• Skip to main content
EuroSTAR 2027 - Sign up for early access

EuroSTAR Conference

Europe's Largest Software Testing Conference.

  • Programme
    • Call for Speakers
    • 2026 Programme
    • Community Hub
    • Awards
  • Attend
    • Why Attend
    • Bring your Team
    • Testimonials
  • Sponsor
    • Sponsor Opportunities
    • Sponsor Testimonials
  • About
    • About Us
    • Our Timeline
    • FAQ
    • Blog
    • Organisations
    • Contact Us
  • Book Now

Lauren Payne

Testing AI Agents: What QE Teams Need to Unlearn Before They Can Get This Right

April 13, 2026 by Lauren Payne

Run the same AI agent with the same input ten times. You will get ten different results.
Sometimes subtly different. Sometimes wildly.

That single fact breaks almost everything traditional QA was built on.

LangChain’s 2026 State of Agent Engineering report surveyed 1,300+ professionals. The
findings are stark: 57% of organizations now have AI agents in production. Quality is the number one barrier to deployment, cited by 32% of teams. And only 52% have any evaluation system in place.

AI Agents Are in Production, but Evaluation Is Still Maturing

Do the math. Roughly half the organizations shipping agents to production have no structured way to know if those agents work reliably. For enterprises with 10,000+ employees, the top concern is not cost or speed. It is hallucinations and output consistency.

Gartner’s 2025 Hype Cycle placed AI agents at the Peak of Inflated Expectations, noting that multi-agent workflows and model non-determinism may trigger cascading failures.

That confidence gap is where QE teams should be rushing in.

Why the Input–Output Contract No Longer Holds

Traditional QA lives on a simple promise: given input X, expect output Y. AI agents break that promise by design. A customer service agent might resolve the same complaint through five different valid approaches. A coding agent might fix a bug with three different architectures. The output varies. The path varies. Both can be correct.

You cannot write an assertion that says “the response must equal this exact string.” You cannot build a regression suite expecting identical behavior across runs. And you cannot rely on pass-fail verdicts when the definition of “correct” depends on context, tone, and user intent. This is not a tooling problem. It is a thinking problem. And it demands that QE teams unlearn some deeply held assumptions about what testing looks like.

Define Behavioural Boundaries, Not Exact Outputs

The most effective teams testing AI agents have made a counterintuitive shift: they stopped checking exact outputs and started defining behavioural bounds.

Anthropic’s engineering team addressed this in their guidance. They recommend evaluating the quality of the final output rather than the exact steps taken to reach it. Agents often arrive at effective solutions through alternative paths. If evaluation frameworks reject those paths, the test suite becomes brittle instead of robust.

Practically, this means asking different questions. Did the agent call the correct tools? Did it stay within policy guardrails? Did it reach a valid end state? Did it handle edge cases without hallucinating?

Simulate Users, Not Just Inputs

Structured simulation frameworks help reduce production agent failures. The approach is simple: test agents against diverse user personas, communication styles, and edge cases before deployment.

A customer service agent that handles polite requests perfectly might collapse with ambiguous or frustrated users. A voice assistant tested only with clear enunciation will fail in noisy real-world environments. Testing AI agents means testing the full range of human unpredictability.

This is exactly the problem TestMu AI’s Agent-to-Agent Testing platform was built to solve. It uses specialized AI agents to simulate diverse personas, generate thousands of test scenarios, and validate how your agent handles conversation, reasoning, and context across real-world conditions.

The concept of using agents to test agents sounds recursive, but it is the only approach that scales to match the complexity of these systems.

Quality Is a Continuous Signal

Many teams approaching agent testing are moving beyond the idea of quality as a one-time, pre-release checkpoint. Instead, they treat it as an ongoing signal.

Production logs can inform new test cases. Real user interactions can expand scenario libraries. Evaluation can run continuously as agents evolve, helping teams adapt as behaviour changes over time.

LangChain’s data confirms this shift: 89% of teams have implemented observability for their agents. But observability without structured evaluation is just logging.

The winning practice combines automated monitoring to flag anomalies with human reviewers making judgment calls on ambiguous cases. Platforms like KaneAI support this continuous model. When test authoring, execution, reporting, and test management live in one unified system, the feedback loop from a production anomaly back to the relevant test scenario becomes fast and actionable, tight enough to drive real quality improvements.

The Discipline Is Being Rewritten

Quality engineering is expanding. As AI systems introduce probabilistic behavior, tool
orchestration, and adaptive workflows, the craft naturally grows more complex. Engineers who understand both testing fundamentals and AI system mechanics are well positioned to navigate that shift.

For teams already practicing strong QE, the shift is less about starting from scratch and more about refining the lens.

Author

Mudit Singh Co-Founder at TestMu AI

With over a decade of experience building and scaling
software products, he has helped shape quality engineering and AI-driven testing strategies that empower engineering teams to ship reliable software faster. His work spans product strategy, AI-native quality engineering, and community-led innovation, bridging the gap between human expertise and autonomous systems.

TestMu AI are Gold Sponsors at EuroSTAR 2026. Join us at EuroSTAR Conference in Oslo 15-18 June 2026.

Filed Under: EuroSTAR Conference, EuroSTAR Expo, Gold Tagged With: 2026, EuroSTAR Conference, Expo, software testing tools

How the Testing Discipline Adapts to Conform with Agentic SDLC

April 3, 2026 by Lauren Payne

Disclaimer: this article is 100% human effort, no LLMs were leveraged while writing it.

Since the end of 2022, people have been using LLMs, first for fun, and then for work related activities. 2023 was the year of doubts, with the majority of people calling this period the “AI-hype Era”. Many were afraid to even try out any of these LLMs. Then in 2024 we started hearing about successful results of AI adoption programs and saw AI-native solution adoption providing valuable assistance, primarily in coding tasks.

In 2025, the scenery changed again. We started hearing more about the pitfalls of GenAI transformation projects, the emerging risks and challenges, and how one could potentially bridge these gaps and avoid hurting their business. Most people were still cautious, but they were also curious.


Moving on from simply using chatbots, the natural next step was leveraging a code
assistant inside an IDE. This is a great way to boost your output in test automation, but
without the right context, i.e. proper test data and agentic knowledge of your enterprise systems, the code produced could turn out to be generic and not tailored to what you need.

The first answer to that problem was RAG (Retrieval Augmented Generation), and then more recently the MCP (Model Context Protocol). The former enables you to leverage additional data – custom embeddings and datasets – and effectively expand what your LLM can access. The latter provides communication between LLMs and Agentic systems with external systems such as project or test management tools.

Although the tooling is important, the human aspect needs to be considered too, and in
fact, is the biggest factor in successful AI transformation programs and adoption.
Now in 2026 we have an even clearer picture of implementing AI-native business and
operational solutions across almost every industry. The first thing to highlight is that the number one discipline where we see success in AI adoption is:

Testing!

This may come as a surprise to you, my fellow Testers, as usually Testing often seems to be an afterthought. Dev and DevOps continue to prioritize coding and delivering fast value. So how come Testers are the forerunners now?

The recipe is simple: Testers have a critical, methodical, and investigative mindset. They
provide unfiltered feedback and really care about the products they are working on. Testers also have a deep understanding of software and are comfortable with new, possibly unfamiliar technologies, both as users and as technical professionals.

Let’s break down, how that aligns with GenAI:

  • Prompt Engineering became a globally recognized role in 2023, and it became mandatory skill to pick up for teams working in an Agentic SDLC
  • For prompts, you need to be descriptive and have a thorough understanding of what you need the LLM to accomplish or provide
  • Testers already have the analytical mindset due to requirements analysis
  • Testers already understand what the end-users and the business are looking for
  • Testers already work closely together with developers
  • Test automation code needs to be integrated in CI/CD pipelines, and quality gates need to be defined at different stages of the delivery

EPAM realized that Testers are the Swiss Army Knives of the Software world. Testers make the perfect Prompt Engineers, as they possess all the required prerequisites to pick up the necessary new skillsets fast to excel in this AI world. And then, they can be the perfect catalysts and support system for pursuing broader AI adoption across an organization.

Example Agentic SDLC phases, AI assistants and benefits of using AI

Agentic SDLC is all about bringing AI-assistance to every stage of development, QA and operations, be it requirements analysis, user story creation, developer’s review of user stories, test case definition, code change impact analysis, test orchestration, or vibe
coding of product code and test automation code.

For each of those tasks, a pipeline of AI agents can provide task level productivity gains. The more use cases you identify to augment with AI, the more overall team productivity gains can be realized. For that, you need to investigate applicable disciplines holistically and ensure that each team member is engaging with the implemented AI solution while developing mastery (Note: A number of AI orchestration and collaboration platforms, like EPAM’s EliteA, have built-in tools to help managers track adoption and skill growth). That’s when adoption can accelerate, and your teams can together ensure an impactful ROI (return on investment) on GenAI adoption programs.

That’s when QA people come into the picture again: We like to set up QA metrics to see trends and be able to course correct when the ship is navigating in the wrong direction. AI solutions are software solutions as well. Usage of these needs to be carefully observed, and course corrected at times. Testers know how to do that and can help teams and organizations avoid waste through proactive, predictive, and preemptive monitoring.

Example agentic eco-system leveraging MCP and ELITEA’s system connectors

To provide better insight, let us give you numbers from one of clients, an insurance
payment platform provider. We measured up to 90% task-level productivity gains on
performance tests results analysis, and on requirements analysis. Test case generation
and orchestration provided 75%, while user story and user guide creation provided 67%
gains. Agents assisted vibe coding enabled developers and test automation engineers to
spend around 40-45% less time on coding.

These numbers may look high, but don’t forget that these were task-level gains. The teamlevel gains were between 27.8% and 31.8%, as not all the tasks of business analysts,
developers, and testers were AI-assisted. As highlighted above, the more use cases
augmented with AI, the more disciplines adopting those solutions, the higher the overall productivity gains are.

Overall, there is an incredibly positive light and exciting opportunity for our beloved
discipline in this new era. But it’s important that You, as a Tester, start adapting to and
working with this new style of delivery, or you risk being left behind. If you are unsure where to start or how, then reach out to us, we are always happy to help.

Visit EPAM at booth 15 at the EuroSTAR conference. Come on over and say hello, and let’s seize these new AI opportunities together!

https://www.epam.com/services/engineering/quality-engineering

Author

Péter Földházi Quality Architect, AI & Game QA Consulting, North America

Péter was first involved with QA as a beta tester of DOTA in 2006. Since joining EPAM in 2012, he moved towards test automation and is currently working in the USA as a Quality Architect.

He is leading Game Testing Consulting and GenAI adoptions in the Americas.

Péter has authored two ISTQB syllabi: Test Automation Engineering & Test Automation Strategy. He also invented two test automation methodologies: the Flow Model Pattern and the Tri-Layer Testing Architecture, the latter published as a white paper by the PNSQC. Péter has been one of the review board members of the HUSTEF since 2015.

Péter is a regular keynote and tutorial speaker on conferences such as STARWEST, STAREAST, and SauceCon. He used to be a guest lecturer at 3 Budapest based universities: Óbuda, Pázmány and the ELTE. Brewing beer and planting chilis are some of his hobbies.

Editor: Ted Weil – Marketing Manager, TestIO & EPAM Testing Practice

EPAM are Exhibitors in EuroSTAR 2026. Join us at EuroSTAR Conference in Oslo 15-18 June 2026.

Filed Under: EuroSTAR Conference, EuroSTAR Expo Tagged With: 2026, EuroSTAR Conference, Expo, software testing tools

How AI Is Changing Test Case Creation 

April 1, 2026 by Lauren Payne

Why test case creation is under pressure 

In software development, speed is no longer a competitive advantage — it is an expectation. Teams release continuously, requirements evolve rapidly, and documentation quality varies. Yet one constant remains: quality must be reliable. 

Test case creation sits at the heart of this challenge. It translates requirements into structured validation, turning ideas into verifiable outcomes. But under increasing time pressure, this critical step often becomes a bottleneck. Requirements evolve rapidly, documentation quality varies, and the window for careful analysis keeps shrinking. When test cases are rushed, inconsistent, or incomplete, the consequences surface later — in escaped defects, costly rework, and delayed releases. 

This growing tension between speed and quality is exactly where Artificial Intelligence begins to reshape the discipline — not by replacing testers, but by redefining how test cases are created, reviewed, and refined. 

Most organizations still rely on manual test case derivation from requirement documents, user stories, or specifications. That work is important, but it comes with familiar challenges: 

  • Time-intensive effort: Large requirement sets can take days or weeks to translate into structured test cases. 
  • Human variability: Two testers can interpret the same requirement differently, producing uneven quality. 
  • Coverage gaps: Under deadline pressure, edge cases and negative scenarios are often missed. 
  • Automation friction: Manually written cases are frequently not “automation-ready” and require rework to be useful in pipelines. 

This is where AI has begun to reshape the discipline, not by replacing testers, but by changing how the work is distributed. 

What AI changes in test case creation 

AI introduces a new operating model: machine-generated drafts plus human validation. Instead of starting from a blank page, testers start from a structured baseline created by an AI engine that has processed the underlying requirements. 

In practice, the shift is not just “faster writing.” It impacts four core outcomes:

  1. Speed: AI can generate test case drafts in a fraction of the time needed for manual extraction. That can reduce lead time from requirements to executable testing, especially helpful in early phases or short sprint cycles. 
  1. Precision: When the AI is trained and designed for requirements understanding, it can standardize structure, language, and formatting across test cases, reducing ambiguity and improving consistency. 
  1. Higher coverage: AI can systematically scan the full set of available requirements and create broader scenario sets, including negative paths, boundary conditions, and dependencies that are commonly overlooked when time is tight. 
  1. Ready for automation: If test cases are generated in a structured format, clear preconditions, steps, expected results, and stable identifiers, they become significantly easier to map into automation frameworks and CI/CD pipelines. 

The key is how this is implemented. AI creates value when it produces output that is immediately usable by testers and automation engineers, not when it generates generic text that still requires heavy rework. 

Introducing msg.TestcaseGen.ai: faster, more complete, automation-ready 

msg.TestcaseGen.ai was built to modernize test case creation with AI, without sacrificing professional QA standards. The tool automatically generates structured test cases from requirement documentation and supports review and refinement by subject matter testers, enabling organizations to combine AI efficiency with human expertise. 

From a test management perspective, the benefits align directly with what many teams need right now: 

  • Faster test case generation: Reduce manual effort and free experts for analysis, risk assessment, and exploratory work. 
  • More precise, consistent structure: Improve readability and reduce interpretation gaps across teams and projects.
  • Higher test case coverage: Systematically derive cases from the full requirements set, supporting more robust functional validation.  
  • Automation readiness: Produce standardized test cases that can be transitioned more efficiently into automated test suites. 

In short, msg.TestcaseGen.ai helps organizations move from “test cases as a documentation burden” to “test cases as an acceleration asset.” 

Where it fits best: functional testing that scales 

AI-based test case generation is particularly effective in functional testing, where traceability to requirements and structured step design matter most. Typical use cases include: 

  • Structured bug testing: Creating reliable, repeatable cases that uncover functional defects. 
  • Regression testing: Ensuring existing features still work after change, supported by consistent, maintainable test sets. 
  • Localization readiness: Supporting coverage across language and region variants by deriving scenarios systematically from specs. 

This matters because functional scope expands quickly, especially in large programs, and manual test case work rarely scales at the same pace. 

Human testers still lead, AI changes what they spend time on 

AI does not remove the need for skilled QA professionals. It changes where expertise delivers the greatest value. 

Instead of spending most of the time on drafting and formatting, testers can focus more on: 

  • validating intent and risk, not just steps 
  • improving test design quality and coverage strategy 
  • identifying missing requirements and inconsistencies 
  • designing automation architecture and stability
  • ensuring test suites remain relevant over time 

AI becomes a productivity layer, while testers remain the quality authority. 

A practical path forward 

If you are evaluating AI for test case creation, the most pragmatic approach is: 

  1. Start with a real requirement set (not a “demo” example). 
  2. Generate a baseline suite using AI. 
  3. Conduct expert review and refinement.
  4. Measure impact on lead times, coverage and automation usability. 

That is exactly the kind of practical, real-world impact msg.TestcaseGen.ai is designed to deliver, helping teams test faster, more precisely, with higher coverage, and ready for automation. 

This human-plus-AI model reduces lead times, improves consistency, and increases coverage—without compromising professional QA standards. 

msg will be present as an exhibitor at EuroSTAR 2026 in Oslo (June 15–18). If AI-driven test case generation is on your roadmap, msg.TestcaseGen.ai is worth a closer look. https://testcasegen.com/ 

Author

Tuan Truong – Head of Test Architect Product Development

Stephan Ingerberg, Head of Sales, msg Test & Quality Management 

Stephan Ingerberg is a seasoned professional with over a decade of experience in the realm of software quality and digital assurance. He is a dedicated desciple of quality and testing since 2004. 

Currently serving as a pivotal figure in the Test & Quality Management division of msg, responsible  for sales, customer relations and commercial aspects within central Europe. His unwavering dedication to excellence and adept navigation of software quality make him indispensable in the pursuit of digital perfection. 

https://www.linkedin.com/in/stephan-ingerberg-digital-transformation

msg Test & Quality Management is an Exhibitor at EuroSTAR 2026, join us in Oslo

Filed Under: EuroSTAR Conference, EuroSTAR Expo Tagged With: 2026, EuroSTAR Conference, software testing tools

Shift Left with Smart Test Selection 

March 27, 2026 by Lauren Payne

Software testers face a constant dilemma: software grows, but release cycles shorten, leaving less time for testing. »Shift Left« promises a solution, aiming for earlier, more frequent testing. But extensive end-to-end test suites can take hours or weeks, making this infeasible. Luckily, Test Selection offers a proven solution. 

The key idea is to run only a small, highly effective subset of your full test suite. We design this subset to require only a fraction of the full suite’s runtime, while concentrating a significant amount of bug-finding power. 

While such a small subset may not reveal all bugs, it finds most of them much faster. And when we run this smaller subset frequently, in addition to less frequent full test runs, we get rapid feedback on the majority of bugs without missing crucial issues. This even enables running tests on feature or developer branches, where end-to-end tests were previously uneconomical, further accelerating development. 

We tested many Test Selection approaches to find those that deliver good results and are easily applicable at scale, even in legacy industry projects. Our Software Quality Platform Teamscale implements the best approaches to help our customers regain fast feedback from their tests. 

Quality Gates vs. Change-Based Testing 

In working closely with dozens of teams, we found two main use cases for Test Selection. 

Quality Gates: We select a fixed set of tests to run repeatedly over a longer period of time (weeks, months), allowing us to protect valuable resources, such as: 

  • Expensive test runs. To avoid wasting costly test runs on high-level bugs that may mask many other bugs, only software versions passing our quality gate proceed to expensive testing, thereby reducing execution costs. 
  • Integration branches. To ensure stable shared branches, only changes passing the quality gate are merged, preventing buggy code from hindering team productivity. 

Change-based Testing: We select a different set of tests for every test run, specifically for the code changes under test. This is more precise, especially for smaller changes. However, it requires a more complex setup to dynamically incorporate change information. 

We can even combine both strategies, e.g., by using quality gates before full test runs and change-based testing on feature branches and pull requests for thorough, rapid feedback and comprehensive quality assurance. 

Build a Quality Gate with AI Vector Embeddings 

End-to-end test suites, often grown through years of copy-and-paste, typically contain numerous similar tests. To identify high-level bugs, we run only one test from each cluster of such similar tests, i.e., we run the most dissimilar tests. This allows us to cover significant software functionality with fewer tests, increasing our chances of finding bugs throughout the software system. 

Technically, we represent tests as numerical vectors in a multi-dimensional space, leveraging Large Language Models (LLMs) to generate embedding vectors that capture the semantic meaning of your tests. This works directly on the source code of your automated tests from your version control system (VCS) or on manual test instructions from your Application Lifecycle Management (ALM) system. We then select the test cases that are “furthest apart” in the vector space, to ensure a diverse and non-redundant subset. 

Our experiments on open-source and industry systems show that this approach enables teams to find 90% of bugs in just 13% of the full test suite’s runtime. This enables you to turn a long-running and expensive test suite into a quality gate that you can quickly run before every merge. 

Test Change-based with The Power of a Search Engine 

We can improve further by selecting tests specifically for the changes under test, which we extract from your VCS. From these changes, we generate a search query, by extracting relevant terms (e.g., »search«, »login«, »user«) from modified functionalities. Leveraging existing search engine research, we index your tests (automated test source code or manual test instructions) into a searchable database. Then we query this database, to compute a ranked list of tests, with the most relevant ones at the top, much like Google finds you the most relevant websites. Given your test budget, e.g., 15 minutes, we then run the most relevant tests that fit into this time. 

Our experiments on open-source and industry systems show that this approach enables teams to find 90% of bugs in just 4% of the full test suite’s runtime. This may transform even a once-a-week test run into fast, per-pull-request checks, ensuring thorough testing very early on. The only downside, compared to a quality gate, is the increased setup effort, as your test runner or manual testers need to query for the relevant tests at the start of each test run. 

Summary

The imperative to »Shift Left« requires smarter testing strategies. Test Selection enables you to run smaller, yet highly effective, subsets of your tests more frequently, thereby accelerating feedback on most bugs. Teamscale offers Test Selection to build quality gates (to protect resources like shared branches or costly test runs) as well as to test specific changes (for rapid feedback on pull requests or feature branches). Both approaches, or a combination, enable »Shift Left« even for extensive end-to-end test suites, ensuring faster feedback and higher software quality without increasing testing costs. 

Would you like to dive in deeper? Reach out or visit our Teamscale booth at EuroSTAR 2026! 

Author

Johannes Veihelmann 

Johannes Veihelmann is a consultant of CQSE GmbH for software quality. He obtained his Bachelor of Science degree in Bioinformatics and is part of the Test Intelligence Team. There he helps customers daily to successfully set up Test Selection analyses.

CQSE are a participating in EuroSTAR Conference 2026 as an exhibitor

Filed Under: EuroSTAR Conference, EuroSTAR Expo Tagged With: 2026, EuroSTAR Conference, Expo

How Artificial Intelligence is Rerouting Quality Assurance

March 25, 2026 by Lauren Payne

AI QA is reshaping software testing by bringing intelligence into every stage of the development lifecycle. By combining AI and machine learning, QA teams are moving from brittle automation to adaptive, predictive strategies that catch bugs earlier, reduce test maintenance, and speed up releases. 

This post breaks down how Artificial Intelligence (AI) in Quality Assurance (QA) is transforming software testing from smarter test case generation and faster defect prediction to continuous optimization. You’ll also see how teams can start applying AI in practical ways. 

What is artificial intelligence in quality assurance? 

AI QA refers to the integration of AI and machine learning (ML) into quality assurance workflows. Practically speaking, it’s about using AI to take repetitive tasks off the QA team’s plate, giving them more time to focus on activities that require human insight, like exploratory testing or evaluating edge-case behavior. 

A few of the jobs AI QA can perform include: 

  • Generating test cases based on user behavior, system logs, requirement, or recent code changes.  
  • Predicting failure points by analyzing historical defect data, commit patterns, and code complexity. 
  • Triaging bugs automatically using NLP (Natural Language Processing) to group related issues, flag duplicates, and suggest likely root causes. 
  • Prioritizing test execution based on risk scores, code velocity, and business-critical areas to reduce unnecessary test cycles. 
  • Maintaining and evolving test suites by identifying outdated tests and generating new ones in response to product changes. 

Why do QA teams need AI now? 

Modern QA teams aren’t lacking tools. They’re lacking time, visibility, and actionable insights. Release cycles are getting shorter, systems are more complex, and user expectations continue to rise. 

Here’s how that pressure shows up in practice: 

  • Tests are running, but the value is unclear.  
  • Automation is fragile and fixing brittle test scripts often takes time away from coverage. 
  • Bugs still make it to production when testing isn’t aligned with real-world risk. 
  • QA becomes a bottleneck when teams are expected to sign off without enough time, data, or confidence. 
  • Leadership can’t measure impact without clear metrics. 

AI QA helps solve these problems by enabling teams to work smarter, not harder. Instead of adding more scripts or expanding headcount, AI reduces waste and helps teams zero in on what really matters. The result is faster feedback and higher-quality releases. 

How AI is helping teams improve efficiency 

Here are five ways AI QA is helping teams improve efficiency and focus on what matters most: 

  • Smarter test generation: AI tools analyze usage patterns, code changes, and defect logs to automatically generate test cases, saving time and improving test coverage. 

With TestRail AI, you can generate first-draft test cases from your requirements text, then review and refine them before adding them to your suite. 

  • Faster defect prediction: By modeling factors like code churn, commit frequency, and historical defect density, AI highlights high-risk areas before issues reach staging or production. 
  • Intelligent bug triage: Using NLP, AI groups related bugs, flags duplicates, and suggests likely owners, helping teams resolve issues faster and reduce backlog noise. 
  • Risk-based test prioritization: Rather than running every test on every build, AI assigns risk scores and ranks test cases based on business impact, recent changes, and failure likelihood. 
  • Continuous test suite maintenance: AI flags outdated/redundant tests to reduce false positives and maintenance overhead. 

The strategic edge: QA leaders are tracking and testing 

QA teams are taking a more strategic approach. They’re seeking better visibility into what’s working, what’s wasting time, and where risk is hiding. AI is helping leaders track metrics like: 

  • Test debt velocity: How quickly are tests becoming outdated, and how does that affect confidence in test results? 
  • Risk-based test ROI: Which tests are consistently catching critical bugs—and which ones are just noise? 
  • AI vs. manual performance: How do AI-generated tests compare to manual ones in terms of defect yield and maintenance cost? 
  • Suite stability trends: Where is test flakiness increasing, and what are the patterns behind it? 

How can my team start using AI in QA? 

Start small with efficiency. The most effective teams begin by identifying where AI can have the greatest immediate impact and then build from there. 

  1. Map your friction. Where are you losing speed or confidence today? 
  1. Pick one high-leverage use case. Flakiness detection and test generation are great entry points. One simple starting point is using TestRail AI to draft test cases from requirements, then standardizing them with a human review step. 
  1. Choose transparent tools. Make sure your AI doesn’t introduce black-box risk. 
  1. Connect everything to TestRail. Use it as your system of record to track, trace, and manage your evolving strategy.  

How TestRail helps teams create an AI QA strategy

AI QA tools can generate tests, flag risks, and optimize execution, but they work best within a structured system. TestRail brings those insights all together, helping teams turn AI QA into a repeatable strategy that scales across teams and release cycles. 

Here’s how TestRail works for AI QA: 

  • Track and generate tests in context: Use TestRail AI to draft test cases from requirements text, then manage AI-assisted and manual tests together with full visibility into history, execution, and ownership. 
  • Visualize test coverage by risk: Filter by release, component, or risk category to see gaps and trends. 
  • Centralize automated results: Connect TestRail to your automation and CI/CD pipeline to centralize reporting across automated test runs. 
  • Maintain end-to-end traceability: Link test execution to requirements, defects, and user stories for complete accountability. 
  • Report with clarity: Use dashboards and custom reports to surface performance trends, identify bottlenecks, and share QA impact across teams. 

TestRail is built so that speed is measurable. Even as complexity grows and scales, your team stays in control. 

Integrate a streamlined workflow with TestRail 

Quality assurance measures real-world risk and complexity. Platforms like TestRail let you leverage AI QA without losing visibility, giving you tighter feedback loops and more confident releases. See it first hand, start your free 30-day trial today. 

Author

Patrícia Duarte Mateus

With more than a decade of experience in Software QA and expertise in several business areas, Patrícia Duarte Mateus has a QA mindset built by the different roles she has played—including tester, test manager, test analyst, and QA engineer. She’s Portuguese, living in Portugal, and is currently a Solution Architect and QA Advocate for TestRail. Patrícia is also a speaker, mentor and founder of a project whose objective is to demystify and educate on Software QA with a focus on Portuguese-speaking people, called “A QA Portuguesa”. Her areas of interest beyond QA include deepening her knowledge of psychology, tech, management, teaching/mentoring, health, and entrepreneurship. Books, podcasts, Ted Talks and YouTube are always on Patrícia’s to-do list to ensure a good day! 

TestRail are a participating in EuroSTAR Conference 2026 as a Gold Sponsor. Join us at EuroSTAR Conference EXPO in Oslo 15-18 June 2026.

Filed Under: EuroSTAR Conference, EuroSTAR Expo, Gold Tagged With: 2026, EuroSTAR Conference, Expo

Agentic AI for Production-Grade, Domain-Specific Test Automation 

March 23, 2026 by Lauren Payne

For the past few years, AI in testing has been dominated by impressive demos. A model generates a test script from a user story. A chatbot repairs a broken locator. A prompt magically produces automation code. 

But when enterprise teams attempt to deploy these capabilities into real CI/CD pipelines, they quickly discover a gap between possibility and practicality. 

Large Language Models (LLMs) are powerful, but they introduce serious challenges in production testing environments. They are inherently probabilistic; identical prompts can yield different results, undermining regression reliability. They can hallucinate syntactically correct but logically flawed automation steps. As test flows grow longer and more complex, models may lose context. Inference latency and cost can escalate. Model drift can alter behavior over time, requiring continuous revalidation. 

Small Language Models (SLMs), by contrast, are fast, stable, and cost-efficient. They perform exceptionally well in structured, domain-specific tasks. However, they lack the deep reasoning power required for complex, multi-step intent interpretation. 

So the real question is not “LLM or SLM?” 
It is: How do we combine both intelligently to build reliable AI-powered automation? 

The Hybrid Architecture: Precision Meets Reasoning

Production-grade AI testing requires architectural discipline. 

The most effective approach is a hybrid model strategy: 

  • Custom SLMs trained on domain-specific automation assets—keywords, UI definitions, reusable components, and workflow logic. 
  • LLMs are reserved for advanced reasoning tasks, such as decomposing high-level test intent into structured automation steps. 

This separation of responsibilities delivers determinism and efficiency for routine automation while preserving flexibility for complex reasoning. The result is faster inference, lower operational cost, improved stability, and reduced hallucination. 

Rather than relying on a single general-purpose model, a hybrid architecture assigns the right model to the right task. 

This architectural pattern can be applied to modern automation frameworks in general. At DWS, we have successfully implemented this hybrid model within TestArchitect (TA), demonstrating how domain-trained SLMs and reasoning-driven LLMs can operate together inside a structured automation ecosystem.  

TestArchitect supports a wide spectrum of platforms and technologies, including SAP, Salesforce, mobile native, mobile web, mobile hybrid, browsers, and desktop applications. Teams can create fully no-code automated tests in plain English, allowing business analysts, domain experts, and testers of all skill levels to contribute without coding barriers. This approach accelerates in-sprint automation, helping teams keep pace with rapid release cycles while ensuring high-quality testing across diverse applications. 

From Single-Model Scripts to Agentic AI 

Even hybrid models alone are not enough. True production-grade automation is not generated by a single model guessing a script; it is orchestrated. 

Agentic AI introduces a multi-agent system where specialized AI components collaborate across the testing lifecycle: 

  • Interpreting business-level test intent
  • Mapping intent to reusable automation actions 
  • Generating structured UI interaction steps 
  • Validating and refining codeless automation 
  • Executing across platforms 
  • Performing intelligent self-healing 
  • Classifying root causes of failures 
  • Predicting impact and suggesting remediation  

Instead of relying on a single probabilistic output, the system validates itself through structured collaboration. Each agent has a defined responsibility, reducing instability and improving reliability. 

Within CI/CD pipelines, intelligent test selection based on code impact analysis ensures that only relevant test suites are executed, reducing cycle time while preserving coverage. 

This marks a shift from AI as a script generator to AI as an orchestrated testing system. 

Within TestArchitect, this agentic hybrid model powers end-to-end intelligent automation, but the architectural principles are transferable to other enterprise automation environments. 

Data Engineering and Model Validation

The real differentiator in AI-powered testing is not model size. It is data quality and governance. 

For domain-intensive industries such as finance, healthcare, and energy, AI must align precisely with industrial workflows. This requires: 

  •  Structured automation datasets 
  • Clearly defined evaluation metrics 
  • Domain-specific fine-tuning 
  • Human-in-the-loop validation 

Domain experts play a critical role in guiding quality standards and ensuring AI-generated outputs meet enterprise expectations. 

To improve precision and consistency, Retrieval-Augmented Generation (RAG) is applied using structured project-specific automation data. Instead of relying solely on model weights, the system dynamically retrieves relevant context, existing action libraries, interface definitions, and project artifacts before generating tests. 

This approach reduces hallucination, preserves framework alignment, and improves maintainability. 

In TestArchitect implementations, this structured RAG strategy has enabled reliable AI-assisted test generation aligned with domain rules and enterprise governance standards. 

Deployment, Cost, and Sustainability

Enterprise adoption requires more than innovation; it requires sustainability. 

Custom SLMs can be hosted in private cloud or on-premises environments, ensuring that sensitive QA artifacts, requirements, test cases, and defect logs remain within organizational boundaries. Strategic collaboration with cloud-hosted LLMs minimizes unnecessary inference costs. 

Because SLMs are lightweight and CPU-friendly, organizations avoid GPU-heavy infrastructure and excessive cloud spending. Millisecond-level responses enable seamless integration into Agile and DevOps workflows. 

The hybrid agentic approach ensures that AI-driven automation remains: 

  • Secure
  • Cost-efficient
  • Predictable
  • Scalable

Without architectural discipline, AI in testing becomes unstable and expensive. With the right design, it becomes a long-term competitive advantage. 

Engineering the Future of AI-Driven Testing

AI will not replace automation frameworks. It will augment them. 

The organizations that succeed will not be those experimenting with prompts, but those engineering intelligent systems, combining hybrid models, agent orchestration, structured data pipelines, and rigorous validation. 

This hybrid, agentic AI architecture can be applied across automation ecosystems. Its successful implementation within TestArchitect demonstrates that production-grade, domain-specific, cost-controlled AI testing is not theoretical—it is achievable today. 

The future of testing belongs to teams that treat AI as an engineering discipline. 

Get Started with TestArchitect AI Today 

Author

Tuan Truong – Head of Test Architect Product Development

He leads the design and evolution of enterprise-scale test automation solutions, helping global organizations modernize quality engineering practices and deliver large-scale systems with confidence. With over 20 years of experience in software testing, automation architecture, and product engineering, Tuan specializes in integrating AI into practical testing workflows. His current focus is on designing production-grade AI systems that combine Custom SLMs, LLMs, and Agentic AI to create scalable, cost-efficient automation solutions. He works closely with enterprise teams to transform emerging AI capabilities into reliable, real-world testing systems. 

DWS are a participating in EuroSTAR Conference 2026 as a Gold Sponsor

Filed Under: EuroSTAR Conference, EuroSTAR Expo Tagged With: 2026, EuroSTAR Conference, Expo

How We Learned to Test Our RAG (and Accidentally Tested Our Content) 

March 20, 2026 by Lauren Payne

In the last four years as a developer advocate at Qase, I’ve written more blog posts and LinkedIn articles than I can count. I’ve also reviewed hundreds of pieces from colleagues, checking whether the arguments hold, the examples work, and the claims are backed by evidence. 

I got better at it over time. My earliest articles are nowhere close to my latest ones, both in knowledge and in writing style. Naturally, I tried using LLMs along the way. I quickly found out that LLMs can’t be trusted to write on my behalf, and they certainly can’t be trusted to fact-check. They hallucinate confidently, they miss nuance, they produce text that sounds right but often isn’t. 

I did find one pattern that works: using LLMs as critics. I write, then I ask the LLM to poke holes in what I wrote. The diversity of their criticism is genuinely helpful. Even when they hallucinate in their feedback, it doesn’t matter, I still have to go through every point and decide what to keep. The hallucinations become noise I can filter, not errors that leak into my content. 

This was my comfortable setup for a while. Then my friend Anupam Krishnamurthy showed me something that changed how I think about content quality entirely. 

The Spark

Anupam and I co-own BeyondQuality, an open research community for software quality topics. He presented his research on evaluating RAG systems using a framework called Ragas. 

For those unfamiliar: RAG (Retrieval Augmented Generation) is a way to make LLMs answer questions from your own documents. You chunk your content, store it in a vector database, and when someone asks a question, the system retrieves the relevant chunks and feeds them to the LLM to generate an answer. Ragas is a framework that evaluates how well this works: does the answer stick to the retrieved content? Did the system retrieve the right content in the first place? 

Anupam built a RAG on the 37signals employee handbook and used Ragas to evaluate it. This was eye-opening. I started thinking about LLMs and content quality in a completely new way. 

From Curiosity to Production

My first experiment was pure fun: I built a chat interface to a Deming book I own. It worked, but there was no real business case. I know what Deming writes; I’d rather just re-read the book. 

The second idea had a real business case. At Qase, we have over a hundred blog posts, a Help Centre, and a customer support knowledge base. Every new piece of content operates on top of everything published before. If a new article contradicts something from six months ago, readers notice. It confuses people and erodes trust. 

This is a classical regression problem, just in content, not in code. A company with hundreds of published pieces has the same problem as a codebase with no tests: changes go out unchecked against existing behavior. Before this, the only check was someone reading a draft and trying to remember if it contradicts anything. With a hundred articles, that’s not realistic. With RAG, I could retrieve just the relevant pieces from the existing corpus and check the new draft against them automatically. Ragas then evaluates how well the retrieval and checking actually work. The feedback loop went from “hope someone remembers” to a 25-minute automated run. 

What Happened When We Tried It 

The first obstacle was building the evaluation itself. To test whether the system answers correctly, you need ground truths: the key claims and ideas from your content. I tried using several LLMs to extract these from our 106 blog posts. They all gave inconsistent results. The LLMs couldn’t agree on what the articles were actually saying. I ended up reading every article myself and writing down the key ideas manually. There is no shortcut here yet. 

Then I had to write evaluation questions. This turned out to be the same problem as writing tests after the code is already written: since you already know the system, you’re inclined to write tests that just confirm what’s there. My first questions were like that. They assumed knowledge of the content and led toward the answer. It took three iterations to learn to write questions the way a real person would ask them: short, simple, with genuine uncertainty about which way the answer goes. 

Once I had 240 questions and the evaluation was running, the results told a clear story. On broad, open-ended questions, the LLM stopped relying on the retrieved content and started answering from its own general knowledge. It sounded confident and correct, but it was no longer grounded in what we actually wrote. Ragas caught this. Without measurement, we would never have noticed. 

The whole evaluation for 240 questions costs $0.60 and takes 25 minutes — I’ve seen classical automated tests’ suites running for longer! 

And then the biggest surprise. For some questions, the system couldn’t find the right article to answer from. Not because the retrieval was broken, but because we simply hadn’t written about those topics well enough. We set out to test the RAG. We ended up finding holes in our own content. 

Where This Is Going

This research started with Anupam. Without his curiosity and his work at BeyondQuality, I would not have explored this direction at all. 

Today we have regression testing for our content: the blog, the Help Centre, the CS knowledge base. New drafts get checked against everything we’ve already published. What started as a contradiction checker is now also helping us find gaps in our existing content we didn’t know were there. 

If you want to see how RAG evaluation works hands-on, come to Anupam’s live demo at EuroStar in Oslo this June, 2026. He will walk through building and evaluating a RAG system step by step. And if the intersection of software quality and AI interests you, follow our work at BeyondQuality, where all research is published openly. 

I’ll be at the Qase booth throughout the conference. If any of this sparked your curiosity, come say hi. 

Author

Vitaly Sharovatov 

As a quality enthusiast, I believe that people should take pride in their work and companies should aim to produce high-quality products. I have spent the last 24 years in IT, focusing on engineering, quality assurance and mentorship. I am also a huge animal lover and have saved and raised more than 50 cats and dogs. 

QASE are a participating in EuroSTAR Conference 2026 as a Gold Sponsor

Filed Under: EuroSTAR Conference, EuroSTAR Expo, Gold Tagged With: 2026, EuroSTAR Conference, Expo

The 4 Pillars of Modern Testing: Building a Unified Ecosystem

March 16, 2026 by Lauren Payne

In the pursuit of digital transformation, many QA teams are hindered by integration debt—the hidden cost of reconciling manual data and bridging visibility gaps between disconnected platforms. This debt is paid in the form of administrative overhead, reduced velocity, and the inherent friction that occurs when test management and test automation operate in separate silos.

To address these complexities, many organizations utilize the Inflectra ecosystem to bridge the gap between planning and execution. As a provider of software lifecycle management tools, Inflectra focuses on creating a single source of truth across the development pipeline. Within this framework, SpiraTest (test management) and Rapise (test automation) are engineered to function as a unified system, ensuring that automated execution remains tethered to original
business requirements.

1. Transitioning from Integration to Unification

Traditionally, QA has been bifurcated. Test management tools track requirements and manual progress, while automation engines function within independent environments. This disconnect creates three primary strategic risks:

  • The Transparency Gap: Stakeholders may see successful execution reports without understanding which specific business requirements have been validated.
  • Version Divergence: Automated scripts often evolve independently of the test plans they support, leading to deceptive results based on outdated logic.
  • Operational Inefficiency: Engineers frequently duplicate effort by recreating manual test steps in code because a shared source of truth does not exist.

2. Transitioning from Integration to Unification

There is a fundamental difference between two tools connected by an API and a truly unified ecosystem. In a unified environment, the automation engine is not an external add-on; it is a native extension of the management layer.

When automation is managed as a primary asset within the test management tool,
organizations achieve End-to-End Traceability. This allows a Project Manager to evaluate a high-level requirement and immediately view the precise execution logs and evidence that confirm its stability.

The Pillars of a Unified QA Workflow

To modernize a QA department, leadership should focus on four technical pillars that define a mature, unified strategy:

Pillar I: Requirement-Driven Automation
Instead of developing scripts in isolation, a unified system allows teams to derive automation directly from manual definitions. By using the manual test case as the structural blueprint, Rapise ensures that automation mirrors the original business intent. This alignment ensures that any change to a manual requirement in SpiraTest is immediately reflected in the automation gap analysis.

Pillar II: Centralized Orchestration
Automation provides the greatest ROI when it is accessible and autonomous. A unified system acts as a Command Center, enabling teams to schedule and remotely execute tests across global labs directly from the management interface. Orchestration ensures the automatic capture of screenshots, logs, and outcomes into a centralized, auditable trail.

Pillar III: Data-Driven Scalability
Professional maturity in testing involves moving beyond simple record-and-playback functions toward Parameterization and Parallel Execution. By defining complex data sets in SpiraTest and deploying them through Rapise scripts simultaneously, teams can expand test coverage exponentially without increasing their maintenance footprint.


Pillar IV: The Continuous Feedback Loop (CI/CD)
In a modern CI/CD pipeline, the test management tool serves as the definitive quality gate. When build completions trigger Rapise automations, the results feed directly into SpiraTest release dashboards. This creates a self-documenting loop that provides stakeholders with a data-backed assessment of risk before code reaches production.

Leadership Perspective: Quality as a Discipline

True software excellence happens when strategy and execution are unified. By adopting a streamlined workflow, organizations reduce toolchain complexity and eliminate the
labor-intensive effort required to synchronize disparate platforms. The result is a more resilient QA process where quality is a continuous, intelligent discipline rather than a final checkpoint.

Summary for QA Executives

If your team spends more time managing tools than testing software, your architecture is suffering from integration debt. A unified system like SpiraTest and Rapise goes beyond task automation; it synchronizes your entire quality strategy, ensuring that every automated action serves a documented business goal.

Author

Adam Sandman

Adam Sandman is a visionary entrepreneur and a respected thought leader in the enterprise software industry. As the Founder and CEO of Inflectra Corporation, Adam has dedicated his career to revolutionizing how businesses approach software development, testing, and lifecycle
management.

Under Adam’s leadership, Inflectra has become a global provider of award-winning solutions, from SpiraTest’s powerful test management and flexible automation of Rapise to SpiraTeam’s end-to-end traceability. He has led Inflectra’s suite of software to grow into a global standard that empowers teams across the world to deliver high-quality software efficiently and collaboratively. His deep technical expertise, combined with a passion for innovation, has
positioned him as a trusted voice in the field, influencing trends and shaping best practices for agile development and quality assurance.

Adam is known for his engaging presentations at industry conferences, where he shares insights on topics such as automation, project management, and emerging technologies. His ability to translate complex concepts into actionable strategies has earned him a reputation as an effective educator and mentor.

Beyond his technical acumen, Adam is committed to fostering a culture of inclusivity and collaboration within the tech community. Through thought-provoking blogs, webinars, and public speaking engagements, he inspires professionals worldwide to adopt forward-thinking approaches to software development and testing.

When he’s not at the helm of Inflectra, Adam enjoys exploring the latest advancements in technology, mentoring up-and-coming tech leaders,

Inflectra is participating in EuroSTAR Conference 2026 as a Gold Sponsor

Filed Under: EuroSTAR Expo, Gold Tagged With: 2026, EuroSTAR Conference, Expo

  • « Previous Page
  • Page 1
  • Page 2
  • Page 3
  • Page 4
  • …
  • Page 12
  • Next Page »
  • Code of Conduct
  • Privacy Policy
  • T&C
  • Media Partners
  • Contact Us

part of the