How to Train LLMs to Design Better Tests Than You?

István Forgács

16:45 - 17:30 CEST, Tuesday 16th June

Pre-trained large language models (LLMs) can generate test cases directly from requirements, but their performance is often disappointing. For example, in our benchmark study, GPT‑5 produced 16 test cases and detected only 67% of artificially seeded defects.

To address this, we developed a specialized prompting approach that first analyzes requirements, identifies gaps or contradictions, and then applies advanced test design techniques. After extensive experimentation, we found Anthropic Claude Sonnet 4 to be the most effective foundation. We trained it to apply reliable domain testing, action‑state testing, complementary testing (a generalization of negative testing), extreme value testing, and the single‑fault assumption.

With these enhancements, the LLM consistently generated stronger test suites than human testers—including myself. On the same benchmark, the generated 15 test cases detected 100% of seeded defects. Across nine benchmark programs available at test-design.org, the model achieved over 98% defect detection, compared to less than 80% for both testers and baseline LLMs.

Designing the prompt was the hardest part. Traditional methods, such as few-shot prompting, meta-prompting, and chain-of-thought prompting, proved insufficient, largely due to biases like attention and placement bias.

Instead, we used special prompting. Organizing the project, we combined the Langfuse Open Source LLM Engineering Platform, Claude Sonnet 4, and our test design automation tool, Harmony, to iteratively build, debug, and refine the prompt.

In my presentation, I will demonstrate how to construct, test, and debug our prompt and how Copilot supported me throughout the process. I will introduce our advanced test design techniques and explain how we trained Claude Sonnet 4 to decide when to use states, how to create them, and how to satisfy rigorous test selection criteria.

I will also share strategies for overcoming the most critical prompting challenges and highlight the key factors behind a successful AI‑driven software testing project.

As AI adoption accelerates, many IT organizations struggle with prompt engineering. This talk will provide practical insights to help them avoid common pitfalls and harness LLMs to achieve superior outcomes.

What you will learn

How to overcome common prompting biases and why traditional methods like few shot or chain of thought prompting are not enough for effective test design
How advanced LLM training and test design techniques can achieve over 98% defect detection, surpassing both testers and baseline models
Practical strategies to build, debug, and refine prompts, ensuring successful AI driven software testing projects

Session Details

Introductory
45mins
Includes 15min Q&A
Implementing/integrating AI tools

Buy Conference Ticket

Check out this video from István Forgács to hear more about his talk on How to Train LLMs to Design Better Tests Than You?

We look forward to welcoming you to EuroSTAR 2026.

Session Speakers

István Forgács

4Test-Plus, Hungary

István Forgács, PhD, has published numerous scientific articles. He is the lead author of three books. He introduced reliable domain and action-state test design techniques. He is the founder and CEO of 4Test-Plus. István was an author for the Advanced Test Analyst Working Group and a member of the Agile Working Group at ISTQB.

He developed an intelligent mutation testing framework to help testers enhance their test design skills. Additionally, he is the creator and key contributor of Harmony, the AI-powered test design automation tool. He was a keynote, invited, and regular speaker at several academic and industrial conferences.

Stay in The Loop

Subscribe to our newsletter and never miss important announcements, updates and special offers from EuroSTAR.

Instagram
This field is for validation purposes and should be left unchanged.
Name*
First Last
Email*
Job Title*
Years in testing*
Company*
Country*
GDPR*
- I would like to subscribe to updates from EuroSTAR Software Testing Conference
ActiveCampaginChecker
CAPTCHA