Programme Launch Offer: Save 20% - Book Now

Track Talk, T22

How to Train LLMs to Design Better Tests Than You?

István Forgács

16:45 - 17:30 CEST, Tuesday 16th June

Pre-trained large language models (LLMs) can generate test cases directly from requirements, but their performance is often disappointing. For example, in our benchmark study, GPT‑5 produced 16 test cases and detected only 67% of artificially seeded defects.

To address this, we developed a specialized prompting approach that first analyzes requirements, identifies gaps or contradictions, and then applies advanced test design techniques. After extensive experimentation, we found Anthropic Claude Sonnet 4 to be the most effective foundation. We trained it to apply reliable domain testing, action‑state testing, complementary testing (a generalization of negative testing), extreme value testing, and the single‑fault assumption.

With these enhancements, the LLM consistently generated stronger test suites than human testers—including myself. On the same benchmark, the generated 15 test cases detected 100% of seeded defects. Across nine benchmark programs available at test-design.org, the model achieved over 98% defect detection, compared to less than 80% for both testers and baseline LLMs.

Designing the prompt was the hardest part. Traditional methods, such as few-shot prompting, meta-prompting, and chain-of-thought prompting, proved insufficient, largely due to biases like attention and placement bias.

Instead, we used special prompting. Organizing the project, we combined the Langfuse Open Source LLM Engineering Platform, Claude Sonnet 4, and our test design automation tool, Harmony, to iteratively build, debug, and refine the prompt.

In my presentation, I will demonstrate how to construct, test, and debug our prompt and how Copilot supported me throughout the process. I will introduce our advanced test design techniques and explain how we trained Claude Sonnet 4 to decide when to use states, how to create them, and how to satisfy rigorous test selection criteria.

I will also share strategies for overcoming the most critical prompting challenges and highlight the key factors behind a successful AI‑driven software testing project.

As AI adoption accelerates, many IT organizations struggle with prompt engineering. This talk will provide practical insights to help them avoid common pitfalls and harness LLMs to achieve superior outcomes.