Last chance for tickets - Groups save up to 25%

Track Talk, Th1

Automated Testing of Large Language Models: A Live Demo

Anupam Krishnamurthy

09:00 - 09:45 CEST Thursday 18th June

How do you test a large language model’s shape-shifting outputs using automated tests?

LLMs have boundless input possibilities, and the same input can often produce different outputs. Further, some requirements require subjective evaluation. E.g. The chatbot should be polite, provide concise answers, and address our customers with an informal friendly tone. All of these pose challenges to automated testing.

The technique LLM-as-Judge has emerged as a promising solution. This technique involves prompting another LLM to evaluate the outputs of the LLM-under-test. However, these LLM judges need to be aligned with human judges, who are often domain experts. And this can be easier said than done.

In my talk, I walk you through a structured approach for aligning LLM judges with human experts with a live demo. Our system-under-test is a RAG based system, which generates responses to a list of questions. We then critically evaluate these responses with a live evaluation for a few questions. We then compare our collective judgement with that of an LLM-judge. How far does the machine’s evaluation align with our evaluation?