In the last four years as a developer advocate at Qase, I’ve written more blog posts and LinkedIn articles than I can count. I’ve also reviewed hundreds of pieces from colleagues, checking whether the arguments hold, the examples work, and the claims are backed by evidence.
I got better at it over time. My earliest articles are nowhere close to my latest ones, both in knowledge and in writing style. Naturally, I tried using LLMs along the way. I quickly found out that LLMs can’t be trusted to write on my behalf, and they certainly can’t be trusted to fact-check. They hallucinate confidently, they miss nuance, they produce text that sounds right but often isn’t.
I did find one pattern that works: using LLMs as critics. I write, then I ask the LLM to poke holes in what I wrote. The diversity of their criticism is genuinely helpful. Even when they hallucinate in their feedback, it doesn’t matter, I still have to go through every point and decide what to keep. The hallucinations become noise I can filter, not errors that leak into my content.
This was my comfortable setup for a while. Then my friend Anupam Krishnamurthy showed me something that changed how I think about content quality entirely.
The Spark
Anupam and I co-own BeyondQuality, an open research community for software quality topics. He presented his research on evaluating RAG systems using a framework called Ragas.
For those unfamiliar: RAG (Retrieval Augmented Generation) is a way to make LLMs answer questions from your own documents. You chunk your content, store it in a vector database, and when someone asks a question, the system retrieves the relevant chunks and feeds them to the LLM to generate an answer. Ragas is a framework that evaluates how well this works: does the answer stick to the retrieved content? Did the system retrieve the right content in the first place?
Anupam built a RAG on the 37signals employee handbook and used Ragas to evaluate it. This was eye-opening. I started thinking about LLMs and content quality in a completely new way.
From Curiosity to Production
My first experiment was pure fun: I built a chat interface to a Deming book I own. It worked, but there was no real business case. I know what Deming writes; I’d rather just re-read the book.
The second idea had a real business case. At Qase, we have over a hundred blog posts, a Help Centre, and a customer support knowledge base. Every new piece of content operates on top of everything published before. If a new article contradicts something from six months ago, readers notice. It confuses people and erodes trust.
This is a classical regression problem, just in content, not in code. A company with hundreds of published pieces has the same problem as a codebase with no tests: changes go out unchecked against existing behavior. Before this, the only check was someone reading a draft and trying to remember if it contradicts anything. With a hundred articles, that’s not realistic. With RAG, I could retrieve just the relevant pieces from the existing corpus and check the new draft against them automatically. Ragas then evaluates how well the retrieval and checking actually work. The feedback loop went from “hope someone remembers” to a 25-minute automated run.
What Happened When We Tried It
The first obstacle was building the evaluation itself. To test whether the system answers correctly, you need ground truths: the key claims and ideas from your content. I tried using several LLMs to extract these from our 106 blog posts. They all gave inconsistent results. The LLMs couldn’t agree on what the articles were actually saying. I ended up reading every article myself and writing down the key ideas manually. There is no shortcut here yet.
Then I had to write evaluation questions. This turned out to be the same problem as writing tests after the code is already written: since you already know the system, you’re inclined to write tests that just confirm what’s there. My first questions were like that. They assumed knowledge of the content and led toward the answer. It took three iterations to learn to write questions the way a real person would ask them: short, simple, with genuine uncertainty about which way the answer goes.
Once I had 240 questions and the evaluation was running, the results told a clear story. On broad, open-ended questions, the LLM stopped relying on the retrieved content and started answering from its own general knowledge. It sounded confident and correct, but it was no longer grounded in what we actually wrote. Ragas caught this. Without measurement, we would never have noticed.
The whole evaluation for 240 questions costs $0.60 and takes 25 minutes — I’ve seen classical automated tests’ suites running for longer!
And then the biggest surprise. For some questions, the system couldn’t find the right article to answer from. Not because the retrieval was broken, but because we simply hadn’t written about those topics well enough. We set out to test the RAG. We ended up finding holes in our own content.
Where This Is Going
This research started with Anupam. Without his curiosity and his work at BeyondQuality, I would not have explored this direction at all.
Today we have regression testing for our content: the blog, the Help Centre, the CS knowledge base. New drafts get checked against everything we’ve already published. What started as a contradiction checker is now also helping us find gaps in our existing content we didn’t know were there.
If you want to see how RAG evaluation works hands-on, come to Anupam’s live demo at EuroStar in Oslo this June, 2026. He will walk through building and evaluating a RAG system step by step. And if the intersection of software quality and AI interests you, follow our work at BeyondQuality, where all research is published openly.
I’ll be at the Qase booth throughout the conference. If any of this sparked your curiosity, come say hi.
Author

Vitaly Sharovatov
As a quality enthusiast, I believe that people should take pride in their work and companies should aim to produce high-quality products. I have spent the last 24 years in IT, focusing on engineering, quality assurance and mentorship. I am also a huge animal lover and have saved and raised more than 50 cats and dogs.
QASE are a participating in EuroSTAR Conference 2026 as a Gold Sponsor


















































