It was a meeting like any other. The head of support for our company of more than 2,500 employees walked into the room and looked at my team. Then he said, in a concerned voice: “Please, follow me”. We sensed it before he uttered a word: our client’s production system was down. We reached a conference room full of apprehensive faces. Our top experts, all dedicated to unblocking the situation, were in that room. Millions of dollars were at stake, but the first problem was that no one knew what had happened. All our bets had been on the logs to tell us, but on that day, it seemed that they had let us down.
Being the QA Lead for the guilty production module, I could not help but think, ‘How did we allow this to happen? Why did we not give the logging system the testing it deserved?’ Then I realised: it was not that the logs had betrayed us; we had betrayed the logs.
Some people neglect logs entirely; many people glance at them once, and others write very rigid automated checks that need maintenance every time the code changes. I crafted a test strategy that goes beyond the conventional scope, spanning the logs’ content, collection, storage, format, and display. I took a bold move by entrusting AI with verifying the accuracy of the logs, a task typically reserved for human scrutiny. Our assertions are catching bugs with a minimal maintenance cost.
In this presentation, I’ll invite you along for that pivotal day. I’ll reveal how we navigated the crisis, emerged stronger with a robust test strategy for logs and their monitoring, and forged a deeper bond with our client. I’ll present a strategy for logging, from creation to display, and I’ll show you how to apply AI to help evaluate the accuracy of your logs.