⬅️ Previous - Week 9 Preview
➡️ Next - Getting started with pytest
This lesson introduces production testing for agentic AI systems — the kind that combine LLMs with traditional software components. You'll learn why testing isn’t just a nice-to-have, but a core part of building reliable, real-world applications, and you’ll get familiar with the four key types of tests: unit, integration, system, and performance.
What’s the difference between a good demo project and a production-grade system?
What’s the difference between the work of a junior AI developer and someone with real production experience?
It’s not just model accuracy.
It’s not just a slick Streamlit app.
It's a shift in mindset — from "Does it work for me?" to "Will it keep working when things change?"
And one of the clearest signals of that mindset? Testing.
You can often tell what kind of developer built a system just by scanning the repo:
Junior developers often build systems that work — as long as everything goes right.
Experienced developers build systems that work even when something goes wrong.
That’s the difference. And it shows.
Let’s be honest. Most first attempts at AI systems (or software projects in general) are built for the happy path:
But here’s what actually happens in production:
And suddenly: your once-reliable system collapses in unexpected ways.
Having something “work on your laptop” is not the same as trusting it in production. That’s why we test.
Let’s also acknowledge something else.
Testing doesn’t always sound exciting. It’s not as glamorous as chaining prompts or building clever tool-using agents.
It might even feel like someone else’s job - a task to be handled by a separate QA team, or pushed off until the end.
That used to be the norm in software teams.
But that mindset is changing - fast.
Modern engineering teams now expect developers to write their own tests. Why? Because the people building the system are the ones who understand its edge cases best.
And the best developers? They design with testing in mind. That approach has a name - Test-Driven Development - and it consistently leads to more reliable, maintainable systems.
So don’t treat testing like a chore. Treat it like part of the craft. It's an important mindset shift that will make you a better developer.
Test-Driven Development is a software development methodology where you write tests before writing the actual code. The TDD cycle follows three simple steps:
Most experienced developers swear by TDD because it forces you to think about your code's behaviour before implementation. This leads to better design, clearer requirements, and more maintainable code. When you write tests first, you're essentially defining a contract for how your code should behave.
TDD also prevents over-engineering. Since you only write code to make tests pass, you avoid adding unnecessary features or complexity that might introduce bugs later.
The next few lessons are about building the testing muscle you’ll need to take your agentic system seriously.
Not just testing the LLM’s accuracy (what we call evaluation), but testing your actual software:
In other words: all the stuff that breaks when you least expect it.
Let’s walk through the core categories of testing you’ll be using throughout the rest of this week. Each one answers a different question and catches a different kind of failure.
These are the smallest tests in your system. You’re checking individual functions, prompt templates, validators, or tool wrappers.
For example:
Unit tests are easy to run, fast to write, and give you confidence that the building blocks of your system behave as expected.
Sometimes each component works fine… until you connect them.
Integration tests verify that your chain of steps — like prompt → tool → format → output — actually works when real data flows through it.
For example:
You’ll write integration tests to simulate realistic system behavior and catch breakdowns in coordination between parts.
End-to-end tests check that your entire application — from input to final output — behaves correctly under real-world conditions.
These tests simulate the full user experience — from input to output, across tools, prompts, and memory.
For example:
System tests simulate actual usage scenarios — not just to verify correctness, but to build confidence that your AI app behaves reliably from start to finish.
Performance tests check how your system behaves under stress. They help you understand:
These tests are especially important if you’re deploying to a platform like Hugging Face Spaces or Render, where cold starts, resource limits, and timeouts can cause hard-to-debug behavior.
In the next lesson, we’ll jump into pytest
, the testing framework we’ll use throughout the week.
We’ll show you how to:
Let’s get into it.
⬅️ Previous - Week 9 Preview
➡️ Next - Getting started with pytest