The evaluation problem nobody wants to solve

I’ve noticed a pattern in how AI projects get staffed. You have ML engineers for the model, backend engineers for the infrastructure, frontend engineers for the interface, maybe a PM to hold it together. What you almost never have is someone whose job is evaluation.

This is kind of bizarre when you think about it. You’re deploying a system whose behavior is probabilistic, whose outputs vary with every input, and whose quality can degrade silently over time. And nobody is responsible for measuring whether it’s working.

I think evaluation is the most important unsolved problem in production AI. The reason it stays unsolved is that it’s genuinely hard and nobody wants to do it.

The obvious reason it gets skipped is that it’s not glamorous. Building models is exciting. Building evaluation frameworks is tedious. The team that builds the model gets credit for shipping the feature. The team that would build evaluation gets credit for… finding problems. Nobody’s career was made by writing great test cases for an AI system.

But there’s a deeper reason too. Evaluation forces you to define what “good” means, and for most AI applications, nobody actually wants to have that conversation.

Take a system that summarizes customer support tickets. What’s a good summary? One that captures the customer’s intent? One that highlights actionable information? One that’s concise? One that includes all relevant context? These aren’t the same thing, and they often conflict. A comprehensive summary isn’t concise. A summary that highlights action items might miss important context.

Most teams avoid this conversation by shipping the feature and seeing if users complain. This works until it doesn’t, which is usually when someone important encounters a bad output and the team scrambles to figure out how widespread the problem is. Without an evaluation framework, they can’t answer that question. They don’t know their failure rate. They don’t know which inputs produce bad outputs. They don’t know if things are getting better or worse.

For traditional software, testing is straightforward. You write a test. The function returns 4 or it doesn’t. Pass or fail.

For AI systems, nothing about evaluation is binary. The output isn’t right or wrong, it’s somewhere on a spectrum, and where it falls depends on context and use case and user expectations. This makes normal software testing almost useless.

What you need instead is something more layered.

Start with what I’d call behavioral testing. Not “does the model produce the right output” but “does it behave correctly across a representative sample of scenarios.” This means building evaluation datasets that cover the full range of inputs you’ll actually see, including the messy, ambiguous ones that never appear in training data. Most teams test with clean examples. Production sends dirty ones.

Then there’s monitoring in production. Your evaluation dataset is always a sample. Production is the population. You need to measure output quality continuously, which means building automated quality signals (consistency checks, confidence tracking, user feedback loops) and reviewing actual outputs regularly. Almost nobody does this part. It’s the part that catches silent degradation.

And then failure mode analysis. Every AI system has categories of inputs where it falls apart. You need to find those before your users do. This means systematically probing the boundaries, understanding where the system is solid and where it’s fragile, and watching the fragile areas closely.

The hardest part of all this is that you often need human judgment to assess quality, and human judgment is expensive, slow, and inconsistent.

I’ve seen teams try to solve this with automated metrics. BLEU scores, cosine similarity, exact match. These work for some narrow tasks, but for most enterprise applications, automated metrics don’t actually correlate well with quality. A response can score great on automated metrics and still be wrong in a way that matters.

The practical approach, I think, is to invest in human evaluation for a subset of outputs and use that to calibrate cheaper automated signals. Review 100 outputs carefully. Understand the failure modes. Build automated detectors for those specific failures. Then monitor the automated signals at scale, with periodic human review to check they’re still calibrated.

This sounds expensive, and it is. But the alternative is shipping a system you can’t measure, which means you can’t improve it systematically, which means you’re just hoping it works. In my experience, teams that invest in evaluation early ship better systems and iterate faster. The evaluation framework turns vague complaints about “model quality” into specific engineering problems you can actually fix.

The evaluation problem gets significantly harder with agentic AI, which is concerning because those are exactly the systems most likely to land in high-stakes enterprise settings.

An agentic system doesn’t just produce an output. It takes actions, makes decisions, chains steps together, interacts with external systems. Evaluating a single output is hard enough. Evaluating a multi-step workflow where each step depends on the previous one, and where the system might take different paths depending on intermediate results, is a whole different thing.

For agentic systems, you need to evaluate not just output quality but decision quality. Did it choose the right action at each step? Did it know when to escalate versus proceed? Did it handle errors well? Did the whole workflow produce the right outcome?

I don’t think we’ve figured this out yet as an industry. Most agentic AI systems are evaluated on vibes. Does it feel like it’s working? Are users complaining? That’s not evaluation. That’s hope with a feedback form.

The teams that build real evaluation frameworks for agentic systems will have a huge advantage. Not just because evaluation is quality assurance, but because it’s the foundation for getting better over time. You can’t improve what you can’t measure, and right now most teams can’t measure the thing that matters most.