Field Notes.

AI Triage: Evaluate First, Automate Second

Cover Image for AI Triage: Evaluate First, Automate Second
Roger Rodriguez
Roger Rodriguez

Listen to this post

AI-generated narration of AI Triage: Evaluate First, Automate Second.

I have seen this movie a few times now.

An org gets excited about AI triage, someone wires up auto-routing quickly, and for two days it feels magical.

Then reality shows up. Work items land in the wrong queue, severity gets overcalled on noisy phrasing, and senior folks start doing cleanup work that nobody planned for.

The pattern is predictable: we automate first, evaluate later.

I think that order is upside down.

If you cannot prove that triage decisions are accurate, calibrated, and safe under real intake noise, auto-routing just scales mistakes faster.

This post is the approach I now use before I let automation touch production routing.

The job to be done (and non-goals)

Before touching models, I force myself to write down what triage is actually responsible for across any intake workflow.

Core triage outcomes:

  • Classify ticket intent
  • Estimate severity
  • Route to the right destination
  • Draft a safe first response
  • Surface uncertainty when context is weak

Non-goals:

  • Fully replacing human judgment on high-risk categories
  • Hiding uncertainty behind fluent output
  • Optimizing speed at the expense of reroutes and reopen rates

It sounds obvious, but blurry boundaries are where expensive mistakes start.

Write a triage decision contract

Early on, I treated triage prompts like creative writing. That was a mistake.

What worked better was treating triage like an API contract.

Minimum output schema:

  • category
  • severity
  • target_queue
  • confidence
  • rationale
  • draft_reply
  • needs_human_review

Minimum behavior rules:

  • Must abstain when confidence is below threshold
  • Must explain decisions in operator-readable terms
  • Must fail closed on schema or policy violations

If the system cannot produce these fields reliably, it is not ready for automation. Full stop.

Build an eval set that matches reality

One of the easiest traps is evaluating on tidy, textbook intake items.

Include:

  • Common categories (billing, bugs, usage, feature requests)
  • High-risk lanes (trust and safety, legal-sensitive issues)
  • Ambiguous items with mixed intent
  • Sparse items missing key context
  • Adversarial edge cases that tempt overconfident misrouting

Real intake streams are messy. Your eval set should be messy too.

Measure what operations actually cares about

I like model metrics, but operations does not run on F1 score alone.

Track metrics that show whether the workflow is healthier or just moving faster.

Quality metrics:

  • Routing accuracy
  • Severity calibration
  • Escalation precision/recall
  • Unsafe draft rate

Flow metrics:

  • Time to first triage decision
  • Reroute rate
  • Reopen rate
  • Workstream aging for priority items

I have made this mistake myself: celebrate faster first touches, then discover reroutes quietly doubled. That is not a win. That is debt with nicer charts.

Roll out in three evaluation tiers

Tier 1: Offline replay

Run historical items through the system and compare outputs to human-labeled outcomes.

Goal: prove baseline quality and identify obvious failure classes.

Tier 2: Shadow mode

Run live recommendations, but humans keep final decision authority.

Goal: measure calibration drift and operator trust under real-time pressure.

Tier 3: Guarded automation

Enable auto-routing only for high-confidence, low-risk lanes with explicit fallback paths.

Goal: earn automation with evidence, not optimism.

The key move here is setting exit criteria before rollout starts. Otherwise every pilot becomes a vibe check.

Guardrails that actually prevent bad outcomes

Guardrails are not anti-innovation. They are how you ship without gambling.

Use concrete policy and routing gates:

  • Per-category confidence thresholds
  • Mandatory human review for sensitive categories
  • Explicit abstain path
  • Audit logs for every triage decision
  • One-click rollback to human-only routing

Good guardrails keep progress reversible on a bad day.

Build an evaluation harness, not a flashy demo

You need a repeatable way to test triage behavior across model, prompt, and policy changes.

Minimum harness capabilities:

  • Replay historical items through the current triage contract
  • Compare predictions against labeled outcomes
  • Break down failures by category, severity, and destination
  • Track confidence calibration, not just top-line accuracy
  • Run regression checks before shipping configuration changes

If your harness cannot tell you exactly what got worse after a change, you are operating blind.

A practical implementation order:

  1. Start with deterministic baseline logic and a fixed eval dataset.
  2. Add model-backed structured outputs behind the same contract.
  3. Introduce retrieval context and re-run the full regression suite.
  4. Gate deployments on quality and safety thresholds, not intuition.

That sequence keeps iteration fast while protecting operations from avoidable regressions.

Common failure patterns

  • Overconfident misroutes that look polished but are wrong
  • Prompt changes shipped without eval regression checks
  • Excess focus on draft tone, not routing correctness
  • No clear owner for false positives and false negatives

If no one owns error classes, error classes will own your workflow. They always do.

Automation readiness checklist

You are ready to automate only if most answers are yes:

  1. Can we explain every routing decision in plain language?
  2. Do we have calibrated confidence thresholds per category?
  3. Do we have a reliable abstain path?
  4. Are sensitive categories forced through human review?
  5. Do we track reroutes and reopens as first-class metrics?
  6. Can we run regression evals before model or prompt changes?
  7. Can we roll back to human-only triage quickly?
  8. Do operators trust the recommendations in shadow mode?

If these are not true yet, stay in evaluation mode.

Final take

Evaluate first. Automate second.

AI triage is a systems problem with policy, quality, and operational consequences. When you treat it that way, automation becomes a controlled expansion of proven behavior instead of a leap of faith.