Field Notes.

What 5K Support Tickets Taught Me About Fine-Tuning

And where prompting stops working and fine-tuning actually earns its keep

Cover Image for What 5K Support Tickets Taught Me About Fine-Tuning
Roger Rodriguez
Roger Rodriguez

Listen to this post

AI-generated narration of What 5K Support Tickets Taught Me About Fine-Tuning.

At first glance, this is about support tickets.

It's not.

It's about a class of problems that quietly show up everywhere in enterprise systems:

  • repetitive decisions
  • messy inputs
  • and a need for clean, consistent outputs

Support tickets just happen to be one of the clearest places to see it.

And once you see the pattern, you start noticing it everywhere.

The Pattern Most Teams Miss

Across teams and systems, these problems start to look the same:

  • classifying inbound requests
  • routing work to the right team
  • tagging content
  • prioritizing issues
  • enforcing structured outputs (usually JSON)

Different surface areas. Same shape.

You're not asking the model to be creative. You’re asking it to make the same decision, over and over again, without drifting. That's a very different problem than most people think they're solving.

The trap most teams fall into

When people first discover fine-tuning, it's easy to think:

"This is how we make the model smarter."

So they reach for it early. Too early.

In reality, prompting, maybe with a few examples, gets you surprisingly far, far enough that fine-tuning often feels unnecessary.

Until it doesn't.

There's a point where your prompt starts to grow, then grow again, then quietly turns into something fragile. That's usually the signal.

A concrete example: what this looks like in support tickets

Imagine thousands of tickets coming in:

  • "Open rate looks normal, but delivered volume is far below expected"
  • "My campaign did not send to everyone"
  • "Login works on desktop but not mobile"

Now you need to decide:

  • category
  • priority
  • escalation
  • owning team

Seems straightforward.

It's not.

Where prompting starts to break

The first version looks simple:

Classify this ticket into one of the following categories...

Then you add examples. Then more. Then edge cases.

Before long, your prompt starts to look less like a prompt and more like a knowledge base.

And still:

  • outputs drift
  • categories blur
  • formatting breaks
  • edge cases pile up

It works, but not reliably enough to trust. So you start patching it.

You compensate:

  • longer prompts
  • more examples
  • more guardrails

At some point, you're not building a system anymore. You're babysitting one.

The shift: from capability to consistency

This is where the mental model changes.

  • Prompting gives you capability
  • Fine-tuning gives you consistency at scale

Fine-tuning isn't about making the model smarter.

It's about removing variability.

You're taking a task the model can already do and making it behave the same way every time.

The dataset

This is the real work.

If you have been running a system like this, you already have the data. That is the part most people miss.

A simple example:

{
  "ticket_text": "Customer was charged twice but only received one confirmation email",
  "category": "Billing Issue",
  "priority": "High",
  "escalation": true
}

That's it. Input to output. No clever prompting required.

The quality of your fine-tuned model is almost entirely determined by how consistent this data is.

Why fine-tuning wins here

Once trained, the behavior shifts in a few important ways:

  • Consistency: same input, same output structure, every time
  • Lower latency: no massive prompts means faster responses
  • Lower cost: fewer tokens per request
  • Simpler system design: no giant prompt to maintain or debug

You stop fighting the model. It becomes predictable. It starts behaving like a component in your system.

This is not about tickets

This is where people either get it or miss it entirely.

Support tickets are just one example of a broader category:

  • classifying leads
  • tagging product feedback
  • routing internal requests
  • labeling logs or errors
  • deciding which variant to serve in a personalization system

Different domains. Same underlying problem.

Same pattern:

  • high volume
  • repeatable decisions
  • structured outputs

That's where fine-tuning earns its keep.

When you should not fine-tune

Most use cases don't need fine-tuning.

Do not fine-tune if:

  • the task is low volume
  • your categories are unclear or constantly changing
  • you don't have clean, consistent examples
  • you are still figuring out the problem

In those cases, prompting is the better tool.

The bigger shift

What's happening here is subtle.

We're moving from:

"How do I prompt the model to do this?"

To:

"How do I make this behavior reliable inside a system?"

That's a different mindset. Less about clever prompts. More about designing repeatable outcomes.

Fine-tuning is not magic. It is just pattern recognition over examples you already have.

But when you apply it to the right class of problems, the ones that show up thousands of times a day, it stops feeling like a demo...

...and starts behaving like infrastructure.