← saul.link
· 4 min read

I tested 35 LLMs on fixing whisper-mangled KubeCon transcripts

Whisper turns KubeCon into Cukon and Gloo into Glue. I benchmarked 35 models to find which one fixes domain-specific ASR errors best, and how much it costs to correct an entire conference.

aikubernetesbenchmarks

I’m processing KubeCon conference talks. Whisper does the transcription but it mangles domain-specific terms constantly. “KubeCon” becomes “Cukon”, “Gloo” becomes “Glue”, “SBOM” becomes “S-Bomb.” You need an LLM to fix these but there are 40+ models and I had no idea which one to use.

So I benchmarked all of them.

What whisper actually does to conference talks

I ran whisper base.en on 4 KubeCon NA 2024 keynotes (Apple M2, 24 seconds for a 10-minute talk, not bad). Here’s what came out:

Whisper saidShould be
CukonKubeCon
Glue (8 times)Gloo
Kate’s gatewayK8s Gateway
S-BombsSBOMs
GwokGUAC
Locke 4JLog4j
KivarnoKyverno
TetreTetrate
home chartHelm chart
hold our cellshold ourselves

These aren’t edge cases. “Gloo” is a real CNCF project and whisper turns it into the word “glue” every single time. “SBOM” becomes “S-Bomb” twelve times in one talk. The speaker says “GUAC” (a supply chain security tool) and whisper writes “Gwok.”

The glossary trick

KubeCon EU 2026 publishes its schedule as a public iCal feed at kccnceu2026.sched.com/all.ics. 596 sessions, all the talk titles and abstracts, no auth needed. I scraped it and extracted 327 technical terms (CNCF projects, Kubernetes internals, speaker names, company names). When you feed this glossary to the LLM alongside the raw transcript, it knows “Gwok” should be “GUAC” because GUAC is in the term list.

The glossary costs about 400 tokens of prompt overhead. Five cents total across 300 talks. 98% of technical talks have at least one glossary term in their abstract.

The benchmark

4 talks, 34 manually-identified ASR errors, 35 models from 15 providers. Everything routed through OpenRouter with a 45-second timeout. All 4 talks fired in parallel per model.

The talks:

  • Solo.io keynote on Kubernetes network security (10 min, 6 errors)
  • Envoy AI Gateway intro (5 min, 5 errors)
  • “Cloud Native’s Next Decade” with SBOMs, GUAC, Log4j, quantum crypto (15 min, 14 errors)
  • End user achievements keynote (12 min, 9 errors)

Talk 3 is the hard one. It has “S-Bomb” x12, “Gwok” x5, “salsa” for SLSA, “Kivarno” for Kyverno, “OPAR” for OPA, “in its scripts” for “init scripts.” If a model can’t handle that talk, it’s not going to work for a full conference.

Results

Pareto frontier: cost vs accuracy across 35 models

ModelScore%Cost for 300 talksLatencyTimeouts
Claude Opus 4.632/3494%$71.2730s0
Claude 3.5 Haiku30/3488%$7.5638s0
Claude Haiku 4.530/3488%$13.9643s0
Gemini 2.5 Flash Lite28/3482%$1.087s0
Gemini 2.5 Flash28/3482%$5.8612s0
GPT-4.1 mini26/3476%$4.2544s0
GPT-4o Mini23/3468%$1.5525s0
GPT-4.1 nano20/3459%$1.0415s0
Gemini 2.0 Flash20/3459%$1.1315s0
Phi-419/3456%$0.4431s0
Nova Lite19/3456%$0.6514s0
Sonnet 4.619/3456%$27.2045s1
Nova 2 Lite13/3438%$5.5411s0
Nova Micro12/3435%$0.388s0

Accuracy bars vs projected cost dots for reliable models

Everything below Nova Micro had timeout issues and scored under 35%. DeepSeek, Grok, Kimi, Minimax, Llama 4, Mistral, NVIDIA Nemotron, Cohere, Liquid, IBM Granite, the open-source Gemma models. All timed out on at least 1 talk. Seven models timed out on all 4.

The full 35-model table is at the bottom of this post.

What I’d pick

Gemini 2.5 Flash Lite. $1.08 to correct 300 talks. 7-second latency. Zero timeouts. 82% accuracy. It nailed the hardest talk (14/14 on the SBOM/GUAC/Log4j/Kyverno mess).

Going from 82% to 88% means switching to Claude 3.5 Haiku at $7.56. Going from 88% to 94% means Opus at $71. The jump from Flash Lite to Haiku is 7x the cost for 2 extra fixes. Not worth it for batch processing.

But if I cared about those last 2 fixes, I’d improve the glossary and prompt, not throw money at a bigger model. The errors Flash Lite misses are contextual ones like “forced” → “forged” and “Audibility” → “observability.” A better prompt could fix those without changing the model.

Things I got wrong along the way

I ran a single-talk benchmark first and DeepSeek V3.2 scored 6/6. Llama 4 Scout scored 6/6. I almost picked one of them. Then I ran all 4 talks and DeepSeek dropped to 5/34 (timed out on 3 talks) and Llama dropped to 11/34 (0/9 on one talk). Single-talk benchmarks are worthless.

I also assumed premium models would be better. Claude Sonnet 4.6 ($3/1M input) scored 19/34. Microsoft Phi-4 ($0.065/1M input) also scored 19/34. Sonnet costs 46x more per token and got the exact same result. Gemini 2.5 Pro scored 5/34, worse than Amazon’s cheapest model. Price tells you nothing about suitability for a specific task.