I tested 35 LLMs on fixing whisper-mangled KubeCon transcripts
Whisper turns KubeCon into Cukon and Gloo into Glue. I benchmarked 35 models to find which one fixes domain-specific ASR errors best, and how much it costs to correct an entire conference.
I’m processing KubeCon conference talks. Whisper does the transcription but it mangles domain-specific terms constantly. “KubeCon” becomes “Cukon”, “Gloo” becomes “Glue”, “SBOM” becomes “S-Bomb.” You need an LLM to fix these but there are 40+ models and I had no idea which one to use.
So I benchmarked all of them.
What whisper actually does to conference talks
I ran whisper base.en on 4 KubeCon NA 2024 keynotes (Apple M2, 24 seconds for a 10-minute talk, not bad). Here’s what came out:
| Whisper said | Should be |
|---|---|
| Cukon | KubeCon |
| Glue (8 times) | Gloo |
| Kate’s gateway | K8s Gateway |
| S-Bombs | SBOMs |
| Gwok | GUAC |
| Locke 4J | Log4j |
| Kivarno | Kyverno |
| Tetre | Tetrate |
| home chart | Helm chart |
| hold our cells | hold ourselves |
These aren’t edge cases. “Gloo” is a real CNCF project and whisper turns it into the word “glue” every single time. “SBOM” becomes “S-Bomb” twelve times in one talk. The speaker says “GUAC” (a supply chain security tool) and whisper writes “Gwok.”
The glossary trick
KubeCon EU 2026 publishes its schedule as a public iCal feed at kccnceu2026.sched.com/all.ics. 596 sessions, all the talk titles and abstracts, no auth needed. I scraped it and extracted 327 technical terms (CNCF projects, Kubernetes internals, speaker names, company names). When you feed this glossary to the LLM alongside the raw transcript, it knows “Gwok” should be “GUAC” because GUAC is in the term list.
The glossary costs about 400 tokens of prompt overhead. Five cents total across 300 talks. 98% of technical talks have at least one glossary term in their abstract.
The benchmark
4 talks, 34 manually-identified ASR errors, 35 models from 15 providers. Everything routed through OpenRouter with a 45-second timeout. All 4 talks fired in parallel per model.
The talks:
- Solo.io keynote on Kubernetes network security (10 min, 6 errors)
- Envoy AI Gateway intro (5 min, 5 errors)
- “Cloud Native’s Next Decade” with SBOMs, GUAC, Log4j, quantum crypto (15 min, 14 errors)
- End user achievements keynote (12 min, 9 errors)
Talk 3 is the hard one. It has “S-Bomb” x12, “Gwok” x5, “salsa” for SLSA, “Kivarno” for Kyverno, “OPAR” for OPA, “in its scripts” for “init scripts.” If a model can’t handle that talk, it’s not going to work for a full conference.
Results
| Model | Score | % | Cost for 300 talks | Latency | Timeouts |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 32/34 | 94% | $71.27 | 30s | 0 |
| Claude 3.5 Haiku | 30/34 | 88% | $7.56 | 38s | 0 |
| Claude Haiku 4.5 | 30/34 | 88% | $13.96 | 43s | 0 |
| Gemini 2.5 Flash Lite | 28/34 | 82% | $1.08 | 7s | 0 |
| Gemini 2.5 Flash | 28/34 | 82% | $5.86 | 12s | 0 |
| GPT-4.1 mini | 26/34 | 76% | $4.25 | 44s | 0 |
| GPT-4o Mini | 23/34 | 68% | $1.55 | 25s | 0 |
| GPT-4.1 nano | 20/34 | 59% | $1.04 | 15s | 0 |
| Gemini 2.0 Flash | 20/34 | 59% | $1.13 | 15s | 0 |
| Phi-4 | 19/34 | 56% | $0.44 | 31s | 0 |
| Nova Lite | 19/34 | 56% | $0.65 | 14s | 0 |
| Sonnet 4.6 | 19/34 | 56% | $27.20 | 45s | 1 |
| Nova 2 Lite | 13/34 | 38% | $5.54 | 11s | 0 |
| Nova Micro | 12/34 | 35% | $0.38 | 8s | 0 |
Everything below Nova Micro had timeout issues and scored under 35%. DeepSeek, Grok, Kimi, Minimax, Llama 4, Mistral, NVIDIA Nemotron, Cohere, Liquid, IBM Granite, the open-source Gemma models. All timed out on at least 1 talk. Seven models timed out on all 4.
The full 35-model table is at the bottom of this post.
What I’d pick
Gemini 2.5 Flash Lite. $1.08 to correct 300 talks. 7-second latency. Zero timeouts. 82% accuracy. It nailed the hardest talk (14/14 on the SBOM/GUAC/Log4j/Kyverno mess).
Going from 82% to 88% means switching to Claude 3.5 Haiku at $7.56. Going from 88% to 94% means Opus at $71. The jump from Flash Lite to Haiku is 7x the cost for 2 extra fixes. Not worth it for batch processing.
But if I cared about those last 2 fixes, I’d improve the glossary and prompt, not throw money at a bigger model. The errors Flash Lite misses are contextual ones like “forced” → “forged” and “Audibility” → “observability.” A better prompt could fix those without changing the model.
Things I got wrong along the way
I ran a single-talk benchmark first and DeepSeek V3.2 scored 6/6. Llama 4 Scout scored 6/6. I almost picked one of them. Then I ran all 4 talks and DeepSeek dropped to 5/34 (timed out on 3 talks) and Llama dropped to 11/34 (0/9 on one talk). Single-talk benchmarks are worthless.
I also assumed premium models would be better. Claude Sonnet 4.6 ($3/1M input) scored 19/34. Microsoft Phi-4 ($0.065/1M input) also scored 19/34. Sonnet costs 46x more per token and got the exact same result. Gemini 2.5 Pro scored 5/34, worse than Amazon’s cheapest model. Price tells you nothing about suitability for a specific task.