I tested 35 LLMs on fixing whisper-mangled KubeCon transcripts

I’m processing KubeCon conference talks. Whisper does the transcription but it mangles domain-specific terms constantly. “KubeCon” becomes “Cukon”, “Gloo” becomes “Glue”, “SBOM” becomes “S-Bomb.” You need an LLM to fix these but there are 40+ models and I had no idea which one to use.

So I benchmarked all of them.

What whisper actually does to conference talks

I ran whisper base.en on 4 KubeCon NA 2024 keynotes (Apple M2, 24 seconds for a 10-minute talk, not bad). Here’s what came out:

Whisper said	Should be
Cukon	KubeCon
Glue (8 times)	Gloo
Kate’s gateway	K8s Gateway
S-Bombs	SBOMs
Gwok	GUAC
Locke 4J	Log4j
Kivarno	Kyverno
Tetre	Tetrate
home chart	Helm chart
hold our cells	hold ourselves

These aren’t edge cases. “Gloo” is a real CNCF project and whisper turns it into the word “glue” every single time. “SBOM” becomes “S-Bomb” twelve times in one talk. The speaker says “GUAC” (a supply chain security tool) and whisper writes “Gwok.”

The glossary trick

KubeCon EU 2026 publishes its schedule as a public iCal feed at kccnceu2026.sched.com/all.ics. 596 sessions, all the talk titles and abstracts, no auth needed. I scraped it and extracted 327 technical terms (CNCF projects, Kubernetes internals, speaker names, company names). When you feed this glossary to the LLM alongside the raw transcript, it knows “Gwok” should be “GUAC” because GUAC is in the term list.

The glossary costs about 400 tokens of prompt overhead. Five cents total across 300 talks. 98% of technical talks have at least one glossary term in their abstract.

The benchmark

4 talks, 34 manually-identified ASR errors, 35 models from 15 providers. Everything routed through OpenRouter with a 45-second timeout. All 4 talks fired in parallel per model.

The talks:

Solo.io keynote on Kubernetes network security (10 min, 6 errors)
Envoy AI Gateway intro (5 min, 5 errors)
“Cloud Native’s Next Decade” with SBOMs, GUAC, Log4j, quantum crypto (15 min, 14 errors)
End user achievements keynote (12 min, 9 errors)

Talk 3 is the hard one. It has “S-Bomb” x12, “Gwok” x5, “salsa” for SLSA, “Kivarno” for Kyverno, “OPAR” for OPA, “in its scripts” for “init scripts.” If a model can’t handle that talk, it’s not going to work for a full conference.

Results

Pareto frontier: cost vs accuracy across 35 models

Model	Score	%	Cost for 300 talks	Latency	Timeouts
Claude Opus 4.6	32/34	94%	$71.27	30s	0
Claude 3.5 Haiku	30/34	88%	$7.56	38s	0
Claude Haiku 4.5	30/34	88%	$13.96	43s	0
Gemini 2.5 Flash Lite	28/34	82%	$1.08	7s	0
Gemini 2.5 Flash	28/34	82%	$5.86	12s	0
GPT-4.1 mini	26/34	76%	$4.25	44s	0
GPT-4o Mini	23/34	68%	$1.55	25s	0
GPT-4.1 nano	20/34	59%	$1.04	15s	0
Gemini 2.0 Flash	20/34	59%	$1.13	15s	0
Phi-4	19/34	56%	$0.44	31s	0
Nova Lite	19/34	56%	$0.65	14s	0
Sonnet 4.6	19/34	56%	$27.20	45s	1
Nova 2 Lite	13/34	38%	$5.54	11s	0
Nova Micro	12/34	35%	$0.38	8s	0

Accuracy bars vs projected cost dots for reliable models

Everything below Nova Micro had timeout issues and scored under 35%. DeepSeek, Grok, Kimi, Minimax, Llama 4, Mistral, NVIDIA Nemotron, Cohere, Liquid, IBM Granite, the open-source Gemma models. All timed out on at least 1 talk. Seven models timed out on all 4.

The full 35-model table is at the bottom of this post.

What I’d pick

Gemini 2.5 Flash Lite. $1.08 to correct 300 talks. 7-second latency. Zero timeouts. 82% accuracy. It nailed the hardest talk (14/14 on the SBOM/GUAC/Log4j/Kyverno mess).

Going from 82% to 88% means switching to Claude 3.5 Haiku at $7.56. Going from 88% to 94% means Opus at $71. The jump from Flash Lite to Haiku is 7x the cost for 2 extra fixes. Not worth it for batch processing.

But if I cared about those last 2 fixes, I’d improve the glossary and prompt, not throw money at a bigger model. The errors Flash Lite misses are contextual ones like “forced” → “forged” and “Audibility” → “observability.” A better prompt could fix those without changing the model.

Things I got wrong along the way

I ran a single-talk benchmark first and DeepSeek V3.2 scored 6/6. Llama 4 Scout scored 6/6. I almost picked one of them. Then I ran all 4 talks and DeepSeek dropped to 5/34 (timed out on 3 talks) and Llama dropped to 11/34 (0/9 on one talk). Single-talk benchmarks are worthless.

I also assumed premium models would be better. Claude Sonnet 4.6 ($3/1M input) scored 19/34. Microsoft Phi-4 ($0.065/1M input) also scored 19/34. Sonnet costs 46x more per token and got the exact same result. Gemini 2.5 Pro scored 5/34, worse than Amazon’s cheapest model. Price tells you nothing about suitability for a specific task.