Word Error Rate (WER)
A standard metric for evaluating ASR accuracy by measuring the percentage of words incorrectly transcribed through substitutions, insertions, and deletions.
Word Error Rate, or WER, is the most widely used metric for evaluating how accurate a speech recognition system is. It gives you a single number that represents the percentage of words the system got wrong compared to a human-verified reference transcript.
How WER is calculated
WER accounts for three types of errors: substitutions (the system wrote "cat" when the speaker said "car"), deletions (a word was spoken but the system missed it entirely), and insertions (the system added a word that was never spoken). The formula is straightforward:
WER = (Substitutions + Deletions + Insertions) / Total Words in Reference
A WER of zero percent means the transcript is perfect. A WER of ten percent means roughly one in every ten words contains an error. It is possible for WER to exceed one hundred percent if the system inserts many extra words.
What counts as "good" WER
For well-resourced languages like English, leading ASR systems achieve WERs between three and five percent on clean audio, approaching human-level performance. For African languages, WERs tend to be higher due to less training data, greater acoustic diversity, and the prevalence of code-switching. A WER of fifteen to twenty percent for a lower-resourced language may actually represent strong performance given the constraints.
Limitations of WER
WER treats all errors equally, but not all errors are equal in practice. Misrecognising a person's name is far more consequential than dropping a filler word like "um." WER also does not account for whether the overall meaning of a sentence was preserved. For these reasons, WER is best used as one indicator among several, not as the sole measure of transcription quality.
AuTrans uses WER alongside human review benchmarks to continuously improve transcription accuracy across every supported language.
Related
Accent Adaptation
The ability of a speech recognition system to adjust its models to accurately recognise speech from speakers with diverse regional or linguistic accents.
AI Summarization
The use of artificial intelligence to automatically generate concise summaries from longer texts, such as full transcripts of audio recordings.
ASR (Automatic Speech Recognition)
Technology that converts spoken language into written text using machine learning models trained on audio and language data.
Start transcribing free
Get 30 minutes of free transcription every month. No credit card required. Just upload your audio and go.
Get Started Free