Automatic speech recognition (ASR) software is ubiquitous- from Siri in the palm of your hand, to Alexa in your home, and productivity solutions like aiOla in your business. The one thing that every user of ASR agrees is needed is accuracy. One way to measure accuracy is through the word error rate (WER).
How is the word error rate used? Is it the only good measurement to consider when selecting your next speech recognition technology? We’ll cover all this and more in this article.
What is Word Error Rate (WER)?
An ASR system listens to human speech and transcribes it into text. The word error rate (WER) refers to how many errors exist in the transcribed audio that an ASR system delivers versus a human transcription.
The lower the word error rate, the better. It is considered the primary metric to measure the accuracy of speech recognition.
How to Calculate Word Error Rate (WER)
The equation for a word error rate calculator is straightforward and as follows:
WER= (S+D+I/N ) x 100
S= the number of substitutions, or when a word is replaced (i.e. the word being said is “cat,” but the transcription reads “bat”)
D= Deletions, or when a word is left out of a transcript (i.e. the phrase said was “just do it,” but the transcription reads “do it.”
I= Insertions, or when a word has been added that wasn’t said (i.e. the word said was “landing,” but the transcription reads, “land in.”)
N= Number of words
To deduce accuracy, you would subtract the word error rate from 100. For example, if the WER is 20, then you can take 1-20 = 80 to determine that transcription’s accuracy is 80%.
Word Error Rate and Speech Recognition
The word error rate is taken into consideration by researchers when designing and testing speech recognition technology. Similarly, when customers are deciding the best solution for their needs, especially in business settings, they tend to question the WER because it helps to understand the tool’s accuracy.
At this point, you may be wondering, “What can impact word error rate?” The word error rate can be affected by several factors, including:
Accents
Everyone has an accent. One person may say, “tomato” while another says, “to-mah-to.” Like the idiom goes, arguing over a minor difference isn’t worth it, but when it comes to speech recognition, accents can have a big impact on understanding. Algorithms have to be trained and use context to be able to decipher the right word, especially when accents come into play.
Background Noise
Another inhibiting factor for total accuracy is background noise when people are speaking. Think about it in your own daily life- if there’s a lot of noise when you’re talking to someone on a phone, it could be hard to really understand what they are saying. Automatic speech recognition tools face the same challenge.
Crosstalk
Additionally, have you ever been in a conversation and two people start talking at the same time? How do you or the ASR machine know which voice to prioritize? The truth is, you won’t, and neither does the technology. Some technology will simply ignore one voice entirely, which will of course negatively impact the WER.
Industry-Specific Jargon
Last but not least, when businesses leverage speech recognition technology, a major accuracy hurdle exists with the use of industry-specific jargon and vocabulary. Every industry operates with its own set of words and acronyms. For the most part, if an ASR tool isn’t specifically trained upon these terms, it will not understand what’s being said. The exception to this rule is aiOla. aiOla is a first-of-its-kind speech recognition software that knows business-specific jargon, in any accent, language and acoustic environment, without the need to be retrained.
The word error rate is a useful metric for researchers and developers who optimize ASR solutions as they can identify where room for improvement exists. This way, they have a benchmark to go off of to refine the model’s overall performance and training.
Issues with Word Error Rate
As helpful as the word error rate is for developers and consumers, it isn’t without its limitations. This is mostly due to the fact that the word error rate considers all words to be equal, no matter what their meaning is within a sentence.
The word error rate does not account for:
Context
A minor substitution may not impact the meaning of a sentence overall, depending on the context. For a human, this is simple to decide. But, if the transcription will count the word as “wrong,” even if it doesn’t change the meaning of the sentence, it will still affect the WER.
Error Type
All errors within the WER calculation are equally weighted. So, a substitution, deletion, and insertion carry equal impact in the equation, but in the context and meaning of a sentence, one may actually weigh heavier than another.
Homophones
The word error rate will also penalize homophones. Even though the system is technically hearing the word correctly, it may use the homophone with a different meaning, thereby altering the sentence or phrase.
Punctuation
If I were to tell you that “I have $10,000 dollars” or “I have 10,000 dollars,” would it mean something different? No. But, for a word error rate calculation, it will.
Closing Words
As you can see, the word error rate is a useful metric, but it is not the end-all-be-all for the success of an automatic speech recognition system. While accuracy is the utmost concern, the word error rate can’t be the only consideration worth worrying about.
For businesses, when selecting speech recognition tools, it’s also of value to assess the technology’s speed, scalability, cost, and support. The word error rate is a highly useful starting point and can help narrow down your list of formidable options, but it shouldn’t be taken into account in a vacuum.