Electronic devices that can carry on a conversation with humans are no longer the stuff of science fiction. These days, no one is particularly fazed when someone pulls out their iPhone and asks Siri to look up nearby restaurants, or when a customer service recording asks callers to give voice commands to get to the correct part of the directory. However, this is still relatively new technology, and it’s worth learning how it has evolved to allow devices to talk to people.
Image sent by author
What Is Automatic Speech Recognition?
The technology that allows devices to talk is referred to as automatic speech recognition (ASR). ASR involves a computer converting spoken language into text in real time so that a device can seemingly have a conversation with its user. It’s now commonly used on smartphones, and is activated when a user speaks into their device’s microphone. The device creates wave forms based on the user’s spoken sounds, and then reduces any background noise and normalizes the volume.
So how does the device translate those wave forms into actual words? It breaks the waves down into individual phonemes, which are the sounds used to build words (the English language has 44 phonemes, in case you were curious). It will take the first phoneme of a word and use statistical analysis to determine what phoneme is likely to come next (for example, if the first phoneme is ‘st’, it’s more likely that the next phoneme is going to be ‘a’ instead of ‘y’).
The device will also use statistical analysis and context clues to determine what words are likely to come next (you’ve probably already experienced this when your smartphone offers you suggestions for the next word to enter in a text based on words that you’ve commonly texted after that word in the past).
How Does ASR Learn from You?
There are two types of ASR: direct dialog and natural language. You may have experienced direct dialog if you have, for example, called your bank to check your balance. Direct dialog will give you a set menu of choices and will only respond to the series of phonemes that it has been programmed to recognize. Natural language, on the other hand, is open-ended and can actually learn from a user in order to build vocabulary and better recognize context.
Active learning occurs when devices store data from past conversations with the user, allowing it to recall words or combinations of words that are commonly used (the same thing happens with auto-correct when you’re texting on a smartphone). This allows the device to better figure out commands based on context. For example, if a user commonly asks the device, “What’s the weather forecast?”, data will show that the first phoneme ‘wh’ is statistically likely to be followed by ‘a’, and that the combination of words is statistically likely to be “What’s the weather forecast?”, triggering the device to respond with the current local weather forecast.
ASR is far from perfect right now; devices will still sometimes incorrectly ‘guess’ a word or fail to understand a command if there is too much background noise or multiple people talking, and homonyms are difficult for devices to differentiate between. However, the scope of artificial intelligence will only continue to grow, and we’re likely to see new and improved ASR devices every year.
To learn more about how ASR works, check out this blog post and infographic from West Interactive.
By Matt Zajechowski