Introduction to Voice User Interfaces (Part — 1)

7 min readDec 2, 2020

The resulting text must be reasoned-over using AI logic to determine the meaning and formulate a response. Finally, the response text must be converted back into understandable speech again with machine learning tools.

These three parts constitute a general pipeline for building an end to end voice enabled application. Each part employs some aspect of a AI. And that’s why we’re here.

In this article we’ll go through a VUI system’s overview and talk about some current VUI applications. We’ll focus on conversational AI applications where we’ll learn some VUI best practices and why we need to think differently about user design for voice as compared to other interface mediums. Finally, we will put these ideas into practice by building our own conversational AI application.

VUI Overview

Let’s take a closer look at the basic VUI pipeline we described earlier. To recap, three general pieces were identified.

Voice to text,
Text input reasoned to text output,
And finally, text to speech.

Speech Recognition

It starts with voice to text. This is speech recognition. Speech recognition is historically hard for machines but easy for people and is an important goal of AI. As a person speaks into a microphone, sound vibrations are converted to an audio signal. This signal can be sampled at some rate and those samples converted into vectors of component frequencies. These vectors represent features of sound in a data set, so this step can be thought of as feature extraction.

Photo by Jonas Leupe on Unsplash

The next step in speech recognition is to decode or recognize the series of vectors as a word or sentence. In order to do that, we need probabilistic models that work well with time series data for the sound patterns. This is the acoustic model.

Decoding the vectors with an acoustic model will give us a best guess as to what the words are. This might not be enough though, some sequences of words are much more likely than others. For example, depending on how the phrase “hello world” was said, the acoustic model might not be sure if the words are “hello world” or “how a word” or something else.

Now you and I know that it was most likely the first choice, “hello world”. But why do we know? We know because we have a language model in our heads, trained from years of experience and that is something we need to add to our decoder. An accent model may be needed for the same reason. If these models are well trained on lots of representative examples, we have a higher probability of producing the correct text. That’s a lot of models to train. Acoustic, language and accent models are all needed for a robust system and we haven’t even gone through the whole VUI pipeline yet.

Reasoning Logic

Back to the pipeline, once we have our speech in the form of text, it’s time to do the thinking part of our voice application, the reasoning logic.

If I ask you, a human, a question like how’s the weather?

You may respond in many ways like “I don’t know?” “It’s cold outside”, “The thermometer says 90 degrees, etc”. In order to come up with a response, you first had to understand what I was asking for and then process the requests and formulate a response. This was easy because, you’re human. It’s hard for a computer to understand what we want and what we mean when we speak. The field of natural language of processing (NLP) is devoted to this quest. To fully implement NLP, large datasets of language must be processed and there are a great deal of challenges to overcome. But let’s look at a smaller problem, like getting just a weather report from VUI device.

Photo by Thomas Kolnowski on Unsplash

Let’s imagine an application that has weather information available in response to some text request. Rather than parsing all the words, we could take a shortcut and just map the most probable request phrases for the weather to get weather process. In that case, the application would in fact understand requests most of the time. This won’t work if the request hasn’t been premapped as a possible choice, but it can be quite effective for limited applications and can be improved over time.

TTS (Text To Speech)

Once we have a text response, the remaining task in our VUI pipeline is to convert that text to speech. This is the speech synthesis or text to speech (TTS). Here again examples of how words are spoken can be used to train a model, to provide the most probable pronunciation components of spoken words. The complexity of the task can vary greatly when we move from say, a monotonic robotic voice to a rich human sounding voice that includes inflection and warmth. Some of the most realistic sounding machine voices to ate have been produced using deep learning techniques.

VUI Applications

VUI applications are becoming more and more common place. There are a few reasons driving this. First of all voice is natural for humans. It’s effortless for us to converse by voice compared to reading and typing. And secondly, it turns out it’s also fast. Speaking into a text transcriber is three times faster than typing. In addition there are times when it is just too distracting to look at a visual interface like when you’re walking or driving. With the advent of better and more accessible speech recognition and speech synthesis technologies a number of applications have flourished. For example voice interfaces can be found in cars, drivers can initiate and answer phone calls given receive navigation commands and even receive texts and e-mail without ever taking their eyes off the road. Other applications in web and mobile have been around for a few years now but are getting better and better. Dictation applications, Leverage speech recognition technologies to make putting thoughts into words a snap. Translation applications, Leverage speech recognition and speech synthesis as well as some reasoning logic in between to convert speech in one language to speech in another. If you’ve tried any of these you know it’s not quite a universal translator but it’s pretty amazing to be able to communicate through one of these apps with someone you couldn’t even speak to before.

One of the most exciting innovations in VUI today is conversational AI technology. We can now carry on a conversation with a cloud based system that incorporates well-tuned speech recognition, some functionality and speech synthesis into one system or device. Examples include Apple’s Siri, Microsoft’s Cortana, Google home and Amazon’s Alexa on Eco. Conversational AI really captures our imaginations because it seems to be an early step toward the more general AI we’ve seen in science fiction movies.The Home Assistant devices in this category are quite flexible. In addition to running a search or giving you the weather these devices can interface with other devices on the internet linked with your accounts if you want, fetching save data, the list goes on. Even better, development with these technologies is accessible to all of us. We really only need our computer to get started creating our own application in conversational AI. The heavy lifting of speech recognition and speech synthesis have been done for us and turned into a cloud based APIs. The field is new and just waiting for smart developers to imagine and implement the next big thing. There’s a lot of opportunity out there to come up with any voice and able application we can think of.