Apr 4, 2013 6:30 AM

Why Aren't We All Talking With Our Devices Yet?

Why aren’t we all now talking to our phones … instead of squinting at them?

Experts were saying over four decades ago that speech recognition would become a mature technology in five to ten years.

Five years later, they were still saying the same thing.

It took a long time, but now, it does seem that reasonably high quality recognition has become real. And on a mobile device yet! (Albeit with a lot of server-side support.) What can be done using phones today is simply stunning when you consider that in the 1970s we would buy really expensive dedicated hardware that tried to recognize sentences of up to five words out of a vocabulary of 120.

So why aren’t we all now talking to our phones … instead of squinting at them?

Because it takes more than just understanding a bunch of words to provide a good voice user interface -- especially on a mobile phone. We have to understand intent. But there are other factors at play here besides the technology of speech: output, interaction, and context.

#### Chris Schmandt

##### About

Chris Schmandt helped found the MIT Media Lab, where he currently directs the Speech + Mobility research group. He has been designing speech interfaces for decades and worked in the Architecture Machine Group, where he co-authored the first multimodal speech user interface as well as the first conversational speech application ("Put-that-there"). Schmandt also pioneered location-based mobile applications.

Currently deployed speech systems either get it or don’t. It’s actually pretty amazing that when we ask for directions to some street it’s recognized out of a database of thousands. But if it's not recognized, or the street is wrong, we have to start all over.

Ironically, our early conversational systems did better because they asked questions: They admitted what they didn’t understand, and allowed the talker to correct what was misunderstood. Now, when the systems don’t work, we just go back to the old finger habits that do work.

To get to the next step in voice, we have to recognize that current systems are little more than raw recognition technology -- good tech, but no more. What we need to do now is create voice user interfaces and, indeed, entire speech-based interaction paradigms.

Understanding intent means figuring out what the person really wants, which often involves conversation. It means understanding why they might be asking, which often requires figuring out what they’re doing. And it means responding in a manner appropriate to their situation, which isn’t easy as users are holding those screens in their hands while they’re running around a world of distractions.

Interaction is a two-way street, which means we have to also focus on output: How does the voice-enabled application respond? By telling us what we want, as does the killer app of real-time driving directions. Keep your eyes on the road, your hands on the wheel, and do what the voice says -- that works.

What doesn’t work is something like reciting the entire bus schedule back when we ask for the next bus. Even telling me “The next bus is at 9:12” is going to be harder to remember than saying “in 12 minutes”.

The point is that speech is transient -- once spoken, it’s gone, and if we were distracted or didn’t immediately commit the information to memory, we’ll have to ask again. And again.

This is why we shouldn't dismiss the interface screen yet; it is less taxing on our memory. The future of voice is screaming for multi-modal output: show plus tell. For example, why not show me the bus schedule on the screen and tell me the next bus is coming in 12 minutes in case I’m on the run?

Yes, this takes a lot of work -- we have to coordinate speech and graphical parts of the user interface. It’s even harder if we enable multi-modal input -- for example, touch the screen to show which bus stop. But this approach means we’ll end up with much more powerful ways of keeping track of the information we ask for.

And that leads right into the third and most important factor around the future of voice: context. Human conversation carries so much context: what the words mean, what things pronouns refer to, what words we stress. This involves elaborate speech rituals (what psycholinguist Herb Clark calls “grounding”) so that a short query like “So did you get it?” can be shorthand for a nuanced, shared experience where the other person knows which “it” is being referred to.

This -- context -- is where phones are poised to work their magic, especially when it comes to the realities and future of voice. The phone knows where it is. It can learn when I go to work, and where that is. And it knows who I call, who is important to me.

When I use a set of unified Android or iOS apps, the phone also knows my email, my calendar, and my interests. So when I say “When’s the bus?” the phone should know which bus I take, what stop I use, whether I’m inbound or outbound, and even how long it takes me to get to the stop. In fact, even before I ask, my phone should be playing some crazy alarm tune and telling me it’s time to get my butt out the door, and don’t forget your umbrella today -- just like a good caretaker.

Drawing on context helps us understand not just the words, but the intent behind them. While these features already exist in some form in some apps, they don't really exist with the voice components. Not yet. Don't forget the social factors: We might not talk to our phones in public because we don’t want to disturb those around us, or because we have secrets.

But things change. I used the first real hand-held phone in the U.S. -- the Motorola MicroTac -- early, and people looked at me like I was crazy when I talked on the phone while walking around (even around the tech-savvy MIT campus). Now, we readily accept people with Bluetooth headsets seemingly talking to themselves -- that used to be a sign of a crazy person.

Editor: Sonal Chokshi @smc90