Google scientist Jeff Dean on how neural networks are improving everything Google does

By Jon Xavier – Web Producer, Silicon Valley Business Journal

Aug 12, 2013

If you've ever been mystified by how Google knows what you're looking for before you even finish typing your query into the search box, or had voice search on Android recognize exactly what you said even though you're in a noisy subway, chances are you have Jeff Dean and the Systems Infrastructure Group to thank for it.

As a Google Research Fellow, Dean has been working on ways to use machine learning and deep neural networks to solve some of the toughest problems Google has, such as natural language processing, speech recognition, and computer vision. In this exclusive Q&A, he talks about his work and how it's making Google more powerful and easy to use.

Q: What does your group do at Google?

A: We in our group are trying to do several things. One is we are trying to build very robust systems that can scale and be trained on very large sets of data. We're also looking to apply parallelism in various ways to train those models more quickly, you know, train up a lot of models at once and pick the best one. Some things are easier to parrellelize than others. It's pretty easy to train up 100 models and pick the best one. If you want to train one big model but do it on hundreds of machines, that's a lot harder to parallelize. So a lot of what we do is just laying the groundwork so that this approach is ready to use for different problems.

We're also collaborating with a bunch of different groups within Google to see how we can solve their problems, both in the short and medium term, and then also thinking about where we want to be four years, five years down the road. It's nice to have short-term to medium-term things that we can apply and see real change in our products, but also have longer-term, five to 10 year goals that we're working toward.

We've been applying neural networks to several different problems. One of our earliest collaborations was with the speech group. Eventually speech recognition grew up into two main applications. First it was going from the raw waveform in a short-time frame, you know, what the sound actually looks like, and from that you try to predict what the small piece of a word is being uttered at that second.

And then there's a model that comes in after that that tries to stitch all of those temporal representations of sounds together into words, so if you said “Buh Ah Da” the model would construct that into the word Bad. That's called a language model where you are stitching these pieces together to get a full word, and also stitching those words together to get phrases, sentences, and so on. That model is very useful for disambiguating between two possible words you could have said. Like, if we know you just said “Palo Alto” it's much more likely the next word is going to be “restaurants” than say, “cheetah.” But the neural network work we've done so far, mostly, is for the first part of that problem, going from sequences of sounds to words.

Q: What are the advantages of neural nets for problems like speech recognition over older methods?

A: Well, neural nets (networks of functions that behave like neurons in the human brain) have been around for a long time, since the late '60s, but they're coming back into vogue for several reasons. One is that a lot of machine learning systems require you to hand engineer a bunch of models they think are predictive. And that works for some small to moderate problems, but for low-level perceptual problems it's often not clear, even for a real domain expert, what features you should look at that would be very predictive of the correct end result. So neural nets, especially deep ones, is that they build features that describe the data well automatically, without humans having to get involved. So that is one big advantage. The other reason is there's a lot more computational power available, and a lot more labeled data, and people have figured out how to train very deep networks. Until four or five years ago, it was impossible to get more than like a three-layer network to train well because, since each computer neuron is a non-linear function, as you get deeper and deeper its output gets more and more irregular. It's a very difficult optimization process the deeper the network is. But people have now figured out ways around that. You can pre-train on the first layer, do your optimization there, get it into a good state, and then add a layer. You can kind of do it layer by layer now.

Q: What are the limitations then?

A: Computation power and data set size, mainly. If you have enough labeled data and a model that's big enough to capture it and enough power to run that, then I think this is the best approach for an awful lot of problems.

Q: Give me some examples

A: Well, it turns out you can represent a lot of textual problems as neural net problems. So for example, we can build a dimensional matrix representing different words and grouping them by how similar they are. So for example the word “iPhone” is going to be much closer to “smartphone” than other words. And we can use that to start to understand what you're searching for on Google no matter what you type. Like, if you type in smartphone, you're probably still expecting to see iPhones in the results.

Another is image recognition. We're training Google Street View to recognize street numbers. It's really two problems: Is this a street number, and if so, what is it? But once we train it to answer those questions, we can better take Street View imagery and align it with street addresses in Google Maps.

It's not just neural networks; machine learning in general is used underneath a lot of our products in ways that are probably not obvious to consumers, but which make a lot of features on the site run. Things like our ad network, which has a lot of machine learning built into it. Or Gmail's spam and virus recognition. That's a machine learning problem because you're having to predict which messages are spam when they're messages you've never seen before. Or on Google+, we use machine learning to try to predict which folks you'd like to interact with or which people you should add to your circles.

Q: What are the outstanding questions you're trying to answer? Like what are the things where you think, 'Oh, if we could just do this, we could build something amazing?'

A: We have the start of being able to do a kind of mixing of supervised and unsupervised learning, and if we can get that working well, that will be pretty important. In almost all cases you don't have as much labeled data as you'd really like. And being able to take advantage of the unlabeled data would probably improve our performance by an order of magnitude on the metrics we care about. You're always going to have 100x, 1000x as much unlabeled data as labeled data, so being able to use that is going to be really important.

The second thing is being able to understand big problems and sort of break them down into manageable chunks. For example, if I told you, “Please book me a trip to Washington DC.” That's a very high-level set of instructions. And if you're a human, you'd ask me a bunch of follow-up questions, “What hotel do you want to stay at?” “Do you mind a layover?” - that sort of thing. I don't think we have a good idea of how to break it down into a set of follow-up questions to make a manageable process for a computer to solve that problem. The search team often talks about this as the “conversational search problem.”

And we kind of have baby steps in this area. You can ask Google, “Who's the president of the United States of America?” And it will come back with “Barack Obama.” Then if you ask, “Who's he married to?” It will remember that “he” in this instance means “Barack Obama” and it knows what married means because we have this curated knowledge graph of facts about the world, and it will infer what sort of result you're looking for and will come back with “Michelle Obama.”

Another big thing is a (user interface) challenge, I think. With these predictive models, often it's not that you can come up with one thing that is absolutely right, it's more like you can come up with a hundred things that are varying degrees of likely to be right. How do you order those and deliver the handful that are most right in a way that's useful to the user? So we need to do research on the best way to do that.

Some places, it's easier. In search, you have the search query, which is a very strong indicator of what someone is into at that second. With something like Google Now, where you're trying to use the phone's sensors and what people have searched for recently, like “I know you're in Mountain View and it's about 4:30, what are the most likely things you care about?”

It's often hard to come up with exactly one right answer. You can say, well, he probably cares about his commute because in the past he's left at around this hour, and he probably cares about the weather. But once you get beyond a handful of things that are kind of universal, it's hard to figure out.

Like, if it's trying to give me restaurant reviews, there's probably 50 possible restaurants to choose. And they might all be pretty good suggestions because it knows what sorts of food I like, but it's still a list of 50 restaurants. Again, this would be a place where a dialog would be useful. “Are you in the mood for Italian?” Something like that.

Q: Where do you see this in five years? What sorts of things will we be able to do that we can't do today?

A: I think the basic perceptual problems like voice search and image recognition are going to be much better than they are now. It's going to enable things that just weren't possible before. In particular, I think speech will be good enough that it's possible to have a device where that's the only input. Right now we're close, but it's really not there. If you can build something where speaking is the fundamental input, that's going to be important.

Vision I think is going to be an important input. Like, if you're using Google Glass, it's going to be able to look around and read all the text on signs and do background lookups on additional information and serve that. That will be pretty exciting.

I think we will have a much better handle on text understanding, as well. You see the very slightest glimmer of that in word vectors, and what we'd like to get to where we have higher level understanding than just words. If we could get to the point where we understand sentences, that will really be quite powerful. So if two sentences mean the same thing but are written very differently, and we are able to tell that, that would be really powerful. Because then you do sort of understand the text at some level because you can paraphrase it.

Once you understand text, it changes the game. Because today what we do for search, for example, we're not really understanding at a deep human level the text we see on webpages, we're looking for the word you searched for, we're looking for related words and we score them in some way. But if we really understand the text we see, and the text you entered for a query, that would fundamentally be pretty important. It might be possible to build user interfaces that read the things people read and do the things people do. You could ask hard questions like, “What are some of the lesser known causes of the Civil War?” Or queries where you have to join together data from lots of different sources. Like “What's the Google engineering office with the highest average temperature?” There's no webpage that has that data on it. But if you know a page that has all the Google offices on it, and you know how to find historical temperature data, you can answer that question. But making the leap to being able to manipulate that data to answer the question depends fundamentally on actually understanding what the data is.