Gary Illyes On Information Retrieval At Google Search

Gary Illyes of Google spent a nice amount of time talking in the last Search Off The Record podcast about information retrieval. If you have not read anything in-depth about this topic, I highly recommend you listen to what Gary said.

This is a super useful refresher from Gary and he digs into it more on how Google handles information retrieval, including how Google handles relevancy and the multi-stage ranking system. He goes a bit more into rankings and even digs a bit into more on synonyms, specific is buy and sell synonyms. And yes, buy and sell are synonyms but Google even weighs synonyms differently, and in this case, buy and sell are weighted less than purchase as a synonym to buy. They dig even deeper into how to write content pages for buy vs sell pages or adopt me pages. It is pretty cool.

This all starts at the 25:23 mark into the podcast, so click play below and it should start there.

If you don’t want to watch and prefer to read, here is the transcript:

I think in one of the previous episodes we talked about query parsing and understanding and briefly touched on synonyms. For example, if you search for something like “buy a car,” then that would be expanded to “buy cars,” and “purchase auto” or “purchase car,” and so on.

And then we search in our index for all those words, right? Because it might be helpful for some people. Now, searching the index, that’s an overloaded term because we don’t actually go through our whole index when we are looking for these words. We have something that we covered previously– something called posting list. That is essentially a map of terms to pages or documents containing those words.

For example, we could identify easily that the term “car” appears in doc A, B, C, D, E, F and G. And then “buy” appears in– I don’t know– B, C, D, E, F, G. And technically what, in the simplest form, what you want to do is to return an intersection of the two groups. Basically, you would return B, C, D, E, F, G because those were the docs that contained both words.

In reality, it’s not that simple. We would return both or all the documents to our serving system and let that deal with documents that are not relevant enough, let’s say, to the query.

Now, relevancy is an interesting concept because it’s determined by multiple things. One thing is that, well, one part is rooted in the query itself where you could say that my original term was “buy car” or “buy a car”– we drop the “a” because it’s irrelevant to the query. So we would have “buy car.” And then those would be the terms that we are most curious about. Those are the terms that we really want in our result set.

You could say that those terms have the highest weight during the ranking process, during the sorting process. Everything that we expanded the query with, like “purchase auto,” for example, that would have a lower weight than the original term because that’s not what the user searched for– it’s just related terms to what the user searched for and it might be helpful, but it’s not what the user searched for.

We are going to look for those terms as well, but we will consider their relevance lower than the original term’s relevancy. And in the first stage, we will retrieve all the documents that we can. Basically, if we have one billion documents that contain the terms “buy car,” then in the first stage, we would have all those billion documents collected in one glob.

Then a sorting mechanism kicks in, which is basically our ranking system, and it will create a reverse sorted list of all those billion documents and makes a cut at roughly 1,000. And then those 1,000 documents will be pushed up in serving– I have no idea why I’m gesticulating here with my hands because no one can see it– but basically, those 1,000 documents are pushed back up towards the user.

Here, I mentioned a little bit about ranking, and I think that’s a topic on its own for a next episode– we are not going to go there. But once we have those 1,000 documents, basically, we can start essentially serving them. And they have not finished the ranking. Basically, we just created a sorted list based on some of the signals that we have, but we need more signals to finish ranking those 1,000 documents. Basically, sorting them in the order that we believe that would be okay for the user, essentially.

And that happens in another stage of ranking. But at this point, we could give those results to the user and they probably would be okay with them already, in most cases. In our [live for] query classes, usually, we show these presorted lists, and usually, it looks okay. Of course, you can search for weird things and as we all know, there’s weird things on the internet, so sometimes you can see very weird things in those presorted lists. And that’s why further ranking is important.

For example, I don’t want results about pineapple pizza, and those would be demoted very aggressively, at least in my case. But in the presorted list, it will still be there because ranking is not finished yet.

Then John Mueller expands:

John Mueller: Okay. So this happens basically across the different kinds of indexes that we have as well?

Gary Illyes: Right.

John Mueller: Or is that a different topic almost?

Gary Illyes: In the context of this episode, we are only talking about the web index and not image index, or video index, or whatnot, because they work slightly differently and I never worked on them so I can’t authoritatively talk about them, I guess.

John Mueller: Okay.

Gary Illyes: On the web index, I actually did work, so I know way more about it than any of the other indexes that we have.

Here is the video Gary mentioned regarding Paul Haahr’s talk:

Forum discussion at Twitter.