2.5. TEXT RETRIEVAL AND GOOGLE

Suppose you want to find an article about headhunters, groups of people who kill others and collect their heads. Unfortunately for you, the most common meaning of the word headhunter is that of a corporate recruiter, so that if you type "headhunter" in Google the returns will be for the corporate type. So you need to specify a bit more. You may try, for example, "headhunter amazon" because you have heard that headhunting was practiced amongst primitive people living in the Amazon rain forest. Google will return some items that are indeed close to what you are looking for but it will also return several irrelevant ones, for example about a music band called "Headhunters" whose albums are sold on amazon.com. Clearly, we must be careful in selecting the words that describe the subject we are interested in. For the current example "headhunter tribes" seems to be a reasonable choice.

Even if the word "headhunter" is used in a document in its literal meaning, its occurrence need imply that the document deals with that subject. It may an incidental reference within another topic. One could use number of occurrences of a phrase in a document but that is not always reliable. Google relies on user feedback. Here is how it, roughly, works.

Google returns links to all the web pages that contain the query words. (Matching words to text is fairly simple, as we saw in Section 1.1.) Typically, there are thousands of them and the challenge is to order them according to the likelihood of being of interest to the user.

Google uses several criteria for ordering pages. One of them is a ranking of the web pages on the basis of their links to other pages. The more web sites link to a particular page, the more popular that page seems to be, so it is given a high rating. Here is an example of some ratings:

Rating
Some Pages with that Rating
10
Google
9
Yahoo Main Page
8
Yahoo Sports; Stony Brook University; Wikipedia Main Page
7
Microsoft Store; State of N.J. Income Tax; 2010 Int. Conf. on Pattern Recognition
6
SBU Staller Center for the Arts,
5
A financial services login page; a Travel Agency; a Table of the Digits of π
4
A site offering Tax Tables, Amazon page of a book on Economics
3
Amazon page of a best seller book ("Eat, Pray, Love")
 

The higher the rating of a page, the earlier is its place in the returns for a match. However, the ratings of a page need not reflect its importance for a subject. A better way to improve the listing of the returns is to look at popular opinion, much as it is done in the recommendations of books and movies that we discussed in the last three sections.

Google tracks user responses and uses them to order the returns accordingly. If more users have clicked on the link for page A than on the link for page B, then page A must be more relevant to the query than page B. (Clearly, this requires keeping a lot of data around. For each page Google has a list of the queries it matched and how often it has been chosen for viewing.)

User feedback also explains why the simple query "headhunter" returns at the top links to employment agencies. Far more people are looking for a job than for stories about primitive tribes!

User feedback is critical to the success of Google. Suppose we are looking for stories about a dog named Lucy. Typing "dog named Lucy" on Google is too restrictive because the word "named" may not appear in a story. And using all three words does not guarantee that we will not get what we want. When I did that on Google one of the stories that was returned contained the phrase:Lucy and I spent the weekend alone together. We have a dog named Kyler. Insisting on the exact phrase "dog named Lucy" does not produce any unwanted returns but it missed a lot of stories. It is better to try "dog Lucy". That did produce some unwanted returns (for example: "Ask Lucy - How to feed a senior dog") but the first 22 returns were all about a dog named Lucy!

A computer cannot search for stories with a given meaning but only for stories with given words. It is human intervention that enables a search for meaning. On one end is the construction of the query by the user and at the other end is the feedback from previous users of the system.

Back to Contents --- Previous Section --- Next Section