Posted by Hal Varian, Chief
Economist
We often use this space to discuss how we
href="http://googleblog.blogspot.com/search/label/privacy"
id="irrh" >treat user data and protect privacy
. With the post
below, we're beginning an occasional series that discusses how
we harness the data we collect to improve our products and services
for our users. We think it's appropriate to start with a post
describing how data has been critical to the advancement of search
technology. - Ed.
Better data makes for better science. The history of information
retrieval illustrates this principle well.
Work in this area began in the early days of computing, with simple
document retrieval based on matching queries with words and phrases
in text files. Driven by the availability of new data sources,
algorithms evolved and became more sophisticated. The arrival of
the web presented new challenges for search, and now it is common
to use information from web links and many other indicators as
signals of relevance.
Today's web search algorithms are trained to a large degree by
the "wisdom of the crowds" drawn from the logs of
billions of previous search queries. This brief overview of the
history of search illustrates why using data is integral to making
Google web search valuable to our users.
A brief history of search
Nowadays search is a hot topic, especially with the widespread use
of the web, but the history of document search dates back to the
1950s. Search engines existed in those ancient times, but their
primary use was to search a static collection of documents. In the
early 60s, the research community gathered new data by digitizing
abstracts of articles, enabling rapid progress in the field in the
60s and 70s. But by the late 80s, progress in this area had slowed
down considerably.
In order to stimulate research in information retrieval, the
National Institute of Standards and Technology (NIST) launched the
>Text Retrieval Conference
(TREC) in 1992. TREC introduced new data in the form of full-text
documents and used human judges to classify whether or not
particular documents were relevant to a set of queries. They
released a sample of this data to researchers, who used it to train
and improve their systems to find the documents relevant to a new
set of queries and compare their results to TREC's human
judgments and other researchers' algorithms.
The TREC data revitalized research on information retrieval. Having
a standard, widely available, and carefully constructed set of data
laid the groundwork for further innovation in this field. The
yearly TREC conference fostered collaboration, innovation, and a
measured dose of competition (and bragging rights) that led to
better information retrieval.
New ideas spread rapidly, and the algorithms improved. But with
each new improvement, it became harder and harder to improve on
last year's techniques, and progress eventually slowed down
again.
And then came the web. In its beginning stages, researchers used
industry-standard algorithms based on the TREC research to find
documents on the web. But the need for better search was
apparent–now not just for researchers, but also for everyday
users—and the web gave us lots of new data in the form of links
that offered the possibility of new advances.
There were developments on two fronts. On the commercial side, a
few companies started offering web search engines, but no one was
quite sure what business models would work.
On the academic side, the National Science Foundation started a
"Digital Library Project" which made grants to several
universities. Two Stanford grad students in computer science named
Larry Page and Sergey Brin worked on this project. Their insight
was to recognize that existing search algorithms could be
dramatically improved by using the special linking structure of web
documents. Thus
href="http://www.google.com/technology/" id="lv1x"
>PageRank was born.
How Google uses data
PageRank offered a significant improvement on existing algorithms
by ranking the relevance of a web page not by keywords alone but
also by the quality and quantity of the sites that linked to it. If
I have six links pointing to me from sites such as the Wall
Street Journal, New York Times, and the House of
Representatives, that carries more weight than 20 links from my old
college buddies who happen to have web pages.
Larry and Sergey initially tried to license their algorithm to some
of the newly formed web search engines, but none were interested.
Since they couldn't sell their algorithm, they decided to start
a search engine themselves. The rest of the story is
well-known.
Over the years, Google has continued to invest in making search
better. Our information retrieval experts have added more than 200
additional signals to the algorithms that determine the relevance
of websites to a user's query.
So where did those other 200 signals come from? What's the next
stage of search, and what do we need to do to find even more
relevant information online?
We're
href="http://googleblog.blogspot.com/2006/04/this-is-test-this-is-only-test.html"
id="a9t." >constantly
experimenting with our algorithm, tuning and tweaking on a
weekly basis to come up with more relevant and useful results for
our users.
But in order to come up with new ranking techniques and evaluate if
users find them useful, we have to store and analyze search logs.
(Watch our
href="http://www.youtube.com/view_play_list?p=ECB20E29232BCBBA"
id="d36_" >videos to see exactly what
data we store in our logs.) What results do people click on? How
does their behavior change when we change aspects of our algorithm?
Using data in the logs, we can compare how well we're doing now
at finding useful information for you to how we did a year ago. If
we don't keep a history, we have no good way to evaluate our
progress and make improvements.
To choose a simple example: the Google spell checker is based on
our analysis of user searches compiled from our logs — not a
dictionary. Similarly, we've had a lot of success in using
query data to improve our information about geographic locations,
enabling us to provide better local search.
Storing and analyzing logs of user searches is how Google's
algorithm learns to give you more useful results. Just as data
availability has driven progress of search in the past, the data in
our search logs will certainly be a critical component of future
breakthroughs.
Tags: , Chief, EconomistWe, Hal, often, Posted, Use, Varian