Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
Why data matters
March 4, 2008
Posted by Hal Varian, Chief Economist
We often use this space to discuss how we
treat user data and protect privacy
. With the post below, we're beginning an occasional series that discusses how we harness the data we collect to improve our products and services for our users. We think it's appropriate to start with a post describing how data has been critical to the advancement of search technology. - Ed.
Better data makes for better science. The history of information retrieval illustrates this principle well.
Work in this area began in the early days of computing, with simple document retrieval based on matching queries with words and phrases in text files. Driven by the availability of new data sources, algorithms evolved and became more sophisticated. The arrival of the web presented new challenges for search, and now it is common to use information from web links and many other indicators as signals of relevance.
Today's web search algorithms are trained to a large degree by the "wisdom of the crowds" drawn from the logs of billions of previous search queries. This brief overview of the history of search illustrates why using data is integral to making Google web search valuable to our users.
A brief history of search
Nowadays search is a hot topic, especially with the widespread use of the web, but the history of document search dates back to the 1950s. Search engines existed in those ancient times, but their primary use was to search a static collection of documents. In the early 60s, the research community gathered new data by digitizing abstracts of articles, enabling rapid progress in the field in the 60s and 70s. But by the late 80s, progress in this area had slowed down considerably.
In order to stimulate research in information retrieval, the National Institute of Standards and Technology (NIST) launched the
Text Retrieval Conference
(TREC) in 1992. TREC introduced new data in the form of full-text documents and used human judges to classify whether or not particular documents were relevant to a set of queries. They released a sample of this data to researchers, who used it to train and improve their systems to find the documents relevant to a new set of queries and compare their results to TREC's human judgments and other researchers' algorithms.
The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field. The yearly TREC conference fostered collaboration, innovation, and a measured dose of competition (and bragging rights) that led to better information retrieval.
New ideas spread rapidly, and the algorithms improved. But with each new improvement, it became harder and harder to improve on last year's techniques, and progress eventually slowed down again.
And then came the web. In its beginning stages, researchers used industry-standard algorithms based on the TREC research to find documents on the web. But the need for better search was apparent--now not just for researchers, but also for everyday users---and the web gave us lots of new data in the form of links that offered the possibility of new advances.
There were developments on two fronts. On the commercial side, a few companies started offering web search engines, but no one was quite sure what business models would work.
On the academic side, the National Science Foundation started a "Digital Library Project" which made grants to several universities. Two Stanford grad students in computer science named Larry Page and Sergey Brin worked on this project. Their insight was to recognize that existing search algorithms could be dramatically improved by using the special linking structure of web documents. Thus
PageRank
was born.
How Google uses data
PageRank offered a significant improvement on existing algorithms by ranking the relevance of a web page not by keywords alone but also by the quality and quantity of the sites that linked to it. If I have six links pointing to me from sites such as the
Wall Street Journal
,
New York Times
, and the House of Representatives, that carries more weight than 20 links from my old college buddies who happen to have web pages.
Larry and Sergey initially tried to license their algorithm to some of the newly formed web search engines, but none were interested. Since they couldn't sell their algorithm, they decided to start a search engine themselves. The rest of the story is well-known.
Over the years, Google has continued to invest in making search better. Our information retrieval experts have added more than 200 additional signals to the algorithms that determine the relevance of websites to a user's query.
So where did those other 200 signals come from? What's the next stage of search, and what do we need to do to find even more relevant information online?
We're
constantly experimenting
with our algorithm, tuning and tweaking on a weekly basis to come up with more relevant and useful results for our users.
But in order to come up with new ranking techniques and evaluate if users find them useful, we have to store and analyze search logs. (Watch our
videos
to see exactly what data we store in our logs.) What results do people click on? How does their behavior change when we change aspects of our algorithm? Using data in the logs, we can compare how well we're doing now at finding useful information for you to how we did a year ago. If we don't keep a history, we have no good way to evaluate our progress and make improvements.
To choose a simple example: the Google spell checker is based on our analysis of user searches compiled from our logs -- not a dictionary. Similarly, we've had a lot of success in using query data to improve our information about geographic locations, enabling us to provide better local search.
Storing and analyzing logs of user searches is how Google's algorithm learns to give you more useful results. Just as data availability has driven progress of search in the past, the data in our search logs will certainly be a critical component of future breakthroughs.
Labels
accessibility
41
acquisition
26
ads
131
Africa
19
Android
58
apps
419
April 1
4
Asia
39
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
Europe
46
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
Latin America
18
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2007
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2006
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2005
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2004
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.