Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
A picture of a thousand words?
October 30, 2008
(Note: Click on the first result in each of the search results pages linked to throughout the post to see this feature in action.)
A scanner is a wonderful tool. Every day, people all over the world post scanned documents online -- everything from official
government reports
to obscure
academic papers
. These files usually contain images of text, rather than the text themselves.But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world.
In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words)
into
a thousand words -- words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world's information accessible and useful.
While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however -- it is a
picture
of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages.
To people reading these documents, the distinction between words and
pictures
of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.
To see our new system at work, click on these search queries. Note the document excerpt in the search results, along with the full text presented after the 'View as HTML' link:
[
repairing aluminum wiring
]
[
spin lock performance
]
[
Mumps and Severe Neutropenia
]
[
Steady success in a volatile world
]
Posted by Evin Levey, Product Manager
Labels
accessibility
41
acquisition
26
ads
131
Africa
19
Android
58
apps
419
April 1
4
Asia
39
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
Europe
46
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
Latin America
18
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2007
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2006
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2005
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2004
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.