Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
A picture of a thousand words?
30 ottobre 2008
(Note: Click on the first result in each of the search results pages linked to throughout the post to see this feature in action.)
A scanner is a wonderful tool. Every day, people all over the world post scanned documents online -- everything from official
government reports
to obscure
academic papers
. These files usually contain images of text, rather than the text themselves.But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world.
In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document-- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words)
into
a thousand words -- words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world's information accessible and useful.
While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however -- it is a
picture
of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages.
To people reading these documents, the distinction between words and
pictures
of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.
To see our new system at work, click on these search queries. Note the document excerpt in the search results, along with the full text presented after the 'View as HTML' link:
[
repairing aluminum wiring
]
[
spin lock performance
]
[
Mumps and Severe Neutropenia
]
[
Steady success in a volatile world
]
Posted by Evin Levey, Product Manager
Etichette
Africa
19
Android
58
April 1
4
Asia
39
Europe
46
Latin America
18
accessibility
41
acquisition
26
ads
131
apps
419
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2015
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2014
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2013
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2012
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2011
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2010
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2009
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2008
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2007
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2006
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2005
dic
nov
ott
set
ago
lug
giu
mag
apr
mar
feb
gen
2004
dic
nov
ott
set
ago
lug
giu
mag
apr
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.