Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
Unicode nearing 50% of the web
28. siječnja 2010.
About 18 months ago, we published a
graph
showing that
Unicode
on the web had just exceeded all other encodings of text on the web. The growth since then has been even more dramatic.
Web pages can use a variety of different character encodings, like ASCII, Latin-1, or Windows 1252 or Unicode. Most encodings can only represent a few languages, but Unicode can represent thousands: from Arabic to Chinese to Zulu. We have long used Unicode as the internal format for all the text we search: any other encoding is first converted to Unicode for processing.
This graph is from Google internal data, based on our indexing of web pages, and thus may vary somewhat from what other search engines find. However, the trends are pretty clear, and the continued rise in use of Unicode makes it even easier to do the processing for the many
languages
that we cover.
Searching for "nancials"?
Unicode is growing both in usage and in character coverage. We recently upgraded to the latest version of Unicode,
version 5.2
(via
ICU
and
CLDR
). This adds over 6,600 new characters: some of mostly academic interest, such as Egyptian Hieroglyphs, but many others for living languages.
We're constantly improving our handling of existing characters. For example, the characters "fi" can either be represented as two characters ("f" and "i"), or a special display form "fi". A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents.
But no longer — after extensive testing, we just recently turned on support for these and thousands of other characters; your searches will now also find these documents. Further steps in our mission to organize the world's information and make it universally accessible and useful.
And we're angling for a party when Unicode hits 50%!
Posted by Mark Davis, Senior International Software Architect
Oznake
Africa
19
Android
58
April 1
4
Asia
39
Europe
46
Latin America
18
accessibility
41
acquisition
26
ads
131
apps
418
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2015
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2014
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2013
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2012
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2011
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2010
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2009
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2008
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2007
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2006
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2005
pro
stu
lis
ruj
kol
srp
lip
svi
tra
ožu
velj
sij
2004
pro
stu
lis
ruj
kol
srp
lip
svi
tra
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.