Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
Unicode over 60 percent of the web
February 3, 2012
Computers store every piece of text using a “character encoding,” which gives a number to each character. For example, the byte
61
stands for ‘a’ and
62
stands for ‘b’ in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding,
C1
could mean any of ¡, Ё, Ą, Ħ, ‘, ”, or
parts
of thousands of characters, from æ to 品. If you brought a file from one computer to another, it could come out as gobbledygook.
Unicode
was invented to solve that problem: to encode all human languages, from Chinese (中文) to Russian (русский) to Arabic (العربية), and even emoji symbols like
or
; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn’t even enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the World Wide Web debuted—little did anyone realize at the time they would be so important for each other. Today, people can easily share documents on the web, no matter what their language.
Every January, we look at the percentage of the webpages in our index that are in different encodings. Here’s what our data looks like with the latest figures*:
*Your mileage may vary: these
figures
may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data.
As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call
mojibake
) when you’re surfing the web.
We’ve long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode.
Version 6.1
just released with over 110,000 characters; soon we’ll be updating to that version and to Unicode’s locale data from
CLDR 21
(both via
ICU
). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it’d be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language.
Posted by
Mark Davis
, International Software Architect
Labels
accessibility
41
acquisition
26
ads
131
Africa
19
Android
58
apps
419
April 1
4
Asia
39
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
Europe
46
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
Latin America
18
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2007
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2006
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2005
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2004
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.