Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
Our international approach to search
November 21, 2008
In previous posts in
this series
, you have read about the challenges of building a world-class search engine. Our goal is to make Google’s search be relevant to all people, regardless of their language or country. As my colleague Amit Singhal
described
, we use statistical data as the basis for making sweeping algorithmic changes. Many of these changes can be rolled out across all languages we support, but in some cases the unique characteristics of each language require some algorithmic considerations and tuning. And to make things really interesting, there are cases where the same language is different across countries. Obvious examples are "color" in the U.S. vs. "colour" in the U.K., or "camião" in Portugal vs. "caminhão" in Brazil.
My name is Daphne Dembo, and my focus is improving Google's international search. This is a tough challenge, since Google search is used in many countries and languages where our engineers have little personal knowledge. Initially, the international search improvements were done by Search Quality engineers who were passionate about their languages and countries: Lina from Sweden improved our parsing of compound words in German and Swedish; Dimitra from Greece introduced diacritical support; Ishai from Israel worked on transliteration corrections for Hebrew and Arabic; Trystan from Australia created methods for identifying local search results and ranking them together with foreign ones from the same language; Alex, a bilingual Ukrainian and Russian, introduced morphological understanding of these languages. As the importance of our international search grew, we solicited help from Googlers in all our offices. Finally, we are leveraging an international network of search specialists who help us understand search within the unique combination of their language and country.
Our first step in providing search support for a language is to train our language model on a large collection of documents in that language. This ensures that our language model is more precise and comprehensive — for example, it incorporates names, idioms, colloquial usage, and newly coined words not often found in static dictionaries. For instance, we recently started identifying Swahili, and used pages such as this one for the
Parliament of Tanzania
to train our system with the language's nuances. Having a trained language model helps to categorize documents during crawling and indexing of the web and to parse the user's query. Once this stage was complete, we launched Swahili search in countries such as
Tanzania
and
Kenya
, enabling local searches for the "Dar es Salaam stock exchange" [
Soko la hisa dar es salaam
], and "cure for Malaria" [
Tiba ya malaria
]. (As always, we are using square brackets to denote a search query. For example, you can search for "soccer" in Hamburg, Germany by clicking on [
fußball in hamburg
]).
We learn some things from our users, so as people start using our search engine, we can improve the way we rank in that language. Here are few examples:
Spell corrections
: We recently launched spell corrections in Estonian. If your Estonian is rusty, and you don't remember how to spell "smoke detector," we can suggest a spell correction for [
suitsuantur
], leading to
better
search results.
Diacritical marks
:
Many languages have diacritical marks, which alter pronunciation. Our algorithms are built to support them, and even help users who mis-type or completely ignore them. For example, if you're a resident of Quebec, Canada and would like to know the weather forecast in Quebec City, we'll serve good results whether you type with diacritical signs [
Météo à Québec
] or without [
meteo quebec
]. Czech users can read the same excellent results for a popular kids' cartoon by searching for [
krtecek
] and [
krteček
]. On the other hand, sometimes diacriticals change the meaning of the word and we have to use them correctly. For example, in Thai, [
ข้าว
] is "rice," with completely different results than [
ข่าว
], which is "news"; or in Slovakia, results for "child" [
dieťa
] are different than results for "diet" [
diéta
].
Synonyms
:
A general case of diacritical support is the handling of synonyms in different languages. Korean searches showed that "samsung" can be viewed as a synonym of "삼성", so that when users search for [
samsung
], they find results which have the company's name in Korean.
Compounding
:
Some languages allow compounding, which is the formation of new words by combining together existing words. You can see a nice example in Swedish, where we return documents about a Swedish credit card for both compounded [
Visakort
] and non-compounded [
visa kort
] queries.
Stemming
:
Google has developed morphological models that can receive compound words as queries, and return pages which contain their stem, possibly as part of a different compound. For example, when searching for cars in Saudi Arabia, you can search for [
سيارة
] and [
سيارات
] because both are variants of the same stem, and both return many common results. A Polish user can search for "movie" [
film
], and get back results that contain other variants of the stem, such as "filmów," "filmu," "filmie," "filmy." A user from
Belarus
will find results for all word forms of the capital, Minsk [
Мінск
]: "Мінску," "Мінска," "Мінскага."
In addition to these semantic factors, Google does even more to parse documents and queries. Understanding the details of language usage in a country is important. Notation of acronyms is different across languages: In Hebrew it is double quotes before the last (left-most) character, as in "prime minister" [
רה"מ
]; in Thai — a dot at the end of the word, as in police station [
สน.
]; while in the U.S. — dots after each character, as in [
I.B.M.
]. Chinese users quote works of art with a "《", as in: [
《手机》剧情
], and denote dates with a "日", as in: [
2006年1月13日
].
Beyond the linguistic elements of a language, we consider how people enter a query. For example, some languages that do not have Latin scripts require keyboards with dual alphanumeric keys. The user can switch between language input modes by typing special keystrokes. In case the user forgets to type this sequence, the queries end up being gibberish. You can see correct handling of these mistakes in Arabic ([
hgsuv
] corrected to [
السعر
]) and ([
حقثسهيثىفهشم ثممثؤفهخىس
] corrected to [
presidential elections
]), Hebrew ([
vdrk, kuyu
] corrected to [
הגרלת לוטו
]), and Cyrillic ([
rehc ljkffhf
] corrected to [
курс доллара
]).
Another way of avoiding the inconvenience of switching keyboard modes is by typing the phonetic sounds of the query in Latin characters. Recreating the correct query in the target language isn't trivial, since there might be many possibilities. We can see several such examples in which we suggest the same query in the intended language for Russian ([
biskvitnyi rulet
] to [
бисквитный рулет
]), "movies" in Chinese ([
dianying
] to [
电影
]), and "Bank of Attica" in Greek [
trapeza attikhs
] returns good results for "Τράπεζα Αττικής". Users of 8 Indic languages (such as Hindi, Gujarati, Telugu) can type the phonetic sound of the query, and choose the words in Hindi script:
Ease of typing and reading is also influenced by the language used. Since every Chinese word requires several keystrokes on a standard keyboard, we provide
category browsing by Images
and
related searches
so that people don't need to type as much. Similarly, we are now launching Google Suggest, or
real-time completion of queries
, in many languages.
So far I described how we improve the quality of search in a language. However, there is a strong effect of the location of the user, even if it is only approximated to the country, since in many cases local content is more relevant than global information. For example, searching for Spanish Yellow Pages [Páginas Amarillas] will result in several documents of global interest and several local results in
Peru
,
Mexico
, and
Spain
. Similar to that, searching for [Côte d'Or] in
France
will return results for that region, whereas searches in
Belgium
will return results about the chocolate maker.
Note that the display of information should conform to the standards in that country, so we display "," as a decimal notation for Croatian users who want to know how many millimeters are in an inch [
inč u milimetrima
], or for Italian users who are interested in currency exchange rates [
50 euro in dollari
]. Similarly, temperatures in Norway [
Været i Oslo
] will be displayed in Celsius, while in the U.S. — in Fahrenheit [
weather Boston
].
If everything else fails, we provide cross-language translations based upon Google's translation technology described in this
blog post
. We will translate your query to English, search English documents on the web, and translate the returned results from English back into the original query language. For example, Japanese users who are interested in viewing Halloween illustrations (Halloween is a holiday which originated in Ireland) can search for [
ハロウィン イラスト
]. You can then request a Japanese translation of the English pages (at the bottom of the page), which will bring up the translation page in the screenshot below. Similarly, Korean users can search for the latest on Harry Potter [
해리 포터
], and Arabic readers can search for the opening of the Sydney Opera house [
افتتاح دار الاوبرا في سيدني
]. (Click on the image to see a larger version.)
All in all, Google Search is being actively developed for more than 100 languages, in 150+ countries, with dozens of improvements launched each month. So far I've covered the basics of how international search works, but this is just the surface of all the international work we do. There are many other interesting topics that impact international markets like usability, homepage and results page layout, and connectivity. An understanding of real cultural and human factors is essential to creating a search engine that resonates with the people who use it. (Click on the image to see a larger version.)
(Update:
Replaced example in the 4th bullet point.)
Posted by Daphne Dembo, Engineering Director
Labels
accessibility
41
acquisition
26
ads
131
Africa
19
Android
58
apps
419
April 1
4
Asia
39
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
Europe
46
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
Latin America
18
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2007
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2006
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2005
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2004
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.