Hey—we've moved. Visit
The Keyword
for all the latest news and stories from Google
Official Blog
Insights from Googlers into our products, technology, and the Google culture
The Robots Exclusion Protocol
February 22, 2007
Posted by Dan Crow, Product Manager
This is the second in a short series of posts about the
Robots Exclusion Protocol
, the standard for controlling how web pages on your site are indexed. This post provides more details and examples of mechanisms to control access and indexing of your website by Google.
In the
first post
in this series, I introduced
robots.txt
and robots
META
tags, giving an overview of when to use them. In this post, I'll look at some examples of the power of the protocol. These examples illustrate the detailed and fine-grain control online publishers have over how their websites are indexed.
Preventing Googlebot from following a link
Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot "crawls" the web. This is useful as it allows Google to include all the pages on your site, as long as they are linked together. Let's say you run the TheHighsteadPost.com website. Here's a map of part of the site:
When Googlebot crawls the
index.html
file, it finds the links to
breakingnews.html
and
articles.html
. From
breakingnews.html
, it can find
valentinesday.html
and
promnight.html
and so on.
What if you didn't want
valentinesday.html
and
promnight.html
appearing in Google's index? The articles in the Breaking News section may only appear for a few hours before being updated and moved to the Articles section. In this case you want the full articles indexed, not the breaking news version. You could put the
NOINDEX
tag on both those pages. But if the set of pages in the Breaking News section changed frequently, it would be a lot of work to continually update the pages with the
NOINDEX
tag and then remove it again when they moved into the articles section. Instead, you can add the
NOFOLLOW
tag to the
breakingnews.html
page. This tells the Googlebot not to follow any links it finds on that page, thus hiding
valentinesday.html
and
promnight.html
and any other pages linked from there. Simply add this line to the
<HEAD>
section of
breakingnews.html
:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
However, there is an important caveat to
NOFOLLOW
that you should know about. It only stops Google from following links from one page to another. If one of the linked pages is also linked from somewhere else, Google can still find and index that page via that other link. For example if
promnight.html
is also linked from HighsteadCourier.com, Google can still find and index
promnight.html
when it indexes HighsteadCourier.com and follows the link from there to
promnight.html
.
Using
NOFOLLOW
is generally not the best method to ensure content does not appear in our search results. Using the
NOINDEX
tag on individual pages or controlling access using
robots.txt
is the best way to achieve this.
Controlling Caching and Snippets
The Robots Exclusion Protocol allows you to specify, to some extent, how you would like your web pages should appear in Google's search results. Usually search results show a cached page link and a snippet, two features that our users tell us are very useful. Here, for example, is the first result I got when I searched for "Mallard duck":
The snippet is the extract of text from the web page, in this case it starts "The
mallard duck
is found mostly in North America...". We know from user studies that users are more likely to visit your site if the search results show the snippet. Why? Because snippets make it much easier for users to see why the result is relevant to their query. If a user isn't able to make this determination
quickly
, he or she usually moves on to the next search result.
Underneath the snippet is the URL of the page followed by the "cached" link. Clicking on this link takes you to a copy of the page stored on Google's servers. This is useful in a number of cases: for sites that are temporarily unavailable; for news sites that get overloaded in the aftermath of a major event, for example, 9/11; for sites that are accidentally deleted. Another advantage is that Google's cached copy highlights the words a person searched for, allowing them to quickly see how the page is relevant to their query.
Usually you want Google to display both the snippet and the cached link. However, there are some cases where you might want to disable one or both of these. For example, say you were a newspaper publisher, and you have a page whose content changes several times a day. It may take longer than a day for us to reindex a page, so users may have access to a cached copy of the page that is not the same as the one currently on your site. In this case, you probably don't want the cached link appearing in our results.
Again, the Robots Exclusion Protocol comes to your aid. Add the
NOARCHIVE
tag to a web page and Google won't cache copy of a web page in search results:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
Similarly, you can tell Google not to display a snippet for a page. The
NOSNIPPET
tag achieves this:
<META NAME="GOOGLEBOT" CONTENT="NOSNIPPET">
Adding
NOSNIPPET
also has the effect of preventing a cache link from being shown, so if you specify
NOSNIPPET
you automatically get
NOARCHIVE
too.
Learn more
As usual the Google Webmaster Help pages have a lot of useful information:
More on Googlebot and robots.txt
Our robots.txt analysis tool
Next time...
The final post in this series will take some common exclusion problems that webmasters have told us about and show how to solve them using the Robots Exclusion Protocol.
Labels
accessibility
41
acquisition
26
ads
131
Africa
19
Android
58
apps
419
April 1
4
Asia
39
books + book search
48
commerce
12
computing history
7
crisis response
33
culture
12
developers
120
diversity
35
doodles
68
education and research
144
entrepreneurs at Google
14
Europe
46
faster web
16
free expression
61
google.org
73
googleplus
50
googlers and culture
202
green
102
Latin America
18
maps and earth
194
mobile
124
online safety
19
open source
19
photos
39
policy and issues
139
politics
71
privacy
66
recruiting and hiring
32
scholarships
31
search
505
search quality
24
search trends
118
security
36
small business
31
user experience and usability
41
youtube and video
140
Archive
2016
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2007
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2006
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2005
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2004
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Feed
Google
on
Follow @google
Follow
Give us feedback in our
Product Forums
.