Crawling control
robots.txt
file. A robots.txt
file contains instructions that specify which content of the website the robots are/are not allowed to access and download. All robots visiting your web that follow this standard read this file first when accessing the web and they adjust their behavior according to the directives in the file. You can find detailed description of its syntax on the official website of the standard.
Using the robots.txt standard you can stop all crawling performed by SeznamBot on your web or only stop downloading of exactly specified pages. It typically takes several days for our crawler to recheck the restriction in the robots.txt
file and eventually update the index, though for some sites that are not visited often, it can take up to several weeks. In case you only want to stop indexing of a page but allow SeznamBot to download and explore it, see Indexing Control. In that case you should also allow SeznamBot to download the page in the robots.txt
file so that it is able to read the restrictions in the HTML code.
If you want to keep SeznamBot from accessing your site altogether, use the following directives in your robots.txt
file:
User-agent: SeznamBot
Disallow: /
Nonstandard Extensions of robots.txt Syntax Recognized By SeznamBot
On top of the official version 1.0 standard SeznamBot recognizes other directives and most parts of the proposed extension robots.txt standard version 2.0 (though this is being deprecated). These extensions are listed below in separate sections.
Allow Directive
The syntax of the Allow directive is the same as with the standard Disallow directive, except for the name. Use of the directive explicitly allows robots' access to a given URL(s). This is useful when you want to instruct robots to avoid an entire directory but still want some HTML documents in that directory crawled and indexed.
Examples
User-agent: * |
All robots can access and download all pages of the web. The empty space following the Disallow /Allow directive means that the directive doesn't apply at all. This is the default (empty or nonexistent robots.txt file has the same effect). |
User-agent: * |
|
User-agent: * |
No robot can download any page. |
User-agent: * |
No robot can enter the /archive/ directory of the website. Furthermore, no robot can download any page with a name starting with "abc". |
User-agent: * |
All robots can download files only from the /A/ directory and its subdirectories except for the subdirectory B/. The order of the directives is not important. |
User-agent: SeznamBot |
SeznamBot can't download anything from the website. Other robots are allowed by default. |
User-agent: SeznamBot |
SeznamBot can't download the /discussion/ directory. Other robots are allowed by default. |
Wildcards
You can use the following wildcards in a robots.txt
file:
* | any number of any characters (an arbitrary string). Can be used multiple times in a directive. |
$ | the end of the address string |
Examples
User-agent: SeznamBot |
Disallow downloading all files with addresses ending with „.pdf“ (regardless of the characters preceding it). |
User-agent: SeznamBot |
Disallow downloading the default document in any of the /discussion/ subdirectories while still allowing downloading all other files in those subdirectories. |
User-agent: SeznamBot |
Disallow /discussion, allowing /discussion-01, /discussion/-02 etc. |
Request-rate Directive
The Request-rate directive is used to tell robots how many documents from a website they can download during a given time period. Seznambot fully respects this directive, which enables you to set this download rate in a way that prevents your servers from being overloaded or even crushed. On the other hand, if you want your files to be processed by SeznamBot at a faster rate, you can set the Request-rate to a higher value.
The request rate directive syntax is: Request-rate: <number of documents>/<time>
You can also specify a time period in the day, during which the robot will observe the rate set by the directive. In the rest of the day, it will return to its regular behavior.
The general syntax in this case is: Request-rate: <rate> <time of day>
Examples
Request-rate: 1/10s |
Robots are allowed to download one document every ten seconds. |
Request-rate: 100/15m |
100 documents every 15 minutes |
Request-rate: 400/1h |
400 documents every hour |
Request-rate: 9000/1d |
9000 documents every day |
Request-rate: 1/10s 1800-1900 |
Robots are allowed to download one document every ten seconds between 18:00 and 19:00 (UTC). In other times there are no limits for download rate. |
CAUTION
The minimum download rate for SeznamBot is 1 document every 10 seconds. If you specify a lower value, SeznamBot will interpret it as this minimum rate. The maximum rate is only limited by the current speed of SeznamBot.
Examples (specific and all other robots)
User-agent: * Disallow: /images/ Request-rate: 30/1m # all robots except for SeznamBot and Googlebot: # do not access /images/ directory, rate 30 URLs per minute User-agent: SeznamBot Disallow: /cz/chat/ Request-rate: 300/1m # SeznamBot: do not access /cz/chat/ directory, rate 300 URLs per minute User-agent: Googlebot Disallow: /logs/ Request-rate: 10/1m # Googlebot: do not access /logs/, rate 10 URLs per minute
Examples (SeznamBot and all other robots)
User-agent: *
Disallow: /
# all robots except for SeznamBot: do not access anything
User-agent: Seznambot
Request-rate: 300/1m
# Seznambot: access everything, rate 300 URLs per minute
Sitemaps
Sitemaps allow you to fine-tune the movement of SeznamBot around your web. Through a sitemap you can tell SeznamBot which pages change frequently, when a given page was last updated or what is its indexing priority within the site. Sitemaps are implemented through the Sitemap protocol which uses XML files that contain all information needed. You can find more information on sitemaps, including the exact syntax, on the official website sitemaps.org.
The sitemap directive syntax is: Sitemap: <absolute URL>
Version 2.0 Wildcards
SeznamBot supports most parts of the proposed extension robots.txt standard version 2.0. Note, however, that these features are being deprecated. Version 2.0 syntax is extended by allowing the use of basic regular expressions, as in the unix shell. If you want to use version 2.0 with SeznamBot, you'll need to set it in the robots.txt
file. This is done through the directive Robot-version: 2.0
placed in the second line of the appropriate section of the robots.txt
file.
Example
User-agent: * Robot-version: 2.0 Disallow: /
robots.txt standard version 2.0 allows you to use the following wildcards in the URL pattern of the Disallow:
and Allow:
directives:
* |
matches any sequence of characters (including 0 characters) |
? |
matches any one character |
\ |
escapes the next special character (e.g. ? , * , ...) to be taken literally (e.g. the pattern /file\? will only match the path /file? , but not /files , etc.) |
[ <character set>] |
matches any one character from the given set |
[! <character set>] or [^ <character set>] |
matches any one character outside the given set |
CAUTION
In version 2.0, as opposed to version 1.0, the robot tries to match the whole URL (not just the beginning). This means that for example the line Disallow: /helpme
restricts robots access to URL /helpme
only. The original effect as in version 1.0 (restricting the download of pages starting with the given string) can be achieved by adding the *
wildcard at the end of the URL string (e.g. Disallow: /helpme*
).