Crawling control

SeznamBot fully complies with the robots exclusion standard (or simply robots.txt), which specifies the rules  of robots behavior through a robots.txt file. A robots.txt file contains instructions that specify which content of the website the robots are/are not allowed to access and download. All robots visiting your web that follow this standard read this file first when accessing the web and they adjust their behavior according to the directives in the file. You can find detailed description of its syntax on the official website of the standard.

Using the robots.txt standard you can stop all crawling performed by SeznamBot on your web or only stop downloading of exactly specified pages. It typically takes several days for our crawler to recheck the restriction in the robots.txt file and eventually update the index, though for some sites that are not visited often, it can take up to several weeks. In case you only want to stop indexing of a page but allow SeznamBot to download and explore it, see Indexing Control. In that case you should also allow SeznamBot to download the page in the robots.txt file so that it is able to read the restrictions in the HTML code.

If you want to keep SeznamBot from accessing your site altogether, use the following directives in your robots.txt file:

User-agent: SeznamBot
Disallow: /

Nonstandard Extensions of robots.txt Syntax Recognized By SeznamBot

On top of the official version 1.0 standard SeznamBot recognizes other directives and most parts of the proposed extension robots.txt standard version 2.0 (though this is being deprecated). These extensions are listed below in separate sections.

Allow Directive

The syntax of the Allow directive is the same as with the standard Disallow directive, except for the name. Use of the directive explicitly allows robots' access to a given URL(s). This is useful when you want to instruct robots to avoid an entire directory but still want some HTML documents in that directory crawled and indexed.

Examples

User-agent: *
Disallow:
All robots can access and download all pages of the web. The empty space following the Disallow/Allow directive means that the directive doesn't apply at all. This is the default (empty or nonexistent robots.txt file has the same effect).
User-agent: *
Allow:
User-agent: *
Disallow: /
No robot can download any page.
User-agent: *
Disallow: /archive/
Disallow: /abc
No robot can enter the /archive/ directory of the website. Furthermore, no robot can download any page with a name starting with "abc".
User-agent: *
Disallow: /
Allow: /A/
Disallow: /A/B/
All robots can download files only from the /A/ directory and its subdirectories except for the subdirectory B/. The order of the directives is not important.
User-agent: SeznamBot
Disallow: /
SeznamBot can't download anything from the website. Other robots are allowed by default.
User-agent: SeznamBot
Disallow: /discussion/
SeznamBot can't download the /discussion/ directory. Other robots are allowed by default.

Wildcards

You can use the following wildcards in a robots.txt file:

* any number of any characters (an arbitrary string). Can be used multiple times in a directive.
$ the end of the address string

Examples

User-agent: SeznamBot
Disallow: *.pdf$
Disallow downloading all files with addresses ending with „.pdf“ (regardless of the characters preceding it).
User-agent: SeznamBot
Disallow: /*/discussion/$
Disallow downloading the default document in any of the /discussion/ subdirectories while still allowing downloading all other files in those subdirectories.
User-agent: SeznamBot
Disallow: /discussion$
Disallow /discussion, allowing /discussion-01, /discussion/-02 etc.

Request-rate Directive

The Request-rate directive is used to tell robots how many documents from a website they can download during a given time period. Seznambot fully respects this directive, which enables you to set this download rate in a way that prevents your servers from being overloaded or even crushed. On the other hand, if you want your files to be processed by SeznamBot at a faster rate, you can set the Request-rate to a higher value.

The request rate directive syntax is: Request-rate: <number of documents>/<time>

You can also specify a time period in the day, during which the robot will observe the rate set by the directive. In the rest of the day, it will return to its regular behavior.

The general syntax in this case is: Request-rate: <rate> <time of day>

Examples

Request-rate: 1/10s Robots are allowed to download one document every ten seconds.
Request-rate: 100/15m 100 documents every 15 minutes
Request-rate: 400/1h 400 documents every hour
Request-rate: 9000/1d 9000 documents every day
Request-rate: 1/10s 1800-1900 Robots are allowed to download one document every ten seconds between 18:00 and 19:00 (UTC). In other times there are no limits for download rate.

CAUTION

The minimum download rate for SeznamBot is 1 document every 10 seconds. If you specify a lower value, SeznamBot will interpret it as this minimum rate. The maximum rate is only limited by the current speed of SeznamBot.

Examples (specific and all other robots)

User-agent: *
Disallow: /images/
Request-rate: 30/1m
# all robots except for SeznamBot and Googlebot: 
#     do not access /images/ directory, rate 30 URLs per minute

User-agent: SeznamBot
Disallow: /cz/chat/
Request-rate: 300/1m 
# SeznamBot: do not access /cz/chat/ directory, rate 300 URLs per minute
  
User-agent: Googlebot
Disallow: /logs/ 
Request-rate: 10/1m
# Googlebot: do not access /logs/, rate 10 URLs per minute

Examples (SeznamBot and all other robots)

 User-agent: *
 Disallow: /
 # all robots except for SeznamBot: do not access anything
 
 User-agent: Seznambot
 Request-rate: 300/1m
 # Seznambot: access everything, rate 300 URLs per minute

Sitemaps

Sitemaps allow you to fine-tune the movement of SeznamBot around your web. Through a sitemap you can tell SeznamBot which pages change frequently, when a given page was last updated or what is its indexing priority within the site. Sitemaps are implemented through the Sitemap protocol which uses XML files that contain all information needed. You can find more information on sitemaps, including the exact syntax, on the official website sitemaps.org.

The sitemap directive syntax is: Sitemap: <absolute URL>

Version 2.0 Wildcards

SeznamBot supports most parts of the proposed extension robots.txt standard version 2.0. Note, however, that these features are being deprecated. Version 2.0 syntax is extended by allowing the use of basic regular expressions, as in the unix shell. If you want to use version 2.0 with SeznamBot, you'll need to set it in the robots.txt file. This is done through the directive Robot-version: 2.0 placed in the second line of the appropriate section of the robots.txt file.

Example

User-agent: *
Robot-version: 2.0
Disallow: /

robots.txt standard version 2.0 allows you to use the following wildcards in the URL pattern of the Disallow: and Allow: directives:

* matches any sequence of characters (including 0 characters)
? matches any one character
\ escapes the next special character (e.g. ?, *, ...) to be taken literally (e.g. the pattern /file\? will only match the path /file?, but not /files, etc.)
[<character set>] matches any one character from the given set
[!<character set>]
or
[^<character set>]
matches any one character outside the given set

CAUTION

In version 2.0, as opposed to version 1.0, the robot tries to match the whole URL (not just the beginning). This means that for example the line Disallow: /helpme restricts robots access to URL /helpme only. The original effect as in version 1.0 (restricting the download of pages starting with the given string) can be achieved by adding the * wildcard at the end of the URL string (e.g. Disallow: /helpme*).

Did you find this article helpful? No

Contact Us