CorpusCrawler

academic CorpusCrawler

Czech corpus linguistic research crawler

About this crawler

CorpusCrawler is a web crawler identified by the regular-expression pattern CorpusCrawler in the User-Agent request header. It is categorised as academic. Use the regex above to detect, log, allow, or block CorpusCrawler traffic in your web server, CDN edge rules, or robots.txt.

Block-rate · top 25k sites

No block-rate data for this crawler.

Technical details

Name
CorpusCrawler
Pattern
CorpusCrawler
Tags
academic
Reference
http://corpora.fi.muni.cz/crawler/
Added
2026/05/09
Instances
15 known sample(s)

Sample User-Agent strings

CorpusCrawler 2.0.0 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.8 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.9 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.10 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.12 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.13 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.14 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.15 (http://corpora.fi.muni.cz/crawler/)
CorpusCrawler 2.0.17 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus
CorpusCrawler 2.0.19 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus
CorpusCrawler 2.0.20 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus
CorpusCrawler 2.0.21 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus
CorpusCrawler 2.0.22 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus
CorpusCrawler 2.0.24 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus
CorpusCrawler 2.0.25 (http://corpora.fi.muni.cz/crawler/);Project:CzCorpus

Block this crawler

robots.txt — disallow CorpusCrawler:

User-agent: CorpusCrawler Disallow: /

Apache .htaccess — return 403:

RewriteEngine On RewriteCond %{HTTP_USER_AGENT} CorpusCrawler [NC] RewriteRule .* - [F,L]

Nginx — return 403 inside a server block:

if ($http_user_agent ~* "CorpusCrawler") { return 403; }
← back to all crawlers