Guide
WordPress bot traffic — identifying and blocking scrapers using request fingerprints
1. Problem
Your WordPress site is not under attack. Nobody is trying to log in. There is no error in the dashboard. But PHP-FPM workers are pinned at 80%, the cache hit rate has dropped, your hosting bill went up, and the access log scrolls past too fast to read. If you searched for "wordpress how to identify bot traffic", "wordpress block scraper bots", or "wordpress unknown bot user-agent flooding", this is the place.
You tail the access log and see something like this, repeated for hours:
20.171.103.18 - - [27/Apr/2026:09:14:11 +0000] "GET /category/news/page/47/ HTTP/2.0" 200 84211 "-" "Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])"
20.171.103.19 - - [27/Apr/2026:09:14:11 +0000] "GET /category/news/page/48/ HTTP/2.0" 200 86012 "-" "Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])"
185.224.128.41 - - [27/Apr/2026:09:14:12 +0000] "GET /?p=12873 HTTP/1.1" 200 41200 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
185.224.128.42 - - [27/Apr/2026:09:14:12 +0000] "GET /wp-content/uploads/2024/06/report.pdf HTTP/1.1" 200 1240288 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
This is wordpress unknown bot user-agent flooding in its most common form: a mix of declared AI training crawlers, headless-browser scrapers pretending to be Chrome, and vulnerability scanners walking your post archive. None of them tried to log in. None of them triggered Wordfence. They just consumed your CPU, your bandwidth, and your patience. Below is how to fingerprint each kind, decide which to block, and stop the rest from coming back next week.
2. Impact
This is not a credential-stuffing incident. It is a slow tax on the site:
- Origin server load. Bots ignore cache hints. A single AI crawler walking pagination at 30 req/s rebuilds your category pages on every miss and saturates PHP-FPM. Real users get 502s.
- Bandwidth cost. Datacenter bots download every PDF and image. On metered hosting or behind a CDN with origin pulls, this is real money.
- Analytics distortion. Headless-browser scrapers execute JavaScript. Your analytics now show 5x traffic. A/B tests are poisoned. Conversion rate looks collapsed.
- Content theft. AI training crawlers ingest your work without attribution. It shows up in someone else's chatbot.
- Reconnaissance. Vulnerability scanners walking
/?author=Nor hammering/wp-content/plugins/are mapping your install. The next visit is a known-CVE exploit.
A site at the median Logystera-monitored volume — 172,000 fingerprint samples in seven days — has 30–60% of its traffic coming from bots. Most of it is invisible to the dashboard.
3. Why It’s Hard to Spot
WordPress core has no concept of a bot. Every request reaches index.php, routes through rewrites, and produces HTML. There is no log of "this was a bot." Your access log has the user agent, but reading it is manual work most of you stopped doing years ago.
The standard tools miss this predictably:
- Security plugins look for failed logins, file changes, and known-bad IPs. A polite AI crawler downloading pages with 200 status looks like a normal reader.
- CDN bot management is gated behind enterprise plans.
- Analytics platforms filter declared bots before you see them. The ones lying about their UA — the ones you most want to know about — are exactly the ones not filtered.
- Uptime monitors see 200 OK and report green. That a 200 took 4 seconds and used 180MB of PHP memory does not register.
robots.txtis advisory. Adversarial scrapers ignore it. Even declared AI crawlers honor it inconsistently.
The result: your site is being walked end-to-end by multiple actors at once, and the only artifact is a slightly elevated load average. By the time you correlate cost with cause, the bot has cached your archive.
4. Cause
The wp_request_fingerprint_top signal is a per-entity, top-N rollup of request fingerprints by count. A fingerprint is a triple:
fingerprint = (user_agent_class, ip_prefix, path_class)
user_agent_class— normalized user agent:claudebot,gptbot,bingbot,headless_chrome,python_requests,curl,empty, orunknown.ip_prefix—/24for IPv4,/48for IPv6. This collapses datacenter rotation while preserving organizational locality.path_class— normalized path:/category/,/?p=N,/wp-content/uploads/,/wp-json/*,/?author=N,/feed/, etc. Numeric IDs and slugs are stripped so/?p=12873and/?p=12874collapse into one.
Each fingerprint emits a counter, and the top N per entity per minute is published as wp_request_fingerprint_top. A healthy site has a long, flat tail — many distinct fingerprints with low counts each. A site under bot traffic has a short, tall head — a handful of fingerprints with thousands of hits each.
The shape of the top fingerprints tells you the bot type:
| Bot type | Typical fingerprint | |---|---| | AI training crawler | claudebot or gptbot + /24 from cloud provider + /category/ or /?p=N walking sequentially | | Content scraper | python_requests or empty UA + rotating /24 from residential proxies + /feed/ and /?p=N | | Headless-browser scraper | headless_chrome + small set of /24s + full pages including JS assets | | Vulnerability scanner | unknown or stale Chrome UA + single /24 + /wp-content/plugins/, /?author=N, /xmlrpc.php | | Legitimate search engine | googlebot or bingbot + verified IP ranges + diverse path classes |
Supporting signals make the picture sharper. wp_top_bot_uris ranks the actual paths being hit, so you can see whether the bot is harvesting content (/?p=N), files (/wp-content/uploads/*), or probing (/wp-config.php.bak). wp_bot_requests_total is a raw counter you can alert on for sudden ramps. wp_request_peak_memory_mb ties bot pressure back to PHP cost — when it climbs above your normal baseline, a bot is forcing uncached page rebuilds.
The point of fingerprinting is that no single dimension is enough. UA alone gets spoofed. IP alone misses cloud-rotated bots. Path alone catches everyone. The combination is what isolates a single actor across thousands of hits.
5. Solution
5.1 Diagnose (logs first)
Every step here ties back to wp_request_fingerprint_top, wp_top_bot_uris, or wp_bot_requests_total. The goal is to reproduce the fingerprint rollup on your own logs.
Step 1: Find your loudest user agents
# Top 30 user agents in the last 100k requests
tail -n 100000 /var/log/nginx/access.log \
| awk -F'"' '{print $6}' \
| sort | uniq -c | sort -rn | head -30
Each line here contributes to a wp_request_fingerprint_top row. If you see ClaudeBot, GPTBot, CCBot, Bytespider, PerplexityBot, or anthropic-ai ranked above Googlebot, you have AI crawler load. If you see python-requests/, Go-http-client/, curl/*, or empty user agents in the top 10, you have scrapers and probes.
Step 2: Map user agents to IP prefixes
# For one suspicious UA, group source IPs by /24
tail -n 100000 /var/log/nginx/access.log \
| grep "ClaudeBot" \
| awk '{split($1,a,"."); print a[1]"."a[2]"."a[3]".0/24"}' \
| sort | uniq -c | sort -rn | head
This is the second axis of the fingerprint. A real Anthropic crawler will resolve to a small published range. A spoofed UA will come from residential proxies — many /24s, one or two hits each. Both patterns are produced by wp_request_fingerprint_top.
Step 3: Map fingerprints to path classes
# What is the bot actually fetching?
tail -n 100000 /var/log/nginx/access.log \
| grep "ClaudeBot" \
| awk '{print $7}' \
| sed -E 's|/[0-9]+/?$|/N/|; s|\?p=[0-9]+|?p=N|; s|/page/[0-9]+/|/page/N/|' \
| sort | uniq -c | sort -rn | head -20
This recreates the path_class axis and is exactly what wp_top_bot_uris reports. Sequential pagination, archive walks, and /feed/ dominance are the signature of content harvesting. Bursts against /wp-content/plugins/ are vulnerability mapping.
Step 4: Check the cost on the WordPress side
# PHP slowlog — requests over your slow threshold
grep "POST\|GET" /var/log/php-fpm/www-slow.log \
| awk '{print $NF}' | sort | uniq -c | sort -rn | head
# Memory peaks per request, if you have a custom logger
grep "peak_mem" /var/log/wordpress/perf.log \
| awk '$NF > 128 {print}' | tail -50
Bot fingerprints that correlate with elevated wp_request_peak_memory_mb are forcing uncached rebuilds — that is the expensive ones. Static-asset scrapers cost bandwidth but not CPU. Archive walkers cost CPU and database. Prioritize CPU-expensive bots first.
Step 5: Confirm volume against the bot counter
# Total bot-class requests per minute (rough approximation)
awk '/ClaudeBot|GPTBot|CCBot|Bytespider|python-requests|headless/ {print substr($4,2,17)}' \
/var/log/nginx/access.log \
| sort | uniq -c | tail -30
This is the per-minute view of wp_bot_requests_total. A flat line at 5/min is a healthy declared crawler. A sustained climb to 500/min that never settles is the load that justifies blocking.
5.2 Root Causes
(see root causes inline in 5.3 Fix)
5.3 Fix
There are four root causes. Each maps to a fingerprint shape, a signal, and a different fix. Apply in order: cheapest first, most permanent last.
Cause A: Declared AI crawler ignoring polite limits
Signal evidence: wp_request_fingerprint_top shows claudebot or gptbot UA on a known cloud /24, walking /category/* and /?p=N sequentially. wp_top_bot_uris shows pagination URLs at the top.
Fix:
- Add explicit
robots.txtdirectives:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
- For crawlers that ignore
robots.txt, block at nginx by UA:
if ($http_user_agent ~* "(GPTBot|ClaudeBot|CCBot|anthropic-ai|Bytespider|PerplexityBot)") {
return 403;
}
- This removes the fingerprint entirely from
wp_request_fingerprint_top.
Cause B: Headless-browser scraper pretending to be Chrome
Signal evidence: wp_request_fingerprint_top shows headless_chrome or stale Chrome UA from a small datacenter /24, full-page fetches including assets. wp_request_peak_memory_mb climbs in lockstep.
Fix:
- Block known headless tells at nginx:
if ($http_user_agent ~* "(HeadlessChrome|Puppeteer|Playwright)") {
return 403;
}
- For lying UAs, add a JavaScript challenge at the CDN edge for the offending
/24s. Real browsers pass; headless without proper stealth setup fails. - Rate-limit by IP at nginx for any UA that requests both HTML and
/wp-content/themes/*.csswithin 100ms — real users come from cached assets, scrapers cold-fetch everything:
limit_req_zone $binary_remote_addr zone=bots:10m rate=30r/m;
location / { limit_req zone=bots burst=20 nodelay; }
Cause C: Python/curl scrapers and feed harvesters
Signal evidence: wp_request_fingerprint_top shows python_requests, Go-http-client, curl, or empty UA from rotating residential /24s, hitting /feed/, /?p=N, or /wp-json/wp/v2/posts.
Fix:
- Block clearly non-browser UAs that have no business reading content:
if ($http_user_agent ~* "^(python-requests|Go-http-client|curl|wget|libwww)") {
return 403;
}
if ($http_user_agent = "") { return 403; }
- Restrict
/wp-json/wp/v2/poststo authenticated requests if you do not publish a public API for it — add arest_authentication_errorsfilter that requires login for read endpoints you do not serve to anonymous users. - Rate-limit
/feed/to one request per minute per IP. RSS readers poll on the order of every 15–60 minutes; anything faster is harvesting.
Cause D: Vulnerability scanner mapping the install
Signal evidence: wp_top_bot_uris shows hits to /wp-config.php.bak, /wp-content/plugins/, /?author=1, /xmlrpc.php, /.env, /wp-content/debug.log. Fingerprint UA is usually unknown or stale.
Fix:
- Return 403 for the entire
/wp-content/plugins//readme.txtand/wp-content/themes//readme.txtpatterns at the web server. - Block
/?author=Nenumeration:
if ($args ~* "author=\d+") { return 403; }
- Disable XML-RPC if unused (covered separately in the credential-stuffing guide).
- Confirm
/wp-config.php.bak,/.env, and/wp-content/debug.logreturn 404 and not 200. If any of these returns content, you have already leaked secrets.
5.4 Verify
The signal that should change is the shape of wp_request_fingerprint_top, not just the volume. After applying the relevant fixes, watch for 30–60 minutes and check:
# The top fingerprints should redistribute — no single one dominating
tail -n 30000 /var/log/nginx/access.log \
| awk -F'"' '{ua=$6; print ua}' \
| sort | uniq -c | sort -rn | head -10
Healthy looks like:
wp_request_fingerprint_topshows a long tail again — top fingerprint under 5% of total requests.- The blocked UAs return 403s, not 200s, and the count of 200-status responses to those UAs goes to zero.
wp_bot_requests_totaldrops by 60–80% within minutes; remaining bot traffic is declared legitimate crawlers (googlebot,bingbot).wp_top_bot_urisno longer shows sequential pagination or archive walks at the top.wp_request_peak_memory_mbsettles back to baseline. PHP-FPM idle worker count returns to normal.- Cache hit rate at your CDN climbs by 10–30%.
# Confirm the blocked UAs are now hitting 403
grep "ClaudeBot\|GPTBot\|python-requests" /var/log/nginx/access.log \
| awk '{print $9}' | sort | uniq -c
If after an hour you still see one fingerprint dominating despite the UA block, the bot has switched UA. Re-run Step 2 from section 5 — the /24 will likely be the same, and you can block by IP prefix instead.
6. How to Catch This Early
Fixing it is straightforward once you know the cause. The hard part is knowing it happened at all.
This issue surfaces as wp_request_fingerprint_top.
The fixes above are mechanical. The hard part is noticing the shape of your traffic shift before it costs you a week of CPU and a hosting upgrade.
wp_request_fingerprint_top exists for exactly this. It is emitted continuously by the Logystera WordPress plugin. Every request contributes to a fingerprint count, and the top-N per entity per minute lands in metrics. When the distribution shape shifts — when the top fingerprint moves from 0.5% of traffic to 30% — the rule fires and you see the offender, the IP prefix, and the path class in one view. No log diving. No correlation work.
The same plugin emits wp_top_bot_uris so you know what they are taking, wp_bot_requests_total for raw volume alerting, and wp_request_peak_memory_mb so you can tell whether a bot is just noisy or actually expensive. None of these signals exist in stock WordPress. None of them are surfaced by your hosting dashboard. Without continuous fingerprinting, this class of failure is only ever caught after the fact, in the next month's bill.
The detection is not clever. It is a counter, a normalization, and a top-N. The reason it works is that it runs all the time and someone — or something — is watching it.
7. Related Silent Failures
Same logs, same blind spots, often the same bots wearing different clothes:
- WordPress REST API hammered with login attempts — when a fingerprint with high
wp_bot_requests_totalshifts from/feed/to/wp-json/jwt-auth/*, the scraper has graduated to credential stuffing. - WordPress XML-RPC system.multicall amplification — vulnerability scanners probing
/xmlrpc.phpare often the same fingerprints walking/wp-content/plugins/*/readme.txt. - WordPress username enumeration via /?author=N — a
wp_top_bot_urisentry for/?author=Nis reconnaissance for the credential-stuffing attack arriving 24 hours later. - PHP memory_limit exhaustion under bot load — sustained
wp_request_peak_memory_mbclimbs from uncached rebuilds end inphp.fatal"Allowed memory size exhausted." - WordPress slow queries triggered by archive walkers — sequential
/category/*/page/N/scraping forces full table scans onwp_posts. The slowlog shows it; the fingerprint shows who.
Each surfaces as a distinct pattern on the same fingerprint substrate. Watching the shape of wp_request_fingerprint_top is what lets you see them coming.
See what's actually happening in your WordPress system
Connect your site. Logystera starts monitoring within minutes.