Guide

WordPress bot traffic — identifying and blocking scrapers using request fingerprints

Your WordPress site is not under attack. Nobody is trying to log in. There is no error in the dashboard.

1. Problem

Your WordPress site is not under attack. Nobody is trying to log in. There is no error in the dashboard. But PHP-FPM workers are pinned at 80%, the cache hit rate has dropped, your hosting bill went up, and the access log scrolls past too fast to read. If you searched for "wordpress how to identify bot traffic", "wordpress block scraper bots", or "wordpress unknown bot user-agent flooding", this is the place.

You tail the access log and see something like this, repeated for hours:

20.171.103.18 - - [27/Apr/2026:09:14:11 +0000] "GET /category/news/page/47/ HTTP/2.0" 200 84211 "-" "Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])"
20.171.103.19 - - [27/Apr/2026:09:14:11 +0000] "GET /category/news/page/48/ HTTP/2.0" 200 86012 "-" "Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])"
185.224.128.41 - - [27/Apr/2026:09:14:12 +0000] "GET /?p=12873 HTTP/1.1" 200 41200 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
185.224.128.42 - - [27/Apr/2026:09:14:12 +0000] "GET /wp-content/uploads/2024/06/report.pdf HTTP/1.1" 200 1240288 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"

This is wordpress unknown bot user-agent flooding in its most common form: a mix of declared AI training crawlers, headless-browser scrapers pretending to be Chrome, and vulnerability scanners walking your post archive. None of them tried to log in. None of them triggered Wordfence. They just consumed your CPU, your bandwidth, and your patience. Below is how to fingerprint each kind, decide which to block, and stop the rest from coming back next week.

2. Impact

This is not a credential-stuffing incident. It is a slow tax on the site:

  • Origin server load. Bots ignore cache hints. A single AI crawler walking pagination at 30 req/s rebuilds your category pages on every miss and saturates PHP-FPM. Real users get 502s.
  • Bandwidth cost. Datacenter bots download every PDF and image. On metered hosting or behind a CDN with origin pulls, this is real money.
  • Analytics distortion. Headless-browser scrapers execute JavaScript. Your analytics now show 5x traffic. A/B tests are poisoned. Conversion rate looks collapsed.
  • Content theft. AI training crawlers ingest your work without attribution. It shows up in someone else's chatbot.
  • Reconnaissance. Vulnerability scanners walking /?author=N or hammering /wp-content/plugins/ are mapping your install. The next visit is a known-CVE exploit.

A site at the median Logystera-monitored volume — 172,000 fingerprint samples in seven days — has 30–60% of its traffic coming from bots. Most of it is invisible to the dashboard.

3. Why It’s Hard to Spot

WordPress core has no concept of a bot. Every request reaches index.php, routes through rewrites, and produces HTML. There is no log of "this was a bot." Your access log has the user agent, but reading it is manual work most of you stopped doing years ago.

The standard tools miss this predictably:

  • Security plugins look for failed logins, file changes, and known-bad IPs. A polite AI crawler downloading pages with 200 status looks like a normal reader.
  • CDN bot management is gated behind enterprise plans.
  • Analytics platforms filter declared bots before you see them. The ones lying about their UA — the ones you most want to know about — are exactly the ones not filtered.
  • Uptime monitors see 200 OK and report green. That a 200 took 4 seconds and used 180MB of PHP memory does not register.
  • robots.txt is advisory. Adversarial scrapers ignore it. Even declared AI crawlers honor it inconsistently.

The result: your site is being walked end-to-end by multiple actors at once, and the only artifact is a slightly elevated load average. By the time you correlate cost with cause, the bot has cached your archive.

4. Cause

The wp_request_fingerprint_top signal is a per-entity, top-N rollup of request fingerprints by count. A fingerprint is a triple:

fingerprint = (user_agent_class, ip_prefix, path_class)
  • user_agent_class — normalized user agent: claudebot, gptbot, bingbot, headless_chrome, python_requests, curl, empty, or unknown.
  • ip_prefix/24 for IPv4, /48 for IPv6. This collapses datacenter rotation while preserving organizational locality.
  • path_class — normalized path: /category/, /?p=N, /wp-content/uploads/, /wp-json/*, /?author=N, /feed/, etc. Numeric IDs and slugs are stripped so /?p=12873 and /?p=12874 collapse into one.

Each fingerprint emits a counter, and the top N per entity per minute is published as wp_request_fingerprint_top. A healthy site has a long, flat tail — many distinct fingerprints with low counts each. A site under bot traffic has a short, tall head — a handful of fingerprints with thousands of hits each.

The shape of the top fingerprints tells you the bot type:

| Bot type | Typical fingerprint | |---|---| | AI training crawler | claudebot or gptbot + /24 from cloud provider + /category/ or /?p=N walking sequentially | | Content scraper | python_requests or empty UA + rotating /24 from residential proxies + /feed/ and /?p=N | | Headless-browser scraper | headless_chrome + small set of /24s + full pages including JS assets | | Vulnerability scanner | unknown or stale Chrome UA + single /24 + /wp-content/plugins/, /?author=N, /xmlrpc.php | | Legitimate search engine | googlebot or bingbot + verified IP ranges + diverse path classes |

Supporting signals make the picture sharper. wp_top_bot_uris ranks the actual paths being hit, so you can see whether the bot is harvesting content (/?p=N), files (/wp-content/uploads/*), or probing (/wp-config.php.bak). wp_bot_requests_total is a raw counter you can alert on for sudden ramps. wp_request_peak_memory_mb ties bot pressure back to PHP cost — when it climbs above your normal baseline, a bot is forcing uncached page rebuilds.

The point of fingerprinting is that no single dimension is enough. UA alone gets spoofed. IP alone misses cloud-rotated bots. Path alone catches everyone. The combination is what isolates a single actor across thousands of hits.

5. Solution

5.1 Diagnose (logs first)

Every step here ties back to wp_request_fingerprint_top, wp_top_bot_uris, or wp_bot_requests_total. The goal is to reproduce the fingerprint rollup on your own logs.

Step 1: Find your loudest user agents

# Top 30 user agents in the last 100k requests
tail -n 100000 /var/log/nginx/access.log \
  | awk -F'"' '{print $6}' \
  | sort | uniq -c | sort -rn | head -30

Each line here contributes to a wp_request_fingerprint_top row. If you see ClaudeBot, GPTBot, CCBot, Bytespider, PerplexityBot, or anthropic-ai ranked above Googlebot, you have AI crawler load. If you see python-requests/, Go-http-client/, curl/*, or empty user agents in the top 10, you have scrapers and probes.

Step 2: Map user agents to IP prefixes

# For one suspicious UA, group source IPs by /24
tail -n 100000 /var/log/nginx/access.log \
  | grep "ClaudeBot" \
  | awk '{split($1,a,"."); print a[1]"."a[2]"."a[3]".0/24"}' \
  | sort | uniq -c | sort -rn | head

This is the second axis of the fingerprint. A real Anthropic crawler will resolve to a small published range. A spoofed UA will come from residential proxies — many /24s, one or two hits each. Both patterns are produced by wp_request_fingerprint_top.

Step 3: Map fingerprints to path classes

# What is the bot actually fetching?
tail -n 100000 /var/log/nginx/access.log \
  | grep "ClaudeBot" \
  | awk '{print $7}' \
  | sed -E 's|/[0-9]+/?$|/N/|; s|\?p=[0-9]+|?p=N|; s|/page/[0-9]+/|/page/N/|' \
  | sort | uniq -c | sort -rn | head -20

This recreates the path_class axis and is exactly what wp_top_bot_uris reports. Sequential pagination, archive walks, and /feed/ dominance are the signature of content harvesting. Bursts against /wp-content/plugins//readme.txt are vulnerability mapping.

Step 4: Check the cost on the WordPress side

# PHP slowlog — requests over your slow threshold
grep "POST\|GET" /var/log/php-fpm/www-slow.log \
  | awk '{print $NF}' | sort | uniq -c | sort -rn | head
# Memory peaks per request, if you have a custom logger
grep "peak_mem" /var/log/wordpress/perf.log \
  | awk '$NF > 128 {print}' | tail -50

Bot fingerprints that correlate with elevated wp_request_peak_memory_mb are forcing uncached rebuilds — that is the expensive ones. Static-asset scrapers cost bandwidth but not CPU. Archive walkers cost CPU and database. Prioritize CPU-expensive bots first.

Step 5: Confirm volume against the bot counter

# Total bot-class requests per minute (rough approximation)
awk '/ClaudeBot|GPTBot|CCBot|Bytespider|python-requests|headless/ {print substr($4,2,17)}' \
  /var/log/nginx/access.log \
  | sort | uniq -c | tail -30

This is the per-minute view of wp_bot_requests_total. A flat line at 5/min is a healthy declared crawler. A sustained climb to 500/min that never settles is the load that justifies blocking.

5.2 Root Causes

(see root causes inline in 5.3 Fix)

5.3 Fix

There are four root causes. Each maps to a fingerprint shape, a signal, and a different fix. Apply in order: cheapest first, most permanent last.

Cause A: Declared AI crawler ignoring polite limits

Signal evidence: wp_request_fingerprint_top shows claudebot or gptbot UA on a known cloud /24, walking /category/* and /?p=N sequentially. wp_top_bot_uris shows pagination URLs at the top.

Fix:

  • Add explicit robots.txt directives:
  User-agent: GPTBot
  Disallow: /
  User-agent: ClaudeBot
  Disallow: /
  User-agent: CCBot
  Disallow: /
  • For crawlers that ignore robots.txt, block at nginx by UA:
  if ($http_user_agent ~* "(GPTBot|ClaudeBot|CCBot|anthropic-ai|Bytespider|PerplexityBot)") {
      return 403;
  }
  • This removes the fingerprint entirely from wp_request_fingerprint_top.

Cause B: Headless-browser scraper pretending to be Chrome

Signal evidence: wp_request_fingerprint_top shows headless_chrome or stale Chrome UA from a small datacenter /24, full-page fetches including assets. wp_request_peak_memory_mb climbs in lockstep.

Fix:

  • Block known headless tells at nginx:
  if ($http_user_agent ~* "(HeadlessChrome|Puppeteer|Playwright)") {
      return 403;
  }
  • For lying UAs, add a JavaScript challenge at the CDN edge for the offending /24s. Real browsers pass; headless without proper stealth setup fails.
  • Rate-limit by IP at nginx for any UA that requests both HTML and /wp-content/themes/*.css within 100ms — real users come from cached assets, scrapers cold-fetch everything:
  limit_req_zone $binary_remote_addr zone=bots:10m rate=30r/m;
  location / { limit_req zone=bots burst=20 nodelay; }

Cause C: Python/curl scrapers and feed harvesters

Signal evidence: wp_request_fingerprint_top shows python_requests, Go-http-client, curl, or empty UA from rotating residential /24s, hitting /feed/, /?p=N, or /wp-json/wp/v2/posts.

Fix:

  • Block clearly non-browser UAs that have no business reading content:
  if ($http_user_agent ~* "^(python-requests|Go-http-client|curl|wget|libwww)") {
      return 403;
  }
  if ($http_user_agent = "") { return 403; }
  • Restrict /wp-json/wp/v2/posts to authenticated requests if you do not publish a public API for it — add a rest_authentication_errors filter that requires login for read endpoints you do not serve to anonymous users.
  • Rate-limit /feed/ to one request per minute per IP. RSS readers poll on the order of every 15–60 minutes; anything faster is harvesting.

Cause D: Vulnerability scanner mapping the install

Signal evidence: wp_top_bot_uris shows hits to /wp-config.php.bak, /wp-content/plugins//readme.txt, /?author=1, /xmlrpc.php, /.env, /wp-content/debug.log. Fingerprint UA is usually unknown or stale.

Fix:

  • Return 403 for the entire /wp-content/plugins//readme.txt and /wp-content/themes//readme.txt patterns at the web server.
  • Block /?author=N enumeration:
  if ($args ~* "author=\d+") { return 403; }
  • Disable XML-RPC if unused (covered separately in the credential-stuffing guide).
  • Confirm /wp-config.php.bak, /.env, and /wp-content/debug.log return 404 and not 200. If any of these returns content, you have already leaked secrets.

5.4 Verify

The signal that should change is the shape of wp_request_fingerprint_top, not just the volume. After applying the relevant fixes, watch for 30–60 minutes and check:

# The top fingerprints should redistribute — no single one dominating
tail -n 30000 /var/log/nginx/access.log \
  | awk -F'"' '{ua=$6; print ua}' \
  | sort | uniq -c | sort -rn | head -10

Healthy looks like:

  • wp_request_fingerprint_top shows a long tail again — top fingerprint under 5% of total requests.
  • The blocked UAs return 403s, not 200s, and the count of 200-status responses to those UAs goes to zero.
  • wp_bot_requests_total drops by 60–80% within minutes; remaining bot traffic is declared legitimate crawlers (googlebot, bingbot).
  • wp_top_bot_uris no longer shows sequential pagination or archive walks at the top.
  • wp_request_peak_memory_mb settles back to baseline. PHP-FPM idle worker count returns to normal.
  • Cache hit rate at your CDN climbs by 10–30%.
# Confirm the blocked UAs are now hitting 403
grep "ClaudeBot\|GPTBot\|python-requests" /var/log/nginx/access.log \
  | awk '{print $9}' | sort | uniq -c

If after an hour you still see one fingerprint dominating despite the UA block, the bot has switched UA. Re-run Step 2 from section 5 — the /24 will likely be the same, and you can block by IP prefix instead.

6. How to Catch This Early

Fixing it is straightforward once you know the cause. The hard part is knowing it happened at all.

This issue surfaces as wp_request_fingerprint_top.

The fixes above are mechanical. The hard part is noticing the shape of your traffic shift before it costs you a week of CPU and a hosting upgrade.

wp_request_fingerprint_top exists for exactly this. It is emitted continuously by the Logystera WordPress plugin. Every request contributes to a fingerprint count, and the top-N per entity per minute lands in metrics. When the distribution shape shifts — when the top fingerprint moves from 0.5% of traffic to 30% — the rule fires and you see the offender, the IP prefix, and the path class in one view. No log diving. No correlation work.

The same plugin emits wp_top_bot_uris so you know what they are taking, wp_bot_requests_total for raw volume alerting, and wp_request_peak_memory_mb so you can tell whether a bot is just noisy or actually expensive. None of these signals exist in stock WordPress. None of them are surfaced by your hosting dashboard. Without continuous fingerprinting, this class of failure is only ever caught after the fact, in the next month's bill.

The detection is not clever. It is a counter, a normalization, and a top-N. The reason it works is that it runs all the time and someone — or something — is watching it.

7. Related Silent Failures

Same logs, same blind spots, often the same bots wearing different clothes:

  • WordPress REST API hammered with login attempts — when a fingerprint with high wp_bot_requests_total shifts from /feed/ to /wp-json/jwt-auth/*, the scraper has graduated to credential stuffing.
  • WordPress XML-RPC system.multicall amplification — vulnerability scanners probing /xmlrpc.php are often the same fingerprints walking /wp-content/plugins/*/readme.txt.
  • WordPress username enumeration via /?author=N — a wp_top_bot_uris entry for /?author=N is reconnaissance for the credential-stuffing attack arriving 24 hours later.
  • PHP memory_limit exhaustion under bot load — sustained wp_request_peak_memory_mb climbs from uncached rebuilds end in php.fatal "Allowed memory size exhausted."
  • WordPress slow queries triggered by archive walkers — sequential /category/*/page/N/ scraping forces full table scans on wp_posts. The slowlog shows it; the fingerprint shows who.

Each surfaces as a distinct pattern on the same fingerprint substrate. Watching the shape of wp_request_fingerprint_top is what lets you see them coming.

See what's actually happening in your WordPress system

Connect your site. Logystera starts monitoring within minutes.

Logystera Logystera
Monitoring for WordPress and Drupal sites. Install a plugin or module to catch silent failures — cron stalls, failed emails, login attacks, PHP errors — before users report them.
Company
Copyright © 2026 Logystera. All rights reserved.