Guide

WordPress RSS / Atom feeds hammered by scrapers — detection and rate-limiting

You open New Relic, htop, or your hosting panel and CPU is pinned at 90%+. PHP-FPM workers are saturated. The site itself loads, barely. You check access logs and the top URL is /feed/. Then /comments/feed/. Then /category/news/feed/.

1. Problem

You open New Relic, htop, or your hosting panel and CPU is pinned at 90%+. PHP-FPM workers are saturated. The site itself loads, barely. You check access logs and the top URL is /feed/. Then /comments/feed/. Then /category/news/feed/. Hundreds of requests per minute. Same handful of user agents. Different IPs every time.

This is the classic shape of WordPress RSS feed hammered by bots — and it is one of the most common silent CPU disasters on a default WordPress install. Everything looked fine before. Pages load. Admin works. There is no PHP fatal, no 500 error, no obvious explanation in the dashboard. But the server is melting, and the cause is not your plugins or your theme. It is feed.php rendering a full WP_Query over and over, uncached, for crawlers you never asked for.

If you searched for wordpress feed.php causing high cpu or wordpress block aggressive feed scrapers, you are in the right place. This guide explains the failure mode, shows you how to find the worst offenders in logs, and walks the three real fix paths: cache, rate-limit, or kill the bot outright.

2. Impact

A single feed render is not cheap. WordPress executes WP_Query for the latest posts (usually 10), pulls post meta, runs every filter hooked into the_content and the_excerpt, and serializes the whole thing to XML. On a typical install that is 80–200ms of PHP and a handful of database round-trips. Multiply that by 30 requests per second from AI training crawlers and aggregators, and you have a sustained 6 vCPU load coming from one endpoint.

The consequences in production:

PHP-FPM saturation. Every worker tied up rendering a feed is a worker not serving a real visitor. Page load times spike, then the queue fills and visitors get 502s.
Database lock contention. Repeated WP_Query against the same posts pulls the same rows, but on a hot site the read load competes with writes from comments, WooCommerce orders, and admin edits.
Memory pressure. Feed rendering peaks at 60–120MB per request on content-heavy sites. With 20 concurrent feed requests, you are using over 2GB just for scrapers.
Hosting bills. Managed WP hosts charge by visitor count or PHP execution time. Feed scraping shows up as a 5x cost increase with no traffic to show for it.
Content theft. AI training crawlers pull your full RSS to seed model datasets. Aggregators republish without attribution.

3. Why It’s Hard to Spot

WordPress does not warn you about feed scraping. There is no admin notice, no Site Health entry, no email. From WordPress's perspective, every feed request is a successful 200 response — it ran, it returned XML, the visitor got what they asked for. The application layer is doing its job.

Uptime monitors miss it entirely. The site is up. Pages render. Synthetic checks pass. The only externally visible symptom is slowness, which gets blamed on hosting, on the theme, on the most recently updated plugin — anything but the actual cause.

Hosting dashboards usually show CPU and memory but not which endpoint is consuming them. You see a red bar and have to dig into raw access logs to figure out why. By the time you correlate /feed/ traffic with the CPU spike, the site has already been degraded for hours.

Feeds are also not on most engineers' threat models. We worry about login brute force, REST API enumeration, XML-RPC abuse — feed scraping feels benign because it is reading public content. The damage is not data theft, it is resource exhaustion. Silent, expensive, and chronic.

The other reason it is hard: feed crawlers rotate IPs aggressively. Blocking one IP does nothing. Blocking a /24 does nothing. They use residential proxy pools with thousands of exit nodes. The only stable signal is the user agent and the request fingerprint — which is exactly what wp_request_fingerprint_top is built to expose.

4. Cause

The Logystera WordPress plugin emits wp_feed_requests_total every time WordPress serves a feed. Internally, the plugin hooks into do_feed and template_redirect to detect feed rendering before the heavy work begins. The signal carries labels for feed_type (rss2, atom, rdf, comments, category, tag), cache_hit (true/false if a feed cache plugin is active), and bot_class (derived from user agent matching against a known crawler list).

A healthy WordPress site with a real audience produces 10–200 wp_feed_requests_total events per hour, mostly from feed readers (Feedly, Inoreader, NetNewsWire) and well-behaved search engines. The rate looks like a flat baseline with small daily peaks.

A scraped site produces 3,000–50,000 events per hour, with the rate climbing in step changes as new crawlers join. The label distribution shifts: bot_class=ai_crawler and bot_class=unknown dominate, cache_hit=false is near 100%, and a single feed_type (usually rss2) accounts for 80%+ of traffic.

This is what the wp_feed_requests_total signal represents in practice — it is a direct count of how many times WordPress executed the feed-rendering pipeline. Each increment is a WP_Query, a template render, and an XML serialization that just happened on your server. When the rate goes up, your CPU goes up. They are the same event measured at two layers.

The supporting signals fill in the picture. wp_bot_requests_total counts every request classified as bot traffic across all endpoints, so you can see whether feed scraping is part of broader crawling or isolated. wp_request_fingerprint_top ranks the top request fingerprints (a hash of user-agent + IP-prefix + path-pattern), which is how you identify the specific scraper without staring at raw logs. wp_request_peak_memory_mb tracks peak PHP memory per request, and feed rendering on content-heavy sites shows up here as a sustained right-tail.

5. Solution

5.1 Diagnose (logs first)

Start at the web server. WordPress feed requests hit /feed/, /comments/feed/, /feed/atom/, /{post-type}/feed/, or any URL ending in /feed/?.... On nginx the access log path is usually /var/log/nginx/access.log or /var/log/nginx/your-site.access.log. On Apache it is /var/log/apache2/access.log.

Count feed requests in the last hour:

grep -E ' "GET [^"]*/feed/?( |\?)' /var/log/nginx/access.log \
  | awk -v d="$(date -d '1 hour ago' '+%d/%b/%Y:%H')" '$4 >= "["d' \
  | wc -l

If that number is over 1,000 on a site that does not have a podcast or a major audience, you are being scraped. This produces the wp_feed_requests_total signal at the application layer — every line in the access log corresponds to one increment of that counter.

Find the top user agents hitting feeds:

grep -E ' "GET [^"]*/feed/?( |\?)' /var/log/nginx/access.log \
  | awk -F'"' '{print $6}' \
  | sort | uniq -c | sort -rn | head -20

You will see entries like Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected]), GPTBot, CCBot, Bytespider, PerplexityBot, and a long tail of generic python-requests/2.x and curl/7.x. Each row maps to a bot_class label on the wp_feed_requests_total signal, and the row count maps to the rate.

Pull the top source IP ranges, even though they rotate:

grep -E ' "GET [^"]*/feed/?( |\?)' /var/log/nginx/access.log \
  | awk '{print $1}' | cut -d. -f1-2 \
  | sort | uniq -c | sort -rn | head -10

A handful of /16 ranges concentrating most traffic confirms a residential proxy pool. This is the same pattern the wp_request_fingerprint_top signal extracts automatically by hashing UA + IP prefix + path.

Check WordPress's debug log for memory pressure during the scrape window. Enable WP_DEBUG_LOG in wp-config.php and look at wp-content/debug.log:

grep -E "Allowed memory size|peak memory" wp-content/debug.log | tail -50

If you see peak memory creeping toward memory_limit, that is wp_request_peak_memory_mb climbing — feed rendering on a content-heavy site is one of the few pages that legitimately consumes 80–150MB per request, and it correlates exactly with the feed traffic spike.

For PHP-FPM saturation, check the slow log (/var/log/php-fpm/slow.log if request_slowlog_timeout is set):

grep -A 20 "feed" /var/log/php-fpm/www.slow.log | head -100

Stack traces showing do_feed, load_template, and WP_Query::get_posts confirm feed rendering is what is blocking workers.

5.2 Root Causes

(see root causes inline in 5.3 Fix)

5.3 Fix

Three real fix paths exist, and on a busy site you want all three layered.

Fix path 1 — cache the feed

Feeds change at most when you publish a post. Serving the same XML from cache for 5–15 minutes is not a tradeoff, it is correctness. The cause this addresses: every uncached feed request runs a full WP_Query, which appears in logs as a 200 response with 100–300ms TTFB and a fresh wp_feed_requests_total event labeled cache_hit=false.

If you run nginx in front of WordPress, add a fastcgi_cache rule for feed URLs:

location ~* /feed/?$ {
    fastcgi_cache WORDPRESS;
    fastcgi_cache_valid 200 10m;
    fastcgi_cache_bypass 0;
    fastcgi_no_cache 0;
    add_header X-Cache-Status $upstream_cache_status;
    include fastcgi_params;
    fastcgi_pass unix:/run/php/php8.2-fpm.sock;
}

After deploying, watch X-Cache-Status: HIT increase. The signal effect: wp_feed_requests_total{cache_hit=true} rises, cache_hit=false drops by 90%+. CPU returns to baseline within minutes.

If you cannot edit nginx, install a page cache plugin that respects feeds (WP Super Cache, W3 Total Cache, LiteSpeed Cache) and explicitly enable feed caching in its settings — it is off by default in some of them.

Fix path 2 — rate-limit at the edge

Caching helps with repeat hits but not with crawlers that aggressively bust caches via query strings (/feed/?utm_source=bot). The cause this addresses: high-rate scrapers from rotating IPs that show up as a flat horizontal bar on wp_request_fingerprint_top, with the same UA + path pattern across hundreds of source IPs.

In nginx:

limit_req_zone $http_user_agent zone=feed_ua:10m rate=10r/m;

location ~* /feed/?$ {
    limit_req zone=feed_ua burst=5 nodelay;
    # ... fastcgi_pass etc.
}

This caps any single user agent at 10 feed requests per minute. Legitimate feed readers poll once per 15–60 minutes and never hit this. Scrapers get 429s. The signal effect: wp_bot_requests_total stays flat or rises (they keep trying), but wp_feed_requests_total drops to baseline because rate-limited requests never reach PHP.

Cloudflare users can do the same with a Rate Limiting rule: URI Path contains "/feed" → 10 requests per minute per UA → block.

Fix path 3 — return 410 Gone to specific bots

Some crawlers — AI training bots in particular — respect HTTP status codes and robots.txt more than rate limits. The cause this addresses: chronic, low-rate scraping from identifiable AI crawlers (ClaudeBot, GPTBot, CCBot, Bytespider, PerplexityBot) that show up clearly in the UA distribution from section 5.

In nginx:

map $http_user_agent $is_ai_bot {
    default 0;
    "~*ClaudeBot|GPTBot|CCBot|Bytespider|PerplexityBot|Diffbot|Amazonbot" 1;
}

location ~* /feed/?$ {
    if ($is_ai_bot) { return 410; }
    # ... fastcgi_pass etc.
}

410 Gone tells the crawler the resource is permanently unavailable, and well-behaved crawlers stop retrying. Combined with a robots.txt Disallow: /feed/ directive for these UAs, scraping from named bots drops within 24–48 hours. The signal effect: wp_feed_requests_total{bot_class=ai_crawler} drops sharply, the residual feed traffic is real readers and search engines.

5.4 Verify

Pick a 30-minute window after deploying changes and check the signal directly. wp_feed_requests_total should drop by an order of magnitude on a previously scraped site — from thousands per hour to hundreds. The cache_hit=false slice should be near zero if you deployed caching, and the bot_class=ai_crawler slice should fall to near zero if you deployed UA blocking.

In logs, the equivalent check:

grep -E ' "GET [^"]*/feed/?( |\?)' /var/log/nginx/access.log \
  | awk -v d="$(date -d '30 minutes ago' '+%d/%b/%Y:%H:%M')" '$4 >= "["d' \
  | wc -l

A healthy result is under 200 for 30 minutes on a normal site. If it is still in the thousands, the cache is not catching, the rate limit is not matching, or you are blocking a different UA than the one actually hitting you — go back to section 5 and re-pull the top user agent list.

PHP-FPM workers should drop back to single-digit active count under normal traffic. wp_request_peak_memory_mb distribution should lose its right-tail of 100MB+ requests if those were feed renders. CPU on the web server should return to baseline within 15 minutes of cache deployment, instantly with rate-limiting.

The verification window is short because feed scraping is high-volume — if you fixed it, you see the effect in the next data point, not the next day. If the wp_feed_requests_total rate has not dropped within 30 minutes, the fix did not land where you thought it did.

6. How to Catch This Early

Fixing it is straightforward once you know the cause. The hard part is knowing it happened at all.

This issue surfaces as wp_feed_requests_total.

The reason feed scraping runs for weeks before anyone notices is that no part of the WordPress stack alerts on it. Hosting dashboards show CPU but not endpoints. Uptime monitors show 200s. WordPress shows nothing. The traffic is technically legitimate — it is just expensive.

This failure surfaces as a sustained increase in wp_feed_requests_total, which Logystera detects and alerts on early via rate-of-change rules on the signal. The supporting signals tell you what to do next: wp_request_fingerprint_top identifies the worst offenders without you running grep at 2am, wp_bot_requests_total tells you whether feed scraping is part of broader crawling, and wp_request_peak_memory_mb confirms whether the requests are actually causing PHP memory pressure.

Logs reveal this immediately. Without log-driven detection, you find out when CPU pegs, the site degrades, and a customer complains. The signal exists in your access log the moment scraping starts — the question is only whether anything is watching for it.

7. Related Silent Failures

Feed scraping clusters with other bot-traffic and resource-exhaustion failures. If you saw this signal, check for:

WordPress REST API enumeration — /wp-json/wp/v2/users scraped by the same crawlers that hit feeds, surfaces as wp_rest_requests_total with route_group=users rate spikes.
xmlrpc.php pingback amplification — old XML-RPC endpoint used as DDoS reflector, surfaces as wp_xmlrpc_requests_total with method=pingback.ping from external IPs.
WooCommerce product feed scraping — competitors pulling /shop/?orderby=date&format=feed, surfaces as wp_feed_requests_total{feed_type=woocommerce_products}.
Search query DoS — bots hammering /?s= with random terms, each query bypassing object cache, surfaces as wp_search_requests_total rate spike correlated with wp_request_peak_memory_mb.
Login brute force from feed-scraping IP pools — same residential proxy networks used for feeds also run credential stuffing, surfaces as wp_login_attempts_total{result=failed} with overlap on the IP fingerprints from wp_request_fingerprint_top.

See what's actually happening in your WordPress system

Connect your site. Logystera starts monitoring within minutes.

Request a demo WordPress integration