Guide

Drupal queue workers stuck — finding the bad queue item

You open Drupal's status report. Cron last ran two minutes ago. The queue table — queue, or your Advanced Queue's advancedqueue — has 14,000 items. Yesterday it had 8,000. The day before, 3,000. Cron itself is finishing without errors.

1. Problem

You open Drupal's status report. Cron last ran two minutes ago. The queue table — queue, or your Advanced Queue's advancedqueue — has 14,000 items. Yesterday it had 8,000. The day before, 3,000. Cron itself is finishing without errors. But the queue is growing faster than it is being drained, and items that should have shipped emails, rebuilt search indexes, posted webhooks, or pushed Salesforce updates are simply not happening.

This is the canonical shape of the "drupal queue workers stuck" problem: cron looks healthy from a green checkmark perspective, queues are visibly backing up, and somewhere inside that backlog is a single bad item — a poison message — that every worker run grabs, fails on, requeues, and grabs again on the next pass. The fleet has not stopped. It is locked in a loop on one item, or a small cluster of items, while every other queued job waits behind them.

This guide is about the queue.item_failed signal Logystera emits when a Drupal queue worker throws inside processItem(), why Drupal core's queue API hides that failure from the status report, and how to identify the exact item id, queue name, and exception trace so you can purge or fix it.

2. Impact

Drupal queues are the connective tissue of every non-trivial site: Search API indexing, email delivery, commerce webhooks, Migrate sync. When one item jams the worker, consequences cascade asymmetrically:

Email queue blocked: password resets and order confirmations do not go out. Customers think the site is broken.
Search API queue blocked: new content is published but never appears in search results. Editors swear they hit "Save".
Commerce queue blocked: webhook to ERP never fires. Orders exist in Drupal but not in the books. Reconciliation drift compounds daily.
Advanced Queue blocked on a poisoned job: stays in processing forever, holding a lease, "max attempts" counter fails to advance — items retried 50,000 times in a day.

The site is up. Pages render. Cron runs. By every conventional definition of "healthy", it is healthy. It is also losing money on every minute the queue grows.

3. Why It’s Hard to Spot

Drupal's queue API was designed for fire-and-forget work, not observability:

Exceptions are not fatal to cron. drupal_cron() wraps each processItem() in a try/catch that logs and continues. The cron run completes successfully from the OS perspective. Your monitoring sees green.
Status report does not surface queue health. /admin/reports/status shows "Last run: 2 minutes ago" but not queue depth, lag, or failure rate.
dblog is volatile. Default retention is 1,000 rows. On a busy site, a queue-failure line is overwritten within minutes.
Lease semantics hide repeated failures. A poisoned item gets re-claimed every lease_time seconds. To dblog, that's "the same exception 100 times". To Drupal core, it's "100 unrelated incidents".
Cron is decoupled from the symptom. The user-visible failure is "I didn't get my password reset email". Nobody connects that to a queue item id buried in a database table.

The signal exists — processItem() threw — but Drupal routes it to a logging channel that is rotated, unmonitored, and stripped of the structured fields you need (queue name + item id + class) to act.

4. Cause

A queue.item_failed signal is emitted when a queue worker plugin's processItem() method throws an exception. Inside Drupal, the lifecycle is concrete:

drupal_cron() (or drush queue:run, or Advanced Queue's daemon) iterates over registered queue worker plugins.
For each worker, it calls claimItem() on the queue backend. The backend returns one row from queue and sets a lease (expire = REQUEST_TIME + lease_time).
The worker invokes processItem($data). If this returns normally, deleteItem() is called and the row is removed.
If processItem() throws \Drupal\Core\Queue\SuspendQueueException, the entire queue is paused for the rest of the cron run.
If processItem() throws any other exception, the item is not deleted. Its lease expires after lease_time seconds (default 30, often 3600), and any worker can claim it again.

That last bullet is the trap. A thrown \Exception looks like a transient failure to core. Core does not increment a failure counter, does not move the item to a dead-letter queue, does not log structured failure metadata. The item sits in the database, waits for its lease to expire, and gets re-claimed on the next cron pass. Forever.

The queue.item_failed signal payload Logystera captures includes the queue name, the item id, the exception class, the exception message, and the worker plugin id. That is the diagnostic skeleton. Without it, you have a generic "cron emitted some warnings" event in the dblog that nobody can act on.

When a poisoned item triggers a fatal — memory exhaustion mid-process, a recursive call into a bad entity reference — you also see a php.fatal from the worker process. When cron pickup stops happening, you see a gap in the cron.run heartbeat. When items pile up faster than they drain, queue.depth climbs monotonically.

5. Solution

5.1 Diagnose (logs first)

You need four sources in order: Drupal dblog, the PHP error log, the queue table itself, and cron's stdout under drush.

1. Find the queue.item_failed signal in dblog

Every uncaught exception in processItem() lands in dblog under the cron channel:

drush sql:query "SELECT timestamp, message, variables FROM watchdog \
  WHERE type='cron' AND severity <= 3 ORDER BY wid DESC LIMIT 50"

Or via syslog if you've redirected dblog:

grep -E "Cron run exited|Exception (thrown|while running)" /var/log/syslog | tail -100

Each match where the message contains a queue worker class is one queue.item_failed signal. Count repeats by message hash to find the offender:

drush sql:query "SELECT COUNT(*) c, SUBSTRING(message, 1, 120) m FROM watchdog \
  WHERE type='cron' GROUP BY m ORDER BY c DESC LIMIT 10"

A message that appears 200+ times in a day with the same exception text is the poison item.

2. Cross-reference the PHP error log for fatals

If the worker died from a php.fatal (memory exhaustion, segfault, allowed memory size), dblog will not capture it — the process died before it could write. Check PHP's own log:

grep -E "Fatal error|Allowed memory size|Out of memory" /var/log/php*-fpm.log \
  | grep -i "queue\|drush\|cron" | tail -50

A php.fatal correlated to a cron run window means the worker process itself died mid-item. The queue lease will expire and the same item will be picked up again, killing the next worker the same way.

3. Pull the actual stuck item from the queue tables

This is the load-bearing step. The queue.item_failed signal tells you which queue and class is broken. The queue tables tell you which row.

For Drupal core's DatabaseQueue:

drush sql:query "SELECT item_id, name, created, expire, LENGTH(data) as size \
  FROM queue WHERE name='[queue_name]' ORDER BY created ASC LIMIT 20"

Items with expire > 0 and expire < UNIX_TIMESTAMP() are leased-but-failed (lease expired without delete — the worker did not survive processItem). For advancedqueue:

drush sql:query "SELECT id, queue_id, state, num_retries, processed, expires \
  FROM advancedqueue WHERE state IN ('processing', 'failure') ORDER BY processed DESC LIMIT 20"

Anything with state='processing' and expires in the past is stuck. Inspect the payload of the suspect item:

drush php:eval 'print_r(unserialize(\Drupal::database()->query(
  "SELECT data FROM queue WHERE item_id=12345")->fetchField()));'

Now you know the entity id, the URL, the email recipient, the migration row — whatever the worker was choking on.

4. Run the queue manually with verbose output

Once you have the suspect queue name, run it under drush with verbose mode:

drush -v queue:run [queue_name] --time-limit=60 2>&1 | tee /tmp/queue-debug.log

The full exception trace appears on stdout, including the file and line where processItem() threw — the trace dblog truncates and production cron throws away.

5. Watch the cron.run heartbeat

If cron.run is missing entirely, the symptom is identical from the queue depth angle but the cause is upstream:

drush sql:query "SELECT FROM_UNIXTIME(value__value) FROM key_value \
  WHERE collection='state' AND name='system.cron_last'"

If that timestamp is hours old, cron is not running at all. Check your cron daemon, hosting platform's cron trigger, or the crontab entry calling drush cron.

Grep-to-signal map:

dblog WHERE type='cron' filtered to exceptions → queue.item_failed
grep "Allowed memory size" php-fpm.log in cron window → php.fatal killing a worker
SELECT FROM queue WHERE expire < UNIX_TIMESTAMP() → leased-but-failed items, DB-side shape of queue.item_failed
system.cron_last timestamp gap → missing cron.run heartbeat
SELECT COUNT(*) FROM queue GROUP BY name over time → climbing queue.depth

5.2 Root Causes

(see root causes inline in 5.3 Fix)

5.3 Fix

Rank by what dblog and the queue table actually show.

Cause 1: Poisoned item — entity reference to a deleted node

By far the most common. A worker calls Node::load($nid) where $nid no longer exists, then dereferences null.

Signal shape: queue.item_failed with TypeError: Drupal\node\Entity\Node::id() on null, repeating with the same item id every cron run.
Fix: Patch the worker to null-check (if (!$node) { return; } — silently dropping bad refs is acceptable in most workers) or delete the offending row: drush sql:query "DELETE FROM queue WHERE item_id=12345". For advancedqueue, set the job state to failure.

Cause 2: External API timeout — webhook target down

Worker calls Salesforce/Stripe/Mailgun and times out. Guzzle throws ConnectException. Lease expires. Same item, same timeout, same dead worker.

Signal shape: queue.item_failed with GuzzleHttp\Exception\ConnectException, clustered by minute. queue.depth ramps during the outage.
Fix: Wrap the external call in try/catch with bounded retries. Throw SuspendQueueException after 5 consecutive failures so the rest of the cron run is not wasted. Switch to Advanced Queue's exponential backoff so retries don't hammer cron.

Cause 3: Memory exhaustion on large entity

Worker calls Node::load() on a node with 50,000 paragraphs or a Migrate row with 8MB of source data. PHP hits memory_limit mid-processItem().

Signal shape: queue.item_failed not captured (process died before logging) plus php.fatal "Allowed memory size of N bytes exhausted" in cli error log, correlated to cron window.
Fix: Increase memory_limit for the cron CLI (not the web pool). For Migrate, use --limit and --feedback. For paragraph-heavy entities, stream-load via direct DB queries instead of full entity load.

Cause 4: Lease too short — worker still running when re-claimed

Default lease_time is 30 seconds. A worker that legitimately takes 60 seconds gets its item re-claimed by a parallel cron while it is still working. Two workers process the same item; one deletes it; the other throws on a missing row.

Signal shape: queue.item_failed with EntityStorageException or "row not found", intermittent rather than consistent.
Fix: Override claimItem($lease_time) with a longer lease via hook_queue_info_alter(). Disable concurrent cron, or run the queue under advancedqueue daemon mode.

Cause 5: Code regression — bad deploy

A deploy changed a worker's signature, DI, or schema expectation. Existing items serialized against the old code shape now fail to deserialize.

Signal shape: queue.item_failed rate jumps from ~0 to hundreds per minute correlated with deploy timestamp. cron.run heartbeat is fine.
Fix: Rollback or hotfix the worker to handle both old and new payload shapes. Drain the queue before redeploying.

Cause 6: Cron not actually running

Not a queue-worker issue but presents identically: depth climbs, no work happens. The worker hasn't thrown — it hasn't been invoked.

Signal shape: No queue.item_failed at all. cron.run heartbeat absent for hours. system.cron_last timestamp is stale.
Fix: Restart the cron daemon. Confirm drush cron exits 0. Check hosting platform cron settings (Pantheon, Acquia, Platform.sh each have their own).

5.4 Verify

You verify against the disappearance of queue.item_failed for the offending queue and the drain of queue.depth toward zero.

Confirm the poison item is gone. The item ids you identified in section 5 should no longer appear:

drush sql:query "SELECT item_id, name, created FROM queue WHERE name='[queue_name]' \
  ORDER BY created ASC LIMIT 5"

Watch the queue drain. Run manually first to confirm forward progress, then let cron run normally and check depth every interval:

drush queue:run [queue_name] --time-limit=120
drush sql:query "SELECT name, COUNT(*) FROM queue GROUP BY name"

Healthy: depth strictly decreasing across consecutive cron runs.

Tail dblog for the absence of queue.item_failed:

drush watchdog:tail | grep -E "queue|cron"

Healthy: no new exception lines for the previously-failing queue across at least three full cron intervals (typically 30 minutes).

Confirm cron.run heartbeat is regular — system.cron_last should advance every cron interval. Confirm no php.fatal in the worker process: tail -f /var/log/php*-fpm.log | grep -E "Fatal|memory size" should be silent across a full cycle.

If queue.item_failed reappears for a different item, you have a second bad item, not an unfixed bug — re-run section 5 against the new failing item id.

6. How to Catch This Early

Fixing it is straightforward once you know the cause. The hard part is knowing it happened at all.

This issue surfaces as queue.item_failed.

Drupal queue health is a perfect example of a failure mode that is observable in principle and invisible in practice. The exception is logged. The queue table has the row. The PHP fatal is in the error log. Every piece of evidence exists somewhere on disk — none of it correlated, alerted on, or surfaced in a place a human will look before queue depth has been climbing for a week and a customer has noticed.

Drupal core does not alert on queue.item_failed. The status report does not show queue depth. dblog rolls over before you can investigate. The PHP error log usually lives on a different host from the database where the queue table sits.

This type of issue surfaces as queue.item_failed, which Logystera detects and alerts on the moment a worker throws — with the queue name, item id, exception class, and worker plugin id already extracted from the dblog row, correlated against the queue.depth curve and the cron.run heartbeat. You see the poison item by id within seconds of its first failure, not three days later when search results stop updating.

That is the difference between knowing your queue fleet is stuck and treating it as a tractable engineering problem you can resolve in one drush command.

7. Related Silent Failures

queue.depth growing while queue.item_failed is zero — cron is not running, or the worker is registered but not invoked. Check cron.run and system.cron_last.
php.fatal "Allowed memory size" during cron windows — worker dies before it can log to dblog. Correlated by timestamp to cron run.
cron.run heartbeat gap longer than your cron interval — Drupal cron is not running. Hosting cron trigger broken, drush failing to bootstrap, or system.cron_last lock stuck.
queue.item_failed clustered immediately after deploy — code regression invalidated existing serialized payloads.
http.request 5xx on /cron/[key] — public cron endpoint failing. Same story, different invocation path.

See what's actually happening in your Drupal system

Connect your site. Logystera starts monitoring within minutes.

Request a demo Drupal integration