Guide

Drupal deployment regression — correlating the deploy event with the five things that follow

The deploy went green. Five minutes later the site is broken. You're staring at the standard sequence: a CI pipeline that ended with drush deploy exiting 0, a Slack ping saying "production deploy complete," and a rising 5xx graph.

1. Problem

The deploy went green. Five minutes later the site is broken.

You're staring at the standard sequence: a CI pipeline that ended with drush deploy exiting 0, a Slack ping saying "production deploy complete," and a rising 5xx graph. Maybe the homepage still loads (Varnish is happy) but /admin/content returns a WSOD, or the contact form throws "The website encountered an unexpected error," or your JSON:API consumer started 500-ing at the deploy timestamp.

This is the textbook "drupal site broke after deployment what changed" scenario, and the dashboards are no help. drush deploy chains four sub-commands — updatedb, cache:rebuild, config:import, cache:rebuild — and any one can leave the site broken while still reporting [success]. By the time you realize the site is bleeding, the deploy output has scrolled off, the CI job is archived, and you have no anchor for what changed.

The deploy surfaces in Logystera as a drupal_deployment_events_total increment with the git SHA and timestamp. The five signals that follow it — config import, module enable/disable, PHP errors, 5xx surge, environment swap — are how you figure out which deploy step caused the regression.

2. Impact

A regression introduced at deploy time is the most expensive class of Drupal failure: it takes a known-good site and breaks it on your team's schedule — usually Friday at 16:30.

For a Drupal Commerce store, post-deploy regressions hit the checkout funnel hardest. A bad config import that disables a commerce_payment_gateway or strips a permission from anonymous user produces a silent checkout failure: customers add to cart, click "Place Order," and get a generic exception. The order row never writes — no abandoned cart to recover, no payment row to reconcile, just a quiet drop in conversion that finance notices three days later. A single hour of broken checkout on a mid-sized store routinely costs $8k–$25k in lost orders, plus customer-trust cost.

For a publisher or membership site, the cost is in the support inbox. A module enable that flips a Views display, a config import that overwrites a permission, an env-changed signal showing dev config in prod — each generates tickets that sound like "the page doesn't work anymore." Expect 30–60 minutes of broken site to produce 4–6 hours of support work.

The quietest cost: regressions that don't surface as 5xx. A permission strip on view own commerce_order is a drupal_config_import_total event with no error log, no exception. Dashboards stay green, the bug ships, and nobody notices for two weeks.

3. Why It’s Hard to Spot

Drupal deploys are uniquely opaque after the fact. Five different mutation channels can fire during a single drush deploy, and Drupal does not surface them as a unified event log:

Config import mutates active config (permissions, views, field storage, third-party settings) — drupal_config_import_total per config item.
Module enable/disable runs install hooks, schema updates, route rebuilds — drupal_module_changes_total per module.
Database updates (updatedb) run hook_update_N — silent unless they throw.
Settings/environment changes load on the next request — drupal_environment_changed_total when the active environment hash changes.
Cached opcode + Drupal cache flips on cache:rebuild — invisible unless a previously masked PHP error now fires.

Drupal's own logging surfaces almost none of this. dblog records a generic "Updates were attempted" and a config-import success message — no diff, no list of items changed. drush deploy --verbose prints per-step output only if you saved it.

Standard uptime monitors miss post-deploy regressions because the site is up. A WSOD on /admin/content is invisible to a monitor that only checks /. A stripped permission is invisible to any monitor — the site returns 200 OK, just with the wrong content. And the CI dashboard shows green because drush deploy exited zero. The deploy reports success, the site reports 200, and the regression hides between them.

4. Cause

Every Drupal deployment that runs drush deploy (or its predecessors drush updb, drush cim, drush cr) generates a drupal_deployment_events_total signal the moment Drupal finishes bootstrapping with the new code. The Logystera Drupal agent hooks hook_post_update_NAME and drush_finish to emit it with three labels: the git SHA from composer.lock's version-name, the environment, and the deploy duration.

That signal is the anchor. By itself it doesn't say anything went wrong — it just marks the timestamp. The intelligence comes from the related_metrics chain wired into the metric definition: drupal_deployment_events_total is correlated against five signals that fire in the minutes after a deploy. Each pattern points at a different culprit:

drupal_config_import_total spike → config import changed something visible (permissions, views, fields).
drupal_module_changes_total spike → a module was enabled or disabled, often dragging schema updates and route changes with it.
drupal_php_error_total spike → new code triggers a fatal or recoverable error the old code didn't.
drupal_server_errors_total spike → 5xx rate climbed at the deploy boundary, regardless of cause.
drupal_environment_changed_total event → the deploy ran with a different environment fingerprint than before (usually wrong).

The deploy signal is what turns "the site broke at some point today" into "the site broke at the deploy at 14:03, and the next four minutes show a drupal_module_changes_total for commerce_payment followed by a drupal_php_error_total spike."

5. Solution

5.1 Diagnose (logs first)

Confirm the deploy event, then walk the five correlated signals in order until one fires. The first one that fires is your culprit.

1. Anchor on the deploy event.

Find the exact deploy timestamp — drush deploy writes a recognizable opcache reset and bootstrap line:

grep -nE "drush.*deploy|cache_rebuild|opcache.*reset" /var/log/php-fpm/error.log | tail -n 20
journalctl --since "2 hours ago" -u php8.3-fpm | grep -iE "deploy|drush" | head -n 30

That timestamp is what drupal_deployment_events_total records. Every diagnostic step now answers "what fired in the 5 minutes after that line?"

2. Look for the config import spike.

drush watchdog:show --severity=Notice --type=config | head -n 30
grep -i "config_import\|configuration imported" /var/log/php-fpm/error.log | \
    awk -v deploy="14:03" '$0 ~ deploy' | head -n 20

A burst of config_import lines at the deploy boundary surfaces as drupal_config_import_total. The signal payload includes the config name (e.g., user.role.authenticated, views.view.frontpage) — the precise thing that changed.

3. Look for module enable/disable.

drush watchdog:show --type=system --count=50 | grep -iE "enabled|disabled|installed|uninstalled"
git -C /var/www/drupal log --since "30 minutes ago" -p config/sync/core.extension.yml

Anything in core.extension.yml that flipped between 0 and 1 produces drupal_module_changes_total — the bridge to whatever the module's hook_install did (schema changes, route rebuilds, default config writes).

4. Look for the PHP error spike.

grep "PHP" /var/log/php-fpm/error.log | \
    awk -v t="14:0[3-9]\\|14:1[0-3]" '$0 ~ t' | \
    grep -cE "Fatal|Uncaught|Error"

If this jumps from single digits into the dozens, drupal_php_error_total is firing. Classic post-deploy patterns: class autoload failure (Class not found after composer.lock changed without cache:rebuild) or a service-container error (drush cr skipped).

5. Look for the 5xx spike.

awk '{print $4, $9}' /var/log/nginx/access.log | \
    grep "14:0[3-9]" | awk '{print $1, $2}' | sort | uniq -c | grep " 5"

drupal_server_errors_total aggregates the 5xx rate. A sustained jump — not a 30-second blip from cache flush — means real user traffic is hitting the regression.

6. Look for environment fingerprint change.

md5sum /var/www/drupal/web/sites/default/settings.php \
       /var/www/drupal/web/sites/default/services.yml \
       /var/www/drupal/web/sites/default/settings.local.php 2>/dev/null

If the environment hash changed and you didn't expect it to, drupal_environment_changed_total fires. Classic case: CI accidentally deployed services.dev.yml to prod, flipping twig.config.debug: true and disabling render caching site-wide.

7. Time-correlate with the deploy SHA.

cd /var/www/drupal && git log --since "1 hour ago" --oneline --stat | head -n 50
git -C /var/www/drupal diff HEAD~1 HEAD -- config/sync/ | head -n 100

This is what the whole chain exists to answer: "the site broke after the 14:03 deploy of SHA e287b4a; the four minutes after show a drupal_module_changes_total for commerce_payment followed by a drupal_php_error_total spike — and the diff for that SHA touched commerce_payment.info.yml."

5.2 Root Causes

Each cause maps to a specific signal in the chain. Prioritized by frequency.

Bad config import — a config item in config/sync/ was committed in a broken state (permission removed by accident, view referencing a deleted field, third-party setting pointing at an un-enabled module). Produces drupal_config_import_total at deploy time and a delayed drupal_php_error_total or drupal_server_errors_total when the affected page is hit.
Module enable side effects — a newly-enabled module ran hook_install and rewrote default config, registered a route that shadows an existing path, or required a service that doesn't exist. Produces drupal_module_changes_total, then drupal_php_error_total from the route rebuild or service-container failure.
Module disable orphans — a module was disabled but its config items remain in config/sync/, or a custom module still calls a now-missing service. Produces drupal_module_changes_total and a Class not found storm in drupal_php_error_total.
composer.lock change without cache rebuild — the deploy updated a Composer dependency (e.g., a security release of symfony/http-kernel) but drush cr did not run, so the autoloader still maps to old class paths. Produces drupal_php_error_total with Class not found and no drupal_config_import_total. Smoking gun: the error appears immediately after the deploy event with no preceding config or module signal.
Wrong environment file deployed — CI pushed settings.dev.php or services.dev.yml to production. Produces drupal_environment_changed_total and an immediate drupal_server_errors_total spike from disabled caching.
hook_update_N failed silently — drush updatedb returned [success] but a hook ran a partial migration leaving the schema inconsistent. No deploy-time signal, but generates drupal_php_error_total with DatabaseExceptionWrapper on the first request hitting the affected table.

5.3 Fix

The fix sequence is dictated by which signal in the chain fired first.

Cause A — Bad config import: roll back the specific config item using the previous SHA's config/sync/. Don't roll back the whole deploy unless multiple items are bad.

git -C /var/www/drupal show HEAD~1:config/sync/user.role.authenticated.yml > /tmp/role.yml
drush config:import --partial --source=/tmp --diff && drush cr

Cause B — Module enable side effect: identify the module from the drupal_module_changes_total payload, then either roll forward (fix the offending route/service) or uninstall and re-deploy without it.

drush pm:uninstall <module_name> && drush cim sync && drush cr

Cause C — Module disable orphans: the inverse of B. Run drush config:status to find drift, then re-enable the module or remove its config items from sync.

Cause D — Composer change without cache rebuild: run drush cr. If the autoloader still misses a class, dump it and recycle PHP-FPM.

cd /var/www/drupal && composer dump-autoload -o
systemctl reload php8.3-fpm && drush cr

Cause E — Wrong environment file: revert the offending file and recycle PHP-FPM.

git -C /var/www/drupal checkout HEAD~1 -- web/sites/default/services.yml
systemctl reload php8.3-fpm

Cause F — Failed hook_update_N: check key_value for the schema version, manually advance or roll back, re-run.

drush sqlq "SELECT name, value FROM key_value WHERE collection='system.schema' AND name='<module>';"
drush updatedb --no-post-updates

5.4 Verify

Two conditions must hold simultaneously: the post-deploy signal chain is silent, and drupal_server_errors_total returns to baseline.

# Should be empty for at least 15 minutes after the fix:
grep -E "PHP Fatal|Uncaught" /var/log/php-fpm/error.log | tail -n 5
awk '{print $9}' /var/log/nginx/access.log | grep "^5" | tail -n 20

# JSON:API and admin paths return 200, not 500
curl -s -o /dev/null -w "%{http_code}\n" https://example.com/admin/content
curl -s -o /dev/null -w "%{http_code}\n" https://example.com/jsonapi/node/article

In Logystera's entity view, a healthy post-deploy state is: a single drupal_deployment_events_total increment at the deploy timestamp, no follow-on drupal_php_error_total spike, drupal_server_errors_total 5xx rate under 0.5%, and no drupal_environment_changed_total event unless intended.

Supporting-signal baselines:

drupal_php_error_total: 0–3/hour from contrib deprecation noise; above 10/hour post-deploy is anomalous.
drupal_server_errors_total: under 0.5% of requests is normal; sustained above 1% for 5+ minutes is regression.
drupal_config_import_total: zero between deploys; non-zero outside a deploy window means a manual drush cim.
drupal_module_changes_total: zero is normal; non-zero outside a deploy means a manual pm:enable or pm:uninstall.
drupal_environment_changed_total: zero is the only normal value in production.

If the deploy fired at 14:03 and 30 minutes later all five supporting signals are at baseline, the regression is resolved. If drupal_php_error_total is still firing 5+/hour, you fixed a symptom without fixing the cause.

6. How to Catch This Early

Fixing it is straightforward once you know the cause. The hard part is knowing it happened at all.

This issue surfaces as drupal_deployment_events_total.

Everything you just did manually — anchor on the deploy timestamp, walk through config import, module changes, PHP errors, 5xx rate, and environment changes in order — Logystera does automatically. The Drupal agent emits drupal_deployment_events_total the instant drush deploy finishes, and the metric definition's related_metrics field wires the five supporting signals into a triage chain that renders as a single post-deploy panel in the entity view.

!Logystera dashboard — drupal_deployment_events_total over time drupal_deployment_events_total with correlated supporting signals, last 24h — deploy at 14:03 followed by a drupal_module_changes_total event and a drupal_php_error_total spike within four minutes.

The rule that fires is id 521 — Drupal post-deploy regression detected, severity warning, threshold any deploy event followed by ≥3 supporting-signal anomalies within 10 minutes. The rule does not fire on the deploy event alone — a clean deploy with no follow-on spikes is silent. It fires only when the chain lights up.

!Logystera alert — Drupal post-deploy regression detected Warning alert fires within 10 minutes of a deploy event when the supporting-signal chain breaks baseline, including the deploy SHA and the names of the supporting signals that spiked.

The alert payload includes the deploy timestamp, git SHA, and a per-signal breakdown showing which of the five supporting signals broke baseline and by how much — enough to decide which of the six root causes in §5.2 you have before you SSH into the box.

The fix is simple once you know which deploy step caused it. The hard part is knowing the regression happened at all, and which of five possible mutations is responsible. Logystera turns this from a 30-minute customer-reported puzzle into a 10-minute notification with the deploy SHA and the specific supporting signal that broke the chain.

7. Related Silent Failures

drupal_config_import_total spike without a deploy — somebody ran drush cim manually on production. Same regression class, no deploy event to anchor on.
drupal_module_changes_total outside a deploy window — manual drush pm:enable/uninstall in production. Often a hotfix that bypassed CI.
drupal_environment_changed_total without a deploy — settings.php was hand-edited on the box. Almost always means someone bypassed your config-as-code pipeline.
Failed hook_update_N with no signal — drush updatedb exits zero but the schema is half-migrated. Surfaces only as drupal_php_error_total with DatabaseExceptionWrapper on the first affected request.
drupal_config_import_failed — distinct from drupal_config_import_total. The import itself threw and the site is in a half-imported state. Covered in T3 #21 drupal-config-import-failed.

See what's actually happening in your Drupal system

Connect your site. Logystera starts monitoring within minutes.

Request a demo Drupal integration