Skip to content

Drift Guardrails

Purpose: Define acceptable infrastructure drift limits Audience: Operators, DevOps Verify: scripts/validate-infra.sh


What Is Drift?

Drift occurs when actual infrastructure state differs from documented state in TRUTH_MAP.md.

Examples of drift: - New service added without updating TRUTH_MAP - Environment variable changed without documentation - Domain routing modified - Database schema out of sync with migrations


Acceptable Drift

Some drift is acceptable during development:

Category Acceptable Drift Duration
Feature branches New env vars, test services Until PR merged
Staging environment Any changes Permanent (staging is not documented)
Local development Full freedom N/A

Unacceptable Drift

Production drift that must be resolved immediately:

Category Unacceptable Drift Resolution
Services Undocumented service in production Add to TRUTH_MAP or delete
Domains Domain pointing to wrong target Fix routing or update TRUTH_MAP
Secrets Doppler out of sync with Railway/Vercel Sync from Doppler
Database Schema differs from migrations Run migrations or fix code
MCP Endpoint behavior differs from docs Fix code or update docs

Drift Detection

Automated (CI/CD)

# Run on every deploy
./scripts/validate-infra.sh

# Fails if:
# - Health endpoints don't respond
# - Domain routing incorrect
# - Required services missing

Manual (Weekly)

Check Procedure Owner
Railway services match TRUTH_MAP Railway dashboard → count services Ops
Autodeploy disabled Railway dashboard → each service → Settings Ops
Doppler sync status Doppler dashboard → prd → sync status Ops
Vercel env vars Vercel dashboard → Settings → Environment Variables Ops

Drift Resolution Protocol

1. Detect

Drift discovered via: - validate-infra.sh failure - Manual audit - Incident investigation

2. Classify

Severity Definition SLA
Critical Production broken, users affected Fix immediately
High Security risk, data integrity Fix within 4 hours
Medium Documentation mismatch, no impact Fix within 1 week
Low Cosmetic, naming inconsistency Fix when convenient

3. Resolve

Option A: Update Infrastructure - Fix infrastructure to match TRUTH_MAP - No documentation change needed - Preferred for unintentional drift

Option B: Update Documentation - Update TRUTH_MAP to match infrastructure - Requires evidence citation - Use for intentional changes

4. Verify

# After resolution
./scripts/validate-infra.sh

# Update last verified date in TRUTH_MAP.md

Drift Scenarios

Scenario: New Service Added

Detection: Railway shows 7 services, TRUTH_MAP says 6

Resolution: 1. Determine if service is intentional 2. If yes: Add to TRUTH_MAP Section A with evidence 3. If no: Delete service from Railway 4. Update Last Verified date

Scenario: Domain Not Resolving

Detection: validate-infra.sh fails on domain check

Resolution: 1. Check DNS propagation (dig, nslookup) 2. Verify platform configuration (Railway/Vercel) 3. Check for platform outages 4. Fix routing or update TRUTH_MAP

Scenario: Environment Variable Mismatch

Detection: Service fails to start, logs show missing env var

Resolution: 1. Check Doppler for correct value 2. Sync to Railway/Vercel 3. Update TRUTH_MAP if new variable 4. Redeploy service


Prevention

Before Making Changes

  1. Check TRUTH_MAP - Understand current documented state
  2. Update Documentation First - For intentional changes
  3. Use Doppler - Never edit Railway/Vercel env vars directly
  4. Run Validation - After every change

CI/CD Integration

# .github/workflows/deploy.yml
- name: Validate Infrastructure
  run: ./scripts/validate-infra.sh

- name: Check for Drift
  run: |
    if git diff --name-only | grep -q "docs/infra/TRUTH_MAP.md"; then
      echo "TRUTH_MAP modified - ensure all claims have evidence"
    fi

Document Purpose
TRUTH_MAP.md Source of truth
DOC_GOVERNANCE.md Evidence requirements
../runbooks/DEPLOYMENT_GUIDE.md Deployment procedures

Document created: 2026-02-01 Authority: DOC_GOVERNANCE.md