Drift Guardrails
Purpose: Define acceptable infrastructure drift limits Audience: Operators, DevOps Verify:
scripts/validate-infra.sh
What Is Drift?
Drift occurs when actual infrastructure state differs from documented state in TRUTH_MAP.md.
Examples of drift: - New service added without updating TRUTH_MAP - Environment variable changed without documentation - Domain routing modified - Database schema out of sync with migrations
Acceptable Drift
Some drift is acceptable during development:
| Category | Acceptable Drift | Duration |
|---|---|---|
| Feature branches | New env vars, test services | Until PR merged |
| Staging environment | Any changes | Permanent (staging is not documented) |
| Local development | Full freedom | N/A |
Unacceptable Drift
Production drift that must be resolved immediately:
| Category | Unacceptable Drift | Resolution |
|---|---|---|
| Services | Undocumented service in production | Add to TRUTH_MAP or delete |
| Domains | Domain pointing to wrong target | Fix routing or update TRUTH_MAP |
| Secrets | Doppler out of sync with Railway/Vercel | Sync from Doppler |
| Database | Schema differs from migrations | Run migrations or fix code |
| MCP | Endpoint behavior differs from docs | Fix code or update docs |
Drift Detection
Automated (CI/CD)
# Run on every deploy
./scripts/validate-infra.sh
# Fails if:
# - Health endpoints don't respond
# - Domain routing incorrect
# - Required services missing
Manual (Weekly)
| Check | Procedure | Owner |
|---|---|---|
| Railway services match TRUTH_MAP | Railway dashboard → count services | Ops |
| Autodeploy disabled | Railway dashboard → each service → Settings | Ops |
| Doppler sync status | Doppler dashboard → prd → sync status | Ops |
| Vercel env vars | Vercel dashboard → Settings → Environment Variables | Ops |
Drift Resolution Protocol
1. Detect
Drift discovered via:
- validate-infra.sh failure
- Manual audit
- Incident investigation
2. Classify
| Severity | Definition | SLA |
|---|---|---|
| Critical | Production broken, users affected | Fix immediately |
| High | Security risk, data integrity | Fix within 4 hours |
| Medium | Documentation mismatch, no impact | Fix within 1 week |
| Low | Cosmetic, naming inconsistency | Fix when convenient |
3. Resolve
Option A: Update Infrastructure - Fix infrastructure to match TRUTH_MAP - No documentation change needed - Preferred for unintentional drift
Option B: Update Documentation - Update TRUTH_MAP to match infrastructure - Requires evidence citation - Use for intentional changes
4. Verify
Drift Scenarios
Scenario: New Service Added
Detection: Railway shows 7 services, TRUTH_MAP says 6
Resolution:
1. Determine if service is intentional
2. If yes: Add to TRUTH_MAP Section A with evidence
3. If no: Delete service from Railway
4. Update Last Verified date
Scenario: Domain Not Resolving
Detection: validate-infra.sh fails on domain check
Resolution: 1. Check DNS propagation (dig, nslookup) 2. Verify platform configuration (Railway/Vercel) 3. Check for platform outages 4. Fix routing or update TRUTH_MAP
Scenario: Environment Variable Mismatch
Detection: Service fails to start, logs show missing env var
Resolution: 1. Check Doppler for correct value 2. Sync to Railway/Vercel 3. Update TRUTH_MAP if new variable 4. Redeploy service
Prevention
Before Making Changes
- Check TRUTH_MAP - Understand current documented state
- Update Documentation First - For intentional changes
- Use Doppler - Never edit Railway/Vercel env vars directly
- Run Validation - After every change
CI/CD Integration
# .github/workflows/deploy.yml
- name: Validate Infrastructure
run: ./scripts/validate-infra.sh
- name: Check for Drift
run: |
if git diff --name-only | grep -q "docs/infra/TRUTH_MAP.md"; then
echo "TRUTH_MAP modified - ensure all claims have evidence"
fi
Related Documents
| Document | Purpose |
|---|---|
| TRUTH_MAP.md | Source of truth |
| DOC_GOVERNANCE.md | Evidence requirements |
| ../runbooks/DEPLOYMENT_GUIDE.md | Deployment procedures |
Document created: 2026-02-01 Authority: DOC_GOVERNANCE.md