Skip to content

Disaster Recovery Runbook

Purpose: Disaster recovery procedures for Fulcrum infrastructure Audience: DevOps, Infrastructure Team Source of Truth: TRUTH_MAP.md

Last Updated: February 1, 2026 Owner: Infrastructure Team Status: Production Ready


Overview

This runbook documents disaster recovery procedures for Fulcrum's production infrastructure. It covers database backup configuration, recovery procedures, and testing protocols.

Infrastructure Overview

IMPORTANT: Production runs on Railway. Disaster recovery relies on Railway-managed backups only. See TRUTH_MAP.md Section C for deprecated services.

Component Primary Provider DR Provider Backup Strategy
PostgreSQL Database Railway (timescaledb-docker) None (no secondary provider) Railway-managed backups
Redis Cache Railway - Ephemeral (no backups required)
Application State Railway Database - Covered by PostgreSQL backups
Configuration Doppler + Git - Version controlled

Platform Clarification

  • Railway: Primary production platform for all services and databases
  • Database hostname: timescaledb-docker.railway.internal:5432

1. PostgreSQL Backup Configuration (Railway)

1.1 Automated Backups

Railway-managed PostgreSQL backups are plan-dependent. Verify schedule, retention, and restore options in the Railway dashboard.

Current Configuration (verify in Railway): - Backup Frequency: Managed by provider plan - Retention Period: Managed by provider plan - Backup Type: Provider-managed snapshots/dumps - Point-in-Time Recovery (PITR): If available on plan

1.2 Enabling/Verifying Backups

Steps to verify backup configuration:

  1. Log in to the Railway dashboard
  2. Navigate to your PostgreSQL database service
  3. Open the backups/snapshots section
  4. Verify the following:
  5. ✅ Automated backups are enabled
  6. ✅ Recent backups are listed with timestamps
  7. ✅ PITR is enabled (if available)

To modify backup settings:

  1. In the backups/snapshots section, open settings
  2. Configure:
  3. Backup Retention: 7-30 days (recommended: 14 days for production)
  4. Backup Window: Choose low-traffic time (e.g., 2-4 AM UTC)
  5. Click "Save Settings"

1.3 Manual Backup

To create an on-demand backup before major changes:

# Via Railway dashboard:
# 1. Go to Database → Backups/Snapshots
# 2. Click "Create Backup" or "Create Snapshot"
# 3. Add description (e.g., "Pre-migration backup - 2026-01-21")

2. Recovery Procedures

2.1 Point-in-Time Recovery (PITR)

Use Case: Recover to a specific timestamp (e.g., before data corruption)

Recovery Time Objective (RTO): 15-30 minutes Recovery Point Objective (RPO): Up to 5 minutes of data loss

Steps:

  1. Identify Target Timestamp

    -- Find the last known good transaction time
    SELECT max(created_at) FROM fulcrum.envelopes WHERE status = 'completed';
    

  2. Initiate PITR via Railway dashboard (if available)

  3. Navigate to Database → Backups/Snapshots
  4. Choose "Point-in-Time Recovery" (if available)
  5. Enter target timestamp (UTC): 2026-01-21T14:30:00Z
  6. Choose recovery destination:
    • New Database (recommended for testing)
    • Overwrite Current (production restore)
  7. Start recovery and monitor progress

  8. Verify Recovery

    # Connect to recovered database
    psql $POSTGRES_CONN_STR
    
    # Verify data integrity
    SELECT COUNT(*) FROM fulcrum.tenants;
    SELECT COUNT(*) FROM fulcrum.envelopes;
    SELECT COUNT(*) FROM fulcrum.policies;
    
    # Check most recent records
    SELECT * FROM fulcrum.envelopes ORDER BY created_at DESC LIMIT 10;
    

  9. Update Application Connection String

  10. If recovered to new database, update POSTGRES_CONN_STR in Doppler
  11. Restart all services: fulcrum-server, event-processor, dashboard

2.2 Full Database Restore from Backup

Use Case: Restore entire database from a specific backup snapshot

RTO: 30-60 minutes RPO: Up to 24 hours (daily backups)

Steps:

  1. Select Backup
  2. Navigate to Railway dashboard → Database → Backups/Snapshots
  3. Choose backup by date/time
  4. Note backup ID and timestamp

  5. Restore Options

Option A: Restore to New Database (Recommended)

# 1. Create new database from backup via Railway dashboard
#    - Click backup → "Restore to New Database"
#    - Name: "fulcrum-db-restored-{date}"
#    - Plan: Same as production

# 2. Wait for restore to complete (5-20 minutes)

# 3. Verify data
psql $NEW_DATABASE_URL -c "SELECT COUNT(*) FROM fulcrum.tenants;"

# 4. Update connection string in Doppler
doppler secrets set POSTGRES_CONN_STR="$NEW_DATABASE_URL&search_path=fulcrum"

# 5. Restart services
railway up --service fulcrum-server
railway up --service event-processor
railway up --service dashboard

Option B: In-Place Restore (Production)

# WARNING: This will overwrite current database
# Only use if current database is completely corrupted

# 1. Put application in maintenance mode
railway down --service fulcrum-server
railway down --service event-processor

# 2. Via Railway dashboard:
#    - Select backup → "Restore"
#    - Confirm overwrite

# 3. Wait for restore (10-30 minutes)

# 4. Verify data integrity
psql $POSTGRES_CONN_STR -c "\dt fulcrum.*"

# 5. Restart services
railway up --service fulcrum-server
railway up --service event-processor
railway up --service dashboard

2.3 Schema-Only Recovery

Use Case: Restore schema without data (e.g., test environments)

# 1. Export schema from backup
psql $BACKUP_DATABASE_URL -c "\
  pg_dump --schema-only --no-owner --no-privileges \
  -n fulcrum -n metrics > /tmp/schema.sql"

# 2. Apply to target database
psql $TARGET_DATABASE_URL -f /tmp/schema.sql

# 3. Verify schema
psql $TARGET_DATABASE_URL -c "\dt fulcrum.*"

3. Testing Recovery Procedures

3.1 Monthly Recovery Test

Schedule: First Monday of each month Duration: 30 minutes

Test Procedure:

  1. Create Test Backup

    # Manually trigger backup via Railway dashboard
    # Label: "Monthly DR Test - {date}"
    

  2. Restore to Staging Environment

    # 1. Restore backup to new database
    # 2. Connect staging environment to restored DB
    # 3. Run smoke tests
    
    cd /Users/td/ConceptDev/Projects/Fulcrum
    go test ./tests/smoke/... -v
    

  3. Document Results

    # Record in disaster-recovery-tests.log:
    echo "$(date -u +%Y-%m-%d) | PITR Test | Success | RTO: 18min | RPO: 2min" \
      >> docs/runbooks/disaster-recovery-tests.log
    

  4. Cleanup

    # Delete test database via Railway dashboard
    

3.2 Test Recovery Log

Create a test log file to track all recovery tests:

Location: docs/runbooks/disaster-recovery-tests.log

Format:

Date       | Test Type    | Result  | RTO    | RPO    | Notes
-----------|--------------|---------|--------|--------|------------------
2026-01-21 | PITR         | Success | 22min  | 3min   | Initial test
2026-02-03 | Full Restore | Success | 45min  | 24hrs  | Monthly test


4. Data Loss Prevention

4.1 Pre-Change Backups

ALWAYS create manual backup before: - Database schema migrations - Bulk data operations - Major version upgrades - Configuration changes affecting RLS policies

# Create pre-migration backup via Railway dashboard
# Label format: "PRE-{operation}-{date}-{ticket-id}"
# Example: "PRE-MIGRATION-2026-01-21-P0-012"

4.2 Multi-Region Backup Strategy (Future)

Current: Single-region (Railway) Future: Cross-region backup replication (requires ADR approval)


5. Recovery Time Objectives (RTO)

Scenario RTO RPO Procedure
Point-in-Time Recovery 15-30min 5min Section 2.1
Full Database Restore 30-60min 24hrs Section 2.2
Schema Corruption 10-20min 0min Replay migrations
Primary platform outage 2-4hrs 24hrs Restore from latest backup in provider console

6. Contact Information

Escalation Path

Level Contact Situation
L1 On-call engineer Initial response
L2 Infrastructure lead RTO >1hr
L3 Platform support Provider outage/issues
L4 CEO/CTO Data breach/major loss

7. Validation Checklist

After any recovery operation, verify:

  • [ ] All schemas present: fulcrum, metrics
  • [ ] Row counts match expected values
  • [ ] RLS policies active: SELECT * FROM fulcrum.tenants; (should be tenant-scoped)
  • [ ] Recent data present (check timestamps)
  • [ ] All services healthy: fulcrum-server, event-processor, dashboard
  • [ ] Application smoke tests pass: go test ./tests/smoke/...
  • [ ] No error logs in application monitoring
  • [ ] User login/authentication working

8. Configuration as Code

All infrastructure configuration is version controlled:

# Database schema
infra/migrations/postgres/*.sql

# Infrastructure config
railway.toml

# Secrets (not in git)
Doppler: https://dashboard.doppler.com

9. Notes & Lessons Learned

Known Issues

  1. PITR availability depends on plan - Verify support in Railway dashboard
  2. Restore time increases with database size - Current DB ~2GB = 20min restore
  3. Connection string updates require service restart - Plan for 2-3min downtime

Improvements Planned

  • [ ] Automated recovery testing
  • [ ] Cross-region backup replication
  • [ ] Sub-5-minute RTO with hot standby

10. References


Document History: - 2026-01-21: Initial version created (P0-012 infrastructure sprint)