Disaster Recovery Runbook

Purpose: Disaster recovery procedures for Fulcrum infrastructure Audience: DevOps, Infrastructure Team Source of Truth: TRUTH_MAP.md

Last Updated: February 1, 2026 Owner: Infrastructure Team Status: Production Ready

Overview

This runbook documents disaster recovery procedures for Fulcrum's production infrastructure. It covers database backup configuration, recovery procedures, and testing protocols.

Infrastructure Overview

IMPORTANT: Production runs on Railway. Disaster recovery relies on Railway-managed backups only. See TRUTH_MAP.md Section C for deprecated services.

Component	Primary Provider	DR Provider	Backup Strategy
PostgreSQL Database	Railway (timescaledb-docker)	None (no secondary provider)	Railway-managed backups
Redis Cache	Railway	-	Ephemeral (no backups required)
Application State	Railway Database	-	Covered by PostgreSQL backups
Configuration	Doppler + Git	-	Version controlled

Platform Clarification

Railway: Primary production platform for all services and databases
Database hostname: timescaledb-docker.railway.internal:5432

1. PostgreSQL Backup Configuration (Railway)

1.1 Automated Backups

Railway-managed PostgreSQL backups are plan-dependent. Verify schedule, retention, and restore options in the Railway dashboard.

Current Configuration (verify in Railway): - Backup Frequency: Managed by provider plan - Retention Period: Managed by provider plan - Backup Type: Provider-managed snapshots/dumps - Point-in-Time Recovery (PITR): If available on plan

1.2 Enabling/Verifying Backups

Steps to verify backup configuration:

Log in to the Railway dashboard
Navigate to your PostgreSQL database service
Open the backups/snapshots section
Verify the following:
✅ Automated backups are enabled
✅ Recent backups are listed with timestamps
✅ PITR is enabled (if available)

To modify backup settings:

In the backups/snapshots section, open settings
Configure:
Backup Retention: 7-30 days (recommended: 14 days for production)
Backup Window: Choose low-traffic time (e.g., 2-4 AM UTC)
Click "Save Settings"

1.3 Manual Backup

To create an on-demand backup before major changes:

# Via Railway dashboard:
# 1. Go to Database → Backups/Snapshots
# 2. Click "Create Backup" or "Create Snapshot"
# 3. Add description (e.g., "Pre-migration backup - 2026-01-21")

2. Recovery Procedures

2.1 Point-in-Time Recovery (PITR)

Use Case: Recover to a specific timestamp (e.g., before data corruption)

Recovery Time Objective (RTO): 15-30 minutes Recovery Point Objective (RPO): Up to 5 minutes of data loss

Steps:

Identify Target Timestamp

-- Find the last known good transaction time
SELECT max(created_at) FROM fulcrum.envelopes WHERE status = 'completed';

Initiate PITR via Railway dashboard (if available)
Navigate to Database → Backups/Snapshots
Choose "Point-in-Time Recovery" (if available)
Enter target timestamp (UTC): 2026-01-21T14:30:00Z
Choose recovery destination:
- New Database (recommended for testing)
- Overwrite Current (production restore)
Start recovery and monitor progress

Verify Recovery

# Connect to recovered database
psql $POSTGRES_CONN_STR

# Verify data integrity
SELECT COUNT(*) FROM fulcrum.tenants;
SELECT COUNT(*) FROM fulcrum.envelopes;
SELECT COUNT(*) FROM fulcrum.policies;

# Check most recent records
SELECT * FROM fulcrum.envelopes ORDER BY created_at DESC LIMIT 10;

Update Application Connection String
If recovered to new database, update POSTGRES_CONN_STR in Doppler
Restart all services: fulcrum-server, event-processor, dashboard

2.2 Full Database Restore from Backup

Use Case: Restore entire database from a specific backup snapshot

RTO: 30-60 minutes RPO: Up to 24 hours (daily backups)

Steps:

Select Backup
Navigate to Railway dashboard → Database → Backups/Snapshots
Choose backup by date/time
Note backup ID and timestamp
Restore Options

Option A: Restore to New Database (Recommended)

# 1. Create new database from backup via Railway dashboard
#    - Click backup → "Restore to New Database"
#    - Name: "fulcrum-db-restored-{date}"
#    - Plan: Same as production

# 2. Wait for restore to complete (5-20 minutes)

# 3. Verify data
psql $NEW_DATABASE_URL -c "SELECT COUNT(*) FROM fulcrum.tenants;"

# 4. Update connection string in Doppler
doppler secrets set POSTGRES_CONN_STR="$NEW_DATABASE_URL&search_path=fulcrum"

# 5. Restart services
railway up --service fulcrum-server
railway up --service event-processor
railway up --service dashboard

Option B: In-Place Restore (Production)

# WARNING: This will overwrite current database
# Only use if current database is completely corrupted

# 1. Put application in maintenance mode
railway down --service fulcrum-server
railway down --service event-processor

# 2. Via Railway dashboard:
#    - Select backup → "Restore"
#    - Confirm overwrite

# 3. Wait for restore (10-30 minutes)

# 4. Verify data integrity
psql $POSTGRES_CONN_STR -c "\dt fulcrum.*"

# 5. Restart services
railway up --service fulcrum-server
railway up --service event-processor
railway up --service dashboard

2.3 Schema-Only Recovery

Use Case: Restore schema without data (e.g., test environments)

# 1. Export schema from backup
psql $BACKUP_DATABASE_URL -c "\
  pg_dump --schema-only --no-owner --no-privileges \
  -n fulcrum -n metrics > /tmp/schema.sql"

# 2. Apply to target database
psql $TARGET_DATABASE_URL -f /tmp/schema.sql

# 3. Verify schema
psql $TARGET_DATABASE_URL -c "\dt fulcrum.*"

3. Testing Recovery Procedures

3.1 Monthly Recovery Test

Schedule: First Monday of each month Duration: 30 minutes

Test Procedure:

Create Test Backup

# Manually trigger backup via Railway dashboard
# Label: "Monthly DR Test - {date}"

Restore to Staging Environment

# 1. Restore backup to new database
# 2. Connect staging environment to restored DB
# 3. Run smoke tests

cd /Users/td/ConceptDev/Projects/Fulcrum
go test ./tests/smoke/... -v

Document Results

# Record in disaster-recovery-tests.log:
echo "$(date -u +%Y-%m-%d) | PITR Test | Success | RTO: 18min | RPO: 2min" \
  >> docs/runbooks/disaster-recovery-tests.log

Cleanup

# Delete test database via Railway dashboard

3.2 Test Recovery Log

Create a test log file to track all recovery tests:

Location: docs/runbooks/disaster-recovery-tests.log

Format:

Date       | Test Type    | Result  | RTO    | RPO    | Notes
-----------|--------------|---------|--------|--------|------------------
2026-01-21 | PITR         | Success | 22min  | 3min   | Initial test
2026-02-03 | Full Restore | Success | 45min  | 24hrs  | Monthly test

4. Data Loss Prevention

4.1 Pre-Change Backups

ALWAYS create manual backup before: - Database schema migrations - Bulk data operations - Major version upgrades - Configuration changes affecting RLS policies

# Create pre-migration backup via Railway dashboard
# Label format: "PRE-{operation}-{date}-{ticket-id}"
# Example: "PRE-MIGRATION-2026-01-21-P0-012"

4.2 Multi-Region Backup Strategy (Future)

Current: Single-region (Railway) Future: Cross-region backup replication (requires ADR approval)

5. Recovery Time Objectives (RTO)

Scenario	RTO	RPO	Procedure
Point-in-Time Recovery	15-30min	5min	Section 2.1
Full Database Restore	30-60min	24hrs	Section 2.2
Schema Corruption	10-20min	0min	Replay migrations
Primary platform outage	2-4hrs	24hrs	Restore from latest backup in provider console

6. Contact Information

Escalation Path

Level	Contact	Situation
L1	On-call engineer	Initial response
L2	Infrastructure lead	RTO >1hr
L3	Platform support	Provider outage/issues
L4	CEO/CTO	Data breach/major loss

7. Validation Checklist

After any recovery operation, verify:

[ ] All schemas present: fulcrum, metrics
[ ] Row counts match expected values
[ ] RLS policies active: SELECT * FROM fulcrum.tenants; (should be tenant-scoped)
[ ] Recent data present (check timestamps)
[ ] All services healthy: fulcrum-server, event-processor, dashboard
[ ] Application smoke tests pass: go test ./tests/smoke/...
[ ] No error logs in application monitoring
[ ] User login/authentication working

8. Configuration as Code

All infrastructure configuration is version controlled:

# Database schema
infra/migrations/postgres/*.sql

# Infrastructure config
railway.toml

# Secrets (not in git)
Doppler: https://dashboard.doppler.com

9. Notes & Lessons Learned

Known Issues

PITR availability depends on plan - Verify support in Railway dashboard
Restore time increases with database size - Current DB ~2GB = 20min restore
Connection string updates require service restart - Plan for 2-3min downtime

Improvements Planned

[ ] Automated recovery testing
[ ] Cross-region backup replication
[ ] Sub-5-minute RTO with hot standby

10. References

PostgreSQL PITR Documentation
Fulcrum Production Runbook: docs/runbooks/RUNBOOKS.md
Database Schema: infra/migrations/postgres/

Document History: - 2026-01-21: Initial version created (P0-012 infrastructure sprint)