Disaster Recovery Runbook
Purpose: Disaster recovery procedures for Fulcrum infrastructure Audience: DevOps, Infrastructure Team Source of Truth: TRUTH_MAP.md
Last Updated: February 1, 2026 Owner: Infrastructure Team Status: Production Ready
Overview
This runbook documents disaster recovery procedures for Fulcrum's production infrastructure. It covers database backup configuration, recovery procedures, and testing protocols.
Infrastructure Overview
IMPORTANT: Production runs on Railway. Disaster recovery relies on Railway-managed backups only. See TRUTH_MAP.md Section C for deprecated services.
| Component | Primary Provider | DR Provider | Backup Strategy |
|---|---|---|---|
| PostgreSQL Database | Railway (timescaledb-docker) | None (no secondary provider) | Railway-managed backups |
| Redis Cache | Railway | - | Ephemeral (no backups required) |
| Application State | Railway Database | - | Covered by PostgreSQL backups |
| Configuration | Doppler + Git | - | Version controlled |
Platform Clarification
- Railway: Primary production platform for all services and databases
- Database hostname:
timescaledb-docker.railway.internal:5432
1. PostgreSQL Backup Configuration (Railway)
1.1 Automated Backups
Railway-managed PostgreSQL backups are plan-dependent. Verify schedule, retention, and restore options in the Railway dashboard.
Current Configuration (verify in Railway): - Backup Frequency: Managed by provider plan - Retention Period: Managed by provider plan - Backup Type: Provider-managed snapshots/dumps - Point-in-Time Recovery (PITR): If available on plan
1.2 Enabling/Verifying Backups
Steps to verify backup configuration:
- Log in to the Railway dashboard
- Navigate to your PostgreSQL database service
- Open the backups/snapshots section
- Verify the following:
- ✅ Automated backups are enabled
- ✅ Recent backups are listed with timestamps
- ✅ PITR is enabled (if available)
To modify backup settings:
- In the backups/snapshots section, open settings
- Configure:
- Backup Retention: 7-30 days (recommended: 14 days for production)
- Backup Window: Choose low-traffic time (e.g., 2-4 AM UTC)
- Click "Save Settings"
1.3 Manual Backup
To create an on-demand backup before major changes:
# Via Railway dashboard:
# 1. Go to Database → Backups/Snapshots
# 2. Click "Create Backup" or "Create Snapshot"
# 3. Add description (e.g., "Pre-migration backup - 2026-01-21")
2. Recovery Procedures
2.1 Point-in-Time Recovery (PITR)
Use Case: Recover to a specific timestamp (e.g., before data corruption)
Recovery Time Objective (RTO): 15-30 minutes Recovery Point Objective (RPO): Up to 5 minutes of data loss
Steps:
-
Identify Target Timestamp
-
Initiate PITR via Railway dashboard (if available)
- Navigate to Database → Backups/Snapshots
- Choose "Point-in-Time Recovery" (if available)
- Enter target timestamp (UTC):
2026-01-21T14:30:00Z - Choose recovery destination:
- New Database (recommended for testing)
- Overwrite Current (production restore)
-
Start recovery and monitor progress
-
Verify Recovery
-
Update Application Connection String
- If recovered to new database, update
POSTGRES_CONN_STRin Doppler - Restart all services: fulcrum-server, event-processor, dashboard
2.2 Full Database Restore from Backup
Use Case: Restore entire database from a specific backup snapshot
RTO: 30-60 minutes RPO: Up to 24 hours (daily backups)
Steps:
- Select Backup
- Navigate to Railway dashboard → Database → Backups/Snapshots
- Choose backup by date/time
-
Note backup ID and timestamp
-
Restore Options
Option A: Restore to New Database (Recommended)
# 1. Create new database from backup via Railway dashboard
# - Click backup → "Restore to New Database"
# - Name: "fulcrum-db-restored-{date}"
# - Plan: Same as production
# 2. Wait for restore to complete (5-20 minutes)
# 3. Verify data
psql $NEW_DATABASE_URL -c "SELECT COUNT(*) FROM fulcrum.tenants;"
# 4. Update connection string in Doppler
doppler secrets set POSTGRES_CONN_STR="$NEW_DATABASE_URL&search_path=fulcrum"
# 5. Restart services
railway up --service fulcrum-server
railway up --service event-processor
railway up --service dashboard
Option B: In-Place Restore (Production)
# WARNING: This will overwrite current database
# Only use if current database is completely corrupted
# 1. Put application in maintenance mode
railway down --service fulcrum-server
railway down --service event-processor
# 2. Via Railway dashboard:
# - Select backup → "Restore"
# - Confirm overwrite
# 3. Wait for restore (10-30 minutes)
# 4. Verify data integrity
psql $POSTGRES_CONN_STR -c "\dt fulcrum.*"
# 5. Restart services
railway up --service fulcrum-server
railway up --service event-processor
railway up --service dashboard
2.3 Schema-Only Recovery
Use Case: Restore schema without data (e.g., test environments)
# 1. Export schema from backup
psql $BACKUP_DATABASE_URL -c "\
pg_dump --schema-only --no-owner --no-privileges \
-n fulcrum -n metrics > /tmp/schema.sql"
# 2. Apply to target database
psql $TARGET_DATABASE_URL -f /tmp/schema.sql
# 3. Verify schema
psql $TARGET_DATABASE_URL -c "\dt fulcrum.*"
3. Testing Recovery Procedures
3.1 Monthly Recovery Test
Schedule: First Monday of each month Duration: 30 minutes
Test Procedure:
-
Create Test Backup
-
Restore to Staging Environment
-
Document Results
-
Cleanup
3.2 Test Recovery Log
Create a test log file to track all recovery tests:
Location: docs/runbooks/disaster-recovery-tests.log
Format:
Date | Test Type | Result | RTO | RPO | Notes
-----------|--------------|---------|--------|--------|------------------
2026-01-21 | PITR | Success | 22min | 3min | Initial test
2026-02-03 | Full Restore | Success | 45min | 24hrs | Monthly test
4. Data Loss Prevention
4.1 Pre-Change Backups
ALWAYS create manual backup before: - Database schema migrations - Bulk data operations - Major version upgrades - Configuration changes affecting RLS policies
# Create pre-migration backup via Railway dashboard
# Label format: "PRE-{operation}-{date}-{ticket-id}"
# Example: "PRE-MIGRATION-2026-01-21-P0-012"
4.2 Multi-Region Backup Strategy (Future)
Current: Single-region (Railway) Future: Cross-region backup replication (requires ADR approval)
5. Recovery Time Objectives (RTO)
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Point-in-Time Recovery | 15-30min | 5min | Section 2.1 |
| Full Database Restore | 30-60min | 24hrs | Section 2.2 |
| Schema Corruption | 10-20min | 0min | Replay migrations |
| Primary platform outage | 2-4hrs | 24hrs | Restore from latest backup in provider console |
6. Contact Information
Escalation Path
| Level | Contact | Situation |
|---|---|---|
| L1 | On-call engineer | Initial response |
| L2 | Infrastructure lead | RTO >1hr |
| L3 | Platform support | Provider outage/issues |
| L4 | CEO/CTO | Data breach/major loss |
7. Validation Checklist
After any recovery operation, verify:
- [ ] All schemas present:
fulcrum,metrics - [ ] Row counts match expected values
- [ ] RLS policies active:
SELECT * FROM fulcrum.tenants;(should be tenant-scoped) - [ ] Recent data present (check timestamps)
- [ ] All services healthy: fulcrum-server, event-processor, dashboard
- [ ] Application smoke tests pass:
go test ./tests/smoke/... - [ ] No error logs in application monitoring
- [ ] User login/authentication working
8. Configuration as Code
All infrastructure configuration is version controlled:
# Database schema
infra/migrations/postgres/*.sql
# Infrastructure config
railway.toml
# Secrets (not in git)
Doppler: https://dashboard.doppler.com
9. Notes & Lessons Learned
Known Issues
- PITR availability depends on plan - Verify support in Railway dashboard
- Restore time increases with database size - Current DB ~2GB = 20min restore
- Connection string updates require service restart - Plan for 2-3min downtime
Improvements Planned
- [ ] Automated recovery testing
- [ ] Cross-region backup replication
- [ ] Sub-5-minute RTO with hot standby
10. References
- PostgreSQL PITR Documentation
- Fulcrum Production Runbook:
docs/runbooks/RUNBOOKS.md - Database Schema:
infra/migrations/postgres/
Document History: - 2026-01-21: Initial version created (P0-012 infrastructure sprint)