Skip to content

Fulcrum Operational Runbooks

Version: 2.0.0 Last Updated: January 6, 2026 Platform: Fulcrum AI Governance Audience: Operations, SRE, DevOps, On-Call Engineers


Table of Contents

  1. Service Management
  2. Starting the Full Stack
  3. Stopping Services Gracefully
  4. Restarting Individual Services
  5. Health Check Procedures
  6. Service Health Endpoints
  7. Expected Responses
  8. Alert Thresholds
  9. Common Incident Types and Responses
  10. Policy Evaluation Failures
  11. Database Connection Issues
  12. NATS Stream Lag
  13. Redis Cache Failures
  14. Dashboard Errors
  15. Database Maintenance
  16. Running Migrations
  17. Analyzing Query Performance
  18. Vacuum Procedures
  19. NATS Stream Management
  20. Viewing Stream Status
  21. Consumer Lag Investigation
  22. Stream Purge Procedures
  23. Log Analysis
  24. Log Locations
  25. Common Log Patterns
  26. Error Investigation
  27. Performance Debugging
  28. Profiling Go Services
  29. Identifying Bottlenecks
  30. Load Testing
  31. Rollback Procedures
  32. Application Rollback
  33. Database Migration Rollback
  34. Configuration Rollback

1. Service Management

1.1 Starting the Full Stack

Script Location: /scripts/start-stack.sh

Prerequisites Checklist

# Run preflight checks first
./scripts/preflight.sh

The preflight script validates: - Docker daemon is running - Docker Compose is available - Required ports are free: 5432, 4222, 6379, 50051, 8080, 3000, 3001, 9090 - Minimum 5GB disk space - Minimum 4GB RAM - Configuration files exist

Start Procedure

# Full stack startup (recommended)
./scripts/start-stack.sh

What it does (step by step):

  1. Runs preflight checks
  2. Pulls Docker images (if needed)
  3. Starts infrastructure: PostgreSQL, NATS, Redis
  4. Waits for infrastructure health (120s timeout)
  5. Runs database migrations
  6. Starts application services: fulcrum-server, event-processor, dashboard
  7. Waits for application health (180s timeout)

Manual Startup (if script fails)

# Step 1: Start infrastructure
docker compose -f docker-compose.unified.yml up -d postgres nats redis

# Step 2: Wait for health
docker compose -f docker-compose.unified.yml ps

# Step 3: Start migrations and apps
docker compose -f docker-compose.unified.yml up -d

Verify Startup

# Check all services
docker compose -f docker-compose.unified.yml ps

# Expected output: All services "healthy" or "running"

Access Points After Startup:

Service URL Credentials
gRPC API localhost:50051 -
REST API http://localhost:8080 -
Dashboard http://localhost:3001 -
Grafana http://localhost:3000 admin/admin
Prometheus http://localhost:9090 -
NATS Monitor http://localhost:8222 -

1.2 Stopping Services Gracefully

Script Location: /scripts/stop-stack.sh

Normal Shutdown (Preserves Data)

./scripts/stop-stack.sh

This stops all services but preserves: - PostgreSQL data - NATS JetStream data - Redis persistence - Prometheus metrics - Grafana dashboards

Full Cleanup (DELETES ALL DATA)

./scripts/stop-stack.sh --clean

WARNING: Requires typing DELETE to confirm. This removes all volumes.

Manual Shutdown

# Stop services only
docker compose -f docker-compose.unified.yml down

# Stop and remove volumes (destructive)
docker compose -f docker-compose.unified.yml down -v

1.3 Restarting Individual Services

Restart Single Service

# Restart fulcrum-server
docker compose -f docker-compose.unified.yml restart fulcrum-server

# Restart event-processor
docker compose -f docker-compose.unified.yml restart event-processor

# Restart dashboard
docker compose -f docker-compose.unified.yml restart dashboard

Rebuild and Restart (After Code Changes)

# Rebuild and restart specific service
docker compose -f docker-compose.unified.yml up -d --build fulcrum-server

# Rebuild all application services
docker compose -f docker-compose.unified.yml up -d --build fulcrum-server event-processor dashboard

Force Recreate (Clears Container State)

docker compose -f docker-compose.unified.yml up -d --force-recreate fulcrum-server

2. Health Check Procedures

2.1 Service Health Endpoints

Service Endpoint Port Protocol
fulcrum-server gRPC health check 50051 gRPC
fulcrum-server /metrics 8080 HTTP
event-processor /healthz 8081 HTTP
dashboard / 3000 HTTP
PostgreSQL pg_isready 5432 PostgreSQL
Redis PING 6379 Redis
NATS /healthz 8222 HTTP
Prometheus /-/healthy 9090 HTTP
Alertmanager /-/healthy 9093 HTTP
Grafana /api/health 3000 HTTP

2.2 Expected Responses

gRPC Health Check

# Install grpcurl if needed: brew install grpcurl

# Check gRPC health
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check

# Expected response:
# {
#   "status": "SERVING"
# }

# List available services
grpcurl -plaintext localhost:50051 list

# Expected services:
# fulcrum.policy.v1.PolicyService
# fulcrum.cost.v1.CostService
# fulcrum.eventstore.v1.EventStoreService
# grpc.health.v1.Health
# grpc.reflection.v1alpha.ServerReflection

HTTP Health Checks

# Event Processor
curl -s http://localhost:8081/healthz
# Expected: "OK" or {"status":"healthy"}

# Prometheus
curl -s http://localhost:9090/-/healthy
# Expected: "Prometheus Server is Healthy."

# Alertmanager
curl -s http://localhost:9093/-/healthy
# Expected: "OK"

# Grafana
curl -s http://localhost:3000/api/health
# Expected: {"commit":"...","database":"ok","version":"..."}

# Dashboard
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001
# Expected: 200

Database Health

# PostgreSQL
docker exec fulcrum-postgres pg_isready -U fulcrum -d fulcrum_dev
# Expected: "localhost:5432 - accepting connections"

# Test query
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT 1;"
# Expected: Returns "1"

Redis Health

docker exec fulcrum-redis redis-cli ping
# Expected: "PONG"

docker exec fulcrum-redis redis-cli info memory | grep used_memory_human
# Check memory usage

NATS Health

curl -s http://localhost:8222/healthz
# Expected: "OK"

# JetStream status
curl -s http://localhost:8222/jsz | jq '.streams'

2.3 Alert Thresholds

Critical (P0) - Immediate Response

Metric Threshold Action
Service Down 0 healthy pods Page on-call
Error Rate >20% for 2min Page on-call
Database Down Connection refused Page on-call
Budget Exhausted 100% utilization Notify customer

Warning (P1) - Response within 30 minutes

Metric Threshold Action
Policy Latency P99 >10ms for 5min Investigate
Error Rate >5% for 5min Investigate
Budget Warning >80% utilization Notify customer
Memory Usage >80% for 10min Scale or optimize
NATS Lag >1000 messages for 5min Investigate

Info (P2) - Business Hours Response

Metric Threshold Action
Cache Miss Rate >10% Tune cache
Slow Queries >100ms Optimize
Disk Usage >70% Plan expansion

3. Common Incident Types and Responses

3.1 Policy Evaluation Failures

Symptoms: - gRPC errors with code INTERNAL or UNAVAILABLE - High fulcrum_policy_evaluations_total{decision="ERROR"} metric - Customer reports of blocked operations

Diagnosis:

# Step 1: Check service health
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check

# Step 2: Check logs for errors
docker logs fulcrum-server --tail 100 | grep -i error

# Step 3: Check policy cache
docker exec fulcrum-redis redis-cli keys "policy:*" | head -10
docker exec fulcrum-redis redis-cli dbsize

# Step 4: Check database connectivity
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT COUNT(*) FROM policies;"

# Step 5: Check metrics
curl -s http://localhost:8080/metrics | grep fulcrum_policy

Resolution Steps:

# If cache is corrupted, flush and restart
docker exec fulcrum-redis redis-cli FLUSHDB
docker compose -f docker-compose.unified.yml restart fulcrum-server

# If database connection pool exhausted
docker compose -f docker-compose.unified.yml restart fulcrum-server

# If specific tenant affected, check tenant policies
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT id, name, enabled, updated_at
  FROM policies
  WHERE tenant_id = '<tenant_id>'
  ORDER BY updated_at DESC
  LIMIT 10;
"

3.2 Database Connection Issues

Symptoms: - "connection refused" or "connection pool exhausted" errors - Slow response times - Timeout errors in logs

Diagnosis:

# Step 1: Check PostgreSQL status
docker exec fulcrum-postgres pg_isready -U fulcrum -d fulcrum_dev

# Step 2: Check active connections
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT count(*), state
  FROM pg_stat_activity
  WHERE datname = 'fulcrum_dev'
  GROUP BY state;
"

# Step 3: Check max connections
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SHOW max_connections;"

# Step 4: Find long-running queries
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT pid, now() - query_start as duration, query
  FROM pg_stat_activity
  WHERE state = 'active'
  ORDER BY duration DESC
  LIMIT 5;
"

# Step 5: Check for locks
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT blocked_locks.pid AS blocked_pid,
         blocking_locks.pid AS blocking_pid,
         blocked_activity.query AS blocked_query
  FROM pg_catalog.pg_locks blocked_locks
  JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
  JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
  WHERE NOT blocked_locks.granted;
"

Resolution Steps:

# Kill long-running query (use with caution)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT pg_terminate_backend(<PID>);"

# Restart services to reset connection pools
docker compose -f docker-compose.unified.yml restart fulcrum-server event-processor

# If PostgreSQL is overloaded, increase connection limits
# Edit docker-compose.unified.yml to add:
# command: postgres -c 'max_connections=200'

3.3 NATS Stream Lag

Symptoms: - Events not being processed - High fulcrum_nats_pending_messages metric - Event processor logs show idle or slow consumption

Diagnosis:

# Step 1: Check NATS server status
curl -s http://localhost:8222/ | jq

# Step 2: Check JetStream streams
curl -s http://localhost:8222/jsz | jq '.streams[] | {name, messages, consumers}'

# Step 3: Check consumer lag
curl -s http://localhost:8222/jsz?consumers=true | jq '.streams[].consumers[] | {name, num_pending, num_ack_pending}'

# Step 4: Check event processor logs
docker logs fulcrum-processor --tail 100 | grep -E "(lag|pending|slow)"

# Step 5: Check event processor metrics
curl -s http://localhost:8081/metrics | grep nats

Resolution Steps:

# If consumer is stuck, restart event processor
docker compose -f docker-compose.unified.yml restart event-processor

# If stream is corrupted, purge and restart (CAUTION: loses unprocessed events)
docker exec -it fulcrum-nats nats stream purge FULCRUM_EXECUTION_EVENTS --force

# Scale up event processor
# Edit docker-compose.unified.yml to add replicas or deploy additional instances

3.4 Redis Cache Failures

Symptoms: - Policy lookups slow (P99 >2ms) - High cache miss rate - Redis connection errors in logs

Diagnosis:

# Step 1: Check Redis health
docker exec fulcrum-redis redis-cli ping

# Step 2: Check memory usage
docker exec fulcrum-redis redis-cli info memory | grep -E "(used_memory_human|maxmemory)"

# Step 3: Check keyspace stats
docker exec fulcrum-redis redis-cli info keyspace

# Step 4: Check slow log
docker exec fulcrum-redis redis-cli slowlog get 10

# Step 5: Check connection count
docker exec fulcrum-redis redis-cli info clients | grep connected

Resolution Steps:

# Clear stale keys (by pattern)
docker exec fulcrum-redis redis-cli keys "policy:stale:*" | xargs -r docker exec -i fulcrum-redis redis-cli del

# Flush entire cache (service will repopulate)
docker exec fulcrum-redis redis-cli FLUSHDB

# Restart Redis
docker compose -f docker-compose.unified.yml restart redis

# After restart, verify fulcrum-server reconnects
docker logs fulcrum-server --tail 50 | grep -i redis

3.5 Dashboard Errors

Symptoms: - Dashboard not loading - API errors displayed - Authentication failures

Diagnosis:

# Step 1: Check dashboard container status
docker logs fulcrum-dashboard --tail 100

# Step 2: Check if backend is reachable from dashboard
docker exec fulcrum-dashboard wget -q -O- http://fulcrum-server:8080/metrics | head -5

# Step 3: Check environment variables
docker exec fulcrum-dashboard env | grep -E "(API_URL|GRPC_URL|CLERK)"

# Step 4: Check Next.js build status
docker exec fulcrum-dashboard ls -la /app/.next

Resolution Steps:

# Rebuild dashboard
docker compose -f docker-compose.unified.yml up -d --build dashboard

# Clear Next.js cache and rebuild
docker exec fulcrum-dashboard rm -rf /app/.next
docker compose -f docker-compose.unified.yml restart dashboard

# Check for Clerk authentication issues
# Verify CLERK_SECRET_KEY and NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY are set correctly

4. Database Maintenance

4.1 Running Migrations

Script Location: /scripts/run-migrations.sh

Local Development

# Using the migration script
./scripts/run-migrations.sh --url "postgresql://fulcrum:fulcrum@localhost:5432/fulcrum_dev?sslmode=disable"

# Or via Docker Compose (automatic on startup)
docker compose -f docker-compose.unified.yml up migrate

Production (Railway)

# Get database URL from Railway
export DATABASE_URL=$(railway variables get DATABASE_URL)

# Dry run (preview migrations)
./scripts/run-migrations.sh --url "$DATABASE_URL" --dry-run

# Apply migrations
./scripts/run-migrations.sh --url "$DATABASE_URL"

View Applied Migrations

docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT version, applied_at, description
  FROM schema_migrations
  ORDER BY applied_at DESC;
"

Create New Migration

# Create migration file manually
touch infra/migrations/postgres/$(date +%Y%m%d%H%M%S)_description.up.sql
touch infra/migrations/postgres/$(date +%Y%m%d%H%M%S)_description.down.sql

4.2 Analyzing Query Performance

Enable Query Statistics

docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
"

Find Slow Queries

docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT
    query,
    calls,
    round(total_exec_time::numeric, 2) as total_time_ms,
    round(mean_exec_time::numeric, 2) as mean_time_ms,
    round(max_exec_time::numeric, 2) as max_time_ms
  FROM pg_stat_statements
  WHERE dbid = (SELECT oid FROM pg_database WHERE datname = 'fulcrum_dev')
  ORDER BY mean_exec_time DESC
  LIMIT 10;
"

Analyze Query Plan

docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
  SELECT * FROM policies WHERE tenant_id = '<tenant_id>' AND enabled = true;
"

Check Missing Indexes

docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT
    relname as table,
    seq_scan,
    idx_scan,
    CASE WHEN seq_scan + idx_scan > 0
         THEN round(100.0 * idx_scan / (seq_scan + idx_scan), 2)
         ELSE 0 END as idx_hit_rate
  FROM pg_stat_user_tables
  WHERE seq_scan > 100
  ORDER BY seq_scan DESC;
"

4.3 Vacuum Procedures

Check Table Bloat

docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
    n_dead_tup as dead_tuples,
    last_vacuum,
    last_autovacuum
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC
  LIMIT 10;
"

Manual Vacuum

# Vacuum specific table (non-blocking)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  VACUUM ANALYZE policies;
"

# Full vacuum (requires exclusive lock - use with caution)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  VACUUM FULL ANALYZE policies;
"

# Vacuum entire database
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  VACUUM ANALYZE;
"

Reindex Tables

# Reindex specific table
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  REINDEX TABLE policies;
"

# Reindex concurrently (non-blocking, PostgreSQL 12+)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  REINDEX TABLE CONCURRENTLY policies;
"

5. NATS Stream Management

5.1 Viewing Stream Status

List All Streams

# Via NATS monitoring endpoint
curl -s http://localhost:8222/jsz | jq '.streams[] | {name, messages, bytes, consumers}'

# Via nats CLI (if installed in container)
docker exec fulcrum-nats nats stream ls

Stream Details

curl -s http://localhost:8222/jsz | jq '.streams[] | select(.name == "FULCRUM_EXECUTION_EVENTS")'

View Stream Configuration

curl -s "http://localhost:8222/jsz?streams=true&config=true" | jq '.streams[] | {name, config}'

5.2 Consumer Lag Investigation

Check Consumer Status

curl -s "http://localhost:8222/jsz?consumers=true" | jq '
  .streams[] |
  {
    stream: .name,
    consumers: [.consumers[] | {
      name: .name,
      pending: .num_pending,
      ack_pending: .num_ack_pending,
      redelivered: .num_redelivered
    }]
  }
'

Monitor Consumer in Real-time

# Watch consumer lag every 5 seconds
watch -n 5 'curl -s "http://localhost:8222/jsz?consumers=true" | jq ".streams[].consumers[].num_pending"'

Identify Stuck Messages

# Check for messages that have been redelivered many times
curl -s "http://localhost:8222/jsz?consumers=true" | jq '
  .streams[].consumers[] |
  select(.num_redelivered > 10) |
  {name, redelivered: .num_redelivered}
'

5.3 Stream Purge Procedures

Purge All Messages (CAUTION: Loses Unprocessed Events)

# Using NATS API
docker exec fulcrum-nats nats stream purge FULCRUM_EXECUTION_EVENTS --force

Purge Messages Older Than N Hours

# Purge messages older than 24 hours
docker exec fulcrum-nats nats stream purge FULCRUM_EXECUTION_EVENTS --keep=0 --seq=0 --subject=">" --older=24h

Delete and Recreate Stream

# Delete stream
docker exec fulcrum-nats nats stream delete FULCRUM_EXECUTION_EVENTS --force

# Stream will be recreated on next publish by fulcrum-server
docker compose -f docker-compose.unified.yml restart fulcrum-server

6. Log Analysis

6.1 Log Locations

Docker Compose Logs

# All services
docker compose -f docker-compose.unified.yml logs

# Specific service
docker compose -f docker-compose.unified.yml logs fulcrum-server

# Follow logs in real-time
docker compose -f docker-compose.unified.yml logs -f fulcrum-server

# Last N lines
docker compose -f docker-compose.unified.yml logs --tail 100 fulcrum-server

Container Log Files

# Find container log location
docker inspect fulcrum-server --format='{{.LogPath}}'

# View raw log file (requires root)
sudo cat /var/lib/docker/containers/<container-id>/<container-id>-json.log

Service-Specific Logs

Service Primary Log Notes
fulcrum-server docker logs fulcrum-server JSON structured
event-processor docker logs fulcrum-processor JSON structured
dashboard docker logs fulcrum-dashboard Next.js format
PostgreSQL docker logs fulcrum-postgres PostgreSQL format
NATS docker logs fulcrum-nats NATS format
Redis docker logs fulcrum-redis Redis format

6.2 Common Log Patterns

Error Patterns

# Find all errors
docker logs fulcrum-server 2>&1 | grep -iE "(error|failed|fatal)"

# Find specific error types
docker logs fulcrum-server 2>&1 | grep -i "connection refused"
docker logs fulcrum-server 2>&1 | grep -i "timeout"
docker logs fulcrum-server 2>&1 | grep -i "authentication"

Performance Patterns

# Find slow operations (if logged)
docker logs fulcrum-server 2>&1 | grep -E "latency_ms.*[0-9]{3,}"

# Find high memory usage warnings
docker logs fulcrum-server 2>&1 | grep -i "memory"

Request Patterns

# Find specific tenant activity
docker logs fulcrum-server 2>&1 | grep "tenant_id.*<tenant_id>"

# Find gRPC method calls
docker logs fulcrum-server 2>&1 | grep "grpc_method"

6.3 Error Investigation

Structured Error Analysis

# Extract JSON logs and filter
docker logs fulcrum-server 2>&1 | jq -r 'select(.level == "error") | "\(.time) \(.msg)"' 2>/dev/null

# Group errors by message
docker logs fulcrum-server 2>&1 | grep -i error | sort | uniq -c | sort -rn | head -10

Trace Correlation

# Find all logs for a specific trace ID
TRACE_ID="<trace-id>"
docker logs fulcrum-server 2>&1 | grep "$TRACE_ID"
docker logs fulcrum-processor 2>&1 | grep "$TRACE_ID"

Time-Based Analysis

# Logs from last hour (if timestamps in ISO format)
docker logs --since 1h fulcrum-server 2>&1 | grep -i error

# Logs between specific times
docker logs --since "2026-01-06T10:00:00" --until "2026-01-06T11:00:00" fulcrum-server

7. Performance Debugging

7.1 Profiling Go Services

Enable pprof

pprof endpoints are exposed on the HTTP port (8080 for fulcrum-server).

# CPU profile (30 seconds)
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30

# Memory/heap profile
go tool pprof http://localhost:8080/debug/pprof/heap

# Goroutine profile
go tool pprof http://localhost:8080/debug/pprof/goroutine

# Block profile (blocking operations)
go tool pprof http://localhost:8080/debug/pprof/block

# Mutex profile
go tool pprof http://localhost:8080/debug/pprof/mutex

Analyze Profile

# Interactive mode
go tool pprof http://localhost:8080/debug/pprof/heap

# Common commands in pprof:
# top10      - Show top 10 functions
# web        - Open graph in browser
# list <func> - Show source for function
# pdf        - Export as PDF

Generate Flame Graph

# Install go-torch or use pprof web view
go tool pprof -http=:8081 http://localhost:8080/debug/pprof/profile?seconds=30

7.2 Identifying Bottlenecks

Check Metrics for Bottlenecks

# Policy evaluation latency
curl -s http://localhost:8080/metrics | grep "fulcrum_policy_evaluation_duration_seconds"

# Database query latency
curl -s http://localhost:8080/metrics | grep "fulcrum_db_query_duration_seconds"

# gRPC latency
curl -s http://localhost:8080/metrics | grep "grpc_server_handling_seconds"

Calculate P99 Latency

# Using Prometheus query (via curl)
curl -g 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,rate(fulcrum_policy_evaluation_duration_seconds_bucket[5m]))'

Database Bottlenecks

# Check for table/index scans
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT relname, seq_scan, idx_scan
  FROM pg_stat_user_tables
  WHERE seq_scan > idx_scan
  ORDER BY seq_scan DESC;
"

# Check cache hit ratio
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT
    sum(blks_hit) / (sum(blks_hit) + sum(blks_read)) * 100 as cache_hit_ratio
  FROM pg_stat_database
  WHERE datname = 'fulcrum_dev';
"

7.3 Load Testing

Location: /tests/load/

Run k6 Load Test

cd tests/load

# Basic load test
k6 run load_test.js

# With specific VUs and duration
k6 run --vus 10 --duration 30s load_test.js

# With output to JSON
k6 run --out json=results.json load_test.js

gRPC Load Testing with ghz

# Install ghz: go install github.com/bojand/ghz/cmd/ghz@latest

# Basic policy check test
ghz --insecure \
    --proto ../proto/fulcrum/policy/v1/policy_service.proto \
    --call fulcrum.policy.v1.PolicyService/CheckPolicy \
    -d '{"tenant_id":"test","policy_id":"test"}' \
    localhost:50051

# Sustained load test
ghz --insecure \
    --concurrency 50 \
    --total 10000 \
    --proto ../proto/fulcrum/policy/v1/policy_service.proto \
    --call fulcrum.policy.v1.PolicyService/CheckPolicy \
    -d '{"tenant_id":"test","policy_id":"test"}' \
    localhost:50051

Test Coverage Report

./scripts/test-coverage-report.sh

# With HTML output
./scripts/test-coverage-report.sh --html

# CI mode (strict)
./scripts/test-coverage-report.sh --ci

8. Rollback Procedures

8.1 Application Rollback

Script Location: /scripts/deployment/rollback.sh

Kubernetes Deployment Rollback

# Rollback to previous version
./scripts/deployment/rollback.sh server

# Rollback to specific revision
./scripts/deployment/rollback.sh server 5

# Rollback all components
./scripts/deployment/rollback.sh all

Blue/Green Instant Switch

# Switch to blue deployment
./scripts/deployment/blue-green-switch.sh blue

# Switch to green deployment
./scripts/deployment/blue-green-switch.sh green

# Dry run (see what would happen)
./scripts/deployment/blue-green-switch.sh blue --dry-run

Docker Compose Rollback

# Stop current version
docker compose -f docker-compose.unified.yml stop fulcrum-server

# Pull previous image version
docker pull ghcr.io/your-org/fulcrum-server:v1.0.0

# Update docker-compose.yml with previous version tag
# Restart service
docker compose -f docker-compose.unified.yml up -d fulcrum-server

Railway Rollback

# List deployments
railway deployments --service fulcrum-server

# Rollback to specific deployment
railway rollback --service fulcrum-server --deployment-id <deployment-id>

8.2 Database Migration Rollback

Prerequisites

  1. Have backup from before migration
  2. Plan for downtime
  3. Notify affected users

Rollback Procedure

# Step 1: Stop application services
docker compose -f docker-compose.unified.yml stop fulcrum-server event-processor

# Step 2: Identify migration to rollback
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT version, applied_at FROM schema_migrations ORDER BY applied_at DESC LIMIT 5;
"

# Step 3: Apply down migration (if exists)
# Down migrations must be created manually
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -f /path/to/migration.down.sql

# Step 4: Update schema_migrations table
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  DELETE FROM schema_migrations WHERE version = '<version-to-rollback>';
"

# Step 5: Restart services with previous code version
docker compose -f docker-compose.unified.yml up -d fulcrum-server event-processor

# Step 6: Verify rollback
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  SELECT version, applied_at FROM schema_migrations ORDER BY applied_at DESC LIMIT 5;
"

Point-in-Time Recovery (PITR)

# For major issues, restore from backup
# Step 1: Stop all services
docker compose -f docker-compose.unified.yml down

# Step 2: Remove current database volume
docker volume rm fulcrum-postgres-data

# Step 3: Restore from backup
# For Docker: Create new volume and restore
docker volume create fulcrum-postgres-data
docker run --rm -v fulcrum-postgres-data:/data -v $(pwd):/backup alpine tar xzf /backup/postgres-backup.tar.gz -C /data

# Step 4: Start services
docker compose -f docker-compose.unified.yml up -d

8.3 Configuration Rollback

Environment Variable Rollback

# Docker Compose: Edit .env file with previous values
cp .env.backup .env
docker compose -f docker-compose.unified.yml up -d

# Railway: Use dashboard or CLI
railway variables set KEY=previous_value --service fulcrum-server

Kubernetes ConfigMap Rollback

# View ConfigMap history
kubectl rollout history configmap fulcrum-config -n fulcrum

# Apply previous ConfigMap
kubectl apply -f infra/k8s/configmaps/fulcrum-config-v1.yaml

# Restart pods to pick up changes
kubectl rollout restart deployment fulcrum-server -n fulcrum

Feature Flag Rollback

# If using feature flags, disable problematic feature
# Via database:
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
  UPDATE feature_flags SET enabled = false WHERE name = 'new-feature';
"

# Via API (if available):
curl -X PATCH http://localhost:8080/api/v1/features/new-feature \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

Quick Reference Commands

Service Status

# All services status
docker compose -f docker-compose.unified.yml ps

# Specific service logs
docker compose -f docker-compose.unified.yml logs -f fulcrum-server --tail 100

# Check health endpoints
curl -s http://localhost:8080/metrics | head
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check

Emergency Actions

# Restart all services
docker compose -f docker-compose.unified.yml restart

# Stop everything immediately
docker compose -f docker-compose.unified.yml down

# Full reset (WARNING: loses data)
docker compose -f docker-compose.unified.yml down -v && docker compose -f docker-compose.unified.yml up -d

Database Quick Access

# Interactive psql
docker exec -it fulcrum-postgres psql -U fulcrum -d fulcrum_dev

# Quick query
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT COUNT(*) FROM policies;"

Log Investigation

# Recent errors
docker logs fulcrum-server --since 5m 2>&1 | grep -i error | tail -20

# All services errors
for svc in fulcrum-server fulcrum-processor fulcrum-dashboard; do
  echo "=== $svc ==="
  docker logs $svc --since 5m 2>&1 | grep -i error | tail -5
done

Contact Information

Escalation Path

Level Contact Response Time
L1 On-call Engineer 5 minutes
L2 Platform Team Lead 15 minutes
L3 Engineering Manager 30 minutes
L4 CTO 1 hour

Resources

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000
  • Alertmanager: http://localhost:9093
  • NATS Monitor: http://localhost:8222
  • Documentation: /documentation/
  • Incident Slack: #fulcrum-incidents

Document Version: 2.0.0 Last Updated: January 6, 2026 Next Review: April 2026 Maintainer: Fulcrum Platform Team