Fulcrum Operational Runbooks
Version: 2.0.0 Last Updated: January 6, 2026 Platform: Fulcrum AI Governance Audience: Operations, SRE, DevOps, On-Call Engineers
Table of Contents
- Service Management
- Starting the Full Stack
- Stopping Services Gracefully
- Restarting Individual Services
- Health Check Procedures
- Service Health Endpoints
- Expected Responses
- Alert Thresholds
- Common Incident Types and Responses
- Policy Evaluation Failures
- Database Connection Issues
- NATS Stream Lag
- Redis Cache Failures
- Dashboard Errors
- Database Maintenance
- Running Migrations
- Analyzing Query Performance
- Vacuum Procedures
- NATS Stream Management
- Viewing Stream Status
- Consumer Lag Investigation
- Stream Purge Procedures
- Log Analysis
- Log Locations
- Common Log Patterns
- Error Investigation
- Performance Debugging
- Profiling Go Services
- Identifying Bottlenecks
- Load Testing
- Rollback Procedures
- Application Rollback
- Database Migration Rollback
- Configuration Rollback
1. Service Management
1.1 Starting the Full Stack
Script Location: /scripts/start-stack.sh
Prerequisites Checklist
The preflight script validates: - Docker daemon is running - Docker Compose is available - Required ports are free: 5432, 4222, 6379, 50051, 8080, 3000, 3001, 9090 - Minimum 5GB disk space - Minimum 4GB RAM - Configuration files exist
Start Procedure
What it does (step by step):
- Runs preflight checks
- Pulls Docker images (if needed)
- Starts infrastructure: PostgreSQL, NATS, Redis
- Waits for infrastructure health (120s timeout)
- Runs database migrations
- Starts application services: fulcrum-server, event-processor, dashboard
- Waits for application health (180s timeout)
Manual Startup (if script fails)
# Step 1: Start infrastructure
docker compose -f docker-compose.unified.yml up -d postgres nats redis
# Step 2: Wait for health
docker compose -f docker-compose.unified.yml ps
# Step 3: Start migrations and apps
docker compose -f docker-compose.unified.yml up -d
Verify Startup
# Check all services
docker compose -f docker-compose.unified.yml ps
# Expected output: All services "healthy" or "running"
Access Points After Startup:
| Service | URL | Credentials |
|---|---|---|
| gRPC API | localhost:50051 |
- |
| REST API | http://localhost:8080 |
- |
| Dashboard | http://localhost:3001 |
- |
| Grafana | http://localhost:3000 |
admin/admin |
| Prometheus | http://localhost:9090 |
- |
| NATS Monitor | http://localhost:8222 |
- |
1.2 Stopping Services Gracefully
Script Location: /scripts/stop-stack.sh
Normal Shutdown (Preserves Data)
This stops all services but preserves: - PostgreSQL data - NATS JetStream data - Redis persistence - Prometheus metrics - Grafana dashboards
Full Cleanup (DELETES ALL DATA)
WARNING: Requires typing DELETE to confirm. This removes all volumes.
Manual Shutdown
# Stop services only
docker compose -f docker-compose.unified.yml down
# Stop and remove volumes (destructive)
docker compose -f docker-compose.unified.yml down -v
1.3 Restarting Individual Services
Restart Single Service
# Restart fulcrum-server
docker compose -f docker-compose.unified.yml restart fulcrum-server
# Restart event-processor
docker compose -f docker-compose.unified.yml restart event-processor
# Restart dashboard
docker compose -f docker-compose.unified.yml restart dashboard
Rebuild and Restart (After Code Changes)
# Rebuild and restart specific service
docker compose -f docker-compose.unified.yml up -d --build fulcrum-server
# Rebuild all application services
docker compose -f docker-compose.unified.yml up -d --build fulcrum-server event-processor dashboard
Force Recreate (Clears Container State)
2. Health Check Procedures
2.1 Service Health Endpoints
| Service | Endpoint | Port | Protocol |
|---|---|---|---|
| fulcrum-server | gRPC health check | 50051 | gRPC |
| fulcrum-server | /metrics |
8080 | HTTP |
| event-processor | /healthz |
8081 | HTTP |
| dashboard | / |
3000 | HTTP |
| PostgreSQL | pg_isready |
5432 | PostgreSQL |
| Redis | PING |
6379 | Redis |
| NATS | /healthz |
8222 | HTTP |
| Prometheus | /-/healthy |
9090 | HTTP |
| Alertmanager | /-/healthy |
9093 | HTTP |
| Grafana | /api/health |
3000 | HTTP |
2.2 Expected Responses
gRPC Health Check
# Install grpcurl if needed: brew install grpcurl
# Check gRPC health
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check
# Expected response:
# {
# "status": "SERVING"
# }
# List available services
grpcurl -plaintext localhost:50051 list
# Expected services:
# fulcrum.policy.v1.PolicyService
# fulcrum.cost.v1.CostService
# fulcrum.eventstore.v1.EventStoreService
# grpc.health.v1.Health
# grpc.reflection.v1alpha.ServerReflection
HTTP Health Checks
# Event Processor
curl -s http://localhost:8081/healthz
# Expected: "OK" or {"status":"healthy"}
# Prometheus
curl -s http://localhost:9090/-/healthy
# Expected: "Prometheus Server is Healthy."
# Alertmanager
curl -s http://localhost:9093/-/healthy
# Expected: "OK"
# Grafana
curl -s http://localhost:3000/api/health
# Expected: {"commit":"...","database":"ok","version":"..."}
# Dashboard
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001
# Expected: 200
Database Health
# PostgreSQL
docker exec fulcrum-postgres pg_isready -U fulcrum -d fulcrum_dev
# Expected: "localhost:5432 - accepting connections"
# Test query
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT 1;"
# Expected: Returns "1"
Redis Health
docker exec fulcrum-redis redis-cli ping
# Expected: "PONG"
docker exec fulcrum-redis redis-cli info memory | grep used_memory_human
# Check memory usage
NATS Health
curl -s http://localhost:8222/healthz
# Expected: "OK"
# JetStream status
curl -s http://localhost:8222/jsz | jq '.streams'
2.3 Alert Thresholds
Critical (P0) - Immediate Response
| Metric | Threshold | Action |
|---|---|---|
| Service Down | 0 healthy pods | Page on-call |
| Error Rate | >20% for 2min | Page on-call |
| Database Down | Connection refused | Page on-call |
| Budget Exhausted | 100% utilization | Notify customer |
Warning (P1) - Response within 30 minutes
| Metric | Threshold | Action |
|---|---|---|
| Policy Latency | P99 >10ms for 5min | Investigate |
| Error Rate | >5% for 5min | Investigate |
| Budget Warning | >80% utilization | Notify customer |
| Memory Usage | >80% for 10min | Scale or optimize |
| NATS Lag | >1000 messages for 5min | Investigate |
Info (P2) - Business Hours Response
| Metric | Threshold | Action |
|---|---|---|
| Cache Miss Rate | >10% | Tune cache |
| Slow Queries | >100ms | Optimize |
| Disk Usage | >70% | Plan expansion |
3. Common Incident Types and Responses
3.1 Policy Evaluation Failures
Symptoms:
- gRPC errors with code INTERNAL or UNAVAILABLE
- High fulcrum_policy_evaluations_total{decision="ERROR"} metric
- Customer reports of blocked operations
Diagnosis:
# Step 1: Check service health
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check
# Step 2: Check logs for errors
docker logs fulcrum-server --tail 100 | grep -i error
# Step 3: Check policy cache
docker exec fulcrum-redis redis-cli keys "policy:*" | head -10
docker exec fulcrum-redis redis-cli dbsize
# Step 4: Check database connectivity
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT COUNT(*) FROM policies;"
# Step 5: Check metrics
curl -s http://localhost:8080/metrics | grep fulcrum_policy
Resolution Steps:
# If cache is corrupted, flush and restart
docker exec fulcrum-redis redis-cli FLUSHDB
docker compose -f docker-compose.unified.yml restart fulcrum-server
# If database connection pool exhausted
docker compose -f docker-compose.unified.yml restart fulcrum-server
# If specific tenant affected, check tenant policies
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT id, name, enabled, updated_at
FROM policies
WHERE tenant_id = '<tenant_id>'
ORDER BY updated_at DESC
LIMIT 10;
"
3.2 Database Connection Issues
Symptoms: - "connection refused" or "connection pool exhausted" errors - Slow response times - Timeout errors in logs
Diagnosis:
# Step 1: Check PostgreSQL status
docker exec fulcrum-postgres pg_isready -U fulcrum -d fulcrum_dev
# Step 2: Check active connections
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT count(*), state
FROM pg_stat_activity
WHERE datname = 'fulcrum_dev'
GROUP BY state;
"
# Step 3: Check max connections
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SHOW max_connections;"
# Step 4: Find long-running queries
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 5;
"
# Step 5: Check for locks
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;
"
Resolution Steps:
# Kill long-running query (use with caution)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT pg_terminate_backend(<PID>);"
# Restart services to reset connection pools
docker compose -f docker-compose.unified.yml restart fulcrum-server event-processor
# If PostgreSQL is overloaded, increase connection limits
# Edit docker-compose.unified.yml to add:
# command: postgres -c 'max_connections=200'
3.3 NATS Stream Lag
Symptoms:
- Events not being processed
- High fulcrum_nats_pending_messages metric
- Event processor logs show idle or slow consumption
Diagnosis:
# Step 1: Check NATS server status
curl -s http://localhost:8222/ | jq
# Step 2: Check JetStream streams
curl -s http://localhost:8222/jsz | jq '.streams[] | {name, messages, consumers}'
# Step 3: Check consumer lag
curl -s http://localhost:8222/jsz?consumers=true | jq '.streams[].consumers[] | {name, num_pending, num_ack_pending}'
# Step 4: Check event processor logs
docker logs fulcrum-processor --tail 100 | grep -E "(lag|pending|slow)"
# Step 5: Check event processor metrics
curl -s http://localhost:8081/metrics | grep nats
Resolution Steps:
# If consumer is stuck, restart event processor
docker compose -f docker-compose.unified.yml restart event-processor
# If stream is corrupted, purge and restart (CAUTION: loses unprocessed events)
docker exec -it fulcrum-nats nats stream purge FULCRUM_EXECUTION_EVENTS --force
# Scale up event processor
# Edit docker-compose.unified.yml to add replicas or deploy additional instances
3.4 Redis Cache Failures
Symptoms: - Policy lookups slow (P99 >2ms) - High cache miss rate - Redis connection errors in logs
Diagnosis:
# Step 1: Check Redis health
docker exec fulcrum-redis redis-cli ping
# Step 2: Check memory usage
docker exec fulcrum-redis redis-cli info memory | grep -E "(used_memory_human|maxmemory)"
# Step 3: Check keyspace stats
docker exec fulcrum-redis redis-cli info keyspace
# Step 4: Check slow log
docker exec fulcrum-redis redis-cli slowlog get 10
# Step 5: Check connection count
docker exec fulcrum-redis redis-cli info clients | grep connected
Resolution Steps:
# Clear stale keys (by pattern)
docker exec fulcrum-redis redis-cli keys "policy:stale:*" | xargs -r docker exec -i fulcrum-redis redis-cli del
# Flush entire cache (service will repopulate)
docker exec fulcrum-redis redis-cli FLUSHDB
# Restart Redis
docker compose -f docker-compose.unified.yml restart redis
# After restart, verify fulcrum-server reconnects
docker logs fulcrum-server --tail 50 | grep -i redis
3.5 Dashboard Errors
Symptoms: - Dashboard not loading - API errors displayed - Authentication failures
Diagnosis:
# Step 1: Check dashboard container status
docker logs fulcrum-dashboard --tail 100
# Step 2: Check if backend is reachable from dashboard
docker exec fulcrum-dashboard wget -q -O- http://fulcrum-server:8080/metrics | head -5
# Step 3: Check environment variables
docker exec fulcrum-dashboard env | grep -E "(API_URL|GRPC_URL|CLERK)"
# Step 4: Check Next.js build status
docker exec fulcrum-dashboard ls -la /app/.next
Resolution Steps:
# Rebuild dashboard
docker compose -f docker-compose.unified.yml up -d --build dashboard
# Clear Next.js cache and rebuild
docker exec fulcrum-dashboard rm -rf /app/.next
docker compose -f docker-compose.unified.yml restart dashboard
# Check for Clerk authentication issues
# Verify CLERK_SECRET_KEY and NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY are set correctly
4. Database Maintenance
4.1 Running Migrations
Script Location: /scripts/run-migrations.sh
Local Development
# Using the migration script
./scripts/run-migrations.sh --url "postgresql://fulcrum:fulcrum@localhost:5432/fulcrum_dev?sslmode=disable"
# Or via Docker Compose (automatic on startup)
docker compose -f docker-compose.unified.yml up migrate
Production (Railway)
# Get database URL from Railway
export DATABASE_URL=$(railway variables get DATABASE_URL)
# Dry run (preview migrations)
./scripts/run-migrations.sh --url "$DATABASE_URL" --dry-run
# Apply migrations
./scripts/run-migrations.sh --url "$DATABASE_URL"
View Applied Migrations
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT version, applied_at, description
FROM schema_migrations
ORDER BY applied_at DESC;
"
Create New Migration
# Create migration file manually
touch infra/migrations/postgres/$(date +%Y%m%d%H%M%S)_description.up.sql
touch infra/migrations/postgres/$(date +%Y%m%d%H%M%S)_description.down.sql
4.2 Analyzing Query Performance
Enable Query Statistics
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
"
Find Slow Queries
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT
query,
calls,
round(total_exec_time::numeric, 2) as total_time_ms,
round(mean_exec_time::numeric, 2) as mean_time_ms,
round(max_exec_time::numeric, 2) as max_time_ms
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = 'fulcrum_dev')
ORDER BY mean_exec_time DESC
LIMIT 10;
"
Analyze Query Plan
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM policies WHERE tenant_id = '<tenant_id>' AND enabled = true;
"
Check Missing Indexes
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT
relname as table,
seq_scan,
idx_scan,
CASE WHEN seq_scan + idx_scan > 0
THEN round(100.0 * idx_scan / (seq_scan + idx_scan), 2)
ELSE 0 END as idx_hit_rate
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_scan DESC;
"
4.3 Vacuum Procedures
Check Table Bloat
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
n_dead_tup as dead_tuples,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
"
Manual Vacuum
# Vacuum specific table (non-blocking)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
VACUUM ANALYZE policies;
"
# Full vacuum (requires exclusive lock - use with caution)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
VACUUM FULL ANALYZE policies;
"
# Vacuum entire database
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
VACUUM ANALYZE;
"
Reindex Tables
# Reindex specific table
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
REINDEX TABLE policies;
"
# Reindex concurrently (non-blocking, PostgreSQL 12+)
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
REINDEX TABLE CONCURRENTLY policies;
"
5. NATS Stream Management
5.1 Viewing Stream Status
List All Streams
# Via NATS monitoring endpoint
curl -s http://localhost:8222/jsz | jq '.streams[] | {name, messages, bytes, consumers}'
# Via nats CLI (if installed in container)
docker exec fulcrum-nats nats stream ls
Stream Details
View Stream Configuration
5.2 Consumer Lag Investigation
Check Consumer Status
curl -s "http://localhost:8222/jsz?consumers=true" | jq '
.streams[] |
{
stream: .name,
consumers: [.consumers[] | {
name: .name,
pending: .num_pending,
ack_pending: .num_ack_pending,
redelivered: .num_redelivered
}]
}
'
Monitor Consumer in Real-time
# Watch consumer lag every 5 seconds
watch -n 5 'curl -s "http://localhost:8222/jsz?consumers=true" | jq ".streams[].consumers[].num_pending"'
Identify Stuck Messages
# Check for messages that have been redelivered many times
curl -s "http://localhost:8222/jsz?consumers=true" | jq '
.streams[].consumers[] |
select(.num_redelivered > 10) |
{name, redelivered: .num_redelivered}
'
5.3 Stream Purge Procedures
Purge All Messages (CAUTION: Loses Unprocessed Events)
Purge Messages Older Than N Hours
# Purge messages older than 24 hours
docker exec fulcrum-nats nats stream purge FULCRUM_EXECUTION_EVENTS --keep=0 --seq=0 --subject=">" --older=24h
Delete and Recreate Stream
# Delete stream
docker exec fulcrum-nats nats stream delete FULCRUM_EXECUTION_EVENTS --force
# Stream will be recreated on next publish by fulcrum-server
docker compose -f docker-compose.unified.yml restart fulcrum-server
6. Log Analysis
6.1 Log Locations
Docker Compose Logs
# All services
docker compose -f docker-compose.unified.yml logs
# Specific service
docker compose -f docker-compose.unified.yml logs fulcrum-server
# Follow logs in real-time
docker compose -f docker-compose.unified.yml logs -f fulcrum-server
# Last N lines
docker compose -f docker-compose.unified.yml logs --tail 100 fulcrum-server
Container Log Files
# Find container log location
docker inspect fulcrum-server --format='{{.LogPath}}'
# View raw log file (requires root)
sudo cat /var/lib/docker/containers/<container-id>/<container-id>-json.log
Service-Specific Logs
| Service | Primary Log | Notes |
|---|---|---|
| fulcrum-server | docker logs fulcrum-server |
JSON structured |
| event-processor | docker logs fulcrum-processor |
JSON structured |
| dashboard | docker logs fulcrum-dashboard |
Next.js format |
| PostgreSQL | docker logs fulcrum-postgres |
PostgreSQL format |
| NATS | docker logs fulcrum-nats |
NATS format |
| Redis | docker logs fulcrum-redis |
Redis format |
6.2 Common Log Patterns
Error Patterns
# Find all errors
docker logs fulcrum-server 2>&1 | grep -iE "(error|failed|fatal)"
# Find specific error types
docker logs fulcrum-server 2>&1 | grep -i "connection refused"
docker logs fulcrum-server 2>&1 | grep -i "timeout"
docker logs fulcrum-server 2>&1 | grep -i "authentication"
Performance Patterns
# Find slow operations (if logged)
docker logs fulcrum-server 2>&1 | grep -E "latency_ms.*[0-9]{3,}"
# Find high memory usage warnings
docker logs fulcrum-server 2>&1 | grep -i "memory"
Request Patterns
# Find specific tenant activity
docker logs fulcrum-server 2>&1 | grep "tenant_id.*<tenant_id>"
# Find gRPC method calls
docker logs fulcrum-server 2>&1 | grep "grpc_method"
6.3 Error Investigation
Structured Error Analysis
# Extract JSON logs and filter
docker logs fulcrum-server 2>&1 | jq -r 'select(.level == "error") | "\(.time) \(.msg)"' 2>/dev/null
# Group errors by message
docker logs fulcrum-server 2>&1 | grep -i error | sort | uniq -c | sort -rn | head -10
Trace Correlation
# Find all logs for a specific trace ID
TRACE_ID="<trace-id>"
docker logs fulcrum-server 2>&1 | grep "$TRACE_ID"
docker logs fulcrum-processor 2>&1 | grep "$TRACE_ID"
Time-Based Analysis
# Logs from last hour (if timestamps in ISO format)
docker logs --since 1h fulcrum-server 2>&1 | grep -i error
# Logs between specific times
docker logs --since "2026-01-06T10:00:00" --until "2026-01-06T11:00:00" fulcrum-server
7. Performance Debugging
7.1 Profiling Go Services
Enable pprof
pprof endpoints are exposed on the HTTP port (8080 for fulcrum-server).
# CPU profile (30 seconds)
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30
# Memory/heap profile
go tool pprof http://localhost:8080/debug/pprof/heap
# Goroutine profile
go tool pprof http://localhost:8080/debug/pprof/goroutine
# Block profile (blocking operations)
go tool pprof http://localhost:8080/debug/pprof/block
# Mutex profile
go tool pprof http://localhost:8080/debug/pprof/mutex
Analyze Profile
# Interactive mode
go tool pprof http://localhost:8080/debug/pprof/heap
# Common commands in pprof:
# top10 - Show top 10 functions
# web - Open graph in browser
# list <func> - Show source for function
# pdf - Export as PDF
Generate Flame Graph
# Install go-torch or use pprof web view
go tool pprof -http=:8081 http://localhost:8080/debug/pprof/profile?seconds=30
7.2 Identifying Bottlenecks
Check Metrics for Bottlenecks
# Policy evaluation latency
curl -s http://localhost:8080/metrics | grep "fulcrum_policy_evaluation_duration_seconds"
# Database query latency
curl -s http://localhost:8080/metrics | grep "fulcrum_db_query_duration_seconds"
# gRPC latency
curl -s http://localhost:8080/metrics | grep "grpc_server_handling_seconds"
Calculate P99 Latency
# Using Prometheus query (via curl)
curl -g 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,rate(fulcrum_policy_evaluation_duration_seconds_bucket[5m]))'
Database Bottlenecks
# Check for table/index scans
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT relname, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_scan DESC;
"
# Check cache hit ratio
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT
sum(blks_hit) / (sum(blks_hit) + sum(blks_read)) * 100 as cache_hit_ratio
FROM pg_stat_database
WHERE datname = 'fulcrum_dev';
"
7.3 Load Testing
Location: /tests/load/
Run k6 Load Test
cd tests/load
# Basic load test
k6 run load_test.js
# With specific VUs and duration
k6 run --vus 10 --duration 30s load_test.js
# With output to JSON
k6 run --out json=results.json load_test.js
gRPC Load Testing with ghz
# Install ghz: go install github.com/bojand/ghz/cmd/ghz@latest
# Basic policy check test
ghz --insecure \
--proto ../proto/fulcrum/policy/v1/policy_service.proto \
--call fulcrum.policy.v1.PolicyService/CheckPolicy \
-d '{"tenant_id":"test","policy_id":"test"}' \
localhost:50051
# Sustained load test
ghz --insecure \
--concurrency 50 \
--total 10000 \
--proto ../proto/fulcrum/policy/v1/policy_service.proto \
--call fulcrum.policy.v1.PolicyService/CheckPolicy \
-d '{"tenant_id":"test","policy_id":"test"}' \
localhost:50051
Test Coverage Report
./scripts/test-coverage-report.sh
# With HTML output
./scripts/test-coverage-report.sh --html
# CI mode (strict)
./scripts/test-coverage-report.sh --ci
8. Rollback Procedures
8.1 Application Rollback
Script Location: /scripts/deployment/rollback.sh
Kubernetes Deployment Rollback
# Rollback to previous version
./scripts/deployment/rollback.sh server
# Rollback to specific revision
./scripts/deployment/rollback.sh server 5
# Rollback all components
./scripts/deployment/rollback.sh all
Blue/Green Instant Switch
# Switch to blue deployment
./scripts/deployment/blue-green-switch.sh blue
# Switch to green deployment
./scripts/deployment/blue-green-switch.sh green
# Dry run (see what would happen)
./scripts/deployment/blue-green-switch.sh blue --dry-run
Docker Compose Rollback
# Stop current version
docker compose -f docker-compose.unified.yml stop fulcrum-server
# Pull previous image version
docker pull ghcr.io/your-org/fulcrum-server:v1.0.0
# Update docker-compose.yml with previous version tag
# Restart service
docker compose -f docker-compose.unified.yml up -d fulcrum-server
Railway Rollback
# List deployments
railway deployments --service fulcrum-server
# Rollback to specific deployment
railway rollback --service fulcrum-server --deployment-id <deployment-id>
8.2 Database Migration Rollback
Prerequisites
- Have backup from before migration
- Plan for downtime
- Notify affected users
Rollback Procedure
# Step 1: Stop application services
docker compose -f docker-compose.unified.yml stop fulcrum-server event-processor
# Step 2: Identify migration to rollback
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT version, applied_at FROM schema_migrations ORDER BY applied_at DESC LIMIT 5;
"
# Step 3: Apply down migration (if exists)
# Down migrations must be created manually
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -f /path/to/migration.down.sql
# Step 4: Update schema_migrations table
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
DELETE FROM schema_migrations WHERE version = '<version-to-rollback>';
"
# Step 5: Restart services with previous code version
docker compose -f docker-compose.unified.yml up -d fulcrum-server event-processor
# Step 6: Verify rollback
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
SELECT version, applied_at FROM schema_migrations ORDER BY applied_at DESC LIMIT 5;
"
Point-in-Time Recovery (PITR)
# For major issues, restore from backup
# Step 1: Stop all services
docker compose -f docker-compose.unified.yml down
# Step 2: Remove current database volume
docker volume rm fulcrum-postgres-data
# Step 3: Restore from backup
# For Docker: Create new volume and restore
docker volume create fulcrum-postgres-data
docker run --rm -v fulcrum-postgres-data:/data -v $(pwd):/backup alpine tar xzf /backup/postgres-backup.tar.gz -C /data
# Step 4: Start services
docker compose -f docker-compose.unified.yml up -d
8.3 Configuration Rollback
Environment Variable Rollback
# Docker Compose: Edit .env file with previous values
cp .env.backup .env
docker compose -f docker-compose.unified.yml up -d
# Railway: Use dashboard or CLI
railway variables set KEY=previous_value --service fulcrum-server
Kubernetes ConfigMap Rollback
# View ConfigMap history
kubectl rollout history configmap fulcrum-config -n fulcrum
# Apply previous ConfigMap
kubectl apply -f infra/k8s/configmaps/fulcrum-config-v1.yaml
# Restart pods to pick up changes
kubectl rollout restart deployment fulcrum-server -n fulcrum
Feature Flag Rollback
# If using feature flags, disable problematic feature
# Via database:
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "
UPDATE feature_flags SET enabled = false WHERE name = 'new-feature';
"
# Via API (if available):
curl -X PATCH http://localhost:8080/api/v1/features/new-feature \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
Quick Reference Commands
Service Status
# All services status
docker compose -f docker-compose.unified.yml ps
# Specific service logs
docker compose -f docker-compose.unified.yml logs -f fulcrum-server --tail 100
# Check health endpoints
curl -s http://localhost:8080/metrics | head
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check
Emergency Actions
# Restart all services
docker compose -f docker-compose.unified.yml restart
# Stop everything immediately
docker compose -f docker-compose.unified.yml down
# Full reset (WARNING: loses data)
docker compose -f docker-compose.unified.yml down -v && docker compose -f docker-compose.unified.yml up -d
Database Quick Access
# Interactive psql
docker exec -it fulcrum-postgres psql -U fulcrum -d fulcrum_dev
# Quick query
docker exec fulcrum-postgres psql -U fulcrum -d fulcrum_dev -c "SELECT COUNT(*) FROM policies;"
Log Investigation
# Recent errors
docker logs fulcrum-server --since 5m 2>&1 | grep -i error | tail -20
# All services errors
for svc in fulcrum-server fulcrum-processor fulcrum-dashboard; do
echo "=== $svc ==="
docker logs $svc --since 5m 2>&1 | grep -i error | tail -5
done
Contact Information
Escalation Path
| Level | Contact | Response Time |
|---|---|---|
| L1 | On-call Engineer | 5 minutes |
| L2 | Platform Team Lead | 15 minutes |
| L3 | Engineering Manager | 30 minutes |
| L4 | CTO | 1 hour |
Resources
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000
- Alertmanager: http://localhost:9093
- NATS Monitor: http://localhost:8222
- Documentation:
/documentation/ - Incident Slack: #fulcrum-incidents
Document Version: 2.0.0 Last Updated: January 6, 2026 Next Review: April 2026 Maintainer: Fulcrum Platform Team