Monitoring
Overview
Fulcrum’s observability stack is built on Prometheus for metrics collection, Grafana for visualization, OpenTelemetry for distributed tracing, and a centralized log aggregation pipeline (Loki/Elastic). This guide provides a comprehensive guide to install, configure, and maintain the monitoring components in both development and production environments.
1. Metrics Collection (Prometheus)
1.1. Prometheus Deployment
- Docker Compose:
infra/docker/prometheus/prometheus.yml(development) andinfra/prometheus/prometheus.production.yml(production). - Helm Chart:
infra/helm/fulcrum/templates/prometheus.yaml(Kubernetes).
1.2. Scrape Targets
| Service | Job Name | Port | Path |
|---|---|---|---|
| Prometheus itself | prometheus |
9090 | /metrics |
| Fulcrum Server | fulcrum-server |
9090 | /metrics |
| NATS JetStream | nats |
8222 | /metrics |
| Redis (optional) | redis |
9121 | /metrics |
| Node Exporter (optional) | node |
9100 | /metrics |
Tip: Add additional services by extending
scrape_configsinprometheus.yml.
1.3. Alert Rules
- Alert definitions live in
infra/docker/prometheus/alerts.ymland are referenced viarule_files. - Example rule for high CPU usage:
2. Visualization (Grafana)
2.1. Grafana Deployment
- Docker Compose:
infra/docker/grafana/(development). - Helm Chart:
infra/helm/fulcrum/templates/grafana.yaml(Kubernetes).
2.2. Data Source Configuration
Add a Prometheus data source pointing to http://prometheus:9090 (dev) or the production Prometheus endpoint.
2.3. Dashboards
| Dashboard | Description |
|---|---|
| Fulcrum Overview | System health, request rates, latency |
| Budget Monitoring | Budget utilization, cost tracking |
| Performance | Execution metrics, latency analysis |
| LLM & Tools | LLM call metrics, tool performance |
| Cognitive Layer | Semantic Judge, Oracle, Immune System |
Dashboards are stored as JSON under infra/grafana/provisioning/dashboards/. Import them via Grafana UI or provision automatically with the Helm chart.
3. Alerting & Notification
3.1. Alertmanager
Runs alongside Prometheus (alertmanager:9093). Configuration lives in infra/docker/prometheus/alertmanager.yml.
route:
receiver: default
group_by: ['alertname', 'cluster', 'service', 'tenant_id']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: critical-alerts
- match:
team: security
receiver: security-critical
- match:
severity: warning
receiver: warning-alerts
receivers:
- name: critical-alerts
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
slack_configs:
- channel: '#fulcrum-critical'
- name: warning-alerts
slack_configs:
- channel: '#fulcrum-alerts'
- name: security-critical
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
slack_configs:
- channel: '#fulcrum-security'
email_configs:
- to: 'security@fulcrum.io'
3.2. Critical Alerts (P0)
- Service Down –
up{job="fulcrum-server"} == 0 - Critical Error Rate – error rate > 5% over 2 min
- Budget Exhausted –
fulcrum_adapter_budget_utilization_percent >= 100 - Prompt Injection –
rate(fulcrum_semantic_judge_classifications_total{intent="malicious"}[5m]) > 0
3.3. Warning Alerts (P1)
- High error rate (>0.1% SLO)
- Policy evaluation latency > 10 ms P99
- Budget utilization > 80 %
- NATS consumer lag > 1 000 messages
- Semantic Judge fallback active
4. Log Aggregation (Loki / Elastic)
4.1. Structured Logging
All services emit JSON logs with fields: level, ts, caller, msg, trace_id, span_id, tenant_id, execution_id, error.
{"level":"info","ts":"2026-01-06T10:15:30.123Z","caller":"policyengine/evaluator.go:145","msg":"Policy evaluated","trace_id":"abc123def456","span_id":"789xyz","tenant_id":"tenant-001","execution_id":"exec-456","policy_id":"policy-001","decision":"allowed","latency_ms":2.5}
4.2. Loki Query Examples (Grafana Explore)
# All errors for a tenant
{job="fulcrum-server"} |= "tenant-001" | json | level="error"
# Policy denials
{job="fulcrum-server"} | json | msg="Policy evaluated" | decision="denied"
# High latency operations
{job="fulcrum-server"} | json | latency_ms > 100
5. Distributed Tracing (OpenTelemetry + Jaeger)
5.1. Collector Configuration (pkg/observability/otel.go)
type Config struct {
ServiceName string
ServiceVersion string
Environment string
OTLPEndpoint string // e.g., "localhost:4317"
SamplingRate float64
Enabled bool
}
OTEL_SAMPLING_RATE=0.1 in production (10 % sampling).
5.2. Viewing Traces
- Jaeger UI:
http://localhost:16686 - Filter by
fulcrum.tenant.idorexecution_id.
6. Custom Metrics
Add new metrics in pkg/observability/metrics.go using the Prometheus client.
var myNewMetric = promauto.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "fulcrum",
Subsystem: "custom",
Name: "my_new_metric",
Help: "Description of my new metric",
}, []string{"label1", "label2"})
7. Operational Checklist
- Verify Prometheus config (
prometheus.yml/prometheus.production.yml). - Ensure Grafana dashboards are provisioned.
- Confirm Alertmanager receivers (Slack, PagerDuty, email).
- Test a synthetic alert (
alertmanager --test). - Run
make monitor-testto spin up the full stack locally. - Validate log ingestion in Loki.
- Verify trace export to Jaeger.
- Review custom metrics coverage.
Document version: 1.0 (January 6 2026)