Monitoring

Overview

Fulcrum’s observability stack is built on Prometheus for metrics collection, Grafana for visualization, OpenTelemetry for distributed tracing, and a centralized log aggregation pipeline (Loki/Elastic). This guide provides a comprehensive guide to install, configure, and maintain the monitoring components in both development and production environments.

1. Metrics Collection (Prometheus)

1.1. Prometheus Deployment

Docker Compose: infra/docker/prometheus/prometheus.yml (development) and infra/prometheus/prometheus.production.yml (production).
Helm Chart: infra/helm/fulcrum/templates/prometheus.yaml (Kubernetes).

1.2. Scrape Targets

Service	Job Name	Port	Path
Prometheus itself	`prometheus`	9090	`/metrics`
Fulcrum Server	`fulcrum-server`	9090	`/metrics`
NATS JetStream	`nats`	8222	`/metrics`
Redis (optional)	`redis`	9121	`/metrics`
Node Exporter (optional)	`node`	9100	`/metrics`

Tip: Add additional services by extending scrape_configs in prometheus.yml.

1.3. Alert Rules

Alert definitions live in infra/docker/prometheus/alerts.yml and are referenced via rule_files.

Example rule for high CPU usage:

- alert: HighCPUUsage
  expr: avg(rate(process_cpu_seconds_total[5m])) > 0.8
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "CPU usage is high on {{ $labels.instance }}"
    description: "CPU usage has been above 80% for more than 2 minutes."

2. Visualization (Grafana)

2.1. Grafana Deployment

Docker Compose: infra/docker/grafana/ (development).
Helm Chart: infra/helm/fulcrum/templates/grafana.yaml (Kubernetes).

2.2. Data Source Configuration

Add a Prometheus data source pointing to http://prometheus:9090 (dev) or the production Prometheus endpoint.

2.3. Dashboards

Dashboard	Description
Fulcrum Overview	System health, request rates, latency
Budget Monitoring	Budget utilization, cost tracking
Performance	Execution metrics, latency analysis
LLM & Tools	LLM call metrics, tool performance
Cognitive Layer	Semantic Judge, Oracle, Immune System

Dashboards are stored as JSON under infra/grafana/provisioning/dashboards/. Import them via Grafana UI or provision automatically with the Helm chart.

3. Alerting & Notification

3.1. Alertmanager

Runs alongside Prometheus (alertmanager:9093). Configuration lives in infra/docker/prometheus/alertmanager.yml.

route:
  receiver: default
  group_by: ['alertname', 'cluster', 'service', 'tenant_id']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: critical-alerts
    - match:
        team: security
      receiver: security-critical
    - match:
        severity: warning
      receiver: warning-alerts
receivers:
  - name: critical-alerts
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
    slack_configs:
      - channel: '#fulcrum-critical'
  - name: warning-alerts
    slack_configs:
      - channel: '#fulcrum-alerts'
  - name: security-critical
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
    slack_configs:
      - channel: '#fulcrum-security'
    email_configs:
      - to: 'security@fulcrum.io'

3.2. Critical Alerts (P0)

Service Down – up{job="fulcrum-server"} == 0
Critical Error Rate – error rate > 5% over 2 min
Budget Exhausted – fulcrum_adapter_budget_utilization_percent >= 100
Prompt Injection – rate(fulcrum_semantic_judge_classifications_total{intent="malicious"}[5m]) > 0

3.3. Warning Alerts (P1)

High error rate (>0.1% SLO)
Policy evaluation latency > 10 ms P99
Budget utilization > 80 %
NATS consumer lag > 1 000 messages
Semantic Judge fallback active

4. Log Aggregation (Loki / Elastic)

4.1. Structured Logging

All services emit JSON logs with fields: level, ts, caller, msg, trace_id, span_id, tenant_id, execution_id, error.

{"level":"info","ts":"2026-01-06T10:15:30.123Z","caller":"policyengine/evaluator.go:145","msg":"Policy evaluated","trace_id":"abc123def456","span_id":"789xyz","tenant_id":"tenant-001","execution_id":"exec-456","policy_id":"policy-001","decision":"allowed","latency_ms":2.5}

4.2. Loki Query Examples (Grafana Explore)

# All errors for a tenant
{job="fulcrum-server"} |= "tenant-001" | json | level="error"

# Policy denials
{job="fulcrum-server"} | json | msg="Policy evaluated" | decision="denied"

# High latency operations
{job="fulcrum-server"} | json | latency_ms > 100

5. Distributed Tracing (OpenTelemetry + Jaeger)

5.1. Collector Configuration (`pkg/observability/otel.go`)

type Config struct {
    ServiceName    string
    ServiceVersion string
    Environment    string
    OTLPEndpoint   string // e.g., "localhost:4317"
    SamplingRate   float64
    Enabled        bool
}

Set OTEL_SAMPLING_RATE=0.1 in production (10 % sampling).

5.2. Viewing Traces

Jaeger UI: http://localhost:16686
Filter by fulcrum.tenant.id or execution_id.

6. Custom Metrics

Add new metrics in pkg/observability/metrics.go using the Prometheus client.

var myNewMetric = promauto.NewHistogramVec(prometheus.HistogramOpts{
    Namespace: "fulcrum",
    Subsystem: "custom",
    Name:      "my_new_metric",
    Help:      "Description of my new metric",
}, []string{"label1", "label2"})

7. Operational Checklist

Verify Prometheus config (prometheus.yml / prometheus.production.yml).
Ensure Grafana dashboards are provisioned.
Confirm Alertmanager receivers (Slack, PagerDuty, email).
Test a synthetic alert (alertmanager --test).
Run make monitor-test to spin up the full stack locally.
Validate log ingestion in Loki.
Verify trace export to Jaeger.
Review custom metrics coverage.

Document version: 1.0 (January 6 2026)