Skip to content

Monitoring

Overview

Fulcrum’s observability stack is built on Prometheus for metrics collection, Grafana for visualization, OpenTelemetry for distributed tracing, and a centralized log aggregation pipeline (Loki/Elastic). This guide provides a comprehensive guide to install, configure, and maintain the monitoring components in both development and production environments.


1. Metrics Collection (Prometheus)

1.1. Prometheus Deployment

  • Docker Compose: infra/docker/prometheus/prometheus.yml (development) and infra/prometheus/prometheus.production.yml (production).
  • Helm Chart: infra/helm/fulcrum/templates/prometheus.yaml (Kubernetes).

1.2. Scrape Targets

Service Job Name Port Path
Prometheus itself prometheus 9090 /metrics
Fulcrum Server fulcrum-server 9090 /metrics
NATS JetStream nats 8222 /metrics
Redis (optional) redis 9121 /metrics
Node Exporter (optional) node 9100 /metrics

Tip: Add additional services by extending scrape_configs in prometheus.yml.

1.3. Alert Rules

  • Alert definitions live in infra/docker/prometheus/alerts.yml and are referenced via rule_files.
  • Example rule for high CPU usage:
    - alert: HighCPUUsage
      expr: avg(rate(process_cpu_seconds_total[5m])) > 0.8
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "CPU usage is high on {{ $labels.instance }}"
        description: "CPU usage has been above 80% for more than 2 minutes."
    

2. Visualization (Grafana)

2.1. Grafana Deployment

  • Docker Compose: infra/docker/grafana/ (development).
  • Helm Chart: infra/helm/fulcrum/templates/grafana.yaml (Kubernetes).

2.2. Data Source Configuration

Add a Prometheus data source pointing to http://prometheus:9090 (dev) or the production Prometheus endpoint.

2.3. Dashboards

Dashboard Description
Fulcrum Overview System health, request rates, latency
Budget Monitoring Budget utilization, cost tracking
Performance Execution metrics, latency analysis
LLM & Tools LLM call metrics, tool performance
Cognitive Layer Semantic Judge, Oracle, Immune System

Dashboards are stored as JSON under infra/grafana/provisioning/dashboards/. Import them via Grafana UI or provision automatically with the Helm chart.


3. Alerting & Notification

3.1. Alertmanager

Runs alongside Prometheus (alertmanager:9093). Configuration lives in infra/docker/prometheus/alertmanager.yml.

route:
  receiver: default
  group_by: ['alertname', 'cluster', 'service', 'tenant_id']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: critical-alerts
    - match:
        team: security
      receiver: security-critical
    - match:
        severity: warning
      receiver: warning-alerts
receivers:
  - name: critical-alerts
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
    slack_configs:
      - channel: '#fulcrum-critical'
  - name: warning-alerts
    slack_configs:
      - channel: '#fulcrum-alerts'
  - name: security-critical
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
    slack_configs:
      - channel: '#fulcrum-security'
    email_configs:
      - to: 'security@fulcrum.io'

3.2. Critical Alerts (P0)

  • Service Downup{job="fulcrum-server"} == 0
  • Critical Error Rate – error rate > 5% over 2 min
  • Budget Exhaustedfulcrum_adapter_budget_utilization_percent >= 100
  • Prompt Injectionrate(fulcrum_semantic_judge_classifications_total{intent="malicious"}[5m]) > 0

3.3. Warning Alerts (P1)

  • High error rate (>0.1% SLO)
  • Policy evaluation latency > 10 ms P99
  • Budget utilization > 80 %
  • NATS consumer lag > 1 000 messages
  • Semantic Judge fallback active

4. Log Aggregation (Loki / Elastic)

4.1. Structured Logging

All services emit JSON logs with fields: level, ts, caller, msg, trace_id, span_id, tenant_id, execution_id, error.

{"level":"info","ts":"2026-01-06T10:15:30.123Z","caller":"policyengine/evaluator.go:145","msg":"Policy evaluated","trace_id":"abc123def456","span_id":"789xyz","tenant_id":"tenant-001","execution_id":"exec-456","policy_id":"policy-001","decision":"allowed","latency_ms":2.5}

4.2. Loki Query Examples (Grafana Explore)

# All errors for a tenant
{job="fulcrum-server"} |= "tenant-001" | json | level="error"

# Policy denials
{job="fulcrum-server"} | json | msg="Policy evaluated" | decision="denied"

# High latency operations
{job="fulcrum-server"} | json | latency_ms > 100

5. Distributed Tracing (OpenTelemetry + Jaeger)

5.1. Collector Configuration (pkg/observability/otel.go)

type Config struct {
    ServiceName    string
    ServiceVersion string
    Environment    string
    OTLPEndpoint   string // e.g., "localhost:4317"
    SamplingRate   float64
    Enabled        bool
}
Set OTEL_SAMPLING_RATE=0.1 in production (10 % sampling).

5.2. Viewing Traces

  • Jaeger UI: http://localhost:16686
  • Filter by fulcrum.tenant.id or execution_id.

6. Custom Metrics

Add new metrics in pkg/observability/metrics.go using the Prometheus client.

var myNewMetric = promauto.NewHistogramVec(prometheus.HistogramOpts{
    Namespace: "fulcrum",
    Subsystem: "custom",
    Name:      "my_new_metric",
    Help:      "Description of my new metric",
}, []string{"label1", "label2"})


7. Operational Checklist

  1. Verify Prometheus config (prometheus.yml / prometheus.production.yml).
  2. Ensure Grafana dashboards are provisioned.
  3. Confirm Alertmanager receivers (Slack, PagerDuty, email).
  4. Test a synthetic alert (alertmanager --test).
  5. Run make monitor-test to spin up the full stack locally.
  6. Validate log ingestion in Loki.
  7. Verify trace export to Jaeger.
  8. Review custom metrics coverage.

Document version: 1.0 (January 6 2026)