Skip to content

Architecture Decision Records (ADR) Index

Fulcrum Cognitive AI Governance Platform Version: 1.0 | Last Updated: January 6, 2026


Table of Contents

  1. What is an ADR?
  2. ADR Template
  3. ADR Index
  4. Decision Records
  5. ADR-001: Go as Backend Language
  6. ADR-002: PostgreSQL as Primary Database
  7. ADR-003: NATS JetStream for Event Streaming
  8. ADR-004: Adapter Abstraction Pattern
  9. ADR-005: OpenTelemetry for Observability
  10. ADR-006: Infrastructure as Code (Terraform + Helm)
  11. ADR-007: Python-Go Bridge Architecture

What is an ADR?

An Architecture Decision Record (ADR) is a document that captures an important architectural decision made along with its context and consequences. ADRs serve as a historical record of why certain technical choices were made, enabling future team members to understand the rationale behind the current system design.

Purpose

  • Preserve Context: Capture the reasoning behind decisions before institutional knowledge is lost
  • Enable Review: Allow stakeholders to review and discuss significant changes
  • Support Onboarding: Help new team members understand why things are the way they are
  • Prevent Repetition: Avoid relitigating the same decisions repeatedly
  • Guide Evolution: Provide a foundation for future architectural changes

When to Write an ADR

Create an ADR when making decisions that:

  1. Affect the overall system architecture
  2. Are difficult or costly to reverse
  3. Involve significant trade-offs
  4. Span multiple components or services
  5. Impact security, performance, or compliance
  6. Establish patterns that others will follow

ADR Template

Use this template when creating new Architecture Decision Records:

# ADR-000: [Title]

**Status:** Proposed | Accepted | Deprecated | Superseded
**Date:** YYYY-MM-DD
**Decision Makers:** [Names/Roles]
**Supersedes:** [ADR-000 if applicable]
**Superseded by:** [ADR-000 if applicable]

---

## Context

[Describe the issue, the forces at play, and why a decision is needed.
Include relevant constraints, requirements, and stakeholder concerns.]

### Requirements

| Requirement | Priority | Notes |
|-------------|----------|-------|
| [Requirement 1] | Critical/High/Medium/Low | [Details] |

### Constraints

- [Constraint 1]
- [Constraint 2]

---

## Decision

[State the decision clearly and concisely. Include the chosen approach
and key implementation details.]

---

## Consequences

### Positive

- [Benefit 1]
- [Benefit 2]

### Negative

- [Drawback 1]
- [Drawback 2]

### Mitigations

| Challenge | Mitigation |
|-----------|------------|
| [Challenge 1] | [How we'll address it] |

---

## Alternatives Considered

### Option A: [Alternative Name]

**Pros:** [List advantages]
**Cons:** [List disadvantages]
**Rejection Rationale:** [Why this was not chosen]

---

## Validation Criteria

This decision should be revisited if:

1. [Condition 1]
2. [Condition 2]

**Review Date:** [When to re-evaluate]

---

## References

- [Link to relevant documentation]
- [Link to research/benchmarks]

---

*Approved by:* [Name]
*Effective from:* [Date]

ADR Index

ADR Title Status Date Summary
001 Go as Backend Language Accepted Dec 10, 2025 Go selected for backend services due to performance, concurrency, and single-binary deployment
002 PostgreSQL as Primary Database Accepted Dec 10, 2025 PostgreSQL 16 with TimescaleDB for relational data and time-series metrics
003 NATS JetStream for Event Streaming Accepted Dec 10, 2025 NATS chosen over Kafka for simpler operations with built-in persistence
004 Adapter Abstraction Pattern Accepted Dec 10, 2025 Framework-agnostic governance via Execution Envelope and adapter interfaces
005 OpenTelemetry for Observability Accepted Dec 10, 2025 OTEL standard adopted for vendor-neutral distributed tracing
006 Infrastructure as Code Accepted Dec 10, 2025 Terraform + Helm + GitHub Actions for reproducible deployments
007 Python-Go Bridge Architecture Proposed Dec 15, 2025 Subprocess with Protocol Buffers for LangGraph integration

Decision Records


ADR-001: Go as Backend Language

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum requires a backend language for implementing core control plane services:

  • Policy enforcement engine
  • Cost governance service
  • Agent task scheduler
  • Multi-framework adapter layer

Key Requirements:

Requirement Priority Rationale
<10ms policy latency Critical Every agent action routes through policy enforcement
High concurrency Critical Agent orchestration demands thousands of concurrent operations
Single binary deployment High Minimize runtime dependencies for Kubernetes
Strong typing High API contract enforcement for SDK generation

Constraints:

  • 12-18 month competitive window before Microsoft Agent Framework GA
  • Initially solo developer, scaling to 2-3 during build
  • Target deployment: Kubernetes (GKE/EKS)

Decision

Primary Backend: Go (Golang 1.24+)

Fulcrum's core control plane services are implemented in Go.

Secondary: TypeScript

SDKs, CLI tooling, and developer-facing components use TypeScript.

Consequences

Positive:

  1. Single binary deployments - No runtime dependencies, minimal container images (~20MB)
  2. Native concurrency - Goroutines handle thousands of concurrent agent connections trivially
  3. Performance predictability - Fast, tunable garbage collector with <1ms pauses
  4. Operations excellence - Prometheus/OpenTelemetry native support
  5. Cloud ecosystem alignment - Kubernetes, Docker, NATS all Go-native
  6. Hiring pool - Go developers understand concurrent systems design

Negative:

  1. Initial velocity - TypeScript would be faster for first 2-3 weeks
  2. Frontend integration - Requires TypeScript SDKs for web integrations
  3. Ecosystem gaps - Fewer AI/ML libraries compared to Python

Mitigations:

Challenge Mitigation
Initial velocity Use code generation from OpenAPI specs
Frontend integration TypeScript SDK as separate package
ML libraries Control plane doesn't run ML; adapters call external services

Alternatives Considered

Rust - Rejected due to 2-3x longer development time and slower iteration during prototyping. Go provides sufficient performance for control plane operations.

TypeScript (Node.js) - Rejected because event loop limits true parallelism. Policy enforcement cannot afford GC pauses or event loop stalls.

Python - Rejected due to GIL limitations and inability to guarantee consistent sub-10ms latency.

Validation Criteria

Revisit this decision if:

  1. Policy engine latency exceeds 10ms p99 under load
  2. Development velocity drops below 1 major feature/week
  3. Hiring proves significantly harder than expected

ADR-002: PostgreSQL as Primary Database

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum requires persistent storage for:

  • Agent state and checkpoints (high-frequency writes)
  • Policy definitions and audit logs (compliance-critical)
  • Cost tracking events (time-series data)
  • Multi-tenant organization data (relational)

Requirements Matrix:

Requirement Priority Notes
Multi-tenancy isolation Critical Organizations must not see each other's data
Checkpoint durability Critical Agent state recovery depends on this
Audit compliance Critical SOC2, GDPR requirements
Query flexibility High Complex policy queries, cost analytics
Time-series performance High 10K+ events/second ingestion

Scale Targets:

  • Phase 1-2: 10 tenants, 100 concurrent agents
  • Phase 3-4: 100 tenants, 1K concurrent agents
  • Phase 5-6: 1K tenants, 10K concurrent agents

Decision

Primary Database: PostgreSQL 16

All relational data, checkpoints, and audit logs stored in PostgreSQL.

Caching Layer: Redis 7

Session state, rate limiting counters, and hot policy data cached in Redis.

Time-series: TimescaleDB Extension

Cost events and metrics stored in TimescaleDB hypertables with continuous aggregates.

Multi-Tenancy Strategy: Row-Level Security (RLS)

-- Tenant isolation via RLS
ALTER TABLE envelopes ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON envelopes
    USING (tenant_id = current_setting('fulcrum.current_tenant')::uuid);

Consequences

Positive:

  1. Single database operations - One backup strategy, one monitoring setup
  2. ACID guarantees - Checkpoint consistency critical for agent recovery
  3. SQL flexibility - Complex policy queries, audit searches
  4. TimescaleDB compatibility - Native PostgreSQL, no separate system
  5. Ecosystem maturity - pgx driver, excellent Go support
  6. Cost predictability - Self-hosted PostgreSQL vs serverless billing surprises

Negative:

  1. Horizontal scaling - Manual sharding if we exceed single-node capacity
  2. Global distribution - Requires additional architecture for multi-region
  3. Operations burden - Must manage backups, failover, upgrades
  4. Cold start - Redis cache warming needed after deployments

Mitigations:

Challenge Mitigation
Horizontal scaling Partition by tenant_id; evaluate Citus at scale
Global distribution Phase 6 concern; design schemas for future sharding
Operations Use managed PostgreSQL (Cloud SQL/RDS) in production
Cold start Implement lazy cache population, graceful degradation

Alternatives Considered

Neon + Upstash (Serverless) - Rejected due to unpredictable costs at scale and connection pooling complexity.

Supabase - Rejected because TimescaleDB not available and platform coupling limits flexibility.

CockroachDB - Rejected as premature optimization; start with PostgreSQL, migrate only if multi-region becomes critical.

MongoDB + InfluxDB - Rejected because checkpoint recovery requires ACID guarantees and two databases doubles operational complexity.

Validation Criteria

Revisit this decision if:

  1. Checkpoint write latency exceeds 50ms p99
  2. Cost event ingestion falls behind (>1 minute lag)
  3. PostgreSQL CPU consistently >80% at Phase 3 scale
  4. Multi-region deployment becomes critical

ADR-003: NATS JetStream for Event Streaming

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum requires a message queue for:

  • Agent task distribution across worker nodes
  • Async event processing (cost events, audit logs)
  • Cross-service communication (policy updates, circuit breaker signals)
  • Real-time streaming to clients (SSE/WebSocket backing)

Requirements:

Requirement Priority Notes
At-least-once delivery Critical Tasks must not be lost
Low latency Critical <10ms for task dispatch
Backpressure handling Critical Prevent cascade failures
Ordering guarantees High Per-agent ordering required
Horizontal scaling High Must scale with agent count
Operations simplicity High Bootstrap budget constraints

Decision

Message Queue: NATS with JetStream

NATS provides the messaging backbone with JetStream for persistence.

Subject Hierarchy:

tasks.{tenant}.{agent}.dispatch     # New task assignments
tasks.{tenant}.{agent}.complete     # Task completion signals
tasks.{tenant}.{agent}.cancel       # Cancellation requests

events.cost.{tenant}                # Cost tracking events
events.audit.{tenant}               # Audit log events
events.policy.{tenant}              # Policy change notifications

signals.circuit.{tenant}.{agent}    # Circuit breaker state
signals.ratelimit.{tenant}          # Rate limit exhaustion

Cluster Topology:

+-------------------------------------+
|       NATS Cluster (3 nodes)        |
|  +----------+----------+----------+ |
|  | nats-0   | nats-1   | nats-2   | |
|  | (leader) |(follower)|(follower)| |
|  +----------+----------+----------+ |
|                                     |
|  JetStream Replicas: 3              |
|  Consensus: Raft                    |
+-------------------------------------+

Consequences

Positive:

  1. Embedded deployment - NATS can run in-process for development
  2. Sub-millisecond latency - Fastest option for agent task dispatch
  3. Native Go client - First-class support, no FFI
  4. Simple operations - Single binary, minimal configuration
  5. Built-in clustering - No external coordination (ZooKeeper, etc.)
  6. JetStream persistence - Durable streams without separate system

Negative:

  1. Community size - Smaller than Kafka/RabbitMQ
  2. Complex routing - Less sophisticated than RabbitMQ exchanges
  3. Enterprise support - Fewer managed service options
  4. Exactly-once semantics - Requires application-level deduplication

Mitigations:

Challenge Mitigation
Community size NATS is CNCF project, growing rapidly
Complex routing Subject-based routing sufficient for Fulcrum
Enterprise support Synadia offers commercial support
Exactly-once Implement idempotency keys in task handlers

Alternatives Considered

RabbitMQ - Rejected because routing sophistication is unnecessary and latency (1-5ms) matters for policy enforcement hot path.

Redis Streams - Rejected due to limited consumer group semantics and manual backpressure handling requirements.

Apache Kafka - Rejected due to significant operational overhead and overkill for Phase 1-3 scale.

Amazon SQS + SNS - Rejected due to vendor lock-in and latency (10-50ms) too high for policy enforcement.

Validation Criteria

Revisit this decision if:

  1. Message delivery latency exceeds 10ms p99
  2. Consumer lag grows unbounded under load
  3. Cluster consensus issues cause availability problems

ADR-004: Adapter Abstraction Pattern

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5) Supersedes: Original Phase 4 scope (Multi-Framework Adapters)

Context

Research conducted in Phase 0 revealed a critical market shift:

Microsoft Agent Framework is closer to GA than anticipated. Their filter/middleware pattern for guardrails is production-ready. LangGraph's orchestration primitives are mature. CrewAI and AutoGen have established ecosystems.

Strategic Implication: Fulcrum's value is NOT in orchestration mechanics--it's the governance layer that wraps existing orchestrators.

This ADR defines the adapter abstraction that enables Fulcrum to:

  1. Wrap any orchestration framework without coupling to its internals
  2. Enforce cost governance and policy at execution boundaries
  3. Provide unified observability across heterogeneous agent deployments
  4. Support A2A/MCP protocols as first-class integration patterns

Decision

Core Abstraction: The Execution Envelope

Fulcrum introduces the concept of an Execution Envelope--a framework-agnostic wrapper that captures:

+-----------------------------------------------------------+
|                   EXECUTION ENVELOPE                       |
+-----------------------------------------------------------+
|  metadata:                                                 |
|    tenant_id, workflow_id, execution_id, budget_id        |
|                                                            |
|  lifecycle:                                                |
|    PENDING -> AUTHORIZED -> RUNNING -> COMPLETED/FAILED   |
|                                                            |
|  governance_context:                                       |
|    token_budget, cost_limit, policy_set, timeout          |
|                                                            |
|  framework_context:                                        |
|    adapter_type, native_execution_ref, checkpoint_id      |
|                                                            |
|  events[]:                                                 |
|    { timestamp, event_type, payload, token_delta }        |
+-----------------------------------------------------------+

Every agent execution--regardless of framework--is wrapped in an Envelope. The Envelope is Fulcrum's unit of governance.

Adapter Interface:

type FrameworkAdapter interface {
    // Lifecycle
    WrapExecution(ctx context.Context, envelope *Envelope, native any) (*WrappedExecution, error)
    StartExecution(ctx context.Context, exec *WrappedExecution) error
    PauseExecution(ctx context.Context, exec *WrappedExecution) error
    ResumeExecution(ctx context.Context, exec *WrappedExecution) error
    TerminateExecution(ctx context.Context, exec *WrappedExecution, reason string) error

    // Event Capture
    RegisterEventHook(hook EventHook) error
    CaptureCheckpoint(ctx context.Context, exec *WrappedExecution) (*Checkpoint, error)

    // Cost Estimation
    EstimateTokens(ctx context.Context, input any) (TokenEstimate, error)

    // Framework Metadata
    FrameworkType() FrameworkType
    Capabilities() []AdapterCapability
}

Adapter Capabilities Matrix:

Capability Description LangGraph Microsoft CrewAI A2A Proxy
INTERCEPT_PRE Pre-execution policy check Full Full Full Full
INTERCEPT_MID Mid-execution budget enforcement Full Full Partial None
INTERCEPT_POST Post-execution audit Full Full Full Full
CHECKPOINT State snapshot capture Full Full None None
TERMINATE Force stop execution Full Full Full Partial
TOKEN_STREAM Real-time token counting Full Full Partial None

Integration Patterns:

  1. Embedded Middleware (Highest Control) - SDK wraps framework client, emits events to control plane
  2. Sidecar Observer (Lowest Friction) - Log/event stream ingestion for observability-only
  3. Protocol Proxy (A2A/MCP Integration) - A2A/MCP-compliant proxy with policy injection

Consequences

Positive:

  1. Framework Independence - New orchestrators require only adapter implementation, not core changes
  2. Governance Consistency - Same policies apply regardless of underlying framework
  3. Incremental Adoption - Start with Sidecar (observability), upgrade to Embedded (full governance)
  4. A2A/MCP Ready - Protocol proxy pattern enables cross-organization agent governance

Negative:

  1. Adapter Maintenance Burden - Each framework needs dedicated adapter development
  2. Capability Gaps - Some frameworks won't support all governance features
  3. Latency Overhead - Middleware pattern adds ~5-15ms per policy check
  4. Abstraction Leakage - Framework-specific behaviors may require special handling

Risks and Mitigations:

Risk Likelihood Impact Mitigation
Framework API changes break adapters Medium High Version-pinned adapters, compatibility matrix
Event schema too rigid Low Medium Extensible payload field, schema versioning
Performance overhead unacceptable Low High Async event emission, local policy caching
New frameworks emerge High Low Clean adapter interface enables rapid development

Alternatives Considered

Build Our Own Orchestrator - Rejected because Microsoft and LangChain are 12-18 months ahead on orchestration mechanics.

Framework-Specific Governance Plugins - Rejected because it would fragment codebase and create maintenance nightmare.

Observability-Only (No Active Governance) - Rejected because observability is autopsy; governance is prevention.


ADR-005: OpenTelemetry for Observability

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum's value proposition is governance and cost control. Observability is not optional--it's core to the product. Requirements:

  1. Distributed tracing - Track requests across services
  2. Metrics - Cost aggregates, latency percentiles, error rates
  3. Logging - Structured logs for debugging and audit
  4. Alerting - Budget thresholds, policy violations, system health

Constraints:

  • Bootstrap budget (cost-sensitive)
  • Must support eventual SaaS deployment
  • Vendor-neutral for portability

Decision

Stack: OpenTelemetry + Prometheus + Grafana + Loki

+-------------+     +-------------+     +-------------+
|  Services   |---->|    OTEL     |---->| Prometheus  |
|             |     |  Collector  |     |  (metrics)  |
+-------------+     |             |     +------+------+
                    |             |            |
                    |             |---->  Loki |
                    |             |     (logs) |
                    |             |            v
                    |             |     +-------------+
                    |             |---->|   Grafana   |
                    +-------------+     | (dashboards)|
                                        +-------------+

Key Metrics:

Metric Type Labels Purpose
fulcrum_envelope_created_total Counter tenant_id, adapter_type Execution volume
fulcrum_envelope_duration_seconds Histogram tenant_id, state Execution time
fulcrum_tokens_total Counter tenant_id, model Token consumption
fulcrum_cost_usd_total Counter tenant_id, model Cost tracking
fulcrum_policy_evaluations_total Counter tenant_id, result Policy usage
fulcrum_budget_usage_ratio Gauge tenant_id, budget_id Budget health

Instrumentation Pattern:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

func (s *EnvelopeService) CreateEnvelope(ctx context.Context, req *CreateEnvelopeRequest) (*Envelope, error) {
    ctx, span := otel.Tracer("envelope-service").Start(ctx, "CreateEnvelope")
    defer span.End()

    span.SetAttributes(
        attribute.String("tenant_id", req.TenantID),
        attribute.String("adapter_type", req.AdapterType),
    )

    // ... implementation
}

Consequences

Positive:

  1. Vendor-neutral - Can swap backends later (DataDog, Honeycomb, etc.)
  2. Native Go SDK - First-class OpenTelemetry Go support
  3. Unified API - Traces, metrics, and logs in one interface
  4. Industry standard - Growing ecosystem and tooling
  5. Self-hosted option - Prometheus + Grafana for predictable costs

Negative:

  1. Operational burden - Self-hosted requires management
  2. No managed alerting - Need Alertmanager or external integration
  3. Storage costs - Retention of metrics and traces

Migration Path:

If Phase 2+ requires managed observability: - OpenTelemetry exports to any backend - Grafana Cloud for managed option - DataDog OTEL ingestion available

Validation Criteria

Revisit this decision if:

  1. Self-hosted operational burden becomes unsustainable
  2. Grafana Cloud or DataDog proves cost-effective at scale
  3. Multi-region requirements necessitate global observability platform

ADR-006: Infrastructure as Code

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum needs reproducible infrastructure for:

  • Local development (Docker Compose)
  • CI/CD preview environments
  • Production deployment (Kubernetes)

Requirements:

Requirement Priority Notes
Multi-cloud capability High AWS/GCP/Azure
Kubernetes-native High Target deployment platform
Team-friendly Medium Readable, auditable
GitOps compatible Medium PR-based infrastructure changes

Decision

Stack: Terraform + Helm + GitHub Actions

+-------------------------------------------------------+
|                    GitOps Flow                         |
+-------------------------------------------------------+
|                                                        |
|  +----------+    +----------+    +----------------+   |
|  |   Git    |--->| GitHub   |--->| Terraform      |   |
|  |  Commit  |    | Actions  |    |   Apply        |   |
|  +----------+    +----------+    +--------+-------+   |
|                                           |           |
|                                           v           |
|                                +------------------+   |
|                                |   Kubernetes     |   |
|                                |   (via Helm)     |   |
|                                +------------------+   |
|                                                        |
+-------------------------------------------------------+

Repository Structure:

/infra
+-- terraform/
|   +-- modules/
|   |   +-- networking/      # VPC, subnets, security groups
|   |   +-- kubernetes/      # EKS/GKE cluster
|   |   +-- database/        # RDS PostgreSQL
|   |   +-- redis/           # ElastiCache
|   |   +-- nats/            # NATS cluster config
|   +-- environments/
|       +-- dev/
|       +-- staging/
|       +-- prod/
+-- helm/
|   +-- fulcrum/
|       +-- Chart.yaml
|       +-- values.yaml
|       +-- values-dev.yaml
|       +-- values-prod.yaml
|       +-- templates/
+-- docker/
    +-- docker-compose.yaml   # Local development

Consequences

Positive:

  1. Standard tooling - Easy onboarding, extensive documentation
  2. Multi-cloud ready - AWS to GCP migration possible
  3. GitOps-friendly - PR reviews for infrastructure changes
  4. Reproducible environments - Consistent dev/staging/prod

Negative:

  1. Terraform state management - Requires backend configuration
  2. HCL verbosity - Complex logic is verbose
  3. Helm templating - Can be brittle with complex values

Phase 1 Simplification:

  • Single AWS region
  • Managed Kubernetes (EKS)
  • RDS PostgreSQL (not Aurora)
  • ElastiCache Redis (not self-hosted)

Validation Criteria

Revisit this decision if:

  1. Pulumi's TypeScript-native approach proves more productive
  2. Crossplane becomes mature enough for Kubernetes-native IaC
  3. Multi-cloud becomes Day 1 requirement

ADR-007: Python-Go Bridge Architecture

Status: Proposed Date: December 15, 2025 Decision Makers: Technical Team Related: ADR-001 (Go as Primary Language)

Context

Fulcrum's LangGraph adapter requires integration with Python-based LangGraph StateGraph execution. The adapter needs to support:

  1. Real StateGraph Execution - Create and execute LangGraph graphs in Python runtime
  2. Bidirectional Callback System - Forward Python callbacks to Go handlers
  3. Real-time Token Tracking - Extract actual token counts from LLM responses
  4. Budget Enforcement - Support mid-execution termination when budget exceeded
  5. Checkpoint Capture - Serialize LangGraph state for restore/recovery
  6. Low Latency - Target <10ms message latency for event forwarding

Decision Drivers:

Driver Weight Notes
Performance High <10ms latency requirement for callbacks
Reliability High Process isolation, crash handling
Complexity Medium Development time, debugging
Safety High Memory management, resource leaks
Testability Medium Unit and integration testing

Decision

Selected: Subprocess with Protocol Buffers (stdin/stdout)

+------------------+                    +-------------------+
|   Go Process     |                    |  Python Process   |
|   (Fulcrum)      |                    |  (LangGraph)      |
|                  |                    |                   |
|   Adapter -------+---- stdin/stdout --+---- Bridge        |
|                  |     (protobuf)     |                   |
|   Callbacks <----+--------------------+---- Callbacks     |
+------------------+                    +-------------------+

Rationale:

  1. Meets Performance Requirements - Callback latency of 1-5ms well under <10ms target
  2. Superior Safety - Python crashes isolated from Fulcrum core
  3. Development Velocity - Simplest implementation, well-understood patterns
  4. Operational Simplicity - No sidecar deployment complexity
  5. Flexibility - Can run different Python versions per execution

Performance Profile:

Metric Value
Startup Latency ~50-100ms (amortized via process pool)
Callback Latency ~1-5ms per event
Throughput ~200-500 messages/sec
Memory Process isolation prevents sharing

Process Pool Configuration:

type ProcessPool struct {
    maxSize     int           // Maximum concurrent Python processes
    idleTimeout time.Duration // How long to keep idle processes
    processes   chan *Process // Pool of ready processes
}

Consequences

Positive:

  1. Process isolation - Python crash doesn't crash Go process
  2. Simple protocol - stdin/stdout universally supported
  3. Easy testing - Can mock subprocess with test harness
  4. Resource tracking - OS-level process metrics (CPU, memory)
  5. Version flexibility - Different Python versions per execution

Negative:

  1. Startup overhead - ~50-100ms to spawn Python process
  2. Message serialization - Protobuf encode/decode on every message
  3. No shared memory - Cannot pass large data efficiently
  4. Process management - Need to handle zombie processes, cleanup

Mitigations:

Challenge Mitigation
Startup overhead Use process pool to amortize startup cost
Serialization Use protobuf ArenaAllocator for efficiency
Process management Health checks and automatic restart

Alternatives Considered

cgo with Python C API - Rejected due to GIL serialization defeating concurrency and memory safety risks in production.

gRPC Server (Python Sidecar) - Rejected because sidecar deployment adds operational complexity not justified by benefits.

Validation Criteria

Success Criteria:

  1. Callback latency <10ms (target: 1-5ms)
  2. Process startup amortized via pool
  3. Zero memory leaks over 1000 executions
  4. Graceful handling of Python crashes
  5. Support for concurrent LangGraph executions

Document Metadata

Attribute Value
Document ID ADR_INDEX
Version 1.0
Created January 6, 2026
Author Technical Architecture Team
Reviewers Engineering Leadership
Source Archive /.archive/historical/phase-docs/phase-0-foundation/adrs/


This document consolidates architecture decisions from the Phase 0 Foundation period (December 2025) and is maintained as the canonical reference for Fulcrum's technical architecture rationale.