Architecture Decision Records (ADR) Index

Fulcrum Cognitive AI Governance Platform Version: 1.0 | Last Updated: January 6, 2026

What is an ADR?
ADR Template
ADR Index
Decision Records
ADR-001: Go as Backend Language
ADR-002: PostgreSQL as Primary Database
ADR-003: NATS JetStream for Event Streaming
ADR-004: Adapter Abstraction Pattern
ADR-005: OpenTelemetry for Observability
ADR-006: Infrastructure as Code (Terraform + Helm)
ADR-007: Python-Go Bridge Architecture

What is an ADR?

An Architecture Decision Record (ADR) is a document that captures an important architectural decision made along with its context and consequences. ADRs serve as a historical record of why certain technical choices were made, enabling future team members to understand the rationale behind the current system design.

Purpose

Preserve Context: Capture the reasoning behind decisions before institutional knowledge is lost
Enable Review: Allow stakeholders to review and discuss significant changes
Support Onboarding: Help new team members understand why things are the way they are
Prevent Repetition: Avoid relitigating the same decisions repeatedly
Guide Evolution: Provide a foundation for future architectural changes

When to Write an ADR

Create an ADR when making decisions that:

Affect the overall system architecture
Are difficult or costly to reverse
Involve significant trade-offs
Span multiple components or services
Impact security, performance, or compliance
Establish patterns that others will follow

ADR Template

Use this template when creating new Architecture Decision Records:

# ADR-000: [Title]

**Status:** Proposed | Accepted | Deprecated | Superseded
**Date:** YYYY-MM-DD
**Decision Makers:** [Names/Roles]
**Supersedes:** [ADR-000 if applicable]
**Superseded by:** [ADR-000 if applicable]

---

## Context

[Describe the issue, the forces at play, and why a decision is needed.
Include relevant constraints, requirements, and stakeholder concerns.]

### Requirements

| Requirement | Priority | Notes |
|-------------|----------|-------|
| [Requirement 1] | Critical/High/Medium/Low | [Details] |

### Constraints

- [Constraint 1]
- [Constraint 2]

---

## Decision

[State the decision clearly and concisely. Include the chosen approach
and key implementation details.]

---

## Consequences

### Positive

- [Benefit 1]
- [Benefit 2]

### Negative

- [Drawback 1]
- [Drawback 2]

### Mitigations

| Challenge | Mitigation |
|-----------|------------|
| [Challenge 1] | [How we'll address it] |

---

## Alternatives Considered

### Option A: [Alternative Name]

**Pros:** [List advantages]
**Cons:** [List disadvantages]
**Rejection Rationale:** [Why this was not chosen]

---

## Validation Criteria

This decision should be revisited if:

1. [Condition 1]
2. [Condition 2]

**Review Date:** [When to re-evaluate]

---

## References

- [Link to relevant documentation]
- [Link to research/benchmarks]

---

*Approved by:* [Name]
*Effective from:* [Date]

ADR Index

ADR	Title	Status	Date	Summary
001	Go as Backend Language	Accepted	Dec 10, 2025	Go selected for backend services due to performance, concurrency, and single-binary deployment
002	PostgreSQL as Primary Database	Accepted	Dec 10, 2025	PostgreSQL 16 with TimescaleDB for relational data and time-series metrics
003	NATS JetStream for Event Streaming	Accepted	Dec 10, 2025	NATS chosen over Kafka for simpler operations with built-in persistence
004	Adapter Abstraction Pattern	Accepted	Dec 10, 2025	Framework-agnostic governance via Execution Envelope and adapter interfaces
005	OpenTelemetry for Observability	Accepted	Dec 10, 2025	OTEL standard adopted for vendor-neutral distributed tracing
006	Infrastructure as Code	Accepted	Dec 10, 2025	Terraform + Helm + GitHub Actions for reproducible deployments
007	Python-Go Bridge Architecture	Proposed	Dec 15, 2025	Subprocess with Protocol Buffers for LangGraph integration

Decision Records

ADR-001: Go as Backend Language

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum requires a backend language for implementing core control plane services:

Policy enforcement engine
Cost governance service
Agent task scheduler
Multi-framework adapter layer

Key Requirements:

Requirement	Priority	Rationale
<10ms policy latency	Critical	Every agent action routes through policy enforcement
High concurrency	Critical	Agent orchestration demands thousands of concurrent operations
Single binary deployment	High	Minimize runtime dependencies for Kubernetes
Strong typing	High	API contract enforcement for SDK generation

Constraints:

12-18 month competitive window before Microsoft Agent Framework GA
Initially solo developer, scaling to 2-3 during build
Target deployment: Kubernetes (GKE/EKS)

Decision

Primary Backend: Go (Golang 1.24+)

Fulcrum's core control plane services are implemented in Go.

Secondary: TypeScript

SDKs, CLI tooling, and developer-facing components use TypeScript.

Consequences

Positive:

Single binary deployments - No runtime dependencies, minimal container images (~20MB)
Native concurrency - Goroutines handle thousands of concurrent agent connections trivially
Performance predictability - Fast, tunable garbage collector with <1ms pauses
Operations excellence - Prometheus/OpenTelemetry native support
Cloud ecosystem alignment - Kubernetes, Docker, NATS all Go-native
Hiring pool - Go developers understand concurrent systems design

Negative:

Initial velocity - TypeScript would be faster for first 2-3 weeks
Frontend integration - Requires TypeScript SDKs for web integrations
Ecosystem gaps - Fewer AI/ML libraries compared to Python

Mitigations:

Challenge	Mitigation
Initial velocity	Use code generation from OpenAPI specs
Frontend integration	TypeScript SDK as separate package
ML libraries	Control plane doesn't run ML; adapters call external services

Alternatives Considered

Rust - Rejected due to 2-3x longer development time and slower iteration during prototyping. Go provides sufficient performance for control plane operations.

TypeScript (Node.js) - Rejected because event loop limits true parallelism. Policy enforcement cannot afford GC pauses or event loop stalls.

Python - Rejected due to GIL limitations and inability to guarantee consistent sub-10ms latency.

Validation Criteria

Revisit this decision if:

Policy engine latency exceeds 10ms p99 under load
Development velocity drops below 1 major feature/week
Hiring proves significantly harder than expected

ADR-002: PostgreSQL as Primary Database

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum requires persistent storage for:

Agent state and checkpoints (high-frequency writes)
Policy definitions and audit logs (compliance-critical)
Cost tracking events (time-series data)
Multi-tenant organization data (relational)

Requirements Matrix:

Requirement	Priority	Notes
Multi-tenancy isolation	Critical	Organizations must not see each other's data
Checkpoint durability	Critical	Agent state recovery depends on this
Audit compliance	Critical	SOC2, GDPR requirements
Query flexibility	High	Complex policy queries, cost analytics
Time-series performance	High	10K+ events/second ingestion

Scale Targets:

Phase 1-2: 10 tenants, 100 concurrent agents
Phase 3-4: 100 tenants, 1K concurrent agents
Phase 5-6: 1K tenants, 10K concurrent agents

Decision

Primary Database: PostgreSQL 16

All relational data, checkpoints, and audit logs stored in PostgreSQL.

Caching Layer: Redis 7

Session state, rate limiting counters, and hot policy data cached in Redis.

Time-series: TimescaleDB Extension

Cost events and metrics stored in TimescaleDB hypertables with continuous aggregates.

Multi-Tenancy Strategy: Row-Level Security (RLS)

-- Tenant isolation via RLS
ALTER TABLE envelopes ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON envelopes
    USING (tenant_id = current_setting('fulcrum.current_tenant')::uuid);

Consequences

Positive:

Single database operations - One backup strategy, one monitoring setup
ACID guarantees - Checkpoint consistency critical for agent recovery
SQL flexibility - Complex policy queries, audit searches
TimescaleDB compatibility - Native PostgreSQL, no separate system
Ecosystem maturity - pgx driver, excellent Go support
Cost predictability - Self-hosted PostgreSQL vs serverless billing surprises

Negative:

Horizontal scaling - Manual sharding if we exceed single-node capacity
Global distribution - Requires additional architecture for multi-region
Operations burden - Must manage backups, failover, upgrades
Cold start - Redis cache warming needed after deployments

Mitigations:

Challenge	Mitigation
Horizontal scaling	Partition by tenant_id; evaluate Citus at scale
Global distribution	Phase 6 concern; design schemas for future sharding
Operations	Use managed PostgreSQL (Cloud SQL/RDS) in production
Cold start	Implement lazy cache population, graceful degradation

Alternatives Considered

Neon + Upstash (Serverless) - Rejected due to unpredictable costs at scale and connection pooling complexity.

Supabase - Rejected because TimescaleDB not available and platform coupling limits flexibility.

CockroachDB - Rejected as premature optimization; start with PostgreSQL, migrate only if multi-region becomes critical.

MongoDB + InfluxDB - Rejected because checkpoint recovery requires ACID guarantees and two databases doubles operational complexity.

Validation Criteria

Revisit this decision if:

Checkpoint write latency exceeds 50ms p99
Cost event ingestion falls behind (>1 minute lag)
PostgreSQL CPU consistently >80% at Phase 3 scale
Multi-region deployment becomes critical

ADR-003: NATS JetStream for Event Streaming

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum requires a message queue for:

Agent task distribution across worker nodes
Async event processing (cost events, audit logs)
Cross-service communication (policy updates, circuit breaker signals)
Real-time streaming to clients (SSE/WebSocket backing)

Requirements:

Requirement	Priority	Notes
At-least-once delivery	Critical	Tasks must not be lost
Low latency	Critical	<10ms for task dispatch
Backpressure handling	Critical	Prevent cascade failures
Ordering guarantees	High	Per-agent ordering required
Horizontal scaling	High	Must scale with agent count
Operations simplicity	High	Bootstrap budget constraints

Decision

Message Queue: NATS with JetStream

NATS provides the messaging backbone with JetStream for persistence.

Subject Hierarchy:

tasks.{tenant}.{agent}.dispatch     # New task assignments
tasks.{tenant}.{agent}.complete     # Task completion signals
tasks.{tenant}.{agent}.cancel       # Cancellation requests

events.cost.{tenant}                # Cost tracking events
events.audit.{tenant}               # Audit log events
events.policy.{tenant}              # Policy change notifications

signals.circuit.{tenant}.{agent}    # Circuit breaker state
signals.ratelimit.{tenant}          # Rate limit exhaustion

Cluster Topology:

+-------------------------------------+
|       NATS Cluster (3 nodes)        |
|  +----------+----------+----------+ |
|  | nats-0   | nats-1   | nats-2   | |
|  | (leader) |(follower)|(follower)| |
|  +----------+----------+----------+ |
|                                     |
|  JetStream Replicas: 3              |
|  Consensus: Raft                    |
+-------------------------------------+

Consequences

Positive:

Embedded deployment - NATS can run in-process for development
Sub-millisecond latency - Fastest option for agent task dispatch
Native Go client - First-class support, no FFI
Simple operations - Single binary, minimal configuration
Built-in clustering - No external coordination (ZooKeeper, etc.)
JetStream persistence - Durable streams without separate system

Negative:

Community size - Smaller than Kafka/RabbitMQ
Complex routing - Less sophisticated than RabbitMQ exchanges
Enterprise support - Fewer managed service options
Exactly-once semantics - Requires application-level deduplication

Mitigations:

Challenge	Mitigation
Community size	NATS is CNCF project, growing rapidly
Complex routing	Subject-based routing sufficient for Fulcrum
Enterprise support	Synadia offers commercial support
Exactly-once	Implement idempotency keys in task handlers

Alternatives Considered

RabbitMQ - Rejected because routing sophistication is unnecessary and latency (1-5ms) matters for policy enforcement hot path.

Redis Streams - Rejected due to limited consumer group semantics and manual backpressure handling requirements.

Apache Kafka - Rejected due to significant operational overhead and overkill for Phase 1-3 scale.

Amazon SQS + SNS - Rejected due to vendor lock-in and latency (10-50ms) too high for policy enforcement.

Validation Criteria

Revisit this decision if:

Message delivery latency exceeds 10ms p99
Consumer lag grows unbounded under load
Cluster consensus issues cause availability problems

ADR-004: Adapter Abstraction Pattern

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5) Supersedes: Original Phase 4 scope (Multi-Framework Adapters)

Context

Research conducted in Phase 0 revealed a critical market shift:

Microsoft Agent Framework is closer to GA than anticipated. Their filter/middleware pattern for guardrails is production-ready. LangGraph's orchestration primitives are mature. CrewAI and AutoGen have established ecosystems.

Strategic Implication: Fulcrum's value is NOT in orchestration mechanics--it's the governance layer that wraps existing orchestrators.

This ADR defines the adapter abstraction that enables Fulcrum to:

Wrap any orchestration framework without coupling to its internals
Enforce cost governance and policy at execution boundaries
Provide unified observability across heterogeneous agent deployments
Support A2A/MCP protocols as first-class integration patterns

Decision

Core Abstraction: The Execution Envelope

Fulcrum introduces the concept of an Execution Envelope--a framework-agnostic wrapper that captures:

+-----------------------------------------------------------+
|                   EXECUTION ENVELOPE                       |
+-----------------------------------------------------------+
|  metadata:                                                 |
|    tenant_id, workflow_id, execution_id, budget_id        |
|                                                            |
|  lifecycle:                                                |
|    PENDING -> AUTHORIZED -> RUNNING -> COMPLETED/FAILED   |
|                                                            |
|  governance_context:                                       |
|    token_budget, cost_limit, policy_set, timeout          |
|                                                            |
|  framework_context:                                        |
|    adapter_type, native_execution_ref, checkpoint_id      |
|                                                            |
|  events[]:                                                 |
|    { timestamp, event_type, payload, token_delta }        |
+-----------------------------------------------------------+

Every agent execution--regardless of framework--is wrapped in an Envelope. The Envelope is Fulcrum's unit of governance.

Adapter Interface:

type FrameworkAdapter interface {
    // Lifecycle
    WrapExecution(ctx context.Context, envelope *Envelope, native any) (*WrappedExecution, error)
    StartExecution(ctx context.Context, exec *WrappedExecution) error
    PauseExecution(ctx context.Context, exec *WrappedExecution) error
    ResumeExecution(ctx context.Context, exec *WrappedExecution) error
    TerminateExecution(ctx context.Context, exec *WrappedExecution, reason string) error

    // Event Capture
    RegisterEventHook(hook EventHook) error
    CaptureCheckpoint(ctx context.Context, exec *WrappedExecution) (*Checkpoint, error)

    // Cost Estimation
    EstimateTokens(ctx context.Context, input any) (TokenEstimate, error)

    // Framework Metadata
    FrameworkType() FrameworkType
    Capabilities() []AdapterCapability
}

Adapter Capabilities Matrix:

Capability	Description	LangGraph	Microsoft	CrewAI	A2A Proxy
INTERCEPT_PRE	Pre-execution policy check	Full	Full	Full	Full
INTERCEPT_MID	Mid-execution budget enforcement	Full	Full	Partial	None
INTERCEPT_POST	Post-execution audit	Full	Full	Full	Full
CHECKPOINT	State snapshot capture	Full	Full	None	None
TERMINATE	Force stop execution	Full	Full	Full	Partial
TOKEN_STREAM	Real-time token counting	Full	Full	Partial	None

Integration Patterns:

Embedded Middleware (Highest Control) - SDK wraps framework client, emits events to control plane
Sidecar Observer (Lowest Friction) - Log/event stream ingestion for observability-only
Protocol Proxy (A2A/MCP Integration) - A2A/MCP-compliant proxy with policy injection

Consequences

Positive:

Framework Independence - New orchestrators require only adapter implementation, not core changes
Governance Consistency - Same policies apply regardless of underlying framework
Incremental Adoption - Start with Sidecar (observability), upgrade to Embedded (full governance)
A2A/MCP Ready - Protocol proxy pattern enables cross-organization agent governance

Negative:

Adapter Maintenance Burden - Each framework needs dedicated adapter development
Capability Gaps - Some frameworks won't support all governance features
Latency Overhead - Middleware pattern adds ~5-15ms per policy check
Abstraction Leakage - Framework-specific behaviors may require special handling

Risks and Mitigations:

Risk	Likelihood	Impact	Mitigation
Framework API changes break adapters	Medium	High	Version-pinned adapters, compatibility matrix
Event schema too rigid	Low	Medium	Extensible payload field, schema versioning
Performance overhead unacceptable	Low	High	Async event emission, local policy caching
New frameworks emerge	High	Low	Clean adapter interface enables rapid development

Alternatives Considered

Build Our Own Orchestrator - Rejected because Microsoft and LangChain are 12-18 months ahead on orchestration mechanics.

Framework-Specific Governance Plugins - Rejected because it would fragment codebase and create maintenance nightmare.

Observability-Only (No Active Governance) - Rejected because observability is autopsy; governance is prevention.

ADR-005: OpenTelemetry for Observability

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum's value proposition is governance and cost control. Observability is not optional--it's core to the product. Requirements:

Distributed tracing - Track requests across services
Metrics - Cost aggregates, latency percentiles, error rates
Logging - Structured logs for debugging and audit
Alerting - Budget thresholds, policy violations, system health

Constraints:

Bootstrap budget (cost-sensitive)
Must support eventual SaaS deployment
Vendor-neutral for portability

Decision

Stack: OpenTelemetry + Prometheus + Grafana + Loki

+-------------+     +-------------+     +-------------+
|  Services   |---->|    OTEL     |---->| Prometheus  |
|             |     |  Collector  |     |  (metrics)  |
+-------------+     |             |     +------+------+
                    |             |            |
                    |             |---->  Loki |
                    |             |     (logs) |
                    |             |            v
                    |             |     +-------------+
                    |             |---->|   Grafana   |
                    +-------------+     | (dashboards)|
                                        +-------------+

Key Metrics:

Metric	Type	Labels	Purpose
`fulcrum_envelope_created_total`	Counter	tenant_id, adapter_type	Execution volume
`fulcrum_envelope_duration_seconds`	Histogram	tenant_id, state	Execution time
`fulcrum_tokens_total`	Counter	tenant_id, model	Token consumption
`fulcrum_cost_usd_total`	Counter	tenant_id, model	Cost tracking
`fulcrum_policy_evaluations_total`	Counter	tenant_id, result	Policy usage
`fulcrum_budget_usage_ratio`	Gauge	tenant_id, budget_id	Budget health

Instrumentation Pattern:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

func (s *EnvelopeService) CreateEnvelope(ctx context.Context, req *CreateEnvelopeRequest) (*Envelope, error) {
    ctx, span := otel.Tracer("envelope-service").Start(ctx, "CreateEnvelope")
    defer span.End()

    span.SetAttributes(
        attribute.String("tenant_id", req.TenantID),
        attribute.String("adapter_type", req.AdapterType),
    )

    // ... implementation
}

Consequences

Positive:

Vendor-neutral - Can swap backends later (DataDog, Honeycomb, etc.)
Native Go SDK - First-class OpenTelemetry Go support
Unified API - Traces, metrics, and logs in one interface
Industry standard - Growing ecosystem and tooling
Self-hosted option - Prometheus + Grafana for predictable costs

Negative:

Operational burden - Self-hosted requires management
No managed alerting - Need Alertmanager or external integration
Storage costs - Retention of metrics and traces

Migration Path:

If Phase 2+ requires managed observability: - OpenTelemetry exports to any backend - Grafana Cloud for managed option - DataDog OTEL ingestion available

Validation Criteria

Revisit this decision if:

Self-hosted operational burden becomes unsustainable
Grafana Cloud or DataDog proves cost-effective at scale
Multi-region requirements necessitate global observability platform

ADR-006: Infrastructure as Code

Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)

Context

Fulcrum needs reproducible infrastructure for:

Local development (Docker Compose)
CI/CD preview environments
Production deployment (Kubernetes)

Requirements:

Requirement	Priority	Notes
Multi-cloud capability	High	AWS/GCP/Azure
Kubernetes-native	High	Target deployment platform
Team-friendly	Medium	Readable, auditable
GitOps compatible	Medium	PR-based infrastructure changes

Decision

Stack: Terraform + Helm + GitHub Actions

+-------------------------------------------------------+
|                    GitOps Flow                         |
+-------------------------------------------------------+
|                                                        |
|  +----------+    +----------+    +----------------+   |
|  |   Git    |--->| GitHub   |--->| Terraform      |   |
|  |  Commit  |    | Actions  |    |   Apply        |   |
|  +----------+    +----------+    +--------+-------+   |
|                                           |           |
|                                           v           |
|                                +------------------+   |
|                                |   Kubernetes     |   |
|                                |   (via Helm)     |   |
|                                +------------------+   |
|                                                        |
+-------------------------------------------------------+

Repository Structure:

/infra
+-- terraform/
|   +-- modules/
|   |   +-- networking/      # VPC, subnets, security groups
|   |   +-- kubernetes/      # EKS/GKE cluster
|   |   +-- database/        # RDS PostgreSQL
|   |   +-- redis/           # ElastiCache
|   |   +-- nats/            # NATS cluster config
|   +-- environments/
|       +-- dev/
|       +-- staging/
|       +-- prod/
+-- helm/
|   +-- fulcrum/
|       +-- Chart.yaml
|       +-- values.yaml
|       +-- values-dev.yaml
|       +-- values-prod.yaml
|       +-- templates/
+-- docker/
    +-- docker-compose.yaml   # Local development

Consequences

Positive:

Standard tooling - Easy onboarding, extensive documentation
Multi-cloud ready - AWS to GCP migration possible
GitOps-friendly - PR reviews for infrastructure changes
Reproducible environments - Consistent dev/staging/prod

Negative:

Terraform state management - Requires backend configuration
HCL verbosity - Complex logic is verbose
Helm templating - Can be brittle with complex values

Phase 1 Simplification:

Single AWS region
Managed Kubernetes (EKS)
RDS PostgreSQL (not Aurora)
ElastiCache Redis (not self-hosted)

Validation Criteria

Revisit this decision if:

Pulumi's TypeScript-native approach proves more productive
Crossplane becomes mature enough for Kubernetes-native IaC
Multi-cloud becomes Day 1 requirement

ADR-007: Python-Go Bridge Architecture

Status: Proposed Date: December 15, 2025 Decision Makers: Technical Team Related: ADR-001 (Go as Primary Language)

Context

Fulcrum's LangGraph adapter requires integration with Python-based LangGraph StateGraph execution. The adapter needs to support:

Real StateGraph Execution - Create and execute LangGraph graphs in Python runtime
Bidirectional Callback System - Forward Python callbacks to Go handlers
Real-time Token Tracking - Extract actual token counts from LLM responses
Budget Enforcement - Support mid-execution termination when budget exceeded
Checkpoint Capture - Serialize LangGraph state for restore/recovery
Low Latency - Target <10ms message latency for event forwarding

Decision Drivers:

Driver	Weight	Notes
Performance	High	<10ms latency requirement for callbacks
Reliability	High	Process isolation, crash handling
Complexity	Medium	Development time, debugging
Safety	High	Memory management, resource leaks
Testability	Medium	Unit and integration testing

Decision

Selected: Subprocess with Protocol Buffers (stdin/stdout)

+------------------+                    +-------------------+
|   Go Process     |                    |  Python Process   |
|   (Fulcrum)      |                    |  (LangGraph)      |
|                  |                    |                   |
|   Adapter -------+---- stdin/stdout --+---- Bridge        |
|                  |     (protobuf)     |                   |
|   Callbacks <----+--------------------+---- Callbacks     |
+------------------+                    +-------------------+

Rationale:

Meets Performance Requirements - Callback latency of 1-5ms well under <10ms target
Superior Safety - Python crashes isolated from Fulcrum core
Development Velocity - Simplest implementation, well-understood patterns
Operational Simplicity - No sidecar deployment complexity
Flexibility - Can run different Python versions per execution

Performance Profile:

Metric	Value
Startup Latency	~50-100ms (amortized via process pool)
Callback Latency	~1-5ms per event
Throughput	~200-500 messages/sec
Memory	Process isolation prevents sharing

Process Pool Configuration:

type ProcessPool struct {
    maxSize     int           // Maximum concurrent Python processes
    idleTimeout time.Duration // How long to keep idle processes
    processes   chan *Process // Pool of ready processes
}

Consequences

Positive:

Process isolation - Python crash doesn't crash Go process
Simple protocol - stdin/stdout universally supported
Easy testing - Can mock subprocess with test harness
Resource tracking - OS-level process metrics (CPU, memory)
Version flexibility - Different Python versions per execution

Negative:

Startup overhead - ~50-100ms to spawn Python process
Message serialization - Protobuf encode/decode on every message
No shared memory - Cannot pass large data efficiently
Process management - Need to handle zombie processes, cleanup

Mitigations:

Challenge	Mitigation
Startup overhead	Use process pool to amortize startup cost
Serialization	Use protobuf ArenaAllocator for efficiency
Process management	Health checks and automatic restart

Alternatives Considered

cgo with Python C API - Rejected due to GIL serialization defeating concurrency and memory safety risks in production.

gRPC Server (Python Sidecar) - Rejected because sidecar deployment adds operational complexity not justified by benefits.

Validation Criteria

Success Criteria:

Callback latency <10ms (target: 1-5ms)
Process startup amortized via pool
Zero memory leaks over 1000 executions
Graceful handling of Python crashes
Support for concurrent LangGraph executions

Document Metadata

Attribute	Value
Document ID	`ADR_INDEX`
Version	1.0
Created	January 6, 2026
Author	Technical Architecture Team
Reviewers	Engineering Leadership
Source Archive	`/.archive/historical/phase-docs/phase-0-foundation/adrs/`

This document consolidates architecture decisions from the Phase 0 Foundation period (December 2025) and is maintained as the canonical reference for Fulcrum's technical architecture rationale.

Architecture Decision Records (ADR) Index

Table of Contents

What is an ADR?

Purpose

When to Write an ADR

ADR Template

ADR Index

Decision Records

ADR-001: Go as Backend Language

Context

Decision

Consequences

Alternatives Considered

Validation Criteria

ADR-002: PostgreSQL as Primary Database

Context

Decision

Consequences

Alternatives Considered

Validation Criteria

ADR-003: NATS JetStream for Event Streaming

Context

Decision

Consequences

Alternatives Considered

Validation Criteria

ADR-004: Adapter Abstraction Pattern

Context

Decision

Consequences

Alternatives Considered

ADR-005: OpenTelemetry for Observability

Context

Decision

Consequences

Validation Criteria

ADR-006: Infrastructure as Code

Context

Decision

Consequences

Validation Criteria

ADR-007: Python-Go Bridge Architecture

Context

Decision

Consequences

Alternatives Considered

Validation Criteria

Document Metadata

Related Documentation