Architecture Decision Records (ADR) Index
Fulcrum Cognitive AI Governance Platform Version: 1.0 | Last Updated: January 6, 2026
Table of Contents
- What is an ADR?
- ADR Template
- ADR Index
- Decision Records
- ADR-001: Go as Backend Language
- ADR-002: PostgreSQL as Primary Database
- ADR-003: NATS JetStream for Event Streaming
- ADR-004: Adapter Abstraction Pattern
- ADR-005: OpenTelemetry for Observability
- ADR-006: Infrastructure as Code (Terraform + Helm)
- ADR-007: Python-Go Bridge Architecture
What is an ADR?
An Architecture Decision Record (ADR) is a document that captures an important architectural decision made along with its context and consequences. ADRs serve as a historical record of why certain technical choices were made, enabling future team members to understand the rationale behind the current system design.
Purpose
- Preserve Context: Capture the reasoning behind decisions before institutional knowledge is lost
- Enable Review: Allow stakeholders to review and discuss significant changes
- Support Onboarding: Help new team members understand why things are the way they are
- Prevent Repetition: Avoid relitigating the same decisions repeatedly
- Guide Evolution: Provide a foundation for future architectural changes
When to Write an ADR
Create an ADR when making decisions that:
- Affect the overall system architecture
- Are difficult or costly to reverse
- Involve significant trade-offs
- Span multiple components or services
- Impact security, performance, or compliance
- Establish patterns that others will follow
ADR Template
Use this template when creating new Architecture Decision Records:
# ADR-000: [Title]
**Status:** Proposed | Accepted | Deprecated | Superseded
**Date:** YYYY-MM-DD
**Decision Makers:** [Names/Roles]
**Supersedes:** [ADR-000 if applicable]
**Superseded by:** [ADR-000 if applicable]
---
## Context
[Describe the issue, the forces at play, and why a decision is needed.
Include relevant constraints, requirements, and stakeholder concerns.]
### Requirements
| Requirement | Priority | Notes |
|-------------|----------|-------|
| [Requirement 1] | Critical/High/Medium/Low | [Details] |
### Constraints
- [Constraint 1]
- [Constraint 2]
---
## Decision
[State the decision clearly and concisely. Include the chosen approach
and key implementation details.]
---
## Consequences
### Positive
- [Benefit 1]
- [Benefit 2]
### Negative
- [Drawback 1]
- [Drawback 2]
### Mitigations
| Challenge | Mitigation |
|-----------|------------|
| [Challenge 1] | [How we'll address it] |
---
## Alternatives Considered
### Option A: [Alternative Name]
**Pros:** [List advantages]
**Cons:** [List disadvantages]
**Rejection Rationale:** [Why this was not chosen]
---
## Validation Criteria
This decision should be revisited if:
1. [Condition 1]
2. [Condition 2]
**Review Date:** [When to re-evaluate]
---
## References
- [Link to relevant documentation]
- [Link to research/benchmarks]
---
*Approved by:* [Name]
*Effective from:* [Date]
ADR Index
| ADR | Title | Status | Date | Summary |
|---|---|---|---|---|
| 001 | Go as Backend Language | Accepted | Dec 10, 2025 | Go selected for backend services due to performance, concurrency, and single-binary deployment |
| 002 | PostgreSQL as Primary Database | Accepted | Dec 10, 2025 | PostgreSQL 16 with TimescaleDB for relational data and time-series metrics |
| 003 | NATS JetStream for Event Streaming | Accepted | Dec 10, 2025 | NATS chosen over Kafka for simpler operations with built-in persistence |
| 004 | Adapter Abstraction Pattern | Accepted | Dec 10, 2025 | Framework-agnostic governance via Execution Envelope and adapter interfaces |
| 005 | OpenTelemetry for Observability | Accepted | Dec 10, 2025 | OTEL standard adopted for vendor-neutral distributed tracing |
| 006 | Infrastructure as Code | Accepted | Dec 10, 2025 | Terraform + Helm + GitHub Actions for reproducible deployments |
| 007 | Python-Go Bridge Architecture | Proposed | Dec 15, 2025 | Subprocess with Protocol Buffers for LangGraph integration |
Decision Records
ADR-001: Go as Backend Language
Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)
Context
Fulcrum requires a backend language for implementing core control plane services:
- Policy enforcement engine
- Cost governance service
- Agent task scheduler
- Multi-framework adapter layer
Key Requirements:
| Requirement | Priority | Rationale |
|---|---|---|
| <10ms policy latency | Critical | Every agent action routes through policy enforcement |
| High concurrency | Critical | Agent orchestration demands thousands of concurrent operations |
| Single binary deployment | High | Minimize runtime dependencies for Kubernetes |
| Strong typing | High | API contract enforcement for SDK generation |
Constraints:
- 12-18 month competitive window before Microsoft Agent Framework GA
- Initially solo developer, scaling to 2-3 during build
- Target deployment: Kubernetes (GKE/EKS)
Decision
Primary Backend: Go (Golang 1.24+)
Fulcrum's core control plane services are implemented in Go.
Secondary: TypeScript
SDKs, CLI tooling, and developer-facing components use TypeScript.
Consequences
Positive:
- Single binary deployments - No runtime dependencies, minimal container images (~20MB)
- Native concurrency - Goroutines handle thousands of concurrent agent connections trivially
- Performance predictability - Fast, tunable garbage collector with <1ms pauses
- Operations excellence - Prometheus/OpenTelemetry native support
- Cloud ecosystem alignment - Kubernetes, Docker, NATS all Go-native
- Hiring pool - Go developers understand concurrent systems design
Negative:
- Initial velocity - TypeScript would be faster for first 2-3 weeks
- Frontend integration - Requires TypeScript SDKs for web integrations
- Ecosystem gaps - Fewer AI/ML libraries compared to Python
Mitigations:
| Challenge | Mitigation |
|---|---|
| Initial velocity | Use code generation from OpenAPI specs |
| Frontend integration | TypeScript SDK as separate package |
| ML libraries | Control plane doesn't run ML; adapters call external services |
Alternatives Considered
Rust - Rejected due to 2-3x longer development time and slower iteration during prototyping. Go provides sufficient performance for control plane operations.
TypeScript (Node.js) - Rejected because event loop limits true parallelism. Policy enforcement cannot afford GC pauses or event loop stalls.
Python - Rejected due to GIL limitations and inability to guarantee consistent sub-10ms latency.
Validation Criteria
Revisit this decision if:
- Policy engine latency exceeds 10ms p99 under load
- Development velocity drops below 1 major feature/week
- Hiring proves significantly harder than expected
ADR-002: PostgreSQL as Primary Database
Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)
Context
Fulcrum requires persistent storage for:
- Agent state and checkpoints (high-frequency writes)
- Policy definitions and audit logs (compliance-critical)
- Cost tracking events (time-series data)
- Multi-tenant organization data (relational)
Requirements Matrix:
| Requirement | Priority | Notes |
|---|---|---|
| Multi-tenancy isolation | Critical | Organizations must not see each other's data |
| Checkpoint durability | Critical | Agent state recovery depends on this |
| Audit compliance | Critical | SOC2, GDPR requirements |
| Query flexibility | High | Complex policy queries, cost analytics |
| Time-series performance | High | 10K+ events/second ingestion |
Scale Targets:
- Phase 1-2: 10 tenants, 100 concurrent agents
- Phase 3-4: 100 tenants, 1K concurrent agents
- Phase 5-6: 1K tenants, 10K concurrent agents
Decision
Primary Database: PostgreSQL 16
All relational data, checkpoints, and audit logs stored in PostgreSQL.
Caching Layer: Redis 7
Session state, rate limiting counters, and hot policy data cached in Redis.
Time-series: TimescaleDB Extension
Cost events and metrics stored in TimescaleDB hypertables with continuous aggregates.
Multi-Tenancy Strategy: Row-Level Security (RLS)
-- Tenant isolation via RLS
ALTER TABLE envelopes ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON envelopes
USING (tenant_id = current_setting('fulcrum.current_tenant')::uuid);
Consequences
Positive:
- Single database operations - One backup strategy, one monitoring setup
- ACID guarantees - Checkpoint consistency critical for agent recovery
- SQL flexibility - Complex policy queries, audit searches
- TimescaleDB compatibility - Native PostgreSQL, no separate system
- Ecosystem maturity - pgx driver, excellent Go support
- Cost predictability - Self-hosted PostgreSQL vs serverless billing surprises
Negative:
- Horizontal scaling - Manual sharding if we exceed single-node capacity
- Global distribution - Requires additional architecture for multi-region
- Operations burden - Must manage backups, failover, upgrades
- Cold start - Redis cache warming needed after deployments
Mitigations:
| Challenge | Mitigation |
|---|---|
| Horizontal scaling | Partition by tenant_id; evaluate Citus at scale |
| Global distribution | Phase 6 concern; design schemas for future sharding |
| Operations | Use managed PostgreSQL (Cloud SQL/RDS) in production |
| Cold start | Implement lazy cache population, graceful degradation |
Alternatives Considered
Neon + Upstash (Serverless) - Rejected due to unpredictable costs at scale and connection pooling complexity.
Supabase - Rejected because TimescaleDB not available and platform coupling limits flexibility.
CockroachDB - Rejected as premature optimization; start with PostgreSQL, migrate only if multi-region becomes critical.
MongoDB + InfluxDB - Rejected because checkpoint recovery requires ACID guarantees and two databases doubles operational complexity.
Validation Criteria
Revisit this decision if:
- Checkpoint write latency exceeds 50ms p99
- Cost event ingestion falls behind (>1 minute lag)
- PostgreSQL CPU consistently >80% at Phase 3 scale
- Multi-region deployment becomes critical
ADR-003: NATS JetStream for Event Streaming
Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)
Context
Fulcrum requires a message queue for:
- Agent task distribution across worker nodes
- Async event processing (cost events, audit logs)
- Cross-service communication (policy updates, circuit breaker signals)
- Real-time streaming to clients (SSE/WebSocket backing)
Requirements:
| Requirement | Priority | Notes |
|---|---|---|
| At-least-once delivery | Critical | Tasks must not be lost |
| Low latency | Critical | <10ms for task dispatch |
| Backpressure handling | Critical | Prevent cascade failures |
| Ordering guarantees | High | Per-agent ordering required |
| Horizontal scaling | High | Must scale with agent count |
| Operations simplicity | High | Bootstrap budget constraints |
Decision
Message Queue: NATS with JetStream
NATS provides the messaging backbone with JetStream for persistence.
Subject Hierarchy:
tasks.{tenant}.{agent}.dispatch # New task assignments
tasks.{tenant}.{agent}.complete # Task completion signals
tasks.{tenant}.{agent}.cancel # Cancellation requests
events.cost.{tenant} # Cost tracking events
events.audit.{tenant} # Audit log events
events.policy.{tenant} # Policy change notifications
signals.circuit.{tenant}.{agent} # Circuit breaker state
signals.ratelimit.{tenant} # Rate limit exhaustion
Cluster Topology:
+-------------------------------------+
| NATS Cluster (3 nodes) |
| +----------+----------+----------+ |
| | nats-0 | nats-1 | nats-2 | |
| | (leader) |(follower)|(follower)| |
| +----------+----------+----------+ |
| |
| JetStream Replicas: 3 |
| Consensus: Raft |
+-------------------------------------+
Consequences
Positive:
- Embedded deployment - NATS can run in-process for development
- Sub-millisecond latency - Fastest option for agent task dispatch
- Native Go client - First-class support, no FFI
- Simple operations - Single binary, minimal configuration
- Built-in clustering - No external coordination (ZooKeeper, etc.)
- JetStream persistence - Durable streams without separate system
Negative:
- Community size - Smaller than Kafka/RabbitMQ
- Complex routing - Less sophisticated than RabbitMQ exchanges
- Enterprise support - Fewer managed service options
- Exactly-once semantics - Requires application-level deduplication
Mitigations:
| Challenge | Mitigation |
|---|---|
| Community size | NATS is CNCF project, growing rapidly |
| Complex routing | Subject-based routing sufficient for Fulcrum |
| Enterprise support | Synadia offers commercial support |
| Exactly-once | Implement idempotency keys in task handlers |
Alternatives Considered
RabbitMQ - Rejected because routing sophistication is unnecessary and latency (1-5ms) matters for policy enforcement hot path.
Redis Streams - Rejected due to limited consumer group semantics and manual backpressure handling requirements.
Apache Kafka - Rejected due to significant operational overhead and overkill for Phase 1-3 scale.
Amazon SQS + SNS - Rejected due to vendor lock-in and latency (10-50ms) too high for policy enforcement.
Validation Criteria
Revisit this decision if:
- Message delivery latency exceeds 10ms p99
- Consumer lag grows unbounded under load
- Cluster consensus issues cause availability problems
ADR-004: Adapter Abstraction Pattern
Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5) Supersedes: Original Phase 4 scope (Multi-Framework Adapters)
Context
Research conducted in Phase 0 revealed a critical market shift:
Microsoft Agent Framework is closer to GA than anticipated. Their filter/middleware pattern for guardrails is production-ready. LangGraph's orchestration primitives are mature. CrewAI and AutoGen have established ecosystems.
Strategic Implication: Fulcrum's value is NOT in orchestration mechanics--it's the governance layer that wraps existing orchestrators.
This ADR defines the adapter abstraction that enables Fulcrum to:
- Wrap any orchestration framework without coupling to its internals
- Enforce cost governance and policy at execution boundaries
- Provide unified observability across heterogeneous agent deployments
- Support A2A/MCP protocols as first-class integration patterns
Decision
Core Abstraction: The Execution Envelope
Fulcrum introduces the concept of an Execution Envelope--a framework-agnostic wrapper that captures:
+-----------------------------------------------------------+
| EXECUTION ENVELOPE |
+-----------------------------------------------------------+
| metadata: |
| tenant_id, workflow_id, execution_id, budget_id |
| |
| lifecycle: |
| PENDING -> AUTHORIZED -> RUNNING -> COMPLETED/FAILED |
| |
| governance_context: |
| token_budget, cost_limit, policy_set, timeout |
| |
| framework_context: |
| adapter_type, native_execution_ref, checkpoint_id |
| |
| events[]: |
| { timestamp, event_type, payload, token_delta } |
+-----------------------------------------------------------+
Every agent execution--regardless of framework--is wrapped in an Envelope. The Envelope is Fulcrum's unit of governance.
Adapter Interface:
type FrameworkAdapter interface {
// Lifecycle
WrapExecution(ctx context.Context, envelope *Envelope, native any) (*WrappedExecution, error)
StartExecution(ctx context.Context, exec *WrappedExecution) error
PauseExecution(ctx context.Context, exec *WrappedExecution) error
ResumeExecution(ctx context.Context, exec *WrappedExecution) error
TerminateExecution(ctx context.Context, exec *WrappedExecution, reason string) error
// Event Capture
RegisterEventHook(hook EventHook) error
CaptureCheckpoint(ctx context.Context, exec *WrappedExecution) (*Checkpoint, error)
// Cost Estimation
EstimateTokens(ctx context.Context, input any) (TokenEstimate, error)
// Framework Metadata
FrameworkType() FrameworkType
Capabilities() []AdapterCapability
}
Adapter Capabilities Matrix:
| Capability | Description | LangGraph | Microsoft | CrewAI | A2A Proxy |
|---|---|---|---|---|---|
| INTERCEPT_PRE | Pre-execution policy check | Full | Full | Full | Full |
| INTERCEPT_MID | Mid-execution budget enforcement | Full | Full | Partial | None |
| INTERCEPT_POST | Post-execution audit | Full | Full | Full | Full |
| CHECKPOINT | State snapshot capture | Full | Full | None | None |
| TERMINATE | Force stop execution | Full | Full | Full | Partial |
| TOKEN_STREAM | Real-time token counting | Full | Full | Partial | None |
Integration Patterns:
- Embedded Middleware (Highest Control) - SDK wraps framework client, emits events to control plane
- Sidecar Observer (Lowest Friction) - Log/event stream ingestion for observability-only
- Protocol Proxy (A2A/MCP Integration) - A2A/MCP-compliant proxy with policy injection
Consequences
Positive:
- Framework Independence - New orchestrators require only adapter implementation, not core changes
- Governance Consistency - Same policies apply regardless of underlying framework
- Incremental Adoption - Start with Sidecar (observability), upgrade to Embedded (full governance)
- A2A/MCP Ready - Protocol proxy pattern enables cross-organization agent governance
Negative:
- Adapter Maintenance Burden - Each framework needs dedicated adapter development
- Capability Gaps - Some frameworks won't support all governance features
- Latency Overhead - Middleware pattern adds ~5-15ms per policy check
- Abstraction Leakage - Framework-specific behaviors may require special handling
Risks and Mitigations:
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Framework API changes break adapters | Medium | High | Version-pinned adapters, compatibility matrix |
| Event schema too rigid | Low | Medium | Extensible payload field, schema versioning |
| Performance overhead unacceptable | Low | High | Async event emission, local policy caching |
| New frameworks emerge | High | Low | Clean adapter interface enables rapid development |
Alternatives Considered
Build Our Own Orchestrator - Rejected because Microsoft and LangChain are 12-18 months ahead on orchestration mechanics.
Framework-Specific Governance Plugins - Rejected because it would fragment codebase and create maintenance nightmare.
Observability-Only (No Active Governance) - Rejected because observability is autopsy; governance is prevention.
ADR-005: OpenTelemetry for Observability
Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)
Context
Fulcrum's value proposition is governance and cost control. Observability is not optional--it's core to the product. Requirements:
- Distributed tracing - Track requests across services
- Metrics - Cost aggregates, latency percentiles, error rates
- Logging - Structured logs for debugging and audit
- Alerting - Budget thresholds, policy violations, system health
Constraints:
- Bootstrap budget (cost-sensitive)
- Must support eventual SaaS deployment
- Vendor-neutral for portability
Decision
Stack: OpenTelemetry + Prometheus + Grafana + Loki
+-------------+ +-------------+ +-------------+
| Services |---->| OTEL |---->| Prometheus |
| | | Collector | | (metrics) |
+-------------+ | | +------+------+
| | |
| |----> Loki |
| | (logs) |
| | v
| | +-------------+
| |---->| Grafana |
+-------------+ | (dashboards)|
+-------------+
Key Metrics:
| Metric | Type | Labels | Purpose |
|---|---|---|---|
fulcrum_envelope_created_total |
Counter | tenant_id, adapter_type | Execution volume |
fulcrum_envelope_duration_seconds |
Histogram | tenant_id, state | Execution time |
fulcrum_tokens_total |
Counter | tenant_id, model | Token consumption |
fulcrum_cost_usd_total |
Counter | tenant_id, model | Cost tracking |
fulcrum_policy_evaluations_total |
Counter | tenant_id, result | Policy usage |
fulcrum_budget_usage_ratio |
Gauge | tenant_id, budget_id | Budget health |
Instrumentation Pattern:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/trace"
)
func (s *EnvelopeService) CreateEnvelope(ctx context.Context, req *CreateEnvelopeRequest) (*Envelope, error) {
ctx, span := otel.Tracer("envelope-service").Start(ctx, "CreateEnvelope")
defer span.End()
span.SetAttributes(
attribute.String("tenant_id", req.TenantID),
attribute.String("adapter_type", req.AdapterType),
)
// ... implementation
}
Consequences
Positive:
- Vendor-neutral - Can swap backends later (DataDog, Honeycomb, etc.)
- Native Go SDK - First-class OpenTelemetry Go support
- Unified API - Traces, metrics, and logs in one interface
- Industry standard - Growing ecosystem and tooling
- Self-hosted option - Prometheus + Grafana for predictable costs
Negative:
- Operational burden - Self-hosted requires management
- No managed alerting - Need Alertmanager or external integration
- Storage costs - Retention of metrics and traces
Migration Path:
If Phase 2+ requires managed observability: - OpenTelemetry exports to any backend - Grafana Cloud for managed option - DataDog OTEL ingestion available
Validation Criteria
Revisit this decision if:
- Self-hosted operational burden becomes unsustainable
- Grafana Cloud or DataDog proves cost-effective at scale
- Multi-region requirements necessitate global observability platform
ADR-006: Infrastructure as Code
Status: Accepted Date: December 10, 2025 Decision Makers: Tony Diefenbach, Claude (Opus 4.5)
Context
Fulcrum needs reproducible infrastructure for:
- Local development (Docker Compose)
- CI/CD preview environments
- Production deployment (Kubernetes)
Requirements:
| Requirement | Priority | Notes |
|---|---|---|
| Multi-cloud capability | High | AWS/GCP/Azure |
| Kubernetes-native | High | Target deployment platform |
| Team-friendly | Medium | Readable, auditable |
| GitOps compatible | Medium | PR-based infrastructure changes |
Decision
Stack: Terraform + Helm + GitHub Actions
+-------------------------------------------------------+
| GitOps Flow |
+-------------------------------------------------------+
| |
| +----------+ +----------+ +----------------+ |
| | Git |--->| GitHub |--->| Terraform | |
| | Commit | | Actions | | Apply | |
| +----------+ +----------+ +--------+-------+ |
| | |
| v |
| +------------------+ |
| | Kubernetes | |
| | (via Helm) | |
| +------------------+ |
| |
+-------------------------------------------------------+
Repository Structure:
/infra
+-- terraform/
| +-- modules/
| | +-- networking/ # VPC, subnets, security groups
| | +-- kubernetes/ # EKS/GKE cluster
| | +-- database/ # RDS PostgreSQL
| | +-- redis/ # ElastiCache
| | +-- nats/ # NATS cluster config
| +-- environments/
| +-- dev/
| +-- staging/
| +-- prod/
+-- helm/
| +-- fulcrum/
| +-- Chart.yaml
| +-- values.yaml
| +-- values-dev.yaml
| +-- values-prod.yaml
| +-- templates/
+-- docker/
+-- docker-compose.yaml # Local development
Consequences
Positive:
- Standard tooling - Easy onboarding, extensive documentation
- Multi-cloud ready - AWS to GCP migration possible
- GitOps-friendly - PR reviews for infrastructure changes
- Reproducible environments - Consistent dev/staging/prod
Negative:
- Terraform state management - Requires backend configuration
- HCL verbosity - Complex logic is verbose
- Helm templating - Can be brittle with complex values
Phase 1 Simplification:
- Single AWS region
- Managed Kubernetes (EKS)
- RDS PostgreSQL (not Aurora)
- ElastiCache Redis (not self-hosted)
Validation Criteria
Revisit this decision if:
- Pulumi's TypeScript-native approach proves more productive
- Crossplane becomes mature enough for Kubernetes-native IaC
- Multi-cloud becomes Day 1 requirement
ADR-007: Python-Go Bridge Architecture
Status: Proposed Date: December 15, 2025 Decision Makers: Technical Team Related: ADR-001 (Go as Primary Language)
Context
Fulcrum's LangGraph adapter requires integration with Python-based LangGraph StateGraph execution. The adapter needs to support:
- Real StateGraph Execution - Create and execute LangGraph graphs in Python runtime
- Bidirectional Callback System - Forward Python callbacks to Go handlers
- Real-time Token Tracking - Extract actual token counts from LLM responses
- Budget Enforcement - Support mid-execution termination when budget exceeded
- Checkpoint Capture - Serialize LangGraph state for restore/recovery
- Low Latency - Target <10ms message latency for event forwarding
Decision Drivers:
| Driver | Weight | Notes |
|---|---|---|
| Performance | High | <10ms latency requirement for callbacks |
| Reliability | High | Process isolation, crash handling |
| Complexity | Medium | Development time, debugging |
| Safety | High | Memory management, resource leaks |
| Testability | Medium | Unit and integration testing |
Decision
Selected: Subprocess with Protocol Buffers (stdin/stdout)
+------------------+ +-------------------+
| Go Process | | Python Process |
| (Fulcrum) | | (LangGraph) |
| | | |
| Adapter -------+---- stdin/stdout --+---- Bridge |
| | (protobuf) | |
| Callbacks <----+--------------------+---- Callbacks |
+------------------+ +-------------------+
Rationale:
- Meets Performance Requirements - Callback latency of 1-5ms well under <10ms target
- Superior Safety - Python crashes isolated from Fulcrum core
- Development Velocity - Simplest implementation, well-understood patterns
- Operational Simplicity - No sidecar deployment complexity
- Flexibility - Can run different Python versions per execution
Performance Profile:
| Metric | Value |
|---|---|
| Startup Latency | ~50-100ms (amortized via process pool) |
| Callback Latency | ~1-5ms per event |
| Throughput | ~200-500 messages/sec |
| Memory | Process isolation prevents sharing |
Process Pool Configuration:
type ProcessPool struct {
maxSize int // Maximum concurrent Python processes
idleTimeout time.Duration // How long to keep idle processes
processes chan *Process // Pool of ready processes
}
Consequences
Positive:
- Process isolation - Python crash doesn't crash Go process
- Simple protocol - stdin/stdout universally supported
- Easy testing - Can mock subprocess with test harness
- Resource tracking - OS-level process metrics (CPU, memory)
- Version flexibility - Different Python versions per execution
Negative:
- Startup overhead - ~50-100ms to spawn Python process
- Message serialization - Protobuf encode/decode on every message
- No shared memory - Cannot pass large data efficiently
- Process management - Need to handle zombie processes, cleanup
Mitigations:
| Challenge | Mitigation |
|---|---|
| Startup overhead | Use process pool to amortize startup cost |
| Serialization | Use protobuf ArenaAllocator for efficiency |
| Process management | Health checks and automatic restart |
Alternatives Considered
cgo with Python C API - Rejected due to GIL serialization defeating concurrency and memory safety risks in production.
gRPC Server (Python Sidecar) - Rejected because sidecar deployment adds operational complexity not justified by benefits.
Validation Criteria
Success Criteria:
- Callback latency <10ms (target: 1-5ms)
- Process startup amortized via pool
- Zero memory leaks over 1000 executions
- Graceful handling of Python crashes
- Support for concurrent LangGraph executions
Document Metadata
| Attribute | Value |
|---|---|
| Document ID | ADR_INDEX |
| Version | 1.0 |
| Created | January 6, 2026 |
| Author | Technical Architecture Team |
| Reviewers | Engineering Leadership |
| Source Archive | /.archive/historical/phase-docs/phase-0-foundation/adrs/ |
Related Documentation
This document consolidates architecture decisions from the Phase 0 Foundation period (December 2025) and is maintained as the canonical reference for Fulcrum's technical architecture rationale.