Observability (OpenTelemetry)
Codex ships an opt-in OpenTelemetry pipeline that emits traces and metrics over OTLP, plus an optional browser RUM layer that posts spans from the SPA through a same-origin proxy. Logs continue to flow through the existing tracing-subscriber stdout/file appender, with trace IDs injected on every line for correlation.
The exporter is vendor-neutral. Anything that speaks OTLP works without code changes: SigNoz, Grafana Tempo + Mimir, Honeycomb, Uptrace, the DataDog Agent OTLP receiver, and more.
Observability is disabled by default. Nothing is exported until an operator opts in. This is intentional for a self-hosted product: no telemetry leaves the box without explicit configuration.
Quickstart (Docker dev environment)
The bundled dev compose ships a Jaeger all-in-one sidecar on the dev profile and overrides the Codex config to point at it via env vars. make dev-up brings the whole stack up with observability already on — no YAML edit, no restart.
make dev-up
Jaeger exposes its UI at http://localhost:16686. Hit a few endpoints in the Codex app, then pick codex from the service dropdown in Jaeger. Traces should appear within a few seconds.
The env overrides live in docker-compose.yml under the codex-dev and codex-dev-worker services:
CODEX_OBSERVABILITY_ENABLED: "true"
CODEX_OBSERVABILITY_SERVICE_NAME: codex
CODEX_OBSERVABILITY_OTLP_ENDPOINT: http://jaeger:4317
CODEX_OBSERVABILITY_OTLP_PROTOCOL: grpc
CODEX_OBSERVABILITY_BROWSER_ENABLED: "true" # codex-dev only; enables RUM proxy
config/config.docker.yaml itself ships with the observability: block commented out so a production deployment using the same config doesn't quietly start exporting telemetry — the dev override is intentionally local to the compose file.
Jaeger all-in-one stores spans in memory (lost on restart) and the UI has no auth. It is appropriate for local dev and evaluation. For long-term storage, metrics, or a full APM UI in production, point Codex at a real OTLP backend (SigNoz, Grafana Tempo + Mimir, Honeycomb, Uptrace, etc.) per the backend matrix below.
Quickstart (outside the dev compose)
If you're running Codex outside of docker-compose.yml, any OTLP backend works. The smallest standalone setup is the same Jaeger all-in-one image:
docker run -d --name codex-jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 -p 4317:4317 -p 4318:4318 \
jaegertracing/all-in-one:1.62.0
Then enable observability in your config file with otlp.endpoint: http://localhost:4317.
Configuration
The full schema is documented in the Configuration reference. At a minimum, an enabled deployment needs:
observability.enabled: trueobservability.otlp.endpointset to an OTLP collector URLobservability.otlp.headerspopulated if your backend requires auth (e.g.signoz-access-token,x-honeycomb-team)
observability:
enabled: true
otlp:
endpoint: https://ingest.eu.signoz.cloud:443
protocol: grpc
headers:
signoz-access-token: "your-token-here"
timeout_ms: 5000
Choosing a backend
| Backend | Endpoint shape | Protocol | Notes |
|---|---|---|---|
| Self-hosted SigNoz | http://signoz-otel-collector:4317 | grpc | Easiest local setup. Use the bundled compose file below. |
| SigNoz Cloud | https://ingest.<region>.signoz.cloud | grpc | Requires signoz-access-token header. |
| Grafana Tempo (local) | http://tempo:4317 | grpc | Pair with Mimir for metrics. Grafana renders both. |
| Honeycomb | https://api.honeycomb.io | grpc | Requires x-honeycomb-team (and optionally x-honeycomb-dataset). |
| Uptrace | https://otlp.uptrace.dev:4317 | grpc | Requires uptrace-dsn header. |
| DataDog (OTLP receive) | http://datadog-agent:4317 | grpc | Agent must have otlp_config.receiver.protocols.grpc enabled. |
| HTTP-only environments | http://collector:4318 | http/protobuf | Use when load balancers don't terminate gRPC. |
Choosing a protocol
grpc is the default and the right choice in most environments: smaller payloads, persistent connections, lower overhead. Switch to http/protobuf only when something between Codex and the collector (a managed load balancer, a strict egress proxy) blocks gRPC. http/json exists for parity but produces noticeably larger payloads; prefer http/protobuf over it whenever both are an option.
Sampling guidance
Codex uses a parent-based sampler. Practically: if an incoming request already carries a traceparent, that decision is honored; otherwise the configured sample_ratio decides whether to sample at the root.
| Workload | Recommended traces.sample_ratio | Reasoning |
|---|---|---|
| Local development | 1.0 | You want every trace while iterating. |
| Small home server (1–5 active users) | 1.0 | Volume is low; full traces are cheap. |
| Medium deployment (10–50 active users) | 0.25–0.5 | Keep tail latency debuggable without flooding the collector. |
| Large/multi-tenant (100+ active users) | 0.05–0.1 | Pair with backend-side tail sampling if your collector supports it. |
| Diagnosing a specific incident | 1.0 temporarily | Crank up while reproducing, then back off. |
Browser RUM defaults to browser.sample_ratio: 0.1 because a busy SPA can produce many spans per user session. Raise it cautiously: a noisy front end can dwarf backend traffic at the collector.
The Rust SDK samples at the root span. If a downstream service (e.g. a plugin subprocess in a future iteration) makes its own decision, it does so independently. There is no global coordination.
What Codex sends
Trace spans
- HTTP server spans — every request, named by matched route template (e.g.
GET /api/v1/series/:id, not the resolved URL). Standardhttp.*semantic-convention attributes. - Repository spans —
db.<entity>.<op>for hot-path operations on books, series, libraries, users, and plugin records. Carrydb.system,db.operation, and the entity ID as an attribute (never in the span name). - Plugin RPC spans —
plugin.<method>around every JSON-RPC call to a plugin subprocess. Internalplugin.rpc.write/plugin.rpc.waitchild spans break down the round-trip into stdio write vs. response wait. - Scanner spans —
scanner.scan_library/scanner.analyze_bookas root spans for background work. - Task worker spans —
task.executeper claimed task, carryingtask.idandtask.type.
Metrics
Two flavors land in the OTLP pipeline:
- Counters and histograms — dual-written from the in-process plugin and task metrics services. Histograms (not just averages) let p95/p99 be queried server-side.
- Observable gauges — inventory snapshot (libraries, series, books, users, pages), refreshed every 30s; process CPU/memory; task in-flight count.
Concrete metric names:
| Metric | Type | Attributes |
|---|---|---|
codex.plugin.requests.total | Counter | plugin_id, method, outcome |
codex.plugin.duration_ms | Histogram (ms) | plugin_id, method, outcome |
codex.task.completed.total | Counter | task_type, outcome |
codex.task.duration_ms | Histogram (ms) | task_type, outcome |
codex.task.queue_wait_ms | Histogram (ms) | task_type |
codex.task.in_flight | Observable gauge | (none) |
codex.inventory.libraries | Observable gauge | (none) |
codex.inventory.series | Observable gauge | (none) |
codex.inventory.books | Observable gauge | (none) |
codex.inventory.users | Observable gauge | (none) |
codex.inventory.pages | Observable gauge | (none) |
http.server.request.duration | Histogram (seconds) | http.request.method, http.route, http.response.status_code |
process.cpu.time | Observable gauge | (none) |
process.memory.usage | Observable gauge | (none) |
process.memory.virtual | Observable gauge | (none) |
The existing /api/v1/metrics/plugins dashboard endpoint is unchanged. The in-app store is still authoritative for that view; OTLP is a parallel consumer.
What Codex does not send
- Logs. Stdout / file logging is unchanged. Trace IDs are injected on every line so you can ship logs separately (Vector, Filebeat, Loki, etc.) and correlate by trace ID.
- Resource bodies. Span attributes carry IDs and operation names, not titles, file contents, or query strings.
- User-identifying browser data. The browser SDK emits document-load, fetch, click, and submit spans. There is no session replay, no DOM capture, no PII enrichment.
- Cross-process plugin spans. Plugin RPC spans wrap the manager-side call;
traceparentis not propagated into plugin subprocesses in this release. Plugins remain black boxes from a tracing perspective.
Browser RUM
When observability.browser.enabled: true:
- The SPA fetches
GET /api/v1/observability/configon startup. If the server flag is on and an OTLP endpoint is configured, the heavyweight OTel browser SDK is dynamically imported. Otherwise the chunk is never downloaded. - The SDK registers
document-load,fetch,user-interaction(click + submit only), andxml-http-requestinstrumentations. - Spans are batched in memory (flush every 5s or 512 spans, max queue 2048) and POSTed to
/api/v1/observability/otlp/v1/traces. - Codex forwards the OTLP body verbatim to the configured collector, swapping in the operator-configured
otlp.headers. Browser-supplied headers are dropped except forContent-Type. - On
pagehide, the SDK usesnavigator.sendBeacon()to flush the final batch so spans survive navigation.
FetchInstrumentation.propagateTraceHeaderCorsUrls is anchored to window.location.origin, so traceparent is injected only on Codex API calls and never leaked to third-party CDNs or external metadata sources.
Why the proxy?
The proxy exists for three reasons:
- No CORS configuration on the collector. The SPA always POSTs to its own origin.
- No collector credentials in the browser. Auth tokens stay on the server.
- Reuses existing session auth. The proxy is
FlexibleAuthContext-gated, so the cookie or bearer the SPA already carries authenticates the export. The OTel JS exporter does not need custom auth wiring.
The proxy is a thin pass-through. It does not buffer, batch, transform, or sample. Body size is capped at 4 MiB and per-session rate limits apply.
Trace ID correlation in logs
When observability is enabled, log lines pick up trace context:
2026-05-22T18:02:11.034Z INFO trace_id=4bf92f3577b34da6a3ce929d0e0e4736 span_id=00f067aa0ba902b7 codex::services::plugin::manager: plugin.search_series finished plugin_id=anilist duration_ms=412
Ship the log file to any backend that can index by trace_id and you can pivot from a slow log line to the SigNoz trace and back.
Performance impact
Codex's success criteria for this feature are:
- < 2% added request latency when observability is disabled (the default).
- < 5% added request latency when enabled with default sampling.
The disabled-path overhead is effectively zero: the OTel layer is not installed in the tracing-subscriber registry, repository #[instrument] attributes compile to inert spans without a subscriber, and metric instruments resolve to no-op implementations from metrics_stub.rs under --no-default-features. With observability enabled at sample_ratio: 1.0 on a representative endpoint, measured overhead falls inside the 5% budget (see the benchmark in the implementation notes for the methodology).
If you need to validate on your own deployment:
# Baseline (observability disabled)
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
http://localhost:8080/api/v1/series?page=1
# Then enable observability, restart, and re-run with the same args.
# Compare p50/p95/p99 in the ab output.
Disabling observability
Three ways, in order of granularity:
- Full off — set
observability.enabled: false(the default) and restart. No providers initialize, no telemetry leaves the process. - Per-signal off — keep
observability.enabled: truebut setobservability.traces.enabled: falseorobservability.metrics.enabled: false. Useful when one pipeline needs maintenance. - Sampling to zero —
observability.traces.sample_ratio: 0.0keeps the layer installed (so incomingtraceparentis still extracted for logging) but no new traces start at the root. Cheaper than restarting if you need to drop trace volume without redeploying.
Browser RUM has its own switch: observability.browser.enabled: false disables the proxy endpoint and the SPA's config payload reports enabled: false, so the SDK chunk is never downloaded.
Troubleshooting
Traces don't appear in the backend.
- Check the Codex logs for
otel_status_code=ERRORlines orfailed to exportwarnings. - Confirm
observability.enabledistrueandobservability.otlp.endpointis non-empty. An enabled config with an empty endpoint is treated as a misconfiguration and the OTel layer is not installed. - For gRPC endpoints, the URL scheme matters:
http://host:4317for cleartext,https://host:4317for TLS. - For HTTP/protobuf endpoints, the SDK appends
/v1/tracesand/v1/metricsto the base URL. Configurehttp://collector:4318, nothttp://collector:4318/v1/traces.
Metrics arrive but with the wrong tenant / project / dataset.
- Headers configured under
observability.otlp.headersapply to both traces and metrics exports. Most multi-tenant backends use a single header (e.g.x-honeycomb-team); for backends that route by dataset, set the dataset header at the OTLP level too.
Browser traces don't show up.
- Confirm
GET /api/v1/observability/configreturnsenabled: truein the response body. If it returnsenabled: falsewhile you havebrowser.enabled: truein YAML, the OTLP endpoint is probably empty. - Open the network panel. Successful proxy POSTs to
/api/v1/observability/otlp/v1/tracesreturn204 No Content. A503means the proxy is disabled. - The
tracer-*.jschunk is loaded asynchronously. If it never appears in the network panel, the bootstrap probe failed or the chunk was blocked by an extension.
cargo build --no-default-features after enabling observability.
- The
observabilityfeature is indefault = ["rar", "observability"].--no-default-featurescompiles against the stub module: all instrumentation calls become no-ops and the OTel crates are not linked. There is no runtime config change required.
Reference
- Configuration reference — full schema and environment variable list
docker-compose.yml— bundled Jaeger sidecar lives on thedevprofile- OpenTelemetry Rust — SDK source
- OpenTelemetry JS browser SDK — browser SDK source
- W3C Trace Context — the propagation format used end-to-end