Conversation
These changes add support & associated config for exporting logs with OTEL via OTLP over gRPC
Simply emits any OTEL data it receives.
These changes add support & associated config for exporting traces with OTEL via OTLP over gRPC
These changes add support & associated config for exporting metrics with OTEL via OTLP over gRPC
OTEL counters use add() which increments values, unlike Prometheus counters which support absolute(). Track previous values for cumulative metrics (CPU time, disk I/O) and compute deltas before recording.
- Use OTEL_EXPORTER_OTLP_ENDPOINT instead of custom OTEL_ENDPOINT - Remove team_name and deployment_env config fields - Users should now set team.name and deployment.environment.name via standard OTEL_RESOURCE_ATTRIBUTES environment variable - Fix import ordering in process_metrics.rs
Remove nilauth. prefix from process metrics to follow OTEL semantic conventions: - nilauth.process.disk.syscalls -> process.disk.syscalls - nilauth.process.network.connection -> process.network.connection.count Business metrics retain the nilauth. prefix as intended.
Factor out the duplicated shutdown logic from shutdown() and Drop impl into a private shutdown_providers() method.
Allow users to configure the OTEL metrics export interval via the `otel.metrics.export_interval` config field. Defaults to 60 seconds.
Pass the ObservabilityGuard to run() and use its otel_metrics_enabled() method instead of checking config directly. This accounts for runtime conditions like OTEL_SDK_DISABLED that may override config values.
Remove redundant otel_enabled variable and duplicate is_otel_sdk_disabled() call. The early return already handles the SDK disabled case, so the subsequent condition can simply check config.otel.enabled.
Add comment explaining why process metrics don't dual-write to both backends like application metrics. Prometheus uses counter!().absolute() while OTEL requires delta computation, necessitating separate collectors.
The `metrics.enabled` field was defined but never checked anywhere in the codebase. Prometheus metrics are now implicitly enabled unless OTEL metrics are enabled (which disables the Prometheus endpoint).
Replace eprintln! with tracing::error! for consistent logging during provider shutdown. The fmt layer remains active even after OTEL providers are shut down, so errors will still be logged to stderr.
The PeriodicReader's interval controls metric collection frequency, while the MetricExporter's timeout (already configured) controls network operation timeout. Also fix integration tests to pass ObservabilityGuard to run().
Add force_flush() calls before shutdown() to ensure all pending telemetry data is exported before providers are shut down.
Replace std::sync::Mutex with tokio::sync::Mutex in the OTEL process metrics collector. This avoids blocking the async runtime thread during metrics collection.
Change the default OTEL metrics export interval from 60s to 15s for more responsive metric updates in typical deployments.
Allow partial observability when individual providers fail to initialize. Logs warning to stderr and continues with the providers that did initialize successfully.
…bility Use try_init() instead of init() to gracefully handle cases where a tracing subscriber is already set, such as in integration tests that initialize tracing before calling observability::init().
The OTEL_RESOURCE_ATTRIBUTES example was being interpreted as Rust code.
Derived Default gave empty strings/zero values instead of the documented serde defaults. This caused incorrect behavior when the otel config key was omitted entirely but OTEL was enabled via env vars.
fb06b38 to
e1de2c0
Compare
Verify default values match serde defaults and endpoint resolution fallback logic.
Verify OTEL_SDK_DISABLED env var handling, resource creation with attributes, and guard shutdown safety.
Verify metric recording functions execute without panic.
| /// - Prometheus: `invalid_payments_total{reason="..."}` | ||
| /// - OTEL: `nilauth.payment.invalid{reason="..."}` |
There was a problem hiding this comment.
Pardon my ignorance, but is there a reason prom uses snake_case while otel uses dots?
There was a problem hiding this comment.
It was just a semantic choice made by the community, as far as I know. I'm sure it comes with benefits. I suppose it naturally lends itself more to "namespacing" than underscores ¯\_( ツ)_/¯
| match init_tracer_provider(config, resource.clone()) { | ||
| Ok(provider) => Some(provider), | ||
| Err(e) => { | ||
| eprintln!("Warning: Failed to initialize tracer provider: {e}"); |
There was a problem hiding this comment.
Shouldn't these trigger a failure? You're otherwise risking starting the app with a broken configuration that exports no logs, metrics, nor traces.
| /// Optional endpoint override for log export. | ||
| /// If not set, uses the global `otel.endpoint`. | ||
| #[serde(default)] | ||
| pub endpoint: Option<String>, |
There was a problem hiding this comment.
I find this level of config a bit too granular. Do we expect to use different endpoints for each metrics, traces, logs? Same for enabled, shouldn't OTEL be completely disabled until we switch over to this new way of running things and then after that always fully enabled? I get allowing the whole thing to be disabled for testing purposes (which can be done via env var or via the top level enabled) but this feels like a bit "too configurable".
I'm adding OTEL to another service now and I'm going with a simple "fully enabled or fully disabled", no endpoint overrides, etc way, and I'm wondering if it's too short sighted.
This PR adds support for exporting observability telemetry via OpenTelemetry OTLP over gRPC.
Config
The configuration has a new
otelsection to control how to export OTEL telemetry:Standard OTEL environment variables are also supported:
Logs
If enabled, log messages will be emitted to the configured endpoint (global, or log-specific).
Logs will still be spat out to stdout/stderr.
Tracing
If enabled, traces will be emitted for any instrumented functions to the configured endpoint (global, or trace-specific).
I figured it's not my place to decide what gets instrumented, as I'm not the domain expert :)
To generate OTEL spans, you can simply decorate functions with
#[tracing::instrument]:skip(arg)- exclude non-Debug/sensitive argsskip_all- exclude all argsfields(key = %val)- add custom span attributeserr- record errors as span eventsSpans nest automatically;
info!/error!calls become span events with trace correlation.Metrics
If enabled (typically just by setting
otel.enabledtotrue) metrics will be exported via OTLP to the configured endpoint (global, or metrics-specific).Application metrics:
nilauth.payment.invalid- Counter of invalid payment attempts (by reason)nilauth.payment.valid- Counter of valid payments (by module)nilauth.nuc.minted- Counter of NUCs minted (by module)nilauth.token.revoked- Counter of tokens revokednilauth.token.expired_removed- Counter of expired tokens cleaned upProcess metrics (Linux only):
process.cpu.time- CPU time consumed (seconds)process.memory.usage- Memory usage (bytes)process.open_file_descriptor.count- Open file descriptorsprocess.thread.count- Thread countMetrics are batched and exported at the configured
export_interval(default: 15 seconds). When OTEL metrics are enabled, the Prometheus/metricsendpoint is disabled to avoid duplicate collection.