Skip to content

feat(observability): OTEL support#66

Open
cmacrae wants to merge 26 commits into
mainfrom
feat/otel
Open

feat(observability): OTEL support#66
cmacrae wants to merge 26 commits into
mainfrom
feat/otel

Conversation

@cmacrae
Copy link
Copy Markdown
Contributor

@cmacrae cmacrae commented Jan 15, 2026

This PR adds support for exporting observability telemetry via OpenTelemetry OTLP over gRPC.

Config

The configuration has a new otel section to control how to export OTEL telemetry:

otel:
  enabled: true  # default: false
  endpoint: "http://localhost:4317"  # OTLP gRPC endpoint
  service_name: nilauth
  resource_attributes:  # optional: arbitrary attributes to grant telemetry
    service.instance.id: nilauth-001
  export_timeout: 30
  logs:
    enabled: true  # default: true (when otel.enabled is true)
    endpoint: "http://localhost:4317"  # optional, overrides global endpoint
  traces:
    enabled: true  # default: true (when otel.enabled is true)
    endpoint: "http://localhost:4317"  # optional, overrides global endpoint
  metrics:
    enabled: true  # default: true - set to false to keep using Prometheus metrics
    endpoint: "http://localhost:4317"  # optional, overrides global endpoint
    export_interval: 15  # seconds between metric exports (default: 15)

Standard OTEL environment variables are also supported:

OTEL_SDK_DISABLED=true            - Disable OTEL SDK, use only fmt logging
OTEL_EXPORTER_OTLP_ENDPOINT       - Global OTLP gRPC endpoint URL (default: http://localhost:4317)
OTEL_SERVICE_NAME                 - Service name for telemetry (default: nilauth)
OTEL_RESOURCE_ATTRIBUTES          - Resource attributes as key=value,key=value
                                    Example: team.name=myteam,deployment.environment.name=prod

Logs

If enabled, log messages will be emitted to the configured endpoint (global, or log-specific).
Logs will still be spat out to stdout/stderr.

Tracing

If enabled, traces will be emitted for any instrumented functions to the configured endpoint (global, or trace-specific).
I figured it's not my place to decide what gets instrumented, as I'm not the domain expert :)

To generate OTEL spans, you can simply decorate functions with #[tracing::instrument]:

use tracing::instrument;

#[instrument(name = "my_operation", skip(config), fields(user_id = %user.id))]
async fn my_function(config: &Config, user: &User) -> Result<()> {
    // logs within this function are automatically correlated to this span
    info!("doing work");
    Ok(())
}
  • skip(arg) - exclude non-Debug/sensitive args
  • skip_all - exclude all args
  • fields(key = %val) - add custom span attributes
  • err - record errors as span events

Spans nest automatically; info!/error! calls become span events with trace correlation.

Metrics

If enabled (typically just by setting otel.enabled to true) metrics will be exported via OTLP to the configured endpoint (global, or metrics-specific).

Application metrics:

  • nilauth.payment.invalid - Counter of invalid payment attempts (by reason)
  • nilauth.payment.valid - Counter of valid payments (by module)
  • nilauth.nuc.minted - Counter of NUCs minted (by module)
  • nilauth.token.revoked - Counter of tokens revoked
  • nilauth.token.expired_removed - Counter of expired tokens cleaned up

Process metrics (Linux only):

  • process.cpu.time - CPU time consumed (seconds)
  • process.memory.usage - Memory usage (bytes)
  • process.open_file_descriptor.count - Open file descriptors
  • process.thread.count - Thread count

Metrics are batched and exported at the configured export_interval (default: 15 seconds). When OTEL metrics are enabled, the Prometheus /metrics endpoint is disabled to avoid duplicate collection.

These changes add support & associated config for exporting logs with OTEL via
OTLP over gRPC
Simply emits any OTEL data it receives.
These changes add support & associated config for exporting traces with OTEL via
OTLP over gRPC
These changes add support & associated config for exporting metrics with OTEL via
OTLP over gRPC
OTEL counters use add() which increments values, unlike Prometheus
counters which support absolute(). Track previous values for cumulative
metrics (CPU time, disk I/O) and compute deltas before recording.
- Use OTEL_EXPORTER_OTLP_ENDPOINT instead of custom OTEL_ENDPOINT
- Remove team_name and deployment_env config fields
- Users should now set team.name and deployment.environment.name
  via standard OTEL_RESOURCE_ATTRIBUTES environment variable
- Fix import ordering in process_metrics.rs
Remove nilauth. prefix from process metrics to follow OTEL semantic
conventions:
- nilauth.process.disk.syscalls -> process.disk.syscalls
- nilauth.process.network.connection -> process.network.connection.count

Business metrics retain the nilauth. prefix as intended.
Factor out the duplicated shutdown logic from shutdown() and Drop
impl into a private shutdown_providers() method.
Allow users to configure the OTEL metrics export interval via the
`otel.metrics.export_interval` config field. Defaults to 60 seconds.
Pass the ObservabilityGuard to run() and use its otel_metrics_enabled()
method instead of checking config directly. This accounts for runtime
conditions like OTEL_SDK_DISABLED that may override config values.
Remove redundant otel_enabled variable and duplicate is_otel_sdk_disabled()
call. The early return already handles the SDK disabled case, so the
subsequent condition can simply check config.otel.enabled.
Add comment explaining why process metrics don't dual-write to both
backends like application metrics. Prometheus uses counter!().absolute()
while OTEL requires delta computation, necessitating separate collectors.
The `metrics.enabled` field was defined but never checked anywhere in
the codebase. Prometheus metrics are now implicitly enabled unless
OTEL metrics are enabled (which disables the Prometheus endpoint).
Replace eprintln! with tracing::error! for consistent logging during
provider shutdown. The fmt layer remains active even after OTEL
providers are shut down, so errors will still be logged to stderr.
The PeriodicReader's interval controls metric collection frequency,
while the MetricExporter's timeout (already configured) controls
network operation timeout. Also fix integration tests to pass
ObservabilityGuard to run().
Add force_flush() calls before shutdown() to ensure all pending
telemetry data is exported before providers are shut down.
Replace std::sync::Mutex with tokio::sync::Mutex in the OTEL process
metrics collector. This avoids blocking the async runtime thread
during metrics collection.
Change the default OTEL metrics export interval from 60s to 15s
for more responsive metric updates in typical deployments.
Allow partial observability when individual providers fail to
initialize. Logs warning to stderr and continues with the
providers that did initialize successfully.
…bility

Use try_init() instead of init() to gracefully handle cases where a
tracing subscriber is already set, such as in integration tests that
initialize tracing before calling observability::init().
The OTEL_RESOURCE_ATTRIBUTES example was being interpreted as Rust code.
Derived Default gave empty strings/zero values instead of the
documented serde defaults. This caused incorrect behavior when the
otel config key was omitted entirely but OTEL was enabled via env vars.
@cmacrae cmacrae force-pushed the feat/otel branch 2 times, most recently from fb06b38 to e1de2c0 Compare January 19, 2026 13:59
@cmacrae cmacrae marked this pull request as ready for review January 19, 2026 14:23
@cmacrae cmacrae requested review from mfontanini and tim-hm January 19, 2026 14:23
Verify default values match serde defaults and
endpoint resolution fallback logic.
Verify OTEL_SDK_DISABLED env var handling, resource
creation with attributes, and guard shutdown safety.
Verify metric recording functions execute without panic.
Copy link
Copy Markdown
Contributor

@tim-hm tim-hm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread src/metrics.rs
Comment on lines +185 to +186
/// - Prometheus: `invalid_payments_total{reason="..."}`
/// - OTEL: `nilauth.payment.invalid{reason="..."}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon my ignorance, but is there a reason prom uses snake_case while otel uses dots?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a semantic choice made by the community, as far as I know. I'm sure it comes with benefits. I suppose it naturally lends itself more to "namespacing" than underscores ¯\_( ツ)_/¯

Comment thread src/observability.rs
match init_tracer_provider(config, resource.clone()) {
Ok(provider) => Some(provider),
Err(e) => {
eprintln!("Warning: Failed to initialize tracer provider: {e}");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these trigger a failure? You're otherwise risking starting the app with a broken configuration that exports no logs, metrics, nor traces.

Comment thread src/config.rs
/// Optional endpoint override for log export.
/// If not set, uses the global `otel.endpoint`.
#[serde(default)]
pub endpoint: Option<String>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this level of config a bit too granular. Do we expect to use different endpoints for each metrics, traces, logs? Same for enabled, shouldn't OTEL be completely disabled until we switch over to this new way of running things and then after that always fully enabled? I get allowing the whole thing to be disabled for testing purposes (which can be done via env var or via the top level enabled) but this feels like a bit "too configurable".

I'm adding OTEL to another service now and I'm going with a simple "fully enabled or fully disabled", no endpoint overrides, etc way, and I'm wondering if it's too short sighted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants