feat(observability): OTEL support by cmacrae · Pull Request #66 · NillionNetwork/nilauth

cmacrae · 2026-01-15T17:02:06Z

This PR adds support for exporting observability telemetry via OpenTelemetry OTLP over gRPC.

Config

The configuration has a new otel section to control how to export OTEL telemetry:

otel:
  enabled: true  # default: false
  endpoint: "http://localhost:4317"  # OTLP gRPC endpoint
  service_name: nilauth
  resource_attributes:  # optional: arbitrary attributes to grant telemetry
    service.instance.id: nilauth-001
  export_timeout: 30
  logs:
    enabled: true  # default: true (when otel.enabled is true)
    endpoint: "http://localhost:4317"  # optional, overrides global endpoint
  traces:
    enabled: true  # default: true (when otel.enabled is true)
    endpoint: "http://localhost:4317"  # optional, overrides global endpoint
  metrics:
    enabled: true  # default: true - set to false to keep using Prometheus metrics
    endpoint: "http://localhost:4317"  # optional, overrides global endpoint
    export_interval: 15  # seconds between metric exports (default: 15)

Standard OTEL environment variables are also supported:

OTEL_SDK_DISABLED=true            - Disable OTEL SDK, use only fmt logging
OTEL_EXPORTER_OTLP_ENDPOINT       - Global OTLP gRPC endpoint URL (default: http://localhost:4317)
OTEL_SERVICE_NAME                 - Service name for telemetry (default: nilauth)
OTEL_RESOURCE_ATTRIBUTES          - Resource attributes as key=value,key=value
                                    Example: team.name=myteam,deployment.environment.name=prod

Logs

If enabled, log messages will be emitted to the configured endpoint (global, or log-specific).
Logs will still be spat out to stdout/stderr.

Tracing

If enabled, traces will be emitted for any instrumented functions to the configured endpoint (global, or trace-specific).
I figured it's not my place to decide what gets instrumented, as I'm not the domain expert :)

To generate OTEL spans, you can simply decorate functions with #[tracing::instrument]:

use tracing::instrument;

#[instrument(name = "my_operation", skip(config), fields(user_id = %user.id))]
async fn my_function(config: &Config, user: &User) -> Result<()> {
    // logs within this function are automatically correlated to this span
    info!("doing work");
    Ok(())
}

skip(arg) - exclude non-Debug/sensitive args
skip_all - exclude all args
fields(key = %val) - add custom span attributes
err - record errors as span events

Spans nest automatically; info!/error! calls become span events with trace correlation.

Metrics

If enabled (typically just by setting otel.enabled to true) metrics will be exported via OTLP to the configured endpoint (global, or metrics-specific).

Application metrics:

nilauth.payment.invalid - Counter of invalid payment attempts (by reason)
nilauth.payment.valid - Counter of valid payments (by module)
nilauth.nuc.minted - Counter of NUCs minted (by module)
nilauth.token.revoked - Counter of tokens revoked
nilauth.token.expired_removed - Counter of expired tokens cleaned up

Process metrics (Linux only):

process.cpu.time - CPU time consumed (seconds)
process.memory.usage - Memory usage (bytes)
process.open_file_descriptor.count - Open file descriptors
process.thread.count - Thread count

Metrics are batched and exported at the configured export_interval (default: 15 seconds). When OTEL metrics are enabled, the Prometheus /metrics endpoint is disabled to avoid duplicate collection.

These changes add support & associated config for exporting logs with OTEL via OTLP over gRPC

Simply emits any OTEL data it receives.

These changes add support & associated config for exporting traces with OTEL via OTLP over gRPC

These changes add support & associated config for exporting metrics with OTEL via OTLP over gRPC

OTEL counters use add() which increments values, unlike Prometheus counters which support absolute(). Track previous values for cumulative metrics (CPU time, disk I/O) and compute deltas before recording.

- Use OTEL_EXPORTER_OTLP_ENDPOINT instead of custom OTEL_ENDPOINT - Remove team_name and deployment_env config fields - Users should now set team.name and deployment.environment.name via standard OTEL_RESOURCE_ATTRIBUTES environment variable - Fix import ordering in process_metrics.rs

Remove nilauth. prefix from process metrics to follow OTEL semantic conventions: - nilauth.process.disk.syscalls -> process.disk.syscalls - nilauth.process.network.connection -> process.network.connection.count Business metrics retain the nilauth. prefix as intended.

Factor out the duplicated shutdown logic from shutdown() and Drop impl into a private shutdown_providers() method.

Allow users to configure the OTEL metrics export interval via the `otel.metrics.export_interval` config field. Defaults to 60 seconds.

Pass the ObservabilityGuard to run() and use its otel_metrics_enabled() method instead of checking config directly. This accounts for runtime conditions like OTEL_SDK_DISABLED that may override config values.

Remove redundant otel_enabled variable and duplicate is_otel_sdk_disabled() call. The early return already handles the SDK disabled case, so the subsequent condition can simply check config.otel.enabled.

Add comment explaining why process metrics don't dual-write to both backends like application metrics. Prometheus uses counter!().absolute() while OTEL requires delta computation, necessitating separate collectors.

The `metrics.enabled` field was defined but never checked anywhere in the codebase. Prometheus metrics are now implicitly enabled unless OTEL metrics are enabled (which disables the Prometheus endpoint).

Replace eprintln! with tracing::error! for consistent logging during provider shutdown. The fmt layer remains active even after OTEL providers are shut down, so errors will still be logged to stderr.

The PeriodicReader's interval controls metric collection frequency, while the MetricExporter's timeout (already configured) controls network operation timeout. Also fix integration tests to pass ObservabilityGuard to run().

Add force_flush() calls before shutdown() to ensure all pending telemetry data is exported before providers are shut down.

Replace std::sync::Mutex with tokio::sync::Mutex in the OTEL process metrics collector. This avoids blocking the async runtime thread during metrics collection.

Change the default OTEL metrics export interval from 60s to 15s for more responsive metric updates in typical deployments.

Allow partial observability when individual providers fail to initialize. Logs warning to stderr and continues with the providers that did initialize successfully.

…bility Use try_init() instead of init() to gracefully handle cases where a tracing subscriber is already set, such as in integration tests that initialize tracing before calling observability::init().

The OTEL_RESOURCE_ATTRIBUTES example was being interpreted as Rust code.

Derived Default gave empty strings/zero values instead of the documented serde defaults. This caused incorrect behavior when the otel config key was omitted entirely but OTEL was enabled via env vars.

Verify default values match serde defaults and endpoint resolution fallback logic.

Verify OTEL_SDK_DISABLED env var handling, resource creation with attributes, and guard shutdown safety.

Verify metric recording functions execute without panic.

tim-hm

LGTM!

tim-hm · 2026-01-19T15:56:45Z

+/// - Prometheus: `invalid_payments_total{reason="..."}`
+/// - OTEL: `nilauth.payment.invalid{reason="..."}`


Pardon my ignorance, but is there a reason prom uses snake_case while otel uses dots?

It was just a semantic choice made by the community, as far as I know. I'm sure it comes with benefits. I suppose it naturally lends itself more to "namespacing" than underscores ¯\_( ツ)_/¯

mfontanini · 2026-01-21T21:38:34Z

+        match init_tracer_provider(config, resource.clone()) {
+            Ok(provider) => Some(provider),
+            Err(e) => {
+                eprintln!("Warning: Failed to initialize tracer provider: {e}");


Shouldn't these trigger a failure? You're otherwise risking starting the app with a broken configuration that exports no logs, metrics, nor traces.

mfontanini · 2026-01-21T23:03:26Z

+    /// Optional endpoint override for log export.
+    /// If not set, uses the global `otel.endpoint`.
+    #[serde(default)]
+    pub endpoint: Option<String>,


I find this level of config a bit too granular. Do we expect to use different endpoints for each metrics, traces, logs? Same for enabled, shouldn't OTEL be completely disabled until we switch over to this new way of running things and then after that always fully enabled? I get allowing the whole thing to be disabled for testing purposes (which can be done via env var or via the top level enabled) but this feels like a bit "too configurable".

I'm adding OTEL to another service now and I'm going with a simple "fully enabled or fully disabled", no endpoint overrides, etc way, and I'm wondering if it's too short sighted.

cmacrae added 2 commits January 16, 2026 11:45

feat(observability): support exporting logs via OTEL

2f734a6

These changes add support & associated config for exporting logs with OTEL via OTLP over gRPC

feat(observability): add an otel-collector to docker-compose

9f8cd93

Simply emits any OTEL data it receives.

cmacrae force-pushed the feat/otel branch from 7044157 to 9f8cd93 Compare January 16, 2026 11:45

cmacrae added 21 commits January 16, 2026 12:37

feat(observability): support exporting traces via OTEL

89341d9

These changes add support & associated config for exporting traces with OTEL via OTLP over gRPC

feat(observability): support exporting metrics via OTEL

51aee8f

These changes add support & associated config for exporting metrics with OTEL via OTLP over gRPC

fix(observability): compute deltas for OTEL counter metrics

59e98cf

OTEL counters use add() which increments values, unlike Prometheus counters which support absolute(). Track previous values for cumulative metrics (CPU time, disk I/O) and compute deltas before recording.

refactor(observability): extract shutdown logic into helper method

4d182c9

Factor out the duplicated shutdown logic from shutdown() and Drop impl into a private shutdown_providers() method.

feat(observability): add configurable metrics export interval

a96cf9e

Allow users to configure the OTEL metrics export interval via the `otel.metrics.export_interval` config field. Defaults to 60 seconds.

refactor(run): use ObservabilityGuard method for OTEL metrics check

5b12e93

Pass the ObservabilityGuard to run() and use its otel_metrics_enabled() method instead of checking config directly. This accounts for runtime conditions like OTEL_SDK_DISABLED that may override config values.

refactor(observability): simplify OTEL SDK disabled check

6b09f1b

Remove redundant otel_enabled variable and duplicate is_otel_sdk_disabled() call. The early return already handles the SDK disabled case, so the subsequent condition can simply check config.otel.enabled.

docs(metrics): explain process metrics are backend-specific

21b8246

Add comment explaining why process metrics don't dual-write to both backends like application metrics. Prometheus uses counter!().absolute() while OTEL requires delta computation, necessitating separate collectors.

refactor(config): remove unused MetricsConfig.enabled field

e832b6b

The `metrics.enabled` field was defined but never checked anywhere in the codebase. Prometheus metrics are now implicitly enabled unless OTEL metrics are enabled (which disables the Prometheus endpoint).

refactor(observability): use tracing::error! in shutdown

412876a

Replace eprintln! with tracing::error! for consistent logging during provider shutdown. The fmt layer remains active even after OTEL providers are shut down, so errors will still be logged to stderr.

fix(observability): flush providers before shutdown

676ea08

Add force_flush() calls before shutdown() to ensure all pending telemetry data is exported before providers are shut down.

refactor(metrics): use tokio::sync::Mutex for OTEL process metrics

e476510

Replace std::sync::Mutex with tokio::sync::Mutex in the OTEL process metrics collector. This avoids blocking the async runtime thread during metrics collection.

feat(config): reduce default metrics export interval to 15s

f143179

Change the default OTEL metrics export interval from 60s to 15s for more responsive metric updates in typical deployments.

fix(observability): graceful degradation on provider init errors

dec69b5

Allow partial observability when individual providers fail to initialize. Logs warning to stderr and continues with the providers that did initialize successfully.

fix(observability): make fmt logging init idempotent for test compati…

c8cb8bb

…bility Use try_init() instead of init() to gracefully handle cases where a tracing subscriber is already set, such as in integration tests that initialize tracing before calling observability::init().

docs(config): mark env var example as bash to skip doctest

1b60de5

The OTEL_RESOURCE_ATTRIBUTES example was being interpreted as Rust code.

feat(config)!: enable OTEL metrics by default when otel.enabled is true

8931795

fix(config): implement manual Default for OtelConfig

1edc1b0

Derived Default gave empty strings/zero values instead of the documented serde defaults. This caused incorrect behavior when the otel config key was omitted entirely but OTEL was enabled via env vars.

cmacrae force-pushed the feat/otel branch 2 times, most recently from fb06b38 to e1de2c0 Compare January 19, 2026 13:59

cmacrae marked this pull request as ready for review January 19, 2026 14:23

cmacrae requested review from mfontanini and tim-hm January 19, 2026 14:23

test(otel): add OtelConfig unit tests

b62d48e

Verify default values match serde defaults and endpoint resolution fallback logic.

cmacrae added 2 commits January 19, 2026 14:35

test(otel): add observability unit tests

016021e

Verify OTEL_SDK_DISABLED env var handling, resource creation with attributes, and guard shutdown safety.

test(otel): add metrics unit test

cfd0eaa

Verify metric recording functions execute without panic.

cmacrae force-pushed the feat/otel branch from e1de2c0 to cfd0eaa Compare January 19, 2026 14:35

tim-hm approved these changes Jan 19, 2026

View reviewed changes

mfontanini approved these changes Jan 21, 2026

View reviewed changes

mfontanini reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): OTEL support#66

feat(observability): OTEL support#66
cmacrae wants to merge 26 commits into
mainfrom
feat/otel

cmacrae commented Jan 15, 2026 •

edited

Loading

Uh oh!

tim-hm left a comment

Uh oh!

tim-hm Jan 19, 2026

Uh oh!

cmacrae Jan 19, 2026

Uh oh!

mfontanini Jan 21, 2026

Uh oh!

mfontanini Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		/// - Prometheus: `invalid_payments_total{reason="..."}`
		/// - OTEL: `nilauth.payment.invalid{reason="..."}`

Conversation

cmacrae commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Config

Logs

Tracing

Metrics

Uh oh!

tim-hm left a comment

Choose a reason for hiding this comment

Uh oh!

tim-hm Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

cmacrae Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

mfontanini Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

mfontanini Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cmacrae commented Jan 15, 2026 •

edited

Loading