514-labs · jmsuzuki · Apr 24, 2026 · coderabbitai · Apr 24, 2026 · coderabbitai
diff --git a/apps/framework-cli/docs/operator/metrics.md b/apps/framework-cli/docs/operator/metrics.md
@@ -0,0 +1,90 @@
+# Operator metrics — connection-churn observability (Phase 0)
+
+These metrics are exposed by the management server on `GET /metrics` in
+OpenMetrics text format. They were introduced by the Phase 0 observability
+work of the Redpanda connection-management plan
+(`.cursor/plans/redpanda-connection-management/connection-churn/plan_connection-churn.md`)
+and are the measurement baseline for every subsequent fix phase.
+
+## `moose_kafka_client_gauge{purpose}`
+
+- **Type**: Gauge
+- **Description**: Number of live Kafka client handles held by this
+  process, broken down by the role of the client. Incremented when a
+  handle is constructed, decremented when the last reference is dropped.
+- **Labels**:
+  - `purpose` — one of the role constants defined in
+    `apps/framework-cli/src/infrastructure/stream/kafka/client.rs`:
+    `ingest_producer`, `idempotent_producer`, `sync_producer`,
+    `sync_consumer`, `peek_consumer`, `mcp_sample_consumer`,
+    `health_probe`, `fetch_topics_consumer`, `fetch_topics_admin`,
+    `check_topic_size_consumer`, `admin_add_partitions`,
+    `admin_update_topic_config`, `admin_create_topics`,
+    `admin_delete_topics`, `admin_describe_topic_config`,
+    `function_worker_estimated` (see note below).
+- **Expected range (per pod)**: in the ~20–30 region in steady state on
+  a main-branch deployment with 6 function processes; the exact
+  per-purpose breakdown is tracked on the connection-churn dashboard.
+  A sustained value of zero for any `purpose` usually indicates the
+  code path is inactive rather than a broken metric.
+- **Notes**: `function_worker_estimated` is a coarse proxy emitted from
+  the Rust supervisor (`functions_registry.rs`) — one tick per parallel
+  worker — because TypeScript and Python workers don't use `rdkafka` and
+  therefore can't be instrumented the same way.
+
+## `moose_function_worker_restarts_total{reason}`
+
+- **Type**: Counter (exposed as `..._total` by the OpenMetrics encoder).
+- **Description**: Cumulative count of streaming-function worker
+  restarts since process start, bucketed by exit classification.
+- **Labels**:
+  - `reason` — Rust-supervised children emit one of
+    `rust_child_exit_ok`, `rust_child_exit_err_code`,
+    `rust_child_exit_signal`, `rust_child_wait_err`. TypeScript clusters
+    emit `ts_worker_exit_code_0`, `ts_worker_exit_code_nonzero`,
+    `ts_worker_killed_by_signal_<SIGNAL>`, or `ts_worker_killed_other`.
+    Python runners emit `py_worker_exit_code_0`,
+    `py_worker_exit_code_nonzero`, or `py_worker_killed_by_<SIGNAL>`.
+- **Expected range**: `rate(...[5m])` should be 0 in steady state. Any
+  sustained non-zero rate indicates a misbehaving function and is worth
+  paging on.
+
+## `moose_function_process_diff_updated_total{reason}`
+
+- **Type**: Counter (exposed as `..._total`).
+- **Description**: Cumulative count of iterations through
+  `InfrastructureMap::diff_function_processes` that decided to emit a
+  `ProcessChange::Updated`, labelled by the root cause of the decision.
+- **Labels**:
+  - `reason`:
+    - `forced_always` — pre-Phase-1 behaviour: every existing function
+      process is treated as changed even when its fields are identical.
+      This is the counter Phase 1 will drive to zero.
+    - `no_change` — fields match exactly; still emits an `Updated` in
+      Phase 0 (behaviour is unchanged; Phase 1 will make this a no-op).
+    - Phase 1 introduces finer-grained reasons
+      (`executable_changed`, `version_changed`, etc.); they'll replace
+      `forced_always` in the dominant series.
+- **Expected range**: Phase 0 baseline is ~0.83/s/pod across the fleet
+  (see research §3.1). Post-Phase-1, the `forced_always` series must
+  drop to zero; a non-zero rate after rollout is the regression signal.
+
+## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
+
+Setting this env var to a non-empty value disables `kafka_client_gauge`
+instrumentation for the process. Used to rule out metric-tracking as a
+CPU/memory regression source during rollout. The
+`function_worker_restarts_total` and `function_process_diff_updated_total`
+counters are **not** gated by this switch — they're cheap and do not
+touch the hot Kafka client path.
-## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
-
-Setting this env var to a non-empty value disables `kafka_client_gauge`
-instrumentation for the process. Used to rule out metric-tracking as a
-CPU/memory regression source during rollout. The
-`function_worker_restarts_total` and `function_process_diff_updated_total`
-counters are **not** gated by this switch — they're cheap and do not
-touch the hot Kafka client path.
+## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
+
+Setting this env var to any value (including empty) disables `kafka_client_gauge`
+instrumentation for the process. Used to rule out metric-tracking as a
+CPU/memory regression source during rollout. The
+`function_worker_restarts_total` and `function_process_diff_updated_total`
+counters are **not** gated by this switch — they're cheap and do not
+touch the hot Kafka client path.
-## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
-
-Setting this env var to a non-empty value disables `kafka_client_gauge`
-instrumentation for the process. Used to rule out metric-tracking as a
-CPU/memory regression source during rollout. The
-`function_worker_restarts_total` and `function_process_diff_updated_total`
-counters are **not** gated by this switch — they're cheap and do not
-touch the hot Kafka client path.
+## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
+
+Setting this env var to any value (including empty) disables `kafka_client_gauge`
+instrumentation for the process. Used to rule out metric-tracking as a
+CPU/memory regression source during rollout. The
+`function_worker_restarts_total` and `function_process_diff_updated_total`
+counters are **not** gated by this switch — they're cheap and do not
+touch the hot Kafka client path.
+
+## Endpoints used by language workers
+
+Worker processes outside the Rust supervisor (TypeScript `cluster`
+workers, Python streaming runners) don't have direct access to the Rust
+metrics registry. They POST JSON event bodies to
+`http://127.0.0.1:${MOOSE_MANAGEMENT_PORT}/metrics-logs`, where the
+`metrics_log_route` handler forwards them to the shared registry. Only
+`StreamingFunctionEvent` and `FunctionWorkerRestart` payloads are
+currently accepted; see `apps/framework-cli/src/cli/local_webserver.rs`
+for the exact allow-list.
diff --git a/apps/framework-cli/src/cli.rs b/apps/framework-cli/src/cli.rs
@@ -14,7 +14,7 @@ pub mod settings;
 /// `spawn_and_await_initial_compile`.
 pub mod ts_compilation_watcher;
 pub mod watcher;
-use super::metrics::Metrics;
+use super::metrics::{set_global_metrics_handle, Metrics};
 use crate::utilities::constants;
 use crate::utilities::docker::DockerClient;
 use crate::utilities::docker_provider::DockerInfraProvider;
@@ -773,6 +773,7 @@ pub async fn top_command_handler(
 
             let arc_metrics = Arc::new(metrics);
             arc_metrics.start_listening_to_metrics(rx_events).await;
+            set_global_metrics_handle(&arc_metrics);
 
             routines::start_development_mode(
                 project_arc,
@@ -1018,6 +1019,7 @@ pub async fn top_command_handler(
 
             let arc_metrics = Arc::new(metrics);
             arc_metrics.start_listening_to_metrics(rx_events).await;
+            set_global_metrics_handle(&arc_metrics);
 
             let capture_handle = crate::utilities::capture::capture_usage(
                 ActivityType::ProdCommand,

diff --git a/apps/framework-cli/src/cli/local_webserver.rs b/apps/framework-cli/src/cli/local_webserver.rs
@@ -1362,23 +1362,12 @@ async fn metrics_log_route(
     let parsed: Result<MetricEvent, serde_json::Error> = serde_json::from_reader(body);
     trace!("Parsed metrics log route: {:?}", parsed);
 
-    if let Ok(MetricEvent::StreamingFunctionEvent {
-        count_in,
-        count_out,
-        bytes,
-        function_name,
-        timestamp,
-    }) = parsed
-    {
-        metrics
-            .send_metric_event(MetricEvent::StreamingFunctionEvent {
-                timestamp,
-                count_in,
-                count_out,
-                bytes,
-                function_name: function_name.clone(),
-            })
-            .await;
+    match parsed {
+        Ok(event @ MetricEvent::StreamingFunctionEvent { .. })
+        | Ok(event @ MetricEvent::FunctionWorkerRestart { .. }) => {
+            metrics.send_metric_event(event).await;
+        }
+        _ => {}
     }
 
     Response::builder()
@@ -2798,6 +2787,7 @@ impl Webserver {
         let producer = if project.features.streaming_engine {
             Some(kafka::client::create_producer(
                 project.redpanda_config.clone(),
+                kafka::client::PURPOSE_INGEST_PRODUCER,
             ))
         } else {
             None
@@ -4280,4 +4270,100 @@ mod tests {
         // Leading slash edge case
         assert_eq!(find_api_name("/api/1", &apis), "/api/1");
     }
+
+    #[tokio::test]
+    async fn metrics_endpoint_e2e_exposes_new_churn_observability_series() {
+        use crate::metrics::{Metrics, TelemetryMetadata};
+        use hyper::service::service_fn;
+        use std::convert::Infallible;
+        use std::time::Duration;
+        use tokio::net::TcpListener;
+
+        let (metrics, rx) = Metrics::new(
+            TelemetryMetadata {
+                machine_id: "smoke".to_string(),
+                is_moose_developer: false,
+                metric_labels: None,
+                metric_endpoints: None,
+                is_production: false,
+                project_name: "smoke".to_string(),
+                export_metrics: false,
+            },
+            None,
+        );
+        let metrics = Arc::new(metrics);
+        metrics.start_listening_to_metrics(rx).await;
+
+        metrics.kafka_client_created("smoke_producer");
+        metrics.kafka_client_created("smoke_producer");
+        metrics.kafka_client_dropped("smoke_producer");
+        metrics.function_worker_restart("smoke_reason".to_string());
+        metrics.function_process_diff_updated("forced_always");
+        metrics.function_process_diff_updated("forced_always");
+
+        let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
+        let addr = listener.local_addr().unwrap();
+
+        let metrics_for_server = metrics.clone();
+        let server_task = tokio::spawn(async move {
+            loop {
+                let (stream, _) = match listener.accept().await {
+                    Ok(pair) => pair,
+                    Err(_) => return,
+                };
+                let io = TokioIo::new(stream);
+                let metrics_clone = metrics_for_server.clone();
+                tokio::spawn(async move {
+                    let service = service_fn(move |_req: Request<Incoming>| {
+                        let m = metrics_clone.clone();
+                        async move {
+                            let resp = metrics_route(m).await.unwrap();
+                            Ok::<_, Infallible>(resp)
+                        }
+                    });
+                    let _ = auto::Builder::new(TokioExecutor::new())
+                        .serve_connection(io, service)
+                        .await;
+                });
+            }
+        });
+
+        tokio::time::sleep(Duration::from_millis(100)).await;
+
+        let url = format!("http://{addr}/metrics");
+        let client = reqwest::Client::builder()
+            .timeout(Duration::from_secs(5))
+            .build()
+            .unwrap();
+        let resp = client.get(&url).send().await.expect("GET /metrics failed");
+        assert_eq!(resp.status().as_u16(), 200);
+        let body = resp.text().await.expect("read body");
+
+        assert!(
+            body.contains("# TYPE moose_kafka_client_gauge gauge"),
+            "missing kafka gauge TYPE header in /metrics body:\n{body}"
+        );
+        assert!(
+            body.contains(r#"moose_kafka_client_gauge{purpose="smoke_producer"} 1"#),
+            "expected gauge value of 1 after 2 creates + 1 drop:\n{body}"
+        );
+        assert!(
+            body.contains("# TYPE moose_function_worker_restarts counter"),
+            "missing worker restarts TYPE header:\n{body}"
+        );
+        assert!(
+            body.contains(r#"moose_function_worker_restarts_total{reason="smoke_reason"} 1"#),
+            "expected worker restart counter == 1:\n{body}"
+        );
+        assert!(
+            body.contains("# TYPE moose_function_process_diff_updated counter"),
+            "missing diff updated TYPE header:\n{body}"
+        );
+        assert!(
+            body.contains(r#"moose_function_process_diff_updated_total{reason="forced_always"} 2"#),
+            "expected diff counter == 2:\n{body}"
+        );
+
+        server_task.abort();
+    }
 }
diff --git a/apps/framework-cli/src/cli/routines/peek.rs b/apps/framework-cli/src/cli/routines/peek.rs
@@ -14,7 +14,9 @@ use crate::project::Project;
 use super::{setup_redis_client, RoutineFailure, RoutineSuccess};
 
 use crate::infrastructure::olap::clickhouse::model::ClickHouseTable;
-use crate::infrastructure::stream::kafka::client::create_consumer;
+use crate::infrastructure::stream::kafka::client::{
+    create_consumer, KafkaClientHandle, PURPOSE_PEEK_CONSUMER,
+};
 use futures::stream::BoxStream;
 use rdkafka::consumer::{Consumer, StreamConsumer};
 use rdkafka::{Message as KafkaMessage, Offset, TopicPartitionList};
@@ -76,13 +78,17 @@ pub async fn peek(
             ))
         })?;
 
-    let consumer_ref: StreamConsumer;
+    let consumer_ref: KafkaClientHandle<StreamConsumer>;
     let table_ref: ClickHouseTable;
 
     let mut stream: BoxStream<anyhow::Result<Value>> = if is_stream {
         let group_id = project.redpanda_config.prefix_with_namespace("peek");
 
-        consumer_ref = create_consumer(&project.redpanda_config, &[("group.id", &group_id)]);
+        consumer_ref = create_consumer(
+            &project.redpanda_config,
+            &[("group.id", &group_id)],
+            PURPOSE_PEEK_CONSUMER,
+        );
         let consumer = &consumer_ref;
 
         let topic = find_topic_by_name(&infra, name).ok_or_else(|| {

diff --git a/apps/framework-cli/src/framework/core/infrastructure_map.rs b/apps/framework-cli/src/framework/core/infrastructure_map.rs
@@ -1504,9 +1504,18 @@ impl InfrastructureMap {
 
         for (id, process) in self_processes {
             if let Some(target_process) = target_processes.get(id) {
-                // Always treat function processes as updated if they exist in both maps
-                // This ensures function code changes are always redeployed
-                tracing::debug!("FunctionProcess updated (forced): {}", id);
+                let reason = if process == target_process {
+                    "no_change"
+                } else {
+                    "forced_always"
+                };
+                crate::metrics::record_function_process_diff_updated(reason);
+
+                tracing::debug!(
+                    "FunctionProcess updated (forced, diff_reason={}): {}",
+                    reason,
+                    id
+                );
                 process_updates += 1;
                 process_changes.push(ProcessChange::FunctionProcess(
                     Change::<FunctionProcess>::Updated {