Skip to content

ENG-2894: kafka observability#4047

Draft
jmsuzuki wants to merge 1 commit intomainfrom
ENG-2794-redpanda-connection-churn-observability
Draft

ENG-2894: kafka observability#4047
jmsuzuki wants to merge 1 commit intomainfrom
ENG-2794-redpanda-connection-churn-observability

Conversation

@jmsuzuki
Copy link
Copy Markdown
Contributor

@jmsuzuki jmsuzuki commented Apr 24, 2026

adds observability metrics to the local /metrics endpoint. includes:

moose_kafka_client_gauge{purpose}
moose_function_worker_restarts_total{reason}
moose_function_process_diff_updated_total{reason}


Note

Medium Risk
Touches Kafka client construction across multiple producer/consumer/admin code paths by wrapping handles for lifecycle tracking, which could affect resource lifetimes and shutdown behavior if misused. Otherwise changes are additive observability and event forwarding with a kill-switch for the hot-path gauge.

Overview
Adds Phase-0 connection-churn observability to the management server’s GET /metrics, introducing moose_kafka_client_gauge{purpose}, moose_function_worker_restarts_total{reason}, and moose_function_process_diff_updated_total{reason} (with new operator docs).

Kafka client creation is now purpose-tagged via a new KafkaClientHandle<T> wrapper that increments/decrements the gauge on handle clone/drop, and key call sites are updated to pass explicit PURPOSE_* constants (producers, consumers/subscribers, admin clients, health probe, topic size checks). A MOOSE_KAFKA_CLIENT_METRICS_DISABLED env kill-switch disables only the Kafka gauge tracking.

Adds restart/diff counters by (1) recording worker-exit classifications from the Rust process supervisor, and (2) logging whether function-process diffs were no_change vs forced_always; TypeScript and Python workers now POST restart reasons to /metrics-logs, and the route allow-list forwards these new events. Includes unit tests plus an end-to-end /metrics scrape test for the new series.

Reviewed by Cursor Bugbot for commit f4786f5. Bugbot is set up for automated code reviews on this repo. Configure here.

adds observability metrics to the local /metrics endpoint. includes:
 > moose_kafka_client_gauge{purpose}
 > moose_function_worker_restarts_total{reason}
 > moose_function_process_diff_updated_total{reason}
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs-v2 Ready Ready Preview, Comment Apr 24, 2026 0:43am

Request Review

@linear
Copy link
Copy Markdown

linear Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OpenMetrics endpoints exposing Kafka client lifecycle metrics (gauge), function worker restart counters, and process update counters.
    • Worker processes now report exit reasons for enhanced observability and troubleshooting.
    • Kafka client metrics can be disabled via MOOSE_KAFKA_CLIENT_METRICS_DISABLED environment variable.
  • Documentation

    • Added operator metrics documentation specifying exposed endpoints, metric schemas, and label definitions.

Walkthrough

Adds comprehensive Kafka client lifecycle metrics instrumentation and worker exit-reason observability. Introduces purpose-tagged Kafka client tracking, global metrics handle propagation, worker exit-reason classification and HTTP reporting via /metrics-logs endpoint, and filtering of metric events from external sinks.

Changes

Cohort / File(s) Summary
Metrics Infrastructure
src/metrics.rs, src/metrics_inserter.rs
New MetricEvent variants for Kafka client lifecycle and function worker restarts/diffs. Prometheus gauge and counter families with label-keyed updates. Global OnceLock-based handle mechanism and env-gated (MOOSE_KAFKA_CLIENT_METRICS_DISABLED) helper functions. Metric events explicitly filtered out from external metric sink payload construction.
Kafka Client Instrumentation
src/infrastructure/stream/kafka/client.rs, src/infrastructure/stream/kafka/models.rs
New KafkaClientHandle<T> wrapper recording create/drop lifecycle via record_kafka_client_created/dropped with purpose labels. Factory functions (create_producer, create_consumer, create_subscriber, create_idempotent_producer) now accept purpose parameter and return wrapped clients. Topic admin operations similarly wrap AdminClient instances with purpose-specific handles.
CLI & Webserver Integration
src/cli.rs, src/cli/local_webserver.rs
Global metrics handle propagation in Dev/Prod paths. New POST /metrics-logs endpoint accepting StreamingFunctionEvent and FunctionWorkerRestart payloads. Kafka producer tagged with PURPOSE_INGEST_PRODUCER. E2e test validating metrics endpoint response and counter/gauge values.
Function Registry & Infrastructure
src/infrastructure/processes/functions_registry.rs, src/framework/core/infrastructure_map.rs
Registry stores estimated_clients per process; records Kafka client creation/drops scaled by worker count. Infrastructure map annotates process diffs with reason (no_change vs forced_always) and invokes record_function_process_diff_updated(reason).
Kafka Operations
src/cli/routines/peek.rs, src/infrastructure/processes/kafka_clickhouse_sync.rs, src/mcp/tools/sample_stream.rs
Consumer/producer creation calls updated to include purpose parameters (PURPOSE_PEEK_CONSUMER, PURPOSE_SYNC_CONSUMER, PURPOSE_SYNC_PRODUCER, PURPOSE_MCP_SAMPLE_CONSUMER). Consumer type signature changed to KafkaClientHandle<StreamConsumer>.
Worker Exit Reporting
src/utilities/system.rs, packages/py-moose-lib/.../streaming_function_runner.py, packages/ts-moose-lib/src/cluster-utils.ts
Rust: child-process exit classification (ok/err_code/signal/wait_err) passed to record_function_worker_restart. Python: atexit hook posts exit reason to /metrics-logs. TypeScript: worker exit event handler posts reason to management service at 127.0.0.1:MOOSE_MANAGEMENT_PORT/metrics-logs with silent error handling.
Documentation
docs/operator/metrics.md
New metrics endpoint specification documenting three metric series (moose_kafka_client_gauge{purpose}, moose_function_worker_restarts_total{reason}, moose_function_process_diff_updated_total{reason}), label schemas, steady-state behavior, and Phase 0/1 interpretation. Documents MOOSE_KAFKA_CLIENT_METRICS_DISABLED and metrics-logs payload allow-list.

Sequence Diagram(s)

sequenceDiagram
    participant W as Worker Process<br/>(Py/TS)
    participant M as Management Server<br/>(/metrics-logs)
    participant R as Metrics Registry<br/>(OpenMetrics)

    rect rgba(100, 150, 200, 0.5)
    note over W,R: Worker Exit Reporting Flow
    W->>W: Process exits/receives signal
    activate W
    W->>W: Classify exit reason<br/>(signal, code, error, ok)
    deactivate W
    
    W->>M: POST /metrics-logs<br/>{FunctionWorkerRestart{reason}}
    activate M
    M->>M: Parse event
    M->>R: Emit MetricEvent
    deactivate M
    
    activate R
    R->>R: Update counter:<br/>moose_function_worker_restarts_total{reason}
    deactivate R
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • callicles
  • LucioFranco

Poem

🔍 Kafka clients wake and sleep,
Metrics count each birth and keep,
Workers whisper exit tales,
Through the /metrics-logs trails—
Observability prevails! 📊

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title 'ENG-2894: kafka observability' accurately and concisely summarizes the main change—adding Kafka observability metrics to the /metrics endpoint.
Description check ✅ Passed Description clearly relates to the changeset, detailing the three new metrics, KafkaClientHandle wrapper, worker restart tracking, and environment controls.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ENG-2794-redpanda-connection-churn-observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f4786f5. Configure here.

exit_code = 1 if fatal_error.is_set() else 0
_set_worker_exit_reason(
"py_worker_exit_code_nonzero" if exit_code != 0 else "py_worker_exit_code_0"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python finally block overwrites signal exit reason

High Severity

The finally block unconditionally calls _set_worker_exit_reason based on exit code, overwriting any signal-based reason set earlier by the shutdown handler. When a signal like SIGTERM is received, shutdown correctly sets the reason to py_worker_killed_by_SIGTERM, but then the finally block runs and replaces it with py_worker_exit_code_0 (since signal-based shutdowns don't set fatal_error). This makes the signal classification in the shutdown handler effectively dead code — all signal exits are misreported as clean exits.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by team rule: Pragmatic Developper

Reviewed by Cursor Bugbot for commit f4786f5. Configure here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/framework-cli/src/infrastructure/stream/kafka/client.rs (1)

546-589: 🛠️ Refactor suggestion | 🟠 Major

Public API signatures changed — docstrings didn't follow.

  • create_idempotent_producer now returns KafkaClientHandle<FutureProducer> (breaking change vs. prior FutureProducer); doc still says "A FutureProducer".
  • create_producer, create_consumer, create_subscriber gained a new purpose: &'static str parameter; none of the # Arguments blocks mention it, and none document the allowed label values.

As per coding guidelines ("Document all public APIs and breaking changes" / "Documentation is required for all public APIs in Rust"), please update the rustdoc blocks. Recommend pointing readers at the PURPOSE_* constants rather than letting new callers invent ad-hoc labels, which would fragment the moose_kafka_client_gauge{purpose=...} cardinality.

Also applies to: 704-716, 734-765

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/infrastructure/stream/kafka/client.rs` around lines
546 - 589, Update the rustdoc for the public producer/consumer functions to
reflect the new signatures and expected labels: change the return description of
create_idempotent_producer to state it returns KafkaClientHandle<FutureProducer>
(not FutureProducer), and add a new `purpose: &'static str` parameter
description to create_producer (and likewise to create_consumer and
create_subscriber) explaining that callers must supply one of the predefined
PURPOSE_* constants (reference PURPOSE_IDEMPOTENT_PRODUCER, PURPOSE_* constants)
and that these values are used as the `purpose` label for
moose_kafka_client_gauge; also list/point to the PURPOSE_* constants rather than
encouraging ad-hoc strings and update the Arguments and Returns sections
accordingly for each affected function (create_idempotent_producer,
create_producer, create_consumer, create_subscriber).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/framework-cli/docs/operator/metrics.md`:
- Around line 72-79: The doc is incorrect: kafka_client_tracking_enabled() in
metrics.rs currently treats any set value (including empty) as disabling
metrics; change the implementation in metrics.rs (function
kafka_client_tracking_enabled and reference KAFKA_METRICS_DISABLED_ENV) to
reject empty strings by using a check like
std::env::var(KAFKA_METRICS_DISABLED_ENV).is_ok_and(|v| !v.is_empty()), so the
kill-switch only trips for non-empty values (or alternatively update the doc to
remove "non-empty" if you prefer doc-only alignment).

In `@apps/framework-cli/src/cli/local_webserver.rs`:
- Around line 4274-4368: The test introduces three new public Prometheus series
(moose_kafka_client_gauge with label purpose,
moose_function_worker_restarts_total with label reason, and
moose_function_process_diff_updated_total with label reason) surfaced by
Metrics/Metrics::new and exposed by metrics_route; update the user docs in
apps/framework-docs-v2/content to document each series name, type (gauge or
counter), label meanings (purpose, reason), example metric lines (e.g.
moose_kafka_client_gauge{purpose="smoke_producer"} 1), and any related
configuration or opt-out notes so the new observability surface is discoverable;
also ensure the docs index/TOC includes this metrics page.

In `@apps/framework-cli/src/framework/core/infrastructure_map.rs`:
- Around line 1507-1518: The metric label for the restart reason is misleading:
change the branch that sets reason when process != target_process from
"forced_always" to "changed" (the variables involved are process,
target_process, and reason) so
crate::metrics::record_function_process_diff_updated(reason) emits "changed" for
actual config differences; keep the tracing::debug call but update any tests
that assert the old label to expect "changed" instead of "forced_always" in the
metrics assertions.

In `@apps/framework-cli/src/infrastructure/stream/kafka/client.rs`:
- Around line 35-55: Replace the free-form &'static str purpose labels with a
tiny newtype to prevent accidental new labels: declare pub struct
KafkaClientPurpose(&'static str) with a private constructor (or make the struct
non-exhaustive) and expose only the existing PURPOSE_* values as constants of
type KafkaClientPurpose (e.g., PURPOSE_INGEST_PRODUCER: KafkaClientPurpose).
Change any APIs that currently accept purpose: &'static str to accept
KafkaClientPurpose (look for usages of purpose and functions/methods that take a
&'static str purpose) so callers must use the predefined constants; keep the
inner string accessible only for metrics/label emission when needed. Ensure the
newtype derives Copy/Clone/Debug/Eq/PartialEq/Hash as appropriate.
- Around line 68-72: KafkaClientTracker::drop calls record_kafka_client_dropped
which ultimately routes through Metrics::try_send_metric_event and a global
mpsc; during runtime shutdown the receiver may be closed and drops on worker
threads can silently fail to decrement the gauge. Change the lifecycle path so
drops update a dedicated Gauge handle (or otherwise use an atomic/lock-protected
gauge API) instead of sending over the bounded mpsc: locate
KafkaClientTracker::drop and record_kafka_client_dropped and replace the
try_send-based decrement with a direct call to the Gauge decrement method (or
atomic decrement) provided by the metrics subsystem used by
start_listening_to_metrics, falling back to a no-op if the gauge handle is
unavailable to avoid panics during shutdown.
- Around line 74-78: Replace the derived Clone on KafkaClientHandle with an
explicit implementation: implement Clone only for T: Clone (impl<T: Clone> Clone
for KafkaClientHandle<T>) and add a doc comment on KafkaClientHandle explaining
that cloned handles share the same Arc<KafkaClientTracker> so only one "dropped"
event fires when the last handle is dropped; mention that StreamConsumer,
BaseConsumer, and AdminClient in rdkafka 0.36 do not implement Clone while
FutureProducer does, so the impl bounds keep behavior correct.

In `@apps/framework-cli/src/metrics.rs`:
- Around line 220-244: Add rustdoc comments for all public helpers to satisfy
the "Document all public APIs" rule: document try_send_metric_event to state
it's fire-and-forget / best-effort (may fail silently), document
kafka_client_created and kafka_client_dropped, function_worker_restart,
function_process_diff_updated and each record_* function with their purpose and
expected input, and document set_global_metrics_handle to state its
once-only/one-time initialization contract (what happens if called multiple
times) and kafka_client_tracking_enabled behavior; include short examples or
usage notes where helpful and call out any failure semantics or concurrency
guarantees for these symbols so callers know expected behavior.
- Around line 220-244: The current try_send_metric_event uses a bounded 32-slot
channel which can drop KafkaClientCreated/Dropped events and desync
moose_kafka_client_gauge; change this by removing the bounded mpsc for lifecycle
events and either (A) switch the MetricEvent channel to an unbounded channel
(use tokio::sync::mpsc::unbounded_channel and update the sender type so
try_send_metric_event uses unbounded_sender.send(...)) or (B, preferred) bypass
the channel for Kafka lifecycle metrics: give KafkaClientTracker direct access
to the Arc<Family<KafkaClientLabels, Gauge>> (moose_kafka_client_gauge) and
increment/decrement it in kafka_client_created and kafka_client_dropped (and
drop try_send_metric_event for these paths), ensuring no try_send is used from
Drop and avoiding lost events during bursts or shutdown.
- Around line 582-584: The function kafka_client_tracking_enabled currently
treats an empty KAFKA_METRICS_DISABLED_ENV as "disabled" because it only checks
std::env::var(...).is_err(); change it to consider the variable disabled only
when it is set to a non-empty value: read the env var with
std::env::var(KAFKA_METRICS_DISABLED_ENV) and return true (enabled) when the var
is unset or is an empty string, and return false (disabled) when the var is
present and non-empty (trim whitespace first to be safe).
- Around line 684-695: The test tracking_disabled_env_suppresses_events mutates
process-global env vars using std::env::set_var/remove_var which are unsafe on
Rust 1.94.0 and causes race conditions; wrap those calls in unsafe blocks and
mark the test with #[serial_test::serial] to prevent parallel execution, or
alternatively refactor kafka_client_tracking_enabled to accept an injectable env
value and call that from the test; ensure you update
tracking_disabled_env_suppresses_events to call unsafe {
std::env::set_var(KAFKA_METRICS_DISABLED_ENV, "1") } and unsafe {
std::env::set_var(...) / std::env::remove_var(...) } and add the serial_test
attribute so record_kafka_client_created and record_kafka_client_dropped run in
isolation.

In `@apps/framework-cli/src/utilities/system.rs`:
- Around line 343-362: The test classify_rust_child_exit_reasons uses
std::os::unix::process::ExitStatusExt and must be compiled only on Unix; add a
Unix-only configuration attribute (e.g., #[cfg(unix)]) to the test (or the
containing module) so Windows builds don't try to compile it. Locate the test
function classify_rust_child_exit_reasons and apply #[cfg(unix)] (or gate the
whole test module) to prevent Windows compilation errors while keeping the test
runnable on Unix.

In `@packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py`:
- Around line 695-703: The shutdown signal handler (shutdown) sets a
signal-based reason via _set_worker_exit_reason but the finally block later
unconditionally overwrites it with a default exit-code reason; change the
finally logic so it only sets the default py_worker_exit_code_... when no signal
reason is already recorded. Concretely, in the finally block (where you
currently call _set_worker_exit_reason for exit codes) query the existing reason
(add or use a getter like _get_worker_exit_reason or make
_set_worker_exit_reason accept an overwrite flag) and skip setting the exit-code
reason if the reason already starts with "py_worker_killed_by_" or is non-empty,
leaving the signal-set reason sticky after running.clear()/shutdown().
- Around line 59-77: Replace the bare try/except/pass in _report_worker_exit
with contextlib.suppress(Exception) to explicitly and narrowly suppress
exceptions; import contextlib if missing, wrap only the requests.post call (and
its timeout) inside the suppress block so atexit.register(_report_worker_exit)
cannot raise, and keep updating _worker_exit_reason via _set_worker_exit_reason
unchanged.

In `@packages/ts-moose-lib/src/cluster-utils.ts`:
- Around line 202-208: The code currently calls reportWorkerExit(...) before
checking this.shutdownInProgress, which causes clean shutdowns to be counted as
restarts (incrementing moose_function_worker_restarts_total); change the flow so
that you first check this.shutdownInProgress and only call reportWorkerExit and
schedule cluster.fork (setTimeout(() => cluster.fork(), RESTART_TIME_MS)) when
shutdownInProgress is false, or alter reportWorkerExit to accept a flag and skip
incrementing the restart metric when shutdownInProgress is true; update the call
site(s) (the block around reportWorkerExit, shutdownInProgress, cluster.fork,
and RESTART_TIME_MS) accordingly.

---

Outside diff comments:
In `@apps/framework-cli/src/infrastructure/stream/kafka/client.rs`:
- Around line 546-589: Update the rustdoc for the public producer/consumer
functions to reflect the new signatures and expected labels: change the return
description of create_idempotent_producer to state it returns
KafkaClientHandle<FutureProducer> (not FutureProducer), and add a new `purpose:
&'static str` parameter description to create_producer (and likewise to
create_consumer and create_subscriber) explaining that callers must supply one
of the predefined PURPOSE_* constants (reference PURPOSE_IDEMPOTENT_PRODUCER,
PURPOSE_* constants) and that these values are used as the `purpose` label for
moose_kafka_client_gauge; also list/point to the PURPOSE_* constants rather than
encouraging ad-hoc strings and update the Arguments and Returns sections
accordingly for each affected function (create_idempotent_producer,
create_producer, create_consumer, create_subscriber).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f25d073e-74fd-45e5-84e7-3bcc2d5e9809

📥 Commits

Reviewing files that changed from the base of the PR and between 2e13cfc and f4786f5.

📒 Files selected for processing (15)
  • apps/framework-cli/docs/operator/metrics.md
  • apps/framework-cli/src/cli.rs
  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/mcp/tools/sample_stream.rs
  • apps/framework-cli/src/metrics.rs
  • apps/framework-cli/src/metrics_inserter.rs
  • apps/framework-cli/src/utilities/system.rs
  • packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py
  • packages/ts-moose-lib/src/cluster-utils.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Build CLI Binaries (linux-arm64-glibc, ubuntu-22-8-core, aarch64-unknown-linux-gnu)
  • GitHub Check: Build CLI Binaries (linux-x64-glibc, ubuntu-22-8-core, x86_64-unknown-linux-gnu)
  • GitHub Check: Build CLI Binaries (darwin-arm64, macos-14-large, aarch64-apple-darwin)
  • GitHub Check: Package and Publish Independant TS Package
  • GitHub Check: Test CLI (blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Test CLI (macos-latest-large)
  • GitHub Check: Lints
  • GitHub Check: Check
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Analyze (rust)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Run cargo clippy to ensure Rust code passes Clippy's linting standards before each commit

**/*.rs: Use thiserror with #[derive(thiserror::Error)] for error handling in Rust; define errors near fallibility unit (NO global Error type); NEVER use anyhow::Result
Use snake_case for functions/variables, PascalCase for types/traits, SCREAMING_SNAKE_CASE for constants in Rust
Use tuple structs with validation constructors for newtypes in Rust (e.g., struct UserId(String))
Write tests inline with #[cfg(test)] modules in Rust
Documentation is required for all public APIs in Rust
Run cargo clippy --all-targets -- -D warnings pre-commit; no warnings allowed in Rust

Files:

  • apps/framework-cli/src/metrics_inserter.rs
  • apps/framework-cli/src/cli.rs
  • apps/framework-cli/src/utilities/system.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/mcp/tools/sample_stream.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/metrics.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
apps/framework-cli/**/*.rs

📄 CodeRabbit inference engine (apps/framework-cli/AGENTS.md)

apps/framework-cli/**/*.rs: Always run cargo clippy --all-targets -- -D warnings before commits; fix all warnings - no Clippy warnings may remain (treat warnings as errors)
Use rustfmt --edition 2021 for consistent formatting
Prefer top-of-file use statements over inline ones in function bodies; add use imports for fully-qualified paths; local use for disambiguation (e.g., name collisions) is fine
Write meaningful names for functions, variables, and types
Keep functions focused and modular
Document all public APIs and breaking changes
Use thiserror crate for error handling instead of anyhow::Result
Define errors near their unit of fallibility (no global Error types)
Use #[derive(thiserror::Error)] with #[error()] messages for error structs
Structure error types as: context struct + error enum + #[source] chaining
Define newtypes as tuple structs: struct UserId(u64);
Add validation constructors for newtypes: UserId::new(id: u64) -> Result<Self, Error>
Derive standard traits for newtypes: #[derive(Debug, Clone, PartialEq)]
Implement From/TryFrom for newtype conversions
Use derive_more or nutype crates to reduce boilerplate in newtype definitions
Use const for static values (prefer over static)
Use UPPER_SNAKE_CASE naming for constants
Scope constant visibility: pub(crate) > pub(super) > pub
Group related constants together
Write unit tests for all public functions
Test error conditions and edge cases in unit and integration tests

Files:

  • apps/framework-cli/src/metrics_inserter.rs
  • apps/framework-cli/src/cli.rs
  • apps/framework-cli/src/utilities/system.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/mcp/tools/sample_stream.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/metrics.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
**/framework-cli/src/**

⚙️ CodeRabbit configuration file

**/framework-cli/src/**: When reviewing changes to Moose CLI:

  1. Check if any user-facing changes were made (commands, flags, configs, apis, etc)
  2. If yes, verify the documentation for THAT SPECIFIC feature is updated in apps/framework-docs-v2/content
  3. If docs for that feature doesn't exist yet, it should be added. If the change removes public apis, the documentation for those should also be removed. Changing unrelated docs doesn't satisfy this requirement.

Files:

  • apps/framework-cli/src/metrics_inserter.rs
  • apps/framework-cli/src/cli.rs
  • apps/framework-cli/src/utilities/system.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/mcp/tools/sample_stream.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/metrics.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Run linting checks before submitting PRs for TypeScript/JavaScript code

**/*.{ts,tsx,js,jsx}: Group imports by external deps, internal modules, types; use named exports from barrel files (index.ts)
Use camelCase for variable and function names, PascalCase for types/classes/components, UPPER_SNAKE_CASE for constants in TypeScript/JavaScript
Prefix unused variables with _ (e.g., _unusedParam) to bypass linting errors in TypeScript/JavaScript
Format with Prettier using experimentalTernaries: true; auto-formats on commit via Husky + lint-staged in TypeScript/JavaScript
ESLint extends Next.js, Turbo, TypeScript recommended; @typescript-eslint/no-explicit-any disabled

Files:

  • packages/ts-moose-lib/src/cluster-utils.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Prefer interfaces for objects, types for unions/intersections; explicit return types on public APIs in TypeScript

Files:

  • packages/ts-moose-lib/src/cluster-utils.ts
**/ts-moose-lib/src/**/*.ts

⚙️ CodeRabbit configuration file

**/ts-moose-lib/src/**/*.ts: When reviewing changes to typescript moose lib:

  1. Check if any public apis were changed (class, method, type, config, etc).
  2. If yes, verify the documentation for THAT SPECIFIC feature is updated in apps/framework-docs-v2/content.
  3. If docs for that feature doesn't exist yet, it should be added. If the change removes public apis, the documentation for those should also be removed. Changing unrelated docs doesn't satisfy this requirement.

Files:

  • packages/ts-moose-lib/src/cluster-utils.ts
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Follow PEP 8 style guide in Python
Use snake_case for functions/variables, PascalCase for classes, UPPER_SNAKE_CASE for constants in Python
Format Python code with Black (line-length 88); auto-formats on commit via Husky + lint-staged
Use type hints for function signatures and public APIs in Python
Use pytest with fixtures and parametrize decorators for Python tests

Files:

  • packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py
**/py-moose-lib/moose_lib/**/*.py

⚙️ CodeRabbit configuration file

**/py-moose-lib/moose_lib/**/*.py: When reviewing changes to python moose lib:

  1. Check if any public apis were changed (class, method, type, config, etc)
  2. If yes, verify the documentation for THAT SPECIFIC feature is updated in apps/framework-docs-v2/content
  3. If docs for that feature doesn't exist yet, it should be added. If the change removes public apis, the documentation for those should also be removed. Changing unrelated docs doesn't satisfy this requirement.

Files:

  • packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py
🧠 Learnings (39)
📚 Learning: 2026-02-06T01:43:06.078Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3462
File: apps/framework-cli/src/cli/routines/feedback.rs:40-72
Timestamp: 2026-02-06T01:43:06.078Z
Learning: In the Moose framework-cli (Rust), the telemetry functions capture_usage and wait_for_usage_capture use a fire-and-forget pattern: errors are logged internally and not surfaced to users. This behavior is intentional across all commands. Do not change error propagation for individual commands; any modification requires a broad refactor across the CLI telemetry layer. During reviews, treat changes to telemetry error reporting as high risk and scope changes to capture_usage/wait_for_usage_capture to the entire framework-cli, not single commands.

Applied to files:

  • apps/framework-cli/src/metrics_inserter.rs
  • apps/framework-cli/src/cli.rs
  • apps/framework-cli/src/utilities/system.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/mcp/tools/sample_stream.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/metrics.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
📚 Learning: 2026-02-07T04:42:43.608Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3468
File: apps/framework-cli/src/cli/routines/docs.rs:878-887
Timestamp: 2026-02-07T04:42:43.608Z
Learning: For the Moose CLI under apps/framework-cli, document and implement that Windows support is best-effort only, while macOS (open) and Linux (xdg-open) are the primary targets. Treat Windows-specific code paths (e.g., using cmd /c start) as fallback implementations and ensure they are not required to be as robust as the primary platforms. When adding Windows fallbacks, include clear caveats, avoid blocking on Windows-specific behavior, and add appropriate runtime checks, logs, and TODOs. This guidance applies to Rust source files under apps/framework-cli/src (and adjacent Rust files in the same module) to keep Windows handling clearly scoped and maintainable.

Applied to files:

  • apps/framework-cli/src/metrics_inserter.rs
  • apps/framework-cli/src/cli.rs
  • apps/framework-cli/src/utilities/system.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/mcp/tools/sample_stream.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/metrics.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
📚 Learning: 2026-04-18T22:28:25.491Z
Learnt from: callicles
Repo: 514-labs/moosestack PR: 4024
File: packages/ts-moose-lib/src/consumption-apis/runner.ts:620-635
Timestamp: 2026-04-18T22:28:25.491Z
Learning: In `packages/ts-moose-lib/src/consumption-apis/runner.ts`, the `server.on("error", ...)` handler (which calls `process.exit(1)` on any server error, not just startup `EADDRINUSE`) is intentional. The PR/ENG-2768 design requires the worker to exit non-zero on any socket error so the Rust `RestartingProcess` circuit breaker in `apps/framework-cli/src/utilities/system.rs` can count rapid failures. Narrowing to the listen phase only would silently swallow post-listen socket errors and break the circuit-breaker story. Do not suggest scoping this to a `once` / startup-only handler.

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
  • packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py
📚 Learning: 2026-04-01T07:34:38.401Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: templates/typescript-agent/AGENTS.md:0-0
Timestamp: 2026-04-01T07:34:38.401Z
Learning: Applies to templates/typescript-agent/packages/moosestack-service/app/mcp/tools/**/*.ts : Return user-friendly error messages in MCP tool responses and do not expose internal error details or stack traces

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
📚 Learning: 2026-04-01T07:34:38.401Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: templates/typescript-agent/AGENTS.md:0-0
Timestamp: 2026-04-01T07:34:38.401Z
Learning: Applies to templates/typescript-agent/packages/moosestack-service/app/index.ts : Export new data models, APIs, and MCP tools from `packages/moosestack-service/app/index.ts` so MooseStack discovers them automatically

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
📚 Learning: 2026-04-01T07:34:51.710Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: templates/typescript-agent/packages/moosestack-service/AGENTS.md:0-0
Timestamp: 2026-04-01T07:34:51.710Z
Learning: Applies to templates/typescript-agent/packages/moosestack-service/app/index.ts : Export Moose-discovered primitives in `app/index.ts` as the public service entrypoint

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
📚 Learning: 2026-04-01T07:34:38.401Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: templates/typescript-agent/AGENTS.md:0-0
Timestamp: 2026-04-01T07:34:38.401Z
Learning: Applies to templates/typescript-agent/packages/moosestack-service/app/mcp/tools/**/*.ts : Return errors in MCP tools via `{ content: [...], isError: true }` instead of throwing exceptions

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
📚 Learning: 2026-03-31T03:47:14.870Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: templates/typescript-mcp/AGENTS.md:0-0
Timestamp: 2026-03-31T03:47:14.870Z
Learning: Applies to templates/typescript-mcp/packages/moosestack-service/app/apis/mcp.ts : Return user-friendly error messages in MCP tool responses instead of exposing internal error details or stack traces

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
📚 Learning: 2026-04-18T22:28:28.229Z
Learnt from: callicles
Repo: 514-labs/moosestack PR: 4024
File: apps/framework-cli/src/utilities/native_infra/mod.rs:524-550
Timestamp: 2026-04-18T22:28:28.229Z
Learning: In `apps/framework-cli/src/utilities/native_infra/mod.rs` (514-labs/moosestack), `kill_pid_file` uses `kill -0` to poll liveness after SIGTERM. Zombie state is unreachable here: ClickHouse and Temporal are spawned with `kill_on_drop(false)` and the `Child` handle is never retained across CLI invocations. A subsequent `moose dev`/`moose clean` is a different process, so the kernel reparents the orphaned child to PID 1 (launchd/init), which reaps zombies immediately. Do not suggest replacing `kill -0` with `ps -o stat=` zombie detection — it adds parsing overhead for a scenario that cannot occur.

Applied to files:

  • packages/ts-moose-lib/src/cluster-utils.ts
  • apps/framework-cli/src/utilities/system.rs
📚 Learning: 2026-04-22T18:10:52.989Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: apps/framework-cli/AGENTS.md:0-0
Timestamp: 2026-04-22T18:10:52.989Z
Learning: Applies to apps/framework-cli/**/*.rs : Prefer top-of-file `use` statements over inline ones in function bodies; add `use` imports for fully-qualified paths; local `use` for disambiguation (e.g., name collisions) is fine

Applied to files:

  • apps/framework-cli/src/cli.rs
📚 Learning: 2026-04-22T18:10:52.989Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: apps/framework-cli/AGENTS.md:0-0
Timestamp: 2026-04-22T18:10:52.989Z
Learning: Applies to apps/framework-cli/**/*.rs : Test error conditions and edge cases in unit and integration tests

Applied to files:

  • apps/framework-cli/src/utilities/system.rs
📚 Learning: 2026-04-10T15:30:49.933Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3933
File: apps/devkafka/src/groups.rs:65-87
Timestamp: 2026-04-10T15:30:49.933Z
Learning: In `apps/devkafka/src/groups.rs` (514-labs/moosestack PR `#3933`), `choose_protocol()` and `remove_member()` intentionally rely on nondeterministic HashMap iteration order for protocol selection and leader failover. This is acceptable for a dev-only broker with typically 1-2 consumers. Do not flag nondeterministic protocol selection or leader election in these methods in future reviews.

Applied to files:

  • apps/framework-cli/src/infrastructure/stream/kafka/models.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
📚 Learning: 2026-01-20T15:49:26.025Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3150
File: packages/py-moose-lib/moose_lib/dmv2/materialized_view.py:35-35
Timestamp: 2026-01-20T15:49:26.025Z
Learning: In the Python moose library (packages/py-moose-lib), prefer typing.Union[X, Y] syntax over PEP 604's X | Y for type annotations. Use Union when you want to explicitly convey a union in the type system, especially if readability or tooling friendliness (e.g., older type checkers) is a consideration. Reserve X | Y for simple, modern type hints where you want concise syntax and are sure the project’s type checker version supports PEP 604 without issues.

Applied to files:

  • packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py
📚 Learning: 2026-04-22T18:10:52.989Z
Learnt from: CR
Repo: 514-labs/moosestack PR: 0
File: apps/framework-cli/AGENTS.md:0-0
Timestamp: 2026-04-22T18:10:52.989Z
Learning: Applies to apps/framework-cli/**/*.rs : Document all public APIs and breaking changes

Applied to files:

  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/framework/core/infrastructure_map.rs
  • apps/framework-cli/src/metrics.rs
📚 Learning: 2026-04-09T15:11:17.424Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3936
File: apps/framework-cli/src/cli/local_webserver.rs:3088-3090
Timestamp: 2026-04-09T15:11:17.424Z
Learning: In 514-labs/moosestack, `moose dev --alpha` (NativeInfraProvider) starts ONLY ClickHouse and Temporal as native binaries — no Docker containers are launched at all (Kafka/Redis native support is not yet implemented). Skipping Docker teardown wholesale in `apps/framework-cli/src/cli/local_webserver.rs` when `project.use_native_infra == true` is intentional and correct for the current scope. This will need revisiting when native Kafka/Redis support is added.

Applied to files:

  • apps/framework-cli/src/cli/local_webserver.rs
📚 Learning: 2026-02-08T22:20:08.757Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3468
File: apps/framework-cli/src/cli/routines/docs.rs:1590-1838
Timestamp: 2026-02-08T22:20:08.757Z
Learning: In apps/framework-cli/src/cli/routines/**/*.rs (Moose CLI): For HTTP-dependent functionality like `fetch_docs`, prefer E2E or integration tests over unit tests. Unit testing HTTP fetching adds little value compared to integration testing and can be flaky due to network dependencies.

Applied to files:

  • apps/framework-cli/src/cli/local_webserver.rs
📚 Learning: 2026-03-24T19:15:00.795Z
Learnt from: DatGuyJonathan
Repo: 514-labs/moosestack PR: 3830
File: apps/framework-cli/src/infrastructure/olap/clickhouse/mod.rs:1638-1644
Timestamp: 2026-03-24T19:15:00.795Z
Learning: In `apps/framework-cli/src/infrastructure/olap/clickhouse/mod.rs`, `rls_bootstrap` (public) and `parse_row_policy_filter` (private) are intentionally covered by E2E tests in `apps/framework-cli-e2e/test/row-level-security.test.ts` (the "Row-Level Security" suite) rather than by inline `#[cfg(test)]` unit tests. Do not request additional unit tests for these functions in this file.

Applied to files:

  • apps/framework-cli/src/cli/local_webserver.rs
📚 Learning: 2026-04-18T22:28:30.509Z
Learnt from: callicles
Repo: 514-labs/moosestack PR: 4024
File: apps/framework-cli/src/cli/local_webserver.rs:600-609
Timestamp: 2026-04-18T22:28:30.509Z
Learning: In `apps/framework-cli/src/cli/local_webserver.rs` (514-labs/moosestack), the consumption-API retry loop in `get_consumption_api_res` is intentionally scoped to connect errors only (via `is_connect_error`). If `execute()` succeeds, the upstream has committed to a response; retrying on body-read failures after headers are received is unsafe because the retried response could differ. Do not suggest extending retries to body/stream errors here.

Applied to files:

  • apps/framework-cli/src/cli/local_webserver.rs
📚 Learning: 2026-04-10T15:29:00.915Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3933
File: apps/framework-cli/src/utilities/native_infra/devkafka.rs:21-39
Timestamp: 2026-04-10T15:29:00.915Z
Learning: In `apps/framework-cli/src/utilities/native_infra/devkafka.rs` (514-labs/moosestack), `health_check` intentionally uses `std::net::TcpStream::connect_timeout` (blocking) because its callers run inside a synchronous polling loop. Do not flag the blocking TCP connect as a concern; the function is not called from within an async context directly.

Applied to files:

  • apps/framework-cli/src/cli/local_webserver.rs
  • apps/framework-cli/src/cli/routines/peek.rs
  • apps/framework-cli/docs/operator/metrics.md
  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/infrastructure/processes/functions_registry.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
📚 Learning: 2026-04-10T15:16:06.497Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3933
File: apps/devkafka/src/broker.rs:134-149
Timestamp: 2026-04-10T15:16:06.497Z
Learning: In `apps/devkafka/src/broker.rs` (514-labs/moosestack), `reap_expired_members()` called inside `spawn_reaper` only performs HashMap iteration and member removal with no unwraps or fallible operations. It cannot panic, so no `catch_unwind` guard is needed. This is an intentional design for a dev-only embedded broker. Do not flag the lack of panic handling in `spawn_reaper`.

Applied to files:

  • apps/framework-cli/src/cli/routines/peek.rs
📚 Learning: 2026-02-06T01:42:35.875Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3462
File: apps/framework-cli/src/cli/routines/feedback.rs:107-119
Timestamp: 2026-02-06T01:42:35.875Z
Learning: In modules under apps/framework-cli/src/cli/routines (any Rust file under subdirectories) use raw println! for multi-line help text and formatted usage examples. Do not rely on display::show_message_wrapper for help output; reserve it for status messages and user-visible progress updates. Ensure help text remains readable and properly formatted.

Applied to files:

  • apps/framework-cli/src/cli/routines/peek.rs
📚 Learning: 2026-02-07T04:42:46.739Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3468
File: apps/framework-cli/src/cli/routines/docs.rs:40-48
Timestamp: 2026-02-07T04:42:46.739Z
Learning: In Rust files under apps/framework-cli/src/cli/routines/, prefer using inherent from_str methods for conversions when the conversion is only used in a single place and you are not using .parse() elsewhere. Avoid implementing std::str::FromStr purely for stylistic reasons if the conversion isn’t reused with .parse()."

Applied to files:

  • apps/framework-cli/src/cli/routines/peek.rs
📚 Learning: 2026-02-07T06:10:34.807Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3468
File: apps/framework-cli/src/cli/routines/docs.rs:896-942
Timestamp: 2026-02-07T06:10:34.807Z
Learning: In apps/framework-cli/src/cli/routines/**/*.rs (Rust code for Moose CLI), variable shadowing is acceptable and idiomatic when progressively refining a value (e.g., from &str to String or from raw input to processed output). Consider this a preferred pattern in this codebase, but ensure shadowing is intentional and improves readability rather than obscuring the data flow.

Applied to files:

  • apps/framework-cli/src/cli/routines/peek.rs
📚 Learning: 2026-02-08T22:20:04.881Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3468
File: apps/framework-cli/src/cli/routines/docs.rs:1590-1838
Timestamp: 2026-02-08T22:20:04.881Z
Learning: For HTTP-dependent functionality under apps/framework-cli/src/cli/routines (e.g., fetch_docs), prefer E2E or integration tests over unit tests. Unit tests for HTTP fetching can be flaky due to network variability and typically add less value than integration tests. When adding tests in this directory, categorize tests as integration/E2E where they exercise real HTTP behavior or client-server interactions; mock or stub HTTP only when necessary to test internal logic, not the HTTP layer itself.

Applied to files:

  • apps/framework-cli/src/cli/routines/peek.rs
📚 Learning: 2026-02-08T22:31:17.000Z
Learnt from: oatsandsugar
Repo: 514-labs/moosestack PR: 3468
File: apps/framework-cli/src/cli/routines/docs.rs:0-0
Timestamp: 2026-02-08T22:31:17.000Z
Learning: In Moose CLI (apps/framework-cli/src/cli/routines/...), when implementing raw output mode (--raw), ensure public entry points show_toc, fetch_page, search_toc, browse_docs return an empty Message (Message::new("".to_string(), "".to_string())) to avoid polluting piped output. Also wrap show_message! calls with if !raw guards to prevent intermediate messages from appearing in piped output. This pattern applies to all Rust files under apps/framework-cli/src/cli/routines/ and should be checked during reviews.

Applied to files:

  • apps/framework-cli/src/cli/routines/peek.rs
📚 Learning: 2026-04-12T12:58:18.735Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3954
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:2039-2042
Timestamp: 2026-04-12T12:58:18.735Z
Learning: 514-labs/moosestack — apps/framework-cli/src/framework/core/infrastructure_map.rs: InfrastructureMap::diff_dictionaries() uses dicts_equal_ignore_metadata(a, b) to ignore metadata when detecting dictionary updates. Treat metadata-only changes as no-ops in future reviews.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-03-24T20:17:43.845Z
Learnt from: DatGuyJonathan
Repo: 514-labs/moosestack PR: 3830
File: apps/framework-cli/src/cli/display/infrastructure.rs:675-677
Timestamp: 2026-03-24T20:17:43.845Z
Learning: In `apps/framework-cli/src/framework/core/infrastructure_map.rs` (514-labs/moosestack), renaming a `SelectRowPolicy` is always represented as a `Removed` + `Added` pair in `OlapChange`. The `Updated` variant only fires when non-name fields (column, claim, tables) change, so `before.name == after.name` is always true inside `OlapChange::SelectRowPolicy(Change::Updated { .. })`. Do not suggest rename-detection logic in that match arm.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-04-15T02:10:48.333Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3997
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:2015-2121
Timestamp: 2026-04-15T02:10:48.333Z
Learning: Repo 514-labs/moosestack — apps/framework-cli/src/framework/core/infrastructure_map.rs: diff_dictionaries() re-keys both self and target maps to canonical IDs via OlapDictionary::id(default_database) before diffing. Lifecycle gating: ExternallyManaged blocks CREATE/UPDATE, DeletionProtected blocks DROP. This prevents false Added/Removed when default_database differs.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-04-12T02:31:35.851Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3954
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:10256-10540
Timestamp: 2026-04-12T02:31:35.851Z
Learning: Repo 514-labs/moosestack — In apps/framework-cli/src/framework/core/infrastructure_map.rs, InfrastructureMap::diff_dictionaries() compares dictionaries by a canonical ID computed with default_database via OlapDictionary::id(default_database), preventing false Added/Removed when only default_database differs. Do not flag such cases as drift.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-04-14T16:21:40.959Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3997
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:0-0
Timestamp: 2026-04-14T16:21:40.959Z
Learning: Repo 514-labs/moosestack — apps/framework-cli/src/framework/core/infrastructure_map.rs: InfrastructureMap::diff_dictionaries() must gate UPDATE filtering on target_dict.life_cycle (not the current dict.life_cycle). ExternallyManaged blocks CREATE/UPDATE; DeletionProtected blocks DROP only.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-01-26T00:56:27.011Z
Learnt from: DatGuyJonathan
Repo: 514-labs/moosestack PR: 3400
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:1238-1298
Timestamp: 2026-01-26T00:56:27.011Z
Learning: Repo 514-labs/moosestack — Workflows: The CLI’s workflow diff is intended to detect only Temporal schedule–affecting changes. In apps/framework-cli/src/framework/core/infrastructure_map.rs, workflows_config_equal should compare schedule, retries, and timeout only; it must not include tasks. Task code/config changes are picked up automatically when the orchestration worker restarts and should not trigger a WorkflowChange.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-04-12T02:31:22.486Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3954
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:0-0
Timestamp: 2026-04-12T02:31:22.486Z
Learning: In 514-labs/moosestack (apps/framework-cli/src/framework/core/infrastructure_map.rs), InfrastructureMap::diff_dictionaries must match dictionaries by canonical ID computed via OlapDictionary::id(default_database) for both source and target maps. Do not compare using raw HashMap keys, as keys can differ after default-database rewrites.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-04-14T18:49:56.425Z
Learnt from: 514Ben
Repo: 514-labs/moosestack PR: 3997
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:4014-4021
Timestamp: 2026-04-14T18:49:56.425Z
Learning: Repo 514-labs/moosestack — apps/framework-cli/src/framework/core/infrastructure_map.rs: dicts_equal_ignore_metadata() now masks external-source credentials on both OlapDictionary clones via mask_dict_credentials() (in addition to clearing metadata) before comparison. mask_credentials_for_json_export() reuses mask_dict_credentials(), ensuring a single source of truth for dictionary credential redaction and preventing spurious UPDATEs after proto/Redis round-trips.

Applied to files:

  • apps/framework-cli/src/framework/core/infrastructure_map.rs
📚 Learning: 2026-04-08T00:08:36.251Z
Learnt from: phiSgr
Repo: 514-labs/moosestack PR: 3930
File: apps/framework-cli/src/cli/local_webserver.rs:3524-3535
Timestamp: 2026-04-08T00:08:36.251Z
Learning: In 514-labs/moosestack, the `/admin/inframap` and related `/admin/*` endpoints (e.g., `/admin/plan`, `/admin/integrate-changes`, `/admin/reality-check`) in `apps/framework-cli/src/cli/local_webserver.rs` are internal CLI-to-server communication endpoints. They are consumed programmatically by CLI routines (e.g., migration, remote-plan), not by end users. Do not flag missing framework-docs-v2 documentation for changes to these admin endpoints — the coding guideline requiring docs updates applies only to user-facing APIs, commands, flags, and configs.

Applied to files:

  • apps/framework-cli/docs/operator/metrics.md
📚 Learning: 2026-04-10T14:30:34.800Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3932
File: apps/framework-cli/src/cli.rs:0-0
Timestamp: 2026-04-10T14:30:34.800Z
Learning: In 514-labs/moosestack, `moose clean` calling `shutdown_embedded_servers()` in `apps/framework-cli/src/cli.rs` only closes in-memory handles in the current process. It cannot stop embedded servers (devredis, devkafka, ClickHouse/Temporal) owned by a separate `moose dev --alpha` process. Cross-process cleanup via PID files or a control socket is a known architectural gap shared across all embedded servers, intentionally deferred and out of scope for individual feature PRs. Do not flag this as a bug requiring immediate fix.

Applied to files:

  • apps/framework-cli/docs/operator/metrics.md
📚 Learning: 2026-04-10T15:16:06.242Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3933
File: apps/devkafka/src/handlers/fetch.rs:42-68
Timestamp: 2026-04-10T15:16:06.242Z
Learning: In 514-labs/moosestack, `apps/devkafka/src/handlers/fetch.rs` — `wait_any_notify` intentionally spawns one task per `Arc<Notify>` rather than using `futures::select_all`. This is an acceptable trade-off for the dev-only embedded Kafka broker (low partition count); adding a `futures` crate dependency just for `select_all` is not desired. Do not suggest replacing the spawn-based approach with `select_all` or similar combinators in future reviews.

Applied to files:

  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
  • apps/framework-cli/src/infrastructure/stream/kafka/client.rs
📚 Learning: 2026-03-04T00:09:14.325Z
Learnt from: phiSgr
Repo: 514-labs/moosestack PR: 3678
File: apps/framework-cli/src/infrastructure/olap/clickhouse/model.rs:483-483
Timestamp: 2026-03-04T00:09:14.325Z
Learning: In apps/framework-cli/src/infrastructure/olap/clickhouse/model.rs, `ClickHouseColumn` and related structs (e.g., `ClickHouseTable`, `ClickHouseIndex`) are internal CLI infrastructure types, not user-facing public APIs. Inline `//` comments are acceptable for their fields; the rustdoc `///` requirement does not apply to these internal structs.

Applied to files:

  • apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs
📚 Learning: 2026-04-10T14:28:37.548Z
Learnt from: LucioFranco
Repo: 514-labs/moosestack PR: 3932
File: apps/devredis/src/lib.rs:18-22
Timestamp: 2026-04-10T14:28:37.548Z
Learning: In 514-labs/moosestack, apps/devredis uses a crate-wide `pub type Error = Box<dyn std::error::Error + Send + Sync>` and `pub type Result<T>` alias intentionally. It is a dev-only embedded Redis server and the boxed error is acceptable here; do not flag it for thiserror/typed-error refactoring.

Applied to files:

  • apps/framework-cli/src/mcp/tools/sample_stream.rs
📚 Learning: 2026-03-04T22:31:33.997Z
Learnt from: cjus
Repo: 514-labs/moosestack PR: 3684
File: apps/framework-cli/src/framework/core/infrastructure_map.rs:0-0
Timestamp: 2026-03-04T22:31:33.997Z
Learning: Repo 514-labs/moosestack — In apps/framework-cli/src/framework/core/infrastructure_map.rs, InfrastructureMap::to_proto serializes select_row_policies and InfrastructureMap::from_proto deserializes them back into the map. Proto/Redis round-trips now preserve row policies; do not flag select_row_policies as being dropped on load in future reviews.

Applied to files:

  • apps/framework-cli/src/mcp/tools/sample_stream.rs
🪛 Ruff (0.15.11)
packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py

[warning] 67-74: Use contextlib.suppress(Exception) instead of try-except-pass

Replace try-except-pass with with contextlib.suppress(Exception): ...

(SIM105)


[error] 73-74: try-except-pass detected, consider logging the exception

(S110)


[warning] 73-73: Do not catch blind exception: Exception

(BLE001)


[warning] 700-700: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (6)
apps/framework-cli/src/mcp/tools/sample_stream.rs (1)

19-19: LGTM — purpose plumbing is consistent.

Import and create_consumer(..., PURPOSE_MCP_SAMPLE_CONSUMER) call match the new 3-arg signature and mirror the peek.rs update.

Also applies to: 323-327

apps/framework-cli/src/utilities/system.rs (1)

16-31: LGTM — classification + emission are clean.

Exit reason derived once from the borrowed result (no move), then should_restart consumes the owned value. Labels align with the metric series.

Also applies to: 175-205

packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py (1)

66-77: Endpoint contract verified as correct. The /metrics-logs endpoint (lines 1350–1377 of local_webserver.rs) properly deserializes {"reason": "..."} into MetricEvent::FunctionWorkerRestart { reason }, explicitly handles it (line 1367), and always returns 200 OK without panicking. The variant is filtered from the sink batch as expected.

apps/framework-cli/src/cli.rs (1)

774-776: No action needed. The execution order is safe: run_local_infrastructure_with_timeout and redis_client setup happen before Metrics::new. The window between Metrics::new (line 757) and set_global_metrics_handle (line 777) contains only start_listening_to_metrics.await, which doesn't spawn tasks or call record_* functions. Additionally, with_global_metrics gracefully handles a missing global handle by becoming a no-op.

			> Likely an incorrect or invalid review comment.
packages/ts-moose-lib/src/cluster-utils.ts (1)

44-67: Payload format is correct; no parse failure will occur.

The Rust enum at line 49 uses #[serde(untagged)], and the FunctionWorkerRestart variant (lines 83–85) accepts exactly the structure being posted: { reason: String }. Untagged enums deserialize struct variants without a wrapper tag, so the bare { reason } payload matches the wire format precisely.

			> Likely an incorrect or invalid review comment.
apps/framework-cli/src/infrastructure/processes/kafka_clickhouse_sync.rs (1)

34-37: LGTM — purpose tags wired through consistently.

Both the topic-to-topic and topic-to-table paths now thread PURPOSE_SYNC_CONSUMER / PURPOSE_SYNC_PRODUCER through the factories; import set is tight.

Also applies to: 484-490, 596-601

Comment on lines +72 to +79
## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`

Setting this env var to a non-empty value disables `kafka_client_gauge`
instrumentation for the process. Used to rule out metric-tracking as a
CPU/memory regression source during rollout. The
`function_worker_restarts_total` and `function_process_diff_updated_total`
counters are **not** gated by this switch — they're cheap and do not
touch the hot Kafka client path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Doc overstates the kill-switch condition.

kafka_client_tracking_enabled() in metrics.rs is std::env::var(KAFKA_METRICS_DISABLED_ENV).is_err(), so the switch trips on any set value — including empty string. Either tighten the check to reject empty (is_ok_and(|v| !v.is_empty())) or drop "non-empty" from the doc.

📝 Proposed doc tweak
-Setting this env var to a non-empty value disables `kafka_client_gauge`
-instrumentation for the process. Used to rule out metric-tracking as a
+Setting this env var to any value (including empty) disables `kafka_client_gauge`
+instrumentation for the process. Used to rule out metric-tracking as a
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
Setting this env var to a non-empty value disables `kafka_client_gauge`
instrumentation for the process. Used to rule out metric-tracking as a
CPU/memory regression source during rollout. The
`function_worker_restarts_total` and `function_process_diff_updated_total`
counters are **not** gated by this switch — they're cheap and do not
touch the hot Kafka client path.
## Kill-switch — `MOOSE_KAFKA_CLIENT_METRICS_DISABLED`
Setting this env var to any value (including empty) disables `kafka_client_gauge`
instrumentation for the process. Used to rule out metric-tracking as a
CPU/memory regression source during rollout. The
`function_worker_restarts_total` and `function_process_diff_updated_total`
counters are **not** gated by this switch — they're cheap and do not
touch the hot Kafka client path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/docs/operator/metrics.md` around lines 72 - 79, The doc is
incorrect: kafka_client_tracking_enabled() in metrics.rs currently treats any
set value (including empty) as disabling metrics; change the implementation in
metrics.rs (function kafka_client_tracking_enabled and reference
KAFKA_METRICS_DISABLED_ENV) to reject empty strings by using a check like
std::env::var(KAFKA_METRICS_DISABLED_ENV).is_ok_and(|v| !v.is_empty()), so the
kill-switch only trips for non-empty values (or alternatively update the doc to
remove "non-empty" if you prefer doc-only alignment).

Comment on lines +4274 to +4368
#[tokio::test]
async fn metrics_endpoint_e2e_exposes_new_churn_observability_series() {
use crate::metrics::{Metrics, TelemetryMetadata};
use hyper::service::service_fn;
use std::convert::Infallible;
use std::time::Duration;
use tokio::net::TcpListener;

let (metrics, rx) = Metrics::new(
TelemetryMetadata {
machine_id: "smoke".to_string(),
is_moose_developer: false,
metric_labels: None,
metric_endpoints: None,
is_production: false,
project_name: "smoke".to_string(),
export_metrics: false,
},
None,
);
let metrics = Arc::new(metrics);
metrics.start_listening_to_metrics(rx).await;

metrics.kafka_client_created("smoke_producer");
metrics.kafka_client_created("smoke_producer");
metrics.kafka_client_dropped("smoke_producer");
metrics.function_worker_restart("smoke_reason".to_string());
metrics.function_process_diff_updated("forced_always");
metrics.function_process_diff_updated("forced_always");

let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap();

let metrics_for_server = metrics.clone();
let server_task = tokio::spawn(async move {
loop {
let (stream, _) = match listener.accept().await {
Ok(pair) => pair,
Err(_) => return,
};
let io = TokioIo::new(stream);
let metrics_clone = metrics_for_server.clone();
tokio::spawn(async move {
let service = service_fn(move |_req: Request<Incoming>| {
let m = metrics_clone.clone();
async move {
let resp = metrics_route(m).await.unwrap();
Ok::<_, Infallible>(resp)
}
});
let _ = auto::Builder::new(TokioExecutor::new())
.serve_connection(io, service)
.await;
});
}
});

tokio::time::sleep(Duration::from_millis(100)).await;

let url = format!("http://{addr}/metrics");
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(5))
.build()
.unwrap();
let resp = client.get(&url).send().await.expect("GET /metrics failed");
assert_eq!(resp.status().as_u16(), 200);
let body = resp.text().await.expect("read body");

assert!(
body.contains("# TYPE moose_kafka_client_gauge gauge"),
"missing kafka gauge TYPE header in /metrics body:\n{body}"
);
assert!(
body.contains(r#"moose_kafka_client_gauge{purpose="smoke_producer"} 1"#),
"expected gauge value of 1 after 2 creates + 1 drop:\n{body}"
);
assert!(
body.contains("# TYPE moose_function_worker_restarts counter"),
"missing worker restarts TYPE header:\n{body}"
);
assert!(
body.contains(r#"moose_function_worker_restarts_total{reason="smoke_reason"} 1"#),
"expected worker restart counter == 1:\n{body}"
);
assert!(
body.contains("# TYPE moose_function_process_diff_updated counter"),
"missing diff updated TYPE header:\n{body}"
);
assert!(
body.contains(r#"moose_function_process_diff_updated_total{reason="forced_always"} 2"#),
"expected diff counter == 2:\n{body}"
);

server_task.abort();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the new /metrics series.

This adds user-visible metrics surface, but the PR context does not include a matching apps/framework-docs-v2/content update for these new series/labels.

As per coding guidelines, **/framework-cli/src/**: "If any user-facing changes were made ... verify the documentation for THAT SPECIFIC feature is updated in apps/framework-docs-v2/content".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/cli/local_webserver.rs` around lines 4274 - 4368, The
test introduces three new public Prometheus series (moose_kafka_client_gauge
with label purpose, moose_function_worker_restarts_total with label reason, and
moose_function_process_diff_updated_total with label reason) surfaced by
Metrics/Metrics::new and exposed by metrics_route; update the user docs in
apps/framework-docs-v2/content to document each series name, type (gauge or
counter), label meanings (purpose, reason), example metric lines (e.g.
moose_kafka_client_gauge{purpose="smoke_producer"} 1), and any related
configuration or opt-out notes so the new observability surface is discoverable;
also ensure the docs index/TOC includes this metrics page.

Comment on lines +1507 to +1518
let reason = if process == target_process {
"no_change"
} else {
"forced_always"
};
crate::metrics::record_function_process_diff_updated(reason);

tracing::debug!(
"FunctionProcess updated (forced, diff_reason={}): {}",
reason,
id
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify current reason taxonomy and metric wiring for FunctionProcess diff updates.
fd -i 'metrics.rs' apps/framework-cli/src --exec rg -n "function_process_diff_updated|record_function_process_diff_updated|reason|no_change|forced_always|changed" {}
fd -i 'metrics_inserter.rs' apps/framework-cli/src --exec rg -n "FunctionProcessDiffUpdated|reason|no_change|forced_always|changed" {}
rg -n --type=rust "record_function_process_diff_updated|FunctionProcessDiffUpdated|no_change|forced_always|changed" apps/framework-cli/src

Repository: 514-labs/moosestack

Length of output: 40534


🏁 Script executed:

# View the diff_function_processes function around lines 1507-1518
sed -n '1480,1530p' apps/framework-cli/src/framework/core/infrastructure_map.rs

Repository: 514-labs/moosestack

Length of output: 2036


🏁 Script executed:

# Search for function signature and understand what this function does
rg -B 20 "let reason = if process == target_process" apps/framework-cli/src/framework/core/infrastructure_map.rs | head -60

Repository: 514-labs/moosestack

Length of output: 869


🏁 Script executed:

# Check if there are any comments or documentation about FunctionProcess diff semantics
rg -B 5 -A 5 "diff_function_processes|FunctionProcess.*diff" apps/framework-cli/src/framework/core/infrastructure_map.rs | head -80

Repository: 514-labs/moosestack

Length of output: 1243


🏁 Script executed:

# Check documentation/comments about what "forced_always" means and FunctionProcess restart semantics
rg -B 10 -A 5 "forced.*always|forced.*update.*process|FunctionProcess.*restart|restart.*process" apps/framework-cli/src/framework/core/infrastructure_map.rs | head -120

Repository: 514-labs/moosestack

Length of output: 722


🏁 Script executed:

# Check if there are similar patterns in orchestration/workflow diffs to understand the naming convention
rg -B 3 -A 3 "record_function_process_diff_updated|OrchestrationWorker.*diff|Workflow.*diff.*reason" apps/framework-cli/src/framework/core/infrastructure_map.rs | head -100

Repository: 514-labs/moosestack

Length of output: 330


🏁 Script executed:

# Look at the test to understand the metric intent
sed -n '8249,8300p' apps/framework-cli/src/framework/core/infrastructure_map.rs

Repository: 514-labs/moosestack

Length of output: 2377


🏁 Script executed:

# Look at the full test that uses "forced_always" to understand test expectations
sed -n '658,690p' apps/framework-cli/src/metrics.rs

Repository: 514-labs/moosestack

Length of output: 1337


🏁 Script executed:

# Check the full test_diff_function_process_no_changes_triggers_update to understand intent
sed -n '8249,8290p' apps/framework-cli/src/framework/core/infrastructure_map.rs

Repository: 514-labs/moosestack

Length of output: 2020


🏁 Script executed:

# Check if there are comments about why FunctionProcess is always forced to update
rg -B 20 "diff_function_processes" apps/framework-cli/src/framework/core/infrastructure_map.rs | grep -A 20 "fn diff_function_processes"

Repository: 514-labs/moosestack

Length of output: 94


🏁 Script executed:

# Search for all usages of the reason labels to see if they're hardcoded in tests/monitoring
rg "\"no_change\"|\"forced_always\"" apps/framework-cli/src --type=rust

Repository: 514-labs/moosestack

Length of output: 948


🏁 Script executed:

# Check if there's any documentation about what these metric reasons should mean
rg -B 5 -A 10 "function_process_diff_updated|FunctionProcessDiffUpdated" apps/framework-cli/src/metrics.rs | grep -A 15 "pub fn function_process_diff_updated"

Repository: 514-labs/moosestack

Length of output: 717


🏁 Script executed:

# Check if there are any monitoring dashboards or prometheus config that hardcode "forced_always"
rg "forced_always" apps/framework-cli --type=rust --type=yaml --type=json

Repository: 514-labs/moosestack

Length of output: 856


🏁 Script executed:

# Look at the git history or comments to understand why "forced_always" was chosen
git log --oneline --all -S "forced_always" -- apps/framework-cli/src/framework/core/infrastructure_map.rs | head -5

Repository: 514-labs/moosestack

Length of output: 101


🏁 Script executed:

# Read more context around the test to understand what the intent is
sed -n '8249,8330p' apps/framework-cli/src/framework/core/infrastructure_map.rs

Repository: 514-labs/moosestack

Length of output: 3600


🏁 Script executed:

# Check what the commit f4786f50 was about
git show f4786f50 --stat | head -30

Repository: 514-labs/moosestack

Length of output: 1635


🏁 Script executed:

# Check if there are any other test files that might reference these reason labels
find apps/framework-cli -name "*.rs" -type f -exec grep -l "forced_always\|no_change" {} \;

Repository: 514-labs/moosestack

Length of output: 319


🏁 Script executed:

# Check the broader context of how FunctionProcess updates are actually used/applied
rg -B 5 -A 10 "ProcessChange::FunctionProcess.*Updated" apps/framework-cli/src --type=rust | head -80

Repository: 514-labs/moosestack

Length of output: 5267


🏁 Script executed:

# Verify if the function worker actually restarts based on this diff (is it truly "forced_always"?)
rg -B 3 -A 3 "function_processes|FunctionProcess.*change" apps/framework-cli/src/framework/core/plan.rs | head -100

Repository: 514-labs/moosestack

Length of output: 45


Clarify reason labels to reflect actual vs identical diffs.

Line 1510 labels process != target_process as "forced_always", which obscures that the process config actually differs. For observability, the metric should distinguish whether a restart is due to an actual change or forced by the diff logic despite no change.

Proposed fix
-                let reason = if process == target_process {
-                    "no_change"
-                } else {
-                    "forced_always"
-                };
+                let reason = if process == target_process {
+                    "forced_no_change"
+                } else {
+                    "changed"
+                };

Update the metric tests in apps/framework-cli/src/metrics.rs (line 673–679) and apps/framework-cli/src/cli/local_webserver.rs (line 4301–4363) to use "changed" instead of "forced_always".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/framework/core/infrastructure_map.rs` around lines
1507 - 1518, The metric label for the restart reason is misleading: change the
branch that sets reason when process != target_process from "forced_always" to
"changed" (the variables involved are process, target_process, and reason) so
crate::metrics::record_function_process_diff_updated(reason) emits "changed" for
actual config differences; keep the tracing::debug call but update any tests
that assert the old label to expect "changed" instead of "forced_always" in the
metrics assertions.

Comment on lines +35 to +55
pub const PURPOSE_INGEST_PRODUCER: &str = "ingest_producer";
pub const PURPOSE_IDEMPOTENT_PRODUCER: &str = "idempotent_producer";
pub const PURPOSE_SYNC_PRODUCER: &str = "sync_producer";
pub const PURPOSE_SYNC_CONSUMER: &str = "sync_consumer";
pub const PURPOSE_PEEK_CONSUMER: &str = "peek_consumer";
pub const PURPOSE_MCP_SAMPLE_CONSUMER: &str = "mcp_sample_consumer";
pub const PURPOSE_HEALTH_PROBE: &str = "health_probe";
pub const PURPOSE_FETCH_TOPICS_CONSUMER: &str = "fetch_topics_consumer";
pub const PURPOSE_FETCH_TOPICS_ADMIN: &str = "fetch_topics_admin";
pub const PURPOSE_CHECK_TOPIC_SIZE_CONSUMER: &str = "check_topic_size_consumer";
pub const PURPOSE_ADMIN_ADD_PARTITIONS: &str = "admin_add_partitions";
pub const PURPOSE_ADMIN_UPDATE_TOPIC_CONFIG: &str = "admin_update_topic_config";
pub const PURPOSE_ADMIN_CREATE_TOPICS: &str = "admin_create_topics";
pub const PURPOSE_ADMIN_DELETE_TOPICS: &str = "admin_delete_topics";
pub const PURPOSE_ADMIN_DESCRIBE_TOPIC_CONFIG: &str = "admin_describe_topic_config";
// Option A (plan §0.6.7): function worker Kafka handles (TS/Python) are not
// rdkafka, so we approximate their presence in the gauge with one tick per
// parallel worker instance. Treat this as a proxy for "an external worker is
// alive and holding at least one Kafka socket"; actual socket count per
// worker is tracked separately on the language runtime side if/when needed.
pub const PURPOSE_FUNCTION_WORKER_ESTIMATED: &str = "function_worker_estimated";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider narrowing purpose constants to a newtype to prevent cardinality drift.

Right now any &'static str satisfies purpose: &'static str, so a typo'd literal at a call site silently creates a new gauge series. A tiny newtype (pub struct KafkaClientPurpose(&'static str);) with only the PURPOSE_* constructors — or a non-exhaustive enum — would give you a compile-time allowlist and match the coding guideline preference for newtype wrappers. Not a blocker for the PR, but cheap insurance while the label set is still small.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/infrastructure/stream/kafka/client.rs` around lines 35
- 55, Replace the free-form &'static str purpose labels with a tiny newtype to
prevent accidental new labels: declare pub struct KafkaClientPurpose(&'static
str) with a private constructor (or make the struct non-exhaustive) and expose
only the existing PURPOSE_* values as constants of type KafkaClientPurpose
(e.g., PURPOSE_INGEST_PRODUCER: KafkaClientPurpose). Change any APIs that
currently accept purpose: &'static str to accept KafkaClientPurpose (look for
usages of purpose and functions/methods that take a &'static str purpose) so
callers must use the predefined constants; keep the inner string accessible only
for metrics/label emission when needed. Ensure the newtype derives
Copy/Clone/Debug/Eq/PartialEq/Hash as appropriate.

Comment on lines +68 to +72
impl Drop for KafkaClientTracker {
fn drop(&mut self) {
record_kafka_client_dropped(self.purpose);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Droprecord_kafka_client_droppedtry_send can fire from any thread/runtime context.

KafkaClientTracker::drop ends up calling Metrics::try_send_metric_event through the global handle. If a KafkaClientHandle is dropped after the tokio runtime hosting start_listening_to_metrics has shut down (e.g., during process teardown), try_send on a closed channel returns Err which is let _'d — fine. But be aware that any Drop running on a worker thread during shutdown will also skip the decrement. Combined with the bounded-channel concern raised in metrics.rs, this is why I recommend the dedicated Gauge handle path instead of routing lifecycle events through the mpsc.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/infrastructure/stream/kafka/client.rs` around lines 68
- 72, KafkaClientTracker::drop calls record_kafka_client_dropped which
ultimately routes through Metrics::try_send_metric_event and a global mpsc;
during runtime shutdown the receiver may be closed and drops on worker threads
can silently fail to decrement the gauge. Change the lifecycle path so drops
update a dedicated Gauge handle (or otherwise use an atomic/lock-protected gauge
API) instead of sending over the bounded mpsc: locate KafkaClientTracker::drop
and record_kafka_client_dropped and replace the try_send-based decrement with a
direct call to the Gauge decrement method (or atomic decrement) provided by the
metrics subsystem used by start_listening_to_metrics, falling back to a no-op if
the gauge handle is unavailable to avoid panics during shutdown.

Comment on lines +684 to +695
#[test]
fn tracking_disabled_env_suppresses_events() {
let prev = std::env::var(KAFKA_METRICS_DISABLED_ENV).ok();
std::env::set_var(KAFKA_METRICS_DISABLED_ENV, "1");
assert!(!kafka_client_tracking_enabled());
record_kafka_client_created("should_not_panic");
record_kafka_client_dropped("should_not_panic");
match prev {
Some(v) => std::env::set_var(KAFKA_METRICS_DISABLED_ENV, v),
None => std::env::remove_var(KAFKA_METRICS_DISABLED_ENV),
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the toolchain rustc version used by this crate to gauge set_var unsafety risk.
fd -t f 'rust-toolchain*' apps/framework-cli -x cat {}
rg -nP '^\s*edition\s*=' apps/framework-cli/Cargo.toml

Repository: 514-labs/moosestack

Length of output: 81


🏁 Script executed:

# Check global rust-toolchain at repo root
find . -maxdepth 1 -name 'rust-toolchain*' -type f -exec cat {} \;

# Check .github/workflows or CI config for Rust version
find .github -name '*.yml' -o -name '*.yaml' | head -5 | xargs grep -l 'rust' 2>/dev/null | head -1 | xargs cat | grep -A 5 -B 5 'rust\|toolchain' | head -30

# Check if Cargo.toml specifies MSRV
rg 'rust-version|msrv' Cargo.toml --ignore-case

Repository: 514-labs/moosestack

Length of output: 378


🏁 Script executed:

# Look for other env::set_var or env::remove_var usage in test code
rg 'env::(set_var|remove_var)' --type rs -B 2 -A 5

Repository: 514-labs/moosestack

Length of output: 89


🏁 Script executed:

# Check the test configuration - look for test settings in Cargo.toml or setup
rg '\[\[test\]\]|\[profile\.test\]' Cargo.toml -A 5

Repository: 514-labs/moosestack

Length of output: 45


🏁 Script executed:

# Fix: rg needs -t rust or just search .rs files directly
rg 'env::(set_var|remove_var)' apps/framework-cli -t rust -B 2 -A 2

Repository: 514-labs/moosestack

Length of output: 8240


🏁 Script executed:

# Check if cargo test runs serially or in parallel
rg 'test-threads|num_test_threads' apps/framework-cli/Cargo.toml

Repository: 514-labs/moosestack

Length of output: 45


🏁 Script executed:

# Look at the metrics.rs test module to understand test isolation
head -n 700 apps/framework-cli/src/metrics.rs | tail -n 50

Repository: 514-labs/moosestack

Length of output: 1906


🏁 Script executed:

# Check if serial_test is already a dependency
rg 'serial_test' apps/framework-cli/Cargo.toml

Repository: 514-labs/moosestack

Length of output: 106


Wrap env::set_var/remove_var calls in unsafe blocks and add test isolation.

On Rust 1.94.0, std::env::set_var and std::env::remove_var are unsafe fn. Currently missing unsafe blocks. Additionally, mutating process-global env state without isolation risks test races when running in parallel. Use #[serial_test::serial] (already in Cargo.toml) or refactor kafka_client_tracking_enabled() to accept an injectable parameter.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/metrics.rs` around lines 684 - 695, The test
tracking_disabled_env_suppresses_events mutates process-global env vars using
std::env::set_var/remove_var which are unsafe on Rust 1.94.0 and causes race
conditions; wrap those calls in unsafe blocks and mark the test with
#[serial_test::serial] to prevent parallel execution, or alternatively refactor
kafka_client_tracking_enabled to accept an injectable env value and call that
from the test; ensure you update tracking_disabled_env_suppresses_events to call
unsafe { std::env::set_var(KAFKA_METRICS_DISABLED_ENV, "1") } and unsafe {
std::env::set_var(...) / std::env::remove_var(...) } and add the serial_test
attribute so record_kafka_client_created and record_kafka_client_dropped run in
isolation.

Comment on lines +343 to +362
#[test]
fn classify_rust_child_exit_reasons() {
use std::os::unix::process::ExitStatusExt;
use std::process::ExitStatus;

let ok: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(0));
assert_eq!(classify_rust_child_exit(&ok), "rust_child_exit_ok");

let err_code: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(256));
assert_eq!(
classify_rust_child_exit(&err_code),
"rust_child_exit_err_code"
);

let signal: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(9));
assert_eq!(classify_rust_child_exit(&signal), "rust_child_exit_signal");

let wait_err: std::io::Result<ExitStatus> = Err(std::io::Error::other("wait failure"));
assert_eq!(classify_rust_child_exit(&wait_err), "rust_child_wait_err");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Test is Unix-only; gate the module/test.

std::os::unix::process::ExitStatusExt won't compile on Windows. While framework-cli Windows support is best-effort, an ungated #[test] still breaks cargo test on Windows. Add #[cfg(unix)].

🔧 Proposed fix
+    #[cfg(unix)]
     #[test]
     fn classify_rust_child_exit_reasons() {
         use std::os::unix::process::ExitStatusExt;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[test]
fn classify_rust_child_exit_reasons() {
use std::os::unix::process::ExitStatusExt;
use std::process::ExitStatus;
let ok: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(0));
assert_eq!(classify_rust_child_exit(&ok), "rust_child_exit_ok");
let err_code: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(256));
assert_eq!(
classify_rust_child_exit(&err_code),
"rust_child_exit_err_code"
);
let signal: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(9));
assert_eq!(classify_rust_child_exit(&signal), "rust_child_exit_signal");
let wait_err: std::io::Result<ExitStatus> = Err(std::io::Error::other("wait failure"));
assert_eq!(classify_rust_child_exit(&wait_err), "rust_child_wait_err");
}
#[cfg(unix)]
#[test]
fn classify_rust_child_exit_reasons() {
use std::os::unix::process::ExitStatusExt;
use std::process::ExitStatus;
let ok: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(0));
assert_eq!(classify_rust_child_exit(&ok), "rust_child_exit_ok");
let err_code: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(256));
assert_eq!(
classify_rust_child_exit(&err_code),
"rust_child_exit_err_code"
);
let signal: std::io::Result<ExitStatus> = Ok(ExitStatus::from_raw(9));
assert_eq!(classify_rust_child_exit(&signal), "rust_child_exit_signal");
let wait_err: std::io::Result<ExitStatus> = Err(std::io::Error::other("wait failure"));
assert_eq!(classify_rust_child_exit(&wait_err), "rust_child_wait_err");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/framework-cli/src/utilities/system.rs` around lines 343 - 362, The test
classify_rust_child_exit_reasons uses std::os::unix::process::ExitStatusExt and
must be compiled only on Unix; add a Unix-only configuration attribute (e.g.,
#[cfg(unix)]) to the test (or the containing module) so Windows builds don't try
to compile it. Locate the test function classify_rust_child_exit_reasons and
apply #[cfg(unix)] (or gate the whole test module) to prevent Windows
compilation errors while keeping the test runnable on Unix.

Comment on lines +59 to +77
_worker_exit_reason: dict[str, str] = {"reason": "py_worker_exit_code_0"}


def _set_worker_exit_reason(reason: str) -> None:
_worker_exit_reason["reason"] = reason


def _report_worker_exit() -> None:
try:
requests.post(
f"http://localhost:{moose_management_port}/metrics-logs",
json={"reason": _worker_exit_reason["reason"]},
timeout=0.5,
)
except Exception:
pass


atexit.register(_report_worker_exit)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Minor: use contextlib.suppress and scope the guard.

Ruff flags the bare try/except/pass (SIM105/S110). Swallowing here is intentional (atexit must not raise), but contextlib.suppress(Exception) reads better and the BLE001 noise stops.

🔧 Proposed fix
+import contextlib
...
 def _report_worker_exit() -> None:
-    try:
-        requests.post(
+    with contextlib.suppress(Exception):
+        requests.post(
             f"http://localhost:{moose_management_port}/metrics-logs",
             json={"reason": _worker_exit_reason["reason"]},
             timeout=0.5,
         )
-    except Exception:
-        pass
🧰 Tools
🪛 Ruff (0.15.11)

[warning] 67-74: Use contextlib.suppress(Exception) instead of try-except-pass

Replace try-except-pass with with contextlib.suppress(Exception): ...

(SIM105)


[error] 73-74: try-except-pass detected, consider logging the exception

(S110)


[warning] 73-73: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py`
around lines 59 - 77, Replace the bare try/except/pass in _report_worker_exit
with contextlib.suppress(Exception) to explicitly and narrowly suppress
exceptions; import contextlib if missing, wrap only the requests.post call (and
its timeout) inside the suppress block so atexit.register(_report_worker_exit)
cannot raise, and keep updating _worker_exit_reason via _set_worker_exit_reason
unchanged.

Comment on lines 695 to 703
def shutdown(signum, frame):
"""Handle shutdown signals gracefully"""
log("Received shutdown signal, cleaning up...")
try:
name = signal.Signals(signum).name
except Exception:
name = f"signal_{signum}"
_set_worker_exit_reason(f"py_worker_killed_by_{name}")
running.clear() # This will trigger the main loop to exit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Signal-based exit reason is clobbered in finally.

Flow on SIGTERM/SIGINT/etc.:

  1. shutdown() sets reason → py_worker_killed_by_SIGTERM.
  2. running.clear() → main loop exits.
  3. finally block unconditionally sets reason → py_worker_exit_code_0 (since fatal_error is typically not set on signal-triggered shutdown).
  4. atexit reports py_worker_exit_code_0 — the signal context is lost.

Only overwrite when no signal reason was recorded, or treat a set signal reason as sticky.

🔧 Proposed fix
-        exit_code = 1 if fatal_error.is_set() else 0
-        _set_worker_exit_reason(
-            "py_worker_exit_code_nonzero" if exit_code != 0 else "py_worker_exit_code_0"
-        )
+        exit_code = 1 if fatal_error.is_set() else 0
+        # Preserve signal-based reason if one was already set by shutdown().
+        if not _worker_exit_reason["reason"].startswith("py_worker_killed_by_"):
+            _set_worker_exit_reason(
+                "py_worker_exit_code_nonzero" if exit_code != 0 else "py_worker_exit_code_0"
+            )

Also applies to: 768-773

🧰 Tools
🪛 Ruff (0.15.11)

[warning] 695-695: Missing return type annotation for private function shutdown

Add return type annotation: None

(ANN202)


[warning] 695-695: Unused function argument: frame

(ARG001)


[warning] 700-700: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/py-moose-lib/moose_lib/streaming/streaming_function_runner.py`
around lines 695 - 703, The shutdown signal handler (shutdown) sets a
signal-based reason via _set_worker_exit_reason but the finally block later
unconditionally overwrites it with a default exit-code reason; change the
finally logic so it only sets the default py_worker_exit_code_... when no signal
reason is already recorded. Concretely, in the finally block (where you
currently call _set_worker_exit_reason for exit codes) query the existing reason
(add or use a getter like _get_worker_exit_reason or make
_set_worker_exit_reason accept an overwrite flag) and skip setting the exit-code
reason if the reason already starts with "py_worker_killed_by_" or is non-empty,
leaving the signal-set reason sticky after running.clear()/shutdown().

Comment on lines +202 to 208
try {
reportWorkerExit(code, signal);
} catch {}

if (!this.shutdownInProgress) {
setTimeout(() => cluster.fork(), RESTART_TIME_MS);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't count clean shutdown exits as restarts.

This reports before the shutdownInProgress guard, so deliberate SIGTERM/SIGINT exits also increment moose_function_worker_restarts_total.

Proposed fix
-      try {
-        reportWorkerExit(code, signal);
-      } catch {}
-
       if (!this.shutdownInProgress) {
+        try {
+          reportWorkerExit(code, signal);
+        } catch {}
         setTimeout(() => cluster.fork(), RESTART_TIME_MS);
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
reportWorkerExit(code, signal);
} catch {}
if (!this.shutdownInProgress) {
setTimeout(() => cluster.fork(), RESTART_TIME_MS);
}
if (!this.shutdownInProgress) {
try {
reportWorkerExit(code, signal);
} catch {}
setTimeout(() => cluster.fork(), RESTART_TIME_MS);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/ts-moose-lib/src/cluster-utils.ts` around lines 202 - 208, The code
currently calls reportWorkerExit(...) before checking this.shutdownInProgress,
which causes clean shutdowns to be counted as restarts (incrementing
moose_function_worker_restarts_total); change the flow so that you first check
this.shutdownInProgress and only call reportWorkerExit and schedule cluster.fork
(setTimeout(() => cluster.fork(), RESTART_TIME_MS)) when shutdownInProgress is
false, or alter reportWorkerExit to accept a flag and skip incrementing the
restart metric when shutdownInProgress is true; update the call site(s) (the
block around reportWorkerExit, shutdownInProgress, cluster.fork, and
RESTART_TIME_MS) accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant