diff --git a/AGENTS.md b/AGENTS.md index 9a4daaf374c..d67e4df548b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,5 +1,3 @@ -# CLAUDE.md - Estuary is a real-time data platform with: - Control plane: user-facing catalog management APIs - Data planes: distributed runtime execution @@ -38,17 +36,18 @@ mise run ci:sql-tap # E2E tests over derivation examples (SLOW) mise run ci:catalog-test -``` -### Development - -A development Supabase instance is available: -```bash +# Start (just) local Supabase. +mise run local:supabase # Reset with current migrations as needed supabase db reset - # Interact directly with dev DB psql postgresql://postgres:postgres@localhost:5432/postgres -c 'SELECT 1;' + +# Start a complete local stack (see local/README.md) +mise run local:stack +# CLI for interacting with the platform. +cargo run -p flowctl -- --profile local --help ``` ## Architecture Overview @@ -69,6 +68,13 @@ The control plane compiles the user's catalog model into **built specs** that have extra specifics required by the runtime, and activates specs into their associated data plane. +Collections and tasks live in a unified, hierarchical namespace. +`/`-delimited prefixes act as "roles" and are the unit of AuthZ. +Users are granted capabilities to roles (`user_grants` table), +and roles are granted capabilities to other roles (`role_grants`). +A top-level prefix like `acmeCo/` homes an organization and +is called a "tenant". + ### Control-plane components - **Supabase**: catalog and platform config DB - **Agent**: APIs and background automation @@ -83,9 +89,9 @@ and activates specs into their associated data plane. ### Protocols - `go/protocols/flow/flow.proto` - core types and built specs -- `go/protocols/capture/capture.proto` - protocol for capture tasks -- `go/protocols/derive/derive.proto` - for derivation tasks -- `go/protocols/materialize/materialize.proto` - for materialization tasks +- `go/protocols/capture/capture.proto` +- `go/protocols/derive/derive.proto` +- `go/protocols/materialize/materialize.proto` ## README.md @@ -115,8 +121,8 @@ Keep READMEs current - update with code changes. The exception is state machines: structs and enums that encapsulate fine-grain POD state into higher-order transitions that are easier to reason about. DO seek to decompose problems into state machines. -- Avoid trivial impl routines which could be inlined by the caller. - Indirection is harder to read; routines must buy us something. +- Avoid routines with trivial bodies that could be inlined into the caller. + Indirection has cost (hard to read): each routine must buy us something. - Decompose IO and POD processing into separate routines where possible. Routines should gravitate towards IO or processing, and not mix both. diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index 9a4daaf374c..00000000000 --- a/CLAUDE.md +++ /dev/null @@ -1,133 +0,0 @@ -# CLAUDE.md - -Estuary is a real-time data platform with: -- Control plane: user-facing catalog management APIs -- Data planes: distributed runtime execution -- Connectors: OCI images integrating external systems - -This repo lives at `https://github.com/estuary/flow` - -## Repository Overview - -Estuary is built with: -- **Rust** (primary language) - - Third-party sources under `~/.cargo/registry/src/` -- **Go** - integration glue with the Gazette consumer framework - - Third-party sources under `~/go/pkg/mod/` -- **Protobuf** - communication between control plane, data planes, and connectors -- **Supabase** - migrations are under `supabase/migrations/` - - pgTAP tests under `supabase/tests/` -- **Docs** - external user-facing product documentation under `site/` (Docusaurus) - -## Essential Commands - -### Build & Test - -Use regular `cargo` and `go` tools to build and test crates. - -```bash -# libsqlite3 tag is required for `bindings` and `flowctl-go` packages. -go build -tags libsqlite3 ./go/bindings - -# Regenerate checked-in protobuf (required after .proto changes) -mise run build:go-protobufs -mise run build:rust-protobufs - -# Run pgTAP SQL Tests -mise run ci:sql-tap - -# E2E tests over derivation examples (SLOW) -mise run ci:catalog-test -``` - -### Development - -A development Supabase instance is available: -```bash -# Reset with current migrations as needed -supabase db reset - -# Interact directly with dev DB -psql postgresql://postgres:postgres@localhost:5432/postgres -c 'SELECT 1;' -``` - -## Architecture Overview - -### Core Concepts - -Users interact with the control plane to manage a catalog of: -- **Captures**: tasks which capture from a user endpoint into target collections -- **Collections**: collections of data with enforced JSON Schema -- **Derivations**: both a collection and a task - the task builds its collection through transformation of other collections -- **Materializations**: tasks which maintain materialized views of source collections in an endpoint -- **Tests**: fixtures of source collection inputs and expected derivation outputs - -Collections and tasks have a declarative (JSON/YAML) **model**. -Users refine model changes in **drafts**, which are **published** -to the control plane for verification and testing. -The control plane compiles the user's catalog model into -**built specs** that have extra specifics required by the runtime, -and activates specs into their associated data plane. - -### Control-plane components -- **Supabase**: catalog and platform config DB -- **Agent**: APIs and background automation -- **Data-plane controller**: provisions data planes - -### Data-plane components -- **Gazette**: brokers serve the journals that back collections -- **Reactors**: runtime written to Gazette consumer framework; - executes tasks and runs connectors as sidecars over gRPC -- **Etcd**: config for gazette and reactors - -### Protocols - -- `go/protocols/flow/flow.proto` - core types and built specs -- `go/protocols/capture/capture.proto` - protocol for capture tasks -- `go/protocols/derive/derive.proto` - for derivation tasks -- `go/protocols/materialize/materialize.proto` - for materialization tasks - -## README.md - -Every crate/module should have a README.md with essential context: -- Purpose and fit within the project -- Key types and entry points -- Brief architecture and non-obvious details - -A README.md is ONLY a roadmap for expert developers, -orienting them where to look next. - -Keep READMEs current - update with code changes. - -## Development Guidelines - -### Implementation -- Use `var myVar = ...` in Go. Do NOT use `myVar := ...` (unless required due to shadowing) -- Write comments that document "why" - rationale, broader context, and non-obvious detail -- Do NOT write comments which describe the obvious behavior of code. - Don't write `// Get credentials` before a call `getCredentials()` -- Use early-return over nested conditionals -- Use at least one level of name qualification for third-party types and functions. - For example, `axum::Router::new()` instead of `use axum::Router; Router::new()`. - Types / functions should be unqualified ONLY if they're in the current module. -- Prefer pure functions that take and act over POD states. - AVOID structures that mix complex state and impl behaviors, where possible. - The exception is state machines: structs and enums that encapsulate fine-grain - POD state into higher-order transitions that are easier to reason about. - DO seek to decompose problems into state machines. -- Avoid trivial impl routines which could be inlined by the caller. - Indirection is harder to read; routines must buy us something. -- Decompose IO and POD processing into separate routines where possible. - Routines should gravitate towards IO or processing, and not mix both. - -### Testing -- Prefer snapshots over fine-grain assertions (`insta` / `cupaloy`) - -### Errors -- Wrap errors with context (`anyhow::Context` / `fmt.Errorf`) -- Return errors up the stack rather than logging -- Panic on impossible states (do NOT add spurious error handling) - -### Logging -- Structured logging with context (`tracing` / `logrus`) -- Avoid verbose logging in hot paths diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 00000000000..47dc3e3d863 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file diff --git a/Cargo.lock b/Cargo.lock index f93a95f221c..703697f0bbe 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1030,7 +1030,7 @@ dependencies = [ "http 1.3.1", "http-body 0.4.6", "hyper 0.14.32", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-rustls 0.24.2", "hyper-rustls 0.27.7", "hyper-util", @@ -1178,7 +1178,7 @@ dependencies = [ "http 1.3.1", "http-body 1.0.1", "http-body-util", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-util", "itoa", "matchit", @@ -1258,16 +1258,16 @@ dependencies = [ [[package]] name = "axum-server" -version = "0.7.2" +version = "0.7.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "495c05f60d6df0093e8fb6e74aa5846a0ad06abaf96d76166283720bf740f8ab" +checksum = "c1ab4a3ec9ea8a657c72d99a03a824af695bd0fb5ec639ccbd9cd3543b41a5f9" dependencies = [ "arc-swap", "bytes", "fs-err", "http 1.3.1", "http-body 1.0.1", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-util", "pin-project-lite", "rustls 0.23.32", @@ -1412,6 +1412,8 @@ dependencies = [ "prost", "proto-flow", "runtime", + "runtime-next", + "rustls 0.23.32", "serde", "serde_json", "thiserror 2.0.17", @@ -3135,6 +3137,17 @@ dependencies = [ "pin-project-lite", ] +[[package]] +name = "evmap" +version = "11.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1b8874945f036109c72242964c1174cf99434e30cfa45bf45fedc983f50046f8" +dependencies = [ + "hashbag", + "left-right", + "smallvec", +] + [[package]] name = "extractors" version = "0.0.0" @@ -3330,7 +3343,6 @@ dependencies = [ "assert_cmd", "async-process", "axum", - "base64 0.22.1", "build", "bytes", "chrono", @@ -3620,6 +3632,7 @@ dependencies = [ "hexdump", "jsonwebtoken", "memchr", + "metrics", "ops", "pin-project-lite", "proto-gazette", @@ -3639,6 +3652,21 @@ dependencies = [ "uuid", ] +[[package]] +name = "generator" +version = "0.8.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "52f04ae4152da20c76fe800fa48659201d5cf627c5149ca0b707b69d7eef6cf9" +dependencies = [ + "cc", + "cfg-if", + "libc", + "log", + "rustversion", + "windows-link 0.2.0", + "windows-result 0.4.0", +] + [[package]] name = "generic-array" version = "0.14.7" @@ -3859,6 +3887,12 @@ dependencies = [ "thiserror 2.0.17", ] +[[package]] +name = "hashbag" +version = "0.1.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7040a10f52cba493ddb09926e15d10a9d8a28043708a405931fe4c6f19fac064" + [[package]] name = "hashbrown" version = "0.12.3" @@ -4138,9 +4172,9 @@ dependencies = [ [[package]] name = "hyper" -version = "1.7.0" +version = "1.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eb3aa54a13a0dfe7fbe3a59e0c76093041720fdc77b110cc0fc260fafb4dc51e" +checksum = "6299f016b246a94207e63da54dbe807655bf9e00044f73ded42c3ac5305fbcca" dependencies = [ "atomic-waker", "bytes", @@ -4153,7 +4187,6 @@ dependencies = [ "httpdate", "itoa", "pin-project-lite", - "pin-utils", "smallvec", "tokio", "want", @@ -4182,7 +4215,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e3c93eb611681b207e1fe55d5a71ecf91572ec8a6705cdb6857f7d8d5242cf58" dependencies = [ "http 1.3.1", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-util", "rustls 0.23.32", "rustls-native-certs 0.8.1", @@ -4199,7 +4232,7 @@ version = "0.5.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2b90d566bffbce6a75bd8b09a05aa8c2cb1fabb6cb348f8840c9e4c90a0d83b0" dependencies = [ - "hyper 1.7.0", + "hyper 1.9.0", "hyper-util", "pin-project-lite", "tokio", @@ -4227,7 +4260,7 @@ checksum = "70206fc6890eaca9fde8a0bf71caa2ddfc9fe045ac9e5c70df101a7dbde866e0" dependencies = [ "bytes", "http-body-util", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-util", "native-tls", "tokio", @@ -4248,7 +4281,7 @@ dependencies = [ "futures-util", "http 1.3.1", "http-body 1.0.1", - "hyper 1.7.0", + "hyper 1.9.0", "ipnet", "libc", "percent-encoding", @@ -4746,6 +4779,17 @@ dependencies = [ "spin", ] +[[package]] +name = "left-right" +version = "0.11.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0f0c21e4c8ff95f487fb34e6f9182875f42c84cef966d29216bf115d9bba835a" +dependencies = [ + "crossbeam-utils", + "loom", + "slab", +] + [[package]] name = "lexical-core" version = "1.0.6" @@ -4971,6 +5015,19 @@ version = "0.4.28" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "34080505efa8e45a4b816c349525ebe327ceaa8559756f0356cba97ef3bf7432" +[[package]] +name = "loom" +version = "0.7.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "419e0dc8046cb947daa77eb95ae174acfbddb7673b4151f56d1eed8e93fbfaca" +dependencies = [ + "cfg-if", + "generator", + "scoped-tls", + "tracing", + "tracing-subscriber", +] + [[package]] name = "lru" version = "0.12.5" @@ -5064,23 +5121,24 @@ checksum = "f52b00d39961fc5b2736ea853c9cc86238e165017a493d1d5c8eac6bdc4cc273" [[package]] name = "metrics" -version = "0.24.2" +version = "0.24.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "25dea7ac8057892855ec285c440160265225438c3c45072613c25a4b26e98ef5" +checksum = "89550ee9f79e88fef3119de263694973a8adb26c21d75322164fb8c493039fe2" dependencies = [ - "ahash", "portable-atomic", + "rapidhash", ] [[package]] name = "metrics-exporter-prometheus" -version = "0.17.2" +version = "0.18.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2b166dea96003ee2531cf14833efedced545751d800f03535801d833313f8c15" +checksum = "1db0d8f1fc9e62caebd0319e11eaec5822b0186c171568f0480b46a0137f9108" dependencies = [ "base64 0.22.1", + "evmap", "http-body-util", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-rustls 0.27.7", "hyper-util", "indexmap 2.11.4", @@ -5088,6 +5146,7 @@ dependencies = [ "metrics", "metrics-util", "quanta", + "rustls 0.23.32", "thiserror 2.0.17", "tokio", "tracing", @@ -5099,11 +5158,15 @@ version = "0.20.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "fe8db7a05415d0f919ffb905afa37784f71901c9a773188876984b4f769ab986" dependencies = [ + "aho-corasick", "crossbeam-epoch", "crossbeam-utils", "hashbrown 0.15.5", + "indexmap 2.11.4", "metrics", + "ordered-float 4.6.0", "quanta", + "radix_trie", "rand 0.9.2", "rand_xoshiro", "sketches-ddsketch", @@ -5617,6 +5680,15 @@ dependencies = [ "num-traits", ] +[[package]] +name = "ordered-float" +version = "4.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7bb71e1b3fa6ca1c61f383464aaf2bb0e2f8e772a1f01d486832464de363b951" +dependencies = [ + "num-traits", +] + [[package]] name = "outref" version = "0.5.2" @@ -6681,6 +6753,15 @@ dependencies = [ "rand_core 0.9.3", ] +[[package]] +name = "rapidhash" +version = "4.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b5e48930979c155e2f33aa36ab3119b5ee81332beb6482199a8ecd6029b80b59" +dependencies = [ + "rustversion", +] + [[package]] name = "raw-cpuid" version = "11.6.0" @@ -6898,7 +6979,7 @@ dependencies = [ "http 1.3.1", "http-body 1.0.1", "http-body-util", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-rustls 0.27.7", "hyper-tls 0.6.0", "hyper-util", @@ -7129,6 +7210,7 @@ dependencies = [ "labels", "librocksdb-sys", "locate-bin", + "metrics", "models", "ops", "pbjson-types", @@ -7143,6 +7225,7 @@ dependencies = [ "rocksdb", "serde", "serde_json", + "service-kit", "shuffle", "simd-doc", "tables", @@ -7163,6 +7246,29 @@ dependencies = [ "zeroize", ] +[[package]] +name = "runtime-sidecar" +version = "0.0.0" +dependencies = [ + "anyhow", + "clap", + "flow-client-next", + "futures", + "gazette", + "proto-gazette", + "runtime-next", + "rustls 0.23.32", + "service-kit", + "shuffle", + "tokens", + "tokio", + "tokio-stream", + "tonic", + "tracing", + "tracing-subscriber", + "url", +] + [[package]] name = "rusqlite" version = "0.32.1" @@ -7428,6 +7534,12 @@ dependencies = [ "syn 2.0.106", ] +[[package]] +name = "scoped-tls" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e1cf6437eb19a8f4a6cc0f7dca544973b0b78843adbfeb3683d1a94a0024a294" + [[package]] name = "scopeguard" version = "1.2.0" @@ -7687,6 +7799,26 @@ dependencies = [ "unsafe-libyaml", ] +[[package]] +name = "service-kit" +version = "0.0.0" +dependencies = [ + "anyhow", + "axum", + "chrono", + "metrics", + "metrics-exporter-prometheus", + "metrics-util", + "proto-gazette", + "serde", + "serde_json", + "tokio", + "tower", + "tower-http", + "tracing", + "tracing-subscriber", +] + [[package]] name = "sha1" version = "0.10.6" @@ -7744,6 +7876,7 @@ dependencies = [ "lz4", "md5", "memchr", + "metrics", "models", "ops", "pbjson-types", @@ -7760,6 +7893,7 @@ dependencies = [ "serde-transcode", "serde_json", "serde_yaml", + "service-kit", "simd-doc", "superslice", "tables", @@ -8498,7 +8632,7 @@ checksum = "7e54bc85fc7faa8bc175c4bab5b92ba8d9a3ce893d0e9f42cc455c8ab16a9e09" dependencies = [ "byteorder", "integer-encoding", - "ordered-float", + "ordered-float 2.10.1", ] [[package]] @@ -8778,7 +8912,7 @@ dependencies = [ "http 1.3.1", "http-body 1.0.1", "http-body-util", - "hyper 1.7.0", + "hyper 1.9.0", "hyper-timeout", "hyper-util", "percent-encoding", diff --git a/Cargo.toml b/Cargo.toml index 255691eed56..395880fc5d3 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -102,8 +102,9 @@ lz4 = "1" lz4_flex = "0" mime = "0" memchr = "2" -metrics = "0" -metrics-exporter-prometheus = "0" +metrics = "0.24" +metrics-exporter-prometheus = "0.18" +metrics-util = "0.20" prometheus = "0" md5 = "0" num-bigint = "0" @@ -169,7 +170,7 @@ sqlx = { version = "0", features = [ ] } tokio-rustls = "0" -rustls = { version = "0" } +rustls = { version = "0", features = ["aws_lc_rs"] } rustls-pemfile = "2" webpki = "0" tempfile = "3" @@ -234,7 +235,7 @@ aws-sdk-sts = "1" # Used for the agent http server axum = { version = "0", features = ["macros"] } -axum-server = { version = "0", features = ["tls-rustls"] } +axum-server = { version = "0.7.3", features = ["tls-rustls"] } axum-extra = { version = "0", features = ["typed-header", "query"] } async-graphql = { version = "7", features = [ "chrono", diff --git a/crates/bindings/Cargo.toml b/crates/bindings/Cargo.toml index ea67f81b669..d49061f0cce 100644 --- a/crates/bindings/Cargo.toml +++ b/crates/bindings/Cargo.toml @@ -18,6 +18,7 @@ derive = { path = "../derive" } ops = { path = "../ops" } proto-flow = { path = "../proto-flow" } runtime = { path = "../runtime" } +runtime-next = { path = "../runtime-next" } anyhow = { workspace = true } bytes = { workspace = true } @@ -25,6 +26,7 @@ futures = { workspace = true } prost = { workspace = true } serde = { workspace = true } serde_json = { workspace = true } +rustls = { workspace = true } thiserror = { workspace = true } time = { workspace = true } tracing = { workspace = true } diff --git a/crates/bindings/flow_bindings.h b/crates/bindings/flow_bindings.h index a395842f506..ccd2af6d1ea 100644 --- a/crates/bindings/flow_bindings.h +++ b/crates/bindings/flow_bindings.h @@ -112,6 +112,17 @@ typedef struct TaskService { uintptr_t err_cap; } TaskService; +typedef struct TaskServiceV2ImplPtr { + uint8_t _private[0]; +} TaskServiceV2ImplPtr; + +typedef struct TaskServiceV2 { + struct TaskServiceV2ImplPtr *svc_ptr; + uint8_t *err_ptr; + uintptr_t err_len; + uintptr_t err_cap; +} TaskServiceV2; + struct Channel *extract_create(int32_t log_level, int32_t log_dest_fd); void extract_invoke1(struct Channel *ch, struct In1 i); @@ -131,6 +142,10 @@ struct TaskService *new_task_service(const uint8_t *config_ptr, uint32_t config_ void task_service_drop(struct TaskService *svc); +struct TaskServiceV2 *new_task_service_v2(const uint8_t *config_ptr, uint32_t config_len); + +void task_service_v2_drop(struct TaskServiceV2 *svc); + struct Channel *upper_case_create(int32_t log_level, int32_t log_dest_fd); void upper_case_invoke1(struct Channel *ch, struct In1 i); diff --git a/crates/bindings/src/lib.rs b/crates/bindings/src/lib.rs index cadaba0747a..d98f78e6446 100644 --- a/crates/bindings/src/lib.rs +++ b/crates/bindings/src/lib.rs @@ -2,4 +2,15 @@ mod extract; mod metrics; mod service; mod task_service; +mod task_service_v2; mod upper_case; + +fn install_crypto_provider() { + static ONCE: std::sync::Once = std::sync::Once::new(); + + ONCE.call_once(|| { + // `bindings` is linked into Go binaries through CGO, so Rust binary + // entrypoints cannot install rustls' process-wide provider for us. + let _ = rustls::crypto::aws_lc_rs::default_provider().install_default(); + }); +} diff --git a/crates/bindings/src/task_service.rs b/crates/bindings/src/task_service.rs index 31107336766..771a2d95a79 100644 --- a/crates/bindings/src/task_service.rs +++ b/crates/bindings/src/task_service.rs @@ -21,6 +21,8 @@ pub struct TaskService { #[unsafe(no_mangle)] pub extern "C" fn new_task_service(config_ptr: *const u8, config_len: u32) -> *mut TaskService { + crate::install_crypto_provider(); + let config = unsafe { std::slice::from_raw_parts(config_ptr, config_len as usize) }; let config = proto_flow::runtime::TaskServiceConfig::decode(config).unwrap(); diff --git a/crates/bindings/src/task_service_v2.rs b/crates/bindings/src/task_service_v2.rs new file mode 100644 index 00000000000..5fcd9f9bc58 --- /dev/null +++ b/crates/bindings/src/task_service_v2.rs @@ -0,0 +1,85 @@ +use prost::Message; + +// Opaque pointer for a TaskService instance in the ABI. +#[repr(C)] +pub struct TaskServiceV2ImplPtr { + _private: [u8; 0], +} + +#[repr(C)] +pub struct TaskServiceV2 { + // Opaque service pointer. + svc_ptr: *mut TaskServiceV2ImplPtr, + + // Terminal error returned by the TaskService. + err_ptr: *mut u8, + err_len: usize, + err_cap: usize, +} + +#[unsafe(no_mangle)] +pub extern "C" fn new_task_service_v2( + config_ptr: *const u8, + config_len: u32, +) -> *mut TaskServiceV2 { + crate::install_crypto_provider(); + + let config = unsafe { std::slice::from_raw_parts(config_ptr, config_len as usize) }; + let config = proto_flow::runtime::TaskServiceConfig::decode(config).unwrap(); + + let log_file = unsafe { + use std::os::unix::io::FromRawFd; + std::fs::File::from_raw_fd(config.log_file_fd) + }; + + let svc_abi = match runtime_next::TaskService::new(config, log_file) { + Ok(svc) => { + let svc_ptr = Box::leak(Box::new(svc)) as *mut runtime_next::TaskService + as *mut TaskServiceV2ImplPtr; + + TaskServiceV2 { + svc_ptr, + err_ptr: 0 as *mut u8, + err_len: 0, + err_cap: 0, + } + } + Err(err) => { + let mut err = format!("{:?}", err); + let err_ptr = err.as_mut_ptr(); + let err_cap = err.capacity(); + let err_len = err.len(); + std::mem::forget(err); + + TaskServiceV2 { + svc_ptr: 0 as *mut TaskServiceV2ImplPtr, + err_ptr, + err_len, + err_cap, + } + } + }; + + Box::leak(Box::new(svc_abi)) +} + +#[unsafe(no_mangle)] +pub extern "C" fn task_service_v2_drop(svc: *mut TaskServiceV2) { + let TaskServiceV2 { + svc_ptr, + err_ptr, + err_len, + err_cap, + } = *unsafe { Box::from_raw(svc) }; + + if svc_ptr != 0 as *mut TaskServiceV2ImplPtr { + let svc = unsafe { Box::from_raw(svc_ptr as *mut runtime_next::TaskService) }; + svc.graceful_stop(); + } + let err_ptr = if err_cap == 0 { + std::ptr::NonNull::dangling().as_ptr() + } else { + err_ptr + }; + unsafe { String::from_raw_parts(err_ptr, err_len, err_cap) }; +} diff --git a/crates/control-plane-api/src/server/authorize_task.rs b/crates/control-plane-api/src/server/authorize_task.rs index f062914f886..deb4f1be21f 100644 --- a/crates/control-plane-api/src/server/authorize_task.rs +++ b/crates/control-plane-api/src/server/authorize_task.rs @@ -39,20 +39,14 @@ pub async fn authorize_task( // checking that the requested capability contains a particular grant isn't enough. // For example, we wouldn't want to allow a request for `REPLICATE` just // because it also requests `READ`. + use proto_gazette::capability::{APPEND, APPLY, LIST, READ}; let required_role = match cap { - cap if (cap == proto_gazette::capability::LIST) - || (cap == proto_gazette::capability::READ) - || (cap == (proto_gazette::capability::LIST | proto_gazette::capability::READ)) => - { - models::Capability::Read - } - // We're intentionally rejecting requests for both APPLY and APPEND, as those two - // grants authorize wildly different capabilities, and no sane logic should - // need both at the same time. So as a sanity check/defense-in-depth measure - // we won't grant you a token that has both, even if we technically could. - cap if (cap == proto_gazette::capability::APPLY) - || (cap == proto_gazette::capability::APPEND) => - { + // Legacy go shuffle module obtains LIST and READ separately. + // The `shuffle` crate and `dekaf` obtain LIST | READ jointly. + cap if (cap == LIST) || (cap == READ) || (cap == (LIST | READ)) => models::Capability::Read, + // Legacy publisher separately obtains APPLY vs APPEND tokens. + // The `publisher` crates obtains APPEND | APPLY | LIST jointly. + cap if (cap == APPLY) || (cap == APPEND) || (cap == (APPEND | APPLY | LIST)) => { models::Capability::Write } cap => { diff --git a/crates/dekaf/tests/e2e/harness.rs b/crates/dekaf/tests/e2e/harness.rs index bca972aa946..457ec1daa25 100644 --- a/crates/dekaf/tests/e2e/harness.rs +++ b/crates/dekaf/tests/e2e/harness.rs @@ -3,9 +3,12 @@ use std::collections::BTreeMap; use std::sync::OnceLock; use std::time::Duration; -/// Prefix for test namespaces. Requires storage mappings and user grants -/// provisioned by `mise run local:test-tenant`. -const TEST_NAMESPACE_PREFIX: &str = "test/dekaf"; +/// Prefix for test namespaces, provisioned by `mise run local:test-tenant`, +/// which must match FLOW_AUTH_TOKEN credential. +fn test_namespace_prefix() -> String { + let tenant = std::env::var("FLOW_TEST_TENANT").expect("FLOW_TEST_TENANT must be set"); + format!("{tenant}/dekaf") +} pub fn init_tracing() { let _ = tracing_subscriber::fmt() @@ -19,7 +22,7 @@ pub fn init_tracing() { } /// Create a flowctl command configured for local stack. -/// Requires FLOW_ACCESS_TOKEN environment variable to be set. +/// Requires FLOW_AUTH_TOKEN environment variable to be set. fn flowctl_command() -> anyhow::Result { // Try to find flowctl in cargo-target/debug first (where `cargo build` puts it), // falling back to locate_bin (which checks alongside the test binary and PATH). @@ -36,11 +39,10 @@ fn flowctl_command() -> anyhow::Result { let home = std::env::var("HOME").unwrap(); let ca_cert = std::env::var("SSL_CERT_FILE").unwrap_or_else(|_| format!("{}/flow-local/ca.crt", home)); - let access_token = std::env::var("FLOW_ACCESS_TOKEN") - .context("FLOW_ACCESS_TOKEN environment variable must be set for e2e tests")?; + let auth_token = std::env::var("FLOW_AUTH_TOKEN").context("FLOW_AUTH_TOKEN must be set")?; let mut cmd = async_process::Command::new(flowctl); - cmd.env("FLOW_ACCESS_TOKEN", access_token); + cmd.env("FLOW_AUTH_TOKEN", auth_token); cmd.env("SSL_CERT_FILE", ca_cert); cmd.arg("--profile").arg("local"); Ok(cmd) @@ -61,7 +63,7 @@ impl DekafTestEnv { /// rewritten to include a unique test namespace. pub async fn setup(test_name: &str, fixture_yaml: &str) -> anyhow::Result { let suffix = format!("{:04x}", rand::random::()); - let namespace = format!("{TEST_NAMESPACE_PREFIX}/{test_name}/{suffix}"); + let namespace = format!("{}/{test_name}/{suffix}", test_namespace_prefix()); tracing::info!(%namespace, "Setting up test environment"); diff --git a/crates/e2e-support/tests/hello_world.rs b/crates/e2e-support/tests/hello_world.rs index c6e4b30c2da..1ed0c65e9b1 100644 --- a/crates/e2e-support/tests/hello_world.rs +++ b/crates/e2e-support/tests/hello_world.rs @@ -53,7 +53,7 @@ async fn hello_world(build: Arc, journal_client: gazette::journal .as_ref() .expect("built collection should have a spec"); - let binding = publisher::Binding::from_collection_spec(collection_spec, None) + let binding = publisher::Binding::from_collection_spec(collection_spec) .expect("should build binding from collection spec"); let factory: gazette::journal::ClientFactory = Arc::new({ @@ -118,7 +118,7 @@ async fn hello_world(build: Arc, journal_client: gazette::journal .expect("ACK write should succeed"); // Snapshot the partition listing from the Publisher's own watch. - let (_client, partitions) = publisher.binding_client(0); + let (_client, partitions) = publisher.mapped_binding_client(0); let partitions_watch = partitions.ready().await; let splits = partitions_watch.token(); let splits = splits.result().expect("partitions should be available"); diff --git a/crates/flow-client-next/src/workflows/task_collection_auth.rs b/crates/flow-client-next/src/workflows/task_collection_auth.rs index 196a451f714..d53a4b96fea 100644 --- a/crates/flow-client-next/src/workflows/task_collection_auth.rs +++ b/crates/flow-client-next/src/workflows/task_collection_auth.rs @@ -28,9 +28,8 @@ pub struct TaskCollectionAuth { /// /// `capability` is the requested capability level of the authorization. /// This is NOT a models::Capability. Rather, it's a bit-mask in the u32 -/// Gazette capability namespace and is restricted to: -/// - proto_gazette::capability::READ -/// - proto_gazette::capability::APPEND +/// Gazette capability namespace and is restricted to +/// proto_gazette::capability::{APPEND, APPLY, LIST, READ}. /// /// `data_plane_fqdn` is the FQDN of the data-plane hosting the task. /// diff --git a/crates/flowctl/Cargo.toml b/crates/flowctl/Cargo.toml index 6b997abb0c3..30872ed1d1e 100644 --- a/crates/flowctl/Cargo.toml +++ b/crates/flowctl/Cargo.toml @@ -45,7 +45,6 @@ validation = { path = "../validation" } anyhow = { workspace = true } axum = { workspace = true } -base64 = { workspace = true } bytes = { workspace = true } chrono = { workspace = true } clap = { workspace = true } diff --git a/crates/flowctl/src/config.rs b/crates/flowctl/src/config.rs index 68865e02193..29968810cbf 100644 --- a/crates/flowctl/src/config.rs +++ b/crates/flowctl/src/config.rs @@ -1,5 +1,4 @@ use anyhow::Context; -use base64::Engine; use std::path::PathBuf; use flow_client::{ @@ -158,26 +157,26 @@ impl Config { config.user_refresh_token = refresh_token; } - // If a refresh token is not defined, attempt to parse one from the environment. - if config.user_refresh_token.is_none() { - if let Ok(env_token) = std::env::var(FLOW_AUTH_TOKEN) { - let decoded = base64::engine::general_purpose::STANDARD - .decode(env_token) + // FLOW_AUTH_TOKEN, if present, overrides a refresh or access token + // loaded from disk. A value with three dot-delimited segments is a JWT + // access token; anything else is a base64-encoded refresh token JSON. + if let Ok(env_token) = std::env::var(FLOW_AUTH_TOKEN) { + if env_token.split('.').count() == 3 { + tracing::info!("using FLOW_AUTH_TOKEN environment access token"); + config.user_refresh_token = None; + config.user_access_token = Some(env_token); + } else { + let decoded = tokens::jwt::parse_base64(&env_token) .context("FLOW_AUTH_TOKEN is not base64")?; let token: RefreshToken = serde_json::from_slice(&decoded).context("FLOW_AUTH_TOKEN is invalid JSON")?; - tracing::info!("using refresh token from environment variable {FLOW_AUTH_TOKEN}"); + tracing::info!("using FLOW_AUTH_TOKEN environment refresh token"); config.user_refresh_token = Some(token); + config.user_access_token = None; } } - // Allow overriding access token via environment variable for CI/automation. - if let Ok(token) = std::env::var(FLOW_ACCESS_TOKEN) { - tracing::info!("using access token from environment variable {FLOW_ACCESS_TOKEN}"); - config.user_access_token = Some(token); - } - config.is_local = profile == "local"; Ok(config) @@ -231,7 +230,6 @@ impl Config { } } -// Environment variable which is inspected for a base64-encoded refresh token. +// Environment variable inspected for an auth credential: either a JWT access +// token, or a base64-encoded refresh token JSON. const FLOW_AUTH_TOKEN: &str = "FLOW_AUTH_TOKEN"; -// Environment variable which is inspected for an access token (for CI/automation). -const FLOW_ACCESS_TOKEN: &str = "FLOW_ACCESS_TOKEN"; diff --git a/crates/flowctl/src/raw/preview_next/driver.rs b/crates/flowctl/src/raw/preview_next/driver.rs index 93188122a71..316ccf7a156 100644 --- a/crates/flowctl/src/raw/preview_next/driver.rs +++ b/crates/flowctl/src/raw/preview_next/driver.rs @@ -108,6 +108,7 @@ async fn drive_one_shard( None, task_name, publisher_factory, + Default::default(), // Inert service_kit::Registry. ); let mut response_rx = shard_svc.spawn_materialize(UnboundedReceiverStream::new(request_rx)); @@ -161,8 +162,6 @@ async fn drive_one_shard( .send(Ok(proto::Materialize { task: Some(proto::Task { spec: spec_bytes.clone(), - ops_stats_journal: String::new(), - ops_stats_spec: None, preview: true, max_transactions: target_txns, }), diff --git a/crates/flowctl/src/raw/preview_next/services.rs b/crates/flowctl/src/raw/preview_next/services.rs index eeed25d8234..0cbc7e50cae 100644 --- a/crates/flowctl/src/raw/preview_next/services.rs +++ b/crates/flowctl/src/raw/preview_next/services.rs @@ -59,15 +59,20 @@ impl Run { user_tokens, ); // 2 GiB matches the runtime-next default for sidecar-hosted shuffle services. - let shuffle_svc = - shuffle::Service::new(peer_endpoint.clone(), factory, 2 * 1024 * 1024 * 1024); + let shuffle_svc = shuffle::Service::new( + peer_endpoint.clone(), + factory, + 2 * 1024 * 1024 * 1024, + Default::default(), // Inert service_kit::Registry. + ); let publisher_factory: gazette::journal::ClientFactory = std::sync::Arc::new({ move |_authz_sub: String, _authz_obj: String| -> gazette::journal::Client { unreachable!("live Publisher is not used by preview ({_authz_sub}, {_authz_obj})") } }); - let runtime_svc = runtime_next::Service::new(shuffle_svc.clone(), publisher_factory); + let runtime_svc = + runtime_next::Service::new(shuffle_svc.clone(), publisher_factory, Default::default()); let router = tonic::transport::Server::builder() .add_service(runtime_svc.into_tonic_service()) diff --git a/crates/gazette/Cargo.toml b/crates/gazette/Cargo.toml index 9a4cb3dc3c8..a379fadb39d 100644 --- a/crates/gazette/Cargo.toml +++ b/crates/gazette/Cargo.toml @@ -25,6 +25,7 @@ futures = { workspace = true } futures-core = { workspace = true } jsonwebtoken = { workspace = true } memchr = { workspace = true } +metrics = { workspace = true } rand = { workspace = true } reqwest = { workspace = true } thiserror = { workspace = true } diff --git a/crates/gazette/src/journal/append.rs b/crates/gazette/src/journal/append.rs index 6c9f0e9a22d..dcc874dd1a3 100644 --- a/crates/gazette/src/journal/append.rs +++ b/crates/gazette/src/journal/append.rs @@ -19,9 +19,10 @@ impl Client { { coroutines::coroutine(move |mut co| async move { let mut attempt = 0; + let metrics = Metrics::new(&req.journal); loop { - let err = match self.try_append(&mut req, source()).await { + let err = match self.try_append(metrics.clone(), &mut req, source()).await { Ok(resp) => { () = co.yield_(Ok(resp)).await; return; @@ -49,6 +50,7 @@ impl Client { async fn try_append( &self, + metrics: Metrics, req: &mut broker::AppendRequest, source: S, ) -> crate::Result @@ -67,16 +69,19 @@ impl Client { // the initial metadata request containing the journal name and any other request metadata, then // "data" requests that contain chunks of data to write, then the final EOF indicating completion. let source = futures::stream::once(async move { Ok(req_clone) }) - .chain(source.filter_map(|input| { + .chain(source.filter_map(move |input| { futures::future::ready(match input { // It's technically possible to get an empty set of bytes when reading // from the input stream. Filter these out as otherwise they would look // like EOFs to the append RPC and cause confusion. Ok(content) if content.len() == 0 => None, - Ok(content) => Some(Ok(broker::AppendRequest { - content, - ..Default::default() - })), + Ok(content) => { + metrics.append.increment(content.len() as u64); + Some(Ok(broker::AppendRequest { + content, + ..Default::default() + })) + } Err(err) => Some(Err(err)), }) })) @@ -120,3 +125,24 @@ impl Client { } } } + +#[derive(Clone)] +struct Metrics { + append: metrics::Counter, +} + +impl Metrics { + fn new(journal: &str) -> Self { + static DESCRIBE: std::sync::Once = std::sync::Once::new(); + DESCRIBE.call_once(|| { + metrics::describe_counter!( + "gazette_append", + metrics::Unit::Bytes, + "number of bytes appended to a journal", + ); + }); + let append = metrics::counter!("gazette_append", "journal" => journal.to_string()); + + Self { append } + } +} diff --git a/crates/gazette/src/journal/read/mod.rs b/crates/gazette/src/journal/read/mod.rs index e64bf88d998..3729edcc0a0 100644 --- a/crates/gazette/src/journal/read/mod.rs +++ b/crates/gazette/src/journal/read/mod.rs @@ -20,8 +20,9 @@ impl Client { ) -> impl futures::Stream> + Send + 'static { coroutines::coroutine(move |mut co| async move { - let mut write_head = i64::MAX; let mut attempt = 0; + let mut write_head = i64::MAX; + let metrics = Metrics::new(&req.journal); loop { // Have we read through requested `end_offset`? @@ -33,7 +34,10 @@ impl Client { return; } - let err = match self.read_some(&mut co, &mut req, &mut write_head).await { + let err = match self + .read_some(&mut co, metrics.clone(), &mut req, &mut write_head) + .await + { Ok(()) => { attempt = 0; continue; @@ -62,6 +66,7 @@ impl Client { async fn read_some( &self, co: &mut coroutines::Suspend, ()>, + metrics: Metrics, req: &mut broker::ReadRequest, write_head: &mut i64, ) -> crate::Result<()> { @@ -95,7 +100,10 @@ impl Client { { *write_head = metadata.write_head; req.offset = metadata.write_head; + () = co.yield_(Ok(metadata)).await; + metrics.tick(req.offset, *write_head); + return Ok(()); } else if metadata.status() != broker::Status::Ok { // Note: we used to fall through and retry below on !Ok. That was @@ -115,11 +123,32 @@ impl Client { tracing::info!(req.journal, req.offset, metadata.offset, "offset jump"); req.offset = metadata.offset; } - *write_head = metadata.write_head; + let (fragment, fragment_url) = (fragment.clone(), metadata.fragment_url.clone()); () = co.yield_(Ok(metadata)).await; - return read_fragment_url(co, fragment, &self.fragment_client, fragment_url, req).await; + metrics.tick(req.offset, *write_head); + + return read_fragment_url( + co, + metrics, + fragment, + &self.fragment_client, + fragment_url, + req, + *write_head, + ) + .await; + } + + // We skipped the direct-fragment path. If the broker returned a + // `file://` URL, the fragment is persisted but lives on the broker's + // local filesystem — we have no transport to read it ourselves, so + // we must ask the broker to proxy. With `do_not_proxy=true` and no + // open spool file, `serveRead` short-circuits after sending only the + // fragment metadata, EOFs the stream, and the outer loop spins. + if metadata.fragment_url.starts_with("file://") { + req.do_not_proxy = false; } tracing::trace!(req.offset, write_head, "started direct journal read"); @@ -148,12 +177,16 @@ impl Client { req.offset = resp.offset; } *write_head = resp.write_head; + () = co.yield_(Ok(resp)).await; + metrics.tick(req.offset, *write_head); } // Content response. (broker::Status::Ok, None, false) => { req.offset += resp.content.len() as i64; + () = co.yield_(Ok(resp)).await; + metrics.tick(req.offset, *write_head); } // All other statuses end the stream, and are handled by the caller. (status, _, _) => return Err(Error::BrokerStatus(status)), @@ -166,10 +199,12 @@ impl Client { async fn read_fragment_url( co: &mut coroutines::Suspend, ()>, + metrics: Metrics, fragment: broker::Fragment, fragment_client: &reqwest::Client, fragment_url: String, req: &mut broker::ReadRequest, + write_head: i64, ) -> crate::Result<()> { let mut get = fragment_client.get(fragment_url); @@ -201,16 +236,16 @@ async fn read_fragment_url( match fragment.compression_codec() { broker::CompressionCodec::None | broker::CompressionCodec::GzipOffloadDecompression => { - read_fragment_url_body(co, fragment, raw_reader, req).await + read_fragment_url_body(co, metrics, fragment, raw_reader, req, write_head).await } broker::CompressionCodec::Gzip => { let mut decoder = async_compression::futures::bufread::GzipDecoder::new(raw_reader); decoder.multiple_members(true); - read_fragment_url_body(co, fragment, decoder, req).await + read_fragment_url_body(co, metrics, fragment, decoder, req, write_head).await } broker::CompressionCodec::Zstandard => { let decoder = async_compression::futures::bufread::ZstdDecoder::new(raw_reader); - read_fragment_url_body(co, fragment, decoder, req).await + read_fragment_url_body(co, metrics, fragment, decoder, req, write_head).await } broker::CompressionCodec::Snappy => Err(Error::Protocol( "snappy compression is not yet implemented by this client", @@ -223,9 +258,11 @@ async fn read_fragment_url( async fn read_fragment_url_body( co: &mut coroutines::Suspend, ()>, + metrics: Metrics, fragment: broker::Fragment, r: impl futures::io::AsyncRead, req: &mut broker::ReadRequest, + write_head: i64, ) -> crate::Result<()> { use bytes::Buf; use tokio_util::compat::FuturesAsyncReadCompatExt; @@ -275,7 +312,53 @@ async fn read_fragment_url_body( .await; req.offset += content_len; + metrics.tick(req.offset, write_head); + metrics.fragment.increment(content_len as u64); } Ok(()) } + +#[derive(Clone)] +struct Metrics { + offset: metrics::Gauge, + remainder: metrics::Gauge, + fragment: metrics::Counter, +} + +impl Metrics { + fn new(journal: &str) -> Self { + static DESCRIBE: std::sync::Once = std::sync::Once::new(); + DESCRIBE.call_once(|| { + metrics::describe_gauge!( + "gazette_read_offset", + metrics::Unit::Bytes, + "current read offset for a journal", + ); + metrics::describe_gauge!( + "gazette_read_remainder", + metrics::Unit::Bytes, + "distance from current read offset to write head for a journal", + ); + metrics::describe_counter!( + "gazette_read_fragment", + metrics::Unit::Bytes, + "number of bytes directly read from journal fragment files", + ); + }); + let offset = metrics::gauge!("gazette_read_offset", "journal" => journal.to_string()); + let remainder = metrics::gauge!("gazette_read_remainder", "journal" => journal.to_string()); + let fragment = metrics::counter!("gazette_read_fragment", "journal" => journal.to_string()); + + Self { + offset, + remainder, + fragment, + } + } + + fn tick(&self, offset: i64, write_head: i64) { + self.offset.set(offset as f64); + self.remainder.set((write_head - offset) as f64); + } +} diff --git a/crates/labels/src/lib.rs b/crates/labels/src/lib.rs index 535f421fb67..32362fd7438 100644 --- a/crates/labels/src/lib.rs +++ b/crates/labels/src/lib.rs @@ -23,7 +23,7 @@ pub const TASK_TYPE_CAPTURE: &str = "capture"; pub const TASK_TYPE_DERIVATION: &str = "derivation"; pub const TASK_TYPE_MATERIALIZATION: &str = "materialization"; pub const RCLOCK_BEGIN: &str = "estuary.dev/rclock-begin"; -pub const RCLOCK_BEGIN_MIN: &str = KEY_BEGIN; +pub const RCLOCK_BEGIN_MIN: &str = KEY_BEGIN_MIN; pub const RCLOCK_END: &str = "estuary.dev/rclock-end"; pub const RCLOCK_END_MAX: &str = KEY_END_MAX; pub const SPLIT_TARGET: &str = "estuary.dev/split-target"; diff --git a/crates/models/src/schemas.rs b/crates/models/src/schemas.rs index 3e09c32e61e..5c1f4a914f4 100644 --- a/crates/models/src/schemas.rs +++ b/crates/models/src/schemas.rs @@ -215,6 +215,8 @@ struct RelaxedSchemaObj { _min_length: Option, #[serde(rename = "maxLength", default, skip_serializing)] _max_length: Option, + #[serde(rename = "redact", default, skip_serializing)] + _redact: Option, // Other keywords are passed-through. #[serde(flatten)] @@ -416,6 +418,33 @@ mod test { ); } + #[test] + fn test_relaxation_drops_redact() { + let schema = schema!({ + "type": "object", + "properties": { + "email": { + "type": "string", + "redact": { "strategy": "sha256" }, + "description": "Sensitive key component" + } + } + }); + + let relaxed = schema.to_relaxed_schema().unwrap().to_value(); + + assert_eq!( + relaxed, + json!({ + "properties": { + "email": { + "description": "Sensitive key component" + } + } + }) + ); + } + #[test] fn test_relaxation_drops_additional_properties_false() { let schema = schema!({ diff --git a/crates/proto-flow/build.rs b/crates/proto-flow/build.rs index e22c5f8c3b0..added534998 100644 --- a/crates/proto-flow/build.rs +++ b/crates/proto-flow/build.rs @@ -40,7 +40,7 @@ fn main() { // to BTreeMap<&str, &RawValue>, and deserialize in reverse. // * Fields ending in "_json_vec" are mapped from Vec to Vec<&RawValue>, // and deserialize in reverse as well. - // * Our stats documents' bytesTotal and docsTotal fields are typed as u64 to allow for + // * Our stats documents' bytesTotal, docsTotal, and bytesBehind fields are typed as u64 to allow for // tallying relatively large values in a single document but we do not want // this value serialized as a string, so we remove the string conversion. @@ -53,7 +53,7 @@ fn main() { regex::Regex::new(r#"struct_ser\.serialize_field\((".+"), &(self\..*_json_vec)\.iter\(\).map\(pbjson::private::base64::encode\).collect::>\(\)\)\?"#) .unwrap(); let ser_int64_re = - regex::Regex::new(r#"struct_ser\.serialize_field\("(bytesTotal|docsTotal)", ToString::to_string\(&self\.(bytes_total|docs_total)\).as_str\(\)\)\?;"#) + regex::Regex::new(r#"struct_ser\.serialize_field\("(bytesTotal|docsTotal|bytesBehind)", ToString::to_string\(&self\.(bytes_total|docs_total|bytes_behind)\).as_str\(\)\)\?;"#) .unwrap(); let de_json_re = @@ -121,7 +121,7 @@ fn main() { buf.replace_range(range, &format!("crate::RawJSONDeserialize")); } - // Handle serializing "bytesTotal"/"docsTotal" as an integer rather than a quoted integer. + // Handle serializing stats counters as integers rather than quoted integers. while let Some(capture) = ser_int64_re.captures(&buf) { let range = capture.get(0).unwrap().range(); buf.replace_range( diff --git a/crates/proto-flow/src/ops.serde.rs b/crates/proto-flow/src/ops.serde.rs index b18f3cffd54..1d9777f6cdc 100644 --- a/crates/proto-flow/src/ops.serde.rs +++ b/crates/proto-flow/src/ops.serde.rs @@ -1353,7 +1353,7 @@ impl serde::Serialize for stats::derive::Transform { if self.bytes_behind != 0 { #[allow(clippy::needless_borrow)] #[allow(clippy::needless_borrows_for_generic_args)] - struct_ser.serialize_field("bytesBehind", ToString::to_string(&self.bytes_behind).as_str())?; + struct_ser.serialize_field("bytesBehind", &self.bytes_behind)?; } struct_ser.end() } @@ -1739,7 +1739,7 @@ impl serde::Serialize for stats::MaterializeBinding { if self.bytes_behind != 0 { #[allow(clippy::needless_borrow)] #[allow(clippy::needless_borrows_for_generic_args)] - struct_ser.serialize_field("bytesBehind", ToString::to_string(&self.bytes_behind).as_str())?; + struct_ser.serialize_field("bytesBehind", &self.bytes_behind)?; } struct_ser.end() } diff --git a/crates/proto-flow/src/runtime.rs b/crates/proto-flow/src/runtime.rs index b4abfa0eebf..dc2c872073d 100644 --- a/crates/proto-flow/src/runtime.rs +++ b/crates/proto-flow/src/runtime.rs @@ -485,22 +485,16 @@ pub struct Joined { /// Task which is being processed by the runtime. /// Sent from Controller to Shard, and from Shard zero (only) to Leader /// after Joined. Other shards do not forward Task. -#[derive(Clone, PartialEq, ::prost::Message)] +#[derive(Clone, PartialEq, Eq, Hash, ::prost::Message)] pub struct Task { /// Task specification (protobuf-encoded bytes). #[prost(bytes = "bytes", tag = "1")] pub spec: ::prost::bytes::Bytes, - /// Collection journal partition to which task states are written. - #[prost(string, tag = "2")] - pub ops_stats_journal: ::prost::alloc::string::String, - /// Collection to which task stats are written. - #[prost(message, optional, tag = "3")] - pub ops_stats_spec: ::core::option::Option, /// When true, documents and stats are written to output and not directed to collections. - #[prost(bool, tag = "4")] + #[prost(bool, tag = "2")] pub preview: bool, /// Preview / harness control. Zero means unlimited. - #[prost(uint32, tag = "5")] + #[prost(uint32, tag = "3")] pub max_transactions: u32, } /// Recover is sent by each shard to the leader after Joined, and carries @@ -574,7 +568,7 @@ pub struct Recover { /// zero's RocksDB. Absent fields are inert. /// /// All fields of a Persist land together in a single WriteBatch. -/// `nonce` is echoed back by the shard's Persisted response, allowing +/// `seq_no` is echoed back by the shard's Persisted response, allowing /// the leader to match a Persisted response to its originating request. #[derive(Clone, PartialEq, ::prost::Message)] pub struct Persist { @@ -582,7 +576,7 @@ pub struct Persist { /// `Persisted` response. The leader chooses any value and the shard /// does not interpret it. #[prost(uint64, tag = "1")] - pub nonce: u64, + pub seq_no: u64, /// Delete previously-persisted ACK intents. Applies ahead of `ack_intents`. /// Effect: DeleteRange("AI:") #[prost(bool, tag = "2")] @@ -599,55 +593,64 @@ pub struct Persist { /// Effect: Put under "committed-close-clock". #[prost(fixed64, tag = "4")] pub committed_close_clock: u64, + /// Delete all previously-persisted committed Frontier entries. + /// Applies ahead of `committed_frontier`. + /// Effect: DeleteRange("FC:") + #[prost(bool, tag = "5")] + pub delete_committed_frontier: bool, /// Committed Frontier entries. /// Effect: Put under "FC:..." keys. - #[prost(message, optional, tag = "5")] + #[prost(message, optional, tag = "6")] pub committed_frontier: ::core::option::Option, /// Connector state patches. State Update Wire Format. /// Effect: Merge each patch under "connector-state". - #[prost(bytes = "bytes", tag = "6")] + #[prost(bytes = "bytes", tag = "7")] pub connector_patches_json: ::prost::bytes::Bytes, /// Clock at which the hinted transaction closed. /// Effect: Put under "hinted-close-clock". - #[prost(fixed64, tag = "7")] + #[prost(fixed64, tag = "8")] pub hinted_close_clock: u64, /// Delete a previously-persisted hinted frontier. Applies ahead of `hinted_frontier`. /// Effect: DeleteRange("FH:") - #[prost(bool, tag = "8")] + #[prost(bool, tag = "9")] pub delete_hinted_frontier: bool, /// Hinted Frontier entries. /// Effect: Put under "FH:" keys. - #[prost(message, optional, tag = "9")] + #[prost(message, optional, tag = "10")] pub hinted_frontier: ::core::option::Option, /// Last-applied task specification (protobuf-encoded bytes), or empty. /// Effect: Put under "last-applied" key. - #[prost(bytes = "bytes", tag = "10")] + #[prost(bytes = "bytes", tag = "11")] pub last_applied: ::prost::bytes::Bytes, - /// Legacy checkpoint, required for rollback to legacy runtime. + /// Delete a previously-persisted legacy checkpoint. + /// Effect: Delete the "checkpoint" key. + #[prost(bool, tag = "12")] + pub delete_legacy_checkpoint: bool, + /// Legacy checkpoint, required for rollback to the V1 runtime. /// Effect: Put under "checkpoint" key. - #[prost(message, optional, tag = "11")] + #[prost(message, optional, tag = "13")] pub legacy_checkpoint: ::core::option::Option<::proto_gazette::consumer::Checkpoint>, /// Per-binding max-key updates, reduced to per-binding maximum across shards. /// Key: binding index; Value: packed composite key tuple. /// Effect: Put value under "MK-v2:{state_key}" (state_key resolved by the encoder). - #[prost(btree_map = "uint32, bytes", tag = "12")] + #[prost(btree_map = "uint32, bytes", tag = "14")] pub max_keys: ::prost::alloc::collections::BTreeMap, /// Delete previously-persisted trigger parameters. Applies ahead of `trigger_params_json`. /// Effect: Delete the "trigger-params" key. - #[prost(bool, tag = "13")] + #[prost(bool, tag = "15")] pub delete_trigger_params: bool, /// Materialization trigger parameters. /// Effect: Put under "trigger-params" key. - #[prost(bytes = "bytes", tag = "14")] + #[prost(bytes = "bytes", tag = "16")] pub trigger_params_json: ::prost::bytes::Bytes, } /// Persisted is sent by shard zero to the leader after the state is durable /// in the recovery log. #[derive(Clone, Copy, PartialEq, Eq, Hash, ::prost::Message)] pub struct Persisted { - /// Echoed back from the originating `Persist.nonce` request. + /// Echoed back from the originating `Persist.seq_no` request. #[prost(uint64, tag = "1")] - pub nonce: u64, + pub seq_no: u64, } /// Apply asks shard zero to invoke its connector's Apply action, both for /// the initial application of a new spec and for re-application after a @@ -744,6 +747,12 @@ pub struct Materialize { /// Shard → Controller. Connector's reply to `validate`. #[prost(message, optional, tag = "4")] pub validated: ::core::option::Option, + /// Controller → Shard. Effective only on unary `spec` / `validate` + /// messages, which never see the Join-time labeling that supplies the + /// log level for session-bound work. Ignored on all other variants + /// (the leader → shard messages MUST NOT set this). + #[prost(enumeration = "super::ops::log::Level", tag = "5")] + pub log_level: i32, /// Controller → Shard. First message of a session-loop stream; /// never sent to the Leader. #[prost(message, optional, tag = "20")] diff --git a/crates/proto-flow/src/runtime.serde.rs b/crates/proto-flow/src/runtime.serde.rs index 89c06951649..664eeb3af60 100644 --- a/crates/proto-flow/src/runtime.serde.rs +++ b/crates/proto-flow/src/runtime.serde.rs @@ -3330,6 +3330,9 @@ impl serde::Serialize for Materialize { if self.validated.is_some() { len += 1; } + if self.log_level != 0 { + len += 1; + } if self.session_loop.is_some() { len += 1; } @@ -3415,6 +3418,11 @@ impl serde::Serialize for Materialize { if let Some(v) = self.validated.as_ref() { struct_ser.serialize_field("validated", v)?; } + if self.log_level != 0 { + let v = super::ops::log::Level::try_from(self.log_level) + .map_err(|_| serde::ser::Error::custom(format!("Invalid variant {}", self.log_level)))?; + struct_ser.serialize_field("logLevel", &v)?; + } if let Some(v) = self.session_loop.as_ref() { struct_ser.serialize_field("sessionLoop", v)?; } @@ -3502,6 +3510,8 @@ impl<'de> serde::Deserialize<'de> for Materialize { "specResponse", "validate", "validated", + "log_level", + "logLevel", "session_loop", "sessionLoop", "join", @@ -3538,6 +3548,7 @@ impl<'de> serde::Deserialize<'de> for Materialize { SpecResponse, Validate, Validated, + LogLevel, SessionLoop, Join, Joined, @@ -3587,6 +3598,7 @@ impl<'de> serde::Deserialize<'de> for Materialize { "specResponse" | "spec_response" => Ok(GeneratedField::SpecResponse), "validate" => Ok(GeneratedField::Validate), "validated" => Ok(GeneratedField::Validated), + "logLevel" | "log_level" => Ok(GeneratedField::LogLevel), "sessionLoop" | "session_loop" => Ok(GeneratedField::SessionLoop), "join" => Ok(GeneratedField::Join), "joined" => Ok(GeneratedField::Joined), @@ -3634,6 +3646,7 @@ impl<'de> serde::Deserialize<'de> for Materialize { let mut spec_response__ = None; let mut validate__ = None; let mut validated__ = None; + let mut log_level__ = None; let mut session_loop__ = None; let mut join__ = None; let mut joined__ = None; @@ -3684,6 +3697,12 @@ impl<'de> serde::Deserialize<'de> for Materialize { } validated__ = map_.next_value()?; } + GeneratedField::LogLevel => { + if log_level__.is_some() { + return Err(serde::de::Error::duplicate_field("logLevel")); + } + log_level__ = Some(map_.next_value::()? as i32); + } GeneratedField::SessionLoop => { if session_loop__.is_some() { return Err(serde::de::Error::duplicate_field("sessionLoop")); @@ -3835,6 +3854,7 @@ impl<'de> serde::Deserialize<'de> for Materialize { spec_response: spec_response__, validate: validate__, validated: validated__, + log_level: log_level__.unwrap_or_default(), session_loop: session_loop__, join: join__, joined: joined__, @@ -5921,7 +5941,7 @@ impl serde::Serialize for Persist { { use serde::ser::SerializeStruct; let mut len = 0; - if self.nonce != 0 { + if self.seq_no != 0 { len += 1; } if self.delete_ack_intents { @@ -5933,6 +5953,9 @@ impl serde::Serialize for Persist { if self.committed_close_clock != 0 { len += 1; } + if self.delete_committed_frontier { + len += 1; + } if self.committed_frontier.is_some() { len += 1; } @@ -5951,6 +5974,9 @@ impl serde::Serialize for Persist { if !self.last_applied.is_empty() { len += 1; } + if self.delete_legacy_checkpoint { + len += 1; + } if self.legacy_checkpoint.is_some() { len += 1; } @@ -5964,10 +5990,10 @@ impl serde::Serialize for Persist { len += 1; } let mut struct_ser = serializer.serialize_struct("runtime.Persist", len)?; - if self.nonce != 0 { + if self.seq_no != 0 { #[allow(clippy::needless_borrow)] #[allow(clippy::needless_borrows_for_generic_args)] - struct_ser.serialize_field("nonce", ToString::to_string(&self.nonce).as_str())?; + struct_ser.serialize_field("seqNo", ToString::to_string(&self.seq_no).as_str())?; } if self.delete_ack_intents { struct_ser.serialize_field("deleteAckIntents", &self.delete_ack_intents)?; @@ -5982,6 +6008,9 @@ impl serde::Serialize for Persist { #[allow(clippy::needless_borrows_for_generic_args)] struct_ser.serialize_field("committedCloseClock", ToString::to_string(&self.committed_close_clock).as_str())?; } + if self.delete_committed_frontier { + struct_ser.serialize_field("deleteCommittedFrontier", &self.delete_committed_frontier)?; + } if let Some(v) = self.committed_frontier.as_ref() { struct_ser.serialize_field("committedFrontier", v)?; } @@ -6006,6 +6035,9 @@ impl serde::Serialize for Persist { #[allow(clippy::needless_borrows_for_generic_args)] struct_ser.serialize_field("lastApplied", pbjson::private::base64::encode(&self.last_applied).as_str())?; } + if self.delete_legacy_checkpoint { + struct_ser.serialize_field("deleteLegacyCheckpoint", &self.delete_legacy_checkpoint)?; + } if let Some(v) = self.legacy_checkpoint.as_ref() { struct_ser.serialize_field("legacyCheckpoint", v)?; } @@ -6032,13 +6064,16 @@ impl<'de> serde::Deserialize<'de> for Persist { D: serde::Deserializer<'de>, { const FIELDS: &[&str] = &[ - "nonce", + "seq_no", + "seqNo", "delete_ack_intents", "deleteAckIntents", "ack_intents", "ackIntents", "committed_close_clock", "committedCloseClock", + "delete_committed_frontier", + "deleteCommittedFrontier", "committed_frontier", "committedFrontier", "connector_patches_json", @@ -6051,6 +6086,8 @@ impl<'de> serde::Deserialize<'de> for Persist { "hintedFrontier", "last_applied", "lastApplied", + "delete_legacy_checkpoint", + "deleteLegacyCheckpoint", "legacy_checkpoint", "legacyCheckpoint", "max_keys", @@ -6063,16 +6100,18 @@ impl<'de> serde::Deserialize<'de> for Persist { #[allow(clippy::enum_variant_names)] enum GeneratedField { - Nonce, + SeqNo, DeleteAckIntents, AckIntents, CommittedCloseClock, + DeleteCommittedFrontier, CommittedFrontier, ConnectorPatchesJson, HintedCloseClock, DeleteHintedFrontier, HintedFrontier, LastApplied, + DeleteLegacyCheckpoint, LegacyCheckpoint, MaxKeys, DeleteTriggerParams, @@ -6098,16 +6137,18 @@ impl<'de> serde::Deserialize<'de> for Persist { E: serde::de::Error, { match value { - "nonce" => Ok(GeneratedField::Nonce), + "seqNo" | "seq_no" => Ok(GeneratedField::SeqNo), "deleteAckIntents" | "delete_ack_intents" => Ok(GeneratedField::DeleteAckIntents), "ackIntents" | "ack_intents" => Ok(GeneratedField::AckIntents), "committedCloseClock" | "committed_close_clock" => Ok(GeneratedField::CommittedCloseClock), + "deleteCommittedFrontier" | "delete_committed_frontier" => Ok(GeneratedField::DeleteCommittedFrontier), "committedFrontier" | "committed_frontier" => Ok(GeneratedField::CommittedFrontier), "connectorPatches" | "connector_patches_json" => Ok(GeneratedField::ConnectorPatchesJson), "hintedCloseClock" | "hinted_close_clock" => Ok(GeneratedField::HintedCloseClock), "deleteHintedFrontier" | "delete_hinted_frontier" => Ok(GeneratedField::DeleteHintedFrontier), "hintedFrontier" | "hinted_frontier" => Ok(GeneratedField::HintedFrontier), "lastApplied" | "last_applied" => Ok(GeneratedField::LastApplied), + "deleteLegacyCheckpoint" | "delete_legacy_checkpoint" => Ok(GeneratedField::DeleteLegacyCheckpoint), "legacyCheckpoint" | "legacy_checkpoint" => Ok(GeneratedField::LegacyCheckpoint), "maxKeys" | "max_keys" => Ok(GeneratedField::MaxKeys), "deleteTriggerParams" | "delete_trigger_params" => Ok(GeneratedField::DeleteTriggerParams), @@ -6131,27 +6172,29 @@ impl<'de> serde::Deserialize<'de> for Persist { where V: serde::de::MapAccess<'de>, { - let mut nonce__ = None; + let mut seq_no__ = None; let mut delete_ack_intents__ = None; let mut ack_intents__ = None; let mut committed_close_clock__ = None; + let mut delete_committed_frontier__ = None; let mut committed_frontier__ = None; let mut connector_patches_json__ = None; let mut hinted_close_clock__ = None; let mut delete_hinted_frontier__ = None; let mut hinted_frontier__ = None; let mut last_applied__ = None; + let mut delete_legacy_checkpoint__ = None; let mut legacy_checkpoint__ = None; let mut max_keys__ = None; let mut delete_trigger_params__ = None; let mut trigger_params_json__ = None; while let Some(k) = map_.next_key()? { match k { - GeneratedField::Nonce => { - if nonce__.is_some() { - return Err(serde::de::Error::duplicate_field("nonce")); + GeneratedField::SeqNo => { + if seq_no__.is_some() { + return Err(serde::de::Error::duplicate_field("seqNo")); } - nonce__ = + seq_no__ = Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0) ; } @@ -6178,6 +6221,12 @@ impl<'de> serde::Deserialize<'de> for Persist { Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0) ; } + GeneratedField::DeleteCommittedFrontier => { + if delete_committed_frontier__.is_some() { + return Err(serde::de::Error::duplicate_field("deleteCommittedFrontier")); + } + delete_committed_frontier__ = Some(map_.next_value()?); + } GeneratedField::CommittedFrontier => { if committed_frontier__.is_some() { return Err(serde::de::Error::duplicate_field("committedFrontier")); @@ -6220,6 +6269,12 @@ impl<'de> serde::Deserialize<'de> for Persist { Some(map_.next_value::<::pbjson::private::BytesDeserialize<_>>()?.0) ; } + GeneratedField::DeleteLegacyCheckpoint => { + if delete_legacy_checkpoint__.is_some() { + return Err(serde::de::Error::duplicate_field("deleteLegacyCheckpoint")); + } + delete_legacy_checkpoint__ = Some(map_.next_value()?); + } GeneratedField::LegacyCheckpoint => { if legacy_checkpoint__.is_some() { return Err(serde::de::Error::duplicate_field("legacyCheckpoint")); @@ -6252,16 +6307,18 @@ impl<'de> serde::Deserialize<'de> for Persist { } } Ok(Persist { - nonce: nonce__.unwrap_or_default(), + seq_no: seq_no__.unwrap_or_default(), delete_ack_intents: delete_ack_intents__.unwrap_or_default(), ack_intents: ack_intents__.unwrap_or_default(), committed_close_clock: committed_close_clock__.unwrap_or_default(), + delete_committed_frontier: delete_committed_frontier__.unwrap_or_default(), committed_frontier: committed_frontier__, connector_patches_json: connector_patches_json__.unwrap_or_default(), hinted_close_clock: hinted_close_clock__.unwrap_or_default(), delete_hinted_frontier: delete_hinted_frontier__.unwrap_or_default(), hinted_frontier: hinted_frontier__, last_applied: last_applied__.unwrap_or_default(), + delete_legacy_checkpoint: delete_legacy_checkpoint__.unwrap_or_default(), legacy_checkpoint: legacy_checkpoint__, max_keys: max_keys__.unwrap_or_default(), delete_trigger_params: delete_trigger_params__.unwrap_or_default(), @@ -6280,14 +6337,14 @@ impl serde::Serialize for Persisted { { use serde::ser::SerializeStruct; let mut len = 0; - if self.nonce != 0 { + if self.seq_no != 0 { len += 1; } let mut struct_ser = serializer.serialize_struct("runtime.Persisted", len)?; - if self.nonce != 0 { + if self.seq_no != 0 { #[allow(clippy::needless_borrow)] #[allow(clippy::needless_borrows_for_generic_args)] - struct_ser.serialize_field("nonce", ToString::to_string(&self.nonce).as_str())?; + struct_ser.serialize_field("seqNo", ToString::to_string(&self.seq_no).as_str())?; } struct_ser.end() } @@ -6299,12 +6356,13 @@ impl<'de> serde::Deserialize<'de> for Persisted { D: serde::Deserializer<'de>, { const FIELDS: &[&str] = &[ - "nonce", + "seq_no", + "seqNo", ]; #[allow(clippy::enum_variant_names)] enum GeneratedField { - Nonce, + SeqNo, } impl<'de> serde::Deserialize<'de> for GeneratedField { fn deserialize(deserializer: D) -> std::result::Result @@ -6326,7 +6384,7 @@ impl<'de> serde::Deserialize<'de> for Persisted { E: serde::de::Error, { match value { - "nonce" => Ok(GeneratedField::Nonce), + "seqNo" | "seq_no" => Ok(GeneratedField::SeqNo), _ => Err(serde::de::Error::unknown_field(value, FIELDS)), } } @@ -6346,21 +6404,21 @@ impl<'de> serde::Deserialize<'de> for Persisted { where V: serde::de::MapAccess<'de>, { - let mut nonce__ = None; + let mut seq_no__ = None; while let Some(k) = map_.next_key()? { match k { - GeneratedField::Nonce => { - if nonce__.is_some() { - return Err(serde::de::Error::duplicate_field("nonce")); + GeneratedField::SeqNo => { + if seq_no__.is_some() { + return Err(serde::de::Error::duplicate_field("seqNo")); } - nonce__ = + seq_no__ = Some(map_.next_value::<::pbjson::private::NumberDeserialize<_>>()?.0) ; } } } Ok(Persisted { - nonce: nonce__.unwrap_or_default(), + seq_no: seq_no__.unwrap_or_default(), }) } } @@ -7624,12 +7682,6 @@ impl serde::Serialize for Task { if !self.spec.is_empty() { len += 1; } - if !self.ops_stats_journal.is_empty() { - len += 1; - } - if self.ops_stats_spec.is_some() { - len += 1; - } if self.preview { len += 1; } @@ -7642,12 +7694,6 @@ impl serde::Serialize for Task { #[allow(clippy::needless_borrows_for_generic_args)] struct_ser.serialize_field("spec", pbjson::private::base64::encode(&self.spec).as_str())?; } - if !self.ops_stats_journal.is_empty() { - struct_ser.serialize_field("opsStatsJournal", &self.ops_stats_journal)?; - } - if let Some(v) = self.ops_stats_spec.as_ref() { - struct_ser.serialize_field("opsStatsSpec", v)?; - } if self.preview { struct_ser.serialize_field("preview", &self.preview)?; } @@ -7665,10 +7711,6 @@ impl<'de> serde::Deserialize<'de> for Task { { const FIELDS: &[&str] = &[ "spec", - "ops_stats_journal", - "opsStatsJournal", - "ops_stats_spec", - "opsStatsSpec", "preview", "max_transactions", "maxTransactions", @@ -7677,8 +7719,6 @@ impl<'de> serde::Deserialize<'de> for Task { #[allow(clippy::enum_variant_names)] enum GeneratedField { Spec, - OpsStatsJournal, - OpsStatsSpec, Preview, MaxTransactions, } @@ -7703,8 +7743,6 @@ impl<'de> serde::Deserialize<'de> for Task { { match value { "spec" => Ok(GeneratedField::Spec), - "opsStatsJournal" | "ops_stats_journal" => Ok(GeneratedField::OpsStatsJournal), - "opsStatsSpec" | "ops_stats_spec" => Ok(GeneratedField::OpsStatsSpec), "preview" => Ok(GeneratedField::Preview), "maxTransactions" | "max_transactions" => Ok(GeneratedField::MaxTransactions), _ => Err(serde::de::Error::unknown_field(value, FIELDS)), @@ -7727,8 +7765,6 @@ impl<'de> serde::Deserialize<'de> for Task { V: serde::de::MapAccess<'de>, { let mut spec__ = None; - let mut ops_stats_journal__ = None; - let mut ops_stats_spec__ = None; let mut preview__ = None; let mut max_transactions__ = None; while let Some(k) = map_.next_key()? { @@ -7741,18 +7777,6 @@ impl<'de> serde::Deserialize<'de> for Task { Some(map_.next_value::<::pbjson::private::BytesDeserialize<_>>()?.0) ; } - GeneratedField::OpsStatsJournal => { - if ops_stats_journal__.is_some() { - return Err(serde::de::Error::duplicate_field("opsStatsJournal")); - } - ops_stats_journal__ = Some(map_.next_value()?); - } - GeneratedField::OpsStatsSpec => { - if ops_stats_spec__.is_some() { - return Err(serde::de::Error::duplicate_field("opsStatsSpec")); - } - ops_stats_spec__ = map_.next_value()?; - } GeneratedField::Preview => { if preview__.is_some() { return Err(serde::de::Error::duplicate_field("preview")); @@ -7771,8 +7795,6 @@ impl<'de> serde::Deserialize<'de> for Task { } Ok(Task { spec: spec__.unwrap_or_default(), - ops_stats_journal: ops_stats_journal__.unwrap_or_default(), - ops_stats_spec: ops_stats_spec__, preview: preview__.unwrap_or_default(), max_transactions: max_transactions__.unwrap_or_default(), }) diff --git a/crates/proto-flow/src/shuffle.rs b/crates/proto-flow/src/shuffle.rs index 8ff3b4c7574..227f6da83d1 100644 --- a/crates/proto-flow/src/shuffle.rs +++ b/crates/proto-flow/src/shuffle.rs @@ -2,16 +2,22 @@ /// Shard represents a participant in the shuffle topology (e.x. a task shard). #[derive(Clone, PartialEq, Eq, Hash, ::prost::Message)] pub struct Shard { + /// Fully-qualified identifier of this shard, e.g. + /// `//_`. + /// Used to label metrics and diagnostics, and as an authorization subject + /// for the shard's RPCs. + #[prost(string, tag = "1")] + pub id: ::prost::alloc::string::String, /// Key and r-clock document range owned by this shard. - #[prost(message, optional, tag = "1")] + #[prost(message, optional, tag = "2")] pub range: ::core::option::Option, /// gRPC endpoint of this shard's shuffle service. - #[prost(string, tag = "2")] + #[prost(string, tag = "3")] pub endpoint: ::prost::alloc::string::String, /// Filesystem path where the Log actor writes segment files for this shard. /// The consumer joins over shuffle-produced log segments via this directory. /// Multiple shard indices may share a single directory. - #[prost(string, tag = "3")] + #[prost(string, tag = "4")] pub directory: ::prost::alloc::string::String, } #[derive(Clone, PartialEq, ::prost::Message)] diff --git a/crates/proto-flow/src/shuffle.serde.rs b/crates/proto-flow/src/shuffle.serde.rs index 900287889be..b102ea59a44 100644 --- a/crates/proto-flow/src/shuffle.serde.rs +++ b/crates/proto-flow/src/shuffle.serde.rs @@ -2029,6 +2029,9 @@ impl serde::Serialize for Shard { { use serde::ser::SerializeStruct; let mut len = 0; + if !self.id.is_empty() { + len += 1; + } if self.range.is_some() { len += 1; } @@ -2039,6 +2042,9 @@ impl serde::Serialize for Shard { len += 1; } let mut struct_ser = serializer.serialize_struct("shuffle.Shard", len)?; + if !self.id.is_empty() { + struct_ser.serialize_field("id", &self.id)?; + } if let Some(v) = self.range.as_ref() { struct_ser.serialize_field("range", v)?; } @@ -2058,6 +2064,7 @@ impl<'de> serde::Deserialize<'de> for Shard { D: serde::Deserializer<'de>, { const FIELDS: &[&str] = &[ + "id", "range", "endpoint", "directory", @@ -2065,6 +2072,7 @@ impl<'de> serde::Deserialize<'de> for Shard { #[allow(clippy::enum_variant_names)] enum GeneratedField { + Id, Range, Endpoint, Directory, @@ -2089,6 +2097,7 @@ impl<'de> serde::Deserialize<'de> for Shard { E: serde::de::Error, { match value { + "id" => Ok(GeneratedField::Id), "range" => Ok(GeneratedField::Range), "endpoint" => Ok(GeneratedField::Endpoint), "directory" => Ok(GeneratedField::Directory), @@ -2111,11 +2120,18 @@ impl<'de> serde::Deserialize<'de> for Shard { where V: serde::de::MapAccess<'de>, { + let mut id__ = None; let mut range__ = None; let mut endpoint__ = None; let mut directory__ = None; while let Some(k) = map_.next_key()? { match k { + GeneratedField::Id => { + if id__.is_some() { + return Err(serde::de::Error::duplicate_field("id")); + } + id__ = Some(map_.next_value()?); + } GeneratedField::Range => { if range__.is_some() { return Err(serde::de::Error::duplicate_field("range")); @@ -2137,6 +2153,7 @@ impl<'de> serde::Deserialize<'de> for Shard { } } Ok(Shard { + id: id__.unwrap_or_default(), range: range__, endpoint: endpoint__.unwrap_or_default(), directory: directory__.unwrap_or_default(), diff --git a/crates/proto-flow/tests/snapshots/regression__stats_json.snap b/crates/proto-flow/tests/snapshots/regression__stats_json.snap index e2de5a6c91d..dce700cb0f7 100644 --- a/crates/proto-flow/tests/snapshots/regression__stats_json.snap +++ b/crates/proto-flow/tests/snapshots/regression__stats_json.snap @@ -39,7 +39,7 @@ expression: json_test(msg) "bytesTotal": 369 }, "lastSourcePublishedAt": "1970-01-01T00:00:06.000000007+00:00", - "bytesBehind": "1000" + "bytesBehind": 1000 }, "otherTransform": { "source": "other/collection", @@ -48,7 +48,7 @@ expression: json_test(msg) "bytesTotal": 2389 }, "lastSourcePublishedAt": "1970-01-01T00:00:06.000000007+00:00", - "bytesBehind": "2000" + "bytesBehind": 2000 } }, "published": { @@ -76,7 +76,7 @@ expression: json_test(msg) "bytesTotal": 300 }, "lastSourcePublishedAt": "1970-01-01T00:00:06.000000007+00:00", - "bytesBehind": "5000" + "bytesBehind": 5000 } }, "interval": { diff --git a/crates/publisher/src/binding.rs b/crates/publisher/src/binding.rs index d82f4f2f0be..07261392ec2 100644 --- a/crates/publisher/src/binding.rs +++ b/crates/publisher/src/binding.rs @@ -2,8 +2,18 @@ use anyhow::Context; use proto_flow::flow; use proto_gazette::broker; -/// Metadata for mapping documents to collection partitions. -pub struct Binding { +/// Metadata for routing publications to a specific journal. +pub enum Binding { + /// `Mapped` bindings dynamically resolve documents to one of a collection's + /// physical partitions, creating partitions on-demand. + Mapped(MappedBinding), + /// `Fixed` bindings target a single, pre-existing journal by name. + Fixed(FixedBinding), +} + +/// Routes documents to a collection's physical partitions via key hashing +/// and partition-field extraction. +pub struct MappedBinding { /// Target collection name (for logging/debugging). pub collection: models::Collection, /// Pre-built key extractors for the collection key pointers. @@ -16,9 +26,14 @@ pub struct Binding { pub partitions_template: broker::JournalSpec, /// Maximum number of allowed partitions for this binding. pub partitions_limit: usize, - /// Collection partitions prefix ("{partitions_template.name}/"), or a - /// more-specific prefix or journal name to which this binding is scoped. - pub partitions_prefix_or_name: String, + /// Collection partitions prefix ("{partitions_template.name}/"). + pub partitions_prefix: String, +} + +/// Routes documents to a single named journal that already exists. +pub struct FixedBinding { + /// Journal to which the binding publishes. + pub journal: String, } impl Binding { @@ -33,25 +48,17 @@ impl Binding { .as_ref() .with_context(|| format!("capture binding {index} missing collection"))?; - Self::from_collection_spec(collection_spec, None).with_context(|| { + Self::from_collection_spec(collection_spec).with_context(|| { format!("building binding for collection {}", collection_spec.name) }) }) .collect() } - /// Build a Binding from a built CollectionSpec. - /// - /// If `partitions_prefix_or_name` is Some, the Binding will authorize-to and - /// watch only that sub-prefix or specific journal. When None, the Binding - /// authorizes to all partitions of the collection. + /// Build a Mapped Binding from a built CollectionSpec. /// - /// `partitions_prefix_or_name` must be prefixed by the CollectionSpec's - /// actual partition template prefix, or this routine errors. - pub fn from_collection_spec( - spec: &flow::CollectionSpec, - partitions_prefix_or_name: Option<&str>, - ) -> anyhow::Result { + /// The Binding authorizes to and watches all partitions of the collection. + pub fn from_collection_spec(spec: &flow::CollectionSpec) -> anyhow::Result { let flow::CollectionSpec { name, key, @@ -67,17 +74,6 @@ impl Binding { .clone(); let partitions_prefix = format!("{}/", &partitions_template.name); - let partitions_prefix_or_name = if let Some(fixed) = partitions_prefix_or_name { - if !fixed.starts_with(&partitions_prefix) { - anyhow::bail!( - "prefix or name {fixed} must begin with collection prefix {partitions_prefix}" - ); - } - fixed.to_string() - } else { - partitions_prefix - }; - let policy = doc::SerPolicy::noop(); let key_extractors = extractors::for_key(key, projections, &policy).context("building key extractors")?; @@ -94,14 +90,31 @@ impl Binding { 100 }; - Ok(Self { + Ok(Self::Mapped(MappedBinding { collection: models::Collection::new(name), key_extractors, partition_fields: partition_fields.clone(), partition_extractors, partitions_template, partitions_limit, - partitions_prefix_or_name, + partitions_prefix, + })) + } + + /// Build a Fixed Binding that publishes to a single named journal. + /// The binding skips the partitions watch and partition-mapping machinery. + pub fn for_fixed_journal(journal: impl Into) -> Self { + Self::Fixed(FixedBinding { + journal: journal.into(), }) } + + /// AuthZ object string for this binding's lazy journal Client. For Mapped + /// bindings this is the partitions prefix; for Fixed it's the journal name. + pub(crate) fn authz_object(&self) -> &str { + match self { + Self::Mapped(b) => &b.partitions_prefix, + Self::Fixed(b) => &b.journal, + } + } } diff --git a/crates/publisher/src/lib.rs b/crates/publisher/src/lib.rs index 8ab94e9d5c1..17ddb09997b 100644 --- a/crates/publisher/src/lib.rs +++ b/crates/publisher/src/lib.rs @@ -6,34 +6,55 @@ pub mod publisher; pub mod watch; pub use appender::{Appender, AppenderGroup}; -pub use binding::Binding; +pub use binding::{Binding, FixedBinding, MappedBinding}; pub use publisher::Publisher; -/// Boxed closure for lazy initialization of a partitions watch and journal Client. -/// Callers of `Binding::from_collection_spec` provide this to control how the -/// journal Client and partitions watch are created. -type PartitionsClientInit = Box< +/// Boxed closure for lazy initialization of a Mapped binding's partitions +/// watch and journal Client. +type MappedClientInit = Box< dyn FnOnce() -> ( gazette::journal::Client, tokens::PendingWatch>, ) + Send, >; -/// LazyPartitionsClient uses a LazyCell to defer initialization of a partitions -/// watch and a paired journal Client for List, Apply, and Append RPCs. +/// Boxed closure for lazy initialization of a Fixed binding's journal Client. +type FixedClientInit = Box gazette::journal::Client + Send>; + +/// LazyBindingClient defers initialization of per-binding journal resources +/// until first use. +/// +/// Mapped bindings need both a journal Client and a long-lived list-watch +/// stream of partitions. Fixed bindings only need a Client (the journal is +/// known by name; no listing is required). /// /// An instantiated client and watch each consume background resources: -/// periodic token refreshes for the client, and a long-lived list RPC for the watch. -/// However, many (most?) bindings and collections are infrequently written and -/// a Publisher instance may never interact with the binding during its lifetime, -/// so avoid paying this cost until we know it's needed. -type LazyPartitionsClient = std::sync::LazyLock< - ( - gazette::journal::Client, - tokens::PendingWatch>, +/// periodic token refreshes for the client, and a long-lived list RPC for the +/// watch. However, many (most?) bindings and collections are infrequently +/// written and a Publisher instance may never interact with the binding during +/// its lifetime, so avoid paying this cost until we know it's needed. +pub(crate) enum LazyBindingClient { + Mapped( + std::sync::LazyLock< + ( + gazette::journal::Client, + tokens::PendingWatch>, + ), + MappedClientInit, + >, ), - PartitionsClientInit, ->; + Fixed(std::sync::LazyLock), +} + +impl LazyBindingClient { + /// Force initialization and return the underlying journal Client. + pub(crate) fn client(&self) -> &gazette::journal::Client { + match self { + Self::Mapped(lazy) => &lazy.0, + Self::Fixed(lazy) => &**lazy, + } + } +} /// Sanity-check that `intents` is non-empty NDJSON: terminated by a newline, /// with every line a syntactically-valid JSON document. diff --git a/crates/publisher/src/mapping.rs b/crates/publisher/src/mapping.rs index f01253ea0be..57d7619949d 100644 --- a/crates/publisher/src/mapping.rs +++ b/crates/publisher/src/mapping.rs @@ -14,8 +14,14 @@ use proto_gazette::broker; /// /// Port of Go's `Mapper.Map` (`go/flow/mapping.go`). pub(crate) async fn map_partition( - binding: &super::Binding, - client: &super::LazyPartitionsClient, + binding: &super::MappedBinding, + lazy: &std::sync::LazyLock< + ( + gazette::journal::Client, + tokens::PendingWatch>, + ), + crate::MappedClientInit, + >, doc: &N, prefix: String, packed_key: bytes::BytesMut, @@ -23,7 +29,7 @@ pub(crate) async fn map_partition( let (mut prefix, packed_key, key_hash) = extract_mapping_context(binding, doc, prefix, packed_key)?; - let (client, partitions) = &(**client); + let (client, partitions) = &(**lazy); let partitions = partitions.ready().await; loop { @@ -72,7 +78,7 @@ pub(crate) async fn map_partition( } fn extract_mapping_context( - binding: &super::Binding, + binding: &super::MappedBinding, doc: &N, mut prefix: String, mut packed_key: bytes::BytesMut, @@ -150,7 +156,7 @@ fn pick_partition( // Panics if field extraction fails, as build_logical_prefix() should have // already been called. fn build_partition_apply( - binding: &super::Binding, + binding: &super::MappedBinding, doc: &N, ) -> tonic::Result<(String, broker::ApplyRequest)> { let mut spec = binding.partitions_template.clone(); @@ -173,13 +179,6 @@ fn build_partition_apply( spec.name = name.clone(); spec.labels = Some(labels); - if !name.starts_with(&binding.partitions_prefix_or_name) { - return Err(tonic::Status::invalid_argument(format!( - "candidate partition to create is {name}, but this publisher is restricted to {}", - binding.partitions_prefix_or_name - ))); - } - Ok(( name, broker::ApplyRequest { @@ -287,8 +286,8 @@ mod test { assert_eq!(pick_partition(&p, "coll/a=1/", 0), None); } - /// Build a test Binding from a built CollectionSpec. - fn test_binding(spec: &flow::CollectionSpec) -> super::super::Binding { + /// Build a test MappedBinding from a built CollectionSpec. + fn test_binding(spec: &flow::CollectionSpec) -> super::super::MappedBinding { let flow::CollectionSpec { name, partition_template, @@ -299,21 +298,21 @@ mod test { } = spec; let partition_template = partition_template.clone().unwrap(); - let partitions_prefix_or_name = format!("{}/", &partition_template.name); + let partitions_prefix = format!("{}/", &partition_template.name); let policy = doc::SerPolicy::noop(); let key_extractors = extractors::for_key(key, projections, &policy).unwrap(); let partition_extractors = extractors::for_fields(partition_fields, projections, &policy).unwrap(); - super::super::Binding { + super::super::MappedBinding { collection: models::Collection::new(name), key_extractors, partition_fields: partition_fields.clone(), partition_extractors, partitions_template: partition_template, partitions_limit: 100, - partitions_prefix_or_name, + partitions_prefix, } } @@ -331,7 +330,7 @@ mod test { .unwrap(); let spec = spec.as_ref().unwrap(); - let mut binding = test_binding(spec); + let binding = test_binding(spec); // extract_mapping_context encodes partition field values into a logical prefix. let (prefix_1, _, _) = extract_mapping_context( @@ -375,25 +374,5 @@ mod test { "change": request.changes.into_iter().next().unwrap(), }) ); - - // A more-specific prefix that still covers the candidate partition is OK. - binding.partitions_prefix_or_name = - "example/collection/2020202020202020/a_bool=%_true/".to_string(); - build_partition_apply( - &binding, - &json!({"a_key": "k", "a_bool": true, "a_str": "hello"}), - ) - .unwrap(); - - // A sibling prefix that does NOT cover the candidate partition is rejected. - binding.partitions_prefix_or_name = - "example/collection/2020202020202020/a_bool=%_false/".to_string(); - let err = build_partition_apply( - &binding, - &json!({"a_key": "k", "a_bool": true, "a_str": "hello"}), - ) - .unwrap_err(); - assert_eq!(err.code(), tonic::Code::InvalidArgument); - insta::assert_snapshot!(err.message(), @"candidate partition to create is example/collection/2020202020202020/a_bool=%_true/a_str=hello/pivot=00, but this publisher is restricted to example/collection/2020202020202020/a_bool=%_false/"); } } diff --git a/crates/publisher/src/publisher.rs b/crates/publisher/src/publisher.rs index 482a5ec60bd..79e786e06d7 100644 --- a/crates/publisher/src/publisher.rs +++ b/crates/publisher/src/publisher.rs @@ -10,8 +10,9 @@ pub struct Publisher { authz_subject: String, // Bindings of this Publisher. bindings: Vec, - // Lazily-initialized journal Client and partitions watch for each `bindings` entry. - binding_clients: Vec, + // Lazily-initialized journal Client (and, for Mapped bindings, partitions + // watch) for each `bindings` entry. + binding_clients: Vec, // Factory for building journal Clients on demand. client_factory: gazette::journal::ClientFactory, // Clock used to stamp published document UUIDs. @@ -31,7 +32,7 @@ impl Publisher { /// (one per entry of `bindings`), and to build ephemeral clients inside /// `write_intents` for ACK intents that do not match any current binding. /// `authz_subject` is passed through to this factory without modification, - /// and `Binding::partitions_prefix_or_name` is the AuthZ object. + /// and a binding's `authz_object()` is the AuthZ object. /// /// The `producer` identifies this Publisher as a distinct writer and is /// embedded in every UUID it generates. The `clock` provides a monotonic @@ -46,17 +47,26 @@ impl Publisher { let binding_clients = bindings .iter() .map(|b| { - let client_factory = client_factory.clone(); + let factory = client_factory.clone(); let authz_subject = authz_subject.clone(); - let authz_object = b.partitions_prefix_or_name.clone(); + let authz_object = b.authz_object().to_string(); - let init: crate::PartitionsClientInit = Box::new(move || { - let client = client_factory(authz_subject, authz_object.clone()); - let partitions = crate::watch::watch_partitions(client.clone(), &authz_object); - (client, partitions) - }); - - std::sync::LazyLock::new(init) + match b { + super::Binding::Mapped(_) => { + let init: crate::MappedClientInit = Box::new(move || { + let client = factory(authz_subject, authz_object.clone()); + let partitions = + crate::watch::watch_partitions(client.clone(), &authz_object); + (client, partitions) + }); + super::LazyBindingClient::Mapped(std::sync::LazyLock::new(init)) + } + super::Binding::Fixed(_) => { + let init: crate::FixedClientInit = + Box::new(move || factory(authz_subject, authz_object)); + super::LazyBindingClient::Fixed(std::sync::LazyLock::new(init)) + } + } }) .collect(); @@ -89,11 +99,13 @@ impl Publisher { /// Enqueue a document for publication to the appropriate journal partition. /// /// Assigns a UUID with the given `flags` and passes it to `doc`, which - /// returns `(binding_index, document)`. The document is mapped to a physical - /// partition (creating one if needed, which may issue an Apply RPC), serialized - /// as newline-delimited JSON into the partition's Appender buffer, and - /// checkpoint'd. The checkpoint may start a background Append RPC if the - /// buffer exceeds the flush threshold. + /// returns `(binding_index, document)`. For Mapped bindings the document + /// is mapped to a physical partition (creating one if needed, which may + /// issue an Apply RPC). For Fixed bindings the binding's journal is used + /// directly, with no key extraction or partition mapping. The document is + /// serialized as newline-delimited JSON into the partition's Appender + /// buffer, and checkpoint'd. The checkpoint may start a background Append + /// RPC if the buffer exceeds the flush threshold. pub async fn enqueue( &mut self, doc: impl FnOnce(uuid::Uuid) -> (usize, N), @@ -105,18 +117,24 @@ impl Publisher { // Sequence the document. let uuid = proto_gazette::uuid::build(self.producer, self.clock.tick(), flags); - let (binding, doc) = doc(uuid); - - let (mut journal, mut packed_key) = super::mapping::map_partition( - &self.bindings[binding], - &self.binding_clients[binding], - &doc, - prefix, - packed_key, - ) - .await?; - - let (client, _partitions) = &(*self.binding_clients[binding]); + let (binding_idx, doc) = doc(uuid); + + let (mut journal, mut packed_key) = match &self.bindings[binding_idx] { + super::Binding::Mapped(mapped) => { + let super::LazyBindingClient::Mapped(lazy) = &self.binding_clients[binding_idx] + else { + unreachable!("Mapped binding has Mapped lazy client"); + }; + super::mapping::map_partition(mapped, lazy, &doc, prefix, packed_key).await? + } + super::Binding::Fixed(fixed) => { + let mut prefix = prefix; + prefix.push_str(&fixed.journal); + (prefix, packed_key) + } + }; + + let client = self.binding_clients[binding_idx].client(); let appender = self.appenders.activate(&journal, client); // Enqueue the serialization to the Appender's buffer, then checkpoint. @@ -167,8 +185,9 @@ impl Publisher { /// Takes the output of `intents::build_transaction_intents()` — per-journal /// NDJSON `Bytes` — and appends each to its journal in parallel. For each /// journal, this uses a hybrid client strategy: - /// - If the journal is a prefix-match of an existing binding's - /// `auth_prefix_or_name`, reuse that binding's client. + /// - If the journal matches a binding, reuse that binding's client. + /// For Mapped bindings the match is a prefix-match on the binding's + /// partitions prefix; for Fixed bindings it's an exact name match. /// - Otherwise, build an ephemeral client. This supports recovered ACK /// intents that may reference journals no longer bound to the current /// task (e.g. from a prior published task). For this class of journals, @@ -192,8 +211,11 @@ impl Publisher { let binding_client = self .bindings .iter() - .position(|b| journal.starts_with(&b.partitions_prefix_or_name)) - .map(|i| &(*self.binding_clients[i]).0); + .position(|b| match b { + super::Binding::Mapped(m) => journal.starts_with(&m.partitions_prefix), + super::Binding::Fixed(f) => journal == f.journal, + }) + .map(|i| self.binding_clients[i].client()); let appender = if let Some(client) = binding_client { self.appenders.activate(&journal, client) @@ -226,15 +248,20 @@ impl Publisher { Ok(()) } - /// Access the lazy Client and partitions watch for the binding at `index`. - /// Primarily used by tests. - pub fn binding_client( + /// Access the lazy Client and partitions watch for the Mapped binding at + /// `index`. Panics if the binding is Fixed. Primarily used by tests. + pub fn mapped_binding_client( &self, index: usize, ) -> &( gazette::journal::Client, tokens::PendingWatch>, ) { - &*self.binding_clients[index] + match &self.binding_clients[index] { + super::LazyBindingClient::Mapped(lazy) => &**lazy, + super::LazyBindingClient::Fixed(_) => { + panic!("binding {index} is Fixed, not Mapped") + } + } } } diff --git a/crates/runtime-next/Cargo.toml b/crates/runtime-next/Cargo.toml index 0c541b58a92..35387fd5e1f 100644 --- a/crates/runtime-next/Cargo.toml +++ b/crates/runtime-next/Cargo.toml @@ -37,6 +37,7 @@ proto-grpc = { path = "../proto-grpc", features = [ "runtime_server", ] } publisher = { path = "../publisher" } +service-kit = { path = "../service-kit" } shuffle = { path = "../shuffle" } simd-doc = { path = "../simd-doc" } tables = { path = "../tables" } @@ -56,6 +57,7 @@ handlebars = { workspace = true } json-patch = { workspace = true } jsonwebtoken = { workspace = true } librocksdb-sys = { workspace = true } +metrics = { workspace = true } pbjson-types = { workspace = true } prost = { workspace = true } rand = { workspace = true } diff --git a/crates/runtime-next/README.md b/crates/runtime-next/README.md index 6587a75d6c2..22e2f76cee5 100644 --- a/crates/runtime-next/README.md +++ b/crates/runtime-next/README.md @@ -32,10 +32,10 @@ Reactor machine │ │ (RocksDB + Go Recorder on the shard hosting the recovery log) │ └─ Capture: per-shard RocksDB with Go Recorder │ - └─ shuffle sidecar process (Rust, one per machine) + └─ runtime sidecar process (Rust, one per machine) ├─ Shuffle Leader service (this crate, per-task via Join) ├─ Shuffle service (`crates/shuffle`, Session/Slice/Log RPCs) - └─ Listens on the fixed shuffle port, shared fleet-wide + └─ Listens on the fixed sidecar port, shared fleet-wide ``` The Gazette consumer framework's transaction lifecycle is **bypassed @@ -74,7 +74,7 @@ src/ │ └── shard/ # per-shard controller-facing service ├── service.rs # gRPC entry, dispatches by task type - ├── recovery.rs # Persist <-> RocksDB WriteBatch encode/decode + scan + ├── recovery.rs # Persist <-> RocksDB WriteBatch encode/decode + scan-time FC: pruning ├── rocksdb.rs # single Persist application path (capture will reuse) └── materialize/ ├── handler.rs # gRPC stream handler @@ -92,7 +92,7 @@ src/ signing key), constructs a `shard::Service`, and serves it over a per-shard Unix domain socket. - **`leader::Service::new`** (`leader/service.rs`) — sidecar process builds - one of these and registers it on the shuffle port alongside `shuffle::Service`. + one of these and registers it on the sidecar port alongside `shuffle::Service`. - **`shard::Service`** (`shard/service.rs`) — implements the controller-facing `Shard` trait. Each bidi stream terminates *both* the controller-bound protocol and the leader-bound protocol, translating between them and the @@ -136,10 +136,31 @@ documented inline in the proto. This crate ships **deployed inert** alongside the existing `runtime` crate; both coexist on the same reactor. Per-task feature flags on shard labels select which runtime serves a given task — all shards of a task use the -same runtime. The shuffle sidecar runs uniformly on every reactor machine +same runtime. The runtime sidecar runs uniformly on every reactor machine regardless of which tasks are assigned; old-runtime tasks simply don't talk to it. Rollback for any task is a feature-flag flip. +Once a task has stably cut over, the per-task `drop-runtime-v1-rollback` +shard-label flag tells the leader to stop maintaining the legacy V1 +`consumer.Checkpoint`; the leader deletes the persisted one during startup +(see below), forfeiting rollback in exchange for shedding compatibility state. + +## Startup checkpoint reconciliation + +The legacy V1 `consumer.Checkpoint` holds a *complete* committed frontier, +whereas the V2 `FC:` keys are written per-transaction as *deltas*. So at a +V1→V2 cutover the recovered `FC:` keys are not yet a sound recovery baseline. +`leader::materialize::startup` reconciles this synchronously: after the +connector `Open`/`Opened` exchange, when the final status of the recovered V1 +checkpoint and any remote-authoritative connector checkpoint is known, it +issues a single cleanup `Persist` to shard zero. If a checkpoint was +*authoritative* (its mapped frontier replaced the recovered one), the cleanup +clears all `FC:` keys and rewrites the complete baseline; if +`drop-runtime-v1-rollback` is set, it also deletes the legacy `checkpoint` +key. An authoritative (unmarked) checkpoint implies no V2 transaction has +committed, so clearing `FC:` loses no V2 state. The transaction loop then +only ever writes `FC:` deltas. + ## Status - `leader::materialize` and `shard::materialize` are implemented. diff --git a/crates/runtime-next/src/leader/materialize/actor.rs b/crates/runtime-next/src/leader/materialize/actor.rs index 3ca2eb93ddf..4465f8deb3f 100644 --- a/crates/runtime-next/src/leader/materialize/actor.rs +++ b/crates/runtime-next/src/leader/materialize/actor.rs @@ -16,6 +16,8 @@ pub struct Actor { intents_write_fut: Option>>, // Optional full Frontier and Checkpoint, used for V1 rollback support. legacy_checkpoint: Option<(shuffle::Frontier, consumer::Checkpoint)>, + // Per-task metrics counters and gauges. + metrics: super::Metrics, // Publisher for stats and ACK intents, parked while no async operation is in-flight. parked_publisher: Option, // ACK intents to persist and append at later transaction stages. @@ -37,6 +39,7 @@ impl Actor { pub fn new( http_client: reqwest::Client, legacy_checkpoint: Option, + metrics: super::Metrics, publisher: crate::Publisher, shard_tx: Vec>>, task: Task, @@ -45,6 +48,7 @@ impl Actor { http_client, intents_write_fut: None, legacy_checkpoint: legacy_checkpoint.map(|f| (f, consumer::Checkpoint::default())), + metrics, parked_publisher: Some(publisher), pending_ack_intents: BTreeMap::new(), shard_tx, @@ -62,7 +66,12 @@ impl Actor { session: shuffle::SessionClient, shard_rx: Vec>>, ) -> anyhow::Result<()> { - tracing::info!(self.task.n_shards, "materialize actor started"); + service_kit::event!( + tracing::Level::INFO, + "leader", + n_shards = self.task.n_shards, + "materialize Actor::serve started", + ); assert_eq!(self.task.n_shards, shard_rx.len()); assert_eq!(self.task.n_shards, self.shard_tx.len()); @@ -119,6 +128,7 @@ impl Actor { // Drive `tail` to idle. let action: fsm::Action; + let prev_kind = tail.kind(); (action, tail) = tail.step( self.intents_write_fut.is_none(), now, @@ -128,6 +138,17 @@ impl Actor { self.trigger_fut.is_some(), ); + if prev_kind != tail.kind() { + service_kit::event!( + tracing::Level::DEBUG, + "tail", + prev = prev_kind, + action = action.kind(), + next = tail.kind(), + "transition", + ); + } + match action { fsm::Action::Idle => (), fsm::Action::Sleep { .. } => unreachable!("Tail does not Sleep"), @@ -140,6 +161,7 @@ impl Actor { // Drive `head` to idle or stop. let action: fsm::Action; + let prev_kind = head.kind(); (action, head) = head.step( &mut binding_bytes_behind, &mut close_requested, @@ -155,6 +177,17 @@ impl Actor { &self.task, ); + if prev_kind != head.kind() { + service_kit::event!( + tracing::Level::DEBUG, + "head", + prev = prev_kind, + action = action.kind(), + next = head.kind(), + "transition", + ); + } + match action { fsm::Action::Idle => (), fsm::Action::Sleep { wake_after: w } => wake_after = w, @@ -165,6 +198,7 @@ impl Actor { &mut transactions_completed, &mut stopping, ); + self.metrics.transactions.increment(1); tail = fsm::Tail::Begin(fsm::TailBegin { pending }); continue; } @@ -198,6 +232,15 @@ impl Actor { self.parked_publisher = Some(publisher); self.pending_ack_intents = intents; self.stats_write_fut = None; + + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "completed ops stats publish", + ); + // Having just written stats, we know this measure is fresh. + let total: i64 = binding_bytes_behind.iter().copied().sum(); + self.metrics.bytes_behind.set(total as f64); } Some(result) = maybe_fut(&mut self.intents_write_fut) => { let publisher = result.map_err(crate::status_to_anyhow) @@ -205,10 +248,22 @@ impl Actor { self.parked_publisher = Some(publisher); self.intents_write_fut = None; + + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "completed ACK intents write", + ); } Some(result) = maybe_fut(&mut self.trigger_fut) => { () = result?; self.trigger_fut = None; + + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "completed trigger dispatch", + ); } // Process shard messages next. Some((shard_index, msg, rx)) = shard_rx.next() => { @@ -224,7 +279,21 @@ impl Actor { } // Poll for a next frontier when otherwise idle. Some(result) = frontier_rx.next(), if ready_frontier.is_none() => { - ready_frontier = Some(result?); + let frontier = result?; + let (journals, journal_producers, bytes_read_delta, bytes_behind_delta) = frontier.measures(); + let unresolved_hints = frontier.unresolved_hints; + ready_frontier = Some(frontier); + + service_kit::event!( + tracing::Level::DEBUG, + "leader", + bytes_behind_delta, + bytes_read_delta, + journal_producers, + journals, + unresolved_hints, + "received Frontier from shuffle Session", + ); } // Lowest priority. @@ -234,7 +303,11 @@ impl Actor { now.update(now_clock()); // Resync after blocking IO. } - tracing::info!("materialize actor exiting"); + service_kit::event!( + tracing::Level::INFO, + "leader", + "materialize Actor::serve exiting; broadcasting Stopped", + ); for tx in &self.shard_tx { let _ = tx.send(Ok(proto::Materialize { @@ -255,7 +328,7 @@ impl Actor { } fsm::Action::Load { frontier } => { - tracing::debug!(journals = frontier.journals.len(), "broadcasting L:Load"); + service_kit::event!(tracing::Level::DEBUG, "shard", "broadcasting L:Load"); self.broadcast(proto::Materialize { load: Some(proto::materialize::Load { frontier: Some(frontier.encode()), @@ -265,10 +338,7 @@ impl Actor { } fsm::Action::Flush { connector_patches } => { - tracing::debug!( - patches_bytes = connector_patches.len(), - "broadcasting L:Flush" - ); + service_kit::event!(tracing::Level::DEBUG, "shard", "broadcasting L:Flush"); self.broadcast(proto::Materialize { flush: Some(proto::materialize::Flush { connector_patches_json: connector_patches, @@ -278,7 +348,7 @@ impl Actor { } fsm::Action::Store => { - tracing::debug!("broadcasting L:Store"); + service_kit::event!(tracing::Level::DEBUG, "shard", "broadcasting L:Store"); self.broadcast(proto::Materialize { store: Some(proto::materialize::Store {}), ..Default::default() @@ -289,11 +359,7 @@ impl Actor { connector_checkpoint, connector_patches, } => { - tracing::debug!( - patches_bytes = connector_patches.len(), - "broadcasting L:StartCommit" - ); - + service_kit::event!(tracing::Level::DEBUG, "shard", "broadcasting L:StartCommit"); self.broadcast(proto::Materialize { start_commit: Some(proto::materialize::StartCommit { connector_checkpoint: Some(connector_checkpoint), @@ -304,10 +370,7 @@ impl Actor { } fsm::Action::Acknowledge { connector_patches } => { - tracing::debug!( - patches_bytes = connector_patches.len(), - "broadcasting L:Acknowledge" - ); + service_kit::event!(tracing::Level::DEBUG, "shard", "broadcasting L:Acknowledge"); self.broadcast(proto::Materialize { acknowledge: Some(proto::materialize::Acknowledge { connector_patches_json: connector_patches, @@ -317,8 +380,7 @@ impl Actor { } fsm::Action::Persist { persist } => { - tracing::debug!("dispatching L:Persist to shard zero"); - + service_kit::event!(tracing::Level::DEBUG, "shard", "sending L:Persist"); let _ = self.shard_tx[0].send(Ok(proto::Materialize { persist: Some(persist), ..Default::default() @@ -326,6 +388,11 @@ impl Actor { } fsm::Action::WriteStats { stats } => { + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "starting ops stats publish" + ); let mut publisher = self .parked_publisher .take() @@ -349,6 +416,11 @@ impl Actor { } fsm::Action::WriteIntents { ack_intents } => { + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "starting ACK intents write" + ); let mut publisher = self .parked_publisher .take() @@ -364,10 +436,17 @@ impl Actor { } fsm::Action::CallTrigger { trigger_params } => { + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "starting trigger execution" + ); let Some(compiled) = self.task.triggers.clone() else { - tracing::info!( + service_kit::event!( + tracing::Level::INFO, + "leader", trigger_params_bytes = trigger_params.len(), - "discarding recovered trigger parameters for materialization without triggers" + "discarding recovered trigger parameters for materialization without triggers", ); return Ok(()); }; @@ -377,6 +456,7 @@ impl Actor { let client = self.http_client.clone(); self.trigger_fut = Some( async move { + // TODO(johnny): Periodic writes into task ops logs if it takes a while. super::triggers::fire_pending_triggers(&compiled, &variables, &client).await } .boxed(), @@ -433,7 +513,13 @@ impl Actor { } else { "(other)" }; - tracing::debug!(shard_index, kind, "received from shard"); + service_kit::event!( + tracing::Level::DEBUG, + "shard", + shard_index, + kind, + "received from shard", + ); Ok(Some(msg)) } @@ -460,8 +546,10 @@ fn on_transaction_completed( *transactions_completed = transactions_completed.saturating_add(1); if *transactions_completed >= max_transactions { - tracing::info!( - transactions_completed, + service_kit::event!( + tracing::Level::INFO, + "leader", + transactions_completed = *transactions_completed, max_transactions, "materialize transaction limit reached; stopping gracefully", ); @@ -527,6 +615,7 @@ mod tests { let actor = Actor::new( reqwest::Client::new(), None, + super::super::Metrics::new("test/task/shard"), crate::Publisher::new_preview(), shard_tx, task, diff --git a/crates/runtime-next/src/leader/materialize/fsm.rs b/crates/runtime-next/src/leader/materialize/fsm.rs index 41391a96a82..a4d6316b61e 100644 --- a/crates/runtime-next/src/leader/materialize/fsm.rs +++ b/crates/runtime-next/src/leader/materialize/fsm.rs @@ -144,6 +144,25 @@ pub enum Action { }, } +impl Action { + pub fn kind(&self) -> &'static str { + match self { + Self::Acknowledge { .. } => "Acknowledge", + Self::CallTrigger { .. } => "CallTrigger", + Self::Flush { .. } => "Flush", + Self::Idle => "Idle", + Self::Load { .. } => "Load", + Self::Persist { .. } => "Persist", + Self::Rotate { .. } => "Rotate", + Self::Sleep { .. } => "Sleep", + Self::StartCommit { .. } => "StartCommit", + Self::Store => "Store", + Self::WriteIntents { .. } => "WriteIntents", + Self::WriteStats { .. } => "WriteStats", + } + } +} + impl Head { /// Dispatch to the current sub-state's `step()`. pub fn step( @@ -178,6 +197,19 @@ impl Head { Head::Stop => panic!("HeadFSM::Stop observed at step boundary"), } } + + pub fn kind(&self) -> &'static str { + match self { + Self::Idle(_) => "Idle", + Self::Extend(_) => "Extend", + Self::Flush(_) => "Flush", + Self::Persist(_) => "Persist", + Self::Store(_) => "Store", + Self::WriteStats(_) => "WriteStats", + Self::StartCommit(_) => "StartCommit", + Self::Stop => "Stop", + } + } } impl Tail { @@ -200,6 +232,17 @@ impl Tail { Tail::Done(_) => (Action::Idle, self), } } + + pub fn kind(&self) -> &'static str { + match self { + Self::Begin(_) => "Begin", + Self::Acknowledge(_) => "Acknowledge", + Self::WriteIntents(_) => "WriteIntents", + Self::Trigger(_) => "Trigger", + Self::Persist(_) => "Persist", + Self::Done(_) => "Done", + } + } } /// HeadIdle awaits a first ready Frontier that begins a transaction. @@ -323,9 +366,8 @@ impl HeadExtend { let max_source_clock = uuid::Clock::from_u64(max_source_clock); let extent = self.extents.bindings.entry(index).or_default(); - if !max_key_delta.is_empty() { - extent.max_key_delta = max_key_delta; - } + extent.max_key_delta = std::mem::take(&mut extent.max_key_delta).max(max_key_delta); + if extent.sourced.docs_total == 0 { extent.max_source_clock = max_source_clock; extent.min_source_clock = min_source_clock; @@ -515,7 +557,7 @@ impl HeadFlush { // Persist extents for idempotent transaction replay. let persist = proto::Persist { - nonce: now.as_u64(), + seq_no: now.as_u64(), connector_patches_json: take_patches(&mut pending.persist_patches), delete_hinted_frontier: true, hinted_close_clock: extents.close.as_u64(), @@ -531,7 +573,7 @@ impl HeadFlush { shard_stored: vec![false; task.n_shards], }; let persist_state = HeadPersist { - nonce: persist.nonce, + seq_no: persist.seq_no, next_action: Action::Store, next_state: Box::new(Head::Store(store_state)), }; @@ -544,7 +586,7 @@ impl HeadFlush { /// and chains its contained action and state. #[derive(Debug)] pub struct HeadPersist { - pub nonce: u64, + pub seq_no: u64, pub next_action: Action, pub next_state: Box, } @@ -554,16 +596,16 @@ impl HeadPersist { if let Some(( 0, proto::Materialize { - persisted: Some(proto::Persisted { nonce }), + persisted: Some(proto::Persisted { seq_no }), .. }, )) = shard_rx - && *nonce == self.nonce + && *seq_no == self.seq_no { shard_rx.take(); let Self { - nonce: _, + seq_no: _, next_action, next_state, } = self; @@ -644,7 +686,7 @@ impl HeadStore { } // Compose the trigger payload now that we have a complete txn-wide view. - if task.triggers.is_some() { + if task.triggers.is_some() && !extents.bindings.is_empty() { let collection_names: Vec = extents .bindings .keys() @@ -826,7 +868,7 @@ impl HeadStartCommit { .map(|(_full_frontier, full_checkpoint)| full_checkpoint.clone()); let persist = proto::Persist { - nonce: now.as_u64(), + seq_no: now.as_u64(), ack_intents: pending.ack_intents.clone(), committed_close_clock: close.as_u64(), committed_frontier: Some(shuffle::JournalFrontier::encode(&frontier.journals)), @@ -860,7 +902,7 @@ impl HeadStartCommit { }; let state = HeadPersist { - nonce: persist.nonce, + seq_no: persist.seq_no, next_action, next_state: Box::new(next_state), }; @@ -987,16 +1029,16 @@ impl TailAcknowledge { // If Acknowledged returned patches, wrap with a Persist that runs first. if !persist_patches.is_empty() { - let nonce = now.as_u64(); + let seq_no = now.as_u64(); state = Tail::Persist(TailPersist { - nonce, + seq_no, next_action: action, next_state: Box::new(state), }); action = Action::Persist { persist: proto::Persist { - nonce, + seq_no, connector_patches_json: persist_patches, ..Default::default() }, @@ -1043,16 +1085,16 @@ impl TailTrigger { let Self { shard_patches } = self; - let nonce = now.as_u64(); + let seq_no = now.as_u64(); let action = Action::Persist { persist: proto::Persist { - nonce, + seq_no, delete_trigger_params: true, ..Default::default() }, }; let state = TailPersist { - nonce, + seq_no, next_action: Action::Idle, next_state: Box::new(Tail::Done(TailDone { shard_patches })), }; @@ -1065,7 +1107,7 @@ impl TailTrigger { /// and chains its contained action and state. #[derive(Debug)] pub struct TailPersist { - pub nonce: u64, + pub seq_no: u64, pub next_action: Action, pub next_state: Box, } @@ -1075,16 +1117,16 @@ impl TailPersist { if let Some(( 0, proto::Materialize { - persisted: Some(proto::Persisted { nonce }), + persisted: Some(proto::Persisted { seq_no }), .. }, )) = shard_rx - && *nonce == self.nonce + && *seq_no == self.seq_no { shard_rx.take(); let Self { - nonce: _, + seq_no: _, next_action, next_state, } = self; @@ -1136,9 +1178,9 @@ pub struct CloseDecision { /// - `close_requested`, or `idempotent_replay && !unresolved_hints`: force close. /// - `unresolved_hints`: forces extend; suppresses close until hints resolve. /// - `idempotent_replay`: suppresses extend (replay is one-shot). -/// - `stopping` with `may_close=true`: suppresses extend so Head can stop after -/// the next commit. With Tail still draining, extend is permitted to keep the -/// pipeline full. +/// - `close_requested` or `stopping` with `may_close=true`: suppresses extend so +/// the current txn closes promptly (and Head can stop after the next commit). +/// With Tail still draining, extend is permitted to keep the pipeline full. /// - `!tail_done`: suppresses close (must hold open while Tail finishes). pub fn decide_close_policy(inputs: CloseInputs, task: &Task) -> CloseDecision { let CloseInputs { @@ -1160,17 +1202,23 @@ pub fn decide_close_policy(inputs: CloseInputs, task: &Task) -> CloseDecision { && read_bytes < task.read_bytes.end && read_docs < task.read_docs.end; - let mut policy_close = open_age > task.open_duration.start - && last_age > task.last_close_age.start - && (!policy_extend || max_combiner > task.combiner_usage_bytes.start) - && (!policy_extend || read_bytes > task.read_bytes.start) - && (!policy_extend || read_docs > task.read_docs.start); + let mut policy_close = open_age >= task.open_duration.start + && last_age >= task.last_close_age.start + && (!policy_extend || max_combiner >= task.combiner_usage_bytes.start) + && (!policy_extend || read_bytes >= task.read_bytes.start) + && (!policy_extend || read_docs >= task.read_docs.start); policy_close |= idempotent_replay && !unresolved_hints; policy_close |= close_requested; let may_close = policy_close && !unresolved_hints && tail_done; + + // A requested or stopping close stops extending the current txn once + // we're actually able to close it, so the txn finishes promptly. While + // we cannot yet close (Tail still draining, or unresolved hints), we + // keep extending if policy allows — maximizing parallelism as Tail works. + let finishing = close_requested || stopping; let may_extend = - (!idempotent_replay && policy_extend && (!stopping || !may_close)) || unresolved_hints; + (!idempotent_replay && policy_extend && (!finishing || !may_close)) || unresolved_hints; CloseDecision { may_extend, @@ -1213,10 +1261,17 @@ fn build_stats_doc( .copied() .unwrap_or_default(), ); + // Note that this measure can be clobbered if multiple bindings source + // from the same collection. This is a little unfortunate, and implied by + // the stats data-model. It's tempting to put a max() here, but that + // doesn't fundamentally solve the problem (updates can arrive in distinct + // txns, and then be reduded LWW by reporting). This can happen only when + // two bindings share the *same* collection and *different* priorities + // (otherwise they're same-cohort and process in lock-step). entry.last_source_published_at = extents.max_source_clock.to_pb_json_timestamp(); - ops::merge_docs_and_bytes(&extents.sourced, &mut entry.left); - ops::merge_docs_and_bytes(&extents.loaded, &mut entry.right); + ops::merge_docs_and_bytes(&extents.sourced, &mut entry.right); + ops::merge_docs_and_bytes(&extents.loaded, &mut entry.left); ops::merge_docs_and_bytes(&extents.stored, &mut entry.out); } @@ -1335,6 +1390,14 @@ mod tests { ) } + /// `mk_loaded` variant that overrides `max_key_delta` on the (sole) + /// binding, for tests that exercise its per-binding reduction. + fn mk_loaded_with_key(shard: usize, key: &'static [u8]) -> (usize, proto::Materialize) { + let (shard, mut msg) = mk_loaded(shard); + msg.loaded.as_mut().unwrap().bindings[0].max_key_delta = Bytes::from_static(key); + (shard, msg) + } + fn mk_flushed(shard: usize, patches: &'static [u8]) -> (usize, proto::Materialize) { ( shard, @@ -1393,28 +1456,28 @@ mod tests { } fn mk_head_persisted(head: &Head) -> (usize, proto::Materialize) { - let nonce = match head { - Head::Persist(p) => p.nonce, + let seq_no = match head { + Head::Persist(p) => p.seq_no, other => panic!("expected Head::Persist, got {other:?}"), }; ( 0, proto::Materialize { - persisted: Some(proto::Persisted { nonce }), + persisted: Some(proto::Persisted { seq_no }), ..Default::default() }, ) } fn mk_tail_persisted(tail: &Tail) -> (usize, proto::Materialize) { - let nonce = match tail { - Tail::Persist(p) => p.nonce, + let seq_no = match tail { + Tail::Persist(p) => p.seq_no, other => panic!("expected Tail::Persist, got {other:?}"), }; ( 0, proto::Materialize { - persisted: Some(proto::Persisted { nonce }), + persisted: Some(proto::Persisted { seq_no }), ..Default::default() }, ) @@ -1505,7 +1568,7 @@ mod tests { want: want(false, false), }, Case { - name: "close_requested overrides policy floor", + name: "close_requested with may_close: extend suppressed, close", inputs: CloseInputs { open_age: Duration::ZERO, last_age: Duration::ZERO, @@ -1515,7 +1578,7 @@ mod tests { close_requested: true, ..mid }, - want: want(true, true), + want: want(false, true), }, Case { name: "close_requested but tail still busy: hold open", @@ -1637,32 +1700,33 @@ mod tests { assert!(matches!(action, Action::Load { .. })); assert!(matches!(head, Head::Extend(_))); - // Loaded responses arrive from each shard. After both have landed - // HeadExtend rests (policy allows extend-or-close, neither input is - // provided) — the action is Sleep/Idle, which we don't assert. - for s in 0..2 { - ctx.shard_rx = Some(mk_loaded(s)); - let (_action, h) = ctx.step_head(head, &mut tail); - head = h; - assert!(ctx.shard_rx.is_none(), "Loaded was consumed"); - } + // Loaded(0) lands; HeadExtend still awaits Loaded(1) and rests. + ctx.shard_rx = Some(mk_loaded(0)); + let (_action, h) = ctx.step_head(head, &mut tail); + head = h; assert!(matches!(head, Head::Extend(_))); - // Extend the open transaction with a second ready Frontier. + // A second ready Frontier becomes available before Loaded(1) arrives — + // simulating the actor's loop pre-fetching the next frontier while + // awaiting the prior round's Loaded responses. ctx.ready_frontier = Some(shuffle::Frontier::default()); + ctx.shard_rx = Some(mk_loaded(1)); let (action, h) = ctx.step_head(head, &mut tail); head = h; + // With both inputs available the FSM extends rather than closes. assert!(matches!(action, Action::Load { .. })); + assert!(matches!(head, Head::Extend(_))); - for s in 0..2 { - ctx.shard_rx = Some(mk_loaded(s)); - let (_action, h) = ctx.step_head(head, &mut tail); - head = h; - } + // Second Load round: Loaded x2 arrive without another frontier queued. + // After the final Loaded the close-policy fires: ready_frontier is + // None and may_close is true (Tail::Done), so + // HeadExtend transitions straight into HeadFlush. + ctx.shard_rx = Some(mk_loaded(0)); + let (_action, h) = ctx.step_head(head, &mut tail); + head = h; + assert!(matches!(head, Head::Extend(_))); - // Close: with Tail::Done and no unresolved hints, `close_requested` - // forces may_close and we begin L:Flush. - ctx.close_requested = true; + ctx.shard_rx = Some(mk_loaded(1)); let (action, h) = ctx.step_head(head, &mut tail); head = h; assert!(matches!(action, Action::Flush { .. })); @@ -1721,18 +1785,18 @@ mod tests { "_meta": {}, "shard": {}, "ts": "2023-11-14T22:13:20.000000004+00:00", - "openSecondsTotal": 0.000000024, + "openSecondsTotal": 0.000000016, "txnCount": 1, "materialize": { "test/collection": { "left": { - "docsTotal": 12, - "bytesTotal": 1200 - }, - "right": { "docsTotal": 4, "bytesTotal": 400 }, + "right": { + "docsTotal": 12, + "bytesTotal": 1200 + }, "out": { "docsTotal": 8, "bytesTotal": 800 @@ -1940,6 +2004,68 @@ mod tests { assert!(matches!(tail, Tail::Done(_))); } + /// Verifies aggregation of L:Loaded `max_key_delta` across shards and Load cycles. + #[test] + fn loaded_max_key_delta_reduction() { + let task = mk_task(2); + let mut ctx = Ctx { + binding_bytes_behind: vec![0; task.binding_collection_names.len()], + close_requested: false, + intents_idle: true, + legacy_checkpoint: None, + now: uuid::Clock::from_unix(1_700_000_000, 0), + pending_ack_intents: BTreeMap::new(), + ready_frontier: None, + shard_rx: None, + stats_idle: false, + stopping: false, + task, + trigger_running: false, + }; + let mut head = Head::Idle(HeadIdle::default()); + let mut tail = Tail::Done(TailDone::default()); + + // Open the first transaction. + ctx.ready_frontier = Some(shuffle::Frontier::default()); + let (_a, h) = ctx.step_head(head, &mut tail); + head = h; + + // Each row is one Load cycle: per-shard Loaded values for `max_key_delta` + // and the expected aggregated value after the cycle. The cycles share + // a single open transaction, so reductions must compose across cycles. + let cycles: &[(&[&'static [u8]], &'static [u8])] = &[ + // Cross-shard reduction: shard 1's "Z" beats shard 0's "M". + (&[b"M", b"Z"], b"Z"), + // Smaller "A" and an empty report must not clobber the prior "Z". + (&[b"", b"A"], b"Z"), + // Strictly-larger "Z9" ratchets the maximum forward. + (&[b"Z9", b""], b"Z9"), + ]; + + for (i, (per_shard_keys, expected)) in cycles.iter().enumerate() { + for (shard, key) in per_shard_keys.iter().enumerate() { + // Queue the next frontier alongside the last Loaded so the FSM + // extends back into a fresh Load cycle rather than closing. + // Mirrors the actor's pre-fetch pattern in `happy_path`. + if shard + 1 == per_shard_keys.len() { + ctx.ready_frontier = Some(shuffle::Frontier::default()); + } + ctx.shard_rx = Some(mk_loaded_with_key(shard, *key)); + let (_a, h) = ctx.step_head(head, &mut tail); + head = h; + } + let aggregated = match &head { + Head::Extend(s) => s.extents.bindings[&0].max_key_delta.clone(), + other => panic!("expected Head::Extend after cycle {i}, got {other:?}"), + }; + assert_eq!( + aggregated, + Bytes::from_static(expected), + "after cycle {i} keys={per_shard_keys:?}", + ); + } + } + /// Fuzz Head and Tail by perturbing every Ctx field at each step. /// Random shard responses, frontiers, and idle/stopping flags drive /// arbitrary state transitions; the test asserts no panics. The FSMs @@ -1952,12 +2078,12 @@ mod tests { use rand::{Rng, SeedableRng, rngs::SmallRng}; // Synthesize a Materialize message of a randomly chosen variant. The - // `expected_nonce` is plumbed through so Persisted occasionally matches - // the in-progress nonce and lets HeadPersist / TailPersist actually + // `expected_seq_no` is plumbed through so Persisted occasionally matches + // the in-progress seq_no and lets HeadPersist / TailPersist actually // chain forward — without it, fuzz traces would rarely leave Persist. fn random_message( shard: usize, - expected_nonce: u64, + expected_seq_no: u64, rng: &mut SmallRng, ) -> (usize, proto::Materialize) { let mut msg = proto::Materialize::default(); @@ -2007,28 +2133,28 @@ mod tests { }); } _ => { - // Most of the time, target the in-progress Persist's nonce so + // Most of the time, target the in-progress Persist's seq_no so // the FSM can actually chain forward; otherwise emit garbage. - let nonce = if rng.random_bool(0.9) { - expected_nonce + let seq_no = if rng.random_bool(0.9) { + expected_seq_no } else { rng.random() }; - msg.persisted = Some(proto::Persisted { nonce }); + msg.persisted = Some(proto::Persisted { seq_no }); } } (shard, msg) } - // Pick a "best-guess" nonce to hand to `random_message`. When Head or - // Tail is awaiting Persisted we surface its nonce so the message is + // Pick a "best-guess" seq_no to hand to `random_message`. When Head or + // Tail is awaiting Persisted we surface its seq_no so the message is // sometimes accepted; otherwise return random noise. - fn pick_nonce(head: &Head, tail: &Tail, rng: &mut SmallRng) -> u64 { + fn pick_seq_no(head: &Head, tail: &Tail, rng: &mut SmallRng) -> u64 { if let Head::Persist(p) = head { - return p.nonce; + return p.seq_no; } if let Tail::Persist(p) = tail { - return p.nonce; + return p.seq_no; } rng.random() } @@ -2075,8 +2201,8 @@ mod tests { // (sometimes out-of-range) to exercise bounds handling. if rng.random_bool(0.50) { let shard = rng.random_range(0..=ctx.task.n_shards); - let nonce = pick_nonce(head, tail, rng); - ctx.shard_rx = Some(random_message(shard, nonce, rng)); + let seq_no = pick_seq_no(head, tail, rng); + ctx.shard_rx = Some(random_message(shard, seq_no, rng)); } // Add an ACK intent occasionally; HeadWriteStats drains them. diff --git a/crates/runtime-next/src/leader/materialize/handler.rs b/crates/runtime-next/src/leader/materialize/handler.rs index c6795da175f..331c285dad1 100644 --- a/crates/runtime-next/src/leader/materialize/handler.rs +++ b/crates/runtime-next/src/leader/materialize/handler.rs @@ -2,11 +2,31 @@ use super::{actor, fsm, startup}; use crate::{leader, proto}; use futures::StreamExt; use tokio::sync::mpsc; +use tracing::Instrument; pub(crate) async fn serve( + service: crate::Service, + request_rx: R, + response_tx: mpsc::UnboundedSender>, +) -> anyhow::Result<()> +where + R: futures::Stream> + Send + Unpin + 'static, +{ + // Run the whole handler inside its span so operator trace overrides (see + // `service_kit::trace`) reach every log line — the actor loop's periodic + // instrumentation included. + let handler = service.registry.register("leader.materialize"); + let span = handler.span(); + serve_inner(service, request_rx, response_tx, handler) + .instrument(span) + .await +} + +async fn serve_inner( service: crate::Service, mut request_rx: R, response_tx: mpsc::UnboundedSender>, + mut handler: service_kit::HandlerGuard, ) -> anyhow::Result<()> where R: futures::Stream> + Send + Unpin + 'static, @@ -22,13 +42,17 @@ where }; let task_name = leader::validate_join(&join)?.to_string(); - tracing::info!( - %task_name, - shards = join.shards.len(), + handler.set_label(&join.shards[0].id); + handler.set_field("shards", join.shards.len()); + handler.set_field("etcd_mod_revision", join.etcd_mod_revision); + handler.set_phase("joining"); + + service_kit::event!( + tracing::Level::INFO, + "shard", shard_index = join.shard_index, etcd_mod_revision = join.etcd_mod_revision, - shuffle_directory = %join.shuffle_directory, - "received Join", + "received Join from shard", ); // Scope `guard` to prove it's not held across await points. @@ -48,12 +72,15 @@ where let slots = match outcome { leader::JoinOutcome::Pending { filled, target } => { - tracing::debug!( - %task_name, + service_kit::event!( + tracing::Level::DEBUG, + "leader", filled, target, - "registered pending Join", + "registered pending Join (awaiting consensus)", ); + handler.set_phase("awaiting-consensus"); + handler.finish_ok(); return Ok(()); } leader::JoinOutcome::Disagreement(slots) => { @@ -63,8 +90,9 @@ where .max() .unwrap(); - tracing::info!( - %task_name, + service_kit::event!( + tracing::Level::INFO, + "leader", max_etcd_revision, retrying = slots.len(), "broadcasting retry due to topology disagreement", @@ -76,19 +104,26 @@ where for slot in slots { let _ = slot.response_tx.send(Ok(retry.clone())); } + handler.set_phase("topology-disagreement"); + handler.finish_ok(); return Ok(()); } leader::JoinOutcome::Consensus(slots) => slots, }; - tracing::info!( - task_name, - shards = slots.len(), + handler.set_phase("starting"); + let metrics = super::Metrics::new(&slots[0].join.shards[0].id); + + service_kit::event!( + tracing::Level::INFO, + "leader", "consensus reached; starting session", ); let mut build = String::new(); + let mut drop_v1_rollback = false; + let mut ops_stats_journal = String::new(); let mut reactors: Vec = Vec::new(); let mut shard_rx = Vec::with_capacity(slots.len()); let mut shard_tx = Vec::with_capacity(slots.len()); @@ -122,19 +157,26 @@ where reactors.push(reactor.unwrap_or_default().suffix); shard_rx.push(slot_rx); shard_tx.push(slot_tx); - shard_ids.push(id); + shard_ids.push(id.clone()); shard_shuffles.push(shuffle::proto::Shard { + id, range: labeling.range, directory, endpoint, }); + + // Labels are identical across shards (enforced by Join equality check). build = labeling.build; + drop_v1_rollback = leader::flag_enabled(&labeling.flags, leader::DROP_V1_ROLLBACK_FLAG); + ops_stats_journal = labeling.stats_journal; } let error_tx = shard_tx.clone(); - // Run startup, and then the Actor transaction loop. - let result = async move { + // Run startup, and then the Actor transaction loop. The inner block is a + // try-scope: an error from either gets a best-effort broadcast to all shards + // below before propagating. + let result = async { let startup::Startup { committed_close, committed_frontier, @@ -146,6 +188,8 @@ where task, } = startup::run( build, + drop_v1_rollback, + ops_stats_journal, reactors, &mut shard_rx, &mut shard_tx, @@ -166,23 +210,36 @@ where }; let tail = fsm::Tail::Begin(fsm::TailBegin { pending }); - // TODO: Make this toggle-able for dropping rollback support. - let legacy_checkpoint = Some(committed_frontier); + // Maintain the legacy V1 `consumer.Checkpoint` from the recovered + // committed Frontier, unless the task has opted out of V1 rollback via + // `drop-runtime-v1-rollback`. + let legacy_checkpoint = if drop_v1_rollback { + None + } else { + Some(committed_frontier) + }; let mut actor = actor::Actor::new( service.http_client.clone(), legacy_checkpoint, + metrics, publisher, shard_tx, task, ); + handler.set_phase("running"); actor.serve(head, tail, session, shard_rx).await } .await; - let Err(err) = result else { - return Ok(()); + let err = match result { + Ok(()) => { + handler.finish_ok(); + return Ok(()); + } + Err(err) => err, }; + handler.finish_err(&format!("{err:#}")); // Best-effort broadcast of terminal error to all shards. let status = match err.downcast_ref::() { diff --git a/crates/runtime-next/src/leader/materialize/mod.rs b/crates/runtime-next/src/leader/materialize/mod.rs index 6caee52a03d..3ab43ab62f1 100644 --- a/crates/runtime-next/src/leader/materialize/mod.rs +++ b/crates/runtime-next/src/leader/materialize/mod.rs @@ -8,6 +8,44 @@ mod triggers; pub(crate) use handler::serve; +#[derive(Clone)] +pub(crate) struct Metrics { + /// Total transactions completed by this leader session. + transactions: metrics::Counter, + /// Aggregate bytes-behind across all bindings, observed at each frontier. + bytes_behind: metrics::Gauge, +} + +impl Metrics { + pub(crate) fn new(shard_zero: &str) -> Self { + static DESCRIBE: std::sync::Once = std::sync::Once::new(); + DESCRIBE.call_once(|| { + metrics::describe_counter!( + "runtime_leader_transactions", + metrics::Unit::Count, + "transactions completed by this leader session", + ); + metrics::describe_gauge!( + "runtime_leader_behind", + metrics::Unit::Bytes, + "aggregate bytes-behind across all bindings, observed when writing stats", + ); + }); + + let shard_zero = || shard_zero.to_string(); + Self { + transactions: metrics::counter!( + "runtime_leader_transactions", + "shard_zero" => shard_zero(), + ), + bytes_behind: metrics::gauge!( + "runtime_leader_behind", + "shard_zero" => shard_zero(), + ), + } + } +} + // Task configuration, as understood by the leader. // // Several fields express "close" vs "extend" policy thresholds: diff --git a/crates/runtime-next/src/leader/materialize/startup.rs b/crates/runtime-next/src/leader/materialize/startup.rs index 90760f81cca..7c49b04458b 100644 --- a/crates/runtime-next/src/leader/materialize/startup.rs +++ b/crates/runtime-next/src/leader/materialize/startup.rs @@ -37,6 +37,8 @@ pub(super) struct Startup { )] pub(super) async fn run( build: String, + drop_v1_rollback: bool, + ops_stats_journal: String, reactors: Vec, shard_rx: &mut Vec>>, shard_tx: &Vec>>, @@ -77,8 +79,6 @@ pub(super) async fn run( // Build task definition. let proto::Task { - ops_stats_journal, - ops_stats_spec, preview, max_transactions, spec: spec_bytes, @@ -94,13 +94,10 @@ pub(super) async fn run( let publisher = if preview { crate::Publisher::new_preview() } else { - let ops_stats_spec = ops_stats_spec.as_ref().context("missing ops stats spec")?; - crate::Publisher::new_real( shard_ids[0].clone(), // Shard zero is AuthZ subject. &service.publisher_factory, &ops_stats_journal, - ops_stats_spec, [], // No additional bindings. ) .context("creating publisher")? @@ -124,7 +121,6 @@ pub(super) async fn run( let mut committed_close = uuid::Clock::from_u64(committed_close); let hinted_close = uuid::Clock::from_u64(hinted_close); - let legacy_checkpoint = legacy_checkpoint.unwrap_or_default(); let mut hinted_frontier = shuffle::Frontier::decode(hinted_frontier.unwrap_or_default()) .context("validating hinted Frontier")?; @@ -188,14 +184,21 @@ pub(super) async fn run( .collect(); journal_read_suffix_index.sort(); + // Set when a recovered checkpoint (legacy V1 or connector) is authoritative + // and its mapped Frontier replaces `committed_frontier`. + let mut committed_frontier_rebuilt = false; + // Handle migration from `legacy_checkpoint`. - if !legacy_checkpoint.sources.is_empty() { + let legacy_checkpoint_present = legacy_checkpoint.is_some(); + if let Some(legacy_checkpoint) = legacy_checkpoint { let clock = frontier_mapping::extract_committed_close(&legacy_checkpoint); if clock == Some(committed_close) { - tracing::debug!( - ?committed_close, - "legacy_checkpoint present but matches Recover::committed_close (ignoring)" + service_kit::event!( + tracing::Level::DEBUG, + "leader", + committed_close, + "legacy_checkpoint present but matches Recover::committed_close (ignoring)", ); } else if let Some(clock) = clock { // Implementation error: these update together and should always sync. @@ -203,21 +206,27 @@ pub(super) async fn run( "legacy_checkpoint has clock {clock:?} that doesn't match Recover's committed_close ({committed_close:?})" ); } else { - tracing::debug!( - ?committed_close, - ?legacy_checkpoint, - "legacy_checkpoint doesn't contain committed-close-clock; treating as authoritative" + service_kit::event!( + tracing::Level::DEBUG, + "leader", + committed_close, + "legacy_checkpoint doesn't contain committed-close-clock; treating as authoritative", ); committed_frontier = frontier_mapping::checkpoint_to_frontier( &legacy_checkpoint.sources, &journal_read_suffix_index, ) .context("mapping recovered legacy checkpoint into Frontier")?; + committed_frontier_rebuilt = true; pending_ack_intents = legacy_checkpoint.ack_intents; } } else { - tracing::debug!("no legacy_checkpoint present"); + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "no legacy_checkpoint present", + ); } // Handle a `connector_checkpoint` from remote-authoritative connectors. @@ -227,17 +236,20 @@ pub(super) async fn run( let clock = frontier_mapping::extract_committed_close(&connector_checkpoint); if clock == Some(committed_close) { - tracing::debug!( - ?committed_close, - "connector_checkpoint present but matches Recover::committed_close (ignoring)" + service_kit::event!( + tracing::Level::DEBUG, + "leader", + committed_close, + "connector_checkpoint present but matches Recover::committed_close (ignoring)", ); } else if clock == Some(hinted_close) { // Connector declares that the hinted txn did in fact commit. - tracing::debug!( - ?committed_close, - ?hinted_close, - ?hinted_frontier, - "connector_checkpoint present and matches Recover::hinted_close; applying delta" + service_kit::event!( + tracing::Level::DEBUG, + "leader", + committed_close, + hinted_close, + "connector_checkpoint present and matches Recover::hinted_close; applying delta", ); committed_close = hinted_close; committed_frontier = committed_frontier.reduce(std::mem::take(&mut hinted_frontier)); @@ -250,10 +262,11 @@ pub(super) async fn run( committed_close ({committed_close:?}) or hinted_close ({hinted_close:?})" ); } else { - tracing::debug!( - ?committed_close, - ?connector_checkpoint, - "connector_checkpoint doesn't contain committed-close-clock; treating as authoritative" + service_kit::event!( + tracing::Level::DEBUG, + "leader", + committed_close, + "connector_checkpoint doesn't contain committed-close-clock; treating as authoritative", ); committed_frontier = frontier_mapping::checkpoint_to_frontier( @@ -261,11 +274,47 @@ pub(super) async fn run( &journal_read_suffix_index, ) .context("mapping recovered connector checkpoint into Frontier")?; + committed_frontier_rebuilt = true; pending_ack_intents = connector_checkpoint.ack_intents; } } else { - tracing::debug!("no connector_checkpoint present"); + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "no connector_checkpoint present", + ); + } + + // Reconcile RocksDB now that the final status of the recovered V1 and + // connector checkpoints is known. If `committed_frontier_rebuilt`, then + // `committed_frontier` is not natively represented in RocksDB and must be + // persisted (clearing stale state). This establishes a baseline for future + // recoveries. Go-forward commits are deltas that apply atop this base. + let delete_legacy_checkpoint = drop_v1_rollback && legacy_checkpoint_present; + if committed_frontier_rebuilt || delete_legacy_checkpoint { + service_kit::event!( + tracing::Level::INFO, + "leader", + committed_frontier_rebuilt, + delete_legacy_checkpoint, + "reconciling recovered checkpoint state", + ); + send_persist( + &mut shard_rx[0], + &shard_tx[0], + &task.peers[0], + proto::Persist { + seq_no: 0, + delete_committed_frontier: committed_frontier_rebuilt, + committed_frontier: committed_frontier_rebuilt + .then(|| shuffle::JournalFrontier::encode(&committed_frontier.journals)), + delete_legacy_checkpoint, + ..Default::default() + }, + ) + .await + .context("sending startup cleanup Persist")?; } // Compose the session resume Frontier: project the recovered hinted @@ -326,6 +375,31 @@ async fn recv_recovers( Ok(recovers.swap_remove(0)) } +/// Send a `Persist` to a shard and await the matching `Persisted` echo. +async fn send_persist( + rx: &mut BoxStream<'static, tonic::Result>, + tx: &mpsc::UnboundedSender>, + peer: &str, + persist: proto::Persist, +) -> anyhow::Result<()> { + let verify = crate::verify("Materialize", "Persisted", peer); + let seq_no = persist.seq_no; + + // Sends are best-effort: a closed peer surfaces on the next `rx`. + let _ = tx.send(Ok(proto::Materialize { + persist: Some(persist), + ..Default::default() + })); + + match verify.not_eof(rx.next().await)? { + proto::Materialize { + persisted: Some(proto::Persisted { seq_no: got }), + .. + } if got == seq_no => Ok(()), + other => Err(verify.fail_msg(other)), + } +} + // The apply loop's persistent state machine is `(last_applied, // connector_state_json)`. Each iteration may persist new connector state // patches; `last_applied` is bumped only on the FINAL iteration once the @@ -343,7 +417,6 @@ async fn apply_loop( connector_state_json: &mut Bytes, ) -> anyhow::Result<()> { let verify_applied = crate::verify("Materialize", "Applied", peer); - let verify_persisted = crate::verify("Materialize", "Persisted", peer); let last_version = if last_applied.is_empty() { String::new() } else { @@ -378,14 +451,14 @@ async fn apply_loop( }), .. } => { - tracing::info!( + let patches_clone: bytes::Bytes = connector_patches_json.clone(); + service_kit::event!( + tracing::Level::INFO, + "leader", iteration, - action_description, - patches = ?ops::DebugJson( - serde_json::from_slice::(&connector_patches_json) - .unwrap_or_default() - ), - "connector Apply completed" + action_description = action_description.clone(), + patches = service_kit::event::debug(patches_clone), + "connector Apply completed", ); connector_patches_json } @@ -393,28 +466,28 @@ async fn apply_loop( }; if applied_patches_json.is_empty() { - tracing::debug!(iteration, "apply loop complete"); + service_kit::event!( + tracing::Level::DEBUG, + "leader", + iteration, + "apply loop complete", + ); if last_applied == next_applied { return Ok(()); } - let _ = tx.send(Ok(proto::Materialize { - persist: Some(proto::Persist { - nonce: iteration, + send_persist( + rx, + tx, + peer, + proto::Persist { + seq_no: iteration, last_applied: next_applied.clone(), ..Default::default() - }), - ..Default::default() - })); - - match verify_persisted.not_eof(rx.next().await)? { - proto::Materialize { - persisted: Some(proto::Persisted { nonce }), - .. - } if nonce == iteration => {} - other => return Err(verify_persisted.fail_msg(other)), - } + }, + ) + .await?; return Ok(()); } @@ -425,23 +498,17 @@ async fn apply_loop( crate::patches::apply_state_patches(connector_state_json, &applied_patches_json)?; // Persist the iteration's patches to shard zero. - let _ = tx.send(Ok(proto::Materialize { - persist: Some(proto::Persist { - nonce: iteration, // End-of-sequence. + send_persist( + rx, + tx, + peer, + proto::Persist { + seq_no: iteration, // End-of-sequence. connector_patches_json: applied_patches_json, ..Default::default() - }), - ..Default::default() - })); - - // Receive Persisted. - match verify_persisted.not_eof(rx.next().await)? { - proto::Materialize { - persisted: Some(proto::Persisted { nonce }), - .. - } if nonce == iteration => {} - other => return Err(verify_persisted.fail_msg(other)), - } + }, + ) + .await?; } anyhow::bail!( @@ -516,9 +583,9 @@ mod tests { } } - fn persisted(nonce: u64) -> proto::Materialize { + fn persisted(seq_no: u64) -> proto::Materialize { proto::Materialize { - persisted: Some(proto::Persisted { nonce }), + persisted: Some(proto::Persisted { seq_no }), ..Default::default() } } @@ -550,7 +617,7 @@ mod tests { #[tokio::test] async fn apply_loop_persists_last_applied_when_no_patches_but_spec_changed() { // No patches but next != last: loop sends Apply, then Persist - // marking next_applied as the new last_applied with matching nonce. + // marking next_applied as the new last_applied with matching seq_no. let (mut rx, peer_tx, leader_tx, mut leader_rx) = channel_pair(); peer_tx.send(Ok(applied(b""))).unwrap(); peer_tx.send(Ok(persisted(1))).unwrap(); @@ -570,7 +637,7 @@ mod tests { let m2 = leader_rx.try_recv().unwrap().unwrap(); let p = m2.persist.unwrap(); - assert_eq!(p.nonce, 1); + assert_eq!(p.seq_no, 1); assert_eq!(p.last_applied, next); assert!(p.connector_patches_json.is_empty()); @@ -610,7 +677,7 @@ mod tests { ); // Persist iter 1 carries the connector's patches but no last_applied. let p1 = leader_rx.try_recv().unwrap().unwrap().persist.unwrap(); - assert_eq!(p1.nonce, 1); + assert_eq!(p1.seq_no, 1); assert!(p1.last_applied.is_empty()); assert_eq!(p1.connector_patches_json.as_ref(), patch1); @@ -622,7 +689,7 @@ mod tests { serde_json::json!({"nested":{"a":1},"keep":"v1","drop":"x"}), ); let p2 = leader_rx.try_recv().unwrap().unwrap().persist.unwrap(); - assert_eq!(p2.nonce, 2); + assert_eq!(p2.seq_no, 2); assert!(p2.last_applied.is_empty()); assert_eq!(p2.connector_patches_json.as_ref(), patch2); @@ -635,7 +702,7 @@ mod tests { ); // Final Persist promotes spec and carries no patches. let p3 = leader_rx.try_recv().unwrap().unwrap().persist.unwrap(); - assert_eq!(p3.nonce, 3); + assert_eq!(p3.seq_no, 3); assert_eq!(p3.last_applied, next); assert!(p3.connector_patches_json.is_empty()); @@ -660,16 +727,16 @@ mod tests { // Connector returns patches forever; we cap at MAX_APPLY_ITERATIONS. name: "no_convergence", seed: |tx| { - for nonce in 1..=4 { + for seq_no in 1..=4 { tx.send(Ok(applied(b"[{\"x\":1}\n]"))).unwrap(); - tx.send(Ok(persisted(nonce))).unwrap(); + tx.send(Ok(persisted(seq_no))).unwrap(); } }, expect: "did not converge", }, Case { - // Peer returns Persisted with a wrong nonce — protocol error. - name: "persisted_nonce_mismatch", + // Peer returns Persisted with a wrong seq_no — protocol error. + name: "persisted_seq_no_mismatch", seed: |tx| { tx.send(Ok(applied(b"[{\"x\":1}\n]"))).unwrap(); tx.send(Ok(persisted(99))).unwrap(); diff --git a/crates/runtime-next/src/leader/materialize/task.rs b/crates/runtime-next/src/leader/materialize/task.rs index 8db1c29b722..dcee0e583b8 100644 --- a/crates/runtime-next/src/leader/materialize/task.rs +++ b/crates/runtime-next/src/leader/materialize/task.rs @@ -80,12 +80,11 @@ impl Task { // Close-policy thresholds, many with placeholder defaults. // TODO: thread these through from the spec once they're supported there. - let open_duration: std::ops::Range = - min_txn_duration..max_txn_duration; - let last_close_age = std::time::Duration::ZERO..std::time::Duration::from_secs(300); let combiner_usage_bytes = 0..(30 * 1024 * 1024 * 1024); - let read_docs = 0..u64::MAX; + let last_close_age = std::time::Duration::ZERO..std::time::Duration::MAX; + let open_duration = min_txn_duration..max_txn_duration; let read_bytes = 0..u64::MAX; + let read_docs = 0..u64::MAX; Ok(Self { binding_collection_names, diff --git a/crates/runtime-next/src/leader/materialize/triggers.rs b/crates/runtime-next/src/leader/materialize/triggers.rs index 9e4975f8b18..77302e6ddf5 100644 --- a/crates/runtime-next/src/leader/materialize/triggers.rs +++ b/crates/runtime-next/src/leader/materialize/triggers.rs @@ -52,7 +52,9 @@ pub async fn fire_pending_triggers( .await .context("trigger webhook delivery failed")?; - tracing::info!( + service_kit::event!( + tracing::Level::INFO, + "leader", num_triggers = compiled.configs.len(), elapsed_ms = started_at.elapsed().as_millis() as u64, "trigger webhooks delivered successfully", @@ -174,24 +176,28 @@ async fn send_single_webhook( ); } - tracing::warn!( + service_kit::event!( + tracing::Level::WARN, + "trigger", trigger_index = index, - url = %trigger.url, - %status, - attempt = attempt + 1, + url = trigger.url.clone(), + status = status.as_u16(), + attempt, total_attempts, - "trigger webhook received non-success response, will retry" + "trigger webhook received non-success response, will retry", ); } Err(err) => { last_err = err.to_string(); - tracing::warn!( + service_kit::event!( + tracing::Level::WARN, + "trigger", trigger_index = index, - url = %trigger.url, - error = %err, - attempt = attempt + 1, + url = trigger.url.clone(), + error = service_kit::event::lazy(move || err.to_string()), + attempt, total_attempts, - "trigger webhook request failed, will retry" + "trigger webhook request failed, will retry", ); } } diff --git a/crates/runtime-next/src/leader/mod.rs b/crates/runtime-next/src/leader/mod.rs index 04bfe0cab67..54ec8545246 100644 --- a/crates/runtime-next/src/leader/mod.rs +++ b/crates/runtime-next/src/leader/mod.rs @@ -11,3 +11,13 @@ mod materialize; // mod capture; // TODO: implement. pub use service::Service; + +/// Shard-label feature flag (under the `estuary.dev/flag/` prefix) that, when +/// set to `"true"`, tells the leader to drop V1 rollback support for the task. +const DROP_V1_ROLLBACK_FLAG: &str = "drop-runtime-v1-rollback"; + +/// Reports whether `flags` (an `ops::ShardLabeling.flags` map) sets `flag` to +/// `"true"`, mirroring the Go runtime's feature-flag convention. +fn flag_enabled(flags: &std::collections::BTreeMap, flag: &str) -> bool { + flags.get(flag).map(String::as_str) == Some("true") +} diff --git a/crates/runtime-next/src/leader/service.rs b/crates/runtime-next/src/leader/service.rs index b92b81cd6b8..eeabc55c6b9 100644 --- a/crates/runtime-next/src/leader/service.rs +++ b/crates/runtime-next/src/leader/service.rs @@ -18,18 +18,22 @@ pub struct ServiceImpl { pub(crate) publisher_factory: gazette::journal::ClientFactory, /// Process-wide HTTP client used by the actor to deliver trigger webhooks. pub(crate) http_client: reqwest::Client, + /// Registry of in-flight Leader session handlers, for the admin surface. + pub(crate) registry: service_kit::Registry, } impl Service { pub fn new( shuffle_service: shuffle::Service, publisher_factory: gazette::journal::ClientFactory, + registry: service_kit::Registry, ) -> Self { Self(Arc::new(ServiceImpl { materialize_joins: std::sync::Mutex::new(HashMap::new()), shuffle_service, publisher_factory, http_client: reqwest::Client::new(), + registry, })) } diff --git a/crates/runtime-next/src/publish.rs b/crates/runtime-next/src/publish.rs index 4ef8a81b4c0..2fcf695184b 100644 --- a/crates/runtime-next/src/publish.rs +++ b/crates/runtime-next/src/publish.rs @@ -11,12 +11,11 @@ //! print to stdout in the `["{collection}",{...doc...}]` format used by //! `flowctl preview`. //! -//! Construction is decided in `startup::run` based on the presence of -//! ops_logs / ops_stats specs in `L:Open`: present ⇒ `Real`, absent ⇒ -//! `Preview`. The leader actor parks the `Publisher` across IO futures. +//! Construction is decided in `startup::run` based on the `preview` flag in +//! `L:Task`: `false` ⇒ `Real`, `true` ⇒ `Preview`. The leader actor parks the +//! `Publisher` across IO futures. use bytes::Bytes; -use proto_flow::flow; use proto_gazette::uuid; use std::collections::BTreeMap; @@ -36,28 +35,24 @@ pub enum Publisher { } impl Publisher { - /// Build a real `Publisher` backed by a `publisher::Publisher` for - /// `ops_logs_spec` / `ops_stats_spec` journals plus any additional - /// supplied collection specs. + /// Build a real `Publisher` backed by a `publisher::Publisher` for the + /// pre-created `ops_stats_journal` plus any additional supplied collection + /// specs. pub fn new_real<'a, I>( authz_subject: String, client_factory: &gazette::journal::ClientFactory, ops_stats_journal: &str, - ops_stats_spec: &flow::CollectionSpec, collection_specs: I, ) -> anyhow::Result where - I: IntoIterator, + I: IntoIterator, { let mut bindings = Vec::new(); - bindings.push(publisher::Binding::from_collection_spec( - ops_stats_spec, - Some(ops_stats_journal), - )?); + bindings.push(publisher::Binding::for_fixed_journal(ops_stats_journal)); for spec in collection_specs { - bindings.push(publisher::Binding::from_collection_spec(spec, None)?); + bindings.push(publisher::Binding::from_collection_spec(spec)?); } let mut producer: [u8; 6] = rand::random(); @@ -101,7 +96,7 @@ impl Publisher { Self::Real(p) => { p.enqueue( |uuid| { - // Binding index 0 is ops_stats_spec. + // Binding index 0 is the fixed ops_stats journal. stats.meta.as_mut().unwrap().uuid = uuid.to_string(); (0, serde_json::to_value(&stats).unwrap()) }, diff --git a/crates/runtime-next/src/shard/materialize/actor.rs b/crates/runtime-next/src/shard/materialize/actor.rs index 8ef18bd9c60..da5faad22bc 100644 --- a/crates/runtime-next/src/shard/materialize/actor.rs +++ b/crates/runtime-next/src/shard/materialize/actor.rs @@ -51,6 +51,8 @@ pub(super) struct Actor { // committed at transaction close (and not before; this lets us filter keys // above "committed" but below "current" during scans). max_keys: Vec<(Bytes, Bytes)>, + // Per-session metrics counters. + metrics: super::Metrics, } impl Actor { @@ -62,6 +64,7 @@ impl Actor { disable_load_optimization: bool, leader_tx: mpsc::UnboundedSender, max_keys: Vec<(Bytes, Bytes)>, + metrics: super::Metrics, ) -> Self { Self { bindings, @@ -74,6 +77,7 @@ impl Actor { leader_tx, load_keys: Default::default(), max_keys, + metrics, } } @@ -139,10 +143,20 @@ impl Actor { let (accumulator, shuffle_reader, shuffle_remainders, active) = scanner.into_parts(); + let combiner_bytes = accumulator.combiner_byte_usage(); + service_kit::event!( + tracing::Level::DEBUG, + "leader", + active_bindings = active.len(), + combiner_bytes, + "sending L:Loaded after frontier scan", + ); + self.metrics.scans_completed.increment(1); + _ = self.leader_tx.send(proto::Materialize { loaded: Some(proto::materialize::Loaded { bindings: active.into_values().collect(), - combiner_usage_bytes: accumulator.combiner_byte_usage(), + combiner_usage_bytes: combiner_bytes, }), ..Default::default() }); @@ -161,6 +175,14 @@ impl Actor { let (accumulator, shuffle_reader, shuffle_remainders, active) = drainer.into_parts()?; + service_kit::event!( + tracing::Level::DEBUG, + "leader", + active_bindings = active.len(), + "sending L:Stored after memtable drain", + ); + self.metrics.drains_completed.increment(1); + _ = self.leader_tx.send(proto::Materialize { stored: Some(proto::materialize::Stored { bindings: active }), ..Default::default() @@ -197,8 +219,17 @@ impl Actor { // Next, a Persist completion. result = maybe_fut(&mut self.db_persist_fut) => { let (db, persisted) = result?; + let seq_no = persisted.seq_no; self.db = Some(db); + service_kit::event!( + tracing::Level::DEBUG, + "leader", + seq_no, + "RocksDB persist completed; sending L:Persisted", + ); + self.metrics.persists.increment(1); + _ = self.leader_tx.send(proto::Materialize { persisted: Some(persisted), ..Default::default() @@ -348,7 +379,7 @@ impl Actor { ..Default::default() }); } else if let Some(persist) = msg.persist { - let nonce = persist.nonce; + let seq_no = persist.seq_no; let (db, binding_state_keys) = self .db @@ -358,7 +389,7 @@ impl Actor { self.db_persist_fut = Some( async move { let db = db.persist(&persist, &binding_state_keys).await?; - Ok(((db, binding_state_keys), proto::Persisted { nonce })) + Ok(((db, binding_state_keys), proto::Persisted { seq_no })) } .boxed(), ); @@ -388,6 +419,9 @@ impl Actor { active.loaded_docs_total += 1; active.loaded_bytes_total += doc_json.len() as u64; + self.metrics.loaded_docs.increment(1); + self.metrics.loaded_bytes.increment(doc_json.len() as u64); + // C:Loaded responses arrive at the connector's pace, which may // straddle phase transitions: // * Phase::Scanning — a common case, mid-scan response. @@ -520,6 +554,7 @@ mod tests { load_keys: Default::default(), flushed: HashMap::new(), max_keys: Vec::new(), + metrics: super::super::Metrics::new("test/shard"), }, leader_rx, connector_rx, @@ -621,6 +656,7 @@ mod tests { load_keys: Default::default(), flushed: HashMap::new(), max_keys: Vec::new(), + metrics: super::super::Metrics::new("test/shard"), }; let accumulator = @@ -755,11 +791,11 @@ mod tests { let resp = actor_to_leader_rx.recv().await.unwrap(); assert!(resp.started_commit.is_some()); - // 5) L:Persist → RocksDB write → L:Persisted echoes nonce. + // 5) L:Persist → RocksDB write → L:Persisted echoes seq_no. leader_to_actor_tx .send(Ok(proto::Materialize { persist: Some(proto::Persist { - nonce: 42, + seq_no: 42, last_applied: Bytes::from_static(b"persisted-spec-bytes"), ..Default::default() }), @@ -768,7 +804,7 @@ mod tests { .unwrap(); let resp = actor_to_leader_rx.recv().await.unwrap(); - assert_eq!(resp.persisted.unwrap().nonce, 42); + assert_eq!(resp.persisted.unwrap().seq_no, 42); // 6) Controller Stop + CloseNow → forwarded to the leader. controller_to_actor_tx diff --git a/crates/runtime-next/src/shard/materialize/connector.rs b/crates/runtime-next/src/shard/materialize/connector.rs index 8e82716c109..06cff02c0b0 100644 --- a/crates/runtime-next/src/shard/materialize/connector.rs +++ b/crates/runtime-next/src/shard/materialize/connector.rs @@ -11,13 +11,13 @@ use zeroize::Zeroize; // plus OpenExtras with decrypted trigger configs and connector metadata. pub async fn start( service: &crate::shard::Service, + log_level: ops::LogLevel, mut initial: materialize::Request, ) -> anyhow::Result<( mpsc::Sender, BoxStream<'static, tonic::Result>, Option, )> { - let log_level = initial.get_internal()?.log_level(); let (endpoint, config_json, connector_type, catalog_name) = extract_endpoint(&mut initial)?; let (connector_tx, connector_rx) = mpsc::channel(crate::CHANNEL_BUFFER); diff --git a/crates/runtime-next/src/shard/materialize/handler.rs b/crates/runtime-next/src/shard/materialize/handler.rs index 5d06ee2fc2b..4391e6320b5 100644 --- a/crates/runtime-next/src/shard/materialize/handler.rs +++ b/crates/runtime-next/src/shard/materialize/handler.rs @@ -4,6 +4,7 @@ use anyhow::Context; use futures::StreamExt; use proto_flow::materialize; use tokio::sync::mpsc; +use tracing::Instrument; pub(crate) async fn serve( service: crate::shard::Service, @@ -34,25 +35,34 @@ where } proto::Materialize { - spec: Some(spec), .. + spec: Some(spec), + log_level, + .. } => { + let log_level = + ops::LogLevel::try_from(log_level).unwrap_or(ops::LogLevel::UndefinedLevel); + service.set_log_level(log_level); let request = materialize::Request { spec: Some(spec), ..Default::default() }; - let response = serve_unary(&service, request).await?; + let response = serve_unary(&service, request, log_level).await?; _ = controller_tx.send(Ok(response)); } proto::Materialize { validate: Some(validate), + log_level, .. } => { + let log_level = + ops::LogLevel::try_from(log_level).unwrap_or(ops::LogLevel::UndefinedLevel); + service.set_log_level(log_level); let request = materialize::Request { validate: Some(validate), ..Default::default() }; - let response = serve_unary(&service, request).await?; + let response = serve_unary(&service, request, log_level).await?; _ = controller_tx.send(Ok(response)); } @@ -65,12 +75,14 @@ where pub async fn serve_unary( service: &crate::shard::Service, request: materialize::Request, + log_level: ops::LogLevel, ) -> anyhow::Result { let is_spec = request.spec.is_some(); let is_validate = request.validate.is_some(); let is_apply = request.apply.is_some(); - let (connector_tx, mut connector_rx, _container) = connector::start(service, request).await?; + let (connector_tx, mut connector_rx, _container) = + connector::start(service, log_level, request).await?; std::mem::drop(connector_tx); // Send EOF. // Read connector response, and verify it matches the request type. @@ -146,6 +158,27 @@ async fn serve_session( db: crate::shard::RocksDB, join: proto::Join, ) -> anyhow::Result +where + R: futures::Stream> + Send + Unpin + 'static, +{ + // Run the whole session inside its handler span so operator trace overrides + // (see `service_kit::trace`) reach every log line — the actor loop's + // periodic instrumentation included. + let handler = service.registry.register("shard.materialize"); + let span = handler.span(); + serve_session_inner(service, controller_rx, controller_tx, db, join, handler) + .instrument(span) + .await +} + +async fn serve_session_inner( + service: &crate::shard::Service, + controller_rx: &mut R, + controller_tx: &mpsc::UnboundedSender>, + db: crate::shard::RocksDB, + join: proto::Join, + mut handler: service_kit::HandlerGuard, +) -> anyhow::Result where R: futures::Stream> + Send + Unpin + 'static, { @@ -160,10 +193,28 @@ where .context("missing shard for shard index")?; let labeling = labeling.as_ref().context("missing shard labeling")?.clone(); + let log_level = labeling.log_level(); let shard_id = shard_id.clone(); let shard_index = join.shard_index; let shuffle_directory = join.shuffle_directory.clone(); + service.set_log_level(log_level); + + handler.set_label(&shard_id); + handler.set_field("shard_index", shard_index); + handler.set_field("etcd_mod_revision", join.etcd_mod_revision); + handler.set_phase("joining"); + + let metrics = super::Metrics::new(&shard_id); + + service_kit::event!( + tracing::Level::INFO, + "leader", + shard_index, + leader_endpoint = join.leader_endpoint.clone(), + "dialing leader and sending Join", + ); + let (joined, leader_stream) = startup::dial_and_join(join).await?; // Forward Joined to controller. @@ -172,9 +223,18 @@ where ..Default::default() })); let Some((leader_tx, leader_rx)) = leader_stream else { + service_kit::event!( + tracing::Level::DEBUG, + "leader", + "leader returned non-zero max_etcd_revision; controller must retry Join", + ); + handler.set_phase("awaiting-retry"); + handler.finish_ok(); return Ok(db); // Controller must retry Join/Joined. }; + handler.set_phase("starting"); + let startup::Startup { accumulator, bindings, @@ -194,6 +254,7 @@ where labeling, leader_rx, leader_tx, + log_level, service, shard_id, shard_index, @@ -201,7 +262,9 @@ where ) .await?; - let db = super::actor::Actor::new( + handler.set_phase("running"); + + let result = super::actor::Actor::new( bindings, binding_state_keys, connector_tx, @@ -209,6 +272,7 @@ where disable_load_optimization, leader_tx, max_keys, + metrics, ) .serve( accumulator, @@ -217,7 +281,18 @@ where &mut leader_rx, shuffle_reader, ) - .await?; + .await; + + let db = match result { + Ok(db) => { + handler.finish_ok(); + db + } + Err(err) => { + handler.finish_err(&format!("{err:#}")); + return Err(err); + } + }; _ = controller_tx.send(Ok(proto::Materialize { stopped: Some(proto::Stopped {}), diff --git a/crates/runtime-next/src/shard/materialize/mod.rs b/crates/runtime-next/src/shard/materialize/mod.rs index 54e9f11c869..c52371becc6 100644 --- a/crates/runtime-next/src/shard/materialize/mod.rs +++ b/crates/runtime-next/src/shard/materialize/mod.rs @@ -8,6 +8,77 @@ mod task; pub(crate) use handler::serve; +#[derive(Clone)] +pub(crate) struct Metrics { + /// RocksDB persists committed by this session. + persists: metrics::Counter, + /// Connector C:Loaded responses received. + loaded_docs: metrics::Counter, + /// Total bytes of C:Loaded document JSON received. + loaded_bytes: metrics::Counter, + /// Frontier scans completed (one per leader L:Load). + scans_completed: metrics::Counter, + /// Memtable drains completed (one per leader L:Store). + drains_completed: metrics::Counter, +} + +impl Metrics { + pub(crate) fn new(shard_id: &str) -> Self { + static DESCRIBE: std::sync::Once = std::sync::Once::new(); + DESCRIBE.call_once(|| { + metrics::describe_counter!( + "runtime_shard_materialize_persists", + metrics::Unit::Count, + "RocksDB persists committed by this session", + ); + metrics::describe_counter!( + "runtime_shard_materialize_loaded_docs", + metrics::Unit::Count, + "connector C:Loaded responses received", + ); + metrics::describe_counter!( + "runtime_shard_materialize_loaded_bytes", + metrics::Unit::Bytes, + "total bytes of C:Loaded document JSON received", + ); + metrics::describe_counter!( + "runtime_shard_materialize_scans_completed", + metrics::Unit::Count, + "frontier scans completed (one per leader L:Load)", + ); + metrics::describe_counter!( + "runtime_shard_materialize_drains_completed", + metrics::Unit::Count, + "memtable drains completed (one per leader L:Store)", + ); + }); + + let shard_id = || shard_id.to_string(); + Self { + persists: metrics::counter!( + "runtime_shard_materialize_persists", + "shard_id" => shard_id(), + ), + loaded_docs: metrics::counter!( + "runtime_shard_materialize_loaded_docs", + "shard_id" => shard_id(), + ), + loaded_bytes: metrics::counter!( + "runtime_shard_materialize_loaded_bytes", + "shard_id" => shard_id(), + ), + scans_completed: metrics::counter!( + "runtime_shard_materialize_scans_completed", + "shard_id" => shard_id(), + ), + drains_completed: metrics::counter!( + "runtime_shard_materialize_drains_completed", + "shard_id" => shard_id(), + ), + } + } +} + #[derive(Debug)] struct Binding { collection_name: String, // Source collection. diff --git a/crates/runtime-next/src/shard/materialize/startup.rs b/crates/runtime-next/src/shard/materialize/startup.rs index efbca991ecb..b649f0fd725 100644 --- a/crates/runtime-next/src/shard/materialize/startup.rs +++ b/crates/runtime-next/src/shard/materialize/startup.rs @@ -81,6 +81,7 @@ pub(super) async fn run( labeling: ops::proto::ShardLabeling, mut leader_rx: tonic::Streaming, leader_tx: mpsc::UnboundedSender, + log_level: ops::LogLevel, service: &crate::shard::Service, shard_id: String, shard_index: u32, @@ -107,8 +108,6 @@ where let proto::Task { max_transactions: _, - ops_stats_journal, - ops_stats_spec, preview, spec: spec_bytes, } = l_task; @@ -126,13 +125,10 @@ where let publisher = if preview { crate::Publisher::new_preview() } else { - let ops_stats_spec = ops_stats_spec.as_ref().context("missing ops stats spec")?; - crate::Publisher::new_real( shard_id, // Shard ID is AuthZ subject. &service.publisher_factory, - &ops_stats_journal, - ops_stats_spec, + &labeling.stats_journal, [], // No additional bindings. ) .context("creating publisher")? @@ -204,6 +200,7 @@ where apply: Some(apply), ..Default::default() }, + log_level, ) .await?, ); @@ -219,7 +216,7 @@ where _ = leader_tx.send(proto::Materialize { persisted: Some(proto::Persisted { - nonce: persist.nonce, + seq_no: persist.seq_no, }), ..Default::default() }); @@ -254,7 +251,7 @@ where ..Default::default() }; let (connector_tx, mut connector_rx, container) = - super::connector::start(service, initial).await?; + super::connector::start(service, log_level, initial).await?; // Read C:Opened from the connector. let verify = crate::verify("Materialize", "Opened", "connector"); diff --git a/crates/runtime-next/src/shard/recovery.rs b/crates/runtime-next/src/shard/recovery.rs index fa61a288ffc..805e1bf0c82 100644 --- a/crates/runtime-next/src/shard/recovery.rs +++ b/crates/runtime-next/src/shard/recovery.rs @@ -7,6 +7,9 @@ //! `(key, value)` pair from a full `rocksdb::DB` scan on session startup, //! folding singleton state directly into a `proto::Recover` while collecting //! frontier entries separately for final sort and proto encoding. +//! [`prune_committed_frontier`] then drops stale `FC:` entries (conservatively; +//! see [`FRONTIER_PRUNE_CLOCK_HORIZON`] / [`FRONTIER_PRUNE_BYTE_HORIZON`]) so +//! `shard::rocksdb` can delete them from the DB before returning `Recover`. //! //! `{state_key}` below is the binding-stable `state_key` field of //! `flow::MaterializationSpec.Binding` — distinct from `journal_read_suffix`, @@ -41,6 +44,8 @@ pub const PREFIX_HINTED_FRONTIER_END: &[u8] = b"FH;"; /// Key prefix for committed Frontier entries: /// `FC:{journal}\0{state_key}\0{producer}`. pub const PREFIX_COMMITTED_FRONTIER: &[u8] = b"FC:"; +/// Exclusive upper bound used for `DeleteRange` over `PREFIX_COMMITTED_FRONTIER`. +pub const PREFIX_COMMITTED_FRONTIER_END: &[u8] = b"FC;"; /// Key prefix for per-journal ACK intent entries: `AI:{journal}`. pub const PREFIX_ACK_INTENT: &[u8] = b"AI:"; /// Exclusive upper bound used for `DeleteRange` over `PREFIX_ACK_INTENT`. @@ -60,6 +65,23 @@ pub const KEY_LAST_APPLIED: &[u8] = b"last-applied"; /// Trigger parameters (JSON `models::TriggerVariables`). pub const KEY_TRIGGER_PARAMS: &[u8] = b"trigger-params"; +/// Minimum clock distance between a committed-frontier producer and the most +/// recent committing producer of the same `(journal, state_key)` before the +/// stale producer becomes a pruning candidate. Time protects high-volume +/// journals from eager cleanup: when one producer quickly writes far ahead of +/// another, the byte-distance horizon alone would prune the laggard even though +/// little wall-clock time has actually passed. See [`prune_committed_frontier`]. +pub const FRONTIER_PRUNE_CLOCK_HORIZON: std::time::Duration = + std::time::Duration::from_secs(2 * 60 * 60); + +/// Minimum journal byte distance between a committed-frontier producer's read +/// offset and the furthest-along read offset of the same `(journal, state_key)` +/// before the stale producer becomes a pruning candidate. Byte distance is the +/// real operational cost of keeping an old pending span replayable: a retained +/// positive-offset entry forces the next session to re-read from that offset. +/// See [`prune_committed_frontier`]. +pub const FRONTIER_PRUNE_BYTE_HORIZON: i64 = 8 * 1024 * 1024 * 1024; + /// A single write effect contributed by a `Persist`. Values are carried as /// [`Bytes`] so shared allocations (e.g. a proto-decoded /// `connector_patches_json` buffer) can be split without copies. @@ -153,6 +175,12 @@ pub fn encode_persist>( }); } + if persist.delete_committed_frontier { + emit(KeyOp::DeleteRange { + from: Bytes::from_static(PREFIX_COMMITTED_FRONTIER), + to: Bytes::from_static(PREFIX_COMMITTED_FRONTIER_END), + }); + } if let Some(frontier) = &persist.committed_frontier { encode_frontier( PREFIX_COMMITTED_FRONTIER, @@ -200,6 +228,11 @@ pub fn encode_persist>( }); } + if persist.delete_legacy_checkpoint { + emit(KeyOp::Delete { + key: Bytes::from_static(KEY_LEGACY_CHECKPOINT), + }); + } if let Some(checkpoint) = &persist.legacy_checkpoint { checkpoint .encode(&mut buf) @@ -304,6 +337,21 @@ fn append_frontier_key( out.extend_from_slice(producer); } +/// Build a `FC:{journal}\0{state_key}\0{producer}` committed-frontier key. +/// Used by `shard::rocksdb` to delete entries chosen by +/// [`prune_committed_frontier`]. +pub fn committed_frontier_key(journal: &str, state_key: &str, producer: &uuid::Producer) -> Bytes { + let mut buf = BytesMut::new(); + append_frontier_key( + &mut buf, + PREFIX_COMMITTED_FRONTIER, + journal, + state_key, + producer.as_bytes(), + ); + buf.freeze() +} + // --------------------------------------------------------------------------- // Decoder // --------------------------------------------------------------------------- @@ -480,6 +528,78 @@ fn decode_frontier_entry( Ok(()) } +// --------------------------------------------------------------------------- +// Frontier pruning +// --------------------------------------------------------------------------- + +/// Drop stale committed-frontier producers from `committed` in place, returning +/// the `(journal, binding, producer)` of each removed entry so the caller can +/// issue the matching RocksDB deletes. `JournalFrontier`s left without any +/// producers are removed from `committed`. +/// +/// Pruning is conservative: within a `(journal, binding)` group — equivalently a +/// `(journal, state_key)` group, since each state_key maps to a single binding — +/// a producer `P` is pruned only when **all** of: +/// +/// 1. No `FH:` (hinted) entry exists for the same `(journal, binding, producer)` +/// — a hinted producer's committed entry is its idempotent-replay baseline +/// and must be retained. +/// 2. `P.last_commit` trails the group's newest `last_commit` by at least +/// [`FRONTIER_PRUNE_CLOCK_HORIZON`]. +/// 3. `P`'s read offset (`offset.abs()`) trails the group's furthest-along read +/// offset by at least [`FRONTIER_PRUNE_BYTE_HORIZON`]. +/// +/// `committed` and `hinted` are the per-`(journal, binding)` chunks collected by +/// [`decode_recover_key_value`]; both are grouped (consecutive RocksDB key +/// order) but neither need be globally sorted. +pub fn prune_committed_frontier( + committed: &mut Vec, + hinted: &[shuffle::JournalFrontier], +) -> Vec<(Box, u16, uuid::Producer)> { + // Protected set: `(journal, binding, producer)` of every hinted entry. + let protected: std::collections::HashSet<(&str, u16, uuid::Producer)> = hinted + .iter() + .flat_map(|jf| { + jf.producers + .iter() + .map(move |p| (jf.journal.as_ref(), jf.binding, p.producer)) + }) + .collect(); + + let mut pruned = Vec::new(); + + committed.retain_mut(|jf| { + let group_clock = jf + .producers + .iter() + .map(|p| p.last_commit) + .max() + .unwrap_or_default(); + let group_offset = jf + .producers + .iter() + .map(|p| p.offset.unsigned_abs()) + .max() + .unwrap_or(0); + + jf.producers.retain(|p| { + let stale = !protected.contains(&(jf.journal.as_ref(), jf.binding, p.producer)) + && uuid::Clock::delta(group_clock, p.last_commit) >= FRONTIER_PRUNE_CLOCK_HORIZON + && group_offset.saturating_sub(p.offset.unsigned_abs()) + >= FRONTIER_PRUNE_BYTE_HORIZON as u64; + + if stale { + pruned.push((jf.journal.clone(), jf.binding, p.producer)); + } + !stale + }); + + !jf.producers.is_empty() + }); + + pruned +} + #[cfg(test)] mod test { use super::*; @@ -488,11 +608,7 @@ mod test { [0x01, tag, 0, 0, 0, 0] } - fn producer_frontier( - tag: u8, - last_commit: u64, - offset: i64, - ) -> shuffle::proto::ProducerFrontier { + fn proto_pf(tag: u8, last_commit: u64, offset: i64) -> shuffle::proto::ProducerFrontier { shuffle::proto::ProducerFrontier { producer: uuid::Producer::from_bytes(producer_id(tag)).as_i64(), last_commit, @@ -509,17 +625,14 @@ mod test { shuffle::proto::JournalFrontier { journal_name_suffix: "acme/events/000".into(), binding: 0, - producers: vec![ - producer_frontier(0xaa, 100, 250), - producer_frontier(0xbb, 90, -300), - ], + producers: vec![proto_pf(0xaa, 100, 250), proto_pf(0xbb, 90, -300)], ..Default::default() }, shuffle::proto::JournalFrontier { journal_name_truncate_delta: 3, journal_name_suffix: "001".into(), binding: 1, - producers: vec![producer_frontier(0xcc, 50, -50)], + producers: vec![proto_pf(0xcc, 50, -50)], ..Default::default() }, ], @@ -644,6 +757,16 @@ mod test { ..Default::default() }, ), + // V1 rollback dropped: the leader deletes the legacy "checkpoint" + // key and persists no replacement. + ( + "drop_legacy_checkpoint", + proto::Persist { + committed_close_clock: 789, + delete_legacy_checkpoint: true, + ..Default::default() + }, + ), // committed_frontier without the AI: prelude: the new proto // decouples delete_ack_intents from committed_frontier. ( @@ -653,6 +776,18 @@ mod test { ..Default::default() }, ), + // Startup checkpoint reconciliation: clear all FC: keys and rewrite + // the recovered baseline, also deleting the legacy "checkpoint" key. + // Pins the DeleteRange-before-Put ordering for committed_frontier. + ( + "reconcile_committed_frontier", + proto::Persist { + delete_committed_frontier: true, + committed_frontier: Some(frontier_fixture()), + delete_legacy_checkpoint: true, + ..Default::default() + }, + ), // Standalone ack clear: DeleteRange alone, no Put follow-up. ( "delete_ack_alone", @@ -733,6 +868,47 @@ mod test { insta::assert_debug_snapshot!(decode_pairs(store, &mapping).unwrap()); } + #[test] + fn delete_committed_frontier_clears_stale_keys() { + // A prior session left a stale/partial committed Frontier in `FC:`. + let stale = proto::Persist { + committed_frontier: Some(shuffle::proto::Frontier { + journals: vec![shuffle::proto::JournalFrontier { + journal_name_suffix: "stale/journal".into(), + binding: 0, + producers: vec![proto_pf(0x99, 1, 1)], + ..Default::default() + }], + ..Default::default() + }), + ..Default::default() + }; + // The startup cleanup Persist clears every `FC:` key, then rewrites the + // recovered baseline. + let cleanup = proto::Persist { + delete_committed_frontier: true, + committed_frontier: Some(frontier_fixture()), + ..Default::default() + }; + + let binding_state_keys = &["materialize/mat/t1", "materialize/mat/t2"]; + let mut store: Vec<(Bytes, Bytes)> = Vec::new(); + for persist in [&stale, &cleanup] { + encode_persist(persist, binding_state_keys, |op| apply_op(&mut store, op)).unwrap(); + } + store.sort_by(|a, b| a.0.cmp(&b.0)); + + // Only the rewritten baseline survives; the stale journal is gone. + let mapping = state_key_index(&[("materialize/mat/t1", 0), ("materialize/mat/t2", 1)]); + let decoded = decode_pairs(store, &mapping).unwrap(); + let journals: Vec<&str> = decoded + .committed_frontier + .iter() + .map(|jf| jf.journal.as_ref()) + .collect(); + assert_eq!(journals, vec!["acme/events/000", "acme/events/001"]); + } + #[test] fn encode_persist_errors() { let cases: Vec<(&str, proto::Persist, &[&str], &str)> = vec![( @@ -756,8 +932,8 @@ mod test { #[test] fn decode_recover_classifies_ranges() { - let fh_value = producer_frontier(0xaa, 777, 12345).encode_to_vec(); - let fc_value = producer_frontier(0xbb, 999, 4242).encode_to_vec(); + let fh_value = proto_pf(0xaa, 777, 12345).encode_to_vec(); + let fc_value = proto_pf(0xbb, 999, 4242).encode_to_vec(); let pairs = vec![ ( @@ -822,7 +998,7 @@ mod test { // FH:/FC: and MK-v2: entries whose state_key is not in the // current binding mapping are silently discarded — they belong // to backfilled or removed bindings. - let fh = producer_frontier(0xaa, 1, 0).encode_to_vec(); + let fh = proto_pf(0xaa, 1, 0).encode_to_vec(); let pairs = vec![ ( frontier_key( @@ -845,7 +1021,7 @@ mod test { #[test] fn decode_recover_errors() { - let valid_value = Bytes::from(producer_frontier(0xaa, 1, 0).encode_to_vec()); + let valid_value = Bytes::from(proto_pf(0xaa, 1, 0).encode_to_vec()); // FH:/FC: layout: rest = journal \0 state_key \0 producer[6]. #[allow(clippy::type_complexity)] @@ -930,6 +1106,94 @@ mod test { } } + fn pf(tag: u8, last_commit_secs: u64, offset: i64) -> shuffle::ProducerFrontier { + shuffle::ProducerFrontier { + producer: uuid::Producer::from_bytes(producer_id(tag)), + last_commit: uuid::Clock::from_unix(last_commit_secs, 0), + hinted_commit: uuid::Clock::zero(), + offset, + } + } + + fn jf( + journal: &str, + binding: u16, + producers: Vec, + ) -> shuffle::JournalFrontier { + shuffle::JournalFrontier { + journal: journal.into(), + binding, + producers, + bytes_read_delta: 0, + bytes_behind_delta: 0, + } + } + + const GIB: i64 = 1024 * 1024 * 1024; + + #[test] + fn prune_committed_frontier_policy() { + // `j/old` exists only on binding 0; `j/two` exists on bindings 0 and 1. + // Within `(j/old, binding 0)`, 0xff is the "fresh" producer pinning the + // group's latest clock (~11.5 days) and read offset (20 GiB); the other + // producers each exercise one prune predicate. + let mut committed = vec![ + jf( + "j/old", + 0, + vec![ + pf(0xff, 1_000_000, -20 * GIB), // group clock/offset leader + pf(0xaa, 0, 5 * GIB), // old clock AND 15 GiB behind -> prune + pf(0xbb, 0, 8 * GIB), // old clock AND 12 GiB behind -> prune + pf(0xcc, 0, 13 * GIB), // old clock but only 7 GiB behind (< 8 GiB) -> retain + pf(0xdd, 999_000, 0), // 20 GiB behind but only 1000s old (< 2h) -> retain + pf(0xee, 0, GIB), // old + 19 GiB behind, but FH-protected -> retain + ], + ), + jf("j/two", 0, vec![pf(0xff, 5, -GIB)]), // group's only producer (the leader) -> retain + jf("j/two", 1, vec![pf(0xaa, 0, GIB)]), // group's only producer (the leader) -> retain + ]; + // 0xee on (j/old, binding 0) is hinted: its committed entry is the + // idempotent-replay baseline and must survive pruning. + let hinted = vec![jf("j/old", 0, vec![pf(0xee, 0, 0)])]; + + let pruned = prune_committed_frontier(&mut committed, &hinted); + + let mut pruned_tags: Vec<_> = pruned + .iter() + .map(|(j, b, p)| (j.to_string(), *b, p.as_bytes()[1])) + .collect(); + pruned_tags.sort(); + assert_eq!( + pruned_tags, + vec![ + ("j/old".to_string(), 0, 0xaa), + ("j/old".to_string(), 0, 0xbb), + ], + ); + + // Surviving producers of (j/old, binding 0): ff, cc, dd, ee. + let old0 = &committed[0]; + assert_eq!(old0.journal.as_ref(), "j/old"); + let mut survivors: Vec = old0 + .producers + .iter() + .map(|p| p.producer.as_bytes()[1]) + .collect(); + survivors.sort(); + assert_eq!(survivors, vec![0xcc, 0xdd, 0xee, 0xff]); + + // Both (j/two, _) journals retained intact. + assert_eq!(committed.len(), 3); + assert_eq!(committed[1].journal.as_ref(), "j/two"); + assert_eq!(committed[1].binding, 0); + assert_eq!(committed[2].journal.as_ref(), "j/two"); + assert_eq!(committed[2].binding, 1); + + // Pruning an already-pruned frontier is a no-op. + assert!(prune_committed_frontier(&mut committed, &hinted).is_empty()); + } + // Apply a KeyOp to an in-memory sorted store, respecting DeleteRange. // Merge is treated as append-with-newline so the round-trip snapshot sees // the framed accumulation; real RocksDB would reduce via the merge operator. diff --git a/crates/runtime-next/src/shard/rocksdb.rs b/crates/runtime-next/src/shard/rocksdb.rs index 1a7d166c5f4..fd498bc22bf 100644 --- a/crates/runtime-next/src/shard/rocksdb.rs +++ b/crates/runtime-next/src/shard/rocksdb.rs @@ -57,6 +57,15 @@ impl RocksDB { opts.create_if_missing(true); opts.create_missing_column_families(true); + // The MANIFEST file is a WAL of database file state, including current live + // SST files and their begin & ending key ranges. A new MANIFEST-00XYZ is + // created at database start, where XYZ is the next available sequence number, + // and CURRENT is updated to point at the live MANIFEST. By default MANIFEST + // files may grow to 4GB, but they are typically written very slowly and thus + // artificially inflate the recovery log horizon. We use a much smaller limit + // to encourage more frequent compactions into new files. + opts.set_max_manifest_file_size(1 << 17); // 131072 bytes + let db = rocksdb::DB::open_cf_descriptors(&opts, &path, cf_descriptors) .context("failed to open RocksDB")?; @@ -107,6 +116,10 @@ impl RocksDB { /// /// `binding_state_keys` is a sorted slice of `(state_key, binding_index)` /// tuples used to map from stable `state_key` to current binding index. + /// + /// As a side effect, stale committed-frontier (`FC:`) entries identified by + /// [`recovery::prune_committed_frontier`] are deleted from the DB before + /// `Recover` is returned, so the leader never observes them. pub async fn scan( self, binding_state_keys: Vec<(String, u32)>, @@ -139,6 +152,37 @@ impl RocksDB { () = it.status()?; std::mem::drop(it); + // Drop stale committed-frontier (`FC:`) entries: remove them + // from the recovered frontier and delete them from RocksDB so + // the leader never sees them and they stop costing scan time. + let pruned = + recovery::prune_committed_frontier(&mut committed_frontier, &hinted_frontier); + if !pruned.is_empty() { + // Invert `(state_key, binding)` → `binding → state_key`. + let state_key_of: std::collections::HashMap = binding_state_keys + .iter() + .map(|(sk, idx)| (*idx, sk.as_str())) + .collect(); + + let mut wb = rocksdb::WriteBatch::default(); + for (journal, binding, producer) in &pruned { + let state_key = state_key_of + .get(&(*binding as u32)) + .expect("pruned binding is present in the binding mapping"); + wb.delete(recovery::committed_frontier_key( + journal, state_key, producer, + )); + } + // `wo` is not sync because this is GC, not a commit. + let wo = rocksdb::WriteOptions::new(); + self.db.write_opt(wb, &wo)?; + + tracing::info!( + producers = pruned.len(), + "pruned stale committed-frontier entries during recovery scan" + ); + } + for (frontier, slot) in [ (&mut committed_frontier, &mut recover.committed_frontier), (&mut hinted_frontier, &mut recover.hinted_frontier), @@ -594,6 +638,83 @@ mod test { assert_eq!(hinted[1].binding, 1); } + /// `scan` drops stale `FC:` producers (old clock AND far behind in bytes) + /// from both the recovered frontier and the DB, but never an `FH:`-protected + /// producer's committed baseline. + #[tokio::test] + async fn scan_prunes_stale_committed_frontier() { + let db = RocksDB::open(None).await.unwrap(); + let producer = |tag: u8| proto_gazette::uuid::Producer::from_bytes([0x01, tag, 0, 0, 0, 0]); + let clock_secs = |s: u64| proto_gazette::uuid::Clock::from_unix(s, 0); + const GIB: i64 = 1024 * 1024 * 1024; + + let pf = |tag: u8, last_commit_secs: u64, offset: i64| shuffle::ProducerFrontier { + producer: producer(tag), + last_commit: clock_secs(last_commit_secs), + hinted_commit: proto_gazette::uuid::Clock::zero(), + offset, + }; + let committed = shuffle::JournalFrontier::encode(&[shuffle::JournalFrontier { + journal: "j/s".into(), + binding: 0, + producers: vec![ + pf(0x11, 1_000_000, -20 * GIB), // fresh leader: pins group clock + 20 GiB read offset + pf(0x22, 0, 0), // old clock + 20 GiB behind -> pruned + pf(0x33, 0, 0), // same, but FH-protected below -> retained + ], + bytes_read_delta: 0, + bytes_behind_delta: 0, + }]); + let hinted = shuffle::JournalFrontier::encode(&[shuffle::JournalFrontier { + journal: "j/s".into(), + binding: 0, + producers: vec![pf(0x33, 42, 7)], + bytes_read_delta: 0, + bytes_behind_delta: 0, + }]); + + let db = db + .persist( + &proto::Persist { + committed_frontier: Some(committed), + hinted_frontier: Some(hinted), + ..Default::default() + }, + &["sk0"], + ) + .await + .unwrap(); + + let mapping = vec![("sk0".to_string(), 0)]; + let (db, recover) = db.scan(mapping.clone()).await.unwrap(); + + let tags = |f: Option| -> Vec> { + f.into_iter() + .flat_map(shuffle::JournalFrontier::decode) + .map(|jf| { + jf.producers + .iter() + .map(|p| p.producer.as_bytes()[1]) + .collect() + }) + .collect() + }; + assert_eq!( + tags(recover.committed_frontier), + vec![vec![0x11_u8, 0x33]], + "0x22 pruned from recovered committed frontier" + ); + assert_eq!( + tags(recover.hinted_frontier), + vec![vec![0x33_u8]], + "hinted frontier untouched" + ); + + // A second scan confirms 0x22's FC: key was actually deleted. + let (_db, recover2) = db.scan(mapping).await.unwrap(); + assert_eq!(tags(recover2.committed_frontier), vec![vec![0x11_u8, 0x33]]); + } + /// Verify merge batching handles many operands that would exceed memory /// threshold. Connectors may emit many small merge-patch updates, and this /// can turn into many merge operands (hundreds of thousands). We handle this @@ -828,21 +949,21 @@ mod test { for persist in [ crate::proto::Persist { - nonce: 1, + seq_no: 1, ack_intents: [("j/A".to_string(), bytes::Bytes::from_static(b"INTENT-A"))] .into_iter() .collect(), ..Default::default() }, crate::proto::Persist { - nonce: 2, + seq_no: 2, ack_intents: [("j/B".to_string(), bytes::Bytes::from_static(b"INTENT-B"))] .into_iter() .collect(), ..Default::default() }, crate::proto::Persist { - nonce: 99, + seq_no: 99, last_applied: bytes::Bytes::from_static(b"v9"), ..Default::default() }, diff --git a/crates/runtime-next/src/shard/service.rs b/crates/runtime-next/src/shard/service.rs index 301564caba3..332fe97f782 100644 --- a/crates/runtime-next/src/shard/service.rs +++ b/crates/runtime-next/src/shard/service.rs @@ -22,6 +22,7 @@ pub struct Service { pub set_log_level: Option>, pub task_name: String, pub publisher_factory: gazette::journal::ClientFactory, + pub registry: service_kit::Registry, } impl Service { @@ -32,6 +33,7 @@ impl Service { /// - `set_log_level`: callback for adjusting the log level implied by runtime requests. /// - `task_name`: name which is used to label any started connector containers. /// - `publisher_factory`: client factory for creating and appending to collection partitions. + /// - `registry`: in-flight handler registry, shared with any co-hosted admin surface. pub fn new( plane: crate::Plane, container_network: String, @@ -39,6 +41,7 @@ impl Service { set_log_level: Option>, task_name: String, publisher_factory: gazette::journal::ClientFactory, + registry: service_kit::Registry, ) -> Self { Self { plane, @@ -47,6 +50,7 @@ impl Service { set_log_level, task_name, publisher_factory, + registry, } } diff --git a/crates/runtime-next/src/shard/snapshots/runtime_next__shard__recovery__test__encode_persist_snapshots.snap b/crates/runtime-next/src/shard/snapshots/runtime_next__shard__recovery__test__encode_persist_snapshots.snap index 15ab86ecc13..c77437870d9 100644 --- a/crates/runtime-next/src/shard/snapshots/runtime_next__shard__recovery__test__encode_persist_snapshots.snap +++ b/crates/runtime-next/src/shard/snapshots/runtime_next__shard__recovery__test__encode_persist_snapshots.snap @@ -1,6 +1,5 @@ --- source: crates/runtime-next/src/shard/recovery.rs -assertion_line: 758 expression: snapshot --- [ @@ -86,6 +85,18 @@ expression: snapshot }, ], ), + ( + "drop_legacy_checkpoint", + [ + Put { + key: b"committed-close", + value: b"\x15\x03\0\0\0\0\0\0", + }, + Delete { + key: b"checkpoint", + }, + ], + ), ( "committed_no_acks", [ @@ -103,6 +114,30 @@ expression: snapshot }, ], ), + ( + "reconcile_committed_frontier", + [ + DeleteRange { + from: b"FC:", + to: b"FC;", + }, + Put { + key: b"FC:acme/events/000\0materialize/mat/t1\0\x01\xaa\0\0\0\0", + value: b"\x11d\0\0\0\0\0\0\0 \xfa\x01", + }, + Put { + key: b"FC:acme/events/000\0materialize/mat/t1\0\x01\xbb\0\0\0\0", + value: b"\x11Z\0\0\0\0\0\0\0 \xd4\xfd\xff\xff\xff\xff\xff\xff\xff\x01", + }, + Put { + key: b"FC:acme/events/001\0materialize/mat/t2\0\x01\xcc\0\0\0\0", + value: b"\x112\0\0\0\0\0\0\0 \xce\xff\xff\xff\xff\xff\xff\xff\xff\x01", + }, + Delete { + key: b"checkpoint", + }, + ], + ), ( "delete_ack_alone", [ diff --git a/crates/runtime-next/src/task_service.rs b/crates/runtime-next/src/task_service.rs index bdddee39797..d0d2fc1f47a 100644 --- a/crates/runtime-next/src/task_service.rs +++ b/crates/runtime-next/src/task_service.rs @@ -31,7 +31,7 @@ impl TaskService { std::env::var("FLOW_DATA_PLANE_FQDN").context("FLOW_DATA_PLANE_FQDN not set")?; let control_api_endpoint = std::env::var("FLOW_CONTROL_API").context("FLOW_CONTROL_API not set")?; - let availability_zone = std::env::var("ZONE").unwrap_or_else(|_| "local".to_string()); + let availability_zone = std::env::var("CONSUMER_ZONE").context("CONSUMER_ZONE not set")?; let data_plane_signing_key = first_consumer_auth_key()?; let log_handler = ::ops::new_encoded_json_write_handler(std::sync::Arc::new( @@ -47,10 +47,11 @@ impl TaskService { let control_api_endpoint: url::Url = url::Url::parse(&control_api_endpoint).context("invalid control API endpoint URL")?; + use proto_gazette::capability::{APPEND, APPLY, LIST}; let publisher_factory = flow_client_next::workflows::task_collection_auth::new_journal_client_factory( flow_client_next::rest::Client::new(&control_api_endpoint, "task-service"), - proto_gazette::capability::APPEND | proto_gazette::capability::APPLY, + APPEND | APPLY | LIST, gazette::Router::new(&availability_zone), data_plane_fqdn, tokens::jwt::EncodingKey::from_secret(&data_plane_signing_key), @@ -65,6 +66,9 @@ impl TaskService { Some(tokio_context.set_log_level_fn()), task_name, publisher_factory, + // Inert registry: TaskService is the CGO entry point and does not + // serve an admin surface; event! tracks still capture per-handler. + service_kit::Registry::default(), ); let uds = tokio_context diff --git a/crates/runtime-sidecar/Cargo.toml b/crates/runtime-sidecar/Cargo.toml new file mode 100644 index 00000000000..9b593664187 --- /dev/null +++ b/crates/runtime-sidecar/Cargo.toml @@ -0,0 +1,33 @@ +[package] +name = "runtime-sidecar" +version.workspace = true +rust-version.workspace = true +edition.workspace = true +authors.workspace = true +homepage.workspace = true +repository.workspace = true +license.workspace = true + +[[bin]] +name = "runtime-sidecar" +path = "src/main.rs" + +[dependencies] +flow-client-next = { path = "../flow-client-next" } +gazette = { path = "../gazette" } +proto-gazette = { path = "../proto-gazette" } +runtime-next = { path = "../runtime-next" } +service-kit = { path = "../service-kit" } +shuffle = { path = "../shuffle" } +tokens = { path = "../tokens" } + +anyhow = { workspace = true } +clap = { workspace = true } +futures = { workspace = true } +rustls = { workspace = true } +tokio = { workspace = true } +tokio-stream = { workspace = true } +tonic = { workspace = true, features = ["tls-aws-lc"] } +tracing = { workspace = true } +tracing-subscriber = { workspace = true } +url = { workspace = true } diff --git a/crates/runtime-sidecar/README.md b/crates/runtime-sidecar/README.md new file mode 100644 index 00000000000..aaee123c4b5 --- /dev/null +++ b/crates/runtime-sidecar/README.md @@ -0,0 +1,23 @@ +# runtime-sidecar + +Production sidecar process for the runtime-v2 architecture +(`plans/runtime-v2/plan.md`). One per reactor machine, supervised by +systemd, hosting two gRPC services on a fixed fleet-wide port: + +- **Shuffle Leader** — `runtime_next::leader::Service`, the per-task + Join rendezvous and HeadFSM/TailFSM coordination for tasks whose + shard zero is on this machine. +- **Shuffle** — `shuffle::Service`, the Session/Slice/Log RPCs. + +## Listeners + +`--listen-port` binds a TCP listener at `[::]:`. TLS is on if +`--certificate-file` and `--certificate-key-file` are both provided. + +## Auth + +`--data-plane-auth-keys` is whitespace- or comma-separated base64 +HMAC keys, matching gazette's `auth-keys` semantics. The first key +signs outgoing `/authorize/task` requests issued by the leader and +shuffle to obtain Gazette journal tokens. (Incoming-gRPC verification +against the full key list is wired in a follow-up change.) diff --git a/crates/runtime-sidecar/src/lib.rs b/crates/runtime-sidecar/src/lib.rs new file mode 100644 index 00000000000..b94556bcdaa --- /dev/null +++ b/crates/runtime-sidecar/src/lib.rs @@ -0,0 +1,197 @@ +use anyhow::Context; +use clap::Parser; +use std::path::PathBuf; + +/// Command-line arguments for the runtime-sidecar process. +/// +/// Naming aligns with sibling Rust services (`dekaf`) for unprefixed +/// `DATA_PLANE_*`, `CERTIFICATE_*`, and `AGENT_ENDPOINT` envs. The +/// reactor (Go consumer) uses `FLOW_*` and `CONSUMER_*` namespaced +/// envs because that's `go-flags`/gazette `mainboilerplate` convention; +/// we follow the unprefixed form here. +#[derive(Debug, Parser)] +#[command(about, version)] +pub struct Args { + #[arg(long = "log-format", env = "LOG_FORMAT", default_value = "text")] + pub log_format: LogFormat, + + /// TCP port to listen on, binding `[::]:`. + #[arg(long, env = "LISTEN_PORT")] + pub listen_port: u16, + + /// When set, serve the admin surface (live gRPC handler inventory) on + /// `127.0.0.1:`. Loopback-only: this surface has no authentication. + #[arg(long, env = "ADMIN_PORT")] + pub admin_port: Option, + + /// Externally-reachable URL of this sidecar, advertised to peer + /// shuffle clients (e.g. `https://reactor-foo.flow.localhost:9100`). + #[arg(long, env = "PEER_ENDPOINT")] + pub peer_endpoint: String, + + /// Fully-qualified domain name of the data-plane that this sidecar + /// belongs to; used as the issuer claim of authorization tokens. + #[arg(long, env = "DATA_PLANE_FQDN")] + pub data_plane_fqdn: String, + + /// Whitespace- or comma-separated base64 HMAC keys recognized by + /// the data plane. The first key signs outgoing `/authorize/task` + /// requests; all keys are accepted as verifiers for incoming + /// gRPC traffic (incoming verification is wired in a follow-up). + #[arg(long, env = "DATA_PLANE_AUTH_KEYS")] + pub data_plane_auth_keys: String, + + /// TLS server certificate PEM. Both `--certificate-file` and + /// `--certificate-key-file` must be provided together. + #[arg(long, env = "CERTIFICATE_FILE", requires = "certificate_key_file")] + pub certificate_file: Option, + + /// TLS server private key PEM. Required iff `--certificate-file` is set. + #[arg(long, env = "CERTIFICATE_KEY_FILE", requires = "certificate_file")] + pub certificate_key_file: Option, + + /// Estuary agent REST base URL used to issue `/authorize/task` calls. + #[arg(long, env = "AGENT_ENDPOINT")] + pub agent_endpoint: url::Url, + + /// Broker zone passed to `gazette::Router::new`. + #[arg(long, env = "GAZETTE_ZONE", default_value = "local")] + pub gazette_zone: String, + + /// On-disk shuffle log overflow threshold in bytes. Default is 2 GiB. + #[arg(long, env = "DISK_BACKLOG_THRESHOLD", default_value_t = 2 * 1024 * 1024 * 1024)] + pub disk_backlog_threshold: u64, +} + +#[derive(Debug, Clone, Copy, PartialEq, clap::ValueEnum)] +pub enum LogFormat { + Text, + Json, +} + +pub async fn run(args: Args, registry: service_kit::Registry) -> anyhow::Result<()> { + // Parse comma/whitespace-separated base64 HMAC keys. First key signs + // outgoing /authorize/task requests; the remaining keys are reserved + // for future incoming-gRPC verification (see plan). + let keys: Vec = args + .data_plane_auth_keys + .split(|c: char| c == ',' || c.is_whitespace()) + .filter(|s| !s.is_empty()) + .map(str::to_owned) + .collect(); + anyhow::ensure!( + !keys.is_empty(), + "--data-plane-auth-keys must contain at least one key" + ); + let signing_key = tokens::jwt::EncodingKey::from_base64_secret(&keys[0]) + .context("parsing first data-plane auth key (base64)")?; + let _verification_keys = keys; // TODO(runtime-v2): wire up incoming-gRPC verification. + + // Build REST + Router. + let api_client = flow_client_next::rest::Client::new(&args.agent_endpoint, "runtime-sidecar"); + let router = gazette::Router::new(&args.gazette_zone); + + use proto_gazette::capability::{APPEND, APPLY, LIST, READ}; + // Shuffle read factory watches (LIST) and reads (READ) source journals. + let read_factory = + flow_client_next::workflows::task_collection_auth::new_journal_client_factory( + api_client.clone(), + LIST | READ, + router.clone(), + args.data_plane_fqdn.clone(), + signing_key.clone(), + ); + // Publisher factory watches (LIST), creates partitions (APPLY), + // and appends (APPEND) to dest journals. + let publisher_factory = + flow_client_next::workflows::task_collection_auth::new_journal_client_factory( + api_client, + APPEND | APPLY | LIST, + router, + args.data_plane_fqdn, + signing_key, + ); + + let shuffle_svc = shuffle::Service::new( + args.peer_endpoint, + read_factory, + args.disk_backlog_threshold, + registry.clone(), + ); + let runtime_svc = + runtime_next::Service::new(shuffle_svc.clone(), publisher_factory, registry.clone()); + + // Build a TLS identity if both files were given. + // clap `requires` enforces both-or-neither. + let tls_identity = if let (Some(cert), Some(key)) = ( + args.certificate_file.as_ref(), + args.certificate_key_file.as_ref(), + ) { + let cert_bytes = tokio::fs::read(cert) + .await + .with_context(|| format!("reading {}", cert.display()))?; + let key_bytes = tokio::fs::read(key) + .await + .with_context(|| format!("reading {}", key.display()))?; + Some(tonic::transport::Identity::from_pem(cert_bytes, key_bytes)) + } else { + None + }; + + // SIGTERM (systemd) and SIGINT (interactive Ctrl+C) both initiate graceful shutdown. + let (shutdown_tx, mut shutdown_rx) = tokio::sync::broadcast::channel::<()>(1); + { + let shutdown_tx = shutdown_tx.clone(); + tokio::spawn(async move { + let mut term = + tokio::signal::unix::signal(tokio::signal::unix::SignalKind::terminate()) + .expect("install SIGTERM handler"); + tokio::select! { + _ = term.recv() => tracing::info!("SIGTERM received"), + _ = tokio::signal::ctrl_c() => tracing::info!("SIGINT received"), + } + let _ = shutdown_tx.send(()); + }); + } + + // Optionally serve the loopback admin surface (handler dashboard). + if let Some(admin_port) = args.admin_port { + let addr = std::net::SocketAddr::from(([127, 0, 0, 1], admin_port)); + let registry = registry.clone(); + let mut shutdown_rx = shutdown_rx.resubscribe(); + tokio::spawn(async move { + let shutdown = async move { + let _ = shutdown_rx.recv().await; + }; + if let Err(err) = + service_kit::admin::serve("runtime-sidecar", registry, addr, shutdown).await + { + tracing::error!(?err, "runtime-sidecar admin surface exited with error"); + } + }); + } + + let addr = format!("[::]:{}", args.listen_port); + let tcp = tokio::net::TcpListener::bind(&addr) + .await + .with_context(|| format!("binding TCP {addr}"))?; + tracing::info!(%addr, tls = tls_identity.is_some(), "runtime-sidecar listening on TCP"); + + let mut builder = tonic::transport::Server::builder(); + if let Some(identity) = tls_identity { + builder = builder + .tls_config(tonic::transport::ServerTlsConfig::new().identity(identity)) + .context("configuring TCP TLS")?; + } + builder + .add_service(runtime_svc.into_tonic_service()) + .add_service(shuffle_svc.into_tonic_service()) + .serve_with_incoming_shutdown( + tokio_stream::wrappers::TcpListenerStream::new(tcp), + async move { + let _ = shutdown_rx.recv().await; + }, + ) + .await + .context("serving runtime-sidecar TCP") +} diff --git a/crates/runtime-sidecar/src/main.rs b/crates/runtime-sidecar/src/main.rs new file mode 100644 index 00000000000..552511bbbdd --- /dev/null +++ b/crates/runtime-sidecar/src/main.rs @@ -0,0 +1,64 @@ +use clap::Parser; + +fn main() -> Result<(), anyhow::Error> { + // Required for libraries that use rustls (tonic TLS, gazette client TLS). + // See https://docs.rs/rustls/latest/rustls/crypto/struct.CryptoProvider.html + rustls::crypto::aws_lc_rs::default_provider() + .install_default() + .expect("failed to install default crypto provider"); + + let args = runtime_sidecar::Args::parse(); + // The handler registry is shared between the tracing subscriber (which + // consults per-handler trace overrides) and the services (which populate it + // and expose it via the admin surface). + let registry = service_kit::Registry::new(); + install_tracing(args.log_format, registry.clone()); + + let runtime = tokio::runtime::Builder::new_multi_thread() + .enable_all() + .build()?; + + let result = runtime.block_on(runtime.spawn(runtime_sidecar::run(args, registry))); + runtime.shutdown_timeout(std::time::Duration::from_secs(5)); + result? +} + +/// Install a tracing subscriber that writes structured application logs to +/// stderr. The base `EnvFilter` (`RUST_LOG`, default `info`) is composed with +/// `service_kit::trace`'s per-handler override filter, so an operator can raise +/// a handler's verbosity at runtime via the admin dashboard; `service_kit::event` +/// additionally captures opt-in `event!` breadcrumbs into per-handler tracks +/// shown on the dashboard's handler drill-down page. +fn install_tracing(log_format: runtime_sidecar::LogFormat, registry: service_kit::Registry) { + use tracing_subscriber::Layer; + use tracing_subscriber::layer::SubscriberExt; + use tracing_subscriber::util::SubscriberInitExt; + + let env_filter = tracing_subscriber::EnvFilter::try_from_default_env() + .unwrap_or_else(|_| tracing_subscriber::EnvFilter::new("info")); + + // `fmt` layer, boxed so the JSON and text variants share one assembly path. + let fmt_layer: Box + Send + Sync> = match log_format { + runtime_sidecar::LogFormat::Json => Box::new( + tracing_subscriber::fmt::layer() + .json() + .with_writer(std::io::stderr), + ), + runtime_sidecar::LogFormat::Text => { + let no_color = matches!(std::env::var("NO_COLOR"), Ok(v) if v == "1"); + Box::new( + tracing_subscriber::fmt::layer() + .with_ansi(!no_color) + .with_writer(std::io::stderr), + ) + } + }; + + tracing_subscriber::registry() + .with(fmt_layer.with_filter(service_kit::trace::layer_filter( + env_filter, + registry.clone(), + ))) + .with(service_kit::event::layer(registry)) + .init(); +} diff --git a/crates/runtime/src/materialize/protocol.rs b/crates/runtime/src/materialize/protocol.rs index 3bfaa8d608f..238bb5a21cb 100644 --- a/crates/runtime/src/materialize/protocol.rs +++ b/crates/runtime/src/materialize/protocol.rs @@ -572,7 +572,7 @@ pub fn recv_client_start_commit( txn: &mut Transaction, ) -> anyhow::Result<(Request, rocksdb::WriteBatch)> { let verify = verify("client", "StartCommit with runtime_checkpoint"); - let request = verify.not_eof(request)?; + let mut request = verify.not_eof(request)?; let Request { start_commit: @@ -581,7 +581,7 @@ pub fn recv_client_start_commit( .. }), .. - } = &request + } = &mut request else { return verify.fail(request); }; @@ -590,6 +590,17 @@ pub fn recv_client_start_commit( // merge-able, incremental update that's written to the WriteBatch. let _last_checkpoint = last_checkpoint; + // The V2 leader stamps a synthetic "committed-close" source into the + // consumer.Checkpoint on each commit, recording the V2 RocksDB epoch. + // If V1 inherits a checkpoint from a prior V2 run, the marker is + // preserved verbatim across V1 commits. A subsequent V2 rollforward + // would then mistake the stale marker for an in-sync RocksDB state + // and ignore the legacy_checkpoint, resuming from V2's stale frontier + // and re-processing whatever V1 had advanced past. Strip the marker + // so V2 startup treats V1's advanced sources as authoritative. + runtime_checkpoint.sources.remove("committed-close"); + let runtime_checkpoint = runtime_checkpoint.clone(); + let mut wb = rocksdb::WriteBatch::default(); tracing::debug!( @@ -598,7 +609,7 @@ pub fn recv_client_start_commit( ); wb.put(RocksDB::CHECKPOINT_KEY, runtime_checkpoint.encode_to_vec()); - txn.checkpoint = runtime_checkpoint.clone(); + txn.checkpoint = runtime_checkpoint; Ok((request, wb)) } diff --git a/crates/service-kit/Cargo.toml b/crates/service-kit/Cargo.toml new file mode 100644 index 00000000000..7753a52642d --- /dev/null +++ b/crates/service-kit/Cargo.toml @@ -0,0 +1,28 @@ +[package] +name = "service-kit" +version.workspace = true +rust-version.workspace = true +edition.workspace = true +authors.workspace = true +homepage.workspace = true +repository.workspace = true +license.workspace = true + +[dependencies] +proto-gazette = { path = "../proto-gazette" } + +anyhow = { workspace = true } +axum = { workspace = true } +chrono = { workspace = true } +metrics-exporter-prometheus = { workspace = true } +metrics-util = { workspace = true } +serde = { workspace = true } +serde_json = { workspace = true } +tokio = { workspace = true } +tower-http = { workspace = true } +tracing = { workspace = true } +tracing-subscriber = { workspace = true } + +[dev-dependencies] +metrics = { workspace = true } +tower = { workspace = true } diff --git a/crates/service-kit/README.md b/crates/service-kit/README.md new file mode 100644 index 00000000000..dedc28b84ec --- /dev/null +++ b/crates/service-kit/README.md @@ -0,0 +1,70 @@ +# service-kit + +Building blocks for the operational surface of a long-running async service: +a loopback HTTP port exposing what the process is doing right now, with +controls to debug it without restarting. + +Service-agnostic. Used by Estuary reactors and runtime-next; nothing here +knows about Flow. + +## Surface + +A service constructs a [`Registry`], wires its tracing/event/metrics layers +in, and serves the admin router on a loopback port. Every spawned unit of +work registers a [`HandlerGuard`]; its lifecycle, identity, and recent events +become visible on the dashboard. + +| Module | Role | +| ------------------ | --------------------------------------------------------------------------------------- | +| [`handlers`] | `Registry` / `HandlerGuard` — in-flight handler inventory plus a recently-finished ring | +| [`admin`] | HTML dashboard, JSON views, per-handler drill-down, trace-override `POST` endpoint | +| [`trace`] | Per-handler `tracing` verbosity override — additive filter composed with the base | +| [`event`] | Opt-in per-handler event tracks (named ring buffers) + `event!` macro; lazy capture | +| [`metrics`] | Prometheus `/metrics` exporter folded into the admin router; histogram upkeep tick | + +## Entry points + +- `Registry::new()` / `Registry::register(kind)` — register a handler; run + its body inside `guard.span()` so tracing/event layers can find it. +- `admin::serve(name, registry, addr, shutdown)` — bind the loopback admin + port and serve until shutdown. +- `trace::layer_filter(base, registry)` — wrap a `tracing_subscriber` + `EnvFilter` so operator overrides bypass it. +- `event::layer(registry)` — install alongside `fmt` so `event!` calls + capture into per-handler tracks. +- `metrics::install_recorder()` — idempotently install the global + `metrics` recorder; called transitively by `admin::build_router`. + +## Non-obvious details + +- **No auth.** Bind admin on loopback only. +- **Trace-override is additive** — it raises verbosity for one handler but + never suppresses what the base filter would keep. Cost when no override + is set is one extra `enabled()` check per disabled callsite (atomic load, + short scope walk only inside a handler span). +- **Handler spans must always be created.** `OverrideFilter` short-circuits + to `true` for the `service_kit::handler` target — that's where override + state is hung. Don't filter the target out at the base. +- **`event!` capture is lazy.** A literal message stays `&'static str`; an + interpolated message stores `(template, captured args)` and is rendered + only when read. Values must be cheap-to-capture types (numbers, bools, + `&'static str`, `String`, `Arc`, `Clock`) or a `lazy`/`json`/`debug` + thunk — borrowed `&str` or `?x`/`%x` formatters won't compile; the call + site converts. See [`event`] for the full contract. +- **`event!` is a no-op outside a handler span** *and* under any subscriber + that isn't a `tracing_subscriber::registry::Registry`. The accompanying + `tracing` event still fires. +- **Prometheus histogram upkeep** isn't driven by scrapes; `install_recorder` + spawns its own tick because we own the HTTP surface (the upstream + `PrometheusBuilder::install` convenience constructor isn't used). Idle- + metric pruning *is* scrape-driven, so it only fires when a scraper polls. +- **`metrics` recorder install is process-global and panics on conflict.** + Service-kit owns that slot. + +[`Registry`]: src/handlers.rs +[`HandlerGuard`]: src/handlers.rs +[`handlers`]: src/handlers.rs +[`admin`]: src/admin.rs +[`trace`]: src/trace.rs +[`event`]: src/event.rs +[`metrics`]: src/metrics.rs diff --git a/crates/service-kit/src/admin.rs b/crates/service-kit/src/admin.rs new file mode 100644 index 00000000000..65e1e953712 --- /dev/null +++ b/crates/service-kit/src/admin.rs @@ -0,0 +1,569 @@ +//! The admin surface: a loopback HTTP endpoint presenting a service's +//! [`Registry`] as an auto-refreshing HTML dashboard, a per-handler drill-down +//! page (identity, phase, and recent [`crate::event`] tracks), plus JSON views. +//! The Prometheus scrape endpoint from [`crate::metrics`] is folded in too, so +//! `/metrics` is reachable on the same loopback port. + +use crate::{HandlerDetail, Registry, Snapshot}; +use std::fmt::Write; +use std::net::SocketAddr; + +#[derive(Clone)] +struct AdminState { + registry: Registry, + // Service name shown in the page title; the dashboard serves one service. + title: std::sync::Arc, +} + +/// Build the admin router for `service_name`. +/// Must be called from within a Tokio runtime. +pub fn build_router(service_name: impl Into, registry: Registry) -> axum::Router<()> { + use axum::routing::{get, post}; + + let state = AdminState { + registry, + title: service_name.into().into(), + }; + axum::Router::new() + .route("/", get(index)) + .route("/debug/handlers.json", get(handlers_json)) + .route("/debug/handlers/{id}", get(handler)) + .route("/debug/handlers/{id}/detail.json", get(handler_detail_json)) + // POST (not GET): this mutates process state, so it shouldn't be + // reachable by a link prefetch or an open-all-tabs. + .route("/debug/handlers/{id}/level/{level}", post(set_level)) + .with_state(state) + // `merge` (not `nest`): keep `/metrics` at the root of the admin port, + // matching what Prometheus scrapers expect by default. + .merge(crate::metrics::install_recorder()) + .layer(tower_http::trace::TraceLayer::new_for_http()) +} + +/// Bind `addr` and serve the admin surface until `shutdown` resolves. `addr` +/// should be loopback-only — there is no authentication on this surface. +pub async fn serve( + service_name: impl Into, + registry: Registry, + addr: SocketAddr, + shutdown: impl std::future::Future + Send + 'static, +) -> anyhow::Result<()> { + let service_name = service_name.into(); + let listener = tokio::net::TcpListener::bind(addr) + .await + .map_err(|err| anyhow::anyhow!("binding {service_name} admin surface on {addr}: {err}"))?; + tracing::info!(%addr, service = %service_name, "service-kit admin surface listening"); + + axum::serve(listener, build_router(service_name.clone(), registry)) + .with_graceful_shutdown(shutdown) + .await + .map_err(|err| anyhow::anyhow!("serving {service_name} admin surface: {err}")) +} + +async fn handlers_json( + axum::extract::State(state): axum::extract::State, +) -> axum::Json { + axum::Json(state.registry.snapshot()) +} + +async fn handler( + axum::extract::State(state): axum::extract::State, + axum::extract::Path(id): axum::extract::Path, + axum::extract::Query(params): axum::extract::Query>, +) -> axum::response::Result> { + let detail = state + .registry + .handler_detail(id) + .ok_or((axum::http::StatusCode::NOT_FOUND, "no such handler"))?; + // Drill-down time display: `?t=zulu` to flip from the default relative ages + // to absolute RFC-3339 timestamps; preserved across the page's 2s refresh. + let zulu = params.get("t").is_some_and(|v| v == "zulu"); + Ok(axum::response::Html(render_handler( + &state.title, + &detail, + zulu, + ))) +} + +async fn handler_detail_json( + axum::extract::State(state): axum::extract::State, + axum::extract::Path(id): axum::extract::Path, +) -> axum::response::Result> { + let detail = state + .registry + .handler_detail(id) + .ok_or((axum::http::StatusCode::NOT_FOUND, "no such handler"))?; + Ok(axum::Json(detail)) +} + +/// `POST` target of the dashboard's trace-level buttons; replies with a +/// see-other redirect back to the index (post/redirect/get). +async fn set_level( + axum::extract::State(state): axum::extract::State, + axum::extract::Path((id, level)): axum::extract::Path<(u64, String)>, +) -> axum::response::Result { + let level = + parse_level(&level).ok_or((axum::http::StatusCode::BAD_REQUEST, "unknown trace level"))?; + // A miss (handler already finished) is unremarkable — fall through to the + // refreshed index, which will no longer list it. + let _ = state.registry.set_trace_override(id, level); + Ok(axum::response::Redirect::to("/")) +} + +/// Parse a path segment into a trace-override level; `Some(None)` clears it. +fn parse_level(s: &str) -> Option> { + Some(match s { + "off" | "clear" | "none" => None, + "error" => Some(tracing::Level::ERROR), + "warn" => Some(tracing::Level::WARN), + "info" => Some(tracing::Level::INFO), + "debug" => Some(tracing::Level::DEBUG), + "trace" => Some(tracing::Level::TRACE), + _ => return None, + }) +} + +async fn index( + axum::extract::State(state): axum::extract::State, +) -> axum::response::Html { + axum::response::Html(render_index(&state.title, &state.registry.snapshot())) +} + +/// Inline page styling shared by [`render_index`] and [`render_handler`]. +const PAGE_STYLE: &str = "body{font:13px/1.4 ui-monospace,Menlo,Consolas,monospace;margin:1.5rem;color:#222}\ +h1,h2,h3{font-size:1rem;margin:1.2rem 0 .4rem}\ +table{border-collapse:collapse;width:100%;margin-bottom:.8rem}\ +th,td{text-align:left;padding:.25rem .6rem;border-bottom:1px solid #ddd;vertical-align:top}\ +th{background:#f3f3f3}\ +a{color:#06c;text-decoration:none}a:hover{text-decoration:underline}\ +form{display:inline}\ +button{font:inherit;color:#06c;background:none;border:0;padding:0;cursor:pointer}button:hover{text-decoration:underline}\ +.kind{color:#06c}.phase{font-weight:bold}.muted{color:#888}.on{font-weight:bold;color:#c30}"; + +/// Open an auto-refreshing HTML page with the shared style; `head_title` is the +/// already-escaped `` text. +fn page_open(out: &mut String, head_title: &str) { + let _ = write!( + out, + "<!doctype html><html><head><meta charset=\"utf-8\">\ + <meta http-equiv=\"refresh\" content=\"2\">\ + <title>{head_title}", + ); +} + +/// A link to a handler's drill-down page, wrapping the given (pre-escaped) text. +fn handler_link(id: u64, text: &str) -> String { + format!("{text}") +} + +fn render_index(title: &str, snapshot: &Snapshot) -> String { + let title = esc(title); + let mut out = String::new(); + page_open(&mut out, &format!("{title} handlers")); + + let _ = write!( + out, + "

{title} handlers

{} live, {} recently finished — auto-refreshes every 2s · json

", + snapshot.live.len(), + snapshot.recent.len(), + ); + + out.push_str("

Live

"); + if snapshot.live.is_empty() { + out.push_str("

(none)

"); + } else { + out.push_str( + "", + ); + for h in &snapshot.live { + let _ = write!( + out, + "", + handler_link(h.id, &h.id.to_string()), + handler_link(h.id, &esc(h.kind)), + esc(&h.label), + fmt_age(h.age_seconds), + esc(&h.phase), + fmt_age(h.phase_age_seconds), + fmt_trace_controls(h.id, h.trace_override), + fmt_fields(&h.fields), + ); + } + out.push_str("
idkindlabelagephasetracefields
{}{}{}{}{} ({}){}{}
"); + } + + out.push_str("

Recently finished

"); + if snapshot.recent.is_empty() { + out.push_str("

(none)

"); + } else { + out.push_str( + "", + ); + // Newest first. + for h in snapshot.recent.iter().rev() { + let _ = write!( + out, + "", + handler_link(h.id, &h.id.to_string()), + handler_link(h.id, &esc(h.kind)), + esc(&h.label), + fmt_age(h.age_seconds), + esc(&h.final_phase), + ); + } + out.push_str("
idkindlabelran forfinal phase
{}{}{}{}{}
"); + } + + out.push_str(""); + out +} + +fn render_handler(title: &str, h: &HandlerDetail, zulu: bool) -> String { + let title = esc(title); + let mut out = String::new(); + page_open(&mut out, &format!("{title} handler #{}", h.id)); + + let status = if h.finished { "finished" } else { "live" }; + // Toggle: the currently-active mode is bold, the alternative is a link + // back to the same handler with the opposite `t=` query. + let toggle = if zulu { + format!("age · zulu", h.id) + } else { + format!( + "age · zulu", + h.id, + ) + }; + let _ = write!( + out, + "

← {title} handlers · json · times: {toggle}

\ +

handler #{} ({status})

", + h.id, h.id, + ); + + out.push_str(""); + let _ = write!( + out, + "", + esc(h.kind) + ); + let _ = write!(out, "", esc(&h.label)); + + // Phase parenthetical: phase age (default) or absolute phase-change time + // (zulu). `phase_since_rfc3339` and `phase_age_seconds` are both `None` + // on a finished handler, in which case there is no parenthetical. + let phase_paren = match (zulu, &h.phase_since_rfc3339, h.phase_age_seconds) { + (true, Some(at), _) => { + format!(" (since {})", esc(at)) + } + (false, _, Some(s)) => format!(" ({})", fmt_age(s)), + _ => String::new(), + }; + let _ = write!( + out, + "", + esc(&h.phase), + ); + + // Age row. For a finished handler, `age_seconds` is the total runtime — + // a duration with no zulu equivalent, so it renders the same in both + // modes. For a live handler in zulu mode, show the absolute start time. + let (age_label, age_value) = if h.finished { + ("ran for", fmt_age(h.age_seconds)) + } else if zulu { + match &h.started_at_rfc3339 { + Some(at) => ("started", esc(at)), + None => ("age", fmt_age(h.age_seconds)), + } + } else { + ("age", fmt_age(h.age_seconds)) + }; + let _ = write!(out, ""); + + if !h.finished { + let _ = write!( + out, + "", + fmt_trace_controls(h.id, h.trace_override), + ); + } + if !h.fields.is_empty() { + let _ = write!( + out, + "", + fmt_fields(&h.fields) + ); + } + out.push_str("
kind{}
label{}
phase{}{phase_paren}
{age_label}{age_value}
trace{}
fields{}
"); + + out.push_str("

Tracks

"); + if h.tracks.is_empty() { + out.push_str("

(no events captured)

"); + } else { + let time_col = if zulu { "at" } else { "age" }; + for (name, events) in &h.tracks { + let _ = write!(out, "

{}

", esc(name)); + let _ = write!( + out, + "", + ); + // Oldest first: the operator reads top-to-bottom following the + // sequence of events that led to the handler's current state. + for e in events { + let when = if zulu { + esc(&e.at_rfc3339) + } else { + fmt_age(e.age_seconds) + }; + let _ = write!( + out, + "", + e.level, + esc(&e.message), + fmt_fields(&e.fields), + ); + } + out.push_str("
{time_col}levelmessagefields
{when}{}{}{}
"); + } + } + + out.push_str(""); + out +} + +/// The current override (highlighted if set) plus the buttons to change it — +/// tiny `POST` forms rather than links, since the target mutates process state. +fn fmt_trace_controls(id: u64, current: Option<&'static str>) -> String { + let mut out = String::new(); + match current { + Some(level) => { + let _ = write!(out, "{level} "); + } + None => out.push_str("base "), + } + for label in ["trace", "debug", "off"] { + let _ = write!( + out, + "
", + ); + } + out +} + +fn fmt_fields(fields: &[(&'static str, String)]) -> String { + fields + .iter() + .map(|(k, v)| format!("{}={}", esc(k), esc(v))) + .collect::>() + .join(" ") +} + +fn fmt_age(secs: u64) -> String { + if secs < 60 { + format!("{secs}s") + } else if secs < 3600 { + format!("{}m{}s", secs / 60, secs % 60) + } else { + format!("{}h{}m", secs / 3600, (secs % 3600) / 60) + } +} + +/// Minimal HTML-escaping for text interpolated into the dashboard. +fn esc(s: &str) -> String { + let mut out = String::with_capacity(s.len()); + for c in s.chars() { + match c { + '&' => out.push_str("&"), + '<' => out.push_str("<"), + '>' => out.push_str(">"), + '"' => out.push_str("""), + '\'' => out.push_str("'"), + _ => out.push(c), + } + } + out +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::event::EventView; + use crate::{FinishedView, HandlerDetail, HandlerView}; + + #[test] + fn render_index_includes_handlers_and_escapes() { + let snapshot = Snapshot { + live: vec![HandlerView { + id: 7, + kind: "leader.materialize", + label: "acmeCo/".to_string(), + phase: "running".to_string(), + age_seconds: 65, + phase_age_seconds: 3, + fields: vec![("shards", "2".to_string())], + trace_override: Some("TRACE"), + }], + recent: vec![FinishedView { + id: 6, + kind: "shuffle.log", + label: "dir/0".to_string(), + final_phase: "done".to_string(), + age_seconds: 4000, + tracks: Default::default(), + }], + }; + + let html = render_index("my-service", &snapshot); + assert!(html.contains("my-service handlers")); + assert!(html.contains("leader.materialize")); + assert!(html.contains("acmeCo/<svc>")); + assert!(!html.contains("acmeCo/")); + assert!(html.contains("shards=2")); + assert!(html.contains("1m5s")); + assert!(html.contains("1h6m")); + assert!(html.contains("shuffle.log")); + assert!(html.contains("7")); + assert!(html.contains("leader.materialize")); + assert!(html.contains("6")); + assert!(html.contains("shuffle.log")); + assert!(html.contains("running (3s)")); + // Trace controls for the live handler — `POST` forms, not links. + assert!(html.contains(">TRACE")); + assert!( + html.contains( + "