Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/platform-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ jobs:
- run: mise run build:gazette
- run: mise run build:flowctl-go
- run: mise run ci:nextest-build
- run: mise run local:bigtable
- run: mise run ci:nextest-run
- run: mise run ci:doctest
- run: mise run ci:gotest
Expand Down
122 changes: 116 additions & 6 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ futures = "0"
futures-core = "0"
futures-util = "0"
fxhash = "0"
gcp_auth = "0.12"
googleapis-tonic-google-bigtable-v2 = "0.36"
graphql_client = "*"
handlebars = "6"
hex = "0"
Expand Down
39 changes: 39 additions & 0 deletions crates/catalog-stats/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
[package]
name = "catalog-stats"
version.workspace = true
rust-version.workspace = true
edition.workspace = true
authors.workspace = true
homepage.workspace = true
repository.workspace = true
license.workspace = true

[lib]
doctest = false

[features]
# Exposes `catalog_stats::test_util`, the emulator-seeding helpers used by
# integration tests.
test_util = []

[dependencies]
coroutines = { path = "../coroutines" }
ops = { path = "../ops" }
tuple = { path = "../tuple" }

anyhow = { workspace = true }
chrono = { workspace = true }
futures = { workspace = true }
futures-core = { workspace = true }
gcp_auth = { workspace = true }
googleapis-tonic-google-bigtable-v2 = { workspace = true }
serde = { workspace = true }
serde_json = { workspace = true }
tokio = { workspace = true }
tonic = { workspace = true }
tracing = { workspace = true }

[dev-dependencies]
catalog-stats = { path = ".", features = ["test_util"] }
ops = { path = "../ops" }
futures = { workspace = true }
98 changes: 98 additions & 0 deletions crates/catalog-stats/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# catalog-stats

Read-side client for the BigTable tables that hold rolled-up catalog stats.

## Inspecting a local stack

The local `mise run local:stack` workflow materializes the same docs to
both BigTable (via this crate's tables) and Postgres (`catalog_stats`
table populated by `ops.us-central1.v1/stats-view`). The two are useful
to cross-check during development; the snippets below assume the default
local config.

### BigTable

`cbt` runs out of the `google/cloud-sdk` image and talks to the emulator
exposed by `mise run local:bigtable` on `localhost:8086`:

```sh
cbt() {
docker run --rm --network host \
-e BIGTABLE_EMULATOR_HOST=localhost:8086 \
-e CLOUDSDK_CORE_PROJECT=estuary-local \
google/cloud-sdk:latest \
cbt -instance estuary-local "$@"
}

# List the three grain tables.
cbt ls

# Read the first few rows of a grain, restricted to the flow_document column.
cbt read catalog_stats_hourly columns=f:flow_document count=5
```

cbt's `prefix=` / `start=` / `end=` arguments match against the raw row
key, which is FDB tuple-packed and begins with a `0x02` type byte — so a
literal `prefix=bobCo/` will NOT match. For ad-hoc filtering by catalog
name, it's easier to scan the table and post-filter the decoded JSON
(see the `jq` example below).

The `flow_document` cell is printed Go-`%q`-quoted on a 4-space-indented
line; for the ASCII JSON we store, that quoting is also a valid JSON
string, so `jq fromjson` recovers the document:

```sh
cbt read catalog_stats_hourly columns=f:flow_document count=1 \
| grep -aE '^ "' | sed 's/^ //' | jq 'fromjson'
```

### Postgres

The same docs land in the `catalog_stats` table under the local
`stats_loader` role:

```sh
psql() {
PGPASSWORD=stats_loader_password command psql \
-h localhost -U stats_loader -d postgres "$@"
}

# A few recent rows, projected columns.
psql -c "
SELECT catalog_name, grain, ts, docs_written_by_me, bytes_written_by_me
FROM catalog_stats
ORDER BY ts DESC
LIMIT 10"

# The full flow_document for a specific row.
psql -At -c "
SELECT jsonb_pretty(flow_document::jsonb)
FROM catalog_stats
WHERE catalog_name = 'bobCo/hw1/'
AND grain = 'hourly'
AND ts = '2026-05-14T18:00:00Z'"
```

### Comparing

The two stores will usually agree (sans small propagation delay) on every
field except `_meta.uuid`, which each materialization assigns
independently.

**Full dump.** Normalize both sides to one JSON-per-line, sort, and
diff. Identical rows produce byte-identical lines after `del(._meta)`
and key-sorting (`-S`), so `sort` lines them up without an explicit
join:

```sh
for grain in hourly daily monthly; do
cbt read "catalog_stats_${grain}" columns=f:flow_document
done \
| grep -aE '^ "' | sed 's/^ //' \
| jq -cS 'fromjson | del(._meta)' | sort > /tmp/bt.jsonl

psql -At -c "SELECT flow_document::text FROM catalog_stats" \
| jq -cS 'del(._meta)' | sort > /tmp/pg.jsonl

diff /tmp/bt.jsonl /tmp/pg.jsonl
```
Loading
Loading