Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .claude/skills/sesh-mode/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,46 @@ Stateright DST Tests Production Metrics
(exhaustive) (simulation) (Observability)
```

## When a Verification Check Fails — STOP, Don't Weaken

**MUST**: when a TLA+ invariant, Stateright property, or DST assertion fails,
the **first** action is diagnosis, not modification. Verification properties
are load-bearing — they encode the safety claim the system has to satisfy.
Silently weakening them to make the check pass loses the proof obligation
and hides future bugs.

The failure has exactly two possible causes:

1. **The implementation/model has a real bug.** Fix the bug.
2. **The property is over-strong** — it asserts something the design does not
actually guarantee. *Sometimes* the right answer is to weaken or replace
the property — but the failing trace was probably also revealing the
real, weaker safety property the design *does* guarantee. Almost never
should the answer be just "remove the property."

**Required protocol**:
- Read the failing trace. State out loud what the property meant to claim,
and what the trace shows.
- If unsure whether (1) or (2) applies — **stop and ask the user**.
Do not silently rewrite the property.
- If (2) applies and you propose to weaken/replace, present the candidate
replacement *and* the safety property the original was reaching for, then
ask before changing. The replacement should usually preserve the spirit
via a different formulation (action property, liveness, narrower precondition)
— not just delete the constraint.

**Forbidden** without explicit user approval:
- Renaming an invariant to make its negation trivially true
- Deleting an invariant that just produced a counter-example
- Adding `=> TRUE` or other no-op weakenings
- Changing `\A` to `\E`, `[]` to `<>`, or similar quantifier flips, when the
motive is to suppress a violation rather than to capture a different claim

**Why**: invariants describe *the system's promise to its users*. When TLC
finds a counter-example, it has just told you either that the implementation
is wrong, or that you've been claiming a stronger promise than you actually
keep. Both deserve a conscious decision, never a silent edit.

## Testing Through Production Path

**MUST NOT** claim a feature works unless tested through the actual network stack.
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,6 @@ qwdata

# GSD planning artifacts (local only)
.planning

# TLC model checker working directory (state dumps from `tla2tools.jar`)
states/
96 changes: 64 additions & 32 deletions docs/internals/specs/tla/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,61 +4,91 @@ This directory contains formal TLA+ specifications for Quickwit protocols.

## Setup (One-Time)

Install TLA+ tools to a standard location (NOT in this directory):
Install TLA+ Toolbox via Homebrew cask:

```bash
# Option 1: Homebrew (recommended on macOS)
brew install tlaplus
brew install --cask tla+-toolbox
```

# Option 2: Download to ~/.local/lib
mkdir -p ~/.local/lib
curl -L -o ~/.local/lib/tla2tools.jar \
"https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar"
This installs the TLA+ Toolbox app to `/Applications/TLA+ Toolbox.app`,
which bundles `tla2tools.jar` at:

# Option 3: VS Code extension (for interactive use)
# Install: alygin.vscode-tlaplus
```
/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar
```

No separate `tlc` CLI is installed — run TLC via `java -jar`.

## Running Model Checker

From the repository root:
All commands run from the **repository root** (`quickwit-oss/quickwit/`):

```bash
# If installed via Homebrew:
tlc -config docs/internals/specs/tla/ExampleProtocol.cfg docs/internals/specs/tla/ExampleProtocol.tla
# Quick check (small config, ~seconds):
java -XX:+UseParallelGC -Xmx4g \
-jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
-workers 4 \
-config docs/internals/specs/tla/SomeSpec_small.cfg \
docs/internals/specs/tla/SomeSpec.tla

# If using downloaded jar:
java -XX:+UseParallelGC -Xmx4g -jar ~/.local/lib/tla2tools.jar \
# Full check (larger state space, ~seconds to minutes):
java -XX:+UseParallelGC -Xmx4g \
-jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
-workers 4 \
-config docs/internals/specs/tla/ExampleProtocol.cfg \
docs/internals/specs/tla/ExampleProtocol.tla
-config docs/internals/specs/tla/SomeSpec.cfg \
docs/internals/specs/tla/SomeSpec.tla
```

### Quick vs Full Verification

Some specs have `_small.cfg` variants for faster iteration:
### Run All Specs

```bash
# Quick (~1s, hundreds of states)
tlc -config docs/internals/specs/tla/ExampleProtocol_small.cfg docs/internals/specs/tla/ExampleProtocol.tla

# Full (~minutes, millions of states)
tlc -workers 4 -config docs/internals/specs/tla/ExampleProtocol.cfg docs/internals/specs/tla/ExampleProtocol.tla
cd /path/to/quickwit-oss/quickwit

# Iterate every .cfg next to a .tla and run TLC on each pair. Some specs
# ship multiple configs (e.g. one for fast iteration, one for thorough
# coverage, one focused on a specific behavior); this loop handles all.
for cfg in docs/internals/specs/tla/*.cfg; do
base="$(basename "${cfg%.cfg}")"
# Try <base>.tla as-is; if absent, strip a trailing _suffix
# (e.g. MergePipelineShutdown_chains.cfg -> MergePipelineShutdown.tla).
if [ -f "docs/internals/specs/tla/${base}.tla" ]; then
tla="docs/internals/specs/tla/${base}.tla"
else
tla="docs/internals/specs/tla/${base%_*}.tla"
fi
echo "=== $(basename "$tla") with $(basename "$cfg") ==="
java -XX:+UseParallelGC -Xmx4g \
-jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
-workers 4 -config "$cfg" "$tla"
echo
done
```

## Available Specifications
### Multiple Configs per Spec

| Spec | Config | Purpose |
|------|--------|---------|
A spec may have several `.cfg` files exploring different bounds:
- `<Spec>.cfg` — primary configuration, balanced for routine verification
- `<Spec>_<focus>.cfg` — alternate configurations targeting specific
behaviors (e.g. `_chains` for deep merge chains, `_small` for fastest
iteration). Each `.cfg` runs against the *same* `.tla` source.

*No specifications yet. Specs will be created as protocols are designed.*
## Available Specifications

| Spec | Small States | Full States | Invariants |
|------|-------------|-------------|------------|
| ParquetDataModel | 8 | millions | DM-1..DM-5 |
| SortSchema | 49,490 | millions | SS-1..SS-5 |
| TimeWindowedCompaction | 938 | thousands | TW-1..3, CS-1..3, MC-1..4 |
| MergePipelineShutdown (primary, multi-lifetime) | — | 15,732 (~1s) | RowsConserved, NoSplitLoss, NoDuplicateMerge, NoOrphanInPlanner, NoOrphanWhenConnected, LeakIsObjectStoreOnly, MP1\_LevelHomogeneity, BoundedWriteAmp, FinalizeWithinBound, ShutdownOnlyWhenDrained + RestartReSeedsAllImmature (action) + ShutdownEventuallyCompletes, NoPersistentOrphan (liveness) |
| MergePipelineShutdown\_chains (deep merge chain) | — | 217,854 (~12s) | same invariant set; MaxIngests=4, MaxRestarts=1 |

## Creating New Specs

1. Create `NewProtocol.tla` with the specification
2. Create `NewProtocol.cfg` with constants and properties to check
3. Add entry to `README.md` mapping table
4. Run model checker to verify
2. Create `NewProtocol_small.cfg` (minimal constants for fast iteration)
3. Create `NewProtocol.cfg` (larger constants for thorough checking)
4. Update the table above
5. Run both configs through TLC to verify

### Config File Template

Expand All @@ -69,6 +99,8 @@ CONSTANTS
Nodes = {n1, n2}
MaxItems = 3

CHECK_DEADLOCK FALSE

SPECIFICATION Spec

INVARIANTS
Expand All @@ -81,7 +113,7 @@ PROPERTIES

## Cleanup

TLC generates trace files and state directories on errors. Clean them up with:
TLC generates trace files and state directories on errors:

```bash
rm -rf docs/internals/specs/tla/*_TTrace_*.tla docs/internals/specs/tla/*.bin docs/internals/specs/tla/states
Expand Down
53 changes: 53 additions & 0 deletions docs/internals/specs/tla/MergePipelineShutdown.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
\* TLC Configuration for MergePipelineShutdown.tla — multi-lifetime focus.
\*
\* Optimized for exercising crash/restart and cross-process recovery: fewer
\* ingests but more restarts. ~16K distinct states; finishes in ~1 second on
\* a workstation.
\*
\* Coverage emphasis:
\* - Three full process lifetimes (initial + 2 restarts)
\* - Crash mid-flight (uploaded outputs become orphan_outputs)
\* - Disconnect → Publish (output published but planner not informed)
\* - Restart re-seeds cold_windows from durable published_splits
\* - Graceful shutdown then Restart (fresh process post-shutdown)
\* - NoPersistentOrphan liveness across multiple lifetimes
\*
\* For deeper merge-chain exercise (level 0→1→2 mature, BoundedWriteAmp on
\* multi-step compaction), see MergePipelineShutdown_chains.cfg.
\*
\* Tuning notes:
\* - MaxIngests dominates state space (~5x per increment).
\* - MaxIngests=4 with MaxRestarts=2 inflates to ~3M states; liveness
\* check then takes several minutes on the temporal-tableau pass.

CONSTANTS
MaxMerges = 2
MaxFinalizeOps = 1
Windows = {w1, w2}
MergeFactor = 2
MaxIngests = 3
MaxMergeOps = 2
MaxCrashes = 1
MaxRestarts = 2

CHECK_DEADLOCK FALSE

SPECIFICATION Spec

INVARIANTS
TypeInvariant
RowsConserved
NoSplitLoss
NoDuplicateMerge
NoOrphanInPlanner
NoOrphanWhenConnected
LeakIsObjectStoreOnly
MP1_LevelHomogeneity
BoundedWriteAmp
FinalizeWithinBound
ShutdownOnlyWhenDrained

PROPERTIES
RestartReSeedsAllImmature
ShutdownEventuallyCompletes
NoPersistentOrphan
Loading
Loading