Skip to content

IGNORE -- crash from spr#2503

Closed
jucor wants to merge 14 commits intoedgefrom
spr/edge/80eaa87c
Closed

IGNORE -- crash from spr#2503
jucor wants to merge 14 commits intoedgefrom
spr/edge/80eaa87c

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 30, 2026

…ues)

Summary

Changes the representativeness metric from a weighted sum of absolutes to the
Clojure product formula (repness.clj:188-190).

Before (Python):

  • agree_metric = pa * (|pat| + |rat|) — weighted sum, tolerant of weak factors
  • disagree_metric = (1 - pd) * (|pdt| + |rdt|) — doubly wrong: uses (1-pd) and sum

After (Clojure formula):

  • agree_metric = ra * rat * pa * pat — product of 4 signed values
  • disagree_metric = rd * rdt * pd * pdt — product of 4 signed values

The product formula is more conservative: any factor near zero kills the entire
metric, requiring ALL dimensions (probability, significance, relative
representativeness) to be strong simultaneously.

The old disagree formula was doubly wrong:

  1. Used (1 - pd) instead of pd — high metric when disagree probability is LOW
  2. Used a weighted sum of absolutes instead of a signed product

No feature flag for the old formula — it has no defensible behavior.

Test plan

  • 5 new D7 formula tests (agree product, disagree product, zero-kills-metric,
    sign preservation, multiple known values)
  • Updated unit tests in test_repness_unit.py and test_old_format_repness.py
  • Full test suite passes (excluding DynamoDB/MinIO tests)
  • Private dataset tests pass (--include-local)
  • Golden snapshots re-recorded for all 7 datasets

🤖 Generated with Claude Code

commit-id:80eaa87c


Stack:


⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

@jucor jucor force-pushed the spr/edge/80eaa87c branch from 06d3d43 to a2e7457 Compare March 30, 2026 21:04
@jucor jucor force-pushed the spr/edge/23c03d70 branch from 3192562 to 8b94c86 Compare March 30, 2026 21:32
@jucor jucor force-pushed the spr/edge/80eaa87c branch from a2e7457 to 1f209a6 Compare March 30, 2026 21:32
@jucor jucor changed the title Fix D7: match Clojure repness metric formula (product of 4 signed values) Fix D7: match Clojure repness metric formula (product of 4 signed val… Mar 30, 2026
@jucor jucor changed the base branch from spr/edge/23c03d70 to edge March 30, 2026 21:49
jucor added 10 commits March 30, 2026 22:51
When spr force-pushes all stack branches at once, each branch triggers
a new E2E run while the previous one is still queued. The concurrency
group (keyed on branch ref) cancels the stale run, cutting the number
of concurrent E2E jobs roughly in half.

commit-id:1a8eb714
## Summary


Add `SKIP_GOLDEN=1` environment variable to disable golden snapshot regression tests.

During stacked PR development, golden snapshots become stale as computation changes cascade through the stack. Rather than re-recording snapshots at every rebase (which causes conflict cascades in jj/git), we skip them until the stack is merged into `edge`.

### Changes

- **`test_regression.py`**: Add `@_skip_golden` decorator to `test_conversation_regression` and `test_conversation_stages_individually` — the only two tests that compare against golden snapshots. Other dataset-using tests (Clojure comparison, smoke tests) are unaffected.
- **`python-ci.yml`**: Set `SKIP_GOLDEN=1` in CI so the stacked PRs don't fail on stale snapshots.

### Usage

```bash
SKIP_GOLDEN=1 pytest tests/          # skip golden snapshot tests
pytest tests/                         # run everything (default)
```

## Test plan

- [x] `SKIP_GOLDEN=1 pytest tests/test_regression.py -v`: 4 skipped, 5 passed
- [x] `pytest tests/test_regression.py -v`: all 9 collected (golden tests run normally)

commit-id:d39cf65d
## Summary


Replace `pip install` with `uv pip install` in the delphi Dockerfile for faster dependency installation. `uv pip` is a drop-in replacement — same `requirements.lock`, same `pyproject.toml`, same installed packages in `site-packages`.

### Changes

- **`delphi/Dockerfile`**:
  - Copy `uv` binary from `ghcr.io/astral-sh/uv:0.11.2` (pinned version, single static binary)
  - Place in `/opt/uv/` in builder to avoid leaking into production image via `COPY --from=builder /usr/local/bin`
  - Set `UV_SYSTEM_PYTHON=1` (install into system Python, not a venv)
  - Replace all `pip install` with `uv pip install` in builder and test stages
  - Update BuildKit cache mount targets from `/root/.cache/pip` to `/root/.cache/uv`
  - Test stage copies `uv` from builder (single source of truth)

### What's NOT changed

- **Makefile** — untouched, `make rebuild-delphi` works as before
- **docker-compose.yml / docker-compose.test.yml** — untouched
- **pyproject.toml / requirements.lock** — untouched, same format
- **pip-compile workflow** — untouched, still used for lock file generation
- **Final/production image** — no `uv` added, stays lean

### CI Benchmark (GitHub Actions, `ubuntu-latest`, `--no-cache`)

5 pip runs (Mar 27 stack push) vs 3 uv pip runs (this PR):

| Step | pip (n=5) | uv pip (n=3) | Speedup |
|------|-----------|--------------|---------|
| **Docker build** | **264s** (sd=6) | **169s** (sd=2) | **1.56x (-94s)** |
| Pytest run | 223s (sd=4) | 227s (sd=4) | ~same |

**~94 seconds saved per CI run** on the Docker build step. Pytest runtime is unchanged (same packages, same tests). Low variance in both groups confirms this is a real improvement, not noise.

### Local Benchmark (Apple M1 Max, 64GB, Docker Desktop, `--no-cache`)

| Step | pip | uv pip | Speedup |
|------|-----|--------|---------|
| **Dependencies install** | **149.3s** | **80.4s** | **1.9x** |
| Dev deps install | 10.0s | 2.2s | **4.5x** |

## Test plan

- [x] `docker compose -f docker-compose.test.yml build --no-cache delphi` succeeds locally
- [x] Built image has all expected packages (`pip show` diagnostic passes in build log)
- [x] CI passes (3 successful runs)
- [x] `make rebuild-delphi` works (Makefile untouched)

commit-id:0c448343
## Summary


Documentation-only PR: deep analysis of Python vs Clojure discrepancies and a TDD fix plan.

### Changes

- Deep analysis documents (`deep-analysis-for-julien/`) comparing Python and Clojure implementations statement-by-statement
- Consolidate CLAUDE.md documentation for the delphi project
- Discrepancy fix plan (`docs/PLAN_DISCREPANCY_FIXES.md`) with prioritized list of fixes

## Test plan

- [x] Documentation only — no code changes
🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:d2f65026
## Summary


Per-discrepancy test infrastructure for TDD fixing of Python-Clojure differences.

### Changes

- Add per-discrepancy test markers and parametrized test infrastructure
- Cold-start recorder: coordinate parallel runs with marker file, auto-pause math workers
- Update journal with xpassed test breakdown across all datasets
- Address Copilot review: remove unused import, fix script issues
- Add naming convention documentation

## Test plan

- [x] 223 passed, 4 skipped, 22 xfailed, 7 xpassed, 0 failures
🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:bdc830db
## Summary


Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d).

### D2: In-conv threshold

- **Before**: `threshold = 7 + sqrt(n_cmts) * 0.1` — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)
- **After**: `threshold = min(7, n_cmts)` — matches Clojure exactly

### D2b: Base-cluster sort order (from Copilot review)

- **Before**: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means
- **After**: Keep k-means ID order, matching Clojure's `(sort-by :id ...)`

### D2c: Vote count source (raw vs filtered matrix)

- **Before**: `_compute_user_vote_counts` and `n_cmts` used `self.rating_mat` (filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold.
- **After**: Both use `self.raw_rating_mat` (includes all votes, even on moderated-out comments), matching Clojure's `user-vote-counts` (conversation.clj:217-225) which reads from `raw-rating-mat`.

### D2d: In-conv monotonicity (design decision)

Python does full recompute from `raw_rating_mat` every time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is **strictly better** than Clojure's approach (which persists in-conv to `math_main` because it uses delta vote processing).

5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358).

### Impact

- biodiversity: 428 → 441 in-conv participants (now matches Clojure)
- Verified on 4 datasets with complete Clojure cold-start blobs

### Incremental vs cold-start blob testing

D2 tests run against both **cold-start** and **incremental** Clojure blobs (infrastructure from #2420):

- **Cold-start blobs** are computed in one pass on the full dataset. The in-conv threshold `min(7, n_cmts)` is evaluated once with the final `n_cmts`. Python matches these exactly.
- **Incremental blobs** were built progressively as votes trickled in over the conversation's lifetime. The threshold was evaluated at each iteration with a smaller `n_cmts`, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants).

D2 tests on incremental blobs are currently **xfailed** with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure.

### Test results

```
253 passed, 5 skipped, 36 xfailed (0 failures)
```

## Test plan

- [x] D2 tests pass on all datasets with complete Clojure cold-start blobs
- [x] D2c: 3 synthetic tests verify vote counts include moderated-out votes, n_cmts includes moderated-out comments, participants stay in-conv after moderation
- [x] D2d: 5 monotonicity tests (basic across updates, survives moderation, worker restart + moderation, restart without new votes, mixed participants)
- [x] D2 tests xfail on incremental blobs (with explanatory comments)
- [x] Full test suite: 253 passed, 0 failures
- [x] Golden snapshots re-recorded for affected datasets

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:c0a682ec
## Summary


- Change `PSEUDO_COUNT` from 1.5 to 2.0, matching Clojure's Beta(2,2) prior
- This changes probability smoothing from `pa = (na + 0.75)/(ns + 1.5)` to `pa = (na + 1)/(ns + 2)`
- All `pa`/`pd` values now match Clojure's `p-success` exactly (verified on all datasets with Clojure blobs)

## Changes

- `repness.py`: `PSEUDO_COUNT = 2.0` with updated comment
- `test_discrepancy_fixes.py`: remove xfail from 3 D4 tests (constant check, pa values per dataset, synthetic)
- `test_repness_unit.py`, `test_old_format_repness.py`: import `PSEUDO_COUNT` instead of hardcoding 1.5
- `simplified_repness_test.py`: update hardcoded constant
- Golden snapshots re-recorded for public datasets (vw, biodiversity)

## Test plan

- [x] TDD red: 6 D4 tests fail before fix
- [x] TDD green: all 6 D4 tests pass after fix
- [x] Full public suite: 258 passed, 0 failures
- [x] Private datasets (--include-local): 60 passed, 0 failures (discrepancy tests)
- [x] Regression tests pass on public + FLI + bg2018

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:6ae3ee43
## Summary


- Default `benchmark=False` in `compare_with_golden()` — benchmark mode ran the pipeline 3x for timing statistics, unnecessary for correctness checks. The `regression_comparer.py` script already had `--benchmark` as opt-in, so this aligns the default.
- Add `skip_intermediate_stages` parameter to `compute_all_stages()` — `test_conversation_regression` now skips stages 1-4 (empty, load-only, PCA-only, PCA+clustering) since it only checks `overall_match`. `test_conversation_stages_individually` still runs all stages for granular failure detection.

### Measured speedup on one of the large private test conversations

| Test | Before | After | Speedup |
|------|--------|-------|---------|
| `test_conversation_regression` | 317s | 23s | **13.9x** |
| `test_conversation_stages_individually` | 60s | 32s | **1.9x** |

The regression test's ~14x speedup comes from two combined effects: no longer running the pipeline 3x (benchmark), and skipping 4 redundant intermediate stages.

## Test plan

- [x] All 9 public regression tests pass (vw + biodiversity)
- [x] Private dataset tests pass (`--include-local`)
- [x] Timing verified on large private dataset

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:f39f3218
## Summary


- Replaces the O(N×G×C) per-participant Python loop in `_compute_participant_info_optimized` with bulk NumPy operations: matrix-wide vote counting (`np.sum` over axis) and per-group Pearson correlation via `P @ g` matrix multiply
- Adds 31 unit tests covering vote counts, group correlations, edge cases (small groups, zero-std, NaN handling, missing members), and golden snapshot regression
- Correlations now return Python `float` instead of `numpy.float64`
- Includes a benchmark script (`scripts/benchmark_participant_info.py`) that runs old vs new on the same data

## Benchmark results

Measured on real datasets (5 runs, median), old per-participant loop vs new vectorized:

| Dataset | Size | Old | New | Speedup |
|---------|------|-----|-----|---------|
| vw | 69p × 125c × 4g | 0.011s | 0.001s | **14.6x** |
| biodiversity | 536p × 314c × 2g | 0.047s | 0.006s | **8.1x** |
| _(larger private datasets)_ | | | | **3–6x** |

Speedup is higher on smaller datasets (loop overhead dominates) and lower on very large ones (matrix materialization dominates). Overall **3–15x** depending on size.

## Test plan

- [x] 31 unit tests pass (pre-vectorization baseline established first, then re-run post)
- [x] Golden snapshot regression passes for biodiversity + vw
- [x] Full regression test suite passes (40/40)
- [x] Benchmark run on all datasets including private (results above)
- [x] Max correlation diff across all datasets: < 2e-15


🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:ea747196
## Summary


- **Absorbed PR #2434** (DB connect_timeout): adds `connect_timeout=5` to all DB connections so tests fail fast instead of hanging when Postgres is unavailable. Reverts timeout from production code (a 5s timeout could break real deployments), keeping it only in test code.
- Adds `require_dynamodb()` and `require_s3()` helpers to conftest that probe services with short timeouts and `pytest.fail()` immediately — applied to all tests that previously hung when Docker services were down.
- Adds `pytest-timestamper` for per-test timing visibility.
- Fixes `test_postgres_real_data.py` to use `DATABASE_URL` (consistent with all other delphi Python code) via `python-dotenv`.

Refs #2442

## Test plan

- [x] `test_conversation_from_postgres` passes
- [x] `test_pakistan_conversation_batch` passes (9 min, 400K votes)
- [x] Full delphi test suite: 294 passed, no regressions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:b9062b50
jucor added 4 commits March 30, 2026 22:51
## Summary


- Fix D9: change z-score significance thresholds from two-tailed to one-tailed, matching Clojure's `stats.clj`
- `Z_90`: 1.645 → 1.2816, `Z_95`: 1.96 → 1.6449
- Also resolves an internal inconsistency — Python's own `stats.py` already used the correct one-tailed values

## Why one-tailed?

The proportion tests in Polis check whether a comment's agree (or disagree) rate is **significantly above 0.5** — a directional hypothesis. One-tailed is correct because we only care about one direction at a time. The two-tailed values were 28% more conservative, causing fewer comments to pass significance.

## Test plan

- [x] TDD: removed xfail from 3 D9 tests, confirmed red (3 failures), applied fix, confirmed green
- [x] Discrepancy tests: 63 passed, 6 skipped, 50 xfailed (all 7 datasets including private)
- [x] Regression tests: 19 passed (all 7 datasets, golden snapshots re-recorded)
- [x] Repness unit tests: 36 passed (boundary values updated to match new thresholds)
- [x] 4 pre-existing failures unrelated to D9 (PCA incremental blobs, DB-dependent tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:0194003d
…eudocount)

## Summary


Replace Python's standard one-proportion z-test `prop_test(p, n, p0)` with
Clojure's Wilson-score-like formula `prop_test(succ, n)` from `stats.clj:10-15`:

```
2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)
```

The Clojure formula has a built-in +1 pseudocount (Laplace smoothing / Beta(1,1)
prior) that regularizes extreme values for small Polis groups. This is separate
from the `PSEUDO_COUNT=2.0` used for `pa`/`pd` estimation (Beta(2,2) prior):

- `pa = (na + 1) / (ns + 2)` — Beta(2,2) prior for probability estimation
- `pat = 2 * sqrt(ns+1) * ((na+1)/(ns+1) - 0.5)` — Beta(1,1) prior for significance testing

**What changed in the output**: `pat`, `pdt` values (proportion test z-scores),
and downstream `agree_metric` / `disagree_metric` values. The z-scores are
now slightly different due to `sqrt(n+1)` vs `sqrt(n)` and `(succ+1)/(n+1)` vs
`(na+1)/(n+2)` denominators.

## Changes
- `repness.py`: `prop_test(p, n, p0)` → `prop_test(succ, n)` with Clojure formula
- `repness.py`: `prop_test_vectorized(p, n, p0)` → `prop_test_vectorized(succ, n)`
- `repness.py`: Callers updated to pass raw counts `(na, ns)` instead of `(pa, ns, 0.5)`
- `test_discrepancy_fixes.py`: Removed xfail from D5 formula test, added 8 test cases + edge case
- `test_repness_unit.py`, `test_old_format_repness.py`: Updated for new signature
- Golden snapshots re-recorded for all datasets

## Test plan
- [x] D5 formula tests pass (8 input pairs + edge cases)
- [x] D5 Clojure blob consistency check passes (all datasets)
- [x] Full test suite passes (public + private, 19/19 regression tests)
- [x] Only pre-existing failure: pakistan-incremental D2 (unrelated)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:48b77ba3
## Summary


The Python `two_prop_test` used a standard two-proportion z-test with no pseudocounts,
while Clojure's `stats/two-prop-test` (stats.clj:18-33) adds +1 to all four inputs
(`succ-in`, `succ-out`, `pop-in`, `pop-out`) via `(map inc ...)` before computing
the pooled z-test. This Laplace smoothing regularizes z-scores for small group sizes,
which are common in Polis conversations.

## Changes
- **Signature change**: `two_prop_test(p1, n1, p2, n2)` (proportions) →
  `two_prop_test(succ_in, succ_out, pop_in, pop_out)` (raw counts)
- **Formula**: Standard pooled z-test on pseudocount-adjusted values:
  `pi1 = (succ_in+1)/(pop_in+1)`, `pi_hat = (s1+s2)/(p1+p2)`
- **Callers updated**: Both scalar (`add_comparative_stats`) and vectorized
  (`compute_group_comment_stats_df`) now pass raw counts matching Clojure's
  `(stats/two-prop-test (:na in-stats) (sum :na rest-stats) (:ns in-stats) (sum :ns rest-stats))`
  (repness.clj:97-100)

## Affected output fields
- `rat` (agree representativeness test z-score)
- `rdt` (disagree representativeness test z-score)
- `agree_metric`, `disagree_metric` (downstream of rat/rdt)

## Test plan
- [x] Targeted D6 tests pass (formula, edge cases, regularization effect)
- [x] Full test suite passes (excluding DynamoDB/MinIO tests)
- [x] Private dataset tests pass (--include-local)
- [x] Golden snapshots re-recorded for all 7 datasets

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:23c03d70
…ues)

## Summary


Changes the representativeness metric from a weighted sum of absolutes to the
Clojure product formula (repness.clj:188-190).

**Before (Python):**
- agree_metric = `pa * (|pat| + |rat|)` — weighted sum, tolerant of weak factors
- disagree_metric = `(1 - pd) * (|pdt| + |rdt|)` — doubly wrong: uses `(1-pd)` and sum

**After (Clojure formula):**
- agree_metric = `ra * rat * pa * pat` — product of 4 signed values
- disagree_metric = `rd * rdt * pd * pdt` — product of 4 signed values

The product formula is more conservative: any factor near zero kills the entire
metric, requiring ALL dimensions (probability, significance, relative
representativeness) to be strong simultaneously.

The old disagree formula was doubly wrong:
1. Used `(1 - pd)` instead of `pd` — high metric when disagree probability is LOW
2. Used a weighted sum of absolutes instead of a signed product

No feature flag for the old formula — it has no defensible behavior.

## Test plan
- [x] 5 new D7 formula tests (agree product, disagree product, zero-kills-metric,
      sign preservation, multiple known values)
- [x] Updated unit tests in test_repness_unit.py and test_old_format_repness.py
- [x] Full test suite passes (excluding DynamoDB/MinIO tests)
- [x] Private dataset tests pass (--include-local)
- [x] Golden snapshots re-recorded for all 7 datasets

🤖 Generated with [Claude Code](https://claude.com/claude-code)

commit-id:80eaa87c
@jucor jucor force-pushed the spr/edge/80eaa87c branch from 1f209a6 to a19fe4a Compare March 30, 2026 21:56
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 30, 2026

PR created by a crashed jj spr update, ignore.

@jucor jucor closed this Mar 30, 2026
@jucor jucor changed the title Fix D7: match Clojure repness metric formula (product of 4 signed val… IGNORE -- crash from spr Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant