IGNORE -- crash from spr by jucor · Pull Request #2496 · compdemocracy/polis

jucor · 2026-03-30T20:57:29Z

Summary

Change PSEUDO_COUNT from 1.5 to 2.0, matching Clojure's Beta(2,2) prior
This changes probability smoothing from pa = (na + 0.75)/(ns + 1.5) to pa = (na + 1)/(ns + 2)
All pa/pd values now match Clojure's p-success exactly (verified on all datasets with Clojure blobs)

Changes

repness.py: PSEUDO_COUNT = 2.0 with updated comment
test_discrepancy_fixes.py: remove xfail from 3 D4 tests (constant check, pa values per dataset, synthetic)
test_repness_unit.py, test_old_format_repness.py: import PSEUDO_COUNT instead of hardcoding 1.5
simplified_repness_test.py: update hardcoded constant
Golden snapshots re-recorded for public datasets (vw, biodiversity)

Test plan

TDD red: 6 D4 tests fail before fix
TDD green: all 6 D4 tests pass after fix
Full public suite: 258 passed, 0 failures
Private datasets (--include-local): 60 passed, 0 failures (discrepancy tests)
Regression tests pass on public + FLI + bg2018

🤖 Generated with Claude Code

commit-id:6ae3ee43

Stack:

⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

When spr force-pushes all stack branches at once, each branch triggers a new E2E run while the previous one is still queued. The concurrency group (keyed on branch ref) cancels the stale run, cutting the number of concurrent E2E jobs roughly in half. commit-id:1a8eb714

## Summary Add `SKIP_GOLDEN=1` environment variable to disable golden snapshot regression tests. During stacked PR development, golden snapshots become stale as computation changes cascade through the stack. Rather than re-recording snapshots at every rebase (which causes conflict cascades in jj/git), we skip them until the stack is merged into `edge`. ### Changes - **`test_regression.py`**: Add `@_skip_golden` decorator to `test_conversation_regression` and `test_conversation_stages_individually` — the only two tests that compare against golden snapshots. Other dataset-using tests (Clojure comparison, smoke tests) are unaffected. - **`python-ci.yml`**: Set `SKIP_GOLDEN=1` in CI so the stacked PRs don't fail on stale snapshots. ### Usage ```bash SKIP_GOLDEN=1 pytest tests/ # skip golden snapshot tests pytest tests/ # run everything (default) ``` ## Test plan - [x] `SKIP_GOLDEN=1 pytest tests/test_regression.py -v`: 4 skipped, 5 passed - [x] `pytest tests/test_regression.py -v`: all 9 collected (golden tests run normally) commit-id:d39cf65d

## Summary Replace `pip install` with `uv pip install` in the delphi Dockerfile for faster dependency installation. `uv pip` is a drop-in replacement — same `requirements.lock`, same `pyproject.toml`, same installed packages in `site-packages`. ### Changes - **`delphi/Dockerfile`**: - Copy `uv` binary from `ghcr.io/astral-sh/uv:0.11.2` (pinned version, single static binary) - Place in `/opt/uv/` in builder to avoid leaking into production image via `COPY --from=builder /usr/local/bin` - Set `UV_SYSTEM_PYTHON=1` (install into system Python, not a venv) - Replace all `pip install` with `uv pip install` in builder and test stages - Update BuildKit cache mount targets from `/root/.cache/pip` to `/root/.cache/uv` - Test stage copies `uv` from builder (single source of truth) ### What's NOT changed - **Makefile** — untouched, `make rebuild-delphi` works as before - **docker-compose.yml / docker-compose.test.yml** — untouched - **pyproject.toml / requirements.lock** — untouched, same format - **pip-compile workflow** — untouched, still used for lock file generation - **Final/production image** — no `uv` added, stays lean ### CI Benchmark (GitHub Actions, `ubuntu-latest`, `--no-cache`) 5 pip runs (Mar 27 stack push) vs 3 uv pip runs (this PR): | Step | pip (n=5) | uv pip (n=3) | Speedup | |------|-----------|--------------|---------| | **Docker build** | **264s** (sd=6) | **169s** (sd=2) | **1.56x (-94s)** | | Pytest run | 223s (sd=4) | 227s (sd=4) | ~same | **~94 seconds saved per CI run** on the Docker build step. Pytest runtime is unchanged (same packages, same tests). Low variance in both groups confirms this is a real improvement, not noise. ### Local Benchmark (Apple M1 Max, 64GB, Docker Desktop, `--no-cache`) | Step | pip | uv pip | Speedup | |------|-----|--------|---------| | **Dependencies install** | **149.3s** | **80.4s** | **1.9x** | | Dev deps install | 10.0s | 2.2s | **4.5x** | ## Test plan - [x] `docker compose -f docker-compose.test.yml build --no-cache delphi` succeeds locally - [x] Built image has all expected packages (`pip show` diagnostic passes in build log) - [x] CI passes (3 successful runs) - [x] `make rebuild-delphi` works (Makefile untouched) commit-id:0c448343

## Summary Documentation-only PR: deep analysis of Python vs Clojure discrepancies and a TDD fix plan. ### Changes - Deep analysis documents (`deep-analysis-for-julien/`) comparing Python and Clojure implementations statement-by-statement - Consolidate CLAUDE.md documentation for the delphi project - Discrepancy fix plan (`docs/PLAN_DISCREPANCY_FIXES.md`) with prioritized list of fixes ## Test plan - [x] Documentation only — no code changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) commit-id:d2f65026

## Summary Per-discrepancy test infrastructure for TDD fixing of Python-Clojure differences. ### Changes - Add per-discrepancy test markers and parametrized test infrastructure - Cold-start recorder: coordinate parallel runs with marker file, auto-pause math workers - Update journal with xpassed test breakdown across all datasets - Address Copilot review: remove unused import, fix script issues - Add naming convention documentation ## Test plan - [x] 223 passed, 4 skipped, 22 xfailed, 7 xpassed, 0 failures 🤖 Generated with [Claude Code](https://claude.com/claude-code) commit-id:bdc830db

## Summary Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d). ### D2: In-conv threshold - **Before**: `threshold = 7 + sqrt(n_cmts) * 0.1` — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments) - **After**: `threshold = min(7, n_cmts)` — matches Clojure exactly ### D2b: Base-cluster sort order (from Copilot review) - **Before**: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means - **After**: Keep k-means ID order, matching Clojure's `(sort-by :id ...)` ### D2c: Vote count source (raw vs filtered matrix) - **Before**: `_compute_user_vote_counts` and `n_cmts` used `self.rating_mat` (filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold. - **After**: Both use `self.raw_rating_mat` (includes all votes, even on moderated-out comments), matching Clojure's `user-vote-counts` (conversation.clj:217-225) which reads from `raw-rating-mat`. ### D2d: In-conv monotonicity (design decision) Python does full recompute from `raw_rating_mat` every time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is **strictly better** than Clojure's approach (which persists in-conv to `math_main` because it uses delta vote processing). 5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358). ### Impact - biodiversity: 428 → 441 in-conv participants (now matches Clojure) - Verified on 4 datasets with complete Clojure cold-start blobs ### Incremental vs cold-start blob testing D2 tests run against both **cold-start** and **incremental** Clojure blobs (infrastructure from #2420): - **Cold-start blobs** are computed in one pass on the full dataset. The in-conv threshold `min(7, n_cmts)` is evaluated once with the final `n_cmts`. Python matches these exactly. - **Incremental blobs** were built progressively as votes trickled in over the conversation's lifetime. The threshold was evaluated at each iteration with a smaller `n_cmts`, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants). D2 tests on incremental blobs are currently **xfailed** with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure. ### Test results ``` 253 passed, 5 skipped, 36 xfailed (0 failures) ``` ## Test plan - [x] D2 tests pass on all datasets with complete Clojure cold-start blobs - [x] D2c: 3 synthetic tests verify vote counts include moderated-out votes, n_cmts includes moderated-out comments, participants stay in-conv after moderation - [x] D2d: 5 monotonicity tests (basic across updates, survives moderation, worker restart + moderation, restart without new votes, mixed participants) - [x] D2 tests xfail on incremental blobs (with explanatory comments) - [x] Full test suite: 253 passed, 0 failures - [x] Golden snapshots re-recorded for affected datasets 🤖 Generated with [Claude Code](https://claude.com/claude-code) commit-id:c0a682ec

## Summary - Change `PSEUDO_COUNT` from 1.5 to 2.0, matching Clojure's Beta(2,2) prior - This changes probability smoothing from `pa = (na + 0.75)/(ns + 1.5)` to `pa = (na + 1)/(ns + 2)` - All `pa`/`pd` values now match Clojure's `p-success` exactly (verified on all datasets with Clojure blobs) ## Changes - `repness.py`: `PSEUDO_COUNT = 2.0` with updated comment - `test_discrepancy_fixes.py`: remove xfail from 3 D4 tests (constant check, pa values per dataset, synthetic) - `test_repness_unit.py`, `test_old_format_repness.py`: import `PSEUDO_COUNT` instead of hardcoding 1.5 - `simplified_repness_test.py`: update hardcoded constant - Golden snapshots re-recorded for public datasets (vw, biodiversity) ## Test plan - [x] TDD red: 6 D4 tests fail before fix - [x] TDD green: all 6 D4 tests pass after fix - [x] Full public suite: 258 passed, 0 failures - [x] Private datasets (--include-local): 60 passed, 0 failures (discrepancy tests) - [x] Regression tests pass on public + FLI + bg2018 🤖 Generated with [Claude Code](https://claude.com/claude-code) commit-id:6ae3ee43

github-actions · 2026-03-30T22:10:30Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	2	0	100%
benchmarks/bench_pca.py	76	76	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	1	0	100%
components/config.py	165	133	19%
conversation/init.py	2	0	100%
conversation/conversation.py	1117	328	71%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	234	40%
database/postgres.py	305	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	257	22	91%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	52	16	69%
pca_kmeans_rep/repness.py	361	51	86%
pca_kmeans_rep/stats.py	107	22	79%
regression/init.py	4	0	100%
regression/clojure_comparer.py	188	17	91%
regression/comparer.py	887	720	19%
regression/datasets.py	135	27	80%
regression/recorder.py	36	27	25%
regression/utils.py	137	118	14%
run_math_pipeline.py	260	114	56%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	54	52%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	785	785	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	108	108	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	584	477	18%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	62	41	34%
Total	10950	7647	30%

jucor · 2026-03-30T22:14:46Z

PR created by a crashed jj spr update, ignore.

jucor force-pushed the spr/edge/6ae3ee43 branch from d575054 to bf3e3bb Compare March 30, 2026 21:04

jucor force-pushed the spr/edge/c0a682ec branch from 180e800 to a24be7e Compare March 30, 2026 21:04

jucor force-pushed the spr/edge/6ae3ee43 branch from bf3e3bb to c7bc6b2 Compare March 30, 2026 21:32

jucor force-pushed the spr/edge/c0a682ec branch from a24be7e to a9b49ee Compare March 30, 2026 21:32

jucor changed the base branch from spr/edge/c0a682ec to edge March 30, 2026 21:49

jucor added 7 commits March 30, 2026 22:51

jucor force-pushed the spr/edge/6ae3ee43 branch from c7bc6b2 to b437500 Compare March 30, 2026 21:56

jucor closed this Mar 30, 2026

jucor changed the title ~~Fix D4: pseudocount formula~~ IGNORE -- crash from spr Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNORE -- crash from spr#2496

IGNORE -- crash from spr#2496
jucor wants to merge 7 commits intoedgefrom
spr/edge/6ae3ee43

jucor commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

jucor commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jucor commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Mar 30, 2026

Delphi Coverage Report

Uh oh!

jucor commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jucor commented Mar 30, 2026 •

edited

Loading