[Stack 9/27] Per-discrepancy test infrastructure by jucor · Pull Request #2420 · compdemocracy/polis

jucor · 2026-03-05T13:12:03Z

Summary

Stacked on #2419 (Deep analysis of Python-Clojure discrepancies and fix plan). Please review and merge #2419 first.
Next in stack: #2421 (Fix D2: in-conv participant threshold + D2c vote count source)

Per-discrepancy test infrastructure for TDD fixing of Python-Clojure differences.

Changes

Add per-discrepancy test markers and parametrized test infrastructure
Cold-start recorder: coordinate parallel runs with marker file, auto-pause math workers
Update journal with xpassed test breakdown across all datasets
Address Copilot review: remove unused import, fix script issues
Add naming convention documentation

Test plan

223 passed, 4 skipped, 22 xfailed, 7 xpassed, 0 failures
🤖 Generated with Claude Code

Copilot

Pull request overview

Adds per-discrepancy (Python vs Clojure) test scaffolding to support TDD parity work, plus tooling updates to generate/compare Clojure cold-start blobs and document the fix workflow.

Changes:

Introduces tests/test_discrepancy_fixes.py with per-discrepancy, per-dataset parametrized tests (mostly xfail-guarded) and shared dataset/Conversation fixtures.
Updates cold-start blob generator to coordinate parallel runs (pause/unpause workers via marker file) and improve fake conversation replay.
Refreshes comparer/tooling/docs (vote loader swap, remove unused import, add PR naming convention + journal).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
delphi/tests/test_repness_smoke.py	Marks an existing flaky/unknown repness structure validation as `xfail`.
delphi/tests/test_discrepancy_fixes.py	New per-discrepancy test module with dataset discovery + xfail-based parity targets.
delphi/scripts/generate_cold_start_clojure.py	Adds pause-marker coordination and expands replay to copy participants/comments before votes.
delphi/scripts/clojure_comparer.py	Switches vote-loading helper and updates failure guidance message.
delphi/polismath/pca_kmeans_rep/clusters.py	Removes unused sklearn import.
delphi/docs/PLAN_DISCREPANCY_FIXES.md	Adds PR naming convention documentation.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md	Adds a running journal for parity work and test status tracking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

delphi/tests/test_discrepancy_fixes.py

delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md

delphi/scripts/generate_cold_start_clojure.py

delphi/tests/test_discrepancy_fixes.py

Copilot

Pull request overview

Adds per-discrepancy (Python vs Clojure) test infrastructure and supporting dataset/blob tooling to enable TDD-style parity fixes across multiple reference blob variants (incremental vs cold-start), plus updates to cold-start blob generation and legacy comparison tests.

Changes:

Introduces test_discrepancy_fixes.py with discrepancy-scoped, dataset+blob parametrized tests (mostly xfailed to document known gaps).
Extends regression dataset utilities to support explicit blob variants (incremental / cold_start) and “filled blob” discovery.
Updates cold-start Clojure blob generation script to coordinate parallel runs and reduce worker conflicts.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
delphi/tests/test_repness_smoke.py	Marks a pre-existing repness structure test as xfail.
delphi/tests/test_legacy_repness_comparison.py	Switches discovered dataset parametrization to include blob variants and parses composite ids.
delphi/tests/test_legacy_clojure_regression.py	Parametrizes legacy Clojure regression tests over blob variants; adds caching for shared Conversations.
delphi/tests/test_discrepancy_fixes.py	New per-discrepancy parity test suite with dataset+blob parametrization and targeted xfails.
delphi/tests/conftest.py	Adds composite-id parsing and extends `use_discovered_datasets` marker to support blob variants.
delphi/scripts/generate_cold_start_clojure.py	Adds pause/unpause coordination for math workers and improves cold-start replay reliability (participants/comments copy).
delphi/scripts/clojure_comparer.py	Switches vote loading helper and updates messaging to point to discrepancy plan.
delphi/polismath/regression/datasets.py	Adds explicit `blob_type` selection and `get_blob_variants()` discovery of filled blobs.
delphi/polismath/regression/init.py	Exports `get_blob_variants`.
delphi/polismath/pca_kmeans_rep/clusters.py	Removes an unused sklearn import.
delphi/docs/PLAN_DISCREPANCY_FIXES.md	Documents PR naming convention for the parity-fix series.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md	Adds a new journal tracking baseline + xfail/xpass status across datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

delphi/tests/test_repness_smoke.py

delphi/tests/test_legacy_repness_comparison.py

delphi/tests/test_discrepancy_fixes.py

delphi/polismath/regression/datasets.py

delphi/tests/conftest.py

delphi/tests/test_legacy_clojure_regression.py

delphi/tests/test_discrepancy_fixes.py

delphi/scripts/generate_cold_start_clojure.py

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md

delphi/tests/test_discrepancy_fixes.py

delphi/tests/conftest.py

delphi/scripts/generate_cold_start_clojure.py

delphi/tests/test_legacy_clojure_regression.py

delphi/tests/test_discrepancy_fixes.py

Copilot review fixes: - clusters.py: adjust k down when fewer distinct points than requested - conversation.py: group_clusters members are base-cluster IDs (not participants) - pca.py: fallback comps shape uses min(n_comps, n_cols) not min(n_comps, 2) Bug fix: - conversation.py: convert pandas Index to list in repness (JSON serialization) Test fixes: - test_clusters: init_clusters returns empty members - test_conversation: test_recompute uses 10 comments to meet vote threshold - test_datasets: add missing has_comments arg and cold_start blob fixture - test_edge_cases: update group_repness expectation for no-clusters case - test_repness_smoke: mark test_repness_structure as xfail (pre-existing) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Create tests/test_discrepancy_fixes.py with 30 tests across 12 classes, one per discrepancy (D2, D4, D5, D6, D7, D8, D9, D10, D11, D12, D15) plus synthetic edge cases. All discrepancy tests are xfailed — designed to fail before each fix and pass after. Parametrized by all datasets with Clojure reference blobs. Current results: 5 passed, 2 skipped, 18 xfailed, 5 xpassed. Create docs/JOURNAL_OF_DISCREPANCIES_FIX.md with initial baseline, test results, Clojure blob structure notes, and design decisions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of erroring when the math worker is running, automatically pause it during generation and resume it afterwards. This prevents the race condition (existing worker picking up fake conversation votes) without requiring manual intervention. Also give each poller container a predictable name and force-remove it on exit to prevent orphans. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move pause/unpause into per-dataset processing so each run handles it independently. Use a marker file (/tmp/polis-math-coldstart-paused) to track whether we caused the pause, and check for sibling coldstart containers before unpausing. Only the last concurrent run to finish unpauses the math worker, and only if it was paused by us (not manually by the user). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove unused `pairwise_distances` import from clusters.py - Move `load_votes` import from test code to `benchmark_utils` module so the clojure_comparer script doesn't depend on test-only code - Fix outdated "K-means++" references — Python now uses first-k-distinct initialization matching Clojure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document the 10 xpassed tests (all strict=False): - D2 in-conv × 2 on vw (thresholds coincide on small dataset) - D6 two_prop_test × 1 (pseudocount diff too small) - D9 repness_not_empty × 7 (test too weak — checks non-empty, not correct count. TODO: tighten when fixing D9) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clojure comparison tests now run against both blob variants when available: - incremental: result of progressive refinement as votes trickled in - cold_start: computed from scratch in one pass on full dataset Each dataset generates separate test IDs (e.g., biodiversity-incremental, biodiversity-cold_start). Only blobs with meaningful content (PCA data or non-empty clusters) are included. Key changes: - Add get_blob_variants() to discover filled blob variants per dataset - Add _is_blob_filled() to check if a blob has meaningful content - Extend get_dataset_files() with explicit blob_type parameter - Add use_blobs=True option to @pytest.mark.use_discovered_datasets - Add parse_dataset_blob_id() helper for composite ID parsing - Update test_legacy_clojure_regression, test_discrepancy_fixes, and test_legacy_repness_comparison to parametrize by blob variant - Conversation computation is shared across blob variants of same dataset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Legacy repness comparison: replace visibility-only test with asserting tests (xfail for group coverage and set matching). Legacy clojure regression: add test_repness_matches_clojure comparing selected comment sets and z-values against the math blob (xfail). Fix int/str tid mismatch that caused all shared-comment lookups to find zero matches, making blob comparison tests pass vacuously. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… imports - Fix doc reference: CLJ-PARITY-FIXES-PLAN.md → PLAN_DISCREPANCY_FIXES.md (test + journal) - Remove unused imports: json, ClojureComparer, unfold_clojure_group_clusters - Fix docstring examples: -full → -incremental (matching actual composite ID format) - Fix error message: 'No full blob' → 'No incremental blob' - Fix _extract_dataset_from_test: use endswith() instead of in/replace - Add strict=True to xfail in test_repness_smoke.py

github-actions · 2026-03-30T17:41:02Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	2	0	100%
benchmarks/bench_pca.py	76	76	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	1	0	100%
components/config.py	165	133	19%
conversation/init.py	2	0	100%
conversation/conversation.py	1118	336	70%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	233	40%
database/postgres.py	305	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	257	22	91%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	52	16	69%
pca_kmeans_rep/repness.py	361	48	87%
pca_kmeans_rep/stats.py	107	22	79%
regression/init.py	4	0	100%
regression/clojure_comparer.py	188	17	91%
regression/comparer.py	887	720	19%
regression/datasets.py	135	27	80%
regression/recorder.py	36	27	25%
regression/utils.py	137	118	14%
run_math_pipeline.py	260	114	56%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	54	52%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	785	785	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	108	108	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	584	477	18%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	62	41	34%
Total	10951	7651	30%

jucor · 2026-03-30T22:54:30Z

Superseded by spr-managed PR stack. See the new stack starting at #2508.

This was referenced Mar 5, 2026

[Stack 10/27] Fix D2: in-conv participant threshold + D2c vote count source #2421

Closed

[Clj parity PR 0] Per-discrepancy test infrastructure #2401

Closed

jucor force-pushed the jc/kmeans_analysis_docs branch from 74d276e to cda5015 Compare March 6, 2026 15:34

jucor force-pushed the jc/series-of-fixes branch from e43b0f9 to 5e2a0de Compare March 6, 2026 15:34

jucor force-pushed the jc/kmeans_analysis_docs branch from cda5015 to f4136dd Compare March 10, 2026 11:12

jucor force-pushed the jc/series-of-fixes branch from 5e2a0de to 33ecc1c Compare March 10, 2026 11:12

jucor force-pushed the jc/kmeans_analysis_docs branch from f4136dd to d965abb Compare March 10, 2026 12:29

jucor force-pushed the jc/series-of-fixes branch from 33ecc1c to d7b7c34 Compare March 10, 2026 12:29

jucor force-pushed the jc/kmeans_analysis_docs branch from d965abb to 3886fb5 Compare March 10, 2026 14:13

jucor force-pushed the jc/series-of-fixes branch from d7b7c34 to d2b434e Compare March 10, 2026 14:14

jucor force-pushed the jc/kmeans_analysis_docs branch from 3886fb5 to 000bb0a Compare March 10, 2026 15:39

jucor force-pushed the jc/series-of-fixes branch from d2b434e to 4220632 Compare March 10, 2026 15:40

jucor mentioned this pull request Mar 10, 2026

[Stack 8/27] Deep analysis of Python-Clojure discrepancies and fix plan #2419

Closed

1 task

jucor requested review from ballPointPenguin and whilo March 10, 2026 16:08

jucor changed the title ~~[Clj parity PR 0] Per-discrepancy test infrastructure~~ [Stack 7/8] Per-discrepancy test infrastructure Mar 10, 2026

jucor mentioned this pull request Mar 11, 2026

[Stack 11/27] Fix D4: pseudocount formula #2435

Closed

5 tasks

jucor changed the title ~~[Stack 7/8] Per-discrepancy test infrastructure~~ [Stack 7/9] Per-discrepancy test infrastructure Mar 11, 2026

jucor changed the title ~~[Stack 7/9] Per-discrepancy test infrastructure~~ [Stack 7/10] Per-discrepancy test infrastructure Mar 11, 2026

jucor changed the title ~~[Stack 7/10] Per-discrepancy test infrastructure~~ [Stack 7/11] Per-discrepancy test infrastructure Mar 11, 2026

jucor requested a review from Copilot March 13, 2026 12:37

Copilot started reviewing on behalf of jucor March 13, 2026 12:38 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

jucor changed the title ~~[Stack 7/11] Per-discrepancy test infrastructure~~ [Stack 7/12] Per-discrepancy test infrastructure Mar 13, 2026

jucor changed the title ~~[Stack 7/12] Per-discrepancy test infrastructure~~ [Stack 7/13] Per-discrepancy test infrastructure Mar 13, 2026

jucor requested a review from Copilot March 13, 2026 16:12

Copilot started reviewing on behalf of jucor March 13, 2026 16:16 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

jucor changed the title ~~[Stack 7/13] Per-discrepancy test infrastructure~~ [Stack 7/15] Per-discrepancy test infrastructure Mar 16, 2026

jucor force-pushed the jc/kmeans_analysis_docs branch from 000bb0a to f323584 Compare March 16, 2026 16:04

jucor force-pushed the jc/kmeans_analysis_docs branch from 839b3c7 to 11831d4 Compare March 27, 2026 10:41

jucor changed the title ~~[Stack 7/25] Per-discrepancy test infrastructure~~ [Stack 8/26] Per-discrepancy test infrastructure Mar 30, 2026

jucor force-pushed the jc/kmeans_analysis_docs branch from 11831d4 to 2d235be Compare March 30, 2026 12:48

jucor force-pushed the jc/series-of-fixes branch from 679694b to 1439005 Compare March 30, 2026 12:48

jucor changed the title ~~[Stack 8/26] Per-discrepancy test infrastructure~~ [Stack 9/27] Per-discrepancy test infrastructure Mar 30, 2026

jucor force-pushed the jc/kmeans_analysis_docs branch from 2d235be to b979d12 Compare March 30, 2026 12:54

jucor force-pushed the jc/series-of-fixes branch from 1439005 to 689ed20 Compare March 30, 2026 12:54

jucor requested a review from Copilot March 30, 2026 16:25

Copilot started reviewing on behalf of jucor March 30, 2026 16:26 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

jucor and others added 9 commits March 30, 2026 17:44

Add naming convention

03cb573

jucor force-pushed the jc/kmeans_analysis_docs branch from b979d12 to fa0395b Compare March 30, 2026 16:48

jucor force-pushed the jc/series-of-fixes branch from 689ed20 to 35e84d5 Compare March 30, 2026 16:49

jucor force-pushed the jc/series-of-fixes branch from 35e84d5 to 9585ffa Compare March 30, 2026 17:05

This was referenced Mar 30, 2026

IGNORE -- crash from spr #2493

Closed

IGNORE -- crash from spr #2495

Closed

[Stack 6/17] Fix D2: in-conv participant threshold + D2c vote count source #2513

Open

[Stack 5/17] Per-discrepancy test infrastructure #2512

Open

jucor closed this Mar 30, 2026

Conversation

jucor commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 30, 2026

Delphi Coverage Report

Uh oh!

jucor commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jucor commented Mar 5, 2026 •

edited

Loading