Skip to content

[Stack 9/27] Per-discrepancy test infrastructure#2420

Closed
jucor wants to merge 10 commits intojc/kmeans_analysis_docsfrom
jc/series-of-fixes
Closed

[Stack 9/27] Per-discrepancy test infrastructure#2420
jucor wants to merge 10 commits intojc/kmeans_analysis_docsfrom
jc/series-of-fixes

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 5, 2026

Summary

Stacked on #2419 (Deep analysis of Python-Clojure discrepancies and fix plan). Please review and merge #2419 first.
Next in stack: #2421 (Fix D2: in-conv participant threshold + D2c vote count source)

Per-discrepancy test infrastructure for TDD fixing of Python-Clojure differences.

Changes

  • Add per-discrepancy test markers and parametrized test infrastructure
  • Cold-start recorder: coordinate parallel runs with marker file, auto-pause math workers
  • Update journal with xpassed test breakdown across all datasets
  • Address Copilot review: remove unused import, fix script issues
  • Add naming convention documentation

Test plan

  • 223 passed, 4 skipped, 22 xfailed, 7 xpassed, 0 failures
    🤖 Generated with Claude Code

@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from 74d276e to cda5015 Compare March 6, 2026 15:34
@jucor jucor force-pushed the jc/series-of-fixes branch from e43b0f9 to 5e2a0de Compare March 6, 2026 15:34
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from cda5015 to f4136dd Compare March 10, 2026 11:12
@jucor jucor force-pushed the jc/series-of-fixes branch from 5e2a0de to 33ecc1c Compare March 10, 2026 11:12
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from f4136dd to d965abb Compare March 10, 2026 12:29
@jucor jucor force-pushed the jc/series-of-fixes branch from 33ecc1c to d7b7c34 Compare March 10, 2026 12:29
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from d965abb to 3886fb5 Compare March 10, 2026 14:13
@jucor jucor force-pushed the jc/series-of-fixes branch from d7b7c34 to d2b434e Compare March 10, 2026 14:14
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from 3886fb5 to 000bb0a Compare March 10, 2026 15:39
@jucor jucor force-pushed the jc/series-of-fixes branch from d2b434e to 4220632 Compare March 10, 2026 15:40
@jucor jucor requested review from ballPointPenguin and whilo March 10, 2026 16:08
@jucor jucor changed the title [Clj parity PR 0] Per-discrepancy test infrastructure [Stack 7/8] Per-discrepancy test infrastructure Mar 10, 2026
@jucor jucor changed the title [Stack 7/8] Per-discrepancy test infrastructure [Stack 7/9] Per-discrepancy test infrastructure Mar 11, 2026
@jucor jucor changed the title [Stack 7/9] Per-discrepancy test infrastructure [Stack 7/10] Per-discrepancy test infrastructure Mar 11, 2026
@jucor jucor changed the title [Stack 7/10] Per-discrepancy test infrastructure [Stack 7/11] Per-discrepancy test infrastructure Mar 11, 2026
@jucor jucor requested a review from Copilot March 13, 2026 12:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-discrepancy (Python vs Clojure) test scaffolding to support TDD parity work, plus tooling updates to generate/compare Clojure cold-start blobs and document the fix workflow.

Changes:

  • Introduces tests/test_discrepancy_fixes.py with per-discrepancy, per-dataset parametrized tests (mostly xfail-guarded) and shared dataset/Conversation fixtures.
  • Updates cold-start blob generator to coordinate parallel runs (pause/unpause workers via marker file) and improve fake conversation replay.
  • Refreshes comparer/tooling/docs (vote loader swap, remove unused import, add PR naming convention + journal).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
delphi/tests/test_repness_smoke.py Marks an existing flaky/unknown repness structure validation as xfail.
delphi/tests/test_discrepancy_fixes.py New per-discrepancy test module with dataset discovery + xfail-based parity targets.
delphi/scripts/generate_cold_start_clojure.py Adds pause-marker coordination and expands replay to copy participants/comments before votes.
delphi/scripts/clojure_comparer.py Switches vote-loading helper and updates failure guidance message.
delphi/polismath/pca_kmeans_rep/clusters.py Removes unused sklearn import.
delphi/docs/PLAN_DISCREPANCY_FIXES.md Adds PR naming convention documentation.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md Adds a running journal for parity work and test status tracking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jucor jucor changed the title [Stack 7/11] Per-discrepancy test infrastructure [Stack 7/12] Per-discrepancy test infrastructure Mar 13, 2026
@jucor jucor changed the title [Stack 7/12] Per-discrepancy test infrastructure [Stack 7/13] Per-discrepancy test infrastructure Mar 13, 2026
@jucor jucor requested a review from Copilot March 13, 2026 16:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-discrepancy (Python vs Clojure) test infrastructure and supporting dataset/blob tooling to enable TDD-style parity fixes across multiple reference blob variants (incremental vs cold-start), plus updates to cold-start blob generation and legacy comparison tests.

Changes:

  • Introduces test_discrepancy_fixes.py with discrepancy-scoped, dataset+blob parametrized tests (mostly xfailed to document known gaps).
  • Extends regression dataset utilities to support explicit blob variants (incremental / cold_start) and “filled blob” discovery.
  • Updates cold-start Clojure blob generation script to coordinate parallel runs and reduce worker conflicts.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
delphi/tests/test_repness_smoke.py Marks a pre-existing repness structure test as xfail.
delphi/tests/test_legacy_repness_comparison.py Switches discovered dataset parametrization to include blob variants and parses composite ids.
delphi/tests/test_legacy_clojure_regression.py Parametrizes legacy Clojure regression tests over blob variants; adds caching for shared Conversations.
delphi/tests/test_discrepancy_fixes.py New per-discrepancy parity test suite with dataset+blob parametrization and targeted xfails.
delphi/tests/conftest.py Adds composite-id parsing and extends use_discovered_datasets marker to support blob variants.
delphi/scripts/generate_cold_start_clojure.py Adds pause/unpause coordination for math workers and improves cold-start replay reliability (participants/comments copy).
delphi/scripts/clojure_comparer.py Switches vote loading helper and updates messaging to point to discrepancy plan.
delphi/polismath/regression/datasets.py Adds explicit blob_type selection and get_blob_variants() discovery of filled blobs.
delphi/polismath/regression/init.py Exports get_blob_variants.
delphi/polismath/pca_kmeans_rep/clusters.py Removes an unused sklearn import.
delphi/docs/PLAN_DISCREPANCY_FIXES.md Documents PR naming convention for the parity-fix series.
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md Adds a new journal tracking baseline + xfail/xpass status across datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jucor jucor changed the title [Stack 7/13] Per-discrepancy test infrastructure [Stack 7/15] Per-discrepancy test infrastructure Mar 16, 2026
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from 000bb0a to f323584 Compare March 16, 2026 16:04
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from 839b3c7 to 11831d4 Compare March 27, 2026 10:41
@jucor jucor changed the title [Stack 7/25] Per-discrepancy test infrastructure [Stack 8/26] Per-discrepancy test infrastructure Mar 30, 2026
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from 11831d4 to 2d235be Compare March 30, 2026 12:48
@jucor jucor force-pushed the jc/series-of-fixes branch from 679694b to 1439005 Compare March 30, 2026 12:48
@jucor jucor changed the title [Stack 8/26] Per-discrepancy test infrastructure [Stack 9/27] Per-discrepancy test infrastructure Mar 30, 2026
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from 2d235be to b979d12 Compare March 30, 2026 12:54
@jucor jucor force-pushed the jc/series-of-fixes branch from 1439005 to 689ed20 Compare March 30, 2026 12:54
@jucor jucor requested a review from Copilot March 30, 2026 16:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jucor and others added 9 commits March 30, 2026 17:44
Copilot review fixes:
- clusters.py: adjust k down when fewer distinct points than requested
- conversation.py: group_clusters members are base-cluster IDs (not participants)
- pca.py: fallback comps shape uses min(n_comps, n_cols) not min(n_comps, 2)

Bug fix:
- conversation.py: convert pandas Index to list in repness (JSON serialization)

Test fixes:
- test_clusters: init_clusters returns empty members
- test_conversation: test_recompute uses 10 comments to meet vote threshold
- test_datasets: add missing has_comments arg and cold_start blob fixture
- test_edge_cases: update group_repness expectation for no-clusters case
- test_repness_smoke: mark test_repness_structure as xfail (pre-existing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create tests/test_discrepancy_fixes.py with 30 tests across 12 classes,
one per discrepancy (D2, D4, D5, D6, D7, D8, D9, D10, D11, D12, D15)
plus synthetic edge cases. All discrepancy tests are xfailed — designed
to fail before each fix and pass after. Parametrized by all datasets
with Clojure reference blobs.

Current results: 5 passed, 2 skipped, 18 xfailed, 5 xpassed.

Create docs/JOURNAL_OF_DISCREPANCIES_FIX.md with initial baseline,
test results, Clojure blob structure notes, and design decisions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of erroring when the math worker is running, automatically pause
it during generation and resume it afterwards. This prevents the race
condition (existing worker picking up fake conversation votes) without
requiring manual intervention. Also give each poller container a
predictable name and force-remove it on exit to prevent orphans.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move pause/unpause into per-dataset processing so each run handles it
independently. Use a marker file (/tmp/polis-math-coldstart-paused) to
track whether we caused the pause, and check for sibling coldstart
containers before unpausing. Only the last concurrent run to finish
unpauses the math worker, and only if it was paused by us (not manually
by the user).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused `pairwise_distances` import from clusters.py
- Move `load_votes` import from test code to `benchmark_utils` module
  so the clojure_comparer script doesn't depend on test-only code
- Fix outdated "K-means++" references — Python now uses first-k-distinct
  initialization matching Clojure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the 10 xpassed tests (all strict=False):
- D2 in-conv × 2 on vw (thresholds coincide on small dataset)
- D6 two_prop_test × 1 (pseudocount diff too small)
- D9 repness_not_empty × 7 (test too weak — checks non-empty,
  not correct count. TODO: tighten when fixing D9)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clojure comparison tests now run against both blob variants when available:
- incremental: result of progressive refinement as votes trickled in
- cold_start: computed from scratch in one pass on full dataset

Each dataset generates separate test IDs (e.g., biodiversity-incremental,
biodiversity-cold_start). Only blobs with meaningful content (PCA data or
non-empty clusters) are included.

Key changes:
- Add get_blob_variants() to discover filled blob variants per dataset
- Add _is_blob_filled() to check if a blob has meaningful content
- Extend get_dataset_files() with explicit blob_type parameter
- Add use_blobs=True option to @pytest.mark.use_discovered_datasets
- Add parse_dataset_blob_id() helper for composite ID parsing
- Update test_legacy_clojure_regression, test_discrepancy_fixes, and
  test_legacy_repness_comparison to parametrize by blob variant
- Conversation computation is shared across blob variants of same dataset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Legacy repness comparison: replace visibility-only test with asserting
tests (xfail for group coverage and set matching).

Legacy clojure regression: add test_repness_matches_clojure comparing
selected comment sets and z-values against the math blob (xfail).

Fix int/str tid mismatch that caused all shared-comment lookups to find
zero matches, making blob comparison tests pass vacuously.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jucor jucor force-pushed the jc/kmeans_analysis_docs branch from b979d12 to fa0395b Compare March 30, 2026 16:48
@jucor jucor force-pushed the jc/series-of-fixes branch from 689ed20 to 35e84d5 Compare March 30, 2026 16:49
… imports

- Fix doc reference: CLJ-PARITY-FIXES-PLAN.md → PLAN_DISCREPANCY_FIXES.md (test + journal)
- Remove unused imports: json, ClojureComparer, unfold_clojure_group_clusters
- Fix docstring examples: -full → -incremental (matching actual composite ID format)
- Fix error message: 'No full blob' → 'No incremental blob'
- Fix _extract_dataset_from_test: use endswith() instead of in/replace
- Add strict=True to xfail in test_repness_smoke.py
@jucor jucor force-pushed the jc/series-of-fixes branch from 35e84d5 to 9585ffa Compare March 30, 2026 17:05
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1118 336 70%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 233 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 361 48 87%
pca_kmeans_rep/stats.py 107 22 79%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 17 91%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 137 118 14%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 54 52%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 477 18%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10951 7651 30%

@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 30, 2026

Superseded by spr-managed PR stack. See the new stack starting at #2508.

@jucor jucor closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants