[Clj parity PR 0] Per-discrepancy test infrastructure#2401
[Clj parity PR 0] Per-discrepancy test infrastructure#2401jucor wants to merge 47 commits intocompdemocracy:jc/kmeans_analysis_docsfrom
Conversation
5bf8859 to
4056619
Compare
f9aaa5a to
d2ddb21
Compare
There was a problem hiding this comment.
Pull request overview
This PR advances the Python↔Clojure parity effort by expanding the regression/testing infrastructure, aligning clustering data structures toward Clojure’s two-level hierarchy, and adding supporting CLI tooling and documentation for comparison/analysis workflows.
Changes:
- Update clustering and serialization to support hierarchical clustering outputs (
base-clustersfolded +group-clustersreferencing base cluster IDs) and adjust downstream computations/tests accordingly. - Improve test execution ergonomics: dataset filtering (
--datasets), xdist loadgroup grouping, and updated smoke/regression tests. - Add/extend parity tooling & docs: Clojure comparison CLI, PCA benchmark script, and extensive discrepancy analysis/journal documentation.
Reviewed changes
Copilot reviewed 51 out of 55 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| delphi/tests/test_repness_smoke.py | Switch dataset parametrization helper and use unfolded clusters for repness calls. |
| delphi/tests/test_regression.py | Enable PCA sign-flip tolerance in comparer + add unit tests for comparer behavior. |
| delphi/tests/test_pipeline_integrity.py | Use dataset param helper for grouped parallel runs. |
| delphi/tests/test_pca_unit.py | Remove legacy PCA tests and add NaN/edge-case coverage around mean-fill behavior. |
| delphi/tests/test_pca_smoke.py | Switch to dataset param helper for grouped parallel runs. |
| delphi/tests/test_pca_edge_cases.py | Simplify to focus on pca_project_dataframe robustness. |
| delphi/tests/test_math_pipeline_runs_e2e.py | Update mock dataset path. |
| delphi/tests/test_legacy_repness_comparison.py | Switch dataset parametrization helper. |
| delphi/tests/test_legacy_clojure_output.py | Remove legacy script-style test file. |
| delphi/tests/test_golden_data.py | Remove redundant golden data comparison test. |
| delphi/tests/test_edge_cases.py | Update repness expectation for insufficient-data case. |
| delphi/tests/test_datasets.py | Update DatasetInfo tests for cold-start blob support. |
| delphi/tests/test_conversation_smoke.py | Switch dataset parametrization helper. |
| delphi/tests/test_conversation.py | Skip recompute in tests that only validate sorting/moderation; adjust synthetic votes to meet thresholds; filterwarnings for PCA warnings. |
| delphi/tests/test_clusters.py | Update expectations for init_clusters when k > n_points. |
| delphi/tests/legacy_compare_with_clojure.py | Remove legacy comparison script. |
| delphi/tests/conftest.py | Add --datasets option, xdist grouping helpers, and collection filtering. |
| delphi/tests/README.md | Document dataset selection + parallel execution options. |
| delphi/scripts/regression_recorder.py | Add --include-local and pass through to dataset discovery. |
| delphi/scripts/regression_comparer.py | Add CLI options (ignore PCA sign flip, include-local, tolerance knobs) and pass through to comparer. |
| delphi/scripts/clojure_comparer.py | New CLI to compare Python outputs against Clojure math blobs. |
| delphi/pyproject.toml | Add Click dependency; set pytest addopts to --dist=loadgroup; add marker documentation. |
| delphi/polismath/regression/datasets.py | Add cold-start blob handling and prefer-cold-start selection in get_dataset_files. |
| delphi/polismath/regression/init.py | Re-export Clojure comparer utilities. |
| delphi/polismath/pca_kmeans_rep/clusters.py | Switch init strategy to “first k distinct points” and add sklearn-based kmeans/silhouette helpers. |
| delphi/polismath/conversation/conversation.py | Implement two-level clustering flow; add unfolding helpers and fold/unfold base clusters; adjust repness/participant info/group votes to use unfolded groups. |
| delphi/polismath/benchmarks/bench_pca.py | New PCA benchmarking/profiling helper. |
| delphi/docs/SUBGROUP_CLUSTERING_THIRD_LEVEL.md | Document (not implement) Clojure’s subgroup clustering level. |
| delphi/docs/CLOJURE_TWO_LEVEL_CLUSTERING.md | Document Clojure clustering hierarchy and formats. |
| delphi/docs/CLOJURE_COMPARISON.md | Document Clojure comparison testing and known differences. |
| delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md | Add parity fixes journal and baseline tracking. |
| delphi/CLAUDE.md | Update documentation warning and add testing/regression guidance. |
| deep-analysis-for-julien/09-fix-plan.md | Add fix plan for parity work. |
| deep-analysis-for-julien/08-dead-code.md | Add dead-code analysis. |
| deep-analysis-for-julien/07-discrepancies.md | Add discrepancy catalog. |
| deep-analysis-for-julien/06-comment-routing.md | Add comment routing analysis. |
| deep-analysis-for-julien/05-participant-filtering.md | Add participant filtering / priority analysis. |
| deep-analysis-for-julien/04-repness-analysis.md | Add repness analysis. |
| deep-analysis-for-julien/03-clustering-analysis.md | Add clustering analysis. |
| deep-analysis-for-julien/02-pca-analysis.md | Add PCA analysis. |
| deep-analysis-for-julien/01-overview-and-architecture.md | Add overall architecture overview. |
Comments suppressed due to low confidence (1)
delphi/scripts/clojure_comparer.py:268
- This Click CLI attempts to signal failures by
return 1, but Click commands run in standalone mode by default and won’t use the callback’s return value as the process exit code. As written, the script will typically exit with status 0 even when comparisons fail. Useraise SystemExit(1),ctx.exit(1), or follow the pattern used inscripts/regression_comparer.py(callexit(1)/exit(0)explicitly), or callmain(standalone_mode=False)andsys.exitthe returned code.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
delphi/tests/conftest.py
Outdated
| def _extract_dataset_from_test(item) -> str | None: | ||
| """Extract dataset name from test item's parameter, if present.""" | ||
| # Check for parametrized marker with dataset/dataset_name parameter | ||
| for marker in item.iter_markers("parametrize"): | ||
| argnames = marker.args[0] if marker.args else "" | ||
| if "dataset" in argnames: | ||
| # Get the parameter value from callspec | ||
| if hasattr(item, 'callspec'): | ||
| for param_name in ['dataset', 'dataset_name']: | ||
| if param_name in item.callspec.params: | ||
| return item.callspec.params[param_name] | ||
| return None |
There was a problem hiding this comment.
--datasets filtering won’t apply to tests parametrized with dataset_name only. _extract_dataset_from_test only considers parametrizations whose argnames contain the substring "dataset", so @pytest.mark.parametrize("dataset_name", ...) will return None and those tests won’t be deselected. Consider checking for both dataset and dataset_name in argnames, or simpler: if item.callspec exists, read item.callspec.params.get('dataset') / .get('dataset_name') directly without relying on the parametrize marker.
| import random | ||
| from copy import deepcopy | ||
| from sklearn.cluster import KMeans | ||
| from sklearn.metrics import silhouette_score, pairwise_distances |
There was a problem hiding this comment.
pairwise_distances is imported from sklearn.metrics but never used in this module. Please remove it to avoid unused-import lint noise and reduce import overhead.
| from sklearn.metrics import silhouette_score, pairwise_distances | |
| from sklearn.metrics import silhouette_score |
| from polismath.regression.clojure_comparer import ClojureComparer, load_clojure_math_blob | ||
| from polismath.regression import list_available_datasets, get_dataset_files | ||
| from polismath.conversation import Conversation | ||
| from tests.common_utils import load_votes | ||
|
|
There was a problem hiding this comment.
This script imports load_votes from tests.common_utils. That makes the CLI depend on test-only code and will break if delphi is installed/distributed without the tests/ tree. Prefer moving the CSV-loading helper into a non-test module (e.g., polismath.regression or polismath.benchmarks.benchmark_utils) or duplicating a small loader locally in this script.
| click.echo(f"✗ {dataset}: FAIL") | ||
| click.echo(f"\nLikely cause: Initialization algorithm difference") | ||
| click.echo(f" Python: K-means++ (seed 42)") | ||
| click.echo(f" Clojure: First k distinct points") | ||
| click.echo(f"{'='*60}") |
There was a problem hiding this comment.
The failure explanation printed here says Python uses “K-means++ (seed 42)”, but the current clustering implementation uses first-k-distinct initialization (matching Clojure) via use_first_k_init=True in kmeans_sklearn(). This message should be updated so the CLI output doesn’t mislead debugging.
| # Legacy field for backward compatibility (remove after full migration) | ||
| result['group_clusters'] = self.group_clusters |
There was a problem hiding this comment.
to_dict() now serializes group-clusters (and the supposedly “legacy” group_clusters) with members as base-cluster IDs rather than participant IDs. There is existing Python code that derives participant→group assignments by iterating group-clusters[*].members and treating them as participant IDs (e.g. umap_narrative/polismath_commentgraph/utils/group_data.py). To avoid breaking downstream consumers, consider either (a) keeping group_clusters as the unfolded (participant-id) representation, or (b) adding an explicit unfolded field (e.g. group-clusters-unfolded) and updating any in-repo consumers accordingly.
| # Legacy field for backward compatibility (remove after full migration) | |
| result['group_clusters'] = self.group_clusters | |
| # Legacy field for backward compatibility: | |
| # expose group_clusters with members as participant IDs (unfolded). | |
| legacy_group_clusters = deepcopy(self.group_clusters) if self.group_clusters else None | |
| if legacy_group_clusters and self.base_clusters: | |
| # Build mapping: base-cluster ID -> list of participant IDs | |
| base_to_participants = {} | |
| for pid, base_cid in self.base_clusters.items(): | |
| base_to_participants.setdefault(base_cid, []).append(pid) | |
| # Replace group members (base-cluster IDs) with participant IDs | |
| for group in legacy_group_clusters: | |
| members = group.get('members', []) or [] | |
| unfolded_members: List[Any] = [] | |
| for base_cid in members: | |
| unfolded_members.extend(base_to_participants.get(base_cid, [])) | |
| group['members'] = unfolded_members | |
| result['group_clusters'] = legacy_group_clusters |
a0b0dab to
97087a2
Compare
…arker - Remove substring-based _extract_dataset_from_test and pytest_collection_modifyitems filtering - Remove get_available_dataset_params() (import-time evaluation) - Add @pytest.mark.use_discovered_datasets marker for dynamic parametrization that respects --datasets and --include-local - Hardcoded @pytest.mark.parametrize tests are no longer affected by --datasets CLI flag - Unify parameter name to dataset_name across all test files - Add test_dataset_selection.py with unit and pytester integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cluster centers are derived from PCA projections, so they inherit the sign ambiguity. This fix ensures sign flips are detected and corrected for `.center` paths in addition to `.pca.comps` and `.proj.` paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, sign flips were detected per-projection-vector, which fails when only some components are flipped (e.g., PC1 unchanged, PC2 flipped). Now detects flips at the component level (.pca.comps[N]) and stores them, then applies per-dimension correction to projections and cluster centers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… tests TLDR: it works! Differences were just numerical errors, artificially looking huge on small values. Looking at the whole cloud of projections confirmed the post sk-learn matches the pre-sklearn results perfectly well. Problem: -------- The regression comparer was failing on datasets like FLI and pakistan with thousands of "differences" showing relative errors up to 874%. Investigation revealed these were false positives: the high relative errors occurred on near-zero values (e.g., golden=4.54e-06 vs current=4.42e-05) where even tiny absolute differences (3e-04) produce huge relative errors. Diagnosis: ---------- Comparing sklearn SVD-based PCA against power iteration golden snapshots: - The projection point clouds are visually identical (see scatter plots) - Important values (Q70-Q100 percentile) match within 0.2% - Only near-zero values (Q0-Q7 percentile, ~0.0x median) show large rel errors - These near-zero values represent participants at the origin who don't affect visualization or clustering The element-wise (abs_tol, rel_tol) approach fundamentally cannot handle this case: it either fails on small values or is too loose for large values. Solution: --------- Added projection comparison metrics that measure what actually matters: | Metric | Threshold | What it measures | |-----------------------|-----------|-------------------------------------| | Max |error| / range | < 1% | Worst displacement as % of axis | | Mean |error| / range | < 0.1% | Average displacement as % of axis | | R² (all coordinates) | > 0.9999 | Variance explained (99.99%) | | R² (per dimension) | > 0.999 | Per-PC fit quality (99.9%) | | Procrustes disparity | < 1e-4 | Shape similarity after alignment | Results for FLI dataset: - Max |error| / range: 0.0617% (was flagging 28% rel error on Q1 values) - R²: 0.9999992 - Procrustes: 8.2e-07 All 7 local datasets now pass regression tests. Also added: - Quantile context in error reports (computed from ALL values, not just failures) - Explanation when element-wise diffs exist but metrics confirm match
- Exclude .pca.center from PCA sign-flip handling (not sign-ambiguous) - Use AND logic for overall_match (don't let projection metrics override stage failures) - Guard against division by zero when projection data_range is 0 - Fix return type annotation on _log_projection_metrics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Note: we will update the golden records in a later commit.
I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.
It's not needed anymore: the PCA is done by Sklearn, and the edge cases are already handled by wrapped_pca caller, i.e. pca_project_dataframe. Less code, less complexity, less maintenance.
Re-recorded golden snapshots using the new sklearn SVD-based PCA. These snapshots now serve as the baseline for regression tests.
- test_basic_outputs: xfail for D9/D5/D7 (wrong z-score thresholds and repness formulas produce empty comment_repness) - test_group_clustering: xfail for D2/D3 (wrong participant threshold and missing k-smoother produce different cluster counts) - test_comment_priorities: update xfail reason to reference D12 - Re-record golden snapshots after two-level clustering changes Test baseline: 11 passed, 6 xfailed, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot review fixes: - clusters.py: adjust k down when fewer distinct points than requested - conversation.py: group_clusters members are base-cluster IDs (not participants) - pca.py: fallback comps shape uses min(n_comps, n_cols) not min(n_comps, 2) Bug fix: - conversation.py: convert pandas Index to list in repness (JSON serialization) Test fixes: - test_clusters: init_clusters returns empty members - test_conversation: test_recompute uses 10 comments to meet vote threshold - test_datasets: add missing has_comments arg and cold_start blob fixture - test_edge_cases: update group_repness expectation for no-clusters case - test_repness_smoke: mark test_repness_structure as xfail (pre-existing) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…am consumers Group clusters store base-cluster IDs in 'members' (matching Clojure's two-level clustering architecture), but downstream functions (conv_repness, participant_stats, group_votes) need participant IDs to join against the vote matrix. Add _unfolded_group_clusters() helper (equivalent to Clojure's clusters/group-members) and use it in all 5 call sites: - _compute_repness - _compute_participant_info - _compute_group_votes - to_dict group-votes - to_dynamo_dict group_votes Also re-record golden snapshots and remove xfail from test_repness_structure (now passes with correct unfolding). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_group_clustering: unfold Python clusters via _unfolded_group_clusters() (both sides now use two-level clustering), tighten thresholds to 0.99 Jaccard / 0.01 distribution tolerance, require overall_match - test_comment_priorities: require exact match (1e-6 tolerance) for all comment IDs instead of 70% at 20% tolerance - clojure_comparer: fix docstring to reflect Python also uses two-level clustering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
to_dict(), get_full_data(), and to_dynamo_dict() were writing self.group_clusters directly, whose 'members' contain base-cluster IDs (integers 0..N). Downstream consumers (group_data.py, Clojure compat, client apps) expect participant IDs. The internal helper _unfolded_group_clusters() already existed and was used for repness/participant_info/group_votes computation, but the five serialization sites were missed. Fix all five sites to unfold before serializing: - to_dict(): group-clusters, group_clusters, base-clusters - get_full_data(): group_clusters - to_dynamo_dict(): base_clusters / group_clusters Add test_serialization_unfolding.py (TDD: 6 tests fail on the bug, 8/8 pass after fix) using real recompute() pipeline output. Re-record golden snapshots to reflect the corrected output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive 9-document analysis covering: - Architecture and data flow comparison - PCA implementation details (power iteration vs SVD) - Two-level clustering with k-smoother analysis - Representativeness metrics with formula verification - Participant filtering and comment priority systems - TypeScript comment routing (prioritized vs topical) - 15 identified discrepancies rated by severity - Dead code inventory across Python codebase - Prioritized fix plan to bring Python to Clojure parity Key findings: 5 CRITICAL discrepancies (in-conv threshold, proportion test formula, repness metric, z-score thresholds, missing comment priorities), 3 HIGH, and 7 MEDIUM/LOW. https://claude.ai/code/session_01FEPFmVHKz1eoqzvXSTmu14
- Doc 02: Correct Clojure sparsity-aware projection code to match actual implementation (raw votes with nils skipped, not imputed). Add new discrepancy D1b documenting subtle projection input difference. - Doc 03: Fix same-clustering convergence check description - it checks every pairwise distance < threshold, not sum. - Doc 07: Add D1b projection input discrepancy (LOW severity). https://claude.ai/code/session_01FEPFmVHKz1eoqzvXSTmu14
- Remove AGENTS.md and CLAUDE.md/GEMINI.md symlinks - Create delphi/CLAUDE.md as a real file with upstream content plus testing, regression, and sklearn migration sections - Add outdated docs warning - Move uv-specific instructions to CLAUDE.local.md (gitignored) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive plan covering all 15 documented discrepancies (D1-D15) between the Delphi Python math pipeline and the Clojure reference implementation. Fixes are ordered by pipeline execution order with one PR per fix, stacked for reviewability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create tests/test_discrepancy_fixes.py with 30 tests across 12 classes, one per discrepancy (D2, D4, D5, D6, D7, D8, D9, D10, D11, D12, D15) plus synthetic edge cases. All discrepancy tests are xfailed — designed to fail before each fix and pass after. Parametrized by all datasets with Clojure reference blobs. Current results: 5 passed, 2 skipped, 18 xfailed, 5 xpassed. Create docs/JOURNAL_OF_DISCREPANCIES_FIX.md with initial baseline, test results, Clojure blob structure notes, and design decisions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of erroring when the math worker is running, automatically pause it during generation and resume it afterwards. This prevents the race condition (existing worker picking up fake conversation votes) without requiring manual intervention. Also give each poller container a predictable name and force-remove it on exit to prevent orphans. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move pause/unpause into per-dataset processing so each run handles it independently. Use a marker file (/tmp/polis-math-coldstart-paused) to track whether we caused the pause, and check for sibling coldstart containers before unpausing. Only the last concurrent run to finish unpauses the math worker, and only if it was paused by us (not manually by the user). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused `pairwise_distances` import from clusters.py - Move `load_votes` import from test code to `benchmark_utils` module so the clojure_comparer script doesn't depend on test-only code - Fix outdated "K-means++" references — Python now uses first-k-distinct initialization matching Clojure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the 10 xpassed tests (all strict=False): - D2 in-conv × 2 on vw (thresholds coincide on small dataset) - D6 two_prop_test × 1 (pseudocount diff too small) - D9 repness_not_empty × 7 (test too weak — checks non-empty, not correct count. TODO: tighten when fixing D9) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Superseded by #2420 (moved head branch to upstream for simpler stacked PR workflow). |
Summary
First PR in the discrepancy fix series described in
docs/PLAN_DISCREPANCY_FIXES.md. Sets up TDD test infrastructure: all discrepancy tests arexfailnow and will be un-xfailed as each fix lands in subsequent PRs.Changes
tests/test_discrepancy_fixes.py— 30 tests across 12 classes, one per discrepancy (D2, D4, D5, D6, D7, D8, D9, D10, D11, D12, D15) plus synthetic edge cases. Parametrized by all datasets with Clojure reference blobs.docs/JOURNAL_OF_DISCREPANCIES_FIX.md— Tracking document with initial baseline, test results, Clojure blob structure, and design decisions.Test results
strict=FalseTest plan
test_discrepancy_fixes.pyruns green (all failures are xfail)test_legacy_clojure_regression.py🤖 Generated with Claude Code