compdemocracy · jucor · Mar 30, 2026
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -0,0 +1,153 @@
+# Journal: Fixing Python-Clojure Discrepancies
+
+This is the ongoing tracking document for the TDD fix process described in
+`PLAN_DISCREPANCY_FIXES.md`. It serves as the single source of truth for
+our work, while commit messages and PR descriptions serve reviewers.
+
+---
+
+## Initial Baseline (2026-02-25)
+
+### Branch: `series-of-fixes` (forked from `origin/kmeans_analysis_docs`)
+
+### Test Results
+
+**`test_legacy_clojure_regression.py`** (2 datasets: vw, biodiversity):
+- 2 passed (`test_pca_components_match_clojure` × 2)
+- 6 xfailed:
+  - `test_basic_outputs` × 2 — D9/D5/D7: empty `comment_repness`
+  - `test_group_clustering` × 2 — D2/D3: wrong participant threshold, missing k-smoother
+  - `test_comment_priorities` × 2 — D12: not implemented
+
+**Full test suite** (`tests/` except `test_batch_id.py`):
+- 185 passed, 11 failed, 3 skipped, 6 xfailed
+- Pre-existing failures (not caused by this work, inherited from stacked PRs):
+  - `test_clusters.py::test_init_clusters` — `init_clusters()` doesn't populate members when k > n_points
+  - `test_conversation.py::test_recompute` — clustering threshold (7.2) filters out all 20 participants in synthetic data
+  - `test_conversation.py::test_data_persistence` — same threshold issue
+  - `test_datasets.py` × 4 — DatasetInfo API changed (added `has_cold_start_blob`), tests use old 8-arg constructor
+  - `test_edge_cases.py::test_insufficient_data_for_pca` — repness returns empty dict for no-group case
+  - `test_repness_smoke.py` × 2 — repness empty due to D9/D5/D7 (the very discrepancies we're fixing)
+- **Note**: CI runs on `main` which has different code — these failures are specific to this stacked branch.
+  4 of them were already known (documented in MEMORY.md under "Known pre-existing test failures in #2393").
+
+### Datasets Available
+
+| Dataset | Votes | Has cold-start Clojure blob? |
+|---------|------:|-----|
+| vw | 4,684 | Yes |
+| biodiversity | 29,803 | Yes |
+| *(5 private datasets)* | 91K–1M | **No** (need prodclone DB) |
+
+### Clojure Blob Structure (vw example)
+
+Repness entry keys: `tid`, `n-agree`, `p-test`, `repness-test`, `n-success`,
+`repful-for`, `n-trials`, `repness`, `best-agree`, `p-success`
+
+Consensus structure: `{agree: [{tid, n-success, n-trials, p-success, p-test}], disagree: [...]}`
+
+Comment priorities: `{tid: priority_value}` — 125 entries for vw
+
+In-conv: list of 67 participant IDs (vw)
+
+---
+
+## PR 0: Test Infrastructure (complete)
+
+### What was done
+- Created `tests/test_discrepancy_fixes.py` with per-discrepancy test classes
+- Created this journal
+- Documented initial baseline
+
+### Test results for `test_discrepancy_fixes.py`
+
+After rebase onto updated `origin/kmeans_analysis_docs`:
+
+```
+7 passed, 19 skipped, 39 xfailed, 10 xpassed (with --include-local, 7 datasets)
+```
+
+- **7 passed**: Clojure formula sanity checks (prop_test, repness metric product, repful rat>rdt) + Clojure blob consistency checks (pat values)
+- **19 skipped**: D15 moderation (no moderated comments), incomplete Clojure blobs, engage duplicate files
+- **39 xfailed**: Discrepancy tests correctly fail (D2-D12 constants, formulas, and real-data comparisons)
+- **10 xpassed** (all `strict=False`, so green):
+  - D2 in-conv × 2 on vw — small dataset where old/new thresholds coincide
+  - D6 two_prop_test × 1 — pseudocount difference too small to matter for this test case
+  - D9 repness_not_empty × 7 on all datasets — `comment_repness` list is populated (all
+    (group, comment) pairs) even with wrong thresholds; only `group_repness` selection is
+    affected. **TODO**: tighten this test when fixing D9 to check correct *number* of
+    representative comments, not just non-emptiness
+
+### Design decisions
+- All tests that verify targets not yet implemented are marked `@pytest.mark.xfail` with the discrepancy ID in the reason
+- `strict=False` on D2 because vw (small) coincidentally matches while biodiversity (larger) does not
+- D5 `test_clojure_pat_values_consistent_with_formula` is a sanity check that verifies Clojure data matches the documented formula — it's NOT xfailed because it doesn't test Python code
+- D6 `test_two_prop_test_with_pseudocounts` is xfailed (Python lacks pseudocounts)
+- D15 uses `pytest.skip` when dataset has no moderated comments
+
+### Test classes summary
+
+| Class | Discrepancy | Tests | xfail? |
+|-------|-------------|-------|--------|
+| `TestD2InConvThreshold` | D2 | 2 (per dataset) | xfail(strict=False) |
+| `TestD4Pseudocount` | D4 | 2 (1 constant + 1 per dataset) | both xfail |
+| `TestD5ProportionTest` | D5 | 2 (1 formula + 1 sanity per dataset) | formula xfail, sanity passes |
+| `TestD6TwoPropTest` | D6 | 1 | xfail |
+| `TestD7RepnessMetric` | D7 | 1 | xfail |
+| `TestD8FinalizeStats` | D8 | 1 | xfail |
+| `TestD9ZScoreThresholds` | D9 | 3 (2 constants + 1 per dataset) | all xfail |
+| `TestD10RepCommentSelection` | D10 | 1 (per dataset) | xfail |
+| `TestD11ConsensusSelection` | D11 | 1 (per dataset) | xfail |
+| `TestD12CommentPriorities` | D12 | 1 (per dataset) | xfail |
+| `TestD15ModerationHandling` | D15 | 1 (per dataset) | skipped (no mod-out data) |
+| `TestSyntheticEdgeCases` | multiple | 5 | 2 xfail (D4, D9), 3 pass |
+
+---
+
+## What's Next: PR 1 — Fix D2 (In-Conv Participant Threshold)
+
+- Change `threshold = 7 + sqrt(n_cmts) * 0.1` to `threshold = min(7, n_cmts)` in `conversation.py:1238`
+- Use `self.raw_rating_mat` instead of `self.rating_mat` for counting
+- Add greedy fallback (top-15 voters if <15 qualify)
+- Add monotonic persistence (once in, always in)
+- Remove xfail from `TestD2InConvThreshold`
+- Check cluster count impact (may change from 3→2 for biodiversity, matching Clojure)
+- Re-record golden snapshots if needed
+- Document new baseline
+
+---
+
+## Session Log
+
+### Session 1 (2026-02-25)
+
+- Assessed codebase: read test infra, repness module, conversation module, Clojure blob structure
+- Created `tests/test_discrepancy_fixes.py` (30 tests, 12 classes)
+- Created this journal with initial baseline
+- Ran baseline: `test_legacy_clojure_regression.py` → 2 passed, 6 xfailed
+- Ran full suite → 185 passed, 11 failed (pre-existing), 3 skipped, 6 xfailed
+- Investigated 11 pre-existing failures: inherited from stacked PRs (DatasetInfo API change, threshold issues, D9/D5/D7 repness)
+- Rebased onto updated `origin/kmeans_analysis_docs` after PR stack rebase
+- Committed and created PR #2401 (`[Clj parity PR 0]`)
+
+### Session 2 (2026-02-26)
+
+- Rebased after stack update (base tests tightened, some xpassed→xfailed)
+- Updated PR title convention: `[Clj parity PR N]` prefix for reviewer clarity
+- Redacted private dataset names from git history across the full stack:
+  - `SESSION_HANDOFF_KMEANS.md` in `kmeans_clustering_tooling` (amended deep commit via `GIT_SEQUENCE_EDITOR` rebase)
+  - `PLAN_DISCREPANCY_FIXES.md` in `kmeans_analysis_docs` (amended tip)
+  - `CLJ-PARITY-FIXES-JOURNAL.md` in `series-of-fixes` (amended tip)
+  - Force-pushed all three branches, rebased the chain
+- Tests unchanged: 5 passed, 2 skipped, 18 xfailed, 5 xpassed
+
+---
+
+## Notes for Future Sessions
+
+- Private datasets not available in this worktree. Need prodclone DB + `generate_cold_start_clojure.py`.
+- `test_discrepancy_fixes.py` uses same parametrization pattern as `test_legacy_clojure_regression.py` (own `pytest_generate_tests` hook, `dataset_name` fixture).
+- 11 pre-existing test failures are from the stacked branch, not from our work. They should be fixed in their respective PRs before merging to main.
+- `strict=False` on xfail means xpass (unexpected pass) is reported but not a failure. Used when some datasets pass by coincidence.
+- After rebase, D9 `test_repness_not_empty` started xpassing — the `comment_repness` list is populated (all pairs), but `group_repness` (selected reps) may still be affected by wrong thresholds. Consider tightening this test when fixing D9.
+- To rebase when base is updated: `git fetch origin kmeans_analysis_docs && git rebase --onto origin/kmeans_analysis_docs series-of-fixes-base series-of-fixes && git tag -f series-of-fixes-base origin/kmeans_analysis_docs`
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -6,6 +6,8 @@ The Delphi Python math pipeline has 15 documented discrepancies with the Clojure
 
 Each fix will be a separate PR to keep reviews manageable. PRs will be **stacked** (each builds on the previous), since fixes are ordered by pipeline execution order — fixing upstream affects downstream. PRs should be clearly labeled as stacked with their dependency chain, so reviewers know the order. We can use `git rebase -i` to clean up commit history before merging to main.
 
+**PR naming convention**: Clojure parity fix PRs use the title prefix `[Clj parity PR N]` (e.g., `[Clj parity PR 0] Per-discrepancy test infrastructure`, `[Clj parity PR 1] Fix D2: in-conv participant threshold`).
+
 **Prerequisite**: The current `kmeans_work` branch has changes from `edge`. These should be separated into their own PR(s) first, before we start the discrepancy fix PRs. The discrepancy fix PRs should be based on the cleaned-up branch.
 
 **Action**: Before starting any fix, analyze the diff of `kmeans_work` vs `upstream/edge` to understand what's changed. Group those changes into logical PR(s) — e.g., test infrastructure improvements, doc updates, minor bug fixes. Each should be reviewable independently. The discrepancy fix PRs (PR 1+) then stack on top of this clean base.

diff --git a/delphi/polismath/regression/__init__.py b/delphi/polismath/regression/__init__.py
@@ -15,6 +15,7 @@
     get_dataset_info,
     get_dataset_files,
     get_dataset_report_id,
+    get_blob_variants,
 )
 __all__ = [
     'ConversationRecorder',
@@ -26,4 +27,5 @@
     'get_dataset_info',
     'get_dataset_files',
     'get_dataset_report_id',
+    'get_blob_variants',
 ]
diff --git a/delphi/polismath/regression/datasets.py b/delphi/polismath/regression/datasets.py
@@ -156,12 +156,15 @@ def get_dataset_report_id(name: str) -> str:
     return get_dataset_info(name).report_id
 
 
-def get_dataset_files(name: str, prefer_cold_start: bool = True) -> Dict[str, str]:
+def get_dataset_files(name: str, prefer_cold_start: bool = True, blob_type: Optional[str] = None) -> Dict[str, str]:
     """Get file paths for a dataset.
 
     Args:
         name: Dataset name
-        prefer_cold_start: If True (default), use cold-start blob when available
+        prefer_cold_start: If True (default), use cold-start blob when available.
+            Ignored if blob_type is specified.
+        blob_type: Explicit blob type to use: 'incremental' or 'cold_start'.
+            If specified, overrides prefer_cold_start.
     """
     info = get_dataset_info(name)
     rid = info.report_id
@@ -174,11 +177,19 @@ def find_file(pattern: str) -> str:
             raise ValueError(f"Multiple files matching {pattern} in {info.path}: {matches}")
         return str(matches[0].resolve())
 
-    # Check for cold-start blob first, fall back to original
+    # Determine which blob to use
     cold_start_blob = info.path / f"{rid}_math_blob_cold_start.json"
     original_blob = info.path / f"{rid}_math_blob.json"
 
-    if prefer_cold_start and cold_start_blob.exists():
+    if blob_type == 'cold_start':
+        if not cold_start_blob.exists():
+            raise FileNotFoundError(f"No cold-start blob for {name}")
+        math_blob_path = str(cold_start_blob)
+    elif blob_type == 'incremental':
+        if not original_blob.exists():
+            raise FileNotFoundError(f"No incremental blob for {name}")
+        math_blob_path = str(original_blob)
+    elif prefer_cold_start and cold_start_blob.exists():
         math_blob_path = str(cold_start_blob)
     else:
         math_blob_path = str(original_blob)
@@ -193,6 +204,44 @@ def find_file(pattern: str) -> str:
     }
 
 
+def _is_blob_filled(blob_path: Path) -> bool:
+    """Check if a math blob has meaningful content (PCA, non-empty clusters, etc.)."""
+    import json
+    if not blob_path.exists():
+        return False
+    try:
+        with open(blob_path) as f:
+            data = json.load(f)
+        # A blob is "filled" if it has PCA data or non-empty base-clusters
+        has_pca = 'pca' in data and 'comps' in data.get('pca', {})
+        bc = data.get('base-clusters', {})
+        has_clusters = isinstance(bc, dict) and len(bc.get('id', [])) > 0
+        return has_pca or has_clusters
+    except (json.JSONDecodeError, OSError):
+        return False
+
+
+def get_blob_variants(name: str) -> List[str]:
+    """Get available filled blob variants for a dataset.
+
+    Returns a list of blob_type strings ('incremental', 'cold_start') for blobs
+    that exist and contain meaningful data.
+    """
+    info = get_dataset_info(name)
+    rid = info.report_id
+    variants = []
+
+    full_blob = info.path / f"{rid}_math_blob.json"
+    if _is_blob_filled(full_blob):
+        variants.append('incremental')
+
+    cold_start_blob = info.path / f"{rid}_math_blob_cold_start.json"
+    if _is_blob_filled(cold_start_blob):
+        variants.append('cold_start')
+
+    return variants
+
+
 # Legacy aliases
 def get_dataset_directory(report_id: str, dataset_name: Optional[str] = None) -> Path:
     """Find dataset directory by report_id."""

diff --git a/delphi/pyproject.toml b/delphi/pyproject.toml
@@ -124,8 +124,12 @@ include = [
 
 # Pytest configuration
 [tool.pytest.ini_options]
-# When using pytest-xdist (-n), group tests by xdist_group marker for efficient fixture sharing
-addopts = "--dist=loadgroup"
+# Force sequential execution (-n0) to leverage the session-scoped conversation cache.
+# The cache shares computed conversations across test files, but each xdist worker
+# has its own process with a separate cache. With only 2 datasets (biodiversity, vw),
+# sequential execution with caching is ~6x faster than parallel execution where each
+# worker recomputes the conversations independently.
+addopts = "-n0"
 filterwarnings = [
     # Ignore python_multipart deprecation warning from ddtrace (third-party)
     "ignore:Please use `import python_multipart`:PendingDeprecationWarning:ddtrace.internal.module",

diff --git a/delphi/scripts/clojure_comparer.py b/delphi/scripts/clojure_comparer.py
@@ -86,12 +86,7 @@ def main(datasets: tuple, include_local: bool, log_level: str,
     from polismath.regression.clojure_comparer import ClojureComparer, load_clojure_math_blob
     from polismath.regression.datasets import list_available_datasets, get_dataset_files
     from polismath.conversation import Conversation
-    try:
-        from tests.common_utils import load_votes
-    except ImportError:
-        click.echo("Error: Could not import tests.common_utils.load_votes.", err=True)
-        click.echo("Run this script from the delphi/ directory with: uv run python scripts/clojure_comparer.py", err=True)
-        raise SystemExit(1)
+    from polismath.benchmarks.benchmark_utils import load_votes_from_csv
 
     # Get available datasets with Clojure math_blob
     all_datasets = list_available_datasets(include_local=include_local)
@@ -138,7 +133,7 @@ def main(datasets: tuple, include_local: bool, log_level: str,
 
             # Load and process Python output
             dataset_files = get_dataset_files(dataset)
-            votes_data = load_votes(dataset_files['votes'])
+            votes_data = load_votes_from_csv(dataset_files['votes'])
 
             click.echo(f"✓ Loaded votes data: {len(votes_data['votes'])} votes")
 
@@ -226,9 +221,7 @@ def main(datasets: tuple, include_local: bool, log_level: str,
                 click.echo(f"✓ {dataset}: PASS")
             else:
                 click.echo(f"✗ {dataset}: FAIL")
-                click.echo(f"\nLikely cause: Initialization algorithm difference")
-                click.echo(f"  Python: K-means++ (seed 42)")
-                click.echo(f"  Clojure: First k distinct points")
+                click.echo(f"  See docs/PLAN_DISCREPANCY_FIXES.md for known discrepancies")
             click.echo(f"{'='*60}")
 
         except Exception as e:
@@ -260,8 +253,8 @@ def main(datasets: tuple, include_local: bool, log_level: str,
         click.echo(f"\n✗ Failed:")
         for name in failed_datasets:
             click.echo(f"  {name}")
-        click.echo("\nNote: Failures may be due to initialization differences (K-means++ vs first-k).")
-        click.echo("This is expected until clustering initialization is aligned.")
+        click.echo("\nNote: Failures may be due to remaining Python-Clojure discrepancies.")
+        click.echo("See docs/PLAN_DISCREPANCY_FIXES.md for the fix plan.")
         return 1
     else:
         click.echo("\n✓ All datasets passed!")