compdemocracy · jucor · Mar 30, 2026
diff --git a/delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md b/delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md
@@ -0,0 +1,134 @@
+# Handoff: PR 14 — Make Vectorized Code Readable + Blob Injection Tests
+
+## Goal
+
+The scalar functions (`comment_stats`, `add_comparative_stats`, `repness_metric`,
+`finalize_cmt_stats`) read like a step-by-step recipe. Their vectorized replacement
+(`compute_group_comment_stats_df`) buries the same logic in 150 lines of DataFrame
+plumbing. The vectorized path is the ONLY production code path — the scalar functions
+are dead code, called only from tests and benchmarks.
+
+**PR 14 must make the vectorized code at least as readable as the scalar code, then
+delete the scalar functions.** This also enables vectorized blob injection tests —
+the only non-tautological way to verify correctness against the Clojure math blob.
+
+## Starting point
+
+Branch off `jc/clj-parity-d9-fix` (Stack 13). This is BELOW D5-D8 in the stack,
+so the refactor will be inherited by all formula fix PRs.
+
+## Current stack (as of 2026-03-17)
+
+```
+Stack 13: jc/clj-parity-d9-fix          ← branch off HERE for PR 14
+Stack 14: jc/clj-parity-d5-prop-test     (draft PR #2448)
+Stack 15: jc/clj-parity-d6-two-prop-test (draft PR #2449)
+Stack 16: jc/clj-parity-d7-repness-metric (draft PR #2450)
+Stack 17: jc/clj-parity-d8-finalize-stats (draft PR #2451)
+Stack 18: jc/clj-parity-d15-moderation-handling-zeros-vs-removes (PR #2452)
+Stack 19: jc/clj-parity-kmeans-k-divergence (k-divergence fix)
+Stack 20: jc/clj-parity-d10-rep-comment-selection
+Stack 21: jc/clj-parity-d11-consensus-comment-selection
+Stack 22: jc/clj-parity-d3-k-smoother-buffer
+Stack 23: jc/clj-parity-d12-comment-priorities
+Stack 24: jc/clj-parity-d1-pca-sign-flip-prevention
+Stack 25: jc/clj-parity-pr15-load-votes-sort
+```
+
+D5-D8 are draft PRs waiting for vectorized blob tests before being marked ready.
+
+## Task sequence
+
+### 1. Refactor `compute_group_comment_stats_df`
+
+Split into two phases:
+
+**(a) DataFrame construction** (the plumbing):
+- Build participant→group mapping
+- Compute total counts per comment
+- Create full (group, comment) cross-product index
+- Join total counts, compute "other" counts
+
+**(b) Statistics computation** (the readable part):
+A new function with clean DataFrame inputs (columns: `na`, `nd`, `ns`,
+`other_agree`, `other_disagree`, `other_votes`) → outputs (columns: `pa`, `pd`,
+`pat`, `pdt`, `ra`, `rd`, `rat`, `rdt`, `agree_metric`, `disagree_metric`, `repful`).
+
+This function should read like the scalar recipe:
+```python
+# Probabilities with pseudocounts
+df['pa'] = (df['na'] + PSEUDO_COUNT/2) / (df['ns'] + PSEUDO_COUNT)
+# Proportion tests
+df['pat'] = prop_test_vectorized(df['na'], df['ns'])
+# Representativeness ratios
+df['ra'] = df['pa'] / df['other_pa']
+# ...etc
+```
+
+### 2. Write vectorized blob injection tests
+
+Inject Clojure group memberships + votes into the new statistics function,
+compare output to blob values. Test the PRODUCTION code path.
+
+For each `(group, tid)` in the blob's `repness`:
+- Compare `pat` (or `pdt`) to blob `p-test`
+- Compare `rat` (or `rdt`) to blob `repness-test`
+- Compare `pa` (or `pd`) to blob `p-success`
+- Compare `repful` to blob `repful-for`
+
+The blob only stores the winning side's values. `repful-for` tells you which
+side won: if `agree`, then `n-success`=na, `p-test`=pat, `repness-test`=rat.
+If `disagree`, then `n-success`=nd, `p-test`=pdt, `repness-test`=rdt.
+
+### 3. Verify scalar ≡ vectorized
+
+Run both paths on all datasets, compare outputs field by field. Must be
+identical (not just close — exact match) since they implement the same formulas.
+
+### 4. Delete scalar functions
+
+Remove: `comment_stats`, `add_comparative_stats`, `repness_metric`,
+`finalize_cmt_stats`, and scalar `prop_test`/`two_prop_test` if no longer
+needed (the vectorized versions remain).
+
+Update all tests that called the scalar API.
+
+### 5. Cascade rebase
+
+Rebase D5→D6→D7→D8→D15→k-divergence→D10→D11→D3→D12→D1→PR15 on top.
+Use `.claude/skills/pr-stack/rebase-stack.sh` for the cascade.
+
+### 6. Add per-PR vectorized blob injection tests
+
+For each of D5, D6, D8: insert a RED blob test commit before the fix,
+verify it fails, then verify the fix makes it pass. Same TDD pattern used
+for the scalar blob tests already in those branches.
+
+Then mark #2448-#2451 as ready (un-draft).
+
+## Key blob context
+
+- `n-trials` in the Clojure blob = `S` (total seen, INCLUDING passes), not
+  `A+D` (agrees + disagrees). Verified: `prop_test(11, 14)` matches blob
+  `p-test` for tid=49 group 0 in vw (A=2, D=11, S=14, A+D=13).
+- `group-votes[gid].votes[tid]` = `{A: agrees, D: disagrees, S: total_seen}`
+- Blob `repness[gid][i]` fields: `n-success`, `n-trials`, `p-success`,
+  `p-test`, `repness` (=ra or rd), `repness-test` (=rat or rdt),
+  `repful-for`, `tid`, `best-agree` (optional).
+
+## Files to modify
+
+- `delphi/polismath/pca_kmeans_rep/repness.py` — the refactor
+- `delphi/tests/test_discrepancy_fixes.py` — vectorized blob tests
+- `delphi/tests/test_repness_unit.py` — update for new API
+- `delphi/tests/test_old_format_repness.py` — update for new API
+- `delphi/polismath/benchmarks/bench_repness.py` — update imports
+
+## Reference
+
+- Journal: `delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md` — Session 11 entry,
+  PR 14 readability goal in Notes for Future Sessions
+- Plan: `delphi/docs/PLAN_DISCREPANCY_FIXES.md` — mandatory blob comparison
+  section in Testing Principles
+- Clojure source: `math/src/polismath/math/stats.clj` (prop-test, two-prop-test),
+  `math/src/polismath/math/repness.clj` (finalize-cmt-stats, repness-metric)
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -19,7 +19,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
 | PR 1 (D2) | #2421 | Stack 8/10 | Fix D2: in-conv participant threshold + D2c vote count source |
 | PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
 | (perf) | #2436 | Stack 10/10 | Speed up regression tests |
-| PR 3 (D9) | — | — | *Next: Fix D9 z-score thresholds* |
+| PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
 
 Future fix PRs will be appended to the stack as they're created.
 
@@ -50,6 +50,10 @@ Because this work will span multiple Claude Code sessions, we maintain:
 - **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
 - **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification.
 - **Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works.
+- **Mathematical rigor**: These are math fixes. Every formula change must be verified against the Clojure reference implementation by reading the actual Clojure source and the Python source side-by-side. Verify algebraic equivalence explicitly — don't assume. When in doubt, add a comment showing the derivation.
+- **Exhaustive RED phase**: In the RED phase of TDD, don't just write one test showing the discrepancy. Actively ask: "What other behaviors does this change affect? What are the boundary conditions? What happens with empty inputs, single-element inputs, all-agree cases, all-disagree cases?" Write tests for all of them. Before moving to GREEN, explicitly list what tests are still missing and add them. The goal is that the test suite for each fix is comprehensive enough that a wrong implementation cannot pass.
+- **Check your work**: After implementing a fix, re-read the Clojure source one more time and verify each line of the Python implementation corresponds correctly. Check array shapes, index semantics (0-based vs 1-based), and aggregation axes. Off-by-one errors and transposed matrices are the most common bugs.
+- **Large private datasets are slow**: Some private conversations have 100K–1M votes. Running the full test suite with `--include-local` on all of them can take a very long time. It's OK to run only the small/medium datasets (vw, biodiversity, and the smaller private ones) during the RED/GREEN cycle. Run the full set including large conversations only once, as a final validation before committing — and even then, if a specific large dataset is known to be slow, it's acceptable to skip it and note which ones were tested in the PR description.
 
 ### Datasets Available (sorted by size, smallest first)
 
@@ -383,11 +387,51 @@ This is non-trivial and should be one of the last fixes.
 
 ---
 
-### PR 14: Cleanup — Remove Dead Code
+### PR 14: Refactor Vectorized Code for Readability + Blob Injection Tests
+
+**MOVED EARLIER**: PR 14 is now a prerequisite for all formula fix PRs (D5-D8+),
+not a post-parity cleanup. It branches off `jc/clj-parity-d9-fix` (Stack 13),
+below all formula fixes. Reason: the vectorized production path
+(`compute_group_comment_stats_df`) is too monolithic to test against the Clojure
+blob. The refactor makes it testable AND readable.
+
+**The problem**: The scalar functions (`comment_stats`, `add_comparative_stats`,
+`repness_metric`, `finalize_cmt_stats`) read like a step-by-step recipe. The
+vectorized replacement (`compute_group_comment_stats_df`) buries the same logic
+in 150 lines of DataFrame plumbing. The scalar path is dead code in production —
+only called from tests and benchmarks.
+
+**Task**:
+1. Split `compute_group_comment_stats_df` into (a) DataFrame construction
+   (group mapping, cross-product index, joins) and (b) statistics computation
+   as its own function with clean inputs/outputs — readable AND testable.
+2. Write vectorized blob injection tests: inject Clojure group memberships +
+   votes, compare output to blob values. Tests the PRODUCTION code path.
+3. Verify scalar and vectorized paths produce identical output on all datasets.
+4. Delete scalar functions. Update tests.
+
+**Files**: `polismath/pca_kmeans_rep/repness.py`, `tests/test_discrepancy_fixes.py`,
+`tests/test_repness_unit.py`, `tests/test_old_format_repness.py`,
+`polismath/benchmarks/bench_repness.py`
+
+See `delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md` for full details.
+
+After PR 14, each fix PR gets vectorized blob injection tests added in RED→GREEN
+TDD pattern. This includes D5-D8 (repness formula fixes), D10/D11 (selection),
+D15 (moderation), D12 (priorities). For D3 (k-smoother) and D1 (PCA sign flip),
+which are incremental-only features, add synthetic tests + skip markers for
+incremental blob comparison pending replay infrastructure (see Replay PRs A/B/C).
+
+**After adding vectorized tests to each PR, update the plan AND journal** to
+record what was tested, what blob fields were compared, and any discrepancies found.
+This is mandatory — the plan and journal are how future sessions know what's done.
+
+---
+
+### PR 14b: Cleanup — Remove Remaining Dead Code (after parity)
 
 **Files**: Multiple (see `08-dead-code.md`)
 - Custom kmeans chain in `clusters.py`
-- Non-vectorized repness functions in `repness.py`
 - Buggy `_compute_votes_base()` (after D12 replaces it)
 - `stats.py` inconsistencies (after D9 makes `repness.py` authoritative)
 
@@ -425,13 +469,14 @@ By this point, we should have good test coverage from all the per-discrepancy te
 | D6 | Two-proportion test | PR 5 | — | Fix |
 | D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
 | D8 | Finalize cmt stats | PR 7 | — | Fix |
-| D9 | Z-score thresholds | PR 3 | — | Fix (next) |
+| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ |
 | D10 | Rep comment selection | PR 8 | — | Fix (with legacy env var) |
 | D11 | Consensus selection | PR 9 | — | Fix (with legacy env var) |
 | D12 | Comment priorities | PR 11 | — | Fix (implement from scratch) |
 | D13 | Subgroup clustering | — | — | **Deferred** (unused) |
 | D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) |
 | D15 | Moderation handling | PR 12 | — | Fix |
+| Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |
 
 ### Non-discrepancy PRs in the stack
 
@@ -441,6 +486,56 @@ By this point, we should have good test coverage from all the per-discrepancy te
 
 ---
 
+## Tasks parallelization
+
+D9 is done (PR #2446). The remaining fixes have the following dependency structure:
+
+### Repness chain dependency graph (all in `repness.py`)
+
+```
+D5 ─┬─→ D7 ─┐
+D6 ─┘    D8 ─┼─→ D10
+             │
+D5 ──────────┴─→ D11
+```
+
+- **D5, D6**: logically independent, but both modify `repness.py` (signature changes + caller updates in `compute_group_comment_stats_df`) — **must be sequential**
+- **D7**: after D5 + D6
+- **D8**: after D6
+- **D10**: after D7 + D8
+- **D11**: after D5 only (parallel with D7, D8, D10)
+
+All are in `repness.py`, strictly sequential within this track.
+
+### File-boundary analysis
+
+Every fix touches `test_discrepancy_fixes.py` (different test classes per fix — low conflict risk, but same file). The production code boundaries are:
+
+| File | Fixes that modify it |
+|------|---------------------|
+| `repness.py` | D5, D6, D7, D8, D10, D11 |
+| `conversation.py` | D3, D12, D15 |
+| `pca.py` | D12, D1/D1b |
+| `test_repness_unit.py` | D5, D6, D7, D8 |
+
+**Within each file group, fixes must be sequential** to avoid merge conflicts.
+
+### Practical parallel tracks (2 worktrees)
+
+| Track | Worktree | Fixes (sequential within) | Files |
+|-------|----------|--------------------------|-------|
+| **A — Repness formulas** | main worktree | D5 → D6 → D7 → D8 → D10 → D11 | `repness.py`, `test_repness_unit.py` |
+| **B — Conversation/PCA** | separate worktree | D3 → D15 → D12 | `conversation.py`, `pca.py` |
+| **C — Late** | (after A+B) | D1/D1b | `pca.py` (needs replay infra) |
+
+**Tracks A and B can run fully in parallel** using separate worktrees. Within each track, fixes are sequential (same files). Track B order is flexible — D3, D15, D12 touch different functions in `conversation.py`, so the order can be chosen for convenience. D12 is the largest (also touches `pca.py`), so putting it last gives D1/D1b a cleaner base.
+
+The shared `test_discrepancy_fixes.py` file will need a mechanical merge when tracks converge, but since each fix modifies a different test class (already scaffolded with xfail markers), conflicts should be trivial to resolve.
+
+**At convergence**: when both tracks are done, rebase Track B onto Track A (or vice versa). The only conflict will be in `test_discrepancy_fixes.py` — resolve by keeping both sets of test class changes.
+
+---
+
 ## Test Infrastructure
 
 ### `tests/test_discrepancy_fixes.py` — New test file

diff --git a/delphi/docs/usage_examples.md b/delphi/docs/usage_examples.md
@@ -136,17 +136,8 @@ for group_id, comments in repness["group_repness"].items():
 
 ### Statistical Functions
 
-```python
-from polismath.pca_kmeans_rep.stats import prop_test, two_prop_test
-
-# Test proportion difference
-z_score = prop_test(70, 100)  # 70 successes out of 100 trials
-print(f"Z-score: {z_score}")
-
-# Compare two proportions
-z_score = two_prop_test(70, 100, 50, 100)  # 70/100 vs 50/100
-print(f"Comparison Z-score: {z_score}")
-```
+Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are
+defined inline in `repness.py` where they are used. See that module for details.
 
 ## Practical Examples
 

diff --git a/delphi/polismath/pca_kmeans_rep/repness.py b/delphi/polismath/pca_kmeans_rep/repness.py
@@ -44,7 +44,7 @@ def z_score_sig_90(z: float) -> bool:
     Returns:
         True if significant at 90% confidence
     """
-    return abs(z) >= Z_90
+    return z > Z_90
 
 
 def z_score_sig_95(z: float) -> bool:
@@ -57,7 +57,7 @@ def z_score_sig_95(z: float) -> bool:
     Returns:
         True if significant at 95% confidence
     """
-    return abs(z) >= Z_95
+    return z > Z_95
 
 
 def prop_test(p: float, n: int, p0: float) -> float:
@@ -688,10 +688,10 @@ def select_rep_comments_df(stats_df: pd.DataFrame,
     # Best agree: pa > pd and passes significance tests
     agree_candidates = stats_df[stats_df['pa'] > stats_df['pd']].copy()
     if not agree_candidates.empty:
-        # Check significance: |pat| >= Z_90 and |rat| >= Z_90
+        # Check significance: pat > Z_90 and rat > Z_90
         passing_agree = agree_candidates[
-            (agree_candidates['pat'].abs() >= Z_90) &
-            (agree_candidates['rat'].abs() >= Z_90) &
+            (agree_candidates['pat'] > Z_90) &
+            (agree_candidates['rat'] > Z_90) &
             (agree_candidates['pa'] >= 0.5)
         ]
         if not passing_agree.empty:
@@ -701,8 +701,8 @@ def select_rep_comments_df(stats_df: pd.DataFrame,
     disagree_candidates = stats_df[stats_df['pd'] > stats_df['pa']].copy()
     if not disagree_candidates.empty:
         passing_disagree = disagree_candidates[
-            (disagree_candidates['pdt'].abs() >= Z_90) &
-            (disagree_candidates['rdt'].abs() >= Z_90) &
+            (disagree_candidates['pdt'] > Z_90) &
+            (disagree_candidates['rdt'] > Z_90) &
             (disagree_candidates['pd'] >= 0.5)
         ]
         if not passing_disagree.empty: