Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Handoff: PR 14 — Make Vectorized Code Readable + Blob Injection Tests

## Goal

The scalar functions (`comment_stats`, `add_comparative_stats`, `repness_metric`,
`finalize_cmt_stats`) read like a step-by-step recipe. Their vectorized replacement
(`compute_group_comment_stats_df`) buries the same logic in 150 lines of DataFrame
plumbing. The vectorized path is the ONLY production code path — the scalar functions
are dead code, called only from tests and benchmarks.

**PR 14 must make the vectorized code at least as readable as the scalar code, then
delete the scalar functions.** This also enables vectorized blob injection tests —
the only non-tautological way to verify correctness against the Clojure math blob.

## Starting point

Branch off `jc/clj-parity-d9-fix` (Stack 13). This is BELOW D5-D8 in the stack,
so the refactor will be inherited by all formula fix PRs.

## Current stack (as of 2026-03-17)

```
Stack 13: jc/clj-parity-d9-fix ← branch off HERE for PR 14
Stack 14: jc/clj-parity-d5-prop-test (draft PR #2448)
Stack 15: jc/clj-parity-d6-two-prop-test (draft PR #2449)
Stack 16: jc/clj-parity-d7-repness-metric (draft PR #2450)
Stack 17: jc/clj-parity-d8-finalize-stats (draft PR #2451)
Stack 18: jc/clj-parity-d15-moderation-handling-zeros-vs-removes (PR #2452)
Stack 19: jc/clj-parity-kmeans-k-divergence (k-divergence fix)
Stack 20: jc/clj-parity-d10-rep-comment-selection
Stack 21: jc/clj-parity-d11-consensus-comment-selection
Stack 22: jc/clj-parity-d3-k-smoother-buffer
Stack 23: jc/clj-parity-d12-comment-priorities
Stack 24: jc/clj-parity-d1-pca-sign-flip-prevention
Stack 25: jc/clj-parity-pr15-load-votes-sort
```

D5-D8 are draft PRs waiting for vectorized blob tests before being marked ready.

## Task sequence

### 1. Refactor `compute_group_comment_stats_df`

Split into two phases:

**(a) DataFrame construction** (the plumbing):
- Build participant→group mapping
- Compute total counts per comment
- Create full (group, comment) cross-product index
- Join total counts, compute "other" counts

**(b) Statistics computation** (the readable part):
A new function with clean DataFrame inputs (columns: `na`, `nd`, `ns`,
`other_agree`, `other_disagree`, `other_votes`) → outputs (columns: `pa`, `pd`,
`pat`, `pdt`, `ra`, `rd`, `rat`, `rdt`, `agree_metric`, `disagree_metric`, `repful`).

This function should read like the scalar recipe:
```python
# Probabilities with pseudocounts
df['pa'] = (df['na'] + PSEUDO_COUNT/2) / (df['ns'] + PSEUDO_COUNT)
# Proportion tests
df['pat'] = prop_test_vectorized(df['na'], df['ns'])
# Representativeness ratios
df['ra'] = df['pa'] / df['other_pa']
# ...etc
```

### 2. Write vectorized blob injection tests

Inject Clojure group memberships + votes into the new statistics function,
compare output to blob values. Test the PRODUCTION code path.

For each `(group, tid)` in the blob's `repness`:
- Compare `pat` (or `pdt`) to blob `p-test`
- Compare `rat` (or `rdt`) to blob `repness-test`
- Compare `pa` (or `pd`) to blob `p-success`
- Compare `repful` to blob `repful-for`

The blob only stores the winning side's values. `repful-for` tells you which
side won: if `agree`, then `n-success`=na, `p-test`=pat, `repness-test`=rat.
If `disagree`, then `n-success`=nd, `p-test`=pdt, `repness-test`=rdt.

### 3. Verify scalar ≡ vectorized

Run both paths on all datasets, compare outputs field by field. Must be
identical (not just close — exact match) since they implement the same formulas.

### 4. Delete scalar functions

Remove: `comment_stats`, `add_comparative_stats`, `repness_metric`,
`finalize_cmt_stats`, and scalar `prop_test`/`two_prop_test` if no longer
needed (the vectorized versions remain).

Update all tests that called the scalar API.

### 5. Cascade rebase

Rebase D5→D6→D7→D8→D15→k-divergence→D10→D11→D3→D12→D1→PR15 on top.
Use `.claude/skills/pr-stack/rebase-stack.sh` for the cascade.

### 6. Add per-PR vectorized blob injection tests

For each of D5, D6, D8: insert a RED blob test commit before the fix,
verify it fails, then verify the fix makes it pass. Same TDD pattern used
for the scalar blob tests already in those branches.

Then mark #2448-#2451 as ready (un-draft).

## Key blob context

- `n-trials` in the Clojure blob = `S` (total seen, INCLUDING passes), not
`A+D` (agrees + disagrees). Verified: `prop_test(11, 14)` matches blob
`p-test` for tid=49 group 0 in vw (A=2, D=11, S=14, A+D=13).
- `group-votes[gid].votes[tid]` = `{A: agrees, D: disagrees, S: total_seen}`
- Blob `repness[gid][i]` fields: `n-success`, `n-trials`, `p-success`,
`p-test`, `repness` (=ra or rd), `repness-test` (=rat or rdt),
`repful-for`, `tid`, `best-agree` (optional).

## Files to modify

- `delphi/polismath/pca_kmeans_rep/repness.py` — the refactor
- `delphi/tests/test_discrepancy_fixes.py` — vectorized blob tests
- `delphi/tests/test_repness_unit.py` — update for new API
- `delphi/tests/test_old_format_repness.py` — update for new API
- `delphi/polismath/benchmarks/bench_repness.py` — update imports

## Reference

- Journal: `delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md` — Session 11 entry,
PR 14 readability goal in Notes for Future Sessions
- Plan: `delphi/docs/PLAN_DISCREPANCY_FIXES.md` — mandatory blob comparison
section in Testing Principles
- Clojure source: `math/src/polismath/math/stats.clj` (prop-test, two-prop-test),
`math/src/polismath/math/repness.clj` (finalize-cmt-stats, repness-metric)
103 changes: 99 additions & 4 deletions delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
| PR 1 (D2) | #2421 | Stack 8/10 | Fix D2: in-conv participant threshold + D2c vote count source |
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
| PR 3 (D9) | | — | *Next: Fix D9 z-score thresholds* |
| PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |

Future fix PRs will be appended to the stack as they're created.

Expand Down Expand Up @@ -50,6 +50,10 @@ Because this work will span multiple Claude Code sessions, we maintain:
- **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
- **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification.
- **Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works.
- **Mathematical rigor**: These are math fixes. Every formula change must be verified against the Clojure reference implementation by reading the actual Clojure source and the Python source side-by-side. Verify algebraic equivalence explicitly — don't assume. When in doubt, add a comment showing the derivation.
- **Exhaustive RED phase**: In the RED phase of TDD, don't just write one test showing the discrepancy. Actively ask: "What other behaviors does this change affect? What are the boundary conditions? What happens with empty inputs, single-element inputs, all-agree cases, all-disagree cases?" Write tests for all of them. Before moving to GREEN, explicitly list what tests are still missing and add them. The goal is that the test suite for each fix is comprehensive enough that a wrong implementation cannot pass.
- **Check your work**: After implementing a fix, re-read the Clojure source one more time and verify each line of the Python implementation corresponds correctly. Check array shapes, index semantics (0-based vs 1-based), and aggregation axes. Off-by-one errors and transposed matrices are the most common bugs.
- **Large private datasets are slow**: Some private conversations have 100K–1M votes. Running the full test suite with `--include-local` on all of them can take a very long time. It's OK to run only the small/medium datasets (vw, biodiversity, and the smaller private ones) during the RED/GREEN cycle. Run the full set including large conversations only once, as a final validation before committing — and even then, if a specific large dataset is known to be slow, it's acceptable to skip it and note which ones were tested in the PR description.

### Datasets Available (sorted by size, smallest first)

Expand Down Expand Up @@ -383,11 +387,51 @@ This is non-trivial and should be one of the last fixes.

---

### PR 14: Cleanup — Remove Dead Code
### PR 14: Refactor Vectorized Code for Readability + Blob Injection Tests

**MOVED EARLIER**: PR 14 is now a prerequisite for all formula fix PRs (D5-D8+),
not a post-parity cleanup. It branches off `jc/clj-parity-d9-fix` (Stack 13),
below all formula fixes. Reason: the vectorized production path
(`compute_group_comment_stats_df`) is too monolithic to test against the Clojure
blob. The refactor makes it testable AND readable.

**The problem**: The scalar functions (`comment_stats`, `add_comparative_stats`,
`repness_metric`, `finalize_cmt_stats`) read like a step-by-step recipe. The
vectorized replacement (`compute_group_comment_stats_df`) buries the same logic
in 150 lines of DataFrame plumbing. The scalar path is dead code in production —
only called from tests and benchmarks.

**Task**:
1. Split `compute_group_comment_stats_df` into (a) DataFrame construction
(group mapping, cross-product index, joins) and (b) statistics computation
as its own function with clean inputs/outputs — readable AND testable.
2. Write vectorized blob injection tests: inject Clojure group memberships +
votes, compare output to blob values. Tests the PRODUCTION code path.
3. Verify scalar and vectorized paths produce identical output on all datasets.
4. Delete scalar functions. Update tests.

**Files**: `polismath/pca_kmeans_rep/repness.py`, `tests/test_discrepancy_fixes.py`,
`tests/test_repness_unit.py`, `tests/test_old_format_repness.py`,
`polismath/benchmarks/bench_repness.py`

See `delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md` for full details.

After PR 14, each fix PR gets vectorized blob injection tests added in RED→GREEN
TDD pattern. This includes D5-D8 (repness formula fixes), D10/D11 (selection),
D15 (moderation), D12 (priorities). For D3 (k-smoother) and D1 (PCA sign flip),
which are incremental-only features, add synthetic tests + skip markers for
incremental blob comparison pending replay infrastructure (see Replay PRs A/B/C).

**After adding vectorized tests to each PR, update the plan AND journal** to
record what was tested, what blob fields were compared, and any discrepancies found.
This is mandatory — the plan and journal are how future sessions know what's done.

---

### PR 14b: Cleanup — Remove Remaining Dead Code (after parity)

**Files**: Multiple (see `08-dead-code.md`)
- Custom kmeans chain in `clusters.py`
- Non-vectorized repness functions in `repness.py`
- Buggy `_compute_votes_base()` (after D12 replaces it)
- `stats.py` inconsistencies (after D9 makes `repness.py` authoritative)

Expand Down Expand Up @@ -425,13 +469,14 @@ By this point, we should have good test coverage from all the per-discrepancy te
| D6 | Two-proportion test | PR 5 | — | Fix |
| D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
| D8 | Finalize cmt stats | PR 7 | — | Fix |
| D9 | Z-score thresholds | PR 3 | | Fix (next) |
| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ |
| D10 | Rep comment selection | PR 8 | — | Fix (with legacy env var) |
| D11 | Consensus selection | PR 9 | — | Fix (with legacy env var) |
| D12 | Comment priorities | PR 11 | — | Fix (implement from scratch) |
| D13 | Subgroup clustering | — | — | **Deferred** (unused) |
| D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) |
| D15 | Moderation handling | PR 12 | — | Fix |
| Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |

### Non-discrepancy PRs in the stack

Expand All @@ -441,6 +486,56 @@ By this point, we should have good test coverage from all the per-discrepancy te

---

## Tasks parallelization

D9 is done (PR #2446). The remaining fixes have the following dependency structure:

### Repness chain dependency graph (all in `repness.py`)

```
D5 ─┬─→ D7 ─┐
D6 ─┘ D8 ─┼─→ D10
D5 ──────────┴─→ D11
```

- **D5, D6**: logically independent, but both modify `repness.py` (signature changes + caller updates in `compute_group_comment_stats_df`) — **must be sequential**
- **D7**: after D5 + D6
- **D8**: after D6
- **D10**: after D7 + D8
- **D11**: after D5 only (parallel with D7, D8, D10)

All are in `repness.py`, strictly sequential within this track.

### File-boundary analysis

Every fix touches `test_discrepancy_fixes.py` (different test classes per fix — low conflict risk, but same file). The production code boundaries are:

| File | Fixes that modify it |
|------|---------------------|
| `repness.py` | D5, D6, D7, D8, D10, D11 |
| `conversation.py` | D3, D12, D15 |
| `pca.py` | D12, D1/D1b |
| `test_repness_unit.py` | D5, D6, D7, D8 |

**Within each file group, fixes must be sequential** to avoid merge conflicts.

### Practical parallel tracks (2 worktrees)

| Track | Worktree | Fixes (sequential within) | Files |
|-------|----------|--------------------------|-------|
| **A — Repness formulas** | main worktree | D5 → D6 → D7 → D8 → D10 → D11 | `repness.py`, `test_repness_unit.py` |
| **B — Conversation/PCA** | separate worktree | D3 → D15 → D12 | `conversation.py`, `pca.py` |
| **C — Late** | (after A+B) | D1/D1b | `pca.py` (needs replay infra) |

**Tracks A and B can run fully in parallel** using separate worktrees. Within each track, fixes are sequential (same files). Track B order is flexible — D3, D15, D12 touch different functions in `conversation.py`, so the order can be chosen for convenience. D12 is the largest (also touches `pca.py`), so putting it last gives D1/D1b a cleaner base.

The shared `test_discrepancy_fixes.py` file will need a mechanical merge when tracks converge, but since each fix modifies a different test class (already scaffolded with xfail markers), conflicts should be trivial to resolve.

**At convergence**: when both tracks are done, rebase Track B onto Track A (or vice versa). The only conflict will be in `test_discrepancy_fixes.py` — resolve by keeping both sets of test class changes.

---

## Test Infrastructure

### `tests/test_discrepancy_fixes.py` — New test file
Expand Down
13 changes: 2 additions & 11 deletions delphi/docs/usage_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,17 +136,8 @@ for group_id, comments in repness["group_repness"].items():

### Statistical Functions

```python
from polismath.pca_kmeans_rep.stats import prop_test, two_prop_test

# Test proportion difference
z_score = prop_test(70, 100) # 70 successes out of 100 trials
print(f"Z-score: {z_score}")

# Compare two proportions
z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100
print(f"Comparison Z-score: {z_score}")
```
Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are
defined inline in `repness.py` where they are used. See that module for details.

## Practical Examples

Expand Down
14 changes: 7 additions & 7 deletions delphi/polismath/pca_kmeans_rep/repness.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def z_score_sig_90(z: float) -> bool:
Returns:
True if significant at 90% confidence
"""
return abs(z) >= Z_90
return z > Z_90


def z_score_sig_95(z: float) -> bool:
Expand All @@ -57,7 +57,7 @@ def z_score_sig_95(z: float) -> bool:
Returns:
True if significant at 95% confidence
"""
return abs(z) >= Z_95
return z > Z_95


def prop_test(p: float, n: int, p0: float) -> float:
Expand Down Expand Up @@ -688,10 +688,10 @@ def select_rep_comments_df(stats_df: pd.DataFrame,
# Best agree: pa > pd and passes significance tests
agree_candidates = stats_df[stats_df['pa'] > stats_df['pd']].copy()
if not agree_candidates.empty:
# Check significance: |pat| >= Z_90 and |rat| >= Z_90
# Check significance: pat > Z_90 and rat > Z_90
passing_agree = agree_candidates[
(agree_candidates['pat'].abs() >= Z_90) &
(agree_candidates['rat'].abs() >= Z_90) &
(agree_candidates['pat'] > Z_90) &
(agree_candidates['rat'] > Z_90) &
(agree_candidates['pa'] >= 0.5)
]
if not passing_agree.empty:
Expand All @@ -701,8 +701,8 @@ def select_rep_comments_df(stats_df: pd.DataFrame,
disagree_candidates = stats_df[stats_df['pd'] > stats_df['pa']].copy()
if not disagree_candidates.empty:
passing_disagree = disagree_candidates[
(disagree_candidates['pdt'].abs() >= Z_90) &
(disagree_candidates['rdt'].abs() >= Z_90) &
(disagree_candidates['pdt'] > Z_90) &
(disagree_candidates['rdt'] > Z_90) &
(disagree_candidates['pd'] >= 0.5)
]
if not passing_disagree.empty:
Expand Down
Loading
Loading