Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -711,6 +711,60 @@ to create a new worktree. If yes, provide a prompt they can use to start that se

---

## Session: Fix D15 — Moderation Handling (2026-03-16)

### Branch: `jc/clj-parity-d15-moderation-handling-zeros-vs-removes`

### What was done

Fixed D15: Python now zeros out moderated-out comment columns instead of removing them,
matching Clojure's `zero-out-columns` behavior (named_matrix.clj:214-230).

**The discrepancy**: Python's `_apply_moderation()` removed moderated-out columns from
`rating_mat` entirely (`raw_rating_mat.loc[keep_ptpts, keep_comments]`). Clojure zeros
them out (`matrix/set-column m' i 0`), preserving matrix structure.

**The fix**: Changed `_apply_moderation()` to:
1. Still remove moderated-out participants (rows) — unchanged
2. Zero out moderated-out comment columns instead of removing them
3. `rating_mat` now has the same column count as `raw_rating_mat`

**Impact on downstream**:
- `tids` output now includes moderated-out tids (matching Clojure)
- PCA: zeroed columns contribute nothing to variance, so PCA results are effectively identical
- Repness: zeroed columns get na=0, nd=0, failing significance — effectively excluded
- Vote counting: `user-vote-counts` in `to_math_blob()` uses `rating_mat`, so moderated
columns now count as "pass" votes (matching Clojure's behavior with zeroed columns)

### Tests

**New synthetic tests** (`TestD15SyntheticModeration`, 5 tests):
- `test_zeroing_preserves_columns` — moderated columns still present
- `test_zeroed_columns_are_all_zero` — moderated column values are 0.0
- `test_non_moderated_columns_unchanged` — other columns retain original values
- `test_empty_moderation_no_change` — no-op when no moderation
- `test_moderate_nonexistent_tid` — graceful handling of unknown tids

**Enhanced real-data tests** (`TestD15ModerationHandling`, 2 tests):
- `test_moderated_comments_zeroed_not_removed` — applies mod-out from Clojure blob, checks column count and zeroed values
- `test_tids_include_moderated` — verifies moderated tids remain in rating_mat columns

**Updated existing tests**:
- `test_conversation.py::test_moderation` — updated to expect zeroed columns
- `test_conversation.py::test_update_moderation` — same
- `test_discrepancy_fixes.py::TestD2cVoteCountSource::test_n_cmts_includes_moderated_out_comments` — updated comment count assertion

### Test results

- Public datasets: **328 passed, 0 failed, 6 skipped, 56 xfailed**
- Private datasets: 13 failures — all **pre-existing** (golden snapshot staleness from earlier fixes, not D15-related). Verified by running parent branch.

### What's next

- D12 (comment priorities) or D1/D1b (PCA sign flips) — per plan ordering

---

## Notes for Future Sessions

- Private datasets are in `delphi/real_data/.local/` (separate git repo, linked via `link-to-polis-worktree.sh`)
Expand Down
40 changes: 38 additions & 2 deletions delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ Because this work will span multiple Claude Code sessions, we maintain:
### Testing Principles

- **Granular tests per discrepancy**: Not just overall regression — each fix gets its own targeted test checking the specific aspect it addresses. Multiple discrepancies may affect `test_basic_outputs`; we need to see incremental improvement per fix.
- **Clojure blob comparison is MANDATORY**: Every fix PR must include tests that compare Python output to actual Clojure math blob values — not just formula re-implementation tests (which are tautological: they only verify our code matches our reading of the Clojure, not that it matches Clojure's actual output). The Clojure blob is the ground truth oracle.
- **Stage isolation via blob injection**: Since upstream stages (PCA, clustering) may not match between Python and Clojure, tests must inject Clojure blob values as inputs to the stage being tested, then compare outputs. For example: to test `prop_test` (D5), extract `n-success` and `n-trials` from the Clojure blob's `repness` entries, feed them to Python's `prop_test()`, and compare the result to the blob's `p-test`. This isolates each stage from upstream divergence.
- **Blob fields available for injection/comparison**: The Clojure cold-start blob provides per-group repness entries with: `n-success` (=na), `n-trials` (=ns), `p-success` (=pa), `p-test` (=pat), `repness` (=ra), `repness-test` (=rat), `repful-for`, `best-agree`, `tid`. Also: `group-clusters` (memberships), `group-votes` (per-group vote counts), `consensus` (selected consensus comments), `comment-priorities` (per-tid priority values), `in-conv` (participant list).
- **Targeted pipeline-stage tests**: For D2/D3 (participant filtering, clustering), check in-conv count, cluster count, and cluster memberships against Clojure blob. For D12, check comment-priorities against Clojure blob.
- **All datasets, not just biodiversity**: Every fix must pass on ALL datasets. biodiversity is just one reference among many.
- **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
Expand Down Expand Up @@ -441,6 +444,38 @@ By this point, we should have good test coverage from all the per-discrepancy te

---

### Investigation: Cold-Start K Divergence (after D15, before D12)

**Prerequisite**: All cold-start-relevant upstream fixes complete: D2/D2c/D2b (in-conv,
vote counts, sort order), D15 (moderation handling). Note: D1 (PCA sign flips) only
affects incremental updates — on cold start there are no previous components to align to.

After D15, the rating matrix construction, in-conv filtering, and PCA inputs should all
match Clojure. Both implementations use silhouette for k-selection. Yet on vw, Python
selects k=4 while Clojure selects k=2.

**Investigation steps**:

1. **PCA component comparison**: Feed the same rating matrix to both sklearn TruncatedSVD
and a Python reimplementation of Clojure's power iteration. Quantify divergence
(cosine similarity per component, Frobenius norm).
2. **Projection comparison**: Inject Clojure blob's PCA components into Python's
clustering path. Does k now match?
3. **Base-cluster comparison**: Given the same projections, compare k-means centroids
and member assignments. Check initialization (Clojure uses first-k-distinct centers
from base clusters — does Python match?).
4. **Silhouette score comparison**: Given the same base clusters, compare per-k
silhouette scores. Are the scores close but the winner differs?
5. **All datasets**: Run on all datasets with cold-start blobs, not just vw.

**Outcome**: Either (a) identify a fixable discrepancy that makes k match, or
(b) document the inherent numerical divergence between sklearn SVD and Clojure
power iteration, and establish tolerance bounds for k agreement in tests.

See `delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md` for detailed context.

---

### Explicitly Deferred

- **D13 — Subgroup Clustering**: Not implemented in Python, never used by TypeScript consumers. No fix needed.
Expand Down Expand Up @@ -470,14 +505,15 @@ By this point, we should have good test coverage from all the per-discrepancy te
| D5 | Proportion test | **PR 4** | — | **DONE** ✓ |
| D6 | Two-proportion test | **PR 5** | — | **DONE** ✓ |
| D7 | Repness metric | PR 6 | — | **DONE** ✓ |
| D8 | Finalize cmt stats | PR 7 | — | Fix |
| D8 | Finalize cmt stats | PR 7 | — | **DONE** ✓ |
| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ |
| D10 | Rep comment selection | PR 8 | — | Fix (with legacy env var) |
| D11 | Consensus selection | PR 9 | — | Fix (with legacy env var) |
| D12 | Comment priorities | PR 11 | — | Fix (implement from scratch) |
| D13 | Subgroup clustering | — | — | **Deferred** (unused) |
| D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) |
| D15 | Moderation handling | PR 12 | — | Fix |
| D15 | Moderation handling | PR 12 | — | **DONE** ✓ |
| K-inv | Cold-start k divergence | (investigation) | — | Branch off D15 (D2+D15 done, clustering independent of repness) |
| Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |

### Non-discrepancy PRs in the stack
Expand Down
22 changes: 15 additions & 7 deletions delphi/polismath/conversation/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,15 +296,23 @@ def update_votes(self,
def _apply_moderation(self) -> None:
"""
Apply moderation settings to create filtered rating matrix.

Matches Clojure behavior (named_matrix.clj:214-230):
- Moderated-out participants are removed (rows dropped)
- Moderated-out comments are ZEROED OUT, not removed — the column
stays in the matrix with all values set to 0. This preserves
matrix structure so that tids, column indices, and dimensions
match between Python and Clojure.
"""
# Filter out moderated participants and comments, and keep them sorted!
# Note: set operations are unordered, hence the extra sort.
# Natural sort: preserves types and sorts numerically when possible
# Filter out moderated participants (remove rows)
keep_ptpts = natsorted(list(set(self.raw_rating_mat.index) - set(self.mod_out_ptpts)))
keep_comments = natsorted(list(set(self.raw_rating_mat.columns) - set(self.mod_out_tids)))

# Create filtered matrix
self.rating_mat = self.raw_rating_mat.loc[keep_ptpts, keep_comments]
self.rating_mat = self.raw_rating_mat.loc[keep_ptpts].copy()

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously _apply_moderation() explicitly applied natural sorting to both participants and comments. After this change, rows are re-sorted but column order is now implicitly whatever raw_rating_mat currently has. If raw_rating_mat.columns can ever be out of natural-sorted order (e.g., loaded from persisted state or constructed outside update_votes()), this introduces a non-obvious behavior change and can destabilize downstream column-index assumptions. Consider explicitly reindexing columns to a deterministic order here (e.g., self.raw_rating_mat.columns or natsorted(self.raw_rating_mat.columns)) to preserve the prior sorted invariant.

Suggested change
# Ensure a deterministic, naturally sorted column order to match
# previous behavior and keep column-index assumptions stable.
self.rating_mat = self.rating_mat.reindex(
columns=natsorted(self.raw_rating_mat.columns)
)

Copilot uses AI. Check for mistakes.
# Zero out moderated-out comments (keep columns, set values to 0)
# Clojure: (matrix/set-column m' i 0) — zeroes the column
mod_cols = [c for c in self.mod_out_tids if c in self.rating_mat.columns]
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large matrices and large mod_out_tids, building mod_cols via repeated c in self.rating_mat.columns membership checks is avoidable overhead. Using an index/set intersection (e.g., self.rating_mat.columns.intersection(...)) is typically faster and communicates intent more directly.

Suggested change
mod_cols = [c for c in self.mod_out_tids if c in self.rating_mat.columns]
mod_cols = list(self.rating_mat.columns.intersection(self.mod_out_tids))

Copilot uses AI. Check for mistakes.
if mod_cols:
self.rating_mat[mod_cols] = 0.0

def _compute_vote_stats(self) -> None:
"""
Expand Down
16 changes: 10 additions & 6 deletions delphi/tests/test_conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -408,10 +408,13 @@ def test_moderation(self):
assert 'c2' in moderated_conv.mod_out_tids
assert 'p3' in moderated_conv.mod_out_ptpts

# Check filtered rating matrix
assert 'c2' not in moderated_conv.rating_mat.columns
assert 'p3' not in moderated_conv.rating_mat.index

# Check filtered rating matrix:
# - Moderated-out comments are ZEROED, not removed (D15 fix)
# - Moderated-out participants are still removed (rows dropped)
assert 'c2' in moderated_conv.rating_mat.columns # column kept
assert (moderated_conv.rating_mat['c2'] == 0.0).all() # but zeroed
assert 'p3' not in moderated_conv.rating_mat.index # participant removed

# Raw matrix should still have all data
assert 'c2' in moderated_conv.raw_rating_mat.columns
assert 'p3' in moderated_conv.raw_rating_mat.index
Expand Down Expand Up @@ -605,9 +608,10 @@ def test_update_moderation(self):

conv = manager.update_moderation('test_conv', moderation)

# Check moderation was applied
# Check moderation was applied: column kept but zeroed (D15 fix)
assert 'c2' in conv.mod_out_tids
assert 'c2' not in conv.rating_mat.columns
assert 'c2' in conv.rating_mat.columns
assert (conv.rating_mat['c2'] == 0.0).all()

# Suppress sklearn PCA warning: test uses minimal data (2 participants, 2 comments)
# which can have zero variance. The test validates that recompute runs, not PCA quality.
Expand Down
152 changes: 138 additions & 14 deletions delphi/tests/test_discrepancy_fixes.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,12 +288,13 @@ def test_n_cmts_includes_moderated_out_comments(self):
participant_votes={0: list(range(10))},
)

# raw_rating_mat has all columns; rating_mat has only non-moderated-out
# Both raw_rating_mat and rating_mat have all columns (D15 fix:
# moderated-out columns are zeroed, not removed)
n_cmts_raw = len(conv.raw_rating_mat.columns)
n_cmts_filtered = len(conv.rating_mat.columns)

assert n_cmts_raw == 10, f"raw_rating_mat should have 10 columns, got {n_cmts_raw}"
assert n_cmts_filtered == 5, f"rating_mat should have 5 columns, got {n_cmts_filtered}"
assert n_cmts_filtered == 10, f"rating_mat should keep all 10 columns (zeroed, not removed), got {n_cmts_filtered}"

# The threshold used by _get_in_conv_participants should be min(7, 10) = 7,
# not min(7, 5) = 5. Verify indirectly: participant with exactly 6 votes
Expand Down Expand Up @@ -1144,29 +1145,152 @@ class TestD15ModerationHandling:
"""
D15: Python removes moderated comments entirely from matrix.
Clojure zeros them out (keeps structure, sets values to 0).

Clojure behavior (named_matrix.clj:214-230):
zero-out-columns sets all values in moderated columns to 0,
preserving the matrix structure (same number of columns).

Python should match: _apply_moderation() must zero out moderated
columns rather than removing them, so that:
- rating_mat.columns includes moderated tids (zeroed)
- tids output includes moderated tids
- Matrix dimensions match Clojure
"""

def test_moderated_comments_zeroed_not_removed(self, conv, clojure_blob, dataset_name):
"""
Moderated comments should be zeroed out, not removed, if any exist.

Note: This test only applies when the dataset has moderated comments.
After applying moderation, rating_mat should still have all columns.
Moderated columns should be zeroed, not removed.
"""
# Check if Clojure blob has mod-out comments
mod_out = clojure_blob.get('mod-out', [])
mod_out = clojure_blob.get('mod-out') or []
if not mod_out:
pytest.skip(f"[{dataset_name}] No moderated comments in this dataset")

# If there ARE moderated comments, check Python's handling
n_cols_python = len(conv.rating_mat.columns)
# Apply moderation from the Clojure blob to the Python conversation
mod_conv = conv.update_moderation(
{'mod_out_tids': mod_out},
recompute=False,
)

n_cols_python = len(mod_conv.rating_mat.columns)
n_cols_raw = len(mod_conv.raw_rating_mat.columns)
n_tids_clojure = len(clojure_blob.get('tids', []))

print(f"[{dataset_name}] Moderated comments: {len(mod_out)}")
print(f"[{dataset_name}] Python matrix columns: {n_cols_python}, Clojure tids: {n_tids_clojure}")
print(f"[{dataset_name}] mod-out: {len(mod_out)}")
print(f"[{dataset_name}] Python rating_mat cols: {n_cols_python}, raw cols: {n_cols_raw}")
print(f"[{dataset_name}] Clojure tids: {n_tids_clojure}")
Comment on lines +1176 to +1181
Comment on lines +1179 to +1181
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These unconditional print() calls will add noise to test output in normal runs. Prefer using the test framework’s logging (or only printing on assertion failure) so CI logs stay clean while still preserving debugging info when needed.

Copilot uses AI. Check for mistakes.

# Clojure keeps all tids (zeroed for mod-out), Python removes them
check.equal(n_cols_python, n_tids_clojure,
f"Matrix columns differ: Python={n_cols_python}, Clojure={n_tids_clojure} (mod-out={len(mod_out)})")
# After zeroing (not removing), column count should match raw matrix
check.equal(
n_cols_python, n_cols_raw,
f"rating_mat should keep all columns (zeroed, not removed): "
f"got {n_cols_python}, expected {n_cols_raw}"
)

# Moderated columns should be all zeros (not NaN, not original values)
for tid in mod_out:
if tid in mod_conv.rating_mat.columns:
col_values = mod_conv.rating_mat[tid].values
check.is_true(
np.all(col_values == 0.0),
f"Moderated tid {tid} should be all zeros, "
f"got non-zero values: {col_values[col_values != 0.0][:5]}"
)
Comment on lines +1191 to +1198
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test silently skips the zeroing assertion when a moderated tid is missing from rating_mat.columns. Since D15’s core requirement is that moderated columns are kept (and then zeroed), a regression back to 'remove columns' could partially evade this check. Consider asserting that tid is present in rating_mat.columns whenever it is present in raw_rating_mat.columns, and then assert the zeroing.

Copilot uses AI. Check for mistakes.

def test_tids_include_moderated(self, conv, clojure_blob, dataset_name):
"""The tids output should include moderated-out comments (matching Clojure)."""
mod_out = clojure_blob.get('mod-out') or []
if not mod_out:
pytest.skip(f"[{dataset_name}] No moderated comments in this dataset")

mod_conv = conv.update_moderation(
{'mod_out_tids': mod_out},
recompute=False,
)

# rating_mat.columns (used for tids output) should include moderated tids
for tid in mod_out:
if tid in mod_conv.raw_rating_mat.columns:
check.is_in(
tid, set(mod_conv.rating_mat.columns),
Comment on lines +1190 to +1215
f"Moderated tid {tid} should still be in rating_mat columns"
)


class TestD15SyntheticModeration:
"""
Synthetic tests for D15 moderation handling.

Clojure zeros out moderated columns (named_matrix.clj:214-230).
Python must match: _apply_moderation() zeros columns, not removes them.
"""

def _make_conversation_with_moderation(self, mod_out_tids):
"""Create a small conversation and apply moderation."""
import pandas as pd

# 5 participants, 4 comments. Votes: agree=1, disagree=-1, pass=0, no vote=NaN
data = {
0: [1.0, -1.0, 1.0, np.nan, 0.0],
1: [-1.0, 1.0, 0.0, 1.0, -1.0],
2: [1.0, 1.0, -1.0, -1.0, 1.0],
3: [np.nan, 0.0, 1.0, 1.0, -1.0],
}
votes_df = pd.DataFrame(data, index=[0, 1, 2, 3, 4])

conv = Conversation("synthetic_d15")
conv.raw_rating_mat = votes_df.copy()
conv.rating_mat = votes_df.copy()
conv.participant_count, conv.comment_count = votes_df.shape

# Apply moderation
conv.mod_out_tids = set(mod_out_tids)
conv._apply_moderation()
return conv

def test_zeroing_preserves_columns(self):
"""Moderated columns should still be present in rating_mat."""
conv = self._make_conversation_with_moderation(mod_out_tids=[1, 3])

# All 4 columns should still be present
assert len(conv.rating_mat.columns) == 4, (
f"Expected 4 columns, got {len(conv.rating_mat.columns)}: "
f"moderated columns should be zeroed, not removed"
)
assert set(conv.rating_mat.columns) == {0, 1, 2, 3}

def test_zeroed_columns_are_all_zero(self):
"""Moderated columns should have all values set to 0.0."""
conv = self._make_conversation_with_moderation(mod_out_tids=[1, 3])

for tid in [1, 3]:
col = conv.rating_mat[tid].values
assert np.all(col == 0.0), (
f"Moderated column {tid} should be all zeros, got {col}"
)

def test_non_moderated_columns_unchanged(self):
"""Non-moderated columns should retain their original values."""
conv = self._make_conversation_with_moderation(mod_out_tids=[1])

# Column 0 should be unchanged: [1, -1, 1, NaN, 0]
col0 = conv.rating_mat[0].values
assert col0[0] == 1.0
assert col0[1] == -1.0
assert np.isnan(col0[3]) # NaN preserved for non-moderated

def test_empty_moderation_no_change(self):
"""No moderation should leave the matrix unchanged."""
conv = self._make_conversation_with_moderation(mod_out_tids=[])
assert len(conv.rating_mat.columns) == 4

def test_moderate_nonexistent_tid(self):
"""Moderating a tid that doesn't exist in the matrix should be a no-op."""
conv = self._make_conversation_with_moderation(mod_out_tids=[99])
# All columns preserved, no crash
assert len(conv.rating_mat.columns) == 4
# Original values intact
assert conv.rating_mat[0].values[0] == 1.0


# ============================================================================
Expand Down
Loading