Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 70 additions & 1 deletion delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -620,7 +620,7 @@ Every fix PR must now include blob comparison tests.
Both use silhouette. The divergence comes from upstream PCA/clustering differences
(sklearn SVD vs Clojure power iteration). This is independent of all repness fixes
(D4-D11). Investigation planned off D15 branch. See
`delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md`.
`delphi/docs/INVESTIGATION_K_DIVERGENCE.md`.

**Key discovery: `n-trials` in Clojure blob = `S` (total seen, including passes),**
not `A+D` (agrees + disagrees). Verified: `prop_test(11, 14)` = blob `p-test` for
Expand Down Expand Up @@ -661,6 +661,75 @@ adding vectorized blob tests at each stage.

---

## K-Divergence Investigation & Fix (2026-03-17/18)

### Branch: `jc/clj-parity-kmeans-k-divergence` (PR #2453, Stack 19/25)

### Investigation

Wrote `scripts/investigate_k_divergence.py` to isolate the source of divergence
on vw (Python k=4, Clojure k=2). Systematic elimination:

1. **PCA components**: identical (cosine similarity = 1.000000) — ruled out
2. **Silhouette implementation**: identical scores for both projection sets — ruled out
3. **K-means initialization**: both use first-k-distinct — ruled out
4. **Clojure blob injection**: injecting Clojure projections into Python clustering
still gave k=4 — so it's not about projection values
5. **Participant ordering**: **ROOT CAUSE FOUND** — Python sorted rows by PID via
`natsorted()`, Clojure preserves vote-encounter order (NamedMatrix insertion order).
Different row ordering → different first-k-distinct seeds → different local optima.

Verified Clojure ordering chain by reading `conversation.clj`, `named_matrix.clj`,
`clusters.clj`: `filter-by-index` preserves original matrix row order, not
set iteration order. The CSV first-appearance order `[2, 3, 4, 6, 8, ...]` matches
the Clojure blob's base-cluster PID order exactly.

### Fix

- `conversation.py update_votes()`: replaced `natsorted(existing_rows.union(new_rows))`
with first-appearance order tracking from `vote_updates`
- `conversation.py _apply_moderation()`: replaced `natsorted()` with order-preserving
list comprehension
- Column ordering remains natsorted (doesn't affect clustering)

### Cold-start blob results

| Dataset | Clj k | Py k (before) | Py k (after) | Sizes match? |
|---------|-------|---------------|--------------|--------------|
| vw | 2 | 4 | **2** | [50,17] exact |
| biodiversity | 2 | 2 | **2** | [81,19] exact |
| bg2018 | 2 | 2 | **2** | close ([52,48] vs [51,49]) |
| FLI | 2 | 3 | 3 | inherent PCA divergence |

FLI: 94.5% NaN sparsity, PCA |cos|≈0.9997 (not 1.0), silhouette gap 0.001. Not
fixable without replicating Clojure's power iteration PCA. Low priority.

### Test results

- 297 passed, 0 failed, 6 skipped, 58 xfailed
- Removed `test_group_clustering` xfail (now passes on cold-start blobs)
- Added incremental-blob xfail (different in-conv from single-shot)
- Updated 6 ordering tests (expect encounter order, not natsort)
- Re-recorded vw cold-start blob and golden snapshots for vw + biodiversity

### Session 12 (2026-03-17/18)

- Created branch off D15, investigated k divergence across all 7 datasets
- Re-recorded vw cold-start blob (confirmed k=2 is genuine, not generation artifact)
- Found root cause: `natsorted()` on participant rows
- Fixed `update_votes()` and `_apply_moderation()` to preserve encounter order
- Rebased branch onto new D15 (other session had rebased the stack)
- Inserted into stack at position 19/25, rebased D10→PR15 with `--onto`
- Created PR #2453

### What's Next

1. Refactor D10-D1 branches (tests, code cleanup) before creating PRs for them.
2. Re-record private dataset golden snapshots.
3. FLI k divergence: accept or investigate Clojure power iteration PCA (low priority).

---

## TDD Discipline

**CRITICAL: For every fix, ALWAYS follow this order:**
Expand Down
118 changes: 118 additions & 0 deletions delphi/docs/INVESTIGATION_K_DIVERGENCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# K-Divergence Investigation — RESOLVED

## Problem (was)

After all cold-start-relevant formula fixes (D2-D15), Python and Clojure
selected different k values on cold-start blobs. On vw: Python=4, Clojure=2.

## Root Cause: Participant Row Ordering

The k divergence was caused by **different participant ordering in the rating
matrix**, which cascades through base-cluster IDs into group-level k-means
initialization via first-k-distinct.

### The chain

```
rating_mat row order
→ PCA projection order
→ base-cluster ID assignment (map-indexed on input rows)
→ group-level k-means first-k-distinct init (first k base-cluster centers)
→ different local optima → different silhouette scores → different k
```

### Clojure ordering

Clojure's NamedMatrix preserves **insertion order** (backed by
`java.util.Vector`). When `rowname-subset` filters to `in-conv` participants,
`filter-by-index` (utils.clj:128-133) preserves the **original matrix row
order** (iterates source, checks membership in filter set). So the base-cluster
ordering is the vote-encounter order of participants in the rating matrix.

### Python ordering (before fix)

Python used `natsorted()` (conversation.py:232) to sort rating matrix rows
by PID. This gave ascending PID order `[1, 2, 3, 4, 5, ...]` instead of
the vote-encounter order `[2, 3, 4, 6, 8, ...]` that Clojure produces.

### Impact

With first-k-distinct initialization, different ordering → different initial
centers → different k-means local optima → different silhouette landscape:

| k | Python (PID order) | Clojure (encounter order) |
|---|-------------------|--------------------------|
| 2 | sil=0.457 | **sil=0.487 (wins)** |
| 3 | sil=0.481 | sil=0.329 |
| 4 | **sil=0.508 (wins)** | sil=0.362 |

## Fix

Changed `update_votes()` and `_apply_moderation()` to preserve vote-encounter
order for participant rows instead of natsort:

1. `update_votes()`: track first-appearance order from `vote_updates`, append
new PIDs in encounter order (not `natsorted`)
2. `_apply_moderation()`: filter `raw_rating_mat.index` preserving order
(list comprehension instead of `natsorted`)

Column (comment ID) ordering remains `natsorted` — column permutation doesn't
affect PCA eigenvalues/vectors, only reorders component loadings.

## Results after fix

| Dataset | CS blob | Clj k | Py k | Sizes match? |
|---------|---------|-------|------|--------------|
| vw | ✓ | 2 | **2** | [50,17] exact |
| biodiversity | ✓ | 2 | **2** | [81,19] exact |
| bg2018 | ✓ | 2 | **2** | close ([51,49] vs [52,48]) |
| FLI | ✓ | 2 | 3 | **still diverges** |
| engage | empty | — | — | — |
| bg2050 | empty | — | — | — |
| pakistan | empty | — | — | — |

### FLI: inherent PCA divergence (not fixable)

FLI has 94.5% NaN sparsity. The PCA components are nearly but not exactly
identical (|cos|≈0.9997 vs 1.000000 for vw). This produces a silhouette
landscape where k=2 and k=3 differ by only 0.001. The tiny PCA difference
tips the balance. Injection test confirms: with Clojure projections injected,
Python picks k=2. This is inherent to the PCA algorithm difference (sklearn
full SVD vs Clojure power iteration) and not fixable without replicating
Clojure's PCA exactly.

## Investigation Findings (for the record)

### PCA is NOT the primary cause for most datasets

- vw: PCA components have cosine similarity = 1.000000 (identical!)
- Projections are exactly negated (sign flip, irrelevant for clustering)
- Silhouette scores are identical for both projection sets

### Silhouette implementation matches

- Both use (b-a)/max(a,b) formula, unweighted mean
- Both compute on base-cluster centers (not raw participants)
- Clojure's `weighted-mean` without weights = unweighted mean

### K-means initialization matches

- Both use first-k-distinct (Clojure: `init-clusters`, Python: `_get_first_k_distinct_centers`)
- Both sort base clusters by ID
- The only difference was the DATA ORDER feeding into first-k-distinct

## Files modified

- `delphi/polismath/conversation/conversation.py` — `update_votes()` and `_apply_moderation()`
- `delphi/tests/test_conversation.py` — updated ordering tests
- `delphi/tests/test_legacy_clojure_regression.py` — removed xfail on `test_group_clustering`

## Future investigation

- **FLI k divergence**: Could be resolved by implementing Clojure's power
iteration PCA. Low priority — the silhouette gap is 0.001.
- **Column ordering**: Currently natsorted, Clojure uses insertion order.
Doesn't affect clustering but could affect other comparisons.
- **Multiple k-means restarts**: Using k-means++ with n_init=10 finds the
global optimum (k=4 for vw) regardless of ordering. This would be more
robust than first-k-distinct but would NOT match Clojure.
45 changes: 20 additions & 25 deletions delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
| PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
| PR 5 (D6) | #2449 | Stack 15/25 | Fix D6: two-proportion test pseudocounts |
| (K-inv) | #2453 | Stack 19/25 | Fix K-means k divergence: preserve vote-encounter row order |

Future fix PRs will be appended to the stack as they're created.

Expand Down Expand Up @@ -444,35 +445,29 @@ By this point, we should have good test coverage from all the per-discrepancy te

---

### Investigation: Cold-Start K Divergence (after D15, before D12)
### K-Divergence Fix: Participant Row Ordering — **DONE** (PR #2453)

**Prerequisite**: All cold-start-relevant upstream fixes complete: D2/D2c/D2b (in-conv,
vote counts, sort order), D15 (moderation handling). Note: D1 (PCA sign flips) only
affects incremental updates — on cold start there are no previous components to align to.
**Root cause found and fixed.** Python's `natsorted()` sorted rating matrix rows by
PID, while Clojure's NamedMatrix preserves vote-encounter order (insertion order via
`java.util.Vector`). Different row ordering cascades through base-cluster ID assignment
into group-level k-means first-k-distinct initialization, producing different local
optima and different silhouette landscapes.

After D15, the rating matrix construction, in-conv filtering, and PCA inputs should all
match Clojure. Both implementations use silhouette for k-selection. Yet on vw, Python
selects k=4 while Clojure selects k=2.
**Fix**: `update_votes()` and `_apply_moderation()` now preserve vote-encounter order
for participant rows instead of natsort. Column ordering remains natsorted (doesn't
affect PCA eigenvalues/vectors).

**Investigation steps**:
**Cold-start blob results**:
- vw: k=2 exact match (was k=4), sizes [50,17] exact
- biodiversity: k=2 exact match, sizes [81,19] exact
- bg2018: k=2 match, sizes close ([52,48] vs [51,49])
- FLI: k=3 vs k=2 — inherent PCA divergence (94.5% NaN sparsity, silhouette gap 0.001)

1. **PCA component comparison**: Feed the same rating matrix to both sklearn TruncatedSVD
and a Python reimplementation of Clojure's power iteration. Quantify divergence
(cosine similarity per component, Frobenius norm).
2. **Projection comparison**: Inject Clojure blob's PCA components into Python's
clustering path. Does k now match?
3. **Base-cluster comparison**: Given the same projections, compare k-means centroids
and member assignments. Check initialization (Clojure uses first-k-distinct centers
from base clusters — does Python match?).
4. **Silhouette score comparison**: Given the same base clusters, compare per-k
silhouette scores. Are the scores close but the winner differs?
5. **All datasets**: Run on all datasets with cold-start blobs, not just vw.
**FLI residual divergence**: Not fixable without replicating Clojure's power iteration
PCA. The silhouette landscape is essentially flat between k=2 and k=3, and any tiny PCA
difference tips the balance. Low priority.

**Outcome**: Either (a) identify a fixable discrepancy that makes k match, or
(b) document the inherent numerical divergence between sklearn SVD and Clojure
power iteration, and establish tolerance bounds for k agreement in tests.

See `delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md` for detailed context.
See `delphi/docs/INVESTIGATION_K_DIVERGENCE.md` for the full investigation.

---

Expand Down Expand Up @@ -513,7 +508,7 @@ See `delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md` for detailed context.
| D13 | Subgroup clustering | — | — | **Deferred** (unused) |
| D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) |
| D15 | Moderation handling | PR 12 | — | **DONE** ✓ |
| K-inv | Cold-start k divergence | (investigation) | | Branch off D15 (D2+D15 done, clustering independent of repness) |
| K-inv | Cold-start k divergence (row ordering) | (after D15) | **#2453** | **DONE** ✓ (FLI residual: inherent PCA divergence) |
| Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |

### Non-discrepancy PRs in the stack
Expand Down
36 changes: 29 additions & 7 deletions delphi/polismath/conversation/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,15 +220,35 @@ def update_votes(self,
# Step 4: Get new rows and columns by set difference
logger.info(f"[{time.time() - start_time:.2f}s] Identifying new rows and columns...")

existing_rows = set(existing_rows)
existing_rows_set = set(existing_rows)
existing_cols = set(existing_cols)

new_rows = set(updates_df['row']) - existing_rows
new_rows = set(updates_df['row']) - existing_rows_set
new_cols = set(updates_df['col']) - existing_cols

# Natural sort: preserves types and sorts numerically when possible
# Numbers are sorted numerically, alphanumeric strings use natural order (e.g., p1, p2, p10)
all_rows = natsorted(existing_rows.union(new_rows))
# Row order: preserve first-appearance order from votes.
#
# Clojure builds the rating matrix incrementally — each new participant
# gets a row appended in the order they first appear in the vote stream
# (conversation.clj, named_matrix.clj: NamedMatrix preserves insertion
# order via IndexHash backed by java.util.Vector). The base-cluster IDs
# are assigned by map-indexed on this row order, so the order directly
# determines group-level k-means initialization via first-k-distinct.
#
# Using natsort (PID-numeric order) instead would change the k-means
# seed points and produce different silhouette scores / different k.
# See delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md for the full
# analysis showing this is the root cause of k divergence on vw.
new_rows_ordered = []
for pid, _, _ in vote_updates:
if pid in new_rows and pid not in existing_rows_set:
existing_rows_set.add(pid)
new_rows_ordered.append(pid)
all_rows = list(existing_rows) + new_rows_ordered

# Column order: natsort is fine — column permutation doesn't affect PCA
# eigenvalues/vectors (only reorders the component loadings), so it has
# no effect on clustering k.
all_cols = natsorted(existing_cols.union(new_cols))

logger.info(f"[{time.time() - start_time:.2f}s] Found {len(new_rows)} new rows and {len(new_cols)} new columns")
Expand Down Expand Up @@ -304,8 +324,10 @@ def _apply_moderation(self) -> None:
matrix structure so that tids, column indices, and dimensions
match between Python and Clojure.
"""
# Filter out moderated participants (remove rows)
keep_ptpts = natsorted(list(set(self.raw_rating_mat.index) - set(self.mod_out_ptpts)))
# Filter out moderated participants (remove rows).
# Preserve raw_rating_mat row order (vote encounter order) — see
# update_votes() comment on why row order matters for Clojure parity.
keep_ptpts = [p for p in self.raw_rating_mat.index if p not in self.mod_out_ptpts]
self.rating_mat = self.raw_rating_mat.loc[keep_ptpts].copy()

# Zero out moderated-out comments (keep columns, set values to 0)
Expand Down
Loading
Loading