compdemocracy · jucor · Mar 30, 2026
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -620,7 +620,7 @@ Every fix PR must now include blob comparison tests.
 Both use silhouette. The divergence comes from upstream PCA/clustering differences
 (sklearn SVD vs Clojure power iteration). This is independent of all repness fixes
 (D4-D11). Investigation planned off D15 branch. See
-`delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md`.
+`delphi/docs/INVESTIGATION_K_DIVERGENCE.md`.
 
 **Key discovery: `n-trials` in Clojure blob = `S` (total seen, including passes),**
 not `A+D` (agrees + disagrees). Verified: `prop_test(11, 14)` = blob `p-test` for
@@ -661,6 +661,75 @@ adding vectorized blob tests at each stage.
 
 ---
 
+## K-Divergence Investigation & Fix (2026-03-17/18)
+
+### Branch: `jc/clj-parity-kmeans-k-divergence` (PR #2453, Stack 19/25)
+
+### Investigation
+
+Wrote `scripts/investigate_k_divergence.py` to isolate the source of divergence
+on vw (Python k=4, Clojure k=2). Systematic elimination:
+
+1. **PCA components**: identical (cosine similarity = 1.000000) — ruled out
+2. **Silhouette implementation**: identical scores for both projection sets — ruled out
+3. **K-means initialization**: both use first-k-distinct — ruled out
+4. **Clojure blob injection**: injecting Clojure projections into Python clustering
+   still gave k=4 — so it's not about projection values
+5. **Participant ordering**: **ROOT CAUSE FOUND** — Python sorted rows by PID via
+   `natsorted()`, Clojure preserves vote-encounter order (NamedMatrix insertion order).
+   Different row ordering → different first-k-distinct seeds → different local optima.
+
+Verified Clojure ordering chain by reading `conversation.clj`, `named_matrix.clj`,
+`clusters.clj`: `filter-by-index` preserves original matrix row order, not
+set iteration order. The CSV first-appearance order `[2, 3, 4, 6, 8, ...]` matches
+the Clojure blob's base-cluster PID order exactly.
+
+### Fix
+
+- `conversation.py update_votes()`: replaced `natsorted(existing_rows.union(new_rows))`
+  with first-appearance order tracking from `vote_updates`
+- `conversation.py _apply_moderation()`: replaced `natsorted()` with order-preserving
+  list comprehension
+- Column ordering remains natsorted (doesn't affect clustering)
+
+### Cold-start blob results
+
+| Dataset | Clj k | Py k (before) | Py k (after) | Sizes match? |
+|---------|-------|---------------|--------------|--------------|
+| vw | 2 | 4 | **2** | [50,17] exact |
+| biodiversity | 2 | 2 | **2** | [81,19] exact |
+| bg2018 | 2 | 2 | **2** | close ([52,48] vs [51,49]) |
+| FLI | 2 | 3 | 3 | inherent PCA divergence |
+
+FLI: 94.5% NaN sparsity, PCA |cos|≈0.9997 (not 1.0), silhouette gap 0.001. Not
+fixable without replicating Clojure's power iteration PCA. Low priority.
+
+### Test results
+
+- 297 passed, 0 failed, 6 skipped, 58 xfailed
+- Removed `test_group_clustering` xfail (now passes on cold-start blobs)
+- Added incremental-blob xfail (different in-conv from single-shot)
+- Updated 6 ordering tests (expect encounter order, not natsort)
+- Re-recorded vw cold-start blob and golden snapshots for vw + biodiversity
+
+### Session 12 (2026-03-17/18)
+
+- Created branch off D15, investigated k divergence across all 7 datasets
+- Re-recorded vw cold-start blob (confirmed k=2 is genuine, not generation artifact)
+- Found root cause: `natsorted()` on participant rows
+- Fixed `update_votes()` and `_apply_moderation()` to preserve encounter order
+- Rebased branch onto new D15 (other session had rebased the stack)
+- Inserted into stack at position 19/25, rebased D10→PR15 with `--onto`
+- Created PR #2453
+
+### What's Next
+
+1. Refactor D10-D1 branches (tests, code cleanup) before creating PRs for them.
+2. Re-record private dataset golden snapshots.
+3. FLI k divergence: accept or investigate Clojure power iteration PCA (low priority).
+
+---
+
 ## TDD Discipline
 
 **CRITICAL: For every fix, ALWAYS follow this order:**

diff --git a/delphi/docs/INVESTIGATION_K_DIVERGENCE.md b/delphi/docs/INVESTIGATION_K_DIVERGENCE.md
@@ -0,0 +1,118 @@
+# K-Divergence Investigation — RESOLVED
+
+## Problem (was)
+
+After all cold-start-relevant formula fixes (D2-D15), Python and Clojure
+selected different k values on cold-start blobs. On vw: Python=4, Clojure=2.
+
+## Root Cause: Participant Row Ordering
+
+The k divergence was caused by **different participant ordering in the rating
+matrix**, which cascades through base-cluster IDs into group-level k-means
+initialization via first-k-distinct.
+
+### The chain
+
+```
+rating_mat row order
+  → PCA projection order
+    → base-cluster ID assignment (map-indexed on input rows)
+      → group-level k-means first-k-distinct init (first k base-cluster centers)
+        → different local optima → different silhouette scores → different k
+```
+
+### Clojure ordering
+
+Clojure's NamedMatrix preserves **insertion order** (backed by
+`java.util.Vector`). When `rowname-subset` filters to `in-conv` participants,
+`filter-by-index` (utils.clj:128-133) preserves the **original matrix row
+order** (iterates source, checks membership in filter set). So the base-cluster
+ordering is the vote-encounter order of participants in the rating matrix.
+
+### Python ordering (before fix)
+
+Python used `natsorted()` (conversation.py:232) to sort rating matrix rows
+by PID. This gave ascending PID order `[1, 2, 3, 4, 5, ...]` instead of
+the vote-encounter order `[2, 3, 4, 6, 8, ...]` that Clojure produces.
+
+### Impact
+
+With first-k-distinct initialization, different ordering → different initial
+centers → different k-means local optima → different silhouette landscape:
+
+| k | Python (PID order) | Clojure (encounter order) |
+|---|-------------------|--------------------------|
+| 2 | sil=0.457         | **sil=0.487 (wins)**     |
+| 3 | sil=0.481         | sil=0.329                |
+| 4 | **sil=0.508 (wins)** | sil=0.362             |
+
+## Fix
+
+Changed `update_votes()` and `_apply_moderation()` to preserve vote-encounter
+order for participant rows instead of natsort:
+
+1. `update_votes()`: track first-appearance order from `vote_updates`, append
+   new PIDs in encounter order (not `natsorted`)
+2. `_apply_moderation()`: filter `raw_rating_mat.index` preserving order
+   (list comprehension instead of `natsorted`)
+
+Column (comment ID) ordering remains `natsorted` — column permutation doesn't
+affect PCA eigenvalues/vectors, only reorders component loadings.
+
+## Results after fix
+
+| Dataset | CS blob | Clj k | Py k | Sizes match? |
+|---------|---------|-------|------|--------------|
+| vw | ✓ | 2 | **2** | [50,17] exact |
+| biodiversity | ✓ | 2 | **2** | [81,19] exact |
+| bg2018 | ✓ | 2 | **2** | close ([51,49] vs [52,48]) |
+| FLI | ✓ | 2 | 3 | **still diverges** |
+| engage | empty | — | — | — |
+| bg2050 | empty | — | — | — |
+| pakistan | empty | — | — | — |
+
+### FLI: inherent PCA divergence (not fixable)
+
+FLI has 94.5% NaN sparsity. The PCA components are nearly but not exactly
+identical (|cos|≈0.9997 vs 1.000000 for vw). This produces a silhouette
+landscape where k=2 and k=3 differ by only 0.001. The tiny PCA difference
+tips the balance. Injection test confirms: with Clojure projections injected,
+Python picks k=2. This is inherent to the PCA algorithm difference (sklearn
+full SVD vs Clojure power iteration) and not fixable without replicating
+Clojure's PCA exactly.
+
+## Investigation Findings (for the record)
+
+### PCA is NOT the primary cause for most datasets
+
+- vw: PCA components have cosine similarity = 1.000000 (identical!)
+- Projections are exactly negated (sign flip, irrelevant for clustering)
+- Silhouette scores are identical for both projection sets
+
+### Silhouette implementation matches
+
+- Both use (b-a)/max(a,b) formula, unweighted mean
+- Both compute on base-cluster centers (not raw participants)
+- Clojure's `weighted-mean` without weights = unweighted mean
+
+### K-means initialization matches
+
+- Both use first-k-distinct (Clojure: `init-clusters`, Python: `_get_first_k_distinct_centers`)
+- Both sort base clusters by ID
+- The only difference was the DATA ORDER feeding into first-k-distinct
+
+## Files modified
+
+- `delphi/polismath/conversation/conversation.py` — `update_votes()` and `_apply_moderation()`
+- `delphi/tests/test_conversation.py` — updated ordering tests
+- `delphi/tests/test_legacy_clojure_regression.py` — removed xfail on `test_group_clustering`
+
+## Future investigation
+
+- **FLI k divergence**: Could be resolved by implementing Clojure's power
+  iteration PCA. Low priority — the silhouette gap is 0.001.
+- **Column ordering**: Currently natsorted, Clojure uses insertion order.
+  Doesn't affect clustering but could affect other comparisons.
+- **Multiple k-means restarts**: Using k-means++ with n_init=10 finds the
+  global optimum (k=4 for vw) regardless of ordering. This would be more
+  robust than first-k-distinct but would NOT match Clojure.
diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md
@@ -22,6 +22,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
 | PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
 | PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
 | PR 5 (D6) | #2449 | Stack 15/25 | Fix D6: two-proportion test pseudocounts |
+| (K-inv) | #2453 | Stack 19/25 | Fix K-means k divergence: preserve vote-encounter row order |
 
 Future fix PRs will be appended to the stack as they're created.
 
@@ -444,35 +445,29 @@ By this point, we should have good test coverage from all the per-discrepancy te
 
 ---
 
-### Investigation: Cold-Start K Divergence (after D15, before D12)
+### K-Divergence Fix: Participant Row Ordering — **DONE** (PR #2453)
 
-**Prerequisite**: All cold-start-relevant upstream fixes complete: D2/D2c/D2b (in-conv,
-vote counts, sort order), D15 (moderation handling). Note: D1 (PCA sign flips) only
-affects incremental updates — on cold start there are no previous components to align to.
+**Root cause found and fixed.** Python's `natsorted()` sorted rating matrix rows by
+PID, while Clojure's NamedMatrix preserves vote-encounter order (insertion order via
+`java.util.Vector`). Different row ordering cascades through base-cluster ID assignment
+into group-level k-means first-k-distinct initialization, producing different local
+optima and different silhouette landscapes.
 
-After D15, the rating matrix construction, in-conv filtering, and PCA inputs should all
-match Clojure. Both implementations use silhouette for k-selection. Yet on vw, Python
-selects k=4 while Clojure selects k=2.
+**Fix**: `update_votes()` and `_apply_moderation()` now preserve vote-encounter order
+for participant rows instead of natsort. Column ordering remains natsorted (doesn't
+affect PCA eigenvalues/vectors).
 
-**Investigation steps**:
+**Cold-start blob results**:
+- vw: k=2 exact match (was k=4), sizes [50,17] exact
+- biodiversity: k=2 exact match, sizes [81,19] exact
+- bg2018: k=2 match, sizes close ([52,48] vs [51,49])
+- FLI: k=3 vs k=2 — inherent PCA divergence (94.5% NaN sparsity, silhouette gap 0.001)
 
-1. **PCA component comparison**: Feed the same rating matrix to both sklearn TruncatedSVD
-   and a Python reimplementation of Clojure's power iteration. Quantify divergence
-   (cosine similarity per component, Frobenius norm).
-2. **Projection comparison**: Inject Clojure blob's PCA components into Python's
-   clustering path. Does k now match?
-3. **Base-cluster comparison**: Given the same projections, compare k-means centroids
-   and member assignments. Check initialization (Clojure uses first-k-distinct centers
-   from base clusters — does Python match?).
-4. **Silhouette score comparison**: Given the same base clusters, compare per-k
-   silhouette scores. Are the scores close but the winner differs?
-5. **All datasets**: Run on all datasets with cold-start blobs, not just vw.
+**FLI residual divergence**: Not fixable without replicating Clojure's power iteration
+PCA. The silhouette landscape is essentially flat between k=2 and k=3, and any tiny PCA
+difference tips the balance. Low priority.
 
-**Outcome**: Either (a) identify a fixable discrepancy that makes k match, or
-(b) document the inherent numerical divergence between sklearn SVD and Clojure
-power iteration, and establish tolerance bounds for k agreement in tests.
-
-See `delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md` for detailed context.
+See `delphi/docs/INVESTIGATION_K_DIVERGENCE.md` for the full investigation.
 
 ---
 
@@ -513,7 +508,7 @@ See `delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md` for detailed context.
 | D13 | Subgroup clustering | — | — | **Deferred** (unused) |
 | D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) |
 | D15 | Moderation handling | PR 12 | — | **DONE** ✓ |
-| K-inv | Cold-start k divergence | (investigation) | — | Branch off D15 (D2+D15 done, clustering independent of repness) |
+| K-inv | Cold-start k divergence (row ordering) | (after D15) | **#2453** | **DONE** ✓ (FLI residual: inherent PCA divergence) |
 | Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. |
 
 ### Non-discrepancy PRs in the stack

diff --git a/delphi/polismath/conversation/conversation.py b/delphi/polismath/conversation/conversation.py
@@ -220,15 +220,35 @@ def update_votes(self,
         # Step 4: Get new rows and columns by set difference
         logger.info(f"[{time.time() - start_time:.2f}s] Identifying new rows and columns...")
 
-        existing_rows = set(existing_rows)
+        existing_rows_set = set(existing_rows)
         existing_cols = set(existing_cols)
 
-        new_rows = set(updates_df['row']) - existing_rows
+        new_rows = set(updates_df['row']) - existing_rows_set
         new_cols = set(updates_df['col']) - existing_cols
 
-        # Natural sort: preserves types and sorts numerically when possible
-        # Numbers are sorted numerically, alphanumeric strings use natural order (e.g., p1, p2, p10)
-        all_rows = natsorted(existing_rows.union(new_rows))
+        # Row order: preserve first-appearance order from votes.
+        #
+        # Clojure builds the rating matrix incrementally — each new participant
+        # gets a row appended in the order they first appear in the vote stream
+        # (conversation.clj, named_matrix.clj: NamedMatrix preserves insertion
+        # order via IndexHash backed by java.util.Vector). The base-cluster IDs
+        # are assigned by map-indexed on this row order, so the order directly
+        # determines group-level k-means initialization via first-k-distinct.
+        #
+        # Using natsort (PID-numeric order) instead would change the k-means
+        # seed points and produce different silhouette scores / different k.
+        # See delphi/docs/HANDOFF_K_DIVERGENCE_INVESTIGATION.md for the full
+        # analysis showing this is the root cause of k divergence on vw.
+        new_rows_ordered = []
+        for pid, _, _ in vote_updates:
+            if pid in new_rows and pid not in existing_rows_set:
+                existing_rows_set.add(pid)
+                new_rows_ordered.append(pid)
+        all_rows = list(existing_rows) + new_rows_ordered
+
+        # Column order: natsort is fine — column permutation doesn't affect PCA
+        # eigenvalues/vectors (only reorders the component loadings), so it has
+        # no effect on clustering k.
         all_cols = natsorted(existing_cols.union(new_cols))
 
         logger.info(f"[{time.time() - start_time:.2f}s] Found {len(new_rows)} new rows and {len(new_cols)} new columns")
@@ -304,8 +324,10 @@ def _apply_moderation(self) -> None:
           matrix structure so that tids, column indices, and dimensions
           match between Python and Clojure.
         """
-        # Filter out moderated participants (remove rows)
-        keep_ptpts = natsorted(list(set(self.raw_rating_mat.index) - set(self.mod_out_ptpts)))
+        # Filter out moderated participants (remove rows).
+        # Preserve raw_rating_mat row order (vote encounter order) — see
+        # update_votes() comment on why row order matters for Clojure parity.
+        keep_ptpts = [p for p in self.raw_rating_mat.index if p not in self.mod_out_ptpts]
         self.rating_mat = self.raw_rating_mat.loc[keep_ptpts].copy()
 
         # Zero out moderated-out comments (keep columns, set values to 0)