Skip to content

[Stack 17/17] Fix K-means k divergence: preserve vote-encounter row order#2524

Open
jucor wants to merge 1 commit intospr/edge/c3450b9afrom
spr/edge/4598a0a1
Open

[Stack 17/17] Fix K-means k divergence: preserve vote-encounter row order#2524
jucor wants to merge 1 commit intospr/edge/c3450b9afrom
spr/edge/4598a0a1

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 30, 2026

Summary

  • Fix K-means k divergence between Python and Clojure by preserving vote-encounter order for participant rows in the rating matrix
  • Python was using natsorted() (PID-numeric order) while Clojure's NamedMatrix preserves insertion order — different row ordering cascades into different first-k-distinct initialization seeds for group-level k-means
  • On vw: Python picked k=4 (wrong), Clojure picks k=2 — now both pick k=2 with identical cluster memberships

Investigation findings

The divergence chain: rating_mat row order → PCA projection order → base-cluster ID assignment → group k-means first-k-distinct init → different local optima → different silhouette landscape → different k.

PCA components are identical (cosine similarity = 1.0), silhouette implementation matches, k-means algorithm matches — only the data ORDER feeding first-k-distinct differed.

Changes

  • conversation.py: update_votes() preserves vote-encounter order for participant rows instead of natsorted()
  • conversation.py: _apply_moderation() preserves row order with list comprehension
  • Column (comment ID) ordering remains natsorted — doesn't affect clustering
  • Re-recorded vw cold-start blob and golden snapshots
  • Updated ordering tests, removed test_group_clustering xfail
  • Added scripts/investigate_k_divergence.py diagnostic tool

Cold-start blob results

Dataset Clj k Py k Match
vw 2 2 exact (sizes [50,17])
biodiversity 2 2 exact (sizes [81,19])
bg2018 2 2 close ([51,49] vs [52,48])
FLI 2 3 inherent PCA divergence (94.5% NaN, sil gap 0.001)

Test plan

  • All 297 tests pass (0 failures, 58 xfailed)
  • vw cold-start: k=2 exact match with Clojure blob
  • biodiversity cold-start: k=2 exact match
  • Ordering tests updated to expect encounter order
  • Re-record private dataset golden snapshots after stack rebase

🤖 Generated with Claude Code

Squashed commits

  • Fix K-means k divergence: preserve vote-encounter order for participant rows
  • Update plan and journal: K-divergence investigation resolved
  • Remove investigation script (one-off diagnostic, not production code)
  • Rename k-divergence doc: investigation record, not a handoff
  • Update references to renamed investigation doc

commit-id:4598a0a1


Stack:


⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

@jucor jucor changed the title Fix K-means k divergence: preserve vote-encounter row order [Stack 17/17] Fix K-means k divergence: preserve vote-encounter row order Mar 30, 2026
@jucor jucor force-pushed the spr/edge/4598a0a1 branch from a9f2a48 to 5808e43 Compare March 30, 2026 22:39
@jucor jucor force-pushed the spr/edge/c3450b9a branch from 775510c to ae240ea Compare March 30, 2026 22:47
@jucor jucor force-pushed the spr/edge/4598a0a1 branch from 5808e43 to 1e0564f Compare March 30, 2026 22:47
## Summary


- Fix K-means k divergence between Python and Clojure by preserving vote-encounter order for participant rows in the rating matrix
- Python was using `natsorted()` (PID-numeric order) while Clojure's NamedMatrix preserves insertion order — different row ordering cascades into different first-k-distinct initialization seeds for group-level k-means
- On vw: Python picked k=4 (wrong), Clojure picks k=2 — now both pick k=2 with identical cluster memberships

## Investigation findings

The divergence chain: rating_mat row order → PCA projection order → base-cluster ID assignment → group k-means first-k-distinct init → different local optima → different silhouette landscape → different k.

PCA components are identical (cosine similarity = 1.0), silhouette implementation matches, k-means algorithm matches — only the data ORDER feeding first-k-distinct differed.

## Changes

- `conversation.py`: `update_votes()` preserves vote-encounter order for participant rows instead of `natsorted()`
- `conversation.py`: `_apply_moderation()` preserves row order with list comprehension
- Column (comment ID) ordering remains `natsorted` — doesn't affect clustering
- Re-recorded vw cold-start blob and golden snapshots
- Updated ordering tests, removed `test_group_clustering` xfail
- Added `scripts/investigate_k_divergence.py` diagnostic tool

## Cold-start blob results

| Dataset | Clj k | Py k | Match |
|---------|-------|------|-------|
| vw | 2 | 2 | exact (sizes [50,17]) |
| biodiversity | 2 | 2 | exact (sizes [81,19]) |
| bg2018 | 2 | 2 | close ([51,49] vs [52,48]) |
| FLI | 2 | 3 | inherent PCA divergence (94.5% NaN, sil gap 0.001) |

## Test plan

- [x] All 297 tests pass (0 failures, 58 xfailed)
- [x] vw cold-start: k=2 exact match with Clojure blob
- [x] biodiversity cold-start: k=2 exact match
- [x] Ordering tests updated to expect encounter order
- [ ] Re-record private dataset golden snapshots after stack rebase

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Fix K-means k divergence: preserve vote-encounter order for participant rows
- Update plan and journal: K-divergence investigation resolved
- Remove investigation script (one-off diagnostic, not production code)
- Rename k-divergence doc: investigation record, not a handoff
- Update references to renamed investigation doc

commit-id:4598a0a1
@jucor jucor force-pushed the spr/edge/4598a0a1 branch from 1e0564f to fbfa1d1 Compare March 31, 2026 00:35
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1114 320 71%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 312 34 89%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 20 89%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 138 94 32%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 53 53%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 518 11%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10792 7619 29%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant