Fix PCA NaN handling and remove Clojure alignment hack by jucor · Pull Request #2413 · compdemocracy/polis

jucor · 2026-03-05T13:04:34Z

Summary

Part 1 of the #2311 split (PCA overhaul). This is the foundational fix — subsequent PRs stack on this one.

Stack: This PR → #2414 (test infra) → #2415 (comparer) → #2416 (sklearn PCA) → #2417 (test cleanup)

Fix Python PCA to use column means for missing votes (instead of 0), matching the Clojure behavior
Add sign-flip-aware comparison in the regression comparer
Add tests for NaN handling, PCA regression, and Clojure comparison
Remove the "align with Clojure" scaling hack from PCA that was masking the real issue
Update golden records to reflect corrected PCA output

Background

Investigation revealed Python filled NaN with 0 while Clojure fills with column mean. The VW dataset had 16° and 52° angle differences in PC1/PC2. After fix, VW matches exactly and biodiversity is within 7° (expected for power iteration on 82% sparse data).

Reviewer note

The golden record JSON diffs (~47k lines) can be skipped — they are the expected output changes from the NaN handling fix. Focus review on:

`pca.py` (NaN handling fix)
`comparer.py` (sign-flip-aware comparison)
`test_pca_unit.py`, `test_legacy_clojure_regression.py` (new tests)

Test plan

226 passed, 2 skipped, 2 xfailed
PCA NaN handling unit tests pass
Clojure PCA comparison passes

🤖 Generated with Claude Code

commit 2024dde Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Update test_powerit_pca_with_nans to expect ValueError The previous commit changed powerit_pca to raise ValueError when given NaN values, making NaN handling the caller responsibility. This test now verifies that behavior instead of testing graceful NaN handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 9839770 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Fix PCA NaN handling to use column mean (matching Clojure) The Python PCA was using 0 to fill missing votes, while Clojure uses column means. This caused significant differences in PCA components (16 deg and 52 deg angle differences on VW dataset). Changes: - pca_project_dataframe: Replace NaN with column means instead of 0 - powerit_pca: Add ValueError if NaN values reach it (caller must preprocess) - test_regression.py: Enable ignore_pca_sign_flip for golden comparisons New tests: - test_pca_unit.py: test_nan_handling_uses_column_mean verifies column mean is used, test_nan_handling_differs_from_zero_fill confirms we are not using 0 - test_legacy_clojure_regression.py: test_pca_components_match_clojure compares Python vs Clojure PCA components (correlation and angle) Results after fix: - VW: PC1/PC2 correlation 1.0/-1.0, angle 0 deg (perfect match) - Biodiversity: PC1/PC2 correlation 0.99/-1.0, angle 7/4 deg (minor numerical differences from power iteration on 82% sparse data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 4b33297 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Test that sign flipping propagates If the PCA components flip sign, the projections must also flip sign. commit e307a52 Author: Julien Cornebise <julien@cornebise.com> Date: Fri Nov 21 09:36:22 2025 +0000 Flag an unneeded exception commit 91a1e9c Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:49:34 2025 +0000 Report any scaling issues in PCA projections commit 395e636 Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:25:46 2025 +0000 Turn off off the Clojure sign-flip for PCA Of course this breaks the comparison to the golden files, as those included the sign flip. So we're also giving the option in the regression test to ignore PCA sign-flips. That will be handy later when we work on improving the PCA implementation, as various PCA implementations have various sign conventions. commit 0dc1aa7 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Clean up unused variables and imports Address GitHub Copilot review comments: - Log superseded votes count in conversation.py instead of leaving unused - Remove unused p1_idx/p2_idx index lookups in corr.py - Remove unused all_passed variable in regression_comparer.py - Remove unused imports (numpy, Path, List, datetime, stats, pca/cluster functions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The original angle computation assumed unit-normalized vectors, which gave misleading results when comparing Python (exact PCA, unit vectors) with Clojure stochastic PCA for large conversations (non-unit vectors due to learning rate blending without re-normalization). Now outputs both the raw angle and the properly normalized angle, plus the norms of both vectors for diagnostics. The assertion now checks the normalized angle, which correctly identifies when vectors point in similar directions regardless of their magnitudes. Co-Authored-By: Claude <noreply@anthropic.com>

As discussed above, the alignment with clojure was an AI-hallucinated kludge to pass tests by customized scaling tailored to each dataset, hiding the problems with the Python PCA implementation (the handling of NaNs). Now that we have fixed those, we can remove that madness and update the golden records accordingly.

- Fix all-NaN column edge case in PCA: replace NaN col_means with 0.0 - Lower per-occurrence sign-flip log level to DEBUG (summary stays WARNING) - Fix dataset validation: explicit names always search local datasets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jucor

Originally from @whilo (approved the previous version of this PR):

Overall: Approve. This is a solid fix. I traced the bug back to the original PCA port (c520b6af) — the Clojure code at conversation.clj:355-377 explicitly fills nil with per-column averages before PCA, and the Python port replaced that with np.nan_to_num(data, nan=0.0). The align_with_clojure hack (hardcoded sign flips and scaling by dataset size) was layered on top to mask the symptoms. Good to see both the root cause fixed and the hack removed.

Minor feedback:

comparer.py: _detect_scaling_factor uses float | None return type (Python 3.10+ syntax). The project pins >=3.12 in pyproject.toml so this works, but the rest of the codebase uses Optional[float] from typing — worth being consistent.
pca.py line ~508: When col_means contains NaN (all-NaN column), you silently fall back to 0.0. Consider logging a warning — an entirely unvoted column is unusual and could indicate a data pipeline issue upstream.
test_legacy_clojure_regression.py lines 333-337: The unnormalized angle computation is dead code (printed but never asserted on), and the # NOTE: This assumes both vectors are unit-normalized comment is misleading since the code right below it computes the normalized version that's actually used. Either remove it or label it clearly as diagnostic.
Sign convention suggestion for a future PR: Rather than using ignore_pca_sign_flip=True in regression tests (which loosens PCA sensitivity), consider a deterministic sign-fixing convention (e.g., force the largest-magnitude element of each component to be positive). This would make golden records fully reproducible without needing sign flip tolerance. Not a blocker for this PR.

jucor · 2026-03-05T14:36:58Z

@whilo Thank you very much for the in-depth review! I'm very excited to get this moving, thank you :)

Types. Thanks for catching that. Done.
all-NaN columns. Good call! In theory we should not be able to have them, they should be filtered by the way we create the matrices by processing the votes (in which case, why do I even have this check here, you'd ask -- and you'd be right to 😅 ), so a Warning is definitely in order ! Done.
Good catch. Rotting comments and dead code --> proof I'm using a LLM :) (although Opus 4.6 is so much better than Sonnet 3.7 !). Removing.
Ah, super interesting ! One thing we need to optimize for is stability (when warm-starting / online-ish computing), to avoid the axes directions changing from one day to the next and the user being super-confused. This has been a historically strong need for the users, and still is. So, in theory, I would think that having a deterministic sign-fixing convention, as opposed to the undetermined behaviour, would help. However, once we switch to incremental PCA from the warm-start (the aim of one of the ten next PRs in my todo-list), we will have stability for free thanks to the progressive update. So while a forced flip signed by enforcing a constraint (smart as it is) would solve the problem right now, I can see it causing some unwanted side-effects when moving to incremental PCA: imagine the case of two elements of opposite sign battling it for "biggest magnitude" by just a little, changing from one iteration to the next --> flip, whereas letting incremental PCA alone would have given it to us.
This being said, we could still use this hard-constraint route for only the golden-record tests. But again, in the weird case of numerical error between two almost-tied coordinates, I could foresee (unlikely as it is) a switch, that the current implementation, if I'm not mistaken, would be robust to.
But thanks for that suggestion!

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-05T15:12:33Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	3	0	100%
main.py	55	55	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	2	0	100%
components/config.py	165	133	19%
components/server.py	116	72	38%
conversation/init.py	2	0	100%
conversation/conversation.py	1036	353	66%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	234	40%
database/postgres.py	306	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	234	7	97%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	235	69	71%
pca_kmeans_rep/repness.py	361	44	88%
pca_kmeans_rep/stats.py	107	22	79%
poller.py	224	188	16%
regression/init.py	4	0	100%
regression/comparer.py	466	186	60%
regression/datasets.py	95	21	78%
regression/recorder.py	36	27	25%
regression/utils.py	137	38	72%
run_math_pipeline.py	260	239	8%
system.py	85	55	35%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	54	52%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	787	787	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	110	110	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	585	477	18%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	63	41	35%
Total	10793	7489	31%

whilo

LGTM 🙂

ballPointPenguin

nice work 👍🏻

jucor · 2026-03-10T09:38:11Z

Thanks a lot @ballPointPenguin and @whilo :)

jucor and others added 4 commits February 25, 2026 10:31

jucor commented Mar 5, 2026

View reviewed changes

jucor mentioned this pull request Mar 5, 2026

Fix PCA NaN handling and remove Clojure alignment hack #2396

Closed

3 tasks

jucor and others added 3 commits March 5, 2026 15:01

Use Optional[float] instead of float | None for typing consistency

ac34127

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Log warning when all-NaN columns fall back to 0.0 in PCA

cb444f9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove dead unnormalized angle code and misleading comment

b846866

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

whilo self-requested a review March 5, 2026 20:09

whilo approved these changes Mar 5, 2026

View reviewed changes

jucor requested a review from ballPointPenguin March 6, 2026 22:23

ballPointPenguin approved these changes Mar 8, 2026

View reviewed changes

jucor merged commit 9b0e711 into edge Mar 10, 2026
4 checks passed

jucor deleted the jc/pca_nan_fix branch March 10, 2026 09:42

jucor restored the jc/pca_nan_fix branch March 19, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PCA NaN handling and remove Clojure alignment hack#2413

Fix PCA NaN handling and remove Clojure alignment hack#2413
jucor merged 7 commits intoedgefrom
jc/pca_nan_fix

jucor commented Mar 5, 2026 •

edited

Loading

Uh oh!

jucor left a comment

Uh oh!

jucor commented Mar 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

whilo left a comment

Uh oh!

ballPointPenguin left a comment

Uh oh!

jucor commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jucor commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Reviewer note

Test plan

Uh oh!

jucor left a comment

Choose a reason for hiding this comment

Uh oh!

jucor commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 5, 2026

Delphi Coverage Report

Uh oh!

whilo left a comment

Choose a reason for hiding this comment

Uh oh!

ballPointPenguin left a comment

Choose a reason for hiding this comment

Uh oh!

jucor commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jucor commented Mar 5, 2026 •

edited

Loading

jucor commented Mar 5, 2026 •

edited

Loading