Fix PCA NaN handling and remove Clojure alignment hack#2413
Conversation
commit 2024dde Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Update test_powerit_pca_with_nans to expect ValueError The previous commit changed powerit_pca to raise ValueError when given NaN values, making NaN handling the caller responsibility. This test now verifies that behavior instead of testing graceful NaN handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 9839770 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Fix PCA NaN handling to use column mean (matching Clojure) The Python PCA was using 0 to fill missing votes, while Clojure uses column means. This caused significant differences in PCA components (16 deg and 52 deg angle differences on VW dataset). Changes: - pca_project_dataframe: Replace NaN with column means instead of 0 - powerit_pca: Add ValueError if NaN values reach it (caller must preprocess) - test_regression.py: Enable ignore_pca_sign_flip for golden comparisons New tests: - test_pca_unit.py: test_nan_handling_uses_column_mean verifies column mean is used, test_nan_handling_differs_from_zero_fill confirms we are not using 0 - test_legacy_clojure_regression.py: test_pca_components_match_clojure compares Python vs Clojure PCA components (correlation and angle) Results after fix: - VW: PC1/PC2 correlation 1.0/-1.0, angle 0 deg (perfect match) - Biodiversity: PC1/PC2 correlation 0.99/-1.0, angle 7/4 deg (minor numerical differences from power iteration on 82% sparse data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 4b33297 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Test that sign flipping propagates If the PCA components flip sign, the projections must also flip sign. commit e307a52 Author: Julien Cornebise <julien@cornebise.com> Date: Fri Nov 21 09:36:22 2025 +0000 Flag an unneeded exception commit 91a1e9c Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:49:34 2025 +0000 Report any scaling issues in PCA projections commit 395e636 Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:25:46 2025 +0000 Turn off off the Clojure sign-flip for PCA Of course this breaks the comparison to the golden files, as those included the sign flip. So we're also giving the option in the regression test to ignore PCA sign-flips. That will be handy later when we work on improving the PCA implementation, as various PCA implementations have various sign conventions. commit 0dc1aa7 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Clean up unused variables and imports Address GitHub Copilot review comments: - Log superseded votes count in conversation.py instead of leaving unused - Remove unused p1_idx/p2_idx index lookups in corr.py - Remove unused all_passed variable in regression_comparer.py - Remove unused imports (numpy, Path, List, datetime, stats, pca/cluster functions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The original angle computation assumed unit-normalized vectors, which gave misleading results when comparing Python (exact PCA, unit vectors) with Clojure stochastic PCA for large conversations (non-unit vectors due to learning rate blending without re-normalization). Now outputs both the raw angle and the properly normalized angle, plus the norms of both vectors for diagnostics. The assertion now checks the normalized angle, which correctly identifies when vectors point in similar directions regardless of their magnitudes. Co-Authored-By: Claude <noreply@anthropic.com>
As discussed above, the alignment with clojure was an AI-hallucinated kludge to pass tests by customized scaling tailored to each dataset, hiding the problems with the Python PCA implementation (the handling of NaNs). Now that we have fixed those, we can remove that madness and update the golden records accordingly.
- Fix all-NaN column edge case in PCA: replace NaN col_means with 0.0 - Lower per-occurrence sign-flip log level to DEBUG (summary stays WARNING) - Fix dataset validation: explicit names always search local datasets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jucor
left a comment
There was a problem hiding this comment.
Originally from @whilo (approved the previous version of this PR):
Overall: Approve. This is a solid fix. I traced the bug back to the original PCA port (c520b6af) — the Clojure code at conversation.clj:355-377 explicitly fills nil with per-column averages before PCA, and the Python port replaced that with np.nan_to_num(data, nan=0.0). The align_with_clojure hack (hardcoded sign flips and scaling by dataset size) was layered on top to mask the symptoms. Good to see both the root cause fixed and the hack removed.
Minor feedback:
-
comparer.py:_detect_scaling_factorusesfloat | Nonereturn type (Python 3.10+ syntax). The project pins>=3.12in pyproject.toml so this works, but the rest of the codebase usesOptional[float]from typing — worth being consistent. -
pca.pyline ~508: Whencol_meanscontains NaN (all-NaN column), you silently fall back to 0.0. Consider logging a warning — an entirely unvoted column is unusual and could indicate a data pipeline issue upstream. -
test_legacy_clojure_regression.pylines 333-337: The unnormalized angle computation is dead code (printed but never asserted on), and the# NOTE: This assumes both vectors are unit-normalizedcomment is misleading since the code right below it computes the normalized version that's actually used. Either remove it or label it clearly as diagnostic. -
Sign convention suggestion for a future PR: Rather than using
ignore_pca_sign_flip=Truein regression tests (which loosens PCA sensitivity), consider a deterministic sign-fixing convention (e.g., force the largest-magnitude element of each component to be positive). This would make golden records fully reproducible without needing sign flip tolerance. Not a blocker for this PR.
|
@whilo Thank you very much for the in-depth review! I'm very excited to get this moving, thank you :)
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delphi Coverage Report
|
|
Thanks a lot @ballPointPenguin and @whilo :) |
Summary
Background
Investigation revealed Python filled NaN with 0 while Clojure fills with column mean. The VW dataset had 16° and 52° angle differences in PC1/PC2. After fix, VW matches exactly and biodiversity is within 7° (expected for power iteration on 82% sparse data).
Reviewer note
The golden record JSON diffs (~47k lines) can be skipped — they are the expected output changes from the NaN handling fix. Focus review on:
Test plan
🤖 Generated with Claude Code