Skip to content

Fix PCA NaN handling and remove Clojure alignment hack#2413

Merged
jucor merged 7 commits intoedgefrom
jc/pca_nan_fix
Mar 10, 2026
Merged

Fix PCA NaN handling and remove Clojure alignment hack#2413
jucor merged 7 commits intoedgefrom
jc/pca_nan_fix

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 5, 2026

Summary

Part 1 of the #2311 split (PCA overhaul). This is the foundational fix — subsequent PRs stack on this one.

Stack: This PR#2414 (test infra) → #2415 (comparer) → #2416 (sklearn PCA) → #2417 (test cleanup)

  • Fix Python PCA to use column means for missing votes (instead of 0), matching the Clojure behavior
  • Add sign-flip-aware comparison in the regression comparer
  • Add tests for NaN handling, PCA regression, and Clojure comparison
  • Remove the "align with Clojure" scaling hack from PCA that was masking the real issue
  • Update golden records to reflect corrected PCA output

Background

Investigation revealed Python filled NaN with 0 while Clojure fills with column mean. The VW dataset had 16° and 52° angle differences in PC1/PC2. After fix, VW matches exactly and biodiversity is within 7° (expected for power iteration on 82% sparse data).

Reviewer note

The golden record JSON diffs (~47k lines) can be skipped — they are the expected output changes from the NaN handling fix. Focus review on:

  • `pca.py` (NaN handling fix)
  • `comparer.py` (sign-flip-aware comparison)
  • `test_pca_unit.py`, `test_legacy_clojure_regression.py` (new tests)

Test plan

  • 226 passed, 2 skipped, 2 xfailed
  • PCA NaN handling unit tests pass
  • Clojure PCA comparison passes

🤖 Generated with Claude Code

jucor and others added 4 commits February 25, 2026 10:31
commit 2024dde
Author: Julien Cornebise <julien@cornebise.com>
Date:   Mon Nov 24 22:14:34 2025 +0000

    Update test_powerit_pca_with_nans to expect ValueError

    The previous commit changed powerit_pca to raise ValueError when given
    NaN values, making NaN handling the caller responsibility. This test
    now verifies that behavior instead of testing graceful NaN handling.

    🤖 Generated with [Claude Code](https://claude.com/claude-code)

    Co-Authored-By: Claude <noreply@anthropic.com>

commit 9839770
Author: Julien Cornebise <julien@cornebise.com>
Date:   Mon Nov 24 22:14:34 2025 +0000

    Fix PCA NaN handling to use column mean (matching Clojure)

    The Python PCA was using 0 to fill missing votes, while Clojure uses
    column means. This caused significant differences in PCA components
    (16 deg and 52 deg angle differences on VW dataset).

    Changes:
    - pca_project_dataframe: Replace NaN with column means instead of 0
    - powerit_pca: Add ValueError if NaN values reach it (caller must preprocess)
    - test_regression.py: Enable ignore_pca_sign_flip for golden comparisons

    New tests:
    - test_pca_unit.py: test_nan_handling_uses_column_mean verifies column
      mean is used, test_nan_handling_differs_from_zero_fill confirms we
      are not using 0
    - test_legacy_clojure_regression.py: test_pca_components_match_clojure
      compares Python vs Clojure PCA components (correlation and angle)

    Results after fix:
    - VW: PC1/PC2 correlation 1.0/-1.0, angle 0 deg (perfect match)
    - Biodiversity: PC1/PC2 correlation 0.99/-1.0, angle 7/4 deg (minor
      numerical differences from power iteration on 82% sparse data)

    🤖 Generated with [Claude Code](https://claude.com/claude-code)

    Co-Authored-By: Claude <noreply@anthropic.com>

commit 4b33297
Author: Julien Cornebise <julien@cornebise.com>
Date:   Mon Nov 24 22:14:34 2025 +0000

    Test that sign flipping propagates

    If the PCA components flip sign, the projections
    must also flip sign.

commit e307a52
Author: Julien Cornebise <julien@cornebise.com>
Date:   Fri Nov 21 09:36:22 2025 +0000

    Flag an unneeded exception

commit 91a1e9c
Author: Julien Cornebise <julien@cornebise.com>
Date:   Wed Nov 19 15:49:34 2025 +0000

    Report any scaling issues in PCA projections

commit 395e636
Author: Julien Cornebise <julien@cornebise.com>
Date:   Wed Nov 19 15:25:46 2025 +0000

    Turn off off the Clojure sign-flip for PCA

    Of course this breaks the comparison to the golden files, as those included
    the sign flip. So we're also giving the option in the regression test to ignore
    PCA sign-flips. That will be handy later when we work on improving the PCA implementation,
    as various PCA implementations have various sign conventions.

commit 0dc1aa7
Author: Julien Cornebise <julien@cornebise.com>
Date:   Mon Nov 24 22:14:34 2025 +0000

    Clean up unused variables and imports

    Address GitHub Copilot review comments:
    - Log superseded votes count in conversation.py instead of leaving unused
    - Remove unused p1_idx/p2_idx index lookups in corr.py
    - Remove unused all_passed variable in regression_comparer.py
    - Remove unused imports (numpy, Path, List, datetime, stats, pca/cluster functions)

    🤖 Generated with [Claude Code](https://claude.com/claude-code)

    Co-Authored-By: Claude <noreply@anthropic.com>
The original angle computation assumed unit-normalized vectors, which
gave misleading results when comparing Python (exact PCA, unit vectors)
with Clojure stochastic PCA for large conversations (non-unit vectors
due to learning rate blending without re-normalization).

Now outputs both the raw angle and the properly normalized angle,
plus the norms of both vectors for diagnostics. The assertion now
checks the normalized angle, which correctly identifies when vectors
point in similar directions regardless of their magnitudes.

Co-Authored-By: Claude <noreply@anthropic.com>
As discussed above, the alignment with clojure was an AI-hallucinated kludge
to pass tests by customized scaling tailored to each dataset, hiding the problems
with the Python PCA implementation (the handling of NaNs). Now that we have fixed
those, we can remove that madness and update the golden records accordingly.
- Fix all-NaN column edge case in PCA: replace NaN col_means with 0.0
- Lower per-occurrence sign-flip log level to DEBUG (summary stays WARNING)
- Fix dataset validation: explicit names always search local datasets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@jucor jucor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally from @whilo (approved the previous version of this PR):

Overall: Approve. This is a solid fix. I traced the bug back to the original PCA port (c520b6af) — the Clojure code at conversation.clj:355-377 explicitly fills nil with per-column averages before PCA, and the Python port replaced that with np.nan_to_num(data, nan=0.0). The align_with_clojure hack (hardcoded sign flips and scaling by dataset size) was layered on top to mask the symptoms. Good to see both the root cause fixed and the hack removed.

Minor feedback:

  1. comparer.py: _detect_scaling_factor uses float | None return type (Python 3.10+ syntax). The project pins >=3.12 in pyproject.toml so this works, but the rest of the codebase uses Optional[float] from typing — worth being consistent.

  2. pca.py line ~508: When col_means contains NaN (all-NaN column), you silently fall back to 0.0. Consider logging a warning — an entirely unvoted column is unusual and could indicate a data pipeline issue upstream.

  3. test_legacy_clojure_regression.py lines 333-337: The unnormalized angle computation is dead code (printed but never asserted on), and the # NOTE: This assumes both vectors are unit-normalized comment is misleading since the code right below it computes the normalized version that's actually used. Either remove it or label it clearly as diagnostic.

  4. Sign convention suggestion for a future PR: Rather than using ignore_pca_sign_flip=True in regression tests (which loosens PCA sensitivity), consider a deterministic sign-fixing convention (e.g., force the largest-magnitude element of each component to be positive). This would make golden records fully reproducible without needing sign flip tolerance. Not a blocker for this PR.

@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 5, 2026

@whilo Thank you very much for the in-depth review! I'm very excited to get this moving, thank you :)

  1. Types. Thanks for catching that. Done.
  2. all-NaN columns. Good call! In theory we should not be able to have them, they should be filtered by the way we create the matrices by processing the votes (in which case, why do I even have this check here, you'd ask -- and you'd be right to 😅 ), so a Warning is definitely in order ! Done.
  3. Good catch. Rotting comments and dead code --> proof I'm using a LLM :) (although Opus 4.6 is so much better than Sonnet 3.7 !). Removing.
  4. Ah, super interesting ! One thing we need to optimize for is stability (when warm-starting / online-ish computing), to avoid the axes directions changing from one day to the next and the user being super-confused. This has been a historically strong need for the users, and still is. So, in theory, I would think that having a deterministic sign-fixing convention, as opposed to the undetermined behaviour, would help. However, once we switch to incremental PCA from the warm-start (the aim of one of the ten next PRs in my todo-list), we will have stability for free thanks to the progressive update. So while a forced flip signed by enforcing a constraint (smart as it is) would solve the problem right now, I can see it causing some unwanted side-effects when moving to incremental PCA: imagine the case of two elements of opposite sign battling it for "biggest magnitude" by just a little, changing from one iteration to the next --> flip, whereas letting incremental PCA alone would have given it to us.
    This being said, we could still use this hard-constraint route for only the golden-record tests. But again, in the weird case of numerical error between two almost-tied coordinates, I could foresee (unlikely as it is) a switch, that the current implementation, if I'm not mistaken, would be robust to.
    But thanks for that suggestion!

jucor and others added 3 commits March 5, 2026 15:01
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 5, 2026

Delphi Coverage Report

File Stmts Miss Cover
init.py 3 0 100%
main.py 55 55 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 2 0 100%
components/config.py 165 133 19%
components/server.py 116 72 38%
conversation/init.py 2 0 100%
conversation/conversation.py 1036 353 66%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 306 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 234 7 97%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 235 69 71%
pca_kmeans_rep/repness.py 361 44 88%
pca_kmeans_rep/stats.py 107 22 79%
poller.py 224 188 16%
regression/init.py 4 0 100%
regression/comparer.py 466 186 60%
regression/datasets.py 95 21 78%
regression/recorder.py 36 27 25%
regression/utils.py 137 38 72%
run_math_pipeline.py 260 239 8%
system.py 85 55 35%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 54 52%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 787 787 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 110 110 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 585 477 18%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 63 41 35%
Total 10793 7489 31%

@whilo whilo self-requested a review March 5, 2026 20:09
Copy link
Copy Markdown
Collaborator

@whilo whilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙂

@jucor jucor requested a review from ballPointPenguin March 6, 2026 22:23
Copy link
Copy Markdown
Member

@ballPointPenguin ballPointPenguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work 👍🏻

@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 10, 2026

Thanks a lot @ballPointPenguin and @whilo :)

@jucor jucor merged commit 9b0e711 into edge Mar 10, 2026
4 checks passed
@jucor jucor deleted the jc/pca_nan_fix branch March 10, 2026 09:42
@jucor jucor restored the jc/pca_nan_fix branch March 19, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants