Skip to content

Regression comparer: sign flip detection and projection metrics#2415

Merged
jucor merged 8 commits intoedgefrom
jc/pca_comparer
Mar 10, 2026
Merged

Regression comparer: sign flip detection and projection metrics#2415
jucor merged 8 commits intoedgefrom
jc/pca_comparer

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 5, 2026

Summary

Stacked on #2414 (test infra). Please review and merge #2413#2414 first.
Stack: #2413 (NaN fix) → #2414 (test infra) → This PR#2416 (sklearn PCA) → #2417 (test cleanup)

Improvements to the regression comparison tooling (`comparer.py`, `regression_comparer.py`):

  • Save difference log regardless of errors (previously lost on exceptions)
  • Account for PCA sign flips in numerical difference reporting
  • Extend sign flip detection to cluster centers
  • Allow per-component error tolerance for PCA
  • Fix per-component PCA sign flip detection (was incorrectly applied globally)
  • Add projection-based comparison metrics: R², Procrustes distance, range-normalized error

These improvements are needed to properly validate the sklearn PCA transition (next PR in stack) but are independently useful for any PCA implementation change.

Test plan

  • 215 passed, 7 skipped, 2 xfailed
  • No production code changes — only regression testing tooling

🤖 Generated with Claude Code

Copy link
Copy Markdown
Member

@ballPointPenguin ballPointPenguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve with note:
the main regression-testing doc is now stale and gives a command that fails on this branch. The doc still tells readers to use --tolerance-abs / --tolerance-rel, but regression_comparer.py no longer defines those options.

otherwise, looks good and the tests pass for me

@jucor jucor force-pushed the jc/pca_test_infra branch from 8b38d9b to 3ab141e Compare March 10, 2026 09:46
Base automatically changed from jc/pca_test_infra to edge March 10, 2026 10:04
jucor and others added 8 commits March 10, 2026 11:03
Cluster centers are derived from PCA projections, so they inherit the
sign ambiguity. This fix ensures sign flips are detected and corrected
for `.center` paths in addition to `.pca.comps` and `.proj.` paths.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, sign flips were detected per-projection-vector, which fails
when only some components are flipped (e.g., PC1 unchanged, PC2 flipped).

Now detects flips at the component level (.pca.comps[N]) and stores them,
then applies per-dimension correction to projections and cluster centers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… tests

TLDR: it works! Differences were just numerical errors, artificially looking
huge on small values. Looking at the whole cloud of projections confirmed the
post sk-learn matches the pre-sklearn results perfectly well.

Problem:
--------
The regression comparer was failing on datasets like FLI and pakistan with
thousands of "differences" showing relative errors up to 874%. Investigation
revealed these were false positives: the high relative errors occurred on
near-zero values (e.g., golden=4.54e-06 vs current=4.42e-05) where even tiny
absolute differences (3e-04) produce huge relative errors.

Diagnosis:
----------
Comparing sklearn SVD-based PCA against power iteration golden snapshots:
- The projection point clouds are visually identical (see scatter plots)
- Important values (Q70-Q100 percentile) match within 0.2%
- Only near-zero values (Q0-Q7 percentile, ~0.0x median) show large rel errors
- These near-zero values represent participants at the origin who don't
  affect visualization or clustering

The element-wise (abs_tol, rel_tol) approach fundamentally cannot handle
this case: it either fails on small values or is too loose for large values.

Solution:
---------
Added projection comparison metrics that measure what actually matters:

| Metric                | Threshold | What it measures                    |
|-----------------------|-----------|-------------------------------------|
| Max |error| / range   | < 1%      | Worst displacement as % of axis    |
| Mean |error| / range  | < 0.1%    | Average displacement as % of axis  |
| R² (all coordinates)  | > 0.9999  | Variance explained (99.99%)        |
| R² (per dimension)    | > 0.999   | Per-PC fit quality (99.9%)         |
| Procrustes disparity  | < 1e-4    | Shape similarity after alignment   |

Results for FLI dataset:
- Max |error| / range: 0.0617% (was flagging 28% rel error on Q1 values)
- R²: 0.9999992
- Procrustes: 8.2e-07

All 7 local datasets now pass regression tests.

Also added:
- Quantile context in error reports (computed from ALL values, not just failures)
- Explanation when element-wise diffs exist but metrics confirm match
- Exclude .pca.center from PCA sign-flip handling (not sign-ambiguous)
- Use AND logic for overall_match (don't let projection metrics override stage failures)
- Guard against division by zero when projection data_range is 0
- Fix return type annotation on _log_projection_metrics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jucor jucor force-pushed the jc/pca_comparer branch from 2798a2f to 503762b Compare March 10, 2026 11:12
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 3 0 100%
main.py 55 55 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 2 0 100%
components/config.py 165 133 19%
components/server.py 116 72 38%
conversation/init.py 2 0 100%
conversation/conversation.py 1036 352 66%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 306 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 234 7 97%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 238 69 71%
pca_kmeans_rep/repness.py 361 44 88%
pca_kmeans_rep/stats.py 107 22 79%
poller.py 224 188 16%
regression/init.py 4 0 100%
regression/comparer.py 883 405 54%
regression/datasets.py 95 21 78%
regression/recorder.py 36 27 25%
regression/utils.py 137 38 72%
run_math_pipeline.py 260 239 8%
system.py 85 55 35%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 54 52%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 787 787 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 110 110 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 585 477 18%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 63 41 35%
Total 11213 7707 31%

@jucor jucor merged commit 8ff5d49 into edge Mar 10, 2026
4 checks passed
@jucor jucor deleted the jc/pca_comparer branch March 10, 2026 11:25
@jucor jucor restored the jc/pca_comparer branch March 19, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants