diff --git a/delphi/CLAUDE.md b/delphi/CLAUDE.md index 8d5bb9c17..92d0e8b0c 100644 --- a/delphi/CLAUDE.md +++ b/delphi/CLAUDE.md @@ -2,15 +2,9 @@ This document provides comprehensive guidance for working with the Delphi system, including database interactions, environment configuration, Docker services, and the distributed job queue system. It serves as both documentation and a practical reference for day-to-day operations. -## Documentation Directory +## Documentation -For a comprehensive list of all documentation files with descriptions, see: -[delphi/docs/DOCUMENTATION_DIRECTORY.md](docs/DOCUMENTATION_DIRECTORY.md) - -## Current work todos are located in - -delphi/docs/JOB_QUEUE_SCHEMA.md -delphi/docs/DISTRIBUTED_SYSTEM_ROADMAP.md +**Warning:** Many docs in `docs/` are outdated and should not be trusted. Always verify against the actual code. Start with `docs/PLAN_DISCREPANCY_FIXES.md` (canonical fix plan) and `docs/CLJ-PARITY-FIXES-JOURNAL.md` (session journal) for current Clojure parity work. ## Helpful terminology @@ -368,3 +362,25 @@ The system uses AWS Auto Scaling Groups to manage capacity: - Large Instance ASG: 1 instance by default, scales up to 3 based on demand CPU utilization triggers scaling actions (scale down when below 60%, scale up when above 80%). + + +## Testing + +Run tests with `pytest` on the `tests/` folder. + +### Datasets of reference + +In `real_data`, we have several datasets of real conversations, exported from Polis, that can be used for testing and development. Those at the root of `real_data` are public. +In `real_data/.local`, we have some private datasets that can only be used internally. The comparer supports both public and private datasets via the `--include-local` flag. + +### Regressions and golden snapshots + +For regressions compared to the latest validated python code, there are both regression unit tests in `tests/`, as well as a test script that compares the output to "golden snapshots": `scripts/regression_comparer.py`. That script is more verbose than the tests, useful for debugging. + +Some amount of numerical errors are OK, which is what the regression comparer library is for. + +### Old Clojure reference implementation, and moving to Sklearn + +For math, there is an older implementation in Clojure, in `polismath`. Until we can replace it, we run comparisons between the two implementations in `tests/*legacy*`. Those run the python code, and compare some of the output in some way to the `math blob`, which is the JSON output of the Clojure implementation, often stored in the PostgreSQL database, but for simplicity stored along the golden (python) snapshots used by the regression comparer, so we do not have to run Postgres nor Clojure to run those tests. + +A lot of the current python code was ported from Clojure using an AI agent (Sonnet 3.5 last year), including a lot of home-made implementations of core algorithms. We are in the process of replacing those with standard implementations (such as sklearn for the PCA and K-means). This is ongoing work, and made harder by the fact that the Python code does not quite produce the same output as the Clojure code. So typically we have to check what the ported python code is doing differently from the clojure code, adjust the python code to match the clojure output, and then replace it with standard implementations, which may again produce slightly different output, so we have to adjust parameters until we get similar output. diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md new file mode 100644 index 000000000..aefb086d3 --- /dev/null +++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md @@ -0,0 +1,470 @@ +# Plan: TDD Approach to Fixing Python-Clojure Discrepancies + +## Context + +The Delphi Python math pipeline has 15 documented discrepancies with the Clojure reference implementation (see `deep-analysis-for-julien/07-discrepancies.md` and `deep-analysis-for-julien/09-fix-plan.md`). We need to fix them one-by-one with a TDD approach: **first extend the regression test to verify the discrepancy exists, then fix it, then verify the test passes**. + +Each fix will be a separate PR to keep reviews manageable. PRs will be **stacked** (each builds on the previous), since fixes are ordered by pipeline execution order — fixing upstream affects downstream. PRs should be clearly labeled as stacked with their dependency chain, so reviewers know the order. We can use `git rebase -i` to clean up commit history before merging to main. + +**Prerequisite**: The current `kmeans_work` branch has changes from `edge`. These should be separated into their own PR(s) first, before we start the discrepancy fix PRs. The discrepancy fix PRs should be based on the cleaned-up branch. + +**Action**: Before starting any fix, analyze the diff of `kmeans_work` vs `upstream/edge` to understand what's changed. Group those changes into logical PR(s) — e.g., test infrastructure improvements, doc updates, minor bug fixes. Each should be reviewable independently. The discrepancy fix PRs (PR 1+) then stack on top of this clean base. + +### Session Continuity + +Because this work will span multiple Claude Code sessions, we maintain: + +1. **`delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md`** — An ongoing (committed) tracking: + - Test baseline after each PR (which tests pass/fail) + - Gotchas and edge cases discovered along the way + - Keep the detailed context here, in addition to commit messages and PR descriptions. This document serves as the single source of truth for the fix process for our work, while commit messages are for reviewers and colleagues and future maintainers. + - Notes for the next session to pick up where we left off +2. **`delphi/CLAUDE.md`** — Updated with stable patterns and conventions discovered during fixes +3. **Commit messages and PR descriptions** — All information useful for reviewers goes here (the journal should reference commit hashes, but can also duplicate content: different audiences, info can overlap, even be identical if it helps.) +4. **Reference docs** — `deep-analysis-for-julien/` (architecture, discrepancies, fix plan), this plan file + +### Current Test Baseline +- `test_legacy_clojure_regression.py`: **12 failed, 6 passed, 7 xfailed** (with --include-local) +- Key failures: `test_basic_outputs` (repness empty!), `test_group_clustering` (wrong k, wrong membership) +- `test_legacy_repness_comparison.py`: 4 passed (but only checks structure, not values) +- Only `biodiversity` and `vw` have Clojure math blobs; private datasets do not yet + +### Testing Principles + +- **Granular tests per discrepancy**: Not just overall regression — each fix gets its own targeted test checking the specific aspect it addresses. Multiple discrepancies may affect `test_basic_outputs`; we need to see incremental improvement per fix. +- **Targeted pipeline-stage tests**: For D2/D3 (participant filtering, clustering), check in-conv count, cluster count, and cluster memberships against Clojure blob. For D12, check comment-priorities against Clojure blob. +- **All datasets, not just biodiversity**: Every fix must pass on ALL datasets. biodiversity is just one reference among many. +- **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly. +- **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification. + +### Datasets Available (sorted by size, smallest first) + +| Dataset | Votes | Has Clojure blob? | Notes | +|---------|------:|-------------------|-------| +| vw | 4,684 | Yes | Fastest for iteration | +| biodiversity | 29,803 | Yes | Medium-size reference | +| *(5 private datasets in real_data/.local/)* | 91K–1M | **Need to generate** | Require prodclone DB | + +**Generating missing Clojure blobs**: Use `delphi/scripts/generate_cold_start_clojure.py` which creates a temporary conversation in Postgres, runs the Clojure poller via Docker, captures the math blob, and cleans up. Requires the prodclone Postgres database to be running. This should be done as a **pre-requisite step** (PR 0 or setup task) before starting fixes. Note: this is ONLY for a full cold-start blob generation. For incremental testing (D3, D1), we need the replay infrastructure to feed votes in batches and capture intermediate blobs. See below. + +```bash +cd delphi +python scripts/generate_cold_start_clojure.py --all --include-local --pause-math +``` + +If the prodclone DB is not running, we need to start it first. Also check if the main polis worktree already has generated blobs we can copy. +**Testing strategy**: Start with `vw` (fastest), then `biodiversity`, then progressively larger datasets. Expect that some discrepancies only manifest in certain conversations (different edge cases). When we find such a case, add a synthetic unit test for that pattern. + +### TypeScript Server Dependency + +The TypeScript server (`server/src/nextComment.ts`) depends on Delphi's math outputs: +- **`comment-priorities`**: Used by `getNextPrioritizedComment()` → `selectProbabilistically()` for weighted comment routing. **Currently NOT computed by Python** — falls back to uniform random (weight=1 for all). +- **`repness`**: Consumed by `client-report/src/components/lists/participantGroups.jsx` for displaying representative comments per group in the **legacy report mode**. +- **`consensus`**: Consumed by `client-report/` via `normalizeConsensus.js` for consensus display. + +**Two report modes exist**: +- **Legacy report** (`/report/{report_id}`) — uses `math["repness"]`, `math["group-votes"]`, `math["group-clusters"]` from the math blob +- **Narrative report** (`/narrativeReport/{report_id}`) — uses DynamoDB UMAP/topic data, independent of math blob repness + +The formula-level fixes (D4, D5, D6, D7, D8, D9) are pure math improvements — they make the computation correct regardless of consumer. The **selection logic** fixes (D10, D11) need a legacy-mode flag since `client-report/` may depend on current behavior. Comment priorities (D12) is a net-new computation that the TypeScript server will consume. + +**Future PRs** (after parity): Implement Python versions of TypeScript routing logic (`getNextPrioritizedComment`, `selectProbabilistically`) so core computations live in Python, with TypeScript as thin routing layer. Open a GitHub Issue upstream documenting this proposal with permalinks to the relevant TypeScript code. + +--- + +## Fix Order (one PR per fix) + +Fixes are ordered by **pipeline execution order**: participant filtering → probability estimates → proportion tests → z-score thresholds → repness metrics → comment stats → selection → consensus → clustering stability → priorities. This way, after each fix, we re-run and compare up to that point. + +**After each PR**: re-check the full baseline. Document results in the journal. Some earlier fixes may resolve downstream failures. + +--- + +### PR 0: Generate Missing Clojure Blobs + Test Infrastructure Setup + +**Pre-requisite** before any fixes. + +1. Start prodclone Postgres if not running +2. Run `generate_cold_start_clojure.py --all --include-local --pause-math` to generate cold-start blobs for all private datasets +3. Create `tests/test_discrepancy_fixes.py` with the test infrastructure: + - One test class per discrepancy, parametrized by ALL datasets + - Shared comparison utilities that call the same core comparer logic as `regression_comparer.py` + - Synthetic edge-case test scaffolding +4. Create `delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md` with initial baseline +5. Update `delphi/CLAUDE.md` with TypeScript/Delphi connection documentation + +--- + +### PR 1: Fix D2 — In-Conv Participant Threshold + +**Why first**: This is the **first step in the pipeline** — it gates which participants enter clustering. Wrong threshold → wrong participants → wrong clusters → wrong everything downstream. + +**File**: `delphi/polismath/conversation/conversation.py` + +**Current**: `threshold = 7 + sqrt(n_cmts) * 0.1` → biodiversity (314 comments): threshold=8.8, keeps 428/536. +**Target**: `threshold = min(7, n_cmts)` → threshold=7, more participants qualify. + +**Additional changes**: +- Use `self.raw_rating_mat` instead of `self.rating_mat` (count on unmoderated matrix) +- Add greedy fallback (top-15 voters if <15 qualify) +- Add monotonic persistence (once in, always in) + +**Test-first approach**: +1. Add test comparing in-conv participant count/set between Python and Clojure for ALL datasets +2. Add synthetic edge-case tests: tiny conversation (fewer comments than threshold), conversation where greedy fallback triggers +3. Run tests → expect failure +4. Apply fix +5. Re-run ALL tests — cluster count may change (possibly 3→2 for biodiversity, matching Clojure) +6. **Cluster comparison test (CRITICAL)**: After fixing D2, add a test that compares Python and Clojure cluster assignments at the participant level — number of clusters AND which participants are in which cluster (using Jaccard similarity or exact match). The Python clustering calls sklearn K-means; Clojure uses a two-level approach. If clusters still don't match after the threshold fix, investigate why and potentially create an additional PR (PR 1b) to fix the remaining clustering discrepancy before moving on to repness fixes. +7. Document new baseline in journal + +--- + +### PR 2: Fix D4 — Pseudocount Formula + +**Why before D9**: The pseudocount affects probability estimates (`pa`, `pd`), which feed into proportion tests and z-scores. Fixing this first gives us correct probability values, even if the z-score thresholds are still wrong. This follows pipeline execution order: probabilities are computed before significance testing. + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current**: `PSEUDO_COUNT = 1.5` → `pa = (na + 0.75) / (ns + 1.5)` (Beta(1.75,1.75) prior) +**Target**: `PSEUDO_COUNT = 2.0` → `pa = (na + 1) / (ns + 2)` (Beta(2,2) prior, matching Clojure) + +**Test-first approach**: +1. Add unit test: for known (na, ns) pairs from Clojure math blob repness, verify `pa` values match +2. Compare `pa`/`pd` values for specific (group, comment) pairs across ALL datasets +3. Fix: Change `PSEUDO_COUNT = 2.0` +4. Document which repness values improved and which are still off (due to remaining D5/D6/D7/D9) + +--- + +### PR 3: Fix D9 — Z-Score Significance Thresholds + +**Why next**: Now that probabilities are correct (D4 fixed), we fix the significance gates. This is likely why `comment_repness` is empty — Python requires 28% higher z-scores. + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current**: `Z_90 = 1.645` (two-tailed), `Z_95 = 1.96` (two-tailed) +**Target**: `Z_90 = 1.2816` (one-tailed), `Z_95 = 1.6449` (one-tailed) — matching Clojure + +**Note**: Python's own `stats.py` already uses 1.2816. This fix resolves an internal inconsistency. + +**Test-first approach**: +1. Add test checking repness produces non-empty `comment_repness` for ALL datasets +2. Add test comparing repness values (pa, pat, ra, rat, overall metric) to Clojure — expect partial match (D5/D6/D7 still unfixed, so z-scores will differ, but more comments should now pass significance) +3. Fix: Change constants +4. Re-run full suite — `test_basic_outputs` may now pass +5. Document which repness values now match Clojure and which still differ + +--- + +### PR 4: Fix D5 — Proportion Test Formula + +**Why next**: With correct probabilities (D4) and thresholds (D9), now fix the proportion test formula itself. This changes ALL `pat`/`pdt` z-scores. + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Analysis**: + +| Aspect | Python (current) | Clojure (target) | +|--------|-----------------|------------------| +| Formula | `(p - 0.5) / sqrt(0.25/n)` | `2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)` | +| Type | Standard z-test | Wilson-score-like with built-in continuity correction | +| Pseudocount | None in test (applied separately) | Built-in Beta(1,1) prior: `(succ+1)/(n+1)` | +| Small n | Can produce extreme z-scores | Regularized — `+1` terms shrink extremes | + +The Clojure formula is preferable for Polis: small group sizes are common, and the built-in regularization prevents spurious significance. + +**Code comments**: Check Clojure source (`stats.clj`) for explanatory comments and copy them to Python. Add clear docstring explaining what this test does, why the Clojure formula is used, and how it differs from a standard z-test. + +**Signature change**: `prop_test(p, n, p0)` → `prop_test(succ, n)`. All callers need updating including `prop_test_vectorized`. + +**Test-first approach**: +1. Unit tests: for known (succ, n) pairs, compare to Clojure formula output +2. Compare pat/pdt values against Clojure blob for ALL datasets +3. Fix: Replace both scalar and vectorized implementations, update all callers +4. Document improvements + +--- + +### PR 5: Fix D6 — Two-Proportion Test Adjustment + +**Why next**: Similar to D5, this fixes the group-vs-other z-scores (`rat`/`rdt`). + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current**: Standard two-proportion z-test, no pseudocounts +**Target**: Add +1 pseudocount to all 4 inputs (succ1, n1, succ2, n2), matching Clojure + +**Signature change**: From proportions to counts. All callers (including `two_prop_test_vectorized`) need updating. + +**Test-first approach**: +1. Unit tests with known values, compare to Clojure formula +2. Compare rat/rdt values against Clojure blob for ALL datasets +3. Fix: Add pseudocounts, change signatures, update callers + +--- + +### PR 6: Fix D7 — Repness Metric Formula + +**Why next**: With all underlying statistics now correct (D4, D5, D6, D9), fix the metric that ranks comments. + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Analysis**: + +| Aspect | Python (current) | Clojure (target) | +|--------|-----------------|------------------| +| Formula | `pa * (\|pat\| + \|rat\|)` | `ra * rat * pa * pat` | +| Type | Weighted sum of 2 absolute values | Product of 4 signed values | +| Zero sensitivity | Tolerant | Strict — any factor near 0 kills metric | +| Behavior | High probability OR high test score | Requires ALL dimensions to be strong | + +The Clojure product formula is more conservative and arguably better for finding truly representative comments. + +**Possibly unintentional change**: The Python formula may have been an LLM hallucination during the Clojure→Python port. Team has been asked to confirm. **Preserve the current formula behind a flag** with a TODO comment to delete or keep once confirmed. Default to Clojure formula for regression tests. + +**Test-first approach**: +1. Extract Clojure repness scores from math blob for ALL datasets +2. Compare agree_metric/disagree_metric per (group, comment) +3. Fix: Change to `ra * rat * pa * pat` (both scalar and vectorized), keep old formula behind flag +4. Verify ranking order matches Clojure + +--- + +### PR 7: Fix D8 — Finalize Comment Stats Logic + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current**: `if pa > 0.5 AND ra > 1.0 → 'agree'; elif pd > 0.5 AND rd > 1.0 → 'disagree'` +**Target**: Simple `rat > rdt → 'agree'; else → 'disagree'` (matching Clojure) + +**Test-first approach**: +1. Test: verify `repful` classification matches Clojure across ALL datasets +2. Fix: Replace threshold checks with `rat > rdt` (both scalar and vectorized) + +--- + +### PR 8: Fix D10 — Representative Comment Selection + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Pre-investigation needed**: Dig into `client-report/src/components/lists/participantGroups.jsx` to understand exactly how it consumes `repness`. The legacy report uses `math["repness"][gid]`, the narrative report uses DynamoDB topics. Determine if changing selection logic (3+2 → 5 total, agrees first) would break the legacy report display. + +**Approach**: Implement Clojure's selection as default, keep current selection behind `DELPHI_LEGACY_REPNESS_SELECTION=true` env var (update `example.env`). TODO comment: remove legacy mode once verified in production. + +**Test-first approach**: +1. Compare selected rep comments per group to Clojure's repness output for ALL datasets +2. Fix: Match Clojure's single-pass with beats-best-by-test, up to 5 total (agrees first) + +--- + +### PR 9: Fix D11 — Consensus Comment Selection + +**File**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current**: ALL groups `pa > 0.6`, top 2 overall +**Target**: Per-comment `pa > 0.5`, top 5 agree + 5 disagree with z-test scores + +Same legacy-mode flag approach as D10. TODO comment + GitHub Issue (upstream) proposing consolidation of all math into Python, with permalinks to the TypeScript code that would be simplified. + +**Test-first approach**: +1. Compare consensus comments to Clojure's `consensus` blob key for ALL datasets +2. Fix: Match Clojure's logic + +--- + +### PR 10: Fix D3 — K-Smoother Buffer + +**File**: `delphi/polismath/conversation/conversation.py` + +**This is a temporal stability feature**: Clojure requires k to be best for 4 consecutive updates before switching. Without it, groups flicker. + +**WARNING**: The k-smoother makes `k` depend on the full history of updates, which we don't have in cold-start math blobs. Cold-start blobs are single-shot. Therefore: + +- **Cold-start test**: Verify the smoother doesn't change cold-start behavior (buffer=4 means first k is always accepted) +- **Incremental test**: Feed votes in batches, verify k doesn't flicker — Python-only test, no Clojure comparison possible without replay infrastructure + +If we discover that cold-start fixes can't be completed without warm-start testing, we'll need the replay infrastructure earlier (see Replay Infrastructure section below). + +**Test approach**: +1. Cold-start: verify same results as without smoother (first run always accepts k) +2. Incremental: feed votes in batches, assert k stability +3. Fix: Add `group_k_smoother` state with buffer=4 + +--- + +### PR 11: Fix D12 — Comment Priorities + +**Files**: `delphi/polismath/conversation/conversation.py`, `delphi/polismath/pca_kmeans_rep/pca.py` + +**Critical for production**: Without comment priorities, TypeScript server falls back to uniform random selection. Currently `_compute_votes_base()` is dead code with a bug (line 1076: invalid pandas syntax). + +This replaces the reading from the math blob with a proper Python computation matching Clojure. + +**Test-first approach**: +1. Remove xfail from existing `test_comment_priorities` +2. Extract `comment-priorities` from Clojure math blob, compare rankings (Spearman correlation) for ALL datasets +3. Implement: comment projection/extremity in PCA, `importance_metric`, `priority_metric`, full computation +4. Fix buggy `_compute_votes_base()` method + +--- + +### PR 12: Fix D15 — Moderation Handling + +**File**: `delphi/polismath/conversation/conversation.py` + +Clojure zeros out moderated comments (keeps structure, sets values to 0). Python removes them entirely. Affects matrix dimensions and vote counts. + +**Test-first approach**: +1. Add test with dataset containing moderated comments, compare behavior +2. Fix or document the difference + +--- + +### PR 13: Fix D1/D1b — PCA Sign Flip Prevention + +**Files**: `delphi/polismath/pca_kmeans_rep/pca.py` + +**This MUST be fixed**: We cannot afford sign flips between updates — they cause participants to appear to jump to the opposite side of the visualization. + +**Approach**: +1. Add a test that detects sign flips (feed incremental votes, check projection consistency) +2. Implement sign-consistency logic: after computing PCA, compare to previous components and flip signs to maintain consistency +3. Consider `IncrementalPCA` for warm-starting (also addresses D14 for free) + +This is non-trivial and should be one of the last fixes. + +--- + +### PR 14: Cleanup — Remove Dead Code + +**Files**: Multiple (see `08-dead-code.md`) +- Custom kmeans chain in `clusters.py` +- Non-vectorized repness functions in `repness.py` +- Buggy `_compute_votes_base()` (after D12 replaces it) +- `stats.py` inconsistencies (after D9 makes `repness.py` authoritative) + +By this point, we should have good test coverage from all the per-discrepancy tests, giving confidence that dead code removal won't break anything. Still run full regression suite after cleanup. + +--- + +### Explicitly Deferred + +- **D13 — Subgroup Clustering**: Not implemented in Python, never used by TypeScript consumers. No fix needed. +- **D14 — Large Conversation Optimization**: Clojure needed mini-batch PCA because it was slow. Python's sklearn SVD is fast enough. No fix needed unless we adopt `IncrementalPCA` for sign consistency (PR 13). + +--- + +### Future PRs (after parity) + +- **Python implementation of TypeScript routing logic**: `getNextPrioritizedComment`, `selectProbabilistically`, returning a pre-computed list of next comments so TypeScript becomes thin routing layer +- **GitHub Issue**: Document proposal to consolidate math in Python with permalinks to TypeScript code + +--- + +## Discrepancy Coverage Checklist + +| ID | Discrepancy | PR | Status | +|----|-------------|-----|--------| +| D1 | PCA sign flips | PR 13 | Fix (sign consistency) | +| D1b | Projection input | PR 13 | Fix with D1 | +| D2 | In-conv threshold | **PR 1** | Fix | +| D3 | K-smoother buffer | PR 10 | Fix | +| D4 | Pseudocount formula | **PR 2** | Fix | +| D5 | Proportion test | PR 4 | Fix | +| D6 | Two-proportion test | PR 5 | Fix | +| D7 | Repness metric | PR 6 | Fix (with flag for old formula) | +| D8 | Finalize cmt stats | PR 7 | Fix | +| D9 | Z-score thresholds | **PR 3** | Fix | +| D10 | Rep comment selection | PR 8 | Fix (with legacy env var) | +| D11 | Consensus selection | PR 9 | Fix (with legacy env var) | +| D12 | Comment priorities | PR 11 | Fix (implement from scratch) | +| D13 | Subgroup clustering | — | **Deferred** (unused) | +| D14 | Large conv optimization | — | **Deferred** (Python fast enough) | +| D15 | Moderation handling | PR 12 | Fix | + +--- + +## Test Infrastructure + +### `tests/test_discrepancy_fixes.py` — New test file + +**Two types of tests per discrepancy**: + +1. **Real-data parametrized tests**: One test class per discrepancy, parametrized by ALL datasets. Loads conversation + Clojure math blob, checks specific aspect, designed to FAIL before fix and PASS after. + +2. **Synthetic edge-case tests**: When we discover an edge case in a real conversation, extract the pattern into a test with completely made-up data (never real data from private datasets). These run fast, document the edge case clearly, and prevent regressions. + +**All datasets must have cold start Clojure blobs** — generate missing ones using `generate_cold_start_clojure.py` before starting fixes. + +### Shared comparison logic + +Both `pytest` tests and `regression_comparer.py` should use the **same core comparison logic**. The pytest tests call it with assertions; the script calls it with report generation. This ensures consistency and reduces duplication. Extend `ClojureComparer` with methods like: +- `compare_in_conv_participants()` — participant set comparison +- `compare_repness_values()` — per-(group, comment) metric comparison +- `compare_consensus()` — consensus comment overlap +- `compare_priorities()` — priority ranking correlation (Spearman) + +### Replay Infrastructure (separate PRs, needed for D3/D1) + +Three separate PRs for temporal/incremental testing: + +**Replay PR A**: Core replay infrastructure +- Script that takes a conversation's votes CSV, splits into batches of configurable size +- For each batch: feed votes to Python `Conversation.update()`, record state +- For each batch: feed votes to Clojure via DB + poller, record math blob +- Compare Python vs Clojure state after each batch +- Requires separate test Postgres database (not touching main environment) +- Generation can be done once but must be scripted for reproducibility + +**Replay PR B**: Use replay infrastructure for tests +- Test D3 (k-smoother stability) with real incremental data +- Test D1 (PCA sign consistency) with real incremental data +- Compare Python's incremental behavior to Clojure's + +**Replay PR C**: Visualization movie generation +- Adapt existing static visualization scripts (`delphi/scripts/visualize_cluster_comparison.py`) for temporal replay +- Generate frame-by-frame visualizations showing conversation evolution +- Fixed x/y ranges across all frames for consistent animation +- Useful for debugging, tutorials, and showcasing + +--- + +## Dataset Size Strategy + +When iterating on a fix: +1. Start with **vw** (4.7K votes) — fastest feedback loop +2. Then **biodiversity** (30K votes) — another reference +3. Then private datasets (91K–1M) — progressively larger +4. Full suite with ALL datasets only after fix is stable on smaller ones + +All datasets are equally important. No single dataset is "the main" reference — each may expose different edge cases. + +### On Test Redundancy: `pytest` vs `regression_comparer.py` + +**Keep both** — they serve different purposes: + +| | `pytest` tests | `regression_comparer.py` | +|---|---|---| +| **Purpose** | Binary pass/fail | Detailed human-readable reports | +| **CI-friendly** | Yes | No (interactive use) | +| **When** | After each change (automated) | Investigating failures (manual) | + +**Critical**: Both must use the same core comparison logic (shared in `polismath/regression/`), so updating one automatically benefits the other. + +### Golden snapshot updates + +After each fix: `uv run python scripts/regression_recorder.py` + +--- + +## Verification After Each PR + +1. Run targeted discrepancy test on **vw**: `uv run pytest tests/test_discrepancy_fixes.py -k --datasets=vw` +2. If pass, run on **biodiversity**: `--datasets=biodiversity` +3. Run on **all datasets**: `--include-local` +4. Run full legacy suite: `uv run pytest tests/test_legacy_clojure_regression.py --include-local` — no new failures +5. Run full regression suite: `uv run pytest tests/test_regression.py --include-local` — re-record golden snapshots if needed +6. Optional deep inspection: `uv run python scripts/regression_comparer.py --include-local` +7. Document updated baseline in `CLJ-PARITY-FIXES-JOURNAL.md` +8. Write clear commit message and PR description with test results diff --git a/delphi/docs/deep-analysis-for-julien/01-overview-and-architecture.md b/delphi/docs/deep-analysis-for-julien/01-overview-and-architecture.md new file mode 100644 index 000000000..35da75cf7 --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/01-overview-and-architecture.md @@ -0,0 +1,159 @@ +# Polis Math Pipeline: Overview and Architecture + +## Document Index + +This analysis is split across multiple files for manageability: + +1. **01-overview-and-architecture.md** (this file) — High-level architecture, data flow, storage +2. **02-pca-analysis.md** — PCA implementations (Clojure vs Python), mathematical formulas +3. **03-clustering-analysis.md** — K-means clustering, silhouette, k-selection +4. **04-repness-analysis.md** — Representativeness metrics, statistical tests, consensus +5. **05-participant-filtering.md** — In-conv logic, comment priorities, vote structures +6. **06-comment-routing.md** — TypeScript comment routing, topical vs prioritized +7. **07-discrepancies.md** — All Clojure vs Python discrepancies (THE KEY DOCUMENT) +8. **08-dead-code.md** — Dead/unreachable code identified +9. **09-fix-plan.md** — Plan to bring Python to Clojure parity + +--- + +## 1. High-Level Architecture + +The Polis math pipeline takes raw participant votes on comments and produces: + +- **PCA projections**: 2D coordinates for each participant in opinion space +- **Clusters**: Hierarchical grouping (base clusters → group clusters → optional subgroups) +- **Representativeness**: Which comments best characterize each group +- **Consensus**: Comments that all groups broadly agree on +- **Comment priorities**: Scores used to route the next comment to a participant + +### 1.1 Two Implementations + +| Aspect | Legacy (Clojure) | Delphi (Python) | +|--------|-----------------|-----------------| +| Location | `math/src/polismath/math/` | `delphi/polismath/` | +| Framework | Plumbing Graph (DAG computation) | Imperative class methods | +| Matrix repr | Custom `NamedMatrix` | pandas `DataFrame` | +| PCA method | Power iteration (custom) | sklearn SVD | +| Status | **CORRECT reference** | Has behavioral gaps | + +### 1.2 Computation Flow + +**Clojure** (`conversation.clj` lines 136–702): +``` +votes → customs (cap ptpts/cmts) + → raw-rating-mat (NamedMatrix) + → rating-mat (zero out mod-out columns) + → mat (replace nil with column means) + → pca (power iteration) + → proj (sparsity-aware projection) + → proj-nmat + → in-conv (filter participants) + → base-clusters (k-means, k=100) + → base-clusters-proj → bucket-dists + → group-clusterings (k=2..max-k) + → group-clusterings-silhouettes + → group-k-smoother (buffer before switching k) + → group-clusters + → subgroup-clusterings → subgroup-clusters + → votes-base → group-votes → subgroup-votes + → comment-priorities + → group-aware-consensus + → repness, subgroup-repness + → consensus + → ptpt-stats, subgroup-ptpt-stats +``` + +**Python** (`conversation.py`, `Conversation` class): +``` +update_votes() → raw_rating_mat (DataFrame) + → _apply_moderation() → rating_mat + → _compute_vote_stats() + → recompute(): + → _compute_pca() + → _compute_clusters() + → _compute_repness() + → _compute_participant_info() +``` + +### 1.3 Key Architectural Differences + +1. **Plumbing Graph vs Imperative**: Clojure uses a declarative computation graph where each node declares its dependencies. Python uses sequential method calls. + +2. **Immutability**: Clojure's pipeline is functionally pure — each step produces new data. Python's `Conversation` class mutates `self` attributes. + +3. **Missing in Python**: subgroup clustering, comment priorities computation, large-conv mini-batch PCA, k-smoother buffer, proper consensus selection (Clojure's `consensus-stats` + `select-consensus-comments`). + +--- + +## 2. Data Storage + +### 2.1 PostgreSQL (Source of Truth) + +**Read by both implementations:** + +| Table | Purpose | Key columns | +|-------|---------|-------------| +| `votes` | Individual vote records | `zid, pid, tid, vote, created` | +| `comments` | All comments | `zid, tid, txt, mod, active, is_seed` | +| `math_main` | Serialized math results | `zid, data, math_tick, caching_tick` | +| `math_ticks` | Update tracking | `zid, math_env` | +| `worker_tasks` | Task queue | `zid, math_env, task_type` | + +**Vote sign convention at the boundary:** +- Postgres stores: `AGREE = -1`, `DISAGREE = 1` (historical Polis convention) +- Delphi internal: `AGREE = 1`, `DISAGREE = -1` (standard convention) +- `postgres.py` line ~: `postgres_vote_to_delphi()` flips signs at the boundary + +### 2.2 DynamoDB (Delphi Output) + +Delphi writes results to these DynamoDB tables (`dynamodb.py`): + +| Table | Content | +|-------|---------| +| `Delphi_PCAConversationConfig` | PCA config (center, components) | +| `Delphi_PCAResults` | Summary results per conversation | +| `Delphi_KMeansClusters` | Cluster assignments and centers | +| `Delphi_CommentRouting` | Comment priority scores for routing | +| `Delphi_RepresentativeComments` | Per-group representative comments | +| `Delphi_PCAParticipantProjections` | Per-participant 2D projections | + +### 2.3 Clojure Math Storage + +Clojure serializes its entire conversation state (including all computed fields) into the `math_main` table as a single large JSON blob under the `data` column. The TypeScript server reads this blob via `getPca(zid, 0)` to extract `comment-priorities` for routing. + +--- + +## 3. Pipeline Entry Points + +### 3.1 Clojure + +- **`conv_man.clj`**: Conversation manager, polls for new votes +- **`conversation.clj:conv-update`** (line 766): Dispatches to `small-conv-update` or `large-conv-update` based on: + - `ptpt-cutoff = 10000` participants + - `cmt-cutoff = 5000` comments +- `small-conv-update` = `eager-profiled-compiler(small-conv-update-graph)` +- `large-conv-update` = same graph but with mini-batch PCA override + +### 3.2 Python (Delphi) + +- **`poller.py`**: Polls Postgres for new votes/moderation/tasks on separate threads +- **`run_math_pipeline.py`**: CLI tool for one-shot processing +- **`manager.py`**: Thread-safe management of multiple `Conversation` objects +- **`conversation.py:Conversation.update_votes()`**: Main entry, calls `recompute()` + +### 3.3 Default Configuration (Clojure) + +From `conversation.clj` lines 142–152: +```clojure +{:n-comps 2 + :pca-iters 100 + :base-iters 100 + :base-k 100 + :max-k 5 + :group-iters 100 + :max-ptpts 100000 + :max-cmts 10000 + :group-k-buffer 4} +``` + +Python matches `BASE_K=100`, `MAX_K=5`, but is missing `group-k-buffer` entirely. diff --git a/delphi/docs/deep-analysis-for-julien/02-pca-analysis.md b/delphi/docs/deep-analysis-for-julien/02-pca-analysis.md new file mode 100644 index 000000000..0703e28d6 --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/02-pca-analysis.md @@ -0,0 +1,204 @@ +# PCA Implementation Analysis + +## 1. Mathematical Foundation + +Both implementations perform PCA on the vote matrix `V` of shape `(n_participants × n_comments)` where entries are `{-1, 0, 1, NaN}`. The goal is to find a 2D projection that captures maximum variance in participant opinions. + +### 1.1 Missing Data Imputation + +Both implementations replace `NaN` (unvoted) entries with column means before PCA: + +**Clojure** (`conversation.clj` lines 356–377): +```clojure +;; replace? nil? ← only replaces nil, not 0 (pass votes) +column-averages = mean of non-nil values per column +mat[i][j] = column-averages[j] if mat[i][j] is nil +``` + +**Python** (`pca.py` lines 42–50): +```python +col_means = np.nanmean(matrix_data, axis=0) +nan_indices = np.where(np.isnan(matrix_data)) +matrix_data_no_nan[nan_indices] = col_means[nan_indices[1]] +``` + +**Status: MATCH** — Both impute with column means. Clojure replaces `nil?` (not `0`), Python replaces `NaN`. Since pass votes are stored as `0` in Clojure and also `0` in Python (with `NaN` for unvoted), these are equivalent. + +--- + +## 2. PCA Computation + +### 2.1 Clojure: Power Iteration (`pca.clj`) + +The Clojure implementation uses **power iteration** (also called the power method) to find eigenvectors of the covariance matrix iteratively. + +**Algorithm** (`pca.clj:powerit-pca`): + +Given centered data matrix `X` (n × p), find top-k eigenvectors: + +``` +For each component i = 1..k: + v_i ← random unit vector of dimension p + For iter = 1..100: + v_i ← X^T · X · v_i (multiply by covariance matrix) + v_i ← v_i - Σ_{j10K participants), Clojure uses mini-batch PCA with learning rate 0.01; Python has no such optimization. + +--- + +## 3. Sparsity-Aware Projection + +Both implementations apply identical sparsity-aware scaling to projections. This is critical for fair treatment of participants who have voted on few comments. + +### 3.1 Mathematical Formula + +For participant `i` with projection `p_i = [p_{i,1}, p_{i,2}]`: + +``` +n_votes_i = number of non-NaN entries in row i +n_comments = total number of comments +scaling_i = sqrt(n_comments / max(n_votes_i, 1)) +scaled_projection_i = p_i × scaling_i +``` + +Participants who have voted on fewer comments get **pushed outward** (larger scaling factor), compensating for the bias toward the center caused by mean imputation. + +### 3.2 Clojure (`pca.clj:sparsity-aware-project-ptpt`, lines 134–157) + +```clojure +(defn sparsity-aware-project-ptpt [votes {:keys [comps center]}] + (let [n-cmnts (count votes) + [pc1 pc2] comps + [n-votes p1 p2] + (reduce + (fn [[n-votes p1 p2] [x-n cntr-n pc1-n pc2-n]] + (if x-n ;; if voted (non-nil) + (let [x-n' (- x-n cntr-n)] ;; subtract center + [(inc n-votes) (+ p1 (* x-n' pc1-n)) (+ p2 (* x-n' pc2-n))]) + [n-votes p1 p2])) ;; skip if nil (unvoted) + [0 0.0 0.0] + (zip votes center pc1 pc2))] + (* (Math/sqrt (/ n-cmnts (max n-votes 1))) [p1 p2]))) +``` + +**Critical detail**: Clojure projects using the RAW votes with nils. Unvoted entries are **skipped** in the dot product (contribute 0). The projection uses the `rating-mat` (raw matrix with nils), NOT the imputed `mat`. + +### 3.3 Python (`pca.py` lines 96–102) + +```python +n_cmnts = matrix_data.shape[1] +n_seen = np.sum(~np.isnan(matrix_data), axis=1) +n_seen_safe = np.maximum(n_seen, 1) +proportions = np.sqrt(n_seen_safe / n_cmnts) +scaled_projections = projections / proportions[:, np.newaxis] +``` + +**IMPORTANT**: Python divides by `sqrt(n_seen/n_cmnts)`, which equals multiplying by `sqrt(n_cmnts/n_seen)`. The scaling is mathematically equivalent to the Clojure formula. + +### 3.4 DISCREPANCY: Projection Input + +The scaling is equivalent, but the **projection computation** differs subtly: + +| | Clojure | Python | +|---|---------|--------| +| Projection input | Raw votes (nils skipped in dot product) | Fully imputed matrix (NaN → col mean) | +| Effect of unvoted | Contributes 0 to dot product | Contributes `(col_mean - center) × loading` | + +Since `center ≈ col_mean` (both computed on the imputed matrix), the unvoted contribution in Python is approximately 0. The two approaches produce very similar but not identical results. The difference grows when the distribution of voters per comment is highly skewed. + +**Status: NEAR-MATCH** — Scaling equivalent; projection input differs subtly. + +--- + +## 4. Comment Projection and Extremity + +**Clojure only** (`conversation.clj` lines 338–349): + +Clojure also projects comments into the PCA space and computes their "extremity" (distance from origin): + +```clojure +cmnt-proj = pca-project-cmnts(pca) ;; project each comment onto PC axes +cmnt-extremity[j] = ||cmnt-proj[j]|| ;; L2 norm of comment's projection +``` + +This extremity is used in the `priority-metric` for comment routing. + +**Python**: Does NOT compute comment projection or extremity. This is a **missing feature** that affects comment priority computation. + +--- + +## 5. Large Conversation Handling + +### 5.1 Clojure Mini-Batch PCA (`conversation.clj` lines 706–755) + +For conversations with >10K participants or >5K comments, Clojure uses **mini-batch PCA**: + +``` +sample_size = interpolate(n_ptpts, start=(100,1500), stop=(1500,150000)) +For iter = 1..pca_iters: + indices = random_sample(range(n_ptpts), size=sample_size) + subset = mat[indices, :] + part_pca = powerit_pca(subset, start_vectors=current_pca.comps, iters=10) + pca.center = 0.99 * pca.center + 0.01 * part_pca.center + pca.comps = 0.99 * pca.comps + 0.01 * part_pca.comps +``` + +The learning rate of 0.01 ensures gradual adaptation. The `sample-size-fn` linearly interpolates between (100 samples at 1500 participants) and (1500 samples at 150000 participants). + +### 5.2 Python + +**No large-conversation optimization exists.** Python always runs full sklearn PCA on the entire matrix. This could become a performance bottleneck for very large conversations. + +**Status: MISSING in Python** diff --git a/delphi/docs/deep-analysis-for-julien/03-clustering-analysis.md b/delphi/docs/deep-analysis-for-julien/03-clustering-analysis.md new file mode 100644 index 000000000..c7066184c --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/03-clustering-analysis.md @@ -0,0 +1,249 @@ +# Clustering Analysis + +## 1. Two-Level Hierarchical Clustering + +Both implementations use a two-level clustering approach: + +``` +Level 1: Participants → ~100 base clusters (k-means on PCA projections) +Level 2: Base clusters → 2-5 group clusters (weighted k-means on base cluster centers) +``` + +The base clustering reduces the computational cost of the group-level clustering while preserving the distribution of participants in opinion space. + +--- + +## 2. K-Means Implementation + +### 2.1 Clojure (`clusters.clj`) + +**Initialization** (`init-clusters`): +```clojure +;; Take first k distinct rows from the NamedMatrix in encounter order +(defn init-clusters [nmat k] + (->> (nm/get-matrix nmat) + distinct + (take k) + (mapv (fn [row] {:center row :members [] :id nil})))) +``` + +**Iteration** (`kmeans`): +``` +1. Initialize clusters (init-clusters or clean-start-clusters if previous exists) +2. Repeat up to max-iters: + a. Assign each point to nearest cluster center (Euclidean distance) + b. Update centers as weighted mean of members + c. Remove empty clusters + d. If converged (same clustering as previous step, threshold=0.01), stop +3. Sort by :id +``` + +**Clean-start** (`clean-start-clusters`): When previous clusters exist: +``` +1. Start with old cluster centers +2. Assign all current data points to nearest old center +3. If need more clusters: split largest cluster using most-distal point +4. If need fewer: merge closest pair +5. Run k-means from these starting positions +``` + +**Convergence** (`same-clustering?`, lines 68–76): Two clusterings are "same" if EVERY pairwise center distance is < threshold (0.01 default). Centers are sorted before comparison. + +### 2.2 Python (`clusters.py`) + +Python has TWO k-means implementations: + +**Custom `kmeans()`** (lines 383–419): Mirrors Clojure's approach: +```python +clusters = clean_start_clusters(data, k, last_clusters) +for _ in range(max_iters): + new_clusters = cluster_step(data, clusters, weights) + if same_clustering(clusters, new_clusters): + break + clusters = new_clusters +``` + +**sklearn `kmeans_sklearn()`** (lines 595–665): Used in the actual pipeline: +```python +init = _get_first_k_distinct_centers(data, k) # match Clojure init +kmeans = KMeans(n_clusters=k, init=init, n_init=1, algorithm='lloyd') +kmeans.fit(data, sample_weight=weights) +``` + +**Which is actually used?** The `_compute_clusters()` method in `conversation.py` (line 576) calls `kmeans_sklearn()`, NOT the custom `kmeans()`. The custom implementation is **dead code**. + +--- + +## 3. Silhouette Coefficient + +The silhouette coefficient measures clustering quality. For each point `i`: + +``` +a(i) = mean distance to other points in same cluster +b(i) = min over other clusters C: mean distance to points in C +s(i) = (b(i) - a(i)) / max(a(i), b(i)) +``` + +Overall silhouette = mean of all s(i). Range: [-1, 1]. Higher is better. + +### 3.1 Clojure (`clusters.clj:silhouette`) + +Custom implementation using a **named distance matrix** (`named-dist-matrix`): +```clojure +;; Precomputes pairwise distances between all base cluster centers +;; Then computes silhouette using those distances +(defn silhouette [dist-matrix clustering] + ;; For each cluster, for each member: + ;; a = avg distance to same-cluster members + ;; b = min avg distance to other-cluster members + ;; s = (b - a) / max(a, b) + ;; Return mean of all s values + ) +``` + +### 3.2 Python + +**Custom `silhouette()`** (lines 444–503): Direct implementation matching Clojure logic. + +**sklearn `calculate_silhouette_sklearn()`** (lines 668–686): Wrapper around `sklearn.metrics.silhouette_score`. + +The actual pipeline uses `calculate_silhouette_sklearn()` (`conversation.py` line 633). + +--- + +## 4. Optimal k Selection + +### 4.1 Clojure: k-smoother with Buffer + +**Max k calculation** (`conversation.clj` lines 271–276): +```clojure +(defn max-k-fn [data max-max-k] + (min max-max-k + (+ 2 (int (/ (count (nm/rownames data)) 12))))) +;; max-max-k defaults to 5 +;; Example: 100 base clusters → max-k = min(5, 2 + 100/12) = min(5, 10) = 5 +``` + +**k-smoother** (`conversation.clj` lines 452–468): +``` +State: {last-k, last-k-count, smoothed-k} +Algorithm: + this-k = argmax_k silhouette(k) # best k for current data + if this-k == last-k: + this-k-count = last-k-count + 1 + else: + this-k-count = 1 + if this-k-count >= group-k-buffer (default 4): + smoothed-k = this-k # switch to new k + else: + smoothed-k = previous smoothed-k # keep old k +``` + +**Purpose**: Prevents the number of groups from flickering between values. A new k must be the best for **4 consecutive updates** before the system switches to it. + +### 4.2 Python: Direct Best Silhouette + +**Max k** (`conversation.py` line 614): +```python +max_k = min(MAX_K, 2 + len(base_clusters) // 12) +max_k = max(2, min(max_k, len(base_clusters))) +``` + +**k selection** (`conversation.py` lines 624–641): +```python +best_k = 2 +best_score = -1 +for k in range(2, max_k + 1): + labels, centers, members = kmeans_sklearn(base_centers, k=k, weights=base_weights) + score = calculate_silhouette_sklearn(base_centers, labels) + if score > best_score: + best_score = score + best_k = k +``` + +### 4.3 DISCREPANCY: No k-smoother in Python + +Python picks the best k by silhouette **every single update** without any buffering. This means: + +1. Groups can flicker between k=2 and k=3 (or other values) on consecutive updates +2. Participants may see their group assignment change frequently +3. The visualization becomes unstable + +**Impact**: HIGH — This is a user-visible instability. + +--- + +## 5. Subgroup Clustering + +### 5.1 Clojure (`conversation.clj` lines 480–570) + +For each group cluster, Clojure runs another round of clustering: +``` +For each group g: + Get base clusters belonging to g + For k = 2..max-k: + Run k-means on those base cluster centers + Apply k-smoother (same buffer logic) + Store best subgroup clustering +``` + +This produces a second level of hierarchy: groups → subgroups. + +### 5.2 Python + +**No subgroup clustering exists.** The `subgroup_clusters` attribute is initialized to `{}` and never populated. + +**Status: MISSING in Python** + +--- + +## 6. `determine_k()` Function — Python Dead Code + +Python has a `determine_k()` function (`clusters.py` lines 689–726) that uses a logarithmic heuristic: +```python +if n_rows < 10: return 2 +if n_rows >= 500: k = 2 + int(min(1, log2(n_rows) / 10)) +else: k = 2 + int(min(2, log2(n_rows) / 5)) +``` + +This function is **NOT used** in the actual pipeline. The pipeline uses the Clojure-matching `max_k` formula with silhouette selection. `determine_k()` is only called from `cluster_dataframe()`, which itself is not used in the main pipeline (the pipeline calls `kmeans_sklearn()` directly). + +--- + +## 7. Cluster Output Format + +### 7.1 Clojure + +Group clusters are sorted by `:id` and contain: +```clojure +{:id 0 + :center [x, y] + :members [bid1, bid2, ...] ;; base cluster IDs + :count n} +``` + +Base clusters contain: +```clojure +{:id 0 + :center [x, y] + :members [pid1, pid2, ...] ;; participant IDs + :count n} +``` + +### 7.2 Python + +Both levels use the same dict format: +```python +{'id': 0, 'center': [x, y], 'members': [...]} +``` + +Group clusters are sorted by size (descending) and re-assigned sequential IDs. This matches Clojure's behavior of sorting by `:id` after creation. + +### 7.3 Base Cluster Folding + +For serialization, Clojure uses a "folded" format: +```clojure +{:id [0 1 2 ...], :members [[...] [...] ...], :x [...], :y [...], :count [...]} +``` + +Python's `_fold_base_clusters()` (`conversation.py` lines 1250–1269) matches this format. diff --git a/delphi/docs/deep-analysis-for-julien/04-repness-analysis.md b/delphi/docs/deep-analysis-for-julien/04-repness-analysis.md new file mode 100644 index 000000000..da10ef297 --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/04-repness-analysis.md @@ -0,0 +1,377 @@ +# Representativeness (Repness) Analysis + +## 1. Overview + +The representativeness module determines which comments best characterize each opinion group. For each (group, comment) pair, it computes: + +1. **Proportion test**: Is this group's agreement on this comment significantly different from 50%? +2. **Representativeness ratio**: Does this group agree/disagree more than other groups? +3. **Two-proportion test**: Is the difference between group and others statistically significant? +4. **Composite metric**: Final ranking score combining the above. + +--- + +## 2. Vote Counting + +### 2.1 Clojure (`repness.clj:comment-stats-graphimpl`) + +```clojure +;; CRITICAL: Vote signs are INVERTED in Clojure repness +;; na counts votes == -1 (which is AGREE in Postgres convention) +;; nd counts votes == 1 (which is DISAGREE in Postgres convention) +;; Comment in code: "Change when we flip votes" +na = count of votes == -1 in group +nd = count of votes == 1 in group +ns = na + nd ;; total directional votes (excludes pass/nil) +``` + +### 2.2 Python (`repness.py:comment_stats`, line 133) + +```python +n_agree = np.sum(group_votes == AGREE) # AGREE = 1 +n_disagree = np.sum(group_votes == DISAGREE) # DISAGREE = -1 +n_votes = n_agree + n_disagree +``` + +### 2.3 DISCREPANCY: Vote Sign Convention + +This is a **critical semantic difference**. In Clojure, `na` counts votes equal to `-1`, which in the Postgres convention IS the agree vote. In Python, `na` counts votes equal to `+1` (AGREE constant). + +Since Delphi flips signs at the Postgres boundary (`postgres_vote_to_delphi()`), AGREE becomes `+1` internally. So both are counting "agrees" — just detecting them with different numeric values. **This is consistent** as long as the sign flip is correctly applied at the boundary. + +**Status: CONSISTENT** (but confusing and fragile) + +--- + +## 3. Pseudocount Smoothing (Bayesian Prior) + +### 3.1 Clojure (`repness.clj`) + +```clojure +pa = (1 + na) / (2 + ns) +pd = (1 + nd) / (2 + ns) +``` + +This is equivalent to adding 1 virtual agree and 1 virtual disagree to the counts (a Beta(2,2) prior, or equivalently Laplace smoothing with pseudocount=2). + +### 3.2 Python (`repness.py`, lines 138–139) + +```python +PSEUDO_COUNT = 1.5 +pa = (n_agree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) + = (n_agree + 0.75) / (n_votes + 1.5) +pd = (n_disagree + PSEUDO_COUNT/2) / (n_votes + PSEUDO_COUNT) + = (n_disagree + 0.75) / (n_votes + 1.5) +``` + +### 3.3 DISCREPANCY: Pseudocount Values + +| | Clojure | Python | +|---|---------|--------| +| Numerator add | +1 | +0.75 | +| Denominator add | +2 | +1.5 | +| Effective prior | Beta(2,2) | Beta(1.75,1.75) | + +**Impact**: Python's weaker prior pulls probabilities less strongly toward 0.5. For small sample sizes (e.g., n=3 votes, 2 agrees): +- Clojure: pa = (1+2)/(2+3) = 0.60 +- Python: pa = (2+0.75)/(3+1.5) = 0.611 + +The effect diminishes with larger samples but matters for small groups. + +--- + +## 4. Proportion Test (p-test) + +### 4.1 Clojure (`stats.clj:prop-test`) + +```clojure +(defn prop-test [succ n] + (* 2 (Math/sqrt (inc n)) (+ (/ (inc succ) (inc n)) -0.5))) +``` + +Expanding: `z = 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5)` + +This is a **custom formula** that: +- Adds 1 to both `succ` and `n` (pseudocount) +- Tests departure from 0.5 +- Uses `2*sqrt(n+1)` as the scaling factor instead of standard error + +### 4.2 Python (`repness.py:prop_test`, lines 64–86) + +```python +def prop_test(p, n, p0): + se = math.sqrt(p0 * (1 - p0) / n) + return (p - p0) / se +``` + +This is the **standard one-proportion z-test**: `z = (p - p0) / sqrt(p0*(1-p0)/n)` + +When `p0 = 0.5`: `z = (p - 0.5) / sqrt(0.25/n) = (p - 0.5) / (0.5/sqrt(n)) = 2*sqrt(n)*(p - 0.5)` + +### 4.3 DISCREPANCY: Proportion Test Formula + +| | Clojure | Python | +|---|---------|--------| +| Formula | `2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)` | `(p - 0.5) / sqrt(0.25/n)` | +| Pseudocount | +1/+1 built into test | None (uses pre-smoothed p) | +| Denominator | `1/(2*sqrt(n+1))` | `0.5/sqrt(n)` | + +For n=10, succ=8: +- Clojure: `2*sqrt(11)*((9/11)-0.5) = 2*3.317*0.318 = 2.11` +- Python (with pa from Clojure's smoothing): p=9/12=0.75, `(0.75-0.5)/sqrt(0.25/10) = 0.25/0.158 = 1.58` + +**Impact**: Clojure's test is more liberal (produces larger z-scores), making it easier to reach significance. + +### 4.4 Note: Python has TWO different prop_test functions + +There is also `stats.py:prop_test` which uses yet another formula: +```python +# stats.py version (NOT used by repness) +succ_adjusted = succ + 1 +n_adjusted = n + 2 +p = succ_adjusted / n_adjusted +se = math.sqrt(p * (1 - p) / n_adjusted) +z = (p - 0.5) / se +``` + +This third variant is not used in the repness pipeline. It exists in `stats.py` but is **dead code** relative to repness computation. + +--- + +## 5. Two-Proportion Test (Representativeness Test) + +### 5.1 Clojure (`stats.clj:two-prop-test`) + +```clojure +(defn two-prop-test [succ1 n1 succ2 n2] + (let [succ1 (inc succ1) n1 (inc n1) + succ2 (inc succ2) n2 (inc n2) + p1 (/ succ1 n1) p2 (/ succ2 n2) + p (/ (+ succ1 succ2) (+ n1 n2))] + (/ (- p1 p2) + (Math/sqrt (* p (- 1 p) (+ (/ 1 n1) (/ 1 n2))))))) +``` + +Key: Increments ALL four inputs by 1 before computing. + +### 5.2 Python (`repness.py:two_prop_test`, lines 89–115) + +```python +def two_prop_test(p1, n1, p2, n2): + p = (p1 * n1 + p2 * n2) / (n1 + n2) # pooled proportion + se = math.sqrt(p * (1 - p) * (1/n1 + 1/n2)) + return (p1 - p2) / se +``` + +Key: Takes proportions directly (already smoothed), no additional pseudocount. + +### 5.3 DISCREPANCY: Two-Proportion Test + +Clojure adds pseudocounts (+1 to each of 4 parameters) to prevent division by zero and stabilize estimates. Python takes pre-smoothed proportions and raw counts without additional adjustment. + +**Impact**: Different z-scores for the same data, especially at small sample sizes. Clojure is more conservative (z-scores pulled toward 0 by the +1 adjustments). + +--- + +## 6. Representativeness Ratio + +### 6.1 Clojure (`repness.clj:add-comparitive-stats`) + +```clojure +ra = pa_group / ((1 + sum(na_rest)) / (2 + sum(ns_rest))) +rd = pd_group / ((1 + sum(nd_rest)) / (2 + sum(ns_rest))) +``` + +The denominator for "others" uses the same pseudocount formula as group stats. + +### 6.2 Python (`repness.py:add_comparative_stats`, lines 172–173) + +```python +ra = pa_group / other_pa # other_pa already computed with PSEUDO_COUNT smoothing +rd = pd_group / other_pd +``` + +**Status: STRUCTURAL MATCH** — Both compute the ratio of group probability to "other" probability. The numeric values differ due to different pseudocount formulas (see Section 3). + +--- + +## 7. Repness Metric (Composite Score) + +### 7.1 Clojure (`repness.clj:repness-metric`) + +```clojure +(defn repness-metric [{:keys [repness repness-test p-success p-test]}] + (* repness repness-test p-success p-test)) +``` + +This is a **PRODUCT of 4 values**: +- `repness` = representativeness ratio (ra or rd) +- `repness-test` = two-proportion z-score (rat or rdt) +- `p-success` = agreement probability (pa or pd) +- `p-test` = proportion z-score (pat or pdt) + +### 7.2 Python (`repness.py:repness_metric`, lines 189–210) + +```python +def repness_metric(stats, key_prefix): + p = stats[f'p{key_prefix}'] # pa or pd + p_test = stats[f'p{key_prefix}t'] # pat or pdt + r = stats[f'r{key_prefix}'] # ra or rd (unused!) + r_test = stats[f'r{key_prefix}t'] # rat or rdt + p_factor = p if key_prefix == 'a' else (1 - p) + return p_factor * (abs(p_test) + abs(r_test)) +``` + +This is `p_factor * (|p_test| + |r_test|)` — a **weighted SUM of 2 absolute z-scores**. + +### 7.3 DISCREPANCY: Repness Metric Formula + +| | Clojure | Python | +|---|---------|--------| +| Formula | `ra * rat * pa * pat` | `pa * (|pat| + |rat|)` | +| Combination | Product of 4 | Weighted sum of 2 | +| Uses ratio | Yes (as multiplier) | No (only ratio's z-test) | +| Uses abs | No | Yes | + +**Impact**: SEVERE — These produce fundamentally different rankings. The Clojure product heavily penalizes any factor near zero, while Python's sum is more forgiving. Additionally, Python takes absolute values of z-scores, meaning it doesn't distinguish direction. + +--- + +## 8. Finalize Comment Stats (Agree vs Disagree) + +### 8.1 Clojure (`repness.clj:finalize-cmt-stats`) + +```clojure +;; Simple comparison: which representativeness test is larger? +(if (> rat rdt) + ;; agree is more representative + {:repful :agree, :repness ra, :repness-test rat, ...} + ;; disagree is more representative + {:repful :disagree, :repness rd, :repness-test rdt, ...}) +``` + +Decision criterion: `rat > rdt` — whichever direction has a larger representativeness z-score wins. + +### 8.2 Python (`repness.py:finalize_cmt_stats`, lines 213–243) + +```python +if pa > 0.5 and ra > 1.0: + result['repful'] = 'agree' +elif pd > 0.5 and rd > 1.0: + result['repful'] = 'disagree' +else: + if agree_metric >= disagree_metric: + result['repful'] = 'agree' + else: + result['repful'] = 'disagree' +``` + +Decision criterion: Priority to agree if `pa > 0.5 AND ra > 1.0`, then disagree if `pd > 0.5 AND rd > 1.0`, then fallback to comparing metrics. + +### 8.3 DISCREPANCY: Agree/Disagree Selection + +Clojure uses a simple `rat > rdt` comparison. Python adds threshold checks (`pa > 0.5`, `ra > 1.0`) that can override the metric-based decision. + +**Impact**: Comments near the boundary between agree and disagree may be classified differently. + +--- + +## 9. Significance Thresholds + +### 9.1 Clojure (`stats.clj`) + +```clojure +(defn z-sig-90? [z-val] (> z-val 1.2816)) ;; one-tailed 90% +(defn z-sig-95? [z-val] (> z-val 1.6449)) ;; one-tailed 95% +``` + +These are **one-tailed** z-values (P(Z > z) < alpha). + +### 9.2 Python — INTERNAL INCONSISTENCY + +**`repness.py`** (line 19): +```python +Z_90 = 1.645 # two-tailed 90% (or one-tailed 95%) +Z_95 = 1.96 # two-tailed 95% +``` + +**`stats.py`** (used elsewhere, not in repness): +```python +z_sig_90 threshold = 1.2816 # matches Clojure +``` + +### 9.3 DISCREPANCY: Z-score Thresholds + +| | Clojure | Python repness.py | Python stats.py | +|---|---------|------------------|-----------------| +| 90% threshold | 1.2816 (one-tailed) | 1.645 (two-tailed) | 1.2816 (one-tailed) | +| 95% threshold | 1.6449 (one-tailed) | 1.96 (two-tailed) | — | + +**Impact**: Python's `repness.py` requires MUCH higher z-scores to pass significance tests. Many comments that would be "significant" in Clojure will fail significance in Python, leading to fewer representative comments being selected. + +--- + +## 10. Representative Comment Selection + +### 10.1 Clojure (`repness.clj:select-rep-comments`) + +Single-pass algorithm: +``` +For each (group, comment) pair: + If repful == :agree: + If beats-best-by-test? OR (passes-by-test? AND beats-best-agr?): + Add to best-agrees + If repful == :disagree: + Similar logic for best-disagrees +Final: Take top 5 (agrees before disagrees) +``` + +### 10.2 Python (`repness.py:select_rep_comments`, lines 315–392) + +Separate lists: +```python +agree_comments = sorted(best_agree(all_stats), key=agree_metric, reverse=True) +disagree_comments = sorted(best_disagree(all_stats), key=disagree_metric, reverse=True) +selected = top 3 agrees + top 2 disagrees +``` + +### 10.3 DISCREPANCY: Selection Algorithm + +| | Clojure | Python | +|---|---------|--------| +| Agree count | Up to 5 total | 3 agrees + 2 disagrees | +| Algorithm | Single-pass with replacement | Sort-and-take | +| Criteria | beats-best-by-test?, passes-by-test? | pa > pd, significance tests | + +--- + +## 11. Consensus Comments + +### 11.1 Clojure (`repness.clj:consensus-stats` + `select-consensus-comments`) + +```clojure +;; consensus-stats: For each comment, compute pa with Bayesian smoothing +;; select-consensus-comments: +;; - format-stat adds test scores +;; - Takes top 5 agrees (pa > 0.5) and top 5 disagrees (pd > 0.5) +;; - Filtered by mod-out +``` + +### 11.2 Python (`repness.py:select_consensus_comments`, lines 413–454) + +```python +# Group by comment +# Check if ALL groups have pa > 0.6 +# Sort by average agreement +# Take top 2 +``` + +### 11.3 DISCREPANCY: Consensus Selection + +| | Clojure | Python | +|---|---------|--------| +| Criterion | Per-comment pa > 0.5 | ALL groups pa > 0.6 | +| Output | Top 5 agree + 5 disagree | Top 2 overall | +| Smoothing | Bayesian (same as repness) | Same | + +**Impact**: Python is much more restrictive (requires all groups to agree at 60%+) and returns fewer consensus comments. diff --git a/delphi/docs/deep-analysis-for-julien/05-participant-filtering.md b/delphi/docs/deep-analysis-for-julien/05-participant-filtering.md new file mode 100644 index 000000000..9b5ed383c --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/05-participant-filtering.md @@ -0,0 +1,278 @@ +# Participant Filtering, Comment Priorities, and Vote Structures + +## 1. Participant Filtering (in-conv) + +The "in-conv" filter determines which participants have voted enough to be included in clustering. This is critical because participants with too few votes produce unreliable PCA projections. + +### 1.1 Clojure (`conversation.clj` lines 240–266) + +```clojure +:in-conv (plmb/fnk [conv user-vote-counts n-cmts] + (as-> (or (:in-conv conv) #{}) in-conv + ;; Step 1: Include anyone with >= min(7, n-cmts) votes + (into in-conv + (map first + (filter + (fn [[rowname cnt]] + (>= cnt (min 7 n-cmts))) + user-vote-counts))) + ;; Step 2: If fewer than 15 participants, greedily add top voters + (let [greedy-n 15 + n-in-conv (count in-conv)] + (if (< n-in-conv greedy-n) + (->> user-vote-counts + (remove (fn [[k v]] (in-conv k))) + (sort-by (comp - second)) + (map first) + (take (- greedy-n n-in-conv)) + (into in-conv)) + in-conv)))) +``` + +**Algorithm**: +1. Start with previous in-conv set (once in, always in) +2. Add any participant with `vote_count >= min(7, n_comments)` +3. If result has fewer than 15 participants, greedily add the highest-voting participants until reaching 15 + +**Key property**: Monotonic — once a participant is in-conv, they stay in-conv forever (across updates). + +### 1.2 Python (`conversation.py` lines 1225–1248) + +```python +def _get_in_conv_participants(self) -> Set[str]: + n_cmts = len(self.rating_mat.columns) + threshold = 7 + np.sqrt(n_cmts) * 0.1 + vote_counts = self._compute_user_vote_counts() + in_conv = {pid for pid, count in vote_counts.items() if count >= threshold} + return in_conv +``` + +**Algorithm**: +1. Threshold = `7 + sqrt(n_comments) * 0.1` +2. Include any participant with `vote_count >= threshold` +3. No greedy fallback +4. No persistence (recomputed from scratch each time) + +### 1.3 DISCREPANCY: In-Conv Filtering + +| Property | Clojure | Python | +|----------|---------|--------| +| Threshold | `min(7, n_cmts)` | `7 + sqrt(n_cmts) * 0.1` | +| Greedy fallback | Yes (top-15 voters) | No | +| Persistence | Monotonic (once in, always in) | Recomputed each time | +| At 100 comments | 7 votes needed | 8 votes needed | +| At 1000 comments | 7 votes needed | 10.2 votes needed | + +**Impact**: HIGH +- Python's threshold is HIGHER and GROWS with comment count, excluding more participants +- Python lacks the greedy fallback, so small conversations may have too few participants for clustering +- Python recalculates from scratch, so participants can drop OUT of in-conv if new comments are added (increasing the threshold) + +### 1.4 Note on to_dict() In-Conv + +Interestingly, `conversation.py:to_dict()` (lines 1637–1644) has a DIFFERENT in-conv calculation for the serialized output: + +```python +min_votes = min(7, self.comment_count) +for pid, count in result['user-vote-counts'].items(): + if count >= min_votes: + in_conv.append(pid) +``` + +This matches Clojure's threshold formula but is only used for serialization output, NOT for the actual clustering computation. The clustering uses `_get_in_conv_participants()` with the different formula. + +--- + +## 2. User Vote Counts + +### 2.1 Clojure (`conversation.clj` lines 217–225) + +```clojure +:user-vote-counts + (->> (mapv + (fn [rowname row] + [rowname (count (remove nil? row))]) + (nm/rownames raw-rating-mat) + (nm/get-matrix raw-rating-mat)) + (into {})) +``` + +Counts non-nil entries per row of the **raw** rating matrix (before moderation). + +### 2.2 Python (`conversation.py:_compute_user_vote_counts`, lines 1179–1223) + +```python +# Uses self.rating_mat (AFTER moderation, not raw) +non_nan_mask = ~np.isnan(self.rating_mat.values) +row_sums = np.sum(non_nan_mask, axis=1) +``` + +### 2.3 DISCREPANCY: Raw vs Moderated Matrix + +Clojure counts votes on the **raw** matrix; Python counts on the **moderated** matrix. If comments are moderated out, Python undercounts participant votes. + +**Impact**: A participant who voted on 8 comments (including 2 moderated-out) would have count=8 in Clojure but count=6 in Python, potentially dropping below the in-conv threshold. + +--- + +## 3. Customs (Participant/Comment Caps) + +### 3.1 Clojure (`conversation.clj` lines 164–186) + +```clojure +:customs (plmb/fnk [conv votes opts'] + (reduce + (fn [{:keys [pids tids] :as result} {:keys [pid tid] :as vote}] + (let [pid-room (< (count pids) (:max-ptpts opts')) ;; 100000 + tid-room (< (count tids) (:max-cmts opts')) ;; 10000 + pid-in (pids pid) + tid-in (tids tid)] + (if (and (or pid-room pid-in) + (or tid-room tid-in)) + ;; Accept vote + ... + ;; Reject vote + result))) + ...)) +``` + +This enforces hard caps: max 100,000 participants, max 10,000 comments. A vote is accepted only if: +- The participant is already known OR there's room for new participants +- AND the comment is already known OR there's room for new comments + +### 3.2 Python + +**No customs/cap mechanism exists.** All votes are accepted regardless of conversation size. + +**Status: MISSING in Python** — Could be a problem for very large conversations, but less critical than other discrepancies. + +--- + +## 4. Comment Priorities + +### 4.1 Clojure (`conversation.clj` lines 308–669) + +Comment priorities determine which comments are shown next to participants. The pipeline computes: + +**Step 1: Votes Base** (`votes-base`, lines 583–590) +``` +For each comment tid: + For each base cluster: + A[bid] = count of agree votes from cluster members + D[bid] = count of disagree votes from cluster members + S[bid] = count of total votes from cluster members +``` + +**Step 2: Group Votes** (`group-votes`, lines 278–305) +``` +For each group: + For each comment: + A = sum of A[bid] for base clusters in this group + D = sum of D[bid] ... + S = sum of S[bid] ... +``` + +**Step 3: Importance Metric** (lines 308–312) +``` +importance_metric(A, P, S, E) = (1 - p) * (E + 1) * a +where: + p = (P + 1) / (S + 2) ;; pass probability (Bayesian) + a = (A + 1) / (S + 2) ;; agree probability (Bayesian) + P = S - A - D ;; pass count + E = comment extremity ;; L2 norm of comment's PCA projection +``` + +**Step 4: Priority Metric** (lines 318–327) +``` +priority_metric(is_meta, A, P, S, E) = + if is_meta: + meta_priority^2 = 7^2 = 49 + else: + (importance_metric(A, P, S, E) * (1 + 8 * 2^(S/-5)))^2 +``` + +The `(1 + 8 * 2^(S/-5))` factor is a **novelty boost** that gives high priority to comments with few total votes (S). As S increases: +- S=0: factor = 1 + 8*1 = 9 +- S=5: factor = 1 + 8*0.5 = 5 +- S=10: factor = 1 + 8*0.25 = 3 +- S=20: factor = 1 + 8*0.0625 = 1.5 +- S=∞: factor → 1 + +The squaring amplifies differences between priorities. + +**Step 5: Comment Priorities** (lines 638–669) +``` +For each tid: + Sum A, D, S, P across all groups + Get extremity from PCA + priority = priority_metric(is_meta, total_A, total_P, total_S, extremity) +``` + +### 4.2 Python + +**Comment priorities are NOT computed in the main pipeline.** The `_compute_votes_base()` method exists (lines 1047–1093) but: + +1. It has a **BUG** on line 1076: `self.rating_mat[:, 'tid']` is invalid pandas syntax (should be `self.rating_mat[tid]`) +2. It is marked with `TODO(julien): why is that not called anywhere?` +3. The `recompute()` method does NOT call it +4. No `importance_metric` or `priority_metric` functions exist in Python +5. Comment extremity is not computed (no comment projection) + +**Status: MISSING in Python** — Comment priorities are critical for the TypeScript server's comment routing. Without them, all comments get equal weight (default priority = 1). + +--- + +## 5. Group-Aware Consensus + +### 5.1 Clojure (`conversation.clj` lines 614–636) + +```clojure +:group-aware-consensus + ;; For each (group, comment): + ;; prob = (A + 1.0) / (S + 2.0) ;; Laplace-smoothed agree probability + ;; For each comment: + ;; consensus = product of all group probabilities + ;; Result: {tid → consensus_value} +``` + +Mathematical formula for comment `j`: +``` +consensus(j) = Π_{g ∈ groups} [ (A_g(j) + 1) / (S_g(j) + 2) ] +``` + +This is a product of Bayesian-smoothed agreement probabilities across all groups. High consensus means ALL groups agree. + +### 5.2 Python (`conversation.py:_compute_group_aware_consensus`, lines 1291–1349) + +```python +# Same formula: consensus_value *= (agree_count + 1.0) / (total_count + 2.0) +``` + +**Status: MATCH** — Both compute the same product of Laplace-smoothed probabilities. + +Also implemented inline in `to_dict()` (lines 1596–1631) with identical logic. + +--- + +## 6. Votes-Base Structure + +### 6.1 Clojure + +```clojure +;; {tid {:A [a0 a1 a2 ...], :D [...], :S [...]}} +;; where indices correspond to base cluster IDs (sorted) +;; a0 = number of agree votes from base cluster 0 for this comment +``` + +This per-base-cluster granularity enables efficient group-level aggregation. + +### 6.2 Python (`to_dict()`, lines 1497–1521) + +```python +# {tid: {'A': total_agree, 'D': total_disagree, 'S': total_votes}} +# NOT per-base-cluster — just global totals +``` + +### 6.3 DISCREPANCY: Granularity + +Clojure's votes-base has per-base-cluster arrays; Python's has only global totals. This means Python cannot efficiently recompute group-level vote statistics when group assignments change. Python works around this by recomputing from the raw vote matrix each time. diff --git a/delphi/docs/deep-analysis-for-julien/06-comment-routing.md b/delphi/docs/deep-analysis-for-julien/06-comment-routing.md new file mode 100644 index 000000000..3dd233bea --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/06-comment-routing.md @@ -0,0 +1,274 @@ +# Comment Routing Analysis + +## 1. Overview + +Comment routing determines which comment a participant sees next when they're ready to vote. The TypeScript server (`server/src/nextComment.ts`) mediates between two routing strategies: + +1. **Prioritized routing**: Uses math-computed priority scores for weighted random selection +2. **Topical routing**: Uses Delphi-generated topic clusters filtered by participant preferences + +--- + +## 2. Entry Point + +### `getNextComment()` (`nextComment.ts` lines 339–381) + +```typescript +export async function getNextComment(zid, pid, withoutTids, lang) { + const ratio = Config.getValidTopicalRatio(); + // ratio is TOPICAL_COMMENT_RATIO env var, float [0,1] or null + + const shouldUseTopical = + typeof ratio === 'number' && ratio > 0 && + Math.random() < ratio && // probabilistic split + pid !== -1; // not anonymous + + let next = null; + if (shouldUseTopical) { + next = await getNextTopicalComment(zid, pid, withoutTids); + } else { + next = await getNextPrioritizedComment(zid, pid, withoutTids); + } + + // Fallback: if topical yielded nothing, try prioritized + if (!next && shouldUseTopical) { + next = await getNextPrioritizedComment(zid, pid, withoutTids); + } + + if (!next) return next; + await ensureTranslations(zid, next, lang); + return next; +} +``` + +**Key design**: The split is probabilistic. With `TOPICAL_COMMENT_RATIO=0.5`, each request has a 50% chance of using topical routing and 50% prioritized. + +--- + +## 3. Prioritized Routing (Legacy Path) + +### `getNextPrioritizedComment()` (`nextComment.ts` lines 51–103) + +```typescript +async function getNextPrioritizedComment(zid, pid, withoutTids) { + const [comments, mathRaw, remainingRows] = await Promise.all([ + getComments({zid, not_voted_by_pid: pid, ...}), // unvoted comments + getPca(zid, 0), // math results from DB + getNumberOfCommentsRemaining(zid, pid), // count + ]); + + const commentPriorities = math.asPOJO?.['comment-priorities'] || {}; + const selectedRow = selectProbabilistically(comments, commentPriorities); + selectedRow.remaining = remainingCount; + selectedRow.total = totalCount; + return selectedRow; +} +``` + +**Data flow**: +1. Fetch all comments the participant hasn't voted on +2. Fetch `comment-priorities` from the math results stored in `math_main` table +3. Use weighted random selection + +### `selectProbabilistically()` (`nextComment.ts` lines 105–140) + +```typescript +function selectProbabilistically(comments, priorities) { + // Build cumulative weight lookup + const lookup = _.reduce(comments, (o, comment) => { + const lookup_val = o.lastCount + (priorities[comment.tid] || 1); + // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + // default weight = 1 if no priority + o.lookup.push([lookup_val, comment]); + o.lastCount = lookup_val; + return o; + }, { lastCount: 0, lookup: [] }); + + // Weighted random selection + const randomN = Math.random() * lookup.lastCount; + const result = _.find(lookup.lookup, (x) => x[0] > randomN); + return result[1]; +} +``` + +**Mathematical formulation**: + +Given comments `c_1, ..., c_n` with priorities `w_1, ..., w_n` (default `w_i = 1`): + +``` +P(select c_i) = w_i / Σ_j w_j +``` + +This is **probability-proportional-to-size (PPS) sampling** without replacement of the selection step. Each comment's selection probability is proportional to its priority weight. + +**Example** with priorities {A: 49, B: 4, C: 1}: +``` +P(A) = 49/54 = 90.7% ← meta comment (priority^2 = 7^2) +P(B) = 4/54 = 7.4% ← normal comment +P(C) = 1/54 = 1.9% ← default (no priority computed) +``` + +--- + +## 4. Topical Routing (Delphi Path) + +### `getNextTopicalComment()` (`nextComment.ts` lines 285–327) + +```typescript +export async function getNextTopicalComment(zid, pid, withoutTids) { + const tids = await getTidsForParticipantTopicAgenda(zid, pid); + if (!tids || tids.length === 0) { + return getNextPrioritizedComment(zid, pid, withoutTids); // fallback + } + + const rows = await getComments({ + zid, + not_voted_by_pid: pid, + tids, // restrict to topic-relevant comments + withoutTids, + random: true, // uniform random within pool + limit: 1, + }); + + if (!rows[0]) { + return getNextPrioritizedComment(zid, pid, withoutTids); // fallback + } + + return { zid, tid: rows[0].tid, txt: rows[0].txt }; +} +``` + +### `getTidsForParticipantTopicAgenda()` (`nextComment.ts` lines 153–276) + +``` +1. Query PostgreSQL: topic_agenda_selections WHERE zid AND pid + → Get archetypal_selections array (participant's chosen topics) + +2. Extract unique topic_keys from selections + +3. For each topic_key: + Query DynamoDB: Delphi_CommentClustersLLMTopicNames + → Get (layer_id, cluster_id) for each topic + +4. For each (layer_id, cluster_id): + Query DynamoDB: Delphi_CommentHierarchicalClusterAssignments + → Get all comment tids in that cluster + +5. De-duplicate and return tids +``` + +**Key difference from prioritized**: Topical routing uses **uniform random** selection within the filtered pool (no weighting). The filtering is by LLM-generated topic clusters, not by math-computed priorities. + +--- + +## 5. How Comment Priorities Reach the Server + +### 5.1 Clojure Pipeline + +``` +conversation.clj:comment-priorities + → Serialized into math_main.data JSON blob in PostgreSQL + → Read by server via getPca(zid, 0) + → Extracted as math.asPOJO['comment-priorities'] + → Passed to selectProbabilistically() +``` + +The Clojure pipeline computes fresh priorities on each update and stores them in the `math_main` table. + +### 5.2 Delphi (Python) Pipeline + +Python does NOT compute comment priorities (see 05-participant-filtering.md Section 4). This means: + +- If running Delphi instead of Clojure math, `comment-priorities` will be empty (`{}`) +- `selectProbabilistically()` will default every comment to weight 1 +- Result: **uniform random selection** — all comments equally likely + +**Impact**: The entire priority-based routing degrades to random selection when using Delphi. + +--- + +## 6. Server API Endpoints + +### GET `/api/v3/nextComment` + +Handler: `handle_GET_nextComment()` in `routes/comments.ts` + +Parameters: +- `zid` — conversation ID +- `not_voted_by_pid` — participant ID (exclude already-voted) +- `without` — array of tids to exclude +- `lang` — language for translation + +Response: +```json +{ + "tid": 42, + "txt": "The comment text...", + "created": 1703000000000, + "remaining": 15, + "total": 50, + "currentPid": 7 +} +``` + +### POST `/api/v3/votes` + +Handler: `handle_POST_votes()` in `routes/votes.ts` + +After recording the vote, the handler calls `getNextComment()` and includes the next comment in the response, enabling seamless vote → next comment flow without a separate request. + +--- + +## 7. Comparison: Delphi vs Legacy Comment Routing + +| Aspect | Legacy (Clojure + Prioritized) | Delphi (Topical) | +|--------|-------------------------------|------------------| +| Selection pool | All unvoted comments | Only topic-relevant unvoted comments | +| Weighting | Priority scores (importance × novelty) | Uniform random within pool | +| Novelty boost | Yes (`2^(S/-5)` decay) | No | +| Meta comments | 7× priority boost (squared: 49×) | Not distinguished | +| Extremity factor | Yes (PCA comment projection norm) | No | +| Topic filtering | No | Yes (LLM-generated clusters) | +| Participant prefs | No | Yes (topic_agenda_selections) | +| Fallback | Uniform random if no priorities | Prioritized routing | +| Data source | math_main PostgreSQL table | DynamoDB + PostgreSQL | + +### 7.1 Strengths of Legacy Approach +- Comments with high information value (high extremity, low vote count) surface faster +- Meta/featured comments get strong priority boost +- Battle-tested across many conversations + +### 7.2 Strengths of Delphi Approach +- Respects participant interests (topic selection) +- LLM-generated topics can be more semantically meaningful +- Reduces survey fatigue by focusing on relevant topics + +### 7.3 Current Hybrid +The `TOPICAL_COMMENT_RATIO` parameter blends both approaches probabilistically. When set to 0.5, each comment request has a 50% chance of using each strategy. + +--- + +## 8. Client-Side Integration + +### `client-participation/js/stores/polis.js` (lines 157–194) + +```javascript +getNextComment(o) { + params = { + not_voted_by_pid: myPid, + limit: 1, + conversation_id: conversation_id, + lang: Utils.uiLanguage() + }; + if (o.notTid) { + params.without = [o.notTid]; // exclude current comment + } + return polisGet('api/v3/nextComment', params) + .then((c) => { + nextCommentCache = c; + if (c.currentPid) processPidResponse(c.currentPid); + }); +} +``` + +The client calls `getNextComment` after each vote, caches the response, and displays it. The client has no knowledge of the routing strategy — it simply requests and displays whatever the server returns. diff --git a/delphi/docs/deep-analysis-for-julien/07-discrepancies.md b/delphi/docs/deep-analysis-for-julien/07-discrepancies.md new file mode 100644 index 000000000..288e6d3ea --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/07-discrepancies.md @@ -0,0 +1,268 @@ +# ALL Discrepancies: Clojure (CORRECT) vs Python (Delphi) + +This is the critical reference document. Every discrepancy is rated by severity and lists the exact code locations. + +--- + +## SEVERITY LEGEND + +- **CRITICAL**: Fundamentally changes mathematical behavior, produces different results +- **HIGH**: Significant behavioral difference visible to users +- **MEDIUM**: Subtle numerical differences, may affect edge cases +- **LOW**: Cosmetic or minor structural difference + +--- + +## D1. PCA Algorithm [MEDIUM] + +| | Clojure | Python | +|---|---------|--------| +| File | `math/src/polismath/math/pca.clj` | `delphi/polismath/pca_kmeans_rep/pca.py` | +| Method | Power iteration (100 iters) | sklearn SVD | +| Warm-start | Yes (previous components) | No | +| Large conv | Mini-batch with learning rate 0.01 | Full SVD (no optimization) | + +**Why MEDIUM**: Both methods converge to the same eigenspace for well-separated eigenvalues. The main risk is temporal instability in Python (PCA can flip/rotate between updates). + +**Fix**: Could switch to incremental PCA (sklearn `IncrementalPCA`) for warm-starting, or accept SVD as equivalent and address stability separately. + +### D1b. Projection Input [LOW] + +| | Clojure | Python | +|---|---------|--------| +| File | `pca.clj:134–157` | `pca.py:76,96–102` | +| Input | Raw votes (nils skipped) | Imputed matrix (NaN → col mean) | + +Clojure projects against raw sparse votes, skipping unvoted entries. Python projects the fully imputed matrix. Since `center ≈ col_mean`, the difference is small but nonzero. See doc 02, Section 3.4 for details. + +--- + +## D2. In-Conv Participant Threshold [CRITICAL] + +| | Clojure | Python | +|---|---------|--------| +| File | `conversation.clj:240–266` | `conversation.py:1225–1248` | +| Threshold | `min(7, n_cmts)` | `7 + sqrt(n_cmts) * 0.1` | +| Greedy fallback | Top-15 voters if < 15 qualify | None | +| Persistence | Monotonic (once in, always in) | Recomputed from scratch | +| Matrix used | raw_rating_mat | rating_mat (moderated) | + +**Example at 100 comments**: +- Clojure: need 7 votes → most participants qualify +- Python: need 8 votes → slightly fewer qualify + +**Example at 1000 comments**: +- Clojure: need 7 votes → most participants qualify +- Python: need 10.2 votes → significantly fewer qualify + +**Why CRITICAL**: This is the gatekeeper for clustering. Wrong threshold means different participants are clustered, cascading to different groups, different repness, different everything. + +**Fix**: Change Python to `min(7, n_cmts)`, add greedy fallback, add persistence, use `raw_rating_mat`. + +--- + +## D3. K-Smoother Buffer [HIGH] + +| | Clojure | Python | +|---|---------|--------| +| File | `conversation.clj:452–468` | `conversation.py:620–641` | +| Buffer | 4 consecutive updates before switching k | None (immediate switch) | +| State | Persisted across updates | None | + +**Why HIGH**: Without the buffer, the number of groups can flicker between values on successive updates, causing visible instability in the UI. + +**Fix**: Add `group_k_smoother` state to `Conversation` class with `{last_k, last_k_count, smoothed_k}`. + +--- + +## D4. Pseudocount Formula [MEDIUM] + +| | Clojure | Python | +|---|---------|--------| +| File | `repness.clj` | `repness.py:35,138–139` | +| Numerator add | +1 | +0.75 (`PSEUDO_COUNT/2`) | +| Denominator add | +2 | +1.5 (`PSEUDO_COUNT`) | +| Prior | Beta(2,2) | Beta(1.75,1.75) | + +**Why MEDIUM**: Small numerical differences that diminish with sample size but matter for small groups. + +**Fix**: Change `PSEUDO_COUNT = 2.0` in Python (giving +1/+2 to match Clojure). + +--- + +## D5. Proportion Test Formula [CRITICAL] + +| | Clojure | Python | +|---|---------|--------| +| File | `stats.clj:prop-test` | `repness.py:prop_test` lines 64–86 | +| Clojure | `2*sqrt(n+1)*((succ+1)/(n+1) - 0.5)` | — | +| Python | — | `(p - 0.5) / sqrt(0.25/n)` (standard z-test) | + +**Why CRITICAL**: These are fundamentally different formulas. Clojure's built-in pseudocount in the test produces different z-scores, especially for small samples. Since these z-scores are used in significance tests that gate which comments become representative, different formulas → different representative comments. + +**Fix**: Implement Clojure's exact formula in Python. + +--- + +## D6. Two-Proportion Test [MEDIUM] + +| | Clojure | Python | +|---|---------|--------| +| File | `stats.clj:two-prop-test` | `repness.py:two_prop_test` lines 89–115 | +| Pseudocount | +1 to all 4 inputs | None (uses pre-smoothed proportions) | + +**Why MEDIUM**: The +1 adjustment provides regularization that prevents extreme z-scores when sample sizes are very small. Python's version can produce larger z-scores. + +**Fix**: Add +1 adjustment to match Clojure, or apply pseudocount before calling. + +--- + +## D7. Repness Metric Formula [CRITICAL] + +| | Clojure | Python | +|---|---------|--------| +| File | `repness.clj:repness-metric` | `repness.py:repness_metric` lines 189–210 | +| Formula | `ra * rat * pa * pat` (product of 4) | `pa * (|pat| + |rat|)` (weighted sum of 2) | +| Uses ratio directly | Yes (`ra`) | No (only `rat`) | +| Abs values | No | Yes | + +**Why CRITICAL**: Completely different mathematical formulas for ranking representative comments. The product form zeros out when any factor is near zero; the sum form is more tolerant. They will produce different rankings for the same data. + +**Fix**: Replace Python formula with `ra * rat * pa * pat` to match Clojure. + +--- + +## D8. Finalize Comment Stats [MEDIUM] + +| | Clojure | Python | +|---|---------|--------| +| File | `repness.clj:finalize-cmt-stats` | `repness.py:finalize_cmt_stats` lines 213–243 | +| Decision | `rat > rdt` (simple comparison) | `pa > 0.5 AND ra > 1.0` (threshold check first) | + +**Why MEDIUM**: Different logic for choosing agree vs disagree classification. The threshold checks in Python can override what would be the metric-based choice. + +**Fix**: Replace Python logic with simple `rat > rdt` comparison. + +--- + +## D9. Z-Score Significance Thresholds [CRITICAL] + +| | Clojure | Python repness.py | +|---|---------|------------------| +| File | `stats.clj` | `repness.py:19` | +| 90% threshold | 1.2816 (one-tailed) | 1.645 (two-tailed) | +| 95% threshold | 1.6449 (one-tailed) | 1.96 (two-tailed) | + +**Why CRITICAL**: Python requires 28% higher z-scores to pass significance at 90% confidence. Many comments that would be selected as representative in Clojure will fail to pass significance in Python. + +**Additional issue**: Python's own `stats.py` uses 1.2816 (matching Clojure), creating an **internal inconsistency** within the Python codebase. + +**Fix**: Change `Z_90 = 1.2816` and `Z_95 = 1.6449` in `repness.py`. + +--- + +## D10. Representative Comment Selection [HIGH] + +| | Clojure | Python | +|---|---------|--------| +| File | `repness.clj:select-rep-comments` | `repness.py:select_rep_comments` lines 315–392 | +| Algorithm | Single-pass with beats-best | Sort-and-take-top | +| Max agree | 5 total (agrees first) | 3 agrees | +| Max disagree | (from the 5) | 2 disagrees | +| Total | Up to 5 | Up to 5 | + +**Why HIGH**: Different selection counts and algorithm produce different representative comment sets. + +**Fix**: Match Clojure's selection logic (5 total, agrees first, single-pass). + +--- + +## D11. Consensus Comment Selection [HIGH] + +| | Clojure | Python | +|---|---------|--------| +| File | `repness.clj:consensus-stats` + `select-consensus-comments` | `repness.py:select_consensus_comments` lines 413–454 | +| Criterion | Per-comment `pa > 0.5` | ALL groups `pa > 0.6` | +| Output | Top 5 agree + 5 disagree with test scores | Top 2 overall | +| Test scores | Yes (format-stat adds z-tests) | No | + +**Why HIGH**: Python is much more restrictive and returns far fewer consensus comments. + +**Fix**: Match Clojure's consensus-stats + select-consensus-comments logic. + +--- + +## D12. Comment Priorities [CRITICAL] + +| | Clojure | Python | +|---|---------|--------| +| File | `conversation.clj:308–669` | `conversation.py:1047–1093` (dead code) | +| Status | Fully computed and stored | NOT COMPUTED (dead code with bug) | +| Components | importance × novelty, squared | N/A | +| Comment extremity | Yes (from PCA) | Not computed | +| Meta priority | 7 (squared = 49) | N/A | + +**Why CRITICAL**: Without comment priorities, the TypeScript server falls back to uniform random comment selection. This eliminates the information-theoretic comment routing that is core to Polis's design. + +**Fix**: Implement `importance_metric`, `priority_metric`, and the full `comment-priorities` computation. Requires first implementing comment projection and extremity in PCA. + +--- + +## D13. Subgroup Clustering [LOW] + +| | Clojure | Python | +|---|---------|--------| +| File | `conversation.clj:480–570` | N/A | +| Status | Full implementation with k-smoother | Not implemented | + +**Why LOW**: Subgroups are a secondary feature. The primary group clustering is more important. + +**Fix**: Implement if needed, but lower priority. + +--- + +## D14. Large Conversation Optimization [LOW] + +| | Clojure | Python | +|---|---------|--------| +| Dispatch | `ptpt-cutoff=10000`, `cmt-cutoff=5000` | No dispatch | +| Large PCA | Mini-batch with learning rate | Full SVD | +| Sample size | `interpolate(n, 100..1500, 1500..150000)` | N/A | + +**Why LOW**: Only matters for very large conversations. Can be addressed later. + +**Fix**: Use `IncrementalPCA` from sklearn for large conversations. + +--- + +## D15. Moderation Handling [MEDIUM] + +| | Clojure | Python | +|---|---------|--------| +| Method | `zero-out-columns` (set values to 0, keep structure) | `loc[keep_ptpts, keep_comments]` (remove entirely) | + +**Why MEDIUM**: Clojure keeps moderated comments in the matrix with zeroed values; Python removes them entirely. This affects matrix dimensions and potentially vote count calculations. + +**Fix**: Consider switching to zero-out approach, but current approach may be acceptable since Python works with DataFrames. + +--- + +## Summary Table + +| ID | Discrepancy | Severity | Fix Effort | +|----|-------------|----------|------------| +| D1 | PCA algorithm | MEDIUM | Low (accept or use IncrementalPCA) | +| D2 | In-conv threshold | CRITICAL | Low (change formula) | +| D3 | K-smoother buffer | HIGH | Medium (add state) | +| D4 | Pseudocount formula | MEDIUM | Low (change constant) | +| D5 | Proportion test | CRITICAL | Low (change formula) | +| D6 | Two-proportion test | MEDIUM | Low (add +1 adjustment) | +| D7 | Repness metric | CRITICAL | Low (change formula) | +| D8 | Finalize cmt stats | MEDIUM | Low (simplify logic) | +| D9 | Z-score thresholds | CRITICAL | Low (change constants) | +| D10 | Rep comment selection | HIGH | Medium (change algorithm) | +| D11 | Consensus selection | HIGH | Medium (match Clojure) | +| D12 | Comment priorities | CRITICAL | High (full implementation) | +| D13 | Subgroup clustering | LOW | High (full implementation) | +| D14 | Large conv optimization | LOW | Medium | +| D15 | Moderation handling | MEDIUM | Low | diff --git a/delphi/docs/deep-analysis-for-julien/08-dead-code.md b/delphi/docs/deep-analysis-for-julien/08-dead-code.md new file mode 100644 index 000000000..e27583ce2 --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/08-dead-code.md @@ -0,0 +1,131 @@ +# Dead and Unreachable Code + +## 1. Python (Delphi) Dead Code + +### 1.1 `_compute_votes_base()` — Dead Code with Bug + +**File**: `conversation.py` lines 1047–1093 +**Status**: DEAD — Never called from `recompute()` or any other pipeline method +**Bug**: Line 1076 uses `self.rating_mat[:, 'tid']` which is invalid pandas syntax. Should be `self.rating_mat[tid]`. +**Note**: Marked with `TODO(julien): why is that not called anywhere?` + +The `to_dict()` method (lines 1497–1521) has its own inline votes-base computation that IS used for serialization, making this method redundant. + +### 1.2 Custom `kmeans()` function — Dead Code + +**File**: `clusters.py` lines 383–419 +**Status**: DEAD — The pipeline uses `kmeans_sklearn()` exclusively +**Related dead code**: +- `cluster_step()` (lines 219–250) — only called by custom `kmeans()` +- `assign_points_to_clusters()` (lines 164–188) — only called by `cluster_step()` +- `update_cluster_centers()` (lines 191–203) — only called by `cluster_step()` +- `filter_empty_clusters()` (lines 206–216) — only called by `cluster_step()` +- `same_clustering()` (lines 135–161) — only called by custom `kmeans()` +- `clean_start_clusters()` (lines 318–380) — only called by custom `kmeans()` +- `split_cluster()` (lines 279–315) — only called by `clean_start_clusters()` +- `most_distal()` (lines 253–276) — only called by `split_cluster()` +- `distance_matrix()` (lines 422–441) — only called by custom `silhouette()` +- Custom `silhouette()` (lines 444–503) — pipeline uses `calculate_silhouette_sklearn()` + +### 1.3 `cluster_dataframe()` — Dead Code + +**File**: `clusters.py` lines 729–790 +**Status**: DEAD — The pipeline calls `kmeans_sklearn()` directly from `_compute_clusters()` +**Related**: `determine_k()` (lines 689–726) is only called by `cluster_dataframe()`, making it also dead. + +### 1.4 `clusters_from_dict()` — Dead Code + +**File**: `clusters.py` lines 536–564 +**Status**: DEAD — Used by `cluster_dataframe()` which is dead. The pipeline constructs clusters as plain dicts, not `Cluster` objects. + +### 1.5 `clusters_to_dict()` — Dead Code + +**File**: `clusters.py` lines 506–533 +**Status**: DEAD — Same reason as above. + +### 1.6 `Cluster` class — Partially Dead + +**File**: `clusters.py` lines 19–79 +**Status**: PARTIALLY DEAD — The `Cluster` class is used by the dead custom `kmeans()` chain but not by `kmeans_sklearn()`. The pipeline works with plain dicts instead. + +### 1.7 Legacy `comment_stats()`, `add_comparative_stats()`, etc. — Dead Code + +**File**: `repness.py` lines 118–392 +**Status**: DEAD — These are the non-vectorized versions. The pipeline uses the vectorized DataFrame functions (`compute_group_comment_stats_df`, `select_rep_comments_df`, etc.) exclusively via `conv_repness()`. + +Specifically dead: +- `comment_stats()` (lines 118–154) +- `add_comparative_stats()` (lines 157–186) +- `finalize_cmt_stats()` (lines 213–243) +- `passes_by_test()` (lines 246–268) +- `best_agree()` (lines 271–290) +- `best_disagree()` (lines 293–312) +- `select_rep_comments()` (lines 315–392) +- `select_consensus_comments()` (lines 413–454) + +The vectorized replacements (`compute_group_comment_stats_df`, `select_rep_comments_df`, `select_consensus_comments_df`) are used instead. + +### 1.8 `calculate_kl_divergence()` — Dead Code + +**File**: `repness.py` lines 395–410 +**Status**: DEAD — Never called from anywhere. + +### 1.9 `_compute_group_votes()` — Partially Dead + +**File**: `conversation.py` lines 1095–1177 +**Status**: Called by `_compute_group_aware_consensus()` which is called from nowhere in the main pipeline. The `to_dict()` method has its own inline group-votes computation. + +### 1.10 `_compute_group_aware_consensus()` — Dead Code + +**File**: `conversation.py` lines 1291–1349 +**Status**: DEAD — Never called from `recompute()`. The `to_dict()` method has its own inline consensus computation. + +### 1.11 `comment_priorities` attribute — Dead Code + +**File**: `conversation.py` lines 1039–1041, 1452–1453 +**Status**: The attribute is checked in `to_dict()` and `get_full_data()` but never populated by any pipeline method. + +--- + +## 2. Python Code That Is Used But Has Issues + +### 2.1 `_get_in_conv_participants()` — Used but Wrong Formula + +**File**: `conversation.py` lines 1225–1248 +**Status**: ACTIVE but uses wrong threshold formula (see D2 in discrepancies) + +### 2.2 `to_dict()` In-Conv — Used but Different from Pipeline + +**File**: `conversation.py` lines 1637–1644 +**Status**: ACTIVE — Uses `min(7, n_cmts)` threshold (matching Clojure) but only for serialization output, not for the actual clustering pipeline. + +### 2.3 `repness.py` Z_90 = 1.645 — Used but Wrong Value + +**File**: `repness.py` line 19 +**Status**: ACTIVE — Should be 1.2816 to match Clojure (see D9 in discrepancies) + +--- + +## 3. Clojure Dead Code + +### 3.1 `choose-group-k` — Commented Out + +**File**: `conversation.clj` lines 97–101 +**Status**: DEAD — Commented out, replaced by silhouette-based k-selection. + +### 3.2 `cmnt-proj` binding — Commented Out + +**File**: `conversation.clj` lines 391–393 +**Status**: DEAD — Commented out in favor of computing projection+extremity together in `with-proj-and-extremtiy` (sic — typo in the Clojure source, `conversation.clj:338`). + +--- + +## 4. Summary + +| Category | Count | Impact | +|----------|-------|--------| +| Dead pipeline methods (Python) | 3 (`_compute_votes_base`, `_compute_group_votes`, `_compute_group_aware_consensus`) | None (unused) | +| Dead utility functions (Python) | ~15 (custom kmeans chain, non-vectorized repness) | None (unused) | +| Dead but referenced attributes (Python) | 1 (`comment_priorities`) | Confusing — suggests feature is implemented when it's not | +| Active code with bugs (Python) | 3 (in-conv threshold, Z_90 value, to_dict inconsistency) | HIGH | +| Dead Clojure code | 2 (commented out, harmless) | None | diff --git a/delphi/docs/deep-analysis-for-julien/09-fix-plan.md b/delphi/docs/deep-analysis-for-julien/09-fix-plan.md new file mode 100644 index 000000000..6dd0f9ec6 --- /dev/null +++ b/delphi/docs/deep-analysis-for-julien/09-fix-plan.md @@ -0,0 +1,395 @@ +# Fix Plan: Bringing Python to Clojure Parity + +> **Note**: This was a preliminary analysis. The canonical fix ordering is in +> [`PLAN_DISCREPANCY_FIXES.md`](../PLAN_DISCREPANCY_FIXES.md), which reordered +> some fixes (e.g., D4 before D9) based on pipeline execution order. + +## Prioritized Fix Order + +Fixes are ordered by: (1) cascading impact, (2) severity, (3) ease of implementation. Items that affect downstream computations come first. + +--- + +## Phase 1: Foundation Fixes (Highest Priority) + +### Fix 1: In-Conv Participant Threshold [D2] + +**Files to modify**: `delphi/polismath/conversation/conversation.py` + +**Current** (lines 1237–1238): +```python +threshold = 7 + np.sqrt(n_cmts) * 0.1 +``` + +**Target**: +```python +threshold = min(7, n_cmts) +``` + +**Additional changes needed**: +1. Use `self.raw_rating_mat` instead of `self.rating_mat` for vote counting +2. Add monotonic persistence: store `self._in_conv` set, union with new qualifiers +3. Add greedy fallback: if `len(in_conv) < 15`, add top voters until reaching 15 + +**Estimated effort**: 1–2 hours + +### Fix 2: Z-Score Significance Thresholds [D9] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current** (line 19): +```python +Z_90 = 1.645 +Z_95 = 1.96 +``` + +**Target**: +```python +Z_90 = 1.2816 # one-tailed, matching Clojure +Z_95 = 1.6449 # one-tailed, matching Clojure +``` + +**Estimated effort**: 5 minutes + +### Fix 3: Pseudocount Formula [D4] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current** (line 35): +```python +PSEUDO_COUNT = 1.5 +``` + +**Target**: +```python +PSEUDO_COUNT = 2.0 # +1 numerator, +2 denominator, matching Clojure Beta(2,2) prior +``` + +Verify that `pa = (na + PSEUDO_COUNT/2) / (ns + PSEUDO_COUNT)` with `PSEUDO_COUNT=2.0` gives `pa = (na + 1) / (ns + 2)`, which matches Clojure. ✓ + +**Estimated effort**: 5 minutes + +### Fix 4: Proportion Test Formula [D5] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current** (`prop_test`, lines 64–86): +```python +def prop_test(p, n, p0): + se = math.sqrt(p0 * (1 - p0) / n) + return (p - p0) / se +``` + +**Target** (match Clojure's `stats.clj:prop-test`): +```python +def prop_test(succ, n): + """Clojure-compatible proportion test. + z = 2 * sqrt(n+1) * ((succ+1)/(n+1) - 0.5) + """ + if n == 0: + return 0.0 + return 2 * math.sqrt(n + 1) * ((succ + 1) / (n + 1) - 0.5) +``` + +**Note**: This changes the function signature from `(p, n, p0)` to `(succ, n)`. All callers need to be updated. The vectorized version (`prop_test_vectorized`) also needs updating. + +**Estimated effort**: 1 hour (including caller updates and vectorized version) + +--- + +## Phase 2: Repness Metric Fixes + +### Fix 5: Repness Metric Formula [D7] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current** (`repness_metric`, lines 189–210): +```python +p_factor = p if key_prefix == 'a' else (1 - p) +return p_factor * (abs(p_test) + abs(r_test)) +``` + +**Target** (match Clojure's product of 4): +```python +def repness_metric(stats, key_prefix): + p = stats[f'p{key_prefix}'] # pa or pd + p_test = stats[f'p{key_prefix}t'] # pat or pdt + r = stats[f'r{key_prefix}'] # ra or rd + r_test = stats[f'r{key_prefix}t'] # rat or rdt + return r * r_test * p * p_test +``` + +**Also update vectorized version** in `compute_group_comment_stats_df`: +```python +# Current (lines 649–650): +stats_df['agree_metric'] = stats_df['pa'] * (stats_df['pat'].abs() + stats_df['rat'].abs()) +stats_df['disagree_metric'] = (1 - stats_df['pd']) * (stats_df['pdt'].abs() + stats_df['rdt'].abs()) + +# Target: +stats_df['agree_metric'] = stats_df['ra'] * stats_df['rat'] * stats_df['pa'] * stats_df['pat'] +stats_df['disagree_metric'] = stats_df['rd'] * stats_df['rdt'] * stats_df['pd'] * stats_df['pdt'] +``` + +**Estimated effort**: 30 minutes + +### Fix 6: Finalize Comment Stats Logic [D8] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current** (`finalize_cmt_stats`, lines 230–241): +```python +if result['pa'] > 0.5 and result['ra'] > 1.0: + result['repful'] = 'agree' +elif result['pd'] > 0.5 and result['rd'] > 1.0: + result['repful'] = 'disagree' +else: ... +``` + +**Target** (match Clojure's simple comparison): +```python +if result['rat'] > result['rdt']: + result['repful'] = 'agree' +else: + result['repful'] = 'disagree' +``` + +**Also update vectorized version** in `compute_group_comment_stats_df` (lines 656–663): +```python +# Target: +stats_df['repful'] = np.where(stats_df['rat'] > stats_df['rdt'], 'agree', 'disagree') +``` + +**Estimated effort**: 15 minutes + +### Fix 7: Two-Proportion Test Adjustment [D6] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +**Current** (`two_prop_test`, lines 89–115): +```python +def two_prop_test(p1, n1, p2, n2): + p = (p1 * n1 + p2 * n2) / (n1 + n2) + ... +``` + +**Target** (add +1 pseudocount like Clojure): +```python +def two_prop_test(succ1, n1, succ2, n2): + """Clojure-compatible two-proportion test with +1 pseudocounts.""" + succ1, n1 = succ1 + 1, n1 + 1 + succ2, n2 = succ2 + 1, n2 + 1 + p1 = succ1 / n1 + p2 = succ2 / n2 + p = (succ1 + succ2) / (n1 + n2) + se = math.sqrt(p * (1 - p) * (1/n1 + 1/n2)) + if se == 0: + return 0.0 + return (p1 - p2) / se +``` + +**Note**: Signature changes from proportions to counts. Callers must be updated. + +**Estimated effort**: 1 hour + +--- + +## Phase 3: Selection & Consensus Fixes + +### Fix 8: Representative Comment Selection [D10] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +Match Clojure's `select-rep-comments` logic: +- Single pass with beats-best-by-test? and passes-by-test? +- Up to 5 total comments (agrees prioritized over disagrees) + +**Estimated effort**: 2–3 hours (algorithm redesign) + +### Fix 9: Consensus Comment Selection [D11] + +**Files to modify**: `delphi/polismath/pca_kmeans_rep/repness.py` + +Match Clojure's `consensus-stats` + `select-consensus-comments`: +- Per-comment stats with `pa = (A+1)/(S+2)` smoothing +- Top 5 agree (pa > 0.5) + top 5 disagree (pd > 0.5) +- Include z-test scores via `format-stat` + +**Estimated effort**: 2–3 hours + +--- + +## Phase 4: K-Smoother and Comment Priorities + +### Fix 10: K-Smoother Buffer [D3] + +**Files to modify**: `delphi/polismath/conversation/conversation.py` + +Add state tracking to `Conversation`: +```python +# In __init__: +self.group_k_smoother = { + 'last_k': None, + 'last_k_count': 0, + 'smoothed_k': None +} + +# In _compute_clusters(), after finding best_k by silhouette: +GROUP_K_BUFFER = 4 +smoother = self.group_k_smoother +this_k = best_k +same = (smoother['last_k'] is not None and this_k == smoother['last_k']) +this_k_count = smoother['last_k_count'] + 1 if same else 1 + +if this_k_count >= GROUP_K_BUFFER: + smoothed_k = this_k +else: + smoothed_k = smoother['smoothed_k'] if smoother['smoothed_k'] is not None else this_k + +self.group_k_smoother = { + 'last_k': this_k, + 'last_k_count': this_k_count, + 'smoothed_k': smoothed_k +} + +# Use smoothed_k instead of best_k for the actual clustering +``` + +**Estimated effort**: 1–2 hours + +### Fix 11: Comment Priorities [D12] + +**Files to modify**: `delphi/polismath/conversation/conversation.py`, `delphi/polismath/pca_kmeans_rep/pca.py` + +**Step 1**: Add comment projection and extremity to PCA: +```python +# In pca.py, after computing components: +comment_projection = pca.components_ # (n_comps × n_comments) +comment_extremity = np.linalg.norm(comment_projection, axis=0) +# Add to pca_results +pca_results['comment-projection'] = comment_projection +pca_results['comment-extremity'] = comment_extremity +``` + +**Step 2**: Implement `importance_metric`: +```python +def importance_metric(A, P, S, E): + p = (P + 1) / (S + 2) + a = (A + 1) / (S + 2) + return (1 - p) * (E + 1) * a +``` + +**Step 3**: Implement `priority_metric`: +```python +META_PRIORITY = 7 + +def priority_metric(is_meta, A, P, S, E): + if is_meta: + return META_PRIORITY ** 2 # = 49 + imp = importance_metric(A, P, S, E) + novelty = 1 + 8 * (2 ** (S / -5)) + return (imp * novelty) ** 2 +``` + +**Step 4**: Compute priorities in `recompute()`: +```python +def _compute_comment_priorities(self): + group_votes = self._compute_group_votes() + extremities = dict(zip( + self.rating_mat.columns, + self.pca.get('comment-extremity', np.zeros(len(self.rating_mat.columns))) + )) + + priorities = {} + for tid in self.rating_mat.columns: + total_A = total_D = total_S = total_P = 0 + for gid, gdata in group_votes.items(): + v = gdata['votes'].get(tid, {'A': 0, 'D': 0, 'S': 0}) + total_A += v['A'] + total_D += v['D'] + total_S += v['S'] + total_P += v['S'] - v['A'] - v['D'] + + E = extremities.get(tid, 0) + is_meta = tid in self.meta_tids if self.meta_tids else False + priorities[tid] = priority_metric(is_meta, total_A, total_P, total_S, E) + + self.comment_priorities = priorities +``` + +**Step 5**: Call from `recompute()` after `_compute_repness()`. + +**Estimated effort**: 4–6 hours + +--- + +## Phase 5: Cleanup + +### Fix 12: Remove Dead Code + +Remove the extensive dead code identified in `08-dead-code.md`: +- Custom kmeans chain in `clusters.py` +- Non-vectorized repness functions in `repness.py` +- Buggy `_compute_votes_base()` in `conversation.py` + +**Estimated effort**: 1–2 hours + +### Fix 13: Resolve Internal Inconsistencies + +- Remove or align `stats.py:prop_test` with the repness version +- Remove `stats.py:z_sig_90` (uses 1.2816) since `repness.py` is authoritative + +**Estimated effort**: 30 minutes + +--- + +## Testing Strategy + +### Unit Tests Per Fix + +Each fix should include tests comparing Python output to known Clojure output: + +1. **In-conv test**: Given a vote matrix with known vote counts, verify the same participants qualify +2. **Proportion test**: Compare z-scores for specific (succ, n) pairs +3. **Repness metric test**: Compute metric for known stats, compare to Clojure value +4. **Full pipeline test**: Run both implementations on the same small dataset and compare: + - PCA projections (direction, not exact values due to SVD vs power iteration) + - Cluster assignments + - Representative comments per group + - Consensus comments + - Comment priorities + +### Integration Test + +Use a real conversation dataset (export from production) and run both pipelines, comparing: +- Group count +- Group membership overlap +- Representative comment overlap +- Priority ranking correlation (Spearman rank) + +--- + +## Timeline Estimate + +| Phase | Fixes | Effort | Dependencies | +|-------|-------|--------|-------------| +| Phase 1 | D2, D9, D4, D5 | 3–4 hours | None | +| Phase 2 | D7, D8, D6 | 2–3 hours | Phase 1 | +| Phase 3 | D10, D11 | 4–6 hours | Phase 2 | +| Phase 4 | D3, D12 | 6–8 hours | Phase 3 | +| Phase 5 | Cleanup | 2–3 hours | All above | +| Testing | Unit + integration | 4–6 hours | All above | +| **Total** | | **21–30 hours** | | + +--- + +## Future: sklearn Streamlining + +Once parity is achieved, opportunities to leverage sklearn more effectively: + +1. **PCA**: Already using sklearn. Could use `IncrementalPCA` for warm-starting and large-conversation support. +2. **K-means**: Already using sklearn. The custom `kmeans()` is dead code and can be removed. +3. **Silhouette**: Already using sklearn. +4. **Statistical tests**: Could use `scipy.stats.proportions_ztest` instead of custom implementations, BUT need to ensure the pseudocount behavior matches Clojure. +5. **Clustering pipeline**: Could use sklearn's `Pipeline` or custom pipeline class for cleaner composition. + +**Important**: Any sklearn adoption must preserve the exact mathematical behavior of the Clojure implementation. The goal is code simplification, not behavioral change.