Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 43 additions & 25 deletions delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,24 @@

The Delphi Python math pipeline has 15 documented discrepancies with the Clojure reference implementation (see `deep-analysis-for-julien/07-discrepancies.md` and `deep-analysis-for-julien/09-fix-plan.md`). We need to fix them one-by-one with a TDD approach: **first extend the regression test to verify the discrepancy exists, then fix it, then verify the test passes**.

Each fix will be a separate PR to keep reviews manageable. PRs will be **stacked** (each builds on the previous), since fixes are ordered by pipeline execution order — fixing upstream affects downstream. PRs should be clearly labeled as stacked with their dependency chain, so reviewers know the order. We can use `git rebase -i` to clean up commit history before merging to main.
Each fix will be a separate PR to keep reviews manageable. PRs are **stacked** (each builds on the previous), since fixes are ordered by pipeline execution order — fixing upstream affects downstream. The stack is managed via `.claude/STACK` and the `/pr-stack` skill.

**PR naming convention**: Clojure parity fix PRs use the title prefix `[Clj parity PR N]` (e.g., `[Clj parity PR 0] Per-discrepancy test infrastructure`, `[Clj parity PR 1] Fix D2: in-conv participant threshold`).
**PR naming**: Titles use `[Stack N/M]` prefix (auto-managed by `.claude/skills/pr-stack/update-stack-titles.sh`). The descriptive part of the title should be self-explanatory.

**Prerequisite**: The current `kmeans_work` branch has changes from `edge`. These should be separated into their own PR(s) first, before we start the discrepancy fix PRs. The discrepancy fix PRs should be based on the cleaned-up branch.
### Stack ↔ Plan Cross-Reference

**Action**: Before starting any fix, analyze the diff of `kmeans_work` vs `upstream/edge` to understand what's changed. Group those changes into logical PR(s) — e.g., test infrastructure improvements, doc updates, minor bug fixes. Each should be reviewable independently. The discrepancy fix PRs (PR 1+) then stack on top of this clean base.
The full PR stack includes infrastructure PRs (Stack 1-7) followed by discrepancy fixes.
This plan's "PR N" labels map to actual GitHub PRs as follows:

| Plan label | GitHub PR | Stack | Title |
|-----------|-----------|-------|-------|
| PR 0 (infra) | #2417–#2420 | Stack 1-7 | Test cleanup, clustering, cold-start tooling, analysis docs |
| PR 1 (D2) | #2421 | Stack 8/10 | Fix D2: in-conv participant threshold + D2c vote count source |
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
Comment on lines +19 to +21
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cross-reference table hard-codes stack positions like Stack 8/10 and 10/10, but this PR is titled Stack 11/25, so the doc becomes stale as soon as the stack grows. Suggestion: avoid fixed denominators (e.g., use Stack 8 / Stack 10), or reference the canonical stack file (.claude/STACK) and keep this table focused on PR numbers ↔ GitHub PR links rather than mutable stack indices.

Suggested change
| PR 1 (D2) | #2421 | Stack 8/10 | Fix D2: in-conv participant threshold + D2c vote count source |
| PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula |
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
| PR 1 (D2) | #2421 | Stack 8 | Fix D2: in-conv participant threshold + D2c vote count source |
| PR 2 (D4) | #2435 | Stack 9 | Fix D4: pseudocount formula |
| (perf) | #2436 | Stack 10 | Speed up regression tests |

Copilot uses AI. Check for mistakes.
| PR 3 (D9) | — | — | *Next: Fix D9 z-score thresholds* |

Future fix PRs will be appended to the stack as they're created.

### Session Continuity

Expand Down Expand Up @@ -38,6 +49,7 @@ Because this work will span multiple Claude Code sessions, we maintain:
- **All datasets, not just biodiversity**: Every fix must pass on ALL datasets. biodiversity is just one reference among many.
- **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly.
- **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification.
- **Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works.

### Datasets Available (sorted by size, smallest first)

Expand Down Expand Up @@ -399,27 +411,33 @@ By this point, we should have good test coverage from all the per-discrepancy te

## Discrepancy Coverage Checklist

| ID | Discrepancy | PR | Status |
|----|-------------|-----|--------|
| D1 | PCA sign flips | PR 13 | Fix (sign consistency) |
| D1b | Projection input | PR 13 | Fix with D1 |
| D2 | In-conv threshold | **PR 1** | **DONE** ✓ |
| D2b | Base-cluster sort order | **PR 1** | **DONE** ✓ |
| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | **DONE** ✓ |
| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **DONE** ✓ (5 guard tests, T1-T5) |
| D3 | K-smoother buffer | PR 10 | Fix |
| D4 | Pseudocount formula | **PR 2** | **DONE** ✓ |
| D5 | Proportion test | PR 4 | Fix |
| D6 | Two-proportion test | PR 5 | Fix |
| D7 | Repness metric | PR 6 | Fix (with flag for old formula) |
| D8 | Finalize cmt stats | PR 7 | Fix |
| D9 | Z-score thresholds | **PR 3** | Fix |
| D10 | Rep comment selection | PR 8 | Fix (with legacy env var) |
| D11 | Consensus selection | PR 9 | Fix (with legacy env var) |
| D12 | Comment priorities | PR 11 | Fix (implement from scratch) |
| D13 | Subgroup clustering | — | **Deferred** (unused) |
| D14 | Large conv optimization | — | **Deferred** (Python fast enough) |
| D15 | Moderation handling | PR 12 | Fix |
| ID | Discrepancy | Plan PR | GitHub PR | Status |
|----|-------------|---------|-----------|--------|
| D1 | PCA sign flips | PR 13 | — | Fix (sign consistency) |
| D1b | Projection input | PR 13 | — | Fix with D1 |
| D2 | In-conv threshold | **PR 1** | **#2421** | **DONE** ✓ |
| D2b | Base-cluster sort order | **PR 1** | **#2421** | **DONE** ✓ |
| D2c | Vote count source (raw vs filtered matrix) | **PR 1** | **#2421** | **DONE** ✓ |
| D2d | In-conv monotonicity (once in, always in) | **PR 1** | **#2421** | **DONE** ✓ (5 guard tests, T1-T5) |
| D3 | K-smoother buffer | PR 10 | — | Fix |
| D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ |
| D5 | Proportion test | PR 4 | — | Fix |
| D6 | Two-proportion test | PR 5 | — | Fix |
| D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
| D8 | Finalize cmt stats | PR 7 | — | Fix |
| D9 | Z-score thresholds | PR 3 | — | Fix (next) |
| D10 | Rep comment selection | PR 8 | — | Fix (with legacy env var) |
| D11 | Consensus selection | PR 9 | — | Fix (with legacy env var) |
| D12 | Comment priorities | PR 11 | — | Fix (implement from scratch) |
| D13 | Subgroup clustering | — | — | **Deferred** (unused) |
| D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) |
| D15 | Moderation handling | PR 12 | — | Fix |

### Non-discrepancy PRs in the stack

| GitHub PR | Stack | Description |
|-----------|-------|-------------|
| #2436 | 10/10 | Speed up regression tests (benchmark off, skip intermediate stages) |

---

Expand Down
2 changes: 1 addition & 1 deletion delphi/notebooks/run_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def check_environment():

# Import polismath modules
from polismath.conversation.conversation import Conversation
from polismath.pca_kmeans_rep.repness import conv_repness, participant_stats
from polismath.pca_kmeans_rep.repness import conv_repness
from polismath.pca_kmeans_rep.corr import compute_correlation

def load_votes(votes_path):
Expand Down
163 changes: 75 additions & 88 deletions delphi/polismath/conversation/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
kmeans_sklearn,
calculate_silhouette_sklearn
)
from polismath.pca_kmeans_rep.repness import conv_repness, participant_stats
from polismath.pca_kmeans_rep.repness import conv_repness
from polismath.pca_kmeans_rep.corr import compute_correlation


Expand Down Expand Up @@ -792,114 +792,101 @@ def _compute_participant_info_optimized(self, vote_matrix: pd.DataFrame, group_c

# OPTIMIZATION 3: Precompute group vote matrices and average votes

# Precompute group vote matrices and their valid comment masks
group_vote_matrices = {}
# Precompute group average votes and valid comment masks
group_avg_votes = {}
group_valid_masks = {}

for group_id, member_indices in group_member_indices.items():
if len(member_indices) >= 3: # Only calculate for groups with enough members
# Extract the group vote matrix
group_vote_matrix = matrix_values[member_indices, :]
group_vote_matrices[group_id] = group_vote_matrix


# Calculate average votes per comment for this group
group_avg_votes[group_id] = np.mean(group_vote_matrix, axis=0)

# Precompute which comments have at least 3 votes from this group
group_valid_masks[group_id] = np.sum(group_vote_matrix != 0, axis=0) >= 3

# OPTIMIZATION 4: Use vectorized operations for participant stats
# VECTORIZED: Compute vote counts for ALL participants at once

process_start = time.time()
batch_start = time.time()

for p_idx, participant_id in enumerate(vote_matrix.index):
if p_idx >= matrix_values.shape[0]:

n_agree_all = np.sum(matrix_values > 0, axis=1) # (N,)
n_disagree_all = np.sum(matrix_values < 0, axis=1) # (N,)
n_pass_all = np.sum(matrix_values == 0, axis=1) # (N,)
n_votes_all = n_agree_all + n_disagree_all # (N,)

# Mask: participants with at least one real vote
has_votes = n_votes_all > 0 # (N,) bool

# VECTORIZED: Compute per-group correlations for ALL participants at once
# Store as {group_id: corr_array} where corr_array is (N,)
group_corr_arrays = {}

for group_id, member_indices in group_member_indices.items():
if len(member_indices) < 3 or group_id not in group_avg_votes:
# All correlations default to 0.0
group_corr_arrays[group_id] = np.zeros(participant_count)
continue

# Print progress for large participant sets
if participant_count > 100 and p_idx % 100 == 0:
now = time.time()
elapsed = now - process_start
batch_time = now - batch_start
batch_start = now
percent = (p_idx / participant_count) * 100
logger.info(f"Processed {p_idx}/{participant_count} participants ({percent:.1f}%) - " +
f"Elapsed: {elapsed:.2f}s, Batch: {batch_time:.4f}s")

# Get participant votes
participant_votes = matrix_values[p_idx, :]

# Count votes using vectorized operations
n_agree = np.sum(participant_votes > 0)
n_disagree = np.sum(participant_votes < 0)
n_pass = np.sum(participant_votes == 0)
n_votes = n_agree + n_disagree

# Skip participants with no votes
if n_votes == 0:

valid_mask = group_valid_masks[group_id]
n_valid = int(np.sum(valid_mask))

if n_valid < 3:
group_corr_arrays[group_id] = np.zeros(participant_count)
continue

# Find participant's group using precomputed mapping
participant_group = ptpt_group_map.get(participant_id)

# OPTIMIZATION 5: Efficient group correlation calculation

# Calculate agreement with each group - optimized version
group_agreements = {}

for group_id, member_indices in group_member_indices.items():
if len(member_indices) < 3:
# Skip groups with too few members
group_agreements[group_id] = 0.0
continue

if group_id not in group_avg_votes or group_id not in group_valid_masks:
group_agreements[group_id] = 0.0
continue

# Use precomputed data
g_votes = group_avg_votes[group_id]
valid_mask = group_valid_masks[group_id]

if np.sum(valid_mask) >= 3: # At least 3 valid comments
# Extract only valid comment votes
p_votes = participant_votes[valid_mask]
g_votes_valid = g_votes[valid_mask]

# Fast correlation calculation
p_std = np.std(p_votes)
g_std = np.std(g_votes_valid)

if p_std > 0 and g_std > 0:
# Use numpy's built-in correlation (faster and more numerically stable)
correlation = np.corrcoef(p_votes, g_votes_valid)[0, 1]

if not np.isnan(correlation):
group_agreements[group_id] = correlation
else:
group_agreements[group_id] = 0.0
else:
group_agreements[group_id] = 0.0
else:
group_agreements[group_id] = 0.0

# Store participant stats

# P: all participants' votes on valid comments — (N, n_valid)
P = matrix_values[:, valid_mask]
# g: group average on valid comments — (n_valid,)
g = group_avg_votes[group_id][valid_mask]

p_mean = P.mean(axis=1) # (N,)
g_mean = g.mean() # scalar
p_std = P.std(axis=1) # (N,)
g_std = g.std() # scalar
Comment on lines +839 to +847
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the per-group loop, P = matrix_values[:, valid_mask] uses boolean indexing, which allocates a new (N × n_valid) array for each group. For many groups and/or many valid columns this can become a major memory/time cost and partially offsets the vectorization gains. A concrete optimization is to compute correlations only for participants with votes (use has_votes to slice rows), then write results into a full-length (N,) array (zeros for non-voters), which reduces the size of P and the amount of work in mean/std/dot products.

Copilot uses AI. Check for mistakes.

if g_std == 0:
group_corr_arrays[group_id] = np.zeros(participant_count)
continue

# Pearson correlation: (mean(P*g) - mean(P)*mean(g)) / (std(P)*std(g))
cross_mean = (P @ g) / n_valid # (N,)

# np.where evaluates both branches; suppress divide-by-zero for p_std==0
with np.errstate(invalid='ignore', divide='ignore'):
corr = np.where(
p_std > 0,
(cross_mean - p_mean * g_mean) / (p_std * g_std),
0.0,
)
corr = np.nan_to_num(corr, nan=0.0)
group_corr_arrays[group_id] = corr

# Assemble result dicts (zero computation — just indexing)
group_ids = list(group_member_indices.keys())

for p_idx, participant_id in enumerate(vote_matrix.index):
if not has_votes[p_idx]:
continue

result['stats'][participant_id] = {
'n_agree': int(n_agree),
'n_disagree': int(n_disagree),
'n_pass': int(n_pass),
'n_votes': int(n_votes),
'group': participant_group,
'group_correlations': group_agreements
'n_agree': int(n_agree_all[p_idx]),
'n_disagree': int(n_disagree_all[p_idx]),
'n_pass': int(n_pass_all[p_idx]),
'n_votes': int(n_votes_all[p_idx]),
'group': ptpt_group_map.get(participant_id),
'group_correlations': {
gid: float(group_corr_arrays[gid][p_idx])
for gid in group_ids
}
}

total_time = time.time() - start_time
process_time = time.time() - process_start
logger.info(f"Participant stats completed in {total_time:.2f}s (preparation: {prep_time:.2f}s, processing: {process_time:.2f}s)")
logger.info(f"Processed {len(result['stats'])} participants with {len(group_clusters)} groups")

return result

def _compute_participant_info(self) -> None:
Expand Down
3 changes: 1 addition & 2 deletions delphi/polismath/pca_kmeans_rep/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,13 @@

from polismath.pca_kmeans_rep.pca import pca_project_dataframe
from polismath.pca_kmeans_rep.clusters import cluster_dataframe, Cluster
from polismath.pca_kmeans_rep.repness import conv_repness, participant_stats
from polismath.pca_kmeans_rep.repness import conv_repness
from polismath.pca_kmeans_rep.corr import compute_correlation

__all__ = [
'pca_project_dataframe',
'cluster_dataframe',
'Cluster',
'conv_repness',
'participant_stats',
'compute_correlation',
]
Comment on lines +13 to 22
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing participant_stats from the package exports is a breaking public API change (imports like from polismath.pca_kmeans_rep import participant_stats will fail). If downstream callers exist, consider keeping a participant_stats shim (e.g., thin wrapper calling the new optimized implementation) and marking it as deprecated, or document the breaking change explicitly (release notes / migration guidance).

Copilot uses AI. Check for mistakes.
Loading
Loading