Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 52 additions & 2 deletions delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -440,8 +440,58 @@ significance filtering. Using different priors is intentional.

### What's Next

1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
from proportions to raw counts.
1. **PR 6 — Fix D7 (Repness metric)**: Change formula from `pa * (|pat| + |rat|)` to
`ra * rat * pa * pat` (Clojure product formula).

---

## PR 5: Fix D6 — Two-Proportion Test Pseudocounts

### TDD steps
1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 102 passed, 5 skipped, 143 xfailed, 2 xpassed
2. **Red**: Rewrote `TestD6TwoPropTest` with new signature `two_prop_test(succ_in, succ_out, pop_in, pop_out)`
and correct Clojure formula → 3 failures (TypeError: old function expects proportions)
3. **Fix**: Replaced both `two_prop_test` and `two_prop_test_vectorized` with Clojure formula:
add +1 to all 4 inputs (stats.clj:20), compute `pi1=(s+1)/(p+1)`, standard pooled z-test
4. **Green**: All 3 D6 formula tests pass, 4 blob comparison tests xfail (depend on D10)
5. **Full suite**: 4 regression failures, all in `rat`/`rdt`/`agree_metric`/`disagree_metric` — direct
downstream of the formula change. No unexpected fields affected.
6. **Re-recorded golden snapshots** for all 7 datasets (public + private)
7. **Final**: 1 failed (pakistan-incremental D2, pre-existing), 102 passed, 5 skipped, 143 xfailed, 2 xpassed

### Changes
- `repness.py`: `two_prop_test(p1, n1, p2, n2)` → `two_prop_test(succ_in, succ_out, pop_in, pop_out)`
with +1 pseudocount on all 4 inputs, matching Clojure's `(map inc ...)` (stats.clj:20)
- `repness.py`: `two_prop_test_vectorized` — same signature change
- `repness.py`: Updated callers in `add_comparative_stats` and `compute_group_comment_stats_df`
to pass raw counts `(na, other_na, ns, other_ns)` instead of `(pa, ns, other_pa, other_ns)`
- `test_discrepancy_fixes.py`: Rewrote `TestD6TwoPropTest` with correct formula, 7 test cases,
edge cases, and regularization effect test
- `test_repness_unit.py`: Updated `test_two_prop_test`, `test_two_prop_test_vectorized`,
`test_two_prop_test_vectorized_edge_cases` for new signature
- `test_old_format_repness.py`: Updated `test_two_prop_test` for new signature

### Key insight: existing test had wrong expected formula
The pre-existing D6 test computed expected values using `(succ+1)/(n+2)` — as if two pseudocounts
were added to the denominator. But Clojure's `(map inc ...)` adds +1 to each value independently,
giving `(succ+1)/(pop+1)`. The formula is a standard pooled z-test on the pseudocount-adjusted values,
not a Beta distribution posterior.

### Session 8 (2026-03-13)

- Created branch `jc/clj-parity-d6-two-prop-test` on top of `jc/clj-parity-d5-prop-test`
- Read Clojure source (stats.clj:18-33, repness.clj:97-100) to verify formula and call sites
- Discovered the existing D6 test had wrong expected formula — fixed
- TDD cycle: red (3 TypeError failures) → fix → green (3 pass, 4 xfail)
- Updated all callers: both scalar (`add_comparative_stats`) and vectorized
(`compute_group_comment_stats_df`) now pass raw counts
- Full suite: 4 regression failures, all in rat/rdt/metric fields (expected)
- Re-recorded golden snapshots for all 7 datasets
- Final validation: 102 passed, 1 pre-existing failure (pakistan-incremental D2)

### What's Next

1. **PR 6 — Fix D7 (Repness metric)**: Change from `pa * (|pat| + |rat|)` to `ra * rat * pa * pat`.

---

Expand Down
3 changes: 2 additions & 1 deletion delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
| PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
| PR 5 (D6) | #2449 | Stack 15/25 | Fix D6: two-proportion test pseudocounts |

Future fix PRs will be appended to the stack as they're created.

Expand Down Expand Up @@ -467,7 +468,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
| D3 | K-smoother buffer | PR 10 | — | Fix |
| D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ |
| D5 | Proportion test | **PR 4** | — | **DONE** ✓ |
| D6 | Two-proportion test | PR 5 | — | Fix |
| D6 | Two-proportion test | **PR 5** | — | **DONE** ✓ |
| D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
| D8 | Finalize cmt stats | PR 7 | — | Fix |
| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ |
Expand Down
120 changes: 75 additions & 45 deletions delphi/polismath/pca_kmeans_rep/repness.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,33 +95,51 @@ def prop_test(succ: int, n: int) -> float:
return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5)


def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float:
def two_prop_test(succ_in: int, succ_out: int, pop_in: int, pop_out: int) -> float:
"""
Two-proportion z-test.

Two-proportion z-test with +1 pseudocount on all inputs.

Matches Clojure's stats/two-prop-test (stats.clj:18-33):
(let [[succ-in succ-out pop-in pop-out] (map inc [succ-in succ-out pop-in pop-out])
pi1 (/ succ-in pop-in)
pi2 (/ succ-out pop-out)
pi-hat (/ (+ succ-in succ-out) (+ pop-in pop-out))]
...)

The +1 pseudocount (Laplace smoothing) regularizes the z-score for small
samples, preventing extreme values when group sizes are tiny.

Args:
p1: First proportion
n1: Number of observations for first proportion
p2: Second proportion
n2: Number of observations for second proportion
succ_in: Number of successes in the group (e.g., agrees)
succ_out: Number of successes outside the group
pop_in: Total votes in the group
pop_out: Total votes outside the group

Returns:
Z-score
Z-score (positive means group proportion > other proportion)
"""
if n1 == 0 or n2 == 0:
if pop_in == 0 or pop_out == 0:
return 0.0

# Pooled probability
p = (p1 * n1 + p2 * n2) / (n1 + n2)

# Standard error
se = math.sqrt(p * (1 - p) * (1/n1 + 1/n2))

# Z-score calculation

# Add +1 pseudocount to all four inputs (Clojure: map inc)
s1 = succ_in + 1
s2 = succ_out + 1
p1 = pop_in + 1
p2 = pop_out + 1

pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)

if pi_hat == 1.0:
# Clojure note (stats.clj:26-27): "this isn't quite right... could
# actually solve this using limits" — returning 0 for now, matching Clojure.
return 0.0

se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
if se == 0:
return 0.0
else:
return (p1 - p2) / se
return (pi1 - pi2) / se


def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any]:
Expand Down Expand Up @@ -182,15 +200,17 @@ def add_comparative_stats(comment_stats: Dict[str, Any],
result['ra'] = result['pa'] / other_stats['pa'] if other_stats['pa'] > 0 else 1.0
result['rd'] = result['pd'] / other_stats['pd'] if other_stats['pd'] > 0 else 1.0

# Calculate representativeness tests
# Calculate representativeness tests — pass raw counts, matching Clojure's
# (stats/two-prop-test (:na in-stats) (sum :na rest-stats)
# (:ns in-stats) (sum :ns rest-stats)) (repness.clj:97-100)
result['rat'] = two_prop_test(
result['pa'], result['ns'],
other_stats['pa'], other_stats['ns']
result['na'], other_stats['na'],
result['ns'], other_stats['ns']
)

result['rdt'] = two_prop_test(
result['pd'], result['ns'],
other_stats['pd'], other_stats['ns']
result['nd'], other_stats['nd'],
result['ns'], other_stats['ns']
)

return result
Expand Down Expand Up @@ -493,30 +513,38 @@ def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series:
return z


def two_prop_test_vectorized(p1: pd.Series, n1: pd.Series,
p2: pd.Series, n2: pd.Series) -> pd.Series:
def two_prop_test_vectorized(succ_in: pd.Series, succ_out: pd.Series,
pop_in: pd.Series, pop_out: pd.Series) -> pd.Series:
"""
Vectorized two-proportion z-test.
Vectorized two-proportion z-test with +1 pseudocount on all inputs.

Matches Clojure's stats/two-prop-test (stats.clj:18-33).
See two_prop_test() scalar version for formula details.

Args:
p1: Series of first proportions
n1: Series of number of observations for first proportion
p2: Series of second proportions
n2: Series of number of observations for second proportion
succ_in: Series of success counts in the group
succ_out: Series of success counts outside the group
pop_in: Series of total vote counts in the group
pop_out: Series of total vote counts outside the group

Returns:
Series of z-scores
"""
# Pooled probability
p_pooled = (p1 * n1 + p2 * n2) / (n1 + n2)
# Add +1 pseudocount to all four inputs (Clojure: map inc)
s1 = succ_in + 1
s2 = succ_out + 1
p1 = pop_in + 1
p2 = pop_out + 1

# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)

# Z-score calculation
z = (p1 - p2) / se
se = np.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
z = (pi1 - pi2) / se

# Handle edge cases
# Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
z = z.fillna(0.0)
z = z.replace([np.inf, -np.inf], 0.0)
return z
Expand Down Expand Up @@ -649,14 +677,16 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame,
stats_df['ra'] = stats_df['ra'].replace([np.inf, -np.inf], 1.0).fillna(1.0)
stats_df['rd'] = stats_df['rd'].replace([np.inf, -np.inf], 1.0).fillna(1.0)

# Compute representativeness tests (two-proportion z-test: group vs other)
# Compute representativeness tests — pass raw counts, matching Clojure's
# (stats/two-prop-test (:na in-stats) (sum :na rest-stats)
# (:ns in-stats) (sum :ns rest-stats)) (repness.clj:97-100)
stats_df['rat'] = two_prop_test_vectorized(
stats_df['pa'], stats_df['ns'],
stats_df['other_pa'], stats_df['other_votes']
stats_df['na'], stats_df['other_agree'],
stats_df['ns'], stats_df['other_votes']
)
stats_df['rdt'] = two_prop_test_vectorized(
stats_df['pd'], stats_df['ns'],
stats_df['other_pd'], stats_df['other_votes']
stats_df['nd'], stats_df['other_disagree'],
stats_df['ns'], stats_df['other_votes']
)

# Compute metrics
Expand Down
Loading
Loading