diff --git a/delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md b/delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md new file mode 100644 index 0000000000..f586914544 --- /dev/null +++ b/delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md @@ -0,0 +1,134 @@ +# Handoff: PR 14 — Make Vectorized Code Readable + Blob Injection Tests + +## Goal + +The scalar functions (`comment_stats`, `add_comparative_stats`, `repness_metric`, +`finalize_cmt_stats`) read like a step-by-step recipe. Their vectorized replacement +(`compute_group_comment_stats_df`) buries the same logic in 150 lines of DataFrame +plumbing. The vectorized path is the ONLY production code path — the scalar functions +are dead code, called only from tests and benchmarks. + +**PR 14 must make the vectorized code at least as readable as the scalar code, then +delete the scalar functions.** This also enables vectorized blob injection tests — +the only non-tautological way to verify correctness against the Clojure math blob. + +## Starting point + +Branch off `jc/clj-parity-d9-fix` (Stack 13). This is BELOW D5-D8 in the stack, +so the refactor will be inherited by all formula fix PRs. + +## Current stack (as of 2026-03-17) + +``` +Stack 13: jc/clj-parity-d9-fix ← branch off HERE for PR 14 +Stack 14: jc/clj-parity-d5-prop-test (draft PR #2448) +Stack 15: jc/clj-parity-d6-two-prop-test (draft PR #2449) +Stack 16: jc/clj-parity-d7-repness-metric (draft PR #2450) +Stack 17: jc/clj-parity-d8-finalize-stats (draft PR #2451) +Stack 18: jc/clj-parity-d15-moderation-handling-zeros-vs-removes (PR #2452) +Stack 19: jc/clj-parity-kmeans-k-divergence (k-divergence fix) +Stack 20: jc/clj-parity-d10-rep-comment-selection +Stack 21: jc/clj-parity-d11-consensus-comment-selection +Stack 22: jc/clj-parity-d3-k-smoother-buffer +Stack 23: jc/clj-parity-d12-comment-priorities +Stack 24: jc/clj-parity-d1-pca-sign-flip-prevention +Stack 25: jc/clj-parity-pr15-load-votes-sort +``` + +D5-D8 are draft PRs waiting for vectorized blob tests before being marked ready. + +## Task sequence + +### 1. Refactor `compute_group_comment_stats_df` + +Split into two phases: + +**(a) DataFrame construction** (the plumbing): +- Build participant→group mapping +- Compute total counts per comment +- Create full (group, comment) cross-product index +- Join total counts, compute "other" counts + +**(b) Statistics computation** (the readable part): +A new function with clean DataFrame inputs (columns: `na`, `nd`, `ns`, +`other_agree`, `other_disagree`, `other_votes`) → outputs (columns: `pa`, `pd`, +`pat`, `pdt`, `ra`, `rd`, `rat`, `rdt`, `agree_metric`, `disagree_metric`, `repful`). + +This function should read like the scalar recipe: +```python +# Probabilities with pseudocounts +df['pa'] = (df['na'] + PSEUDO_COUNT/2) / (df['ns'] + PSEUDO_COUNT) +# Proportion tests +df['pat'] = prop_test_vectorized(df['na'], df['ns']) +# Representativeness ratios +df['ra'] = df['pa'] / df['other_pa'] +# ...etc +``` + +### 2. Write vectorized blob injection tests + +Inject Clojure group memberships + votes into the new statistics function, +compare output to blob values. Test the PRODUCTION code path. + +For each `(group, tid)` in the blob's `repness`: +- Compare `pat` (or `pdt`) to blob `p-test` +- Compare `rat` (or `rdt`) to blob `repness-test` +- Compare `pa` (or `pd`) to blob `p-success` +- Compare `repful` to blob `repful-for` + +The blob only stores the winning side's values. `repful-for` tells you which +side won: if `agree`, then `n-success`=na, `p-test`=pat, `repness-test`=rat. +If `disagree`, then `n-success`=nd, `p-test`=pdt, `repness-test`=rdt. + +### 3. Verify scalar ≡ vectorized + +Run both paths on all datasets, compare outputs field by field. Must be +identical (not just close — exact match) since they implement the same formulas. + +### 4. Delete scalar functions + +Remove: `comment_stats`, `add_comparative_stats`, `repness_metric`, +`finalize_cmt_stats`, and scalar `prop_test`/`two_prop_test` if no longer +needed (the vectorized versions remain). + +Update all tests that called the scalar API. + +### 5. Cascade rebase + +Rebase D5→D6→D7→D8→D15→k-divergence→D10→D11→D3→D12→D1→PR15 on top. +Use `.claude/skills/pr-stack/rebase-stack.sh` for the cascade. + +### 6. Add per-PR vectorized blob injection tests + +For each of D5, D6, D8: insert a RED blob test commit before the fix, +verify it fails, then verify the fix makes it pass. Same TDD pattern used +for the scalar blob tests already in those branches. + +Then mark #2448-#2451 as ready (un-draft). + +## Key blob context + +- `n-trials` in the Clojure blob = `S` (total seen, INCLUDING passes), not + `A+D` (agrees + disagrees). Verified: `prop_test(11, 14)` matches blob + `p-test` for tid=49 group 0 in vw (A=2, D=11, S=14, A+D=13). +- `group-votes[gid].votes[tid]` = `{A: agrees, D: disagrees, S: total_seen}` +- Blob `repness[gid][i]` fields: `n-success`, `n-trials`, `p-success`, + `p-test`, `repness` (=ra or rd), `repness-test` (=rat or rdt), + `repful-for`, `tid`, `best-agree` (optional). + +## Files to modify + +- `delphi/polismath/pca_kmeans_rep/repness.py` — the refactor +- `delphi/tests/test_discrepancy_fixes.py` — vectorized blob tests +- `delphi/tests/test_repness_unit.py` — update for new API +- `delphi/tests/test_old_format_repness.py` — update for new API +- `delphi/polismath/benchmarks/bench_repness.py` — update imports + +## Reference + +- Journal: `delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md` — Session 11 entry, + PR 14 readability goal in Notes for Future Sessions +- Plan: `delphi/docs/PLAN_DISCREPANCY_FIXES.md` — mandatory blob comparison + section in Testing Principles +- Clojure source: `math/src/polismath/math/stats.clj` (prop-test, two-prop-test), + `math/src/polismath/math/repness.clj` (finalize-cmt-stats, repness-metric) diff --git a/delphi/docs/PLAN_DISCREPANCY_FIXES.md b/delphi/docs/PLAN_DISCREPANCY_FIXES.md index 056415156d..3c3ceecfbe 100644 --- a/delphi/docs/PLAN_DISCREPANCY_FIXES.md +++ b/delphi/docs/PLAN_DISCREPANCY_FIXES.md @@ -19,7 +19,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows: | PR 1 (D2) | #2421 | Stack 8/10 | Fix D2: in-conv participant threshold + D2c vote count source | | PR 2 (D4) | #2435 | Stack 9/10 | Fix D4: pseudocount formula | | (perf) | #2436 | Stack 10/10 | Speed up regression tests | -| PR 3 (D9) | — | — | *Next: Fix D9 z-score thresholds* | +| PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) | Future fix PRs will be appended to the stack as they're created. @@ -50,6 +50,10 @@ Because this work will span multiple Claude Code sessions, we maintain: - **Synthetic edge-case tests**: Every time we discover an edge case specific to one conversation, extract it into a synthetic unit test with made-up data (never real data from private datasets). These run fast and document the intent clearly. - **E2E awareness**: GitHub Actions has Cypress E2E tests (`cypress-tests.yml`) testing UI workflows, and `python-ci.yml` running pytest regression. The Cypress tests don't test math output values directly, but `python-ci.yml` will break if clustering/repness changes. Formula-level fixes (D4, D5, D6, D7, D8, D9) are pure computation — no E2E risk. Selection logic changes (D10, D11) and priority computation (D12) could affect what the TypeScript server returns. We decide case-by-case which PRs need E2E verification. - **Remove dead code after replacement**: When a function is replaced by a new implementation (e.g. vectorized version), the old function must be deleted and all callers updated — not left as dead code. Do this in the same PR or a follow-up, after benchmarks and tests confirm the replacement works. +- **Mathematical rigor**: These are math fixes. Every formula change must be verified against the Clojure reference implementation by reading the actual Clojure source and the Python source side-by-side. Verify algebraic equivalence explicitly — don't assume. When in doubt, add a comment showing the derivation. +- **Exhaustive RED phase**: In the RED phase of TDD, don't just write one test showing the discrepancy. Actively ask: "What other behaviors does this change affect? What are the boundary conditions? What happens with empty inputs, single-element inputs, all-agree cases, all-disagree cases?" Write tests for all of them. Before moving to GREEN, explicitly list what tests are still missing and add them. The goal is that the test suite for each fix is comprehensive enough that a wrong implementation cannot pass. +- **Check your work**: After implementing a fix, re-read the Clojure source one more time and verify each line of the Python implementation corresponds correctly. Check array shapes, index semantics (0-based vs 1-based), and aggregation axes. Off-by-one errors and transposed matrices are the most common bugs. +- **Large private datasets are slow**: Some private conversations have 100K–1M votes. Running the full test suite with `--include-local` on all of them can take a very long time. It's OK to run only the small/medium datasets (vw, biodiversity, and the smaller private ones) during the RED/GREEN cycle. Run the full set including large conversations only once, as a final validation before committing — and even then, if a specific large dataset is known to be slow, it's acceptable to skip it and note which ones were tested in the PR description. ### Datasets Available (sorted by size, smallest first) @@ -383,11 +387,51 @@ This is non-trivial and should be one of the last fixes. --- -### PR 14: Cleanup — Remove Dead Code +### PR 14: Refactor Vectorized Code for Readability + Blob Injection Tests + +**MOVED EARLIER**: PR 14 is now a prerequisite for all formula fix PRs (D5-D8+), +not a post-parity cleanup. It branches off `jc/clj-parity-d9-fix` (Stack 13), +below all formula fixes. Reason: the vectorized production path +(`compute_group_comment_stats_df`) is too monolithic to test against the Clojure +blob. The refactor makes it testable AND readable. + +**The problem**: The scalar functions (`comment_stats`, `add_comparative_stats`, +`repness_metric`, `finalize_cmt_stats`) read like a step-by-step recipe. The +vectorized replacement (`compute_group_comment_stats_df`) buries the same logic +in 150 lines of DataFrame plumbing. The scalar path is dead code in production — +only called from tests and benchmarks. + +**Task**: +1. Split `compute_group_comment_stats_df` into (a) DataFrame construction + (group mapping, cross-product index, joins) and (b) statistics computation + as its own function with clean inputs/outputs — readable AND testable. +2. Write vectorized blob injection tests: inject Clojure group memberships + + votes, compare output to blob values. Tests the PRODUCTION code path. +3. Verify scalar and vectorized paths produce identical output on all datasets. +4. Delete scalar functions. Update tests. + +**Files**: `polismath/pca_kmeans_rep/repness.py`, `tests/test_discrepancy_fixes.py`, +`tests/test_repness_unit.py`, `tests/test_old_format_repness.py`, +`polismath/benchmarks/bench_repness.py` + +See `delphi/docs/HANDOFF_PR14_VECTORIZED_REFACTOR.md` for full details. + +After PR 14, each fix PR gets vectorized blob injection tests added in RED→GREEN +TDD pattern. This includes D5-D8 (repness formula fixes), D10/D11 (selection), +D15 (moderation), D12 (priorities). For D3 (k-smoother) and D1 (PCA sign flip), +which are incremental-only features, add synthetic tests + skip markers for +incremental blob comparison pending replay infrastructure (see Replay PRs A/B/C). + +**After adding vectorized tests to each PR, update the plan AND journal** to +record what was tested, what blob fields were compared, and any discrepancies found. +This is mandatory — the plan and journal are how future sessions know what's done. + +--- + +### PR 14b: Cleanup — Remove Remaining Dead Code (after parity) **Files**: Multiple (see `08-dead-code.md`) - Custom kmeans chain in `clusters.py` -- Non-vectorized repness functions in `repness.py` - Buggy `_compute_votes_base()` (after D12 replaces it) - `stats.py` inconsistencies (after D9 makes `repness.py` authoritative) @@ -425,13 +469,14 @@ By this point, we should have good test coverage from all the per-discrepancy te | D6 | Two-proportion test | PR 5 | — | Fix | | D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) | | D8 | Finalize cmt stats | PR 7 | — | Fix | -| D9 | Z-score thresholds | PR 3 | — | Fix (next) | +| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ | | D10 | Rep comment selection | PR 8 | — | Fix (with legacy env var) | | D11 | Consensus selection | PR 9 | — | Fix (with legacy env var) | | D12 | Comment priorities | PR 11 | — | Fix (implement from scratch) | | D13 | Subgroup clustering | — | — | **Deferred** (unused) | | D14 | Large conv optimization | — | — | **Deferred** (Python fast enough) | | D15 | Moderation handling | PR 12 | — | Fix | +| Replay | Replay infrastructure (A/B/C) | — | — | NOT BUILT — D3/D1 used synthetic tests only. Needed for incremental blob comparison. | ### Non-discrepancy PRs in the stack @@ -441,6 +486,56 @@ By this point, we should have good test coverage from all the per-discrepancy te --- +## Tasks parallelization + +D9 is done (PR #2446). The remaining fixes have the following dependency structure: + +### Repness chain dependency graph (all in `repness.py`) + +``` +D5 ─┬─→ D7 ─┐ +D6 ─┘ D8 ─┼─→ D10 + │ +D5 ──────────┴─→ D11 +``` + +- **D5, D6**: logically independent, but both modify `repness.py` (signature changes + caller updates in `compute_group_comment_stats_df`) — **must be sequential** +- **D7**: after D5 + D6 +- **D8**: after D6 +- **D10**: after D7 + D8 +- **D11**: after D5 only (parallel with D7, D8, D10) + +All are in `repness.py`, strictly sequential within this track. + +### File-boundary analysis + +Every fix touches `test_discrepancy_fixes.py` (different test classes per fix — low conflict risk, but same file). The production code boundaries are: + +| File | Fixes that modify it | +|------|---------------------| +| `repness.py` | D5, D6, D7, D8, D10, D11 | +| `conversation.py` | D3, D12, D15 | +| `pca.py` | D12, D1/D1b | +| `test_repness_unit.py` | D5, D6, D7, D8 | + +**Within each file group, fixes must be sequential** to avoid merge conflicts. + +### Practical parallel tracks (2 worktrees) + +| Track | Worktree | Fixes (sequential within) | Files | +|-------|----------|--------------------------|-------| +| **A — Repness formulas** | main worktree | D5 → D6 → D7 → D8 → D10 → D11 | `repness.py`, `test_repness_unit.py` | +| **B — Conversation/PCA** | separate worktree | D3 → D15 → D12 | `conversation.py`, `pca.py` | +| **C — Late** | (after A+B) | D1/D1b | `pca.py` (needs replay infra) | + +**Tracks A and B can run fully in parallel** using separate worktrees. Within each track, fixes are sequential (same files). Track B order is flexible — D3, D15, D12 touch different functions in `conversation.py`, so the order can be chosen for convenience. D12 is the largest (also touches `pca.py`), so putting it last gives D1/D1b a cleaner base. + +The shared `test_discrepancy_fixes.py` file will need a mechanical merge when tracks converge, but since each fix modifies a different test class (already scaffolded with xfail markers), conflicts should be trivial to resolve. + +**At convergence**: when both tracks are done, rebase Track B onto Track A (or vice versa). The only conflict will be in `test_discrepancy_fixes.py` — resolve by keeping both sets of test class changes. + +--- + ## Test Infrastructure ### `tests/test_discrepancy_fixes.py` — New test file diff --git a/delphi/docs/usage_examples.md b/delphi/docs/usage_examples.md index 7552f78c7b..c845524f70 100644 --- a/delphi/docs/usage_examples.md +++ b/delphi/docs/usage_examples.md @@ -136,17 +136,8 @@ for group_id, comments in repness["group_repness"].items(): ### Statistical Functions -```python -from polismath.pca_kmeans_rep.stats import prop_test, two_prop_test - -# Test proportion difference -z_score = prop_test(70, 100) # 70 successes out of 100 trials -print(f"Z-score: {z_score}") - -# Compare two proportions -z_score = two_prop_test(70, 100, 50, 100) # 70/100 vs 50/100 -print(f"Comparison Z-score: {z_score}") -``` +Statistical helpers (`z_sig_90`, `z_sig_95`, `prop_test`, `two_prop_test`) are +defined inline in `repness.py` where they are used. See that module for details. ## Practical Examples diff --git a/delphi/polismath/pca_kmeans_rep/repness.py b/delphi/polismath/pca_kmeans_rep/repness.py index c9e4266346..cc869acdb5 100644 --- a/delphi/polismath/pca_kmeans_rep/repness.py +++ b/delphi/polismath/pca_kmeans_rep/repness.py @@ -44,7 +44,7 @@ def z_score_sig_90(z: float) -> bool: Returns: True if significant at 90% confidence """ - return abs(z) >= Z_90 + return z > Z_90 def z_score_sig_95(z: float) -> bool: @@ -57,7 +57,7 @@ def z_score_sig_95(z: float) -> bool: Returns: True if significant at 95% confidence """ - return abs(z) >= Z_95 + return z > Z_95 def prop_test(p: float, n: int, p0: float) -> float: @@ -688,10 +688,10 @@ def select_rep_comments_df(stats_df: pd.DataFrame, # Best agree: pa > pd and passes significance tests agree_candidates = stats_df[stats_df['pa'] > stats_df['pd']].copy() if not agree_candidates.empty: - # Check significance: |pat| >= Z_90 and |rat| >= Z_90 + # Check significance: pat > Z_90 and rat > Z_90 passing_agree = agree_candidates[ - (agree_candidates['pat'].abs() >= Z_90) & - (agree_candidates['rat'].abs() >= Z_90) & + (agree_candidates['pat'] > Z_90) & + (agree_candidates['rat'] > Z_90) & (agree_candidates['pa'] >= 0.5) ] if not passing_agree.empty: @@ -701,8 +701,8 @@ def select_rep_comments_df(stats_df: pd.DataFrame, disagree_candidates = stats_df[stats_df['pd'] > stats_df['pa']].copy() if not disagree_candidates.empty: passing_disagree = disagree_candidates[ - (disagree_candidates['pdt'].abs() >= Z_90) & - (disagree_candidates['rdt'].abs() >= Z_90) & + (disagree_candidates['pdt'] > Z_90) & + (disagree_candidates['rdt'] > Z_90) & (disagree_candidates['pd'] >= 0.5) ] if not passing_disagree.empty: diff --git a/delphi/polismath/pca_kmeans_rep/stats.py b/delphi/polismath/pca_kmeans_rep/stats.py deleted file mode 100644 index 430d697abe..0000000000 --- a/delphi/polismath/pca_kmeans_rep/stats.py +++ /dev/null @@ -1,338 +0,0 @@ -""" -Statistical functions for the Pol.is math module. - -This module provides statistical utilities used throughout the system, -particularly for measuring representativeness and significance. -""" - -import numpy as np -import math -from typing import Dict, List, Optional, Tuple, Union - - -def prop_test(success_count: int, total_count: int) -> float: - """ - Proportion test for a single proportion. - - Calculates a z-statistic for a single proportion using a pseudocount adjustment - to prevent division by zero. - - Args: - success_count: Number of successes - total_count: Total number of trials - - Returns: - Z-statistic for the proportion test - """ - # Add pseudocount to avoid division by zero - success_count_adj = success_count + 1 - total_count_adj = total_count + 2 - - # Calculate proportion - p_hat = success_count_adj / total_count_adj - - # Standard error - se = math.sqrt(p_hat * (1 - p_hat) / total_count_adj) - - # Return z-statistic - return (p_hat - 0.5) / se - - -def two_prop_test(success_count_1: int, total_count_1: int, - success_count_2: int, total_count_2: int) -> float: - """ - Two-proportion z-test. - - Compares proportions between two populations using pseudocounts for stability. - - Args: - success_count_1: Number of successes in first group - total_count_1: Total number of trials in first group - success_count_2: Number of successes in second group - total_count_2: Total number of trials in second group - - Returns: - Z-statistic for the two-proportion test - """ - # Add pseudocounts to avoid division by zero - success_count_1_adj = success_count_1 + 1 - total_count_1_adj = total_count_1 + 2 - success_count_2_adj = success_count_2 + 1 - total_count_2_adj = total_count_2 + 2 - - # Calculate proportions - p_hat_1 = success_count_1_adj / total_count_1_adj - p_hat_2 = success_count_2_adj / total_count_2_adj - - # Pooled proportion - pooled_p_hat = (success_count_1_adj + success_count_2_adj) / (total_count_1_adj + total_count_2_adj) - - # Handle edge case when pooled proportion is 1 - if pooled_p_hat >= 0.9999: - pooled_p_hat = 0.9999 - - # Standard error - se = math.sqrt(pooled_p_hat * (1 - pooled_p_hat) * - (1/total_count_1_adj + 1/total_count_2_adj)) - - # Return z-statistic - return (p_hat_1 - p_hat_2) / se - - -def z_sig_90(z: float) -> bool: - """ - Test significance at 90% confidence level. - - Args: - z: Z-statistic to test - - Returns: - True if significant at 90% confidence level - """ - return abs(z) > 1.2816 - - -def z_sig_95(z: float) -> bool: - """ - Test significance at 95% confidence level. - - Args: - z: Z-statistic to test - - Returns: - True if significant at 95% confidence level - """ - return abs(z) > 1.6449 - - -def shannon_entropy(p: np.ndarray) -> float: - """ - Calculate Shannon entropy for a probability distribution. - - Args: - p: Probability distribution - - Returns: - Shannon entropy value - """ - # Filter zeros to avoid log(0) - p = p[p > 0] - return -np.sum(p * np.log2(p)) - - -def gini_coefficient(values: np.ndarray) -> float: - """ - Calculate Gini coefficient as a measure of inequality. - - Args: - values: Array of values - - Returns: - Gini coefficient (0 = perfect equality, 1 = perfect inequality) - """ - # Handle edge cases - values = np.asarray(values) - n = len(values) - if n <= 1 or np.all(values == values[0]): - return 0.0 - - # Ensure all values are non-negative (Gini is typically for income/wealth) - if np.any(values < 0): - values = values - np.min(values) # Shift to non-negative - - # Handle zero sum case - if np.sum(values) == 0: - return 0.0 - - # Sort values (ascending) - sorted_values = np.sort(values) - - # Calculate cumulative proportion of population and values - cumulative_population = np.arange(1, n + 1) / n - cumulative_values = np.cumsum(sorted_values) / np.sum(sorted_values) - - # Calculate Gini coefficient using the area method - # Area between Lorenz curve and line of equality - return 1 - 2 * np.trapz(cumulative_values, cumulative_population) - - -def weighted_stddev(values: np.ndarray, weights: Optional[np.ndarray] = None) -> float: - """ - Calculate weighted standard deviation. - - Args: - values: Array of values - weights: Optional weights for values - - Returns: - Weighted standard deviation - """ - if weights is None: - return np.std(values) - - # Normalize weights - weights = weights / np.sum(weights) - - # Calculate weighted mean - weighted_mean = np.sum(values * weights) - - # Calculate weighted variance - weighted_variance = np.sum(weights * (values - weighted_mean)**2) - - # Return weighted standard deviation - return np.sqrt(weighted_variance) - - -def ci_95(values: np.ndarray) -> Tuple[float, float]: - """ - Calculate 95% confidence interval using Student's t-distribution. - - Args: - values: Array of values - - Returns: - Tuple of (lower bound, upper bound) - """ - n = len(values) - if n < 2: - return (0.0, 0.0) - - mean = np.mean(values) - stderr = np.std(values, ddof=1) / np.sqrt(n) - - # 95% CI using t-distribution - t_crit = 1.96 # Approximation for large samples - if n < 30: - from scipy import stats - t_crit = stats.t.ppf(0.975, n-1) - - lower = mean - t_crit * stderr - upper = mean + t_crit * stderr - - return (lower, upper) - - -def bayesian_ci_95(success_count: int, total_count: int) -> Tuple[float, float]: - """ - Calculate 95% Bayesian confidence interval for a proportion. - - Uses the Jeffreys prior for better behavior at extremes. - - Args: - success_count: Number of successes - total_count: Total number of trials - - Returns: - Tuple of (lower bound, upper bound) - """ - from scipy import stats - - # Jeffreys prior (Beta(0.5, 0.5)) - alpha = success_count + 0.5 - beta = total_count - success_count + 0.5 - - lower = stats.beta.ppf(0.025, alpha, beta) - upper = stats.beta.ppf(0.975, alpha, beta) - - return (lower, upper) - - -def bootstrap_ci_95(values: np.ndarray, - statistic: callable = np.mean, - n_bootstrap: int = 1000) -> Tuple[float, float]: - """ - Calculate 95% confidence interval using bootstrap resampling. - - Args: - values: Array of values - statistic: Function to compute the statistic of interest - n_bootstrap: Number of bootstrap samples - - Returns: - Tuple of (lower bound, upper bound) - """ - from scipy import stats - - n = len(values) - if n < 2: - return (0.0, 0.0) - - # Generate bootstrap samples - bootstrap_stats = [] - for _ in range(n_bootstrap): - # Sample with replacement - sample = np.random.choice(values, size=n, replace=True) - # Compute statistic - bootstrap_stats.append(statistic(sample)) - - # Calculate 95% confidence interval - lower = np.percentile(bootstrap_stats, 2.5) - upper = np.percentile(bootstrap_stats, 97.5) - - return (lower, upper) - - -def binomial_test(success_count: int, total_count: int, p: float = 0.5) -> float: - """ - Perform a binomial test for a proportion. - - Args: - success_count: Number of successes - total_count: Total number of trials - p: Expected proportion under null hypothesis - - Returns: - P-value for the test - """ - from scipy import stats - - if total_count == 0: - return 1.0 - - # In newer versions of scipy, binom_test was renamed to binomtest - # and its API was updated to return an object with a pvalue attribute - try: - # Try the new API first - result = stats.binomtest(success_count, total_count, p) - return result.pvalue - except AttributeError: - # Fall back to the old API if available - try: - return stats.binom_test(success_count, total_count, p) - except AttributeError: - # If neither is available, implement a simple approximation - import math - from scipy.special import comb - - # Calculate binomial PMF - def binom_pmf(k, n, p): - return comb(n, k) * (p ** k) * ((1 - p) ** (n - k)) - - # Calculate two-sided p-value - observed_pmf = binom_pmf(success_count, total_count, p) - p_value = 0.0 - - for k in range(total_count + 1): - k_pmf = binom_pmf(k, total_count, p) - if k_pmf <= observed_pmf: - p_value += k_pmf - - return min(p_value, 1.0) - - -def fisher_exact_test(count_matrix: np.ndarray) -> Tuple[float, float]: - """ - Perform Fisher's exact test on a 2x2 contingency table. - - Args: - count_matrix: 2x2 contingency table - - Returns: - Tuple of (odds ratio, p-value) - """ - from scipy import stats - - if count_matrix.shape != (2, 2): - raise ValueError("Count matrix must be 2x2") - - return stats.fisher_exact(count_matrix) \ No newline at end of file diff --git a/delphi/tests/test_discrepancy_fixes.py b/delphi/tests/test_discrepancy_fixes.py index 9652d7dca9..e79b767a8f 100644 --- a/delphi/tests/test_discrepancy_fixes.py +++ b/delphi/tests/test_discrepancy_fixes.py @@ -37,6 +37,8 @@ PSEUDO_COUNT, Z_90, Z_95, + z_score_sig_90, + z_score_sig_95, prop_test, two_prop_test, repness_metric, @@ -538,11 +540,11 @@ def test_pa_values_match_clojure(self, conv, clojure_blob, dataset_name): @pytest.mark.clojure_comparison class TestD9ZScoreThresholds: """ - D9: Python uses two-tailed z-scores (Z_90=1.645, Z_95=1.96) - Clojure uses one-tailed z-scores (Z_90=1.2816, Z_95=1.6449) + D9: Z-score thresholds and semantics must match Clojure stats.clj: + (defn z-sig-90? [z-val] (> z-val 1.2816)) + (defn z-sig-95? [z-val] (> z-val 1.6449)) - Python's higher thresholds mean fewer comments pass significance, - leading to empty comment_repness. + Three aspects: correct values, strict > (not >=), one-tailed (no abs). """ @pytest.mark.xfail(reason="D9: Z_90=1.645 (two-tailed), target is 1.2816 (one-tailed)") @@ -557,6 +559,20 @@ def test_z95_matches_clojure(self): check.almost_equal(Z_95, 1.6449, abs=0.001, msg=f"Z_95 should be 1.6449 (one-tailed), got {Z_95}") + def test_z_sig_strict_greater_than(self): + """Clojure uses strict >, not >=. Boundary values must NOT pass.""" + check.is_false(z_score_sig_90(Z_90), + "z_score_sig_90(Z_90) should be False (strict >, not >=)") + check.is_false(z_score_sig_95(Z_95), + "z_score_sig_95(Z_95) should be False (strict >, not >=)") + + def test_z_sig_one_tailed(self): + """Clojure uses (> z-val threshold), no abs(). Negative values must NOT pass.""" + check.is_false(z_score_sig_90(-2.0), + "z_score_sig_90(-2.0) should be False (one-tailed, no abs)") + check.is_false(z_score_sig_95(-2.0), + "z_score_sig_95(-2.0) should be False (one-tailed, no abs)") + def test_repness_not_empty(self, conv, dataset_name): """Repness should produce non-empty comment_repness with correct thresholds.""" repness = conv.repness @@ -567,6 +583,104 @@ def test_repness_not_empty(self, conv, dataset_name): check.greater(len(repness['comment_repness']), 0, "comment_repness should not be empty") + @pytest.mark.xfail(reason="D5/D6: z-values differ → different significance decisions → different sets") + def test_significance_sets_match_clojure(self, conv, clojure_blob, dataset_name): + """Post-significance-filtering comment sets should match Clojure per group. + + Both sides apply z-sig-90? to their z-values and select top comments. + With D9 the gate semantics match (>, no abs), but the z-values + themselves differ until D5 (prop test) and D6 (two-prop test) are fixed. + """ + clojure_repness = clojure_blob.get('repness', {}) + if not clojure_repness: + pytest.skip("No repness in Clojure blob") + + python_repness = (conv.repness or {}).get('group_repness', {}) + + mismatches = [] + for gid_str, clj_entries in clojure_repness.items(): + gid = int(gid_str) + clj_tids = set(e['tid'] for e in clj_entries) + py_entries = python_repness.get(gid, []) + py_tids = set(int(e['comment_id']) for e in py_entries) + + if clj_tids != py_tids: + only_clj = sorted(clj_tids - py_tids) + only_py = sorted(py_tids - clj_tids) + mismatches.append(f" g{gid}: clj_only={only_clj}, py_only={only_py}") + + if mismatches: + print(f"[{dataset_name}] {len(mismatches)} groups with different rep comment sets:") + for m in mismatches: + print(m) + + check.equal(len(mismatches), 0, + f"{len(mismatches)} groups differ in selected rep comments") + + @pytest.mark.xfail(reason="D5/D6/D10: different z-values and selection → no shared comments to compare") + def test_z_values_match_clojure(self, conv, clojure_blob, dataset_name): + """Z-score values for shared rep comments should match Clojure. + + For comments in BOTH Clojure and Python selections, compare: + - p-test (Clojure) vs pat (Python) — proportion test z-score + - repness-test (Clojure) vs rat (Python) — two-proportion test z-score + + Requires D10 (same comment selection) so there ARE shared comments, + then D5/D6 so the values match. + """ + clojure_repness = clojure_blob.get('repness', {}) + if not clojure_repness: + pytest.skip("No repness in Clojure blob") + + python_repness = (conv.repness or {}).get('group_repness', {}) + + shared_count = 0 + pat_mismatches = [] + rat_mismatches = [] + + for gid_str, clj_entries in clojure_repness.items(): + gid = int(gid_str) + py_entries = python_repness.get(gid, []) + py_by_tid = {int(e['comment_id']): e for e in py_entries} + + for clj_entry in clj_entries: + tid = clj_entry['tid'] + py_entry = py_by_tid.get(tid) + if py_entry is None: + continue + + shared_count += 1 + + # Compare p-test vs pat (proportion test z-score) + clj_pat = clj_entry.get('p-test', 0) + py_pat = py_entry.get('pat', 0) + if abs(clj_pat - py_pat) > 0.01: + pat_mismatches.append( + f" g{gid}/t{tid}: clj p-test={clj_pat:.4f}, py pat={py_pat:.4f}") + + # Compare repness-test vs rat (two-proportion test z-score) + clj_rat = clj_entry.get('repness-test', 0) + py_rat = py_entry.get('rat', 0) + if abs(clj_rat - py_rat) > 0.01: + rat_mismatches.append( + f" g{gid}/t{tid}: clj repness-test={clj_rat:.4f}, py rat={py_rat:.4f}") + + if pat_mismatches: + print(f"[{dataset_name}] {len(pat_mismatches)} pat/p-test mismatches:") + for m in pat_mismatches[:10]: + print(m) + if rat_mismatches: + print(f"[{dataset_name}] {len(rat_mismatches)} rat/repness-test mismatches:") + for m in rat_mismatches[:10]: + print(m) + + check.greater(shared_count, 0, + "No shared comments to compare — selection sets are disjoint (D10)") + check.equal(len(pat_mismatches), 0, + f"{len(pat_mismatches)} p-test/pat mismatches (D5)") + check.equal(len(rat_mismatches), 0, + f"{len(rat_mismatches)} repness-test/rat mismatches (D6)") + # ============================================================================ # D5 — Proportion Test Formula @@ -627,6 +741,40 @@ def test_clojure_pat_values_consistent_with_formula(self, clojure_blob, dataset_ print(f"[{dataset_name}] pat consistency: {total - mismatches}/{total} match formula (max_diff={max_diff:.4f})") check.equal(mismatches, 0, f"Clojure p-test values don't match formula for {mismatches}/{total}") + @pytest.mark.xfail(reason="D5/D10: prop test formula differs + no shared comments") + def test_pat_values_match_clojure_blob(self, conv, clojure_blob, dataset_name): + """p-test (Clojure) vs pat (Python) for shared rep comments.""" + clojure_repness = clojure_blob.get('repness', {}) + if not clojure_repness: + pytest.skip("No repness in Clojure blob") + + python_repness = (conv.repness or {}).get('group_repness', {}) + shared_count = 0 + mismatches = [] + + for gid_str, clj_entries in clojure_repness.items(): + gid = int(gid_str) + py_by_tid = {int(e['comment_id']): e for e in python_repness.get(gid, [])} + for clj_entry in clj_entries: + tid = clj_entry['tid'] + py_entry = py_by_tid.get(tid) + if py_entry is None: + continue + shared_count += 1 + clj_val = clj_entry.get('p-test', 0) + py_val = py_entry.get('pat', 0) + if abs(clj_val - py_val) > 0.01: + mismatches.append( + f" g{gid}/t{tid}: clj={clj_val:.4f}, py={py_val:.4f}") + + if mismatches: + print(f"[{dataset_name}] {len(mismatches)} p-test/pat mismatches:") + for m in mismatches[:10]: + print(m) + check.greater(shared_count, 0, + "No shared comments to compare (D10)") + check.equal(len(mismatches), 0, f"{len(mismatches)} p-test/pat mismatches") + # ============================================================================ # D6 — Two-Proportion Test Adjustment @@ -661,6 +809,44 @@ def test_two_prop_test_with_pseudocounts(self): check.almost_equal(python_result, expected, abs=0.01, msg=f"two_prop_test should include pseudocounts: Python={python_result:.4f}, expected={expected:.4f}") + @pytest.mark.xfail(reason="D6/D10: two-prop test differs + no shared comments to compare") + def test_rat_values_match_clojure_blob(self, conv, clojure_blob, dataset_name): + """repness-test (Clojure) vs rat (Python) for shared rep comments. + + Unlike D5's pat test, rat (two-proportion test) needs group-vs-others + counts which aren't in the blob, so we compare for shared comments only. + """ + clojure_repness = clojure_blob.get('repness', {}) + if not clojure_repness: + pytest.skip("No repness in Clojure blob") + + python_repness = (conv.repness or {}).get('group_repness', {}) + shared_count = 0 + mismatches = [] + + for gid_str, clj_entries in clojure_repness.items(): + gid = int(gid_str) + py_by_tid = {int(e['comment_id']): e for e in python_repness.get(gid, [])} + for clj_entry in clj_entries: + tid = clj_entry['tid'] + py_entry = py_by_tid.get(tid) + if py_entry is None: + continue + shared_count += 1 + clj_val = clj_entry.get('repness-test', 0) + py_val = py_entry.get('rat', 0) + if abs(clj_val - py_val) > 0.01: + mismatches.append( + f" g{gid}/t{tid}: clj={clj_val:.4f}, py={py_val:.4f}") + + if mismatches: + print(f"[{dataset_name}] {len(mismatches)} repness-test/rat mismatches:") + for m in mismatches[:10]: + print(m) + check.greater(shared_count, 0, + "No shared comments to compare — selection sets are disjoint (D10)") + check.equal(len(mismatches), 0, f"{len(mismatches)} repness-test/rat mismatches") + # ============================================================================ # D7 — Repness Metric Formula @@ -695,6 +881,40 @@ def test_metric_formula_is_product(self): check.almost_equal(result, expected_agree, abs=0.01, msg=f"agree_metric should be ra*rat*pa*pat={expected_agree:.4f}, got {result:.4f}") + @pytest.mark.xfail(reason="D7/D10: metric formula differs + no shared comments") + def test_repness_metric_matches_clojure_blob(self, conv, clojure_blob, dataset_name): + """repness (Clojure) vs agree/disagree_metric (Python) for shared comments.""" + clojure_repness = clojure_blob.get('repness', {}) + if not clojure_repness: + pytest.skip("No repness in Clojure blob") + + python_repness = (conv.repness or {}).get('group_repness', {}) + shared_count = 0 + mismatches = [] + + for gid_str, clj_entries in clojure_repness.items(): + gid = int(gid_str) + py_by_tid = {int(e['comment_id']): e for e in python_repness.get(gid, [])} + for clj_entry in clj_entries: + tid = clj_entry['tid'] + py_entry = py_by_tid.get(tid) + if py_entry is None: + continue + shared_count += 1 + clj_val = clj_entry.get('repness', 0) + py_val = py_entry.get('agree_metric', 0) if py_entry.get('repful') == 'agree' else py_entry.get('disagree_metric', 0) + if abs(clj_val - py_val) > 0.01: + mismatches.append( + f" g{gid}/t{tid}: clj={clj_val:.4f}, py={py_val:.4f}") + + if mismatches: + print(f"[{dataset_name}] {len(mismatches)} repness metric mismatches:") + for m in mismatches[:10]: + print(m) + check.greater(shared_count, 0, + "No shared comments to compare (D10)") + check.equal(len(mismatches), 0, f"{len(mismatches)} repness metric mismatches") + # ============================================================================ # D8 — Finalize Comment Stats Logic @@ -727,6 +947,40 @@ def test_repful_uses_rat_vs_rdt(self): check.equal(result['repful'], 'disagree', f"repful should be 'disagree' when rat < rdt, got '{result['repful']}'") + @pytest.mark.xfail(reason="D8/D10: repful logic differs + no shared comments") + def test_repful_matches_clojure_blob(self, conv, clojure_blob, dataset_name): + """repful-for (Clojure) vs repful (Python) for shared rep comments.""" + clojure_repness = clojure_blob.get('repness', {}) + if not clojure_repness: + pytest.skip("No repness in Clojure blob") + + python_repness = (conv.repness or {}).get('group_repness', {}) + shared_count = 0 + mismatches = [] + + for gid_str, clj_entries in clojure_repness.items(): + gid = int(gid_str) + py_by_tid = {int(e['comment_id']): e for e in python_repness.get(gid, [])} + for clj_entry in clj_entries: + tid = clj_entry['tid'] + py_entry = py_by_tid.get(tid) + if py_entry is None: + continue + shared_count += 1 + clj_val = clj_entry.get('repful-for', '') + py_val = py_entry.get('repful', '') + if clj_val != py_val: + mismatches.append( + f" g{gid}/t{tid}: clj={clj_val}, py={py_val}") + + if mismatches: + print(f"[{dataset_name}] {len(mismatches)} repful mismatches:") + for m in mismatches[:10]: + print(m) + check.greater(shared_count, 0, + "No shared comments to compare (D10)") + check.equal(len(mismatches), 0, f"{len(mismatches)} repful-for/repful mismatches") + # ============================================================================ # D10 — Representative Comment Selection @@ -757,7 +1011,7 @@ def test_rep_comments_match_clojure(self, conv, clojure_blob, dataset_name): clj_tids = set(e['tid'] for e in clj_entries) py_entries = group_repness.get(gid, []) - py_tids = set(e.get('comment_id') for e in py_entries) + py_tids = set(int(e['comment_id']) for e in py_entries) total_groups += 1 overlap = len(clj_tids & py_tids) @@ -796,7 +1050,7 @@ def test_consensus_matches_clojure(self, conv, clojure_blob, dataset_name): clj_all = clj_agree_tids | clj_disagree_tids py_consensus = conv.repness.get('consensus_comments', []) if conv.repness else [] - py_tids = set(c.get('comment_id') for c in py_consensus) + py_tids = set(int(c['comment_id']) for c in py_consensus) print(f"[{dataset_name}] Consensus: Clojure agree={sorted(clj_agree_tids)}, disagree={sorted(clj_disagree_tids)}") print(f"[{dataset_name}] Consensus: Python={sorted(py_tids)}") diff --git a/delphi/tests/test_old_format_repness.py b/delphi/tests/test_old_format_repness.py index ae24226d18..514ddf9e8a 100644 --- a/delphi/tests/test_old_format_repness.py +++ b/delphi/tests/test_old_format_repness.py @@ -28,17 +28,19 @@ class TestStatisticalFunctions: def test_z_score_significance(self): """Test z-score significance checks.""" - # 90% confidence - assert z_score_sig_90(1.645) + # 90% confidence — one-tailed, strict >, matching Clojure assert z_score_sig_90(2.0) - assert z_score_sig_90(-1.645) + assert not z_score_sig_90(1.2816) # boundary: not significant (strict >) + assert not z_score_sig_90(-1.2816) # negative: not significant (one-tailed) assert not z_score_sig_90(1.0) + assert not z_score_sig_90(1.28) - # 95% confidence - assert z_score_sig_95(1.96) + # 95% confidence — one-tailed, strict >, matching Clojure assert z_score_sig_95(2.5) - assert z_score_sig_95(-1.96) + assert not z_score_sig_95(1.6449) # boundary: not significant (strict >) + assert not z_score_sig_95(-1.6449) # negative: not significant (one-tailed) assert not z_score_sig_95(1.5) + assert not z_score_sig_95(1.64) def test_prop_test(self): """Test one-proportion z-test.""" diff --git a/delphi/tests/test_repness_unit.py b/delphi/tests/test_repness_unit.py index b061ccb88c..dfdcad8aaf 100644 --- a/delphi/tests/test_repness_unit.py +++ b/delphi/tests/test_repness_unit.py @@ -29,17 +29,19 @@ class TestStatisticalFunctions: def test_z_score_significance(self): """Test z-score significance checks.""" - # 90% confidence - assert z_score_sig_90(1.645) + # 90% confidence — one-tailed, strict >, matching Clojure assert z_score_sig_90(2.0) - assert z_score_sig_90(-1.645) + assert not z_score_sig_90(1.2816) # boundary: not significant (strict >) + assert not z_score_sig_90(-1.2816) # negative: not significant (one-tailed) assert not z_score_sig_90(1.0) - - # 95% confidence - assert z_score_sig_95(1.96) + assert not z_score_sig_90(1.28) + + # 95% confidence — one-tailed, strict >, matching Clojure assert z_score_sig_95(2.5) - assert z_score_sig_95(-1.96) + assert not z_score_sig_95(1.6449) # boundary: not significant (strict >) + assert not z_score_sig_95(-1.6449) # negative: not significant (one-tailed) assert not z_score_sig_95(1.5) + assert not z_score_sig_95(1.64) def test_prop_test(self): """Test one-proportion z-test.""" diff --git a/delphi/tests/test_stats.py b/delphi/tests/test_stats.py deleted file mode 100644 index 6b33f555e9..0000000000 --- a/delphi/tests/test_stats.py +++ /dev/null @@ -1,321 +0,0 @@ -""" -Tests for the statistics module. -""" - -import pytest -import numpy as np -import sys -import os -import math -from scipy import stats as scipy_stats - -# Add the parent directory to the path to import the module -sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))) - -from polismath.pca_kmeans_rep.stats import ( - prop_test, two_prop_test, z_sig_90, z_sig_95, - shannon_entropy, gini_coefficient, weighted_stddev, - ci_95, bayesian_ci_95, bootstrap_ci_95, binomial_test, - fisher_exact_test -) - - -class TestProportionTests: - """Tests for proportion test functions.""" - - def test_prop_test(self): - """Test the proportion test function.""" - # Test with different proportions - z1 = prop_test(80, 100) # 80% success - z2 = prop_test(50, 100) # 50% success - z3 = prop_test(20, 100) # 20% success - - # Higher proportion should yield positive z-scores - assert z1 > 0 - # 50% proportion should yield z-score close to 0 - assert abs(z2) < 0.5 - # Lower proportion should yield negative z-scores - assert z3 < 0 - - # Test with extreme values - assert prop_test(0, 0) != float('inf') # Should handle edge cases - assert prop_test(1, 1) != float('inf') - - def test_two_prop_test(self): - """Test the two-proportion test function.""" - # Test with different proportions - z1 = two_prop_test(80, 100, 50, 100) # 80% vs 50% - z2 = two_prop_test(50, 100, 50, 100) # 50% vs 50% - z3 = two_prop_test(20, 100, 50, 100) # 20% vs 50% - - # First proportion higher should yield positive z-scores - assert z1 > 0 - # Equal proportions should yield z-score close to 0 - assert abs(z2) < 0.5 - # First proportion lower should yield negative z-scores - assert z3 < 0 - - # Test with extreme values - assert two_prop_test(0, 0, 50, 100) != float('inf') # Should handle edge cases - assert two_prop_test(100, 100, 100, 100) != float('inf') - - def test_significance_functions(self): - """Test the significance testing functions.""" - # 90% confidence level (z > 1.2816) - assert z_sig_90(1.3) - assert z_sig_90(-1.3) - assert not z_sig_90(1.0) - assert not z_sig_90(-1.0) - - # 95% confidence level (z > 1.6449) - assert z_sig_95(1.7) - assert z_sig_95(-1.7) - assert not z_sig_95(1.5) - assert not z_sig_95(-1.5) - - def test_prop_test_vs_scipy(self): - """Compare our prop_test with scipy's version.""" - # Calculate using our function - our_z = prop_test(70, 100) - - # Calculate using scipy - from scipy import stats as scipy_stats - # Scipy doesn't apply pseudocounts, so results will differ - # We add pseudocounts for comparison - p_hat = (70 + 1) / (100 + 2) - scipy_z = (p_hat - 0.5) / math.sqrt(p_hat * (1 - p_hat) / (100 + 2)) - - # Should be close - assert abs(our_z - scipy_z) < 0.01 - - def test_two_prop_test_vs_scipy(self): - """Compare our two_prop_test with scipy's version.""" - # Calculate using our function - our_z = two_prop_test(70, 100, 50, 100) - - # Calculate using scipy - from scipy import stats as scipy_stats - # Scipy doesn't apply pseudocounts, so results will differ - # We add pseudocounts for comparison - p1 = (70 + 1) / (100 + 2) - p2 = (50 + 1) / (100 + 2) - pooled_p = ((70 + 1) + (50 + 1)) / ((100 + 2) + (100 + 2)) - scipy_z = (p1 - p2) / math.sqrt(pooled_p * (1 - pooled_p) * (1/(100+2) + 1/(100+2))) - - # Should be close - assert abs(our_z - scipy_z) < 0.01 - - -class TestInformationTheory: - """Tests for information theory functions.""" - - def test_shannon_entropy(self): - """Test Shannon entropy calculation.""" - # Uniform distribution has maximum entropy - uniform = np.array([0.25, 0.25, 0.25, 0.25]) - max_entropy = shannon_entropy(uniform) - assert np.isclose(max_entropy, 2.0) # log2(4) = 2 - - # Non-uniform distribution has lower entropy - non_uniform = np.array([0.5, 0.25, 0.125, 0.125]) - lower_entropy = shannon_entropy(non_uniform) - assert lower_entropy < max_entropy - - # Distribution with certainty has zero entropy - certain = np.array([1.0, 0.0, 0.0, 0.0]) - zero_entropy = shannon_entropy(certain) - assert np.isclose(zero_entropy, 0.0) - - def test_gini_coefficient(self): - """Test Gini coefficient calculation.""" - # Perfect equality has Gini = 0 - equal = np.array([10, 10, 10, 10]) - assert np.isclose(gini_coefficient(equal), 0.0) - - # Perfect inequality has Gini = 1 - 1/n - unequal = np.array([0, 0, 0, 10]) - expected_gini = 1 - 1/4 - assert np.isclose(gini_coefficient(unequal), expected_gini, atol=0.01) - - # Some inequality - partial = np.array([5, 10, 15, 20]) - gini = gini_coefficient(partial) - assert 0 < gini < 1 - - -class TestDescriptiveStatistics: - """Tests for descriptive statistics functions.""" - - def test_weighted_stddev(self): - """Test weighted standard deviation calculation.""" - # Test against numpy's unweighted version - values = np.array([1, 2, 3, 4, 5]) - - # Unweighted - std_unweighted = weighted_stddev(values) - assert np.isclose(std_unweighted, np.std(values)) - - # Weighted with equal weights (should be same as unweighted) - weights = np.array([1, 1, 1, 1, 1]) - std_weighted_equal = weighted_stddev(values, weights) - assert np.isclose(std_weighted_equal, np.std(values)) - - # Weighted with different weights - weights = np.array([5, 1, 1, 1, 1]) # More weight on first value - std_weighted = weighted_stddev(values, weights) - - # Manually calculate weighted standard deviation - normalized_weights = weights / np.sum(weights) - weighted_mean = np.sum(values * normalized_weights) - weighted_variance = np.sum(normalized_weights * (values - weighted_mean)**2) - manual_weighted_std = np.sqrt(weighted_variance) - - assert np.isclose(std_weighted, manual_weighted_std) - - -class TestConfidenceIntervals: - """Tests for confidence interval functions.""" - - def test_ci_95(self): - """Test 95% confidence interval calculation.""" - # Generate normally distributed data - np.random.seed(42) - values = np.random.normal(100, 15, 1000) - - # Calculate 95% CI - lower, upper = ci_95(values) - - # Mean should be within the interval - mean = np.mean(values) - assert lower <= mean <= upper - - # For large samples, CI width should be about 3.92 * standard error - stderr = np.std(values, ddof=1) / np.sqrt(len(values)) - expected_width = 3.92 * stderr - actual_width = upper - lower - assert np.isclose(actual_width, expected_width, rtol=0.1) - - # Test with small sample - small_values = values[:10] - lower_small, upper_small = ci_95(small_values) - - # Small sample CI should be wider than large sample CI - small_width = upper_small - lower_small - assert small_width > actual_width - - def test_bayesian_ci_95(self): - """Test Bayesian 95% confidence interval for proportions.""" - # Test with different proportions - lower1, upper1 = bayesian_ci_95(80, 100) # 80% success - lower2, upper2 = bayesian_ci_95(50, 100) # 50% success - - # Intervals should contain the point estimates - assert lower1 <= 0.8 <= upper1 - assert lower2 <= 0.5 <= upper2 - - # Higher proportion should have narrower interval (due to binomial variance) - width1 = upper1 - lower1 - width2 = upper2 - lower2 - assert width1 < width2 - - # Test with small sample - lower3, upper3 = bayesian_ci_95(8, 10) # 80% success but small sample - width3 = upper3 - lower3 - - # Small sample should have wider interval - assert width3 > width1 - - def test_bootstrap_ci_95(self): - """Test bootstrap 95% confidence interval.""" - # Generate non-normal data - np.random.seed(42) - values = np.concatenate([ - np.random.normal(100, 10, 900), # Normal part - np.random.normal(150, 20, 100) # Outliers - ]) - - # Calculate bootstrap CI for mean - lower, upper = bootstrap_ci_95(values) - - # Mean should be within the interval - mean = np.mean(values) - assert lower <= mean <= upper - - # Bootstrap CI for different statistics - median_lower, median_upper = bootstrap_ci_95(values, np.median) - - # Median should be within its interval - median = np.median(values) - assert median_lower <= median <= median_upper - - -class TestStatisticalTests: - """Tests for statistical test functions.""" - - def test_binomial_test(self): - """Test binomial test calculation.""" - # Test against scipy's implementation - p1 = binomial_test(70, 100, 0.5) - - # Use the newer SciPy API if available - try: - p2 = scipy_stats.binomtest(70, 100, 0.5).pvalue - except AttributeError: - # Fall back to older API if necessary - try: - p2 = scipy_stats.binom_test(70, 100, 0.5) - except AttributeError: - # If neither is available, skip the test - pytest.skip("SciPy binomial test functions not available") - - assert np.isclose(p1, p2) - - # Test with different expected proportions - p3 = binomial_test(70, 100, 0.7) - - # Use the newer SciPy API if available - try: - p4 = scipy_stats.binomtest(70, 100, 0.7).pvalue - except AttributeError: - # Fall back to older API if necessary - try: - p4 = scipy_stats.binom_test(70, 100, 0.7) - except AttributeError: - # If neither is available, skip the test - return - - assert np.isclose(p3, p4) - - # Test significance - p5 = binomial_test(90, 100, 0.5) # Very unlikely with p=0.5 - assert p5 < 0.001 - - def test_fisher_exact_test(self): - """Test Fisher's exact test.""" - # Create a 2x2 contingency table - table = np.array([ - [12, 5], - [7, 25] - ]) - - # Calculate using our function - odds_ratio, p_value = fisher_exact_test(table) - - # Calculate using scipy - scipy_odds, scipy_p = scipy_stats.fisher_exact(table) - - # Results should match - assert np.isclose(odds_ratio, scipy_odds) - assert np.isclose(p_value, scipy_p) - - # Test for significance - assert p_value < 0.05 # This table should show significance - - # Test with a non-significant table - balanced_table = np.array([ - [10, 10], - [10, 10] - ]) - - _, p_value2 = fisher_exact_test(balanced_table) - assert p_value2 > 0.05 # This table should not show significance \ No newline at end of file