Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Journal: Fixing Python-Clojure Discrepancies

This is the ongoing tracking document for the TDD fix process described in
`PLAN_DISCREPANCY_FIXES.md`. It serves as the single source of truth for
our work, while commit messages and PR descriptions serve reviewers.

---

## Initial Baseline (2026-02-25)

### Branch: `series-of-fixes` (forked from `origin/kmeans_analysis_docs`)

### Test Results

**`test_legacy_clojure_regression.py`** (2 datasets: vw, biodiversity):
- 2 passed (`test_pca_components_match_clojure` × 2)
- 6 xfailed:
- `test_basic_outputs` × 2 — D9/D5/D7: empty `comment_repness`
- `test_group_clustering` × 2 — D2/D3: wrong participant threshold, missing k-smoother
- `test_comment_priorities` × 2 — D12: not implemented

**Full test suite** (`tests/` except `test_batch_id.py`):
- 185 passed, 11 failed, 3 skipped, 6 xfailed
- Pre-existing failures (not caused by this work, inherited from stacked PRs):
- `test_clusters.py::test_init_clusters` — `init_clusters()` doesn't populate members when k > n_points
- `test_conversation.py::test_recompute` — clustering threshold (7.2) filters out all 20 participants in synthetic data
- `test_conversation.py::test_data_persistence` — same threshold issue
- `test_datasets.py` × 4 — DatasetInfo API changed (added `has_cold_start_blob`), tests use old 8-arg constructor
- `test_edge_cases.py::test_insufficient_data_for_pca` — repness returns empty dict for no-group case
- `test_repness_smoke.py` × 2 — repness empty due to D9/D5/D7 (the very discrepancies we're fixing)
- **Note**: CI runs on `main` which has different code — these failures are specific to this stacked branch.
4 of them were already known (documented in MEMORY.md under "Known pre-existing test failures in #2393").

### Datasets Available

| Dataset | Votes | Has cold-start Clojure blob? |
|---------|------:|-----|
| vw | 4,684 | Yes |
| biodiversity | 29,803 | Yes |
| *(5 private datasets)* | 91K–1M | **No** (need prodclone DB) |

### Clojure Blob Structure (vw example)

Repness entry keys: `tid`, `n-agree`, `p-test`, `repness-test`, `n-success`,
`repful-for`, `n-trials`, `repness`, `best-agree`, `p-success`

Consensus structure: `{agree: [{tid, n-success, n-trials, p-success, p-test}], disagree: [...]}`

Comment priorities: `{tid: priority_value}` — 125 entries for vw

In-conv: list of 67 participant IDs (vw)

---

## PR 0: Test Infrastructure (complete)

### What was done
- Created `tests/test_discrepancy_fixes.py` with per-discrepancy test classes
- Created this journal
- Documented initial baseline

### Test results for `test_discrepancy_fixes.py`

After rebase onto updated `origin/kmeans_analysis_docs`:

```
7 passed, 19 skipped, 39 xfailed, 10 xpassed (with --include-local, 7 datasets)
```

- **7 passed**: Clojure formula sanity checks (prop_test, repness metric product, repful rat>rdt) + Clojure blob consistency checks (pat values)
- **19 skipped**: D15 moderation (no moderated comments), incomplete Clojure blobs, engage duplicate files
- **39 xfailed**: Discrepancy tests correctly fail (D2-D12 constants, formulas, and real-data comparisons)
- **10 xpassed** (all `strict=False`, so green):
- D2 in-conv × 2 on vw — small dataset where old/new thresholds coincide
- D6 two_prop_test × 1 — pseudocount difference too small to matter for this test case
- D9 repness_not_empty × 7 on all datasets — `comment_repness` list is populated (all
(group, comment) pairs) even with wrong thresholds; only `group_repness` selection is
affected. **TODO**: tighten this test when fixing D9 to check correct *number* of
representative comments, not just non-emptiness

### Design decisions
- All tests that verify targets not yet implemented are marked `@pytest.mark.xfail` with the discrepancy ID in the reason
- `strict=False` on D2 because vw (small) coincidentally matches while biodiversity (larger) does not
- D5 `test_clojure_pat_values_consistent_with_formula` is a sanity check that verifies Clojure data matches the documented formula — it's NOT xfailed because it doesn't test Python code
- D6 `test_two_prop_test_with_pseudocounts` is xfailed (Python lacks pseudocounts)
- D15 uses `pytest.skip` when dataset has no moderated comments

### Test classes summary

| Class | Discrepancy | Tests | xfail? |
|-------|-------------|-------|--------|
| `TestD2InConvThreshold` | D2 | 2 (per dataset) | xfail(strict=False) |
| `TestD4Pseudocount` | D4 | 2 (1 constant + 1 per dataset) | both xfail |
| `TestD5ProportionTest` | D5 | 2 (1 formula + 1 sanity per dataset) | formula xfail, sanity passes |
| `TestD6TwoPropTest` | D6 | 1 | xfail |
| `TestD7RepnessMetric` | D7 | 1 | xfail |
| `TestD8FinalizeStats` | D8 | 1 | xfail |
| `TestD9ZScoreThresholds` | D9 | 3 (2 constants + 1 per dataset) | all xfail |
| `TestD10RepCommentSelection` | D10 | 1 (per dataset) | xfail |
| `TestD11ConsensusSelection` | D11 | 1 (per dataset) | xfail |
| `TestD12CommentPriorities` | D12 | 1 (per dataset) | xfail |
| `TestD15ModerationHandling` | D15 | 1 (per dataset) | skipped (no mod-out data) |
| `TestSyntheticEdgeCases` | multiple | 5 | 2 xfail (D4, D9), 3 pass |

---

## What's Next: PR 1 — Fix D2 (In-Conv Participant Threshold)

- Change `threshold = 7 + sqrt(n_cmts) * 0.1` to `threshold = min(7, n_cmts)` in `conversation.py:1238`
- Use `self.raw_rating_mat` instead of `self.rating_mat` for counting
- Add greedy fallback (top-15 voters if <15 qualify)
- Add monotonic persistence (once in, always in)
- Remove xfail from `TestD2InConvThreshold`
- Check cluster count impact (may change from 3→2 for biodiversity, matching Clojure)
- Re-record golden snapshots if needed
- Document new baseline

---

## Session Log

### Session 1 (2026-02-25)

- Assessed codebase: read test infra, repness module, conversation module, Clojure blob structure
- Created `tests/test_discrepancy_fixes.py` (30 tests, 12 classes)
- Created this journal with initial baseline
- Ran baseline: `test_legacy_clojure_regression.py` → 2 passed, 6 xfailed
- Ran full suite → 185 passed, 11 failed (pre-existing), 3 skipped, 6 xfailed
- Investigated 11 pre-existing failures: inherited from stacked PRs (DatasetInfo API change, threshold issues, D9/D5/D7 repness)
- Rebased onto updated `origin/kmeans_analysis_docs` after PR stack rebase
- Committed and created PR #2401 (`[Clj parity PR 0]`)

### Session 2 (2026-02-26)

- Rebased after stack update (base tests tightened, some xpassed→xfailed)
- Updated PR title convention: `[Clj parity PR N]` prefix for reviewer clarity
- Redacted private dataset names from git history across the full stack:
- `SESSION_HANDOFF_KMEANS.md` in `kmeans_clustering_tooling` (amended deep commit via `GIT_SEQUENCE_EDITOR` rebase)
- `PLAN_DISCREPANCY_FIXES.md` in `kmeans_analysis_docs` (amended tip)
- `CLJ-PARITY-FIXES-JOURNAL.md` in `series-of-fixes` (amended tip)
- Force-pushed all three branches, rebased the chain
- Tests unchanged: 5 passed, 2 skipped, 18 xfailed, 5 xpassed

---

## Notes for Future Sessions

- Private datasets not available in this worktree. Need prodclone DB + `generate_cold_start_clojure.py`.
- `test_discrepancy_fixes.py` uses same parametrization pattern as `test_legacy_clojure_regression.py` (own `pytest_generate_tests` hook, `dataset_name` fixture).
- 11 pre-existing test failures are from the stacked branch, not from our work. They should be fixed in their respective PRs before merging to main.
- `strict=False` on xfail means xpass (unexpected pass) is reported but not a failure. Used when some datasets pass by coincidence.
- After rebase, D9 `test_repness_not_empty` started xpassing — the `comment_repness` list is populated (all pairs), but `group_repness` (selected reps) may still be affected by wrong thresholds. Consider tightening this test when fixing D9.
- To rebase when base is updated: `git fetch origin kmeans_analysis_docs && git rebase --onto origin/kmeans_analysis_docs series-of-fixes-base series-of-fixes && git tag -f series-of-fixes-base origin/kmeans_analysis_docs`
2 changes: 2 additions & 0 deletions delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ The Delphi Python math pipeline has 15 documented discrepancies with the Clojure

Each fix will be a separate PR to keep reviews manageable. PRs will be **stacked** (each builds on the previous), since fixes are ordered by pipeline execution order — fixing upstream affects downstream. PRs should be clearly labeled as stacked with their dependency chain, so reviewers know the order. We can use `git rebase -i` to clean up commit history before merging to main.

**PR naming convention**: Clojure parity fix PRs use the title prefix `[Clj parity PR N]` (e.g., `[Clj parity PR 0] Per-discrepancy test infrastructure`, `[Clj parity PR 1] Fix D2: in-conv participant threshold`).

**Prerequisite**: The current `kmeans_work` branch has changes from `edge`. These should be separated into their own PR(s) first, before we start the discrepancy fix PRs. The discrepancy fix PRs should be based on the cleaned-up branch.

**Action**: Before starting any fix, analyze the diff of `kmeans_work` vs `upstream/edge` to understand what's changed. Group those changes into logical PR(s) — e.g., test infrastructure improvements, doc updates, minor bug fixes. Each should be reviewable independently. The discrepancy fix PRs (PR 1+) then stack on top of this clean base.
Expand Down
2 changes: 2 additions & 0 deletions delphi/polismath/regression/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
get_dataset_info,
get_dataset_files,
get_dataset_report_id,
get_blob_variants,
)
__all__ = [
'ConversationRecorder',
Expand All @@ -26,4 +27,5 @@
'get_dataset_info',
'get_dataset_files',
'get_dataset_report_id',
'get_blob_variants',
]
57 changes: 53 additions & 4 deletions delphi/polismath/regression/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,12 +156,15 @@ def get_dataset_report_id(name: str) -> str:
return get_dataset_info(name).report_id


def get_dataset_files(name: str, prefer_cold_start: bool = True) -> Dict[str, str]:
def get_dataset_files(name: str, prefer_cold_start: bool = True, blob_type: Optional[str] = None) -> Dict[str, str]:
"""Get file paths for a dataset.

Args:
name: Dataset name
prefer_cold_start: If True (default), use cold-start blob when available
prefer_cold_start: If True (default), use cold-start blob when available.
Ignored if blob_type is specified.
blob_type: Explicit blob type to use: 'incremental' or 'cold_start'.
If specified, overrides prefer_cold_start.
"""
info = get_dataset_info(name)
rid = info.report_id
Expand All @@ -174,11 +177,19 @@ def find_file(pattern: str) -> str:
raise ValueError(f"Multiple files matching {pattern} in {info.path}: {matches}")
return str(matches[0].resolve())

# Check for cold-start blob first, fall back to original
# Determine which blob to use
cold_start_blob = info.path / f"{rid}_math_blob_cold_start.json"
original_blob = info.path / f"{rid}_math_blob.json"

if prefer_cold_start and cold_start_blob.exists():
if blob_type == 'cold_start':
if not cold_start_blob.exists():
raise FileNotFoundError(f"No cold-start blob for {name}")
math_blob_path = str(cold_start_blob)
elif blob_type == 'incremental':
if not original_blob.exists():
raise FileNotFoundError(f"No incremental blob for {name}")
math_blob_path = str(original_blob)
elif prefer_cold_start and cold_start_blob.exists():
math_blob_path = str(cold_start_blob)
else:
math_blob_path = str(original_blob)
Expand All @@ -193,6 +204,44 @@ def find_file(pattern: str) -> str:
}


def _is_blob_filled(blob_path: Path) -> bool:
"""Check if a math blob has meaningful content (PCA, non-empty clusters, etc.)."""
import json
if not blob_path.exists():
return False
try:
with open(blob_path) as f:
data = json.load(f)
# A blob is "filled" if it has PCA data or non-empty base-clusters
has_pca = 'pca' in data and 'comps' in data.get('pca', {})
bc = data.get('base-clusters', {})
has_clusters = isinstance(bc, dict) and len(bc.get('id', [])) > 0
return has_pca or has_clusters
except (json.JSONDecodeError, OSError):
return False


def get_blob_variants(name: str) -> List[str]:
"""Get available filled blob variants for a dataset.

Returns a list of blob_type strings ('incremental', 'cold_start') for blobs
that exist and contain meaningful data.
"""
info = get_dataset_info(name)
rid = info.report_id
variants = []

full_blob = info.path / f"{rid}_math_blob.json"
if _is_blob_filled(full_blob):
variants.append('incremental')

cold_start_blob = info.path / f"{rid}_math_blob_cold_start.json"
if _is_blob_filled(cold_start_blob):
variants.append('cold_start')

return variants


# Legacy aliases
def get_dataset_directory(report_id: str, dataset_name: Optional[str] = None) -> Path:
"""Find dataset directory by report_id."""
Expand Down
8 changes: 6 additions & 2 deletions delphi/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,12 @@ include = [

# Pytest configuration
[tool.pytest.ini_options]
# When using pytest-xdist (-n), group tests by xdist_group marker for efficient fixture sharing
addopts = "--dist=loadgroup"
# Force sequential execution (-n0) to leverage the session-scoped conversation cache.
# The cache shares computed conversations across test files, but each xdist worker
# has its own process with a separate cache. With only 2 datasets (biodiversity, vw),
# sequential execution with caching is ~6x faster than parallel execution where each
# worker recomputes the conversations independently.
addopts = "-n0"
filterwarnings = [
# Ignore python_multipart deprecation warning from ddtrace (third-party)
"ignore:Please use `import python_multipart`:PendingDeprecationWarning:ddtrace.internal.module",
Expand Down
17 changes: 5 additions & 12 deletions delphi/scripts/clojure_comparer.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,7 @@ def main(datasets: tuple, include_local: bool, log_level: str,
from polismath.regression.clojure_comparer import ClojureComparer, load_clojure_math_blob
from polismath.regression.datasets import list_available_datasets, get_dataset_files
from polismath.conversation import Conversation
try:
from tests.common_utils import load_votes
except ImportError:
click.echo("Error: Could not import tests.common_utils.load_votes.", err=True)
click.echo("Run this script from the delphi/ directory with: uv run python scripts/clojure_comparer.py", err=True)
raise SystemExit(1)
from polismath.benchmarks.benchmark_utils import load_votes_from_csv

# Get available datasets with Clojure math_blob
all_datasets = list_available_datasets(include_local=include_local)
Expand Down Expand Up @@ -138,7 +133,7 @@ def main(datasets: tuple, include_local: bool, log_level: str,

# Load and process Python output
dataset_files = get_dataset_files(dataset)
votes_data = load_votes(dataset_files['votes'])
votes_data = load_votes_from_csv(dataset_files['votes'])

click.echo(f"✓ Loaded votes data: {len(votes_data['votes'])} votes")

Expand Down Expand Up @@ -226,9 +221,7 @@ def main(datasets: tuple, include_local: bool, log_level: str,
click.echo(f"✓ {dataset}: PASS")
else:
click.echo(f"✗ {dataset}: FAIL")
click.echo(f"\nLikely cause: Initialization algorithm difference")
click.echo(f" Python: K-means++ (seed 42)")
click.echo(f" Clojure: First k distinct points")
click.echo(f" See docs/PLAN_DISCREPANCY_FIXES.md for known discrepancies")
click.echo(f"{'='*60}")

except Exception as e:
Expand Down Expand Up @@ -260,8 +253,8 @@ def main(datasets: tuple, include_local: bool, log_level: str,
click.echo(f"\n✗ Failed:")
for name in failed_datasets:
click.echo(f" {name}")
click.echo("\nNote: Failures may be due to initialization differences (K-means++ vs first-k).")
click.echo("This is expected until clustering initialization is aligned.")
click.echo("\nNote: Failures may be due to remaining Python-Clojure discrepancies.")
click.echo("See docs/PLAN_DISCREPANCY_FIXES.md for the fix plan.")
return 1
else:
click.echo("\n✓ All datasets passed!")
Expand Down
Loading
Loading