compdemocracy · jucor · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/.github/workflows/cypress-tests.yml b/.github/workflows/cypress-tests.yml
@@ -9,6 +9,10 @@ on:
       - stable
       - 'jc/**'
 
+concurrency:
+  group: e2e-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
   cypress-run:
     runs-on: ubuntu-latest

diff --git a/.github/workflows/python-ci.yml b/.github/workflows/python-ci.yml
@@ -93,6 +93,7 @@ jobs:
           -e POSTGRES_HOST=postgres \
           -e POSTGRES_PASSWORD=PdwPNS2mDN73Vfbc \
           -e POSTGRES_DB=polis-test \
+          -e SKIP_GOLDEN=1 \
           delphi \
           bash -c " \
             set -e; \

diff --git a/.spr.yml b/.spr.yml
@@ -0,0 +1,14 @@
+githubRepoOwner: compdemocracy
+githubRepoName: polis
+githubHost: github.com
+githubRemote: origin
+githubBranch: edge
+requireChecks: true
+requireApproval: true
+defaultReviewers: []
+mergeMethod: squash
+mergeQueue: false
+prTemplateType: stack
+forceFetchTags: false
+showPrTitlesInStack: false
+branchPushIndividually: false
diff --git a/delphi/CLAUDE.md b/delphi/CLAUDE.md
@@ -2,15 +2,9 @@
 
 This document provides comprehensive guidance for working with the Delphi system, including database interactions, environment configuration, Docker services, and the distributed job queue system. It serves as both documentation and a practical reference for day-to-day operations.
 
-## Documentation Directory
+## Documentation
 
-For a comprehensive list of all documentation files with descriptions, see:
-[delphi/docs/DOCUMENTATION_DIRECTORY.md](docs/DOCUMENTATION_DIRECTORY.md)
-
-## Current work todos are located in
-
-delphi/docs/JOB_QUEUE_SCHEMA.md
-delphi/docs/DISTRIBUTED_SYSTEM_ROADMAP.md
+**Warning:** Many docs in `docs/` are outdated and should not be trusted. Always verify against the actual code. Start with `docs/PLAN_DISCREPANCY_FIXES.md` (canonical fix plan) and `docs/CLJ-PARITY-FIXES-JOURNAL.md` (session journal) for current Clojure parity work.
 
 ## Helpful terminology
 
@@ -368,3 +362,25 @@ The system uses AWS Auto Scaling Groups to manage capacity:
 - Large Instance ASG: 1 instance by default, scales up to 3 based on demand
 
 CPU utilization triggers scaling actions (scale down when below 60%, scale up when above 80%).
+
+
+## Testing
+
+Run tests with `pytest` on the `tests/` folder.
+
+### Datasets of reference
+
+In `real_data`, we have several datasets of real conversations, exported from Polis, that can be used for testing and development. Those at the root of `real_data` are public.
+In `real_data/.local`, we have some private datasets that can only be used internally. The comparer supports both public and private datasets via the `--include-local` flag.
+
+### Regressions and golden snapshots
+
+For regressions compared to the latest validated python code, there are both regression unit tests in `tests/`, as well as a test script that compares the output to "golden snapshots": `scripts/regression_comparer.py`. That script is more verbose than the tests, useful for debugging.
+
+Some amount of numerical errors are OK, which is what the regression comparer library is for.
+
+### Old Clojure reference implementation, and moving to Sklearn
+
+For math, there is an older implementation in Clojure, in `polismath`. Until we can replace it, we run comparisons between the two implementations in `tests/*legacy*`. Those run the python code, and compare some of the output in some way to the `math blob`, which is the JSON output of the Clojure implementation, often stored in the PostgreSQL database, but for simplicity stored along the golden (python) snapshots used by the regression comparer, so we do not have to run Postgres nor Clojure to run those tests.
+
+A lot of the current python code was ported from Clojure using an AI agent (Sonnet 3.5 last year), including a lot of home-made implementations of core algorithms. We are in the process of replacing those with standard implementations (such as sklearn for the PCA and K-means). This is ongoing work, and made harder by the fact that the Python code does not quite produce the same output as the Clojure code. So typically we have to check what the ported python code is doing differently from the clojure code, adjust the python code to match the clojure output, and then replace it with standard implementations, which may again produce slightly different output, so we have to adjust parameters until we get similar output.
diff --git a/delphi/Dockerfile b/delphi/Dockerfile
@@ -7,6 +7,13 @@
 
   ENV PYTHONDONTWRITEBYTECODE=1
   ENV PYTHONUNBUFFERED=1
+  ENV UV_SYSTEM_PYTHON=1
+
+  # Install uv for faster package installation (~2x faster than pip)
+  # Placed in /opt/uv (not /usr/local/bin) to avoid leaking into the final/prod image
+  # via the blanket COPY --from=builder /usr/local/bin in Stage 2.
+  COPY --from=ghcr.io/astral-sh/uv:0.11.2 /uv /opt/uv/uv
+  ENV PATH="/opt/uv:$PATH"
 
   RUN apt-get update && \
       apt-get install -y --no-install-recommends \
@@ -28,21 +35,21 @@
   COPY pyproject.toml requirements.lock ./
 
 # Install dependencies from lock file (cached layer - reused unless requirements.lock changes)
-# BuildKit cache mount keeps pip cache between builds for faster rebuilds
+# BuildKit cache mount keeps uv cache between builds for faster rebuilds
 # If USE_CPU_TORCH is true, we install CPU-specific wheels and filter them out of the lockfile
-RUN --mount=type=cache,target=/root/.cache/pip \
+RUN --mount=type=cache,target=/root/.cache/uv \
     if [ "$USE_CPU_TORCH" = "true" ]; then \
         echo "USE_CPU_TORCH=true: Installing CPU-only PyTorch..." && \
-        pip install --index-url https://download.pytorch.org/whl/cpu \
+        uv pip install --index-url https://download.pytorch.org/whl/cpu \
             torch==2.8.0 \
             torchvision==0.23.0 \
             torchaudio==2.8.0 && \
         echo "Filtering standard torch packages from requirements.lock..." && \
         grep -vE "^(torch|torchvision|torchaudio)==" requirements.lock > requirements.filtered.lock && \
-        pip install -r requirements.filtered.lock; \
+        uv pip install -r requirements.filtered.lock; \
     else \
         echo "USE_CPU_TORCH=false: Installing standard dependencies..." && \
-        pip install -r requirements.lock; \
+        uv pip install -r requirements.lock; \
     fi
 
   # ===== OPTIMIZATION: Copy source code LAST (busts cache on code changes) =====
@@ -54,8 +61,8 @@ RUN --mount=type=cache,target=/root/.cache/pip \
 
   # Install the project package (without dependencies - they're already installed)
   # This registers entry points and installs the package in development mode
-  RUN --mount=type=cache,target=/root/.cache/pip \
-      pip install --no-deps .
+  RUN --mount=type=cache,target=/root/.cache/uv \
+      uv pip install --no-deps .
 
   RUN echo "--- PyTorch Check (after pyproject.toml installation) ---" && \
       pip show torch torchvision torchaudio && \
@@ -132,12 +139,14 @@ RUN apt-get update && \
     && apt-get clean && \
     rm -rf /var/lib/apt/lists/*
 
-# Copy pyproject.toml to install dev dependencies
+# Copy uv from builder for faster package installation
+COPY --from=builder /opt/uv/uv /usr/local/bin/uv
+ENV UV_SYSTEM_PYTHON=1
 COPY pyproject.toml .
 
 # Install dev dependencies (pytest, etc.) using caching
-RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install --no-cache-dir ".[dev]"
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv pip install ".[dev]"
 
 # Default command for test container (can be overridden)
 CMD ["tail", "-f", "/dev/null"]
diff --git a/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md b/delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
@@ -0,0 +1,153 @@
+# Journal: Fixing Python-Clojure Discrepancies
+
+This is the ongoing tracking document for the TDD fix process described in
+`PLAN_DISCREPANCY_FIXES.md`. It serves as the single source of truth for
+our work, while commit messages and PR descriptions serve reviewers.
+
+---
+
+## Initial Baseline (2026-02-25)
+
+### Branch: `series-of-fixes` (forked from `origin/kmeans_analysis_docs`)
+
+### Test Results
+
+**`test_legacy_clojure_regression.py`** (2 datasets: vw, biodiversity):
+- 2 passed (`test_pca_components_match_clojure` × 2)
+- 6 xfailed:
+  - `test_basic_outputs` × 2 — D9/D5/D7: empty `comment_repness`
+  - `test_group_clustering` × 2 — D2/D3: wrong participant threshold, missing k-smoother
+  - `test_comment_priorities` × 2 — D12: not implemented
+
+**Full test suite** (`tests/` except `test_batch_id.py`):
+- 185 passed, 11 failed, 3 skipped, 6 xfailed
+- Pre-existing failures (not caused by this work, inherited from stacked PRs):
+  - `test_clusters.py::test_init_clusters` — `init_clusters()` doesn't populate members when k > n_points
+  - `test_conversation.py::test_recompute` — clustering threshold (7.2) filters out all 20 participants in synthetic data
+  - `test_conversation.py::test_data_persistence` — same threshold issue
+  - `test_datasets.py` × 4 — DatasetInfo API changed (added `has_cold_start_blob`), tests use old 8-arg constructor
+  - `test_edge_cases.py::test_insufficient_data_for_pca` — repness returns empty dict for no-group case
+  - `test_repness_smoke.py` × 2 — repness empty due to D9/D5/D7 (the very discrepancies we're fixing)
+- **Note**: CI runs on `main` which has different code — these failures are specific to this stacked branch.
+  4 of them were already known (documented in MEMORY.md under "Known pre-existing test failures in #2393").
+
+### Datasets Available
+
+| Dataset | Votes | Has cold-start Clojure blob? |
+|---------|------:|-----|
+| vw | 4,684 | Yes |
+| biodiversity | 29,803 | Yes |
+| *(5 private datasets)* | 91K–1M | **No** (need prodclone DB) |
+
+### Clojure Blob Structure (vw example)
+
+Repness entry keys: `tid`, `n-agree`, `p-test`, `repness-test`, `n-success`,
+`repful-for`, `n-trials`, `repness`, `best-agree`, `p-success`
+
+Consensus structure: `{agree: [{tid, n-success, n-trials, p-success, p-test}], disagree: [...]}`
+
+Comment priorities: `{tid: priority_value}` — 125 entries for vw
+
+In-conv: list of 67 participant IDs (vw)
+
+---
+
+## PR 0: Test Infrastructure (complete)
+
+### What was done
+- Created `tests/test_discrepancy_fixes.py` with per-discrepancy test classes
+- Created this journal
+- Documented initial baseline
+
+### Test results for `test_discrepancy_fixes.py`
+
+After rebase onto updated `origin/kmeans_analysis_docs`:
+
+```
+7 passed, 19 skipped, 39 xfailed, 10 xpassed (with --include-local, 7 datasets)
+```
+
+- **7 passed**: Clojure formula sanity checks (prop_test, repness metric product, repful rat>rdt) + Clojure blob consistency checks (pat values)
+- **19 skipped**: D15 moderation (no moderated comments), incomplete Clojure blobs, engage duplicate files
+- **39 xfailed**: Discrepancy tests correctly fail (D2-D12 constants, formulas, and real-data comparisons)
+- **10 xpassed** (all `strict=False`, so green):
+  - D2 in-conv × 2 on vw — small dataset where old/new thresholds coincide
+  - D6 two_prop_test × 1 — pseudocount difference too small to matter for this test case
+  - D9 repness_not_empty × 7 on all datasets — `comment_repness` list is populated (all
+    (group, comment) pairs) even with wrong thresholds; only `group_repness` selection is
+    affected. **TODO**: tighten this test when fixing D9 to check correct *number* of
+    representative comments, not just non-emptiness
+
+### Design decisions
+- All tests that verify targets not yet implemented are marked `@pytest.mark.xfail` with the discrepancy ID in the reason
+- `strict=False` on D2 because vw (small) coincidentally matches while biodiversity (larger) does not
+- D5 `test_clojure_pat_values_consistent_with_formula` is a sanity check that verifies Clojure data matches the documented formula — it's NOT xfailed because it doesn't test Python code
+- D6 `test_two_prop_test_with_pseudocounts` is xfailed (Python lacks pseudocounts)
+- D15 uses `pytest.skip` when dataset has no moderated comments
+
+### Test classes summary
+
+| Class | Discrepancy | Tests | xfail? |
+|-------|-------------|-------|--------|
+| `TestD2InConvThreshold` | D2 | 2 (per dataset) | xfail(strict=False) |
+| `TestD4Pseudocount` | D4 | 2 (1 constant + 1 per dataset) | both xfail |
+| `TestD5ProportionTest` | D5 | 2 (1 formula + 1 sanity per dataset) | formula xfail, sanity passes |
+| `TestD6TwoPropTest` | D6 | 1 | xfail |
+| `TestD7RepnessMetric` | D7 | 1 | xfail |
+| `TestD8FinalizeStats` | D8 | 1 | xfail |
+| `TestD9ZScoreThresholds` | D9 | 3 (2 constants + 1 per dataset) | all xfail |
+| `TestD10RepCommentSelection` | D10 | 1 (per dataset) | xfail |
+| `TestD11ConsensusSelection` | D11 | 1 (per dataset) | xfail |
+| `TestD12CommentPriorities` | D12 | 1 (per dataset) | xfail |
+| `TestD15ModerationHandling` | D15 | 1 (per dataset) | skipped (no mod-out data) |
+| `TestSyntheticEdgeCases` | multiple | 5 | 2 xfail (D4, D9), 3 pass |
+
+---
+
+## What's Next: PR 1 — Fix D2 (In-Conv Participant Threshold)
+
+- Change `threshold = 7 + sqrt(n_cmts) * 0.1` to `threshold = min(7, n_cmts)` in `conversation.py:1238`
+- Use `self.raw_rating_mat` instead of `self.rating_mat` for counting
+- Add greedy fallback (top-15 voters if <15 qualify)
+- Add monotonic persistence (once in, always in)
+- Remove xfail from `TestD2InConvThreshold`
+- Check cluster count impact (may change from 3→2 for biodiversity, matching Clojure)
+- Re-record golden snapshots if needed
+- Document new baseline
+
+---
+
+## Session Log
+
+### Session 1 (2026-02-25)
+
+- Assessed codebase: read test infra, repness module, conversation module, Clojure blob structure
+- Created `tests/test_discrepancy_fixes.py` (30 tests, 12 classes)
+- Created this journal with initial baseline
+- Ran baseline: `test_legacy_clojure_regression.py` → 2 passed, 6 xfailed
+- Ran full suite → 185 passed, 11 failed (pre-existing), 3 skipped, 6 xfailed
+- Investigated 11 pre-existing failures: inherited from stacked PRs (DatasetInfo API change, threshold issues, D9/D5/D7 repness)
+- Rebased onto updated `origin/kmeans_analysis_docs` after PR stack rebase
+- Committed and created PR #2401 (`[Clj parity PR 0]`)
+
+### Session 2 (2026-02-26)
+
+- Rebased after stack update (base tests tightened, some xpassed→xfailed)
+- Updated PR title convention: `[Clj parity PR N]` prefix for reviewer clarity
+- Redacted private dataset names from git history across the full stack:
+  - `SESSION_HANDOFF_KMEANS.md` in `kmeans_clustering_tooling` (amended deep commit via `GIT_SEQUENCE_EDITOR` rebase)
+  - `PLAN_DISCREPANCY_FIXES.md` in `kmeans_analysis_docs` (amended tip)
+  - `CLJ-PARITY-FIXES-JOURNAL.md` in `series-of-fixes` (amended tip)
+  - Force-pushed all three branches, rebased the chain
+- Tests unchanged: 5 passed, 2 skipped, 18 xfailed, 5 xpassed
+
+---
+
+## Notes for Future Sessions
+
+- Private datasets not available in this worktree. Need prodclone DB + `generate_cold_start_clojure.py`.
+- `test_discrepancy_fixes.py` uses same parametrization pattern as `test_legacy_clojure_regression.py` (own `pytest_generate_tests` hook, `dataset_name` fixture).
+- 11 pre-existing test failures are from the stacked branch, not from our work. They should be fixed in their respective PRs before merging to main.
+- `strict=False` on xfail means xpass (unexpected pass) is reported but not a failure. Used when some datasets pass by coincidence.
+- After rebase, D9 `test_repness_not_empty` started xpassing — the `comment_repness` list is populated (all pairs), but `group_repness` (selected reps) may still be affected by wrong thresholds. Consider tightening this test when fixing D9.
+- To rebase when base is updated: `git fetch origin kmeans_analysis_docs && git rebase --onto origin/kmeans_analysis_docs series-of-fixes-base series-of-fixes && git tag -f series-of-fixes-base origin/kmeans_analysis_docs`