Agentic builds: dsynth evidence capture hooks by tuxillo · Pull Request #1517 · DragonFlyBSD/DeltaPorts

tuxillo · 2026-01-10T00:59:52Z

Goal

We are designing a system to automatically (agent-assisted) fix ports while keeping the existing, build-driven workflow intact:

dsynth stays the authoritative build executor.
On failure, we capture a bounded evidence bundle (distilled errors + small port context) so automated triage/patch generation can be driven by real build output without dumping huge logs or entire work directories into an AI context.
Evidence is intended to flow into an asynchronous agent pipeline (triage → patch → review) via a central queue (documented), so builds never block on AI availability.

What this PR adds (foundation)

dsynth hook scripts under scripts/dsynth-hooks/:
- hook_run_start / hook_run_end group failures per build run and snapshot dsynth summary lists.
- hook_pkg_failure creates a per-failure evidence bundle with:
  - logs/errors.txt (high-signal extract, capped at 200KB)
  - logs/full.log.gz (full log preserved for humans)
  - port/* snapshot (Makefile/distinfo/pkg-plist/patches, etc.)
  - meta.txt and basic dsynth profile/config snapshots
Design/usage documentation in docs/AGENTIC_BUILDS.md describing:
- the overall automated-fixing workflow (bounded evidence → triage → snippet escalation → patch → rebuild)
- an opencode integration plan, including a central queue model for asynchronous triage
A small README pointer to the hook location.

What this PR does not do (yet)

No network calls from hooks.
No queue writer/runner implementation.
No automated patch application.

Those are intentionally deferred so this PR can land the core evidence-capture mechanism safely and independently.

How to try it

Install hooks by copying/symlinking scripts/dsynth-hooks/hook_* and scripts/dsynth-hooks/hook_common.sh into dsynth’s config base (/etc/dsynth/ or /usr/local/etc/dsynth/) and making them executable.
Run dsynth normally.
On a port failure, inspect ${Directory_logs}/evidence/runs/.../ports/.../ for the evidence bundle.

Why this matters for automated fixing

Reliable, size-capped evidence capture is the prerequisite for an automated port-fixing system:

the triage agent needs consistent inputs (errors.txt + port metadata)
the patch agent can generate DeltaPorts-style diffs based on evidence, not guesses
the rebuild loop stays dsynth-driven, and automation can be layered on without destabilizing build infrastructure

Add dsynth hook scripts that snapshot distilled build errors and relevant port metadata on failures, grouped by run, so debugging can stay build-driven without keeping full workdirs. Document the bounded evidence contract and the planned opencode integration/central queue model for asynchronous triage.

Add observe-only state server for remote UI integration: - REST API for runs, jobs, bundles, ports, artifacts - SSE event stream with replay support - SQLite persistence for full history - Filesystem reconciler for live updates Validated on DragonFlyBSD VM - all endpoints tested.

- Add vanilla JS Bootstrap 5 UI served by state-server - Live SSE event stream with replay/reconnect - Views: Overview, Events, Jobs, Runs, Ports, Bundles - Artifact viewer for markdown, diffs, logs - SSE improvements: after_id, tail query params, ts in payloads

- Add /bundles API endpoint listing recent bundles - Add #/bundles route with renderBundles() view - Add Bundles nav item to navbar - Update Phase 9 docs with completion status and new route

- agent-queue-runner: add apply job type and iteration tracking - apply-patch: add DragonFly local mode, --no-push flag, BSD-compatible patch - hook_common.sh: detect rebuild iterations, track previous bundles - Add KEDB entry for DragonFly source patch conventions

Makefiles use tabs, not spaces. The agent was generating patches with spaces which caused patch application failures. Added rule #8 to emphasize preserving exact whitespace from the bundle context.

When retrying a patch application, the branch may already exist from a previous failed attempt. Delete it first to allow the retry.

Stop extraction when hitting common section markers like 'Rationale', 'Files Modified', etc. Also detect when prose text starts after hunks. This prevents non-diff content from being included in patch.diff.

The agent was generating patches with incorrect hunk line counts. Added detailed instructions on unified diff format with example.

- Change dports-patch prompt to request complete file contents - Add extract_files_from_response() to parse FILE content blocks - Add generate_unified_diff() to create diffs programmatically - Add generate_combined_diff() for multi-file patches - Update write_patch_outputs() to try new format first, fallback to legacy This fixes the malformed diff issue - LLMs are good at generating file content but struggle with unified diff syntax and line counts.

The agent was outputting diff syntax inside FILE blocks for Makefile.DragonFly. Make it explicit that Makefile.DragonFly should be raw makefile content, while dragonfly/patch-* files are actual diffs. Also add specific hint for the IFM_IEEE80211_VHT5G error.

…er UI - Add activity_log and runner_status tables to state-server schema - Add /activity and /runner-status API endpoints with SSE events - Update agent-queue-runner to log activities at all job stages - Add heartbeat thread for runner liveness detection (5s interval) - UI: Add Activity Log panel showing last 10 runner activities - UI: Add Runner Status indicator with staleness detection (>15s) - UI: Add back button for artifact navigation in bundle view - UI: Hide session_id.txt files from artifact lists

…b error display - state-server: Only emit runner_status SSE events when status/job_id/stage changes, not on every heartbeat update_at change - app.js: Don't trigger full re-render for runner_status/activity events (fixes bundle tab reset issue), only re-render on overview page - app.js: Add renderJobDetail() with prominent error display and related activity log entries for failed jobs - agent-queue-runner: Write .job.error files before moving failed jobs, move error files along with job files

llm.py's tokenizers stub only fired when llm was imported — but the runner / tools modules / manual inspections can hit litellm without going through llm first. Moving the stub to dportsv3/agent/__init__.py makes it run as soon as any module under the package is imported. This unblocks invocations like: python -c "import dportsv3.agent; import litellm; ..." without needing to pre-import llm. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ting When litellm's model-name → provider heuristic mis-routes (e.g., any model name containing 'deepseek' or 'claude' is shunted to the native provider client even when openai/ prefix and api_base are set), custom_llm_provider forces a specific code path. Generic passthrough; default None means "let litellm pick from prefix as before." Set per flow: - agent-queue-runner: DP_HARNESS_TRIAGE_PROVIDER env var (DP_HARNESS_PATCH_PROVIDER will follow in step 4 when patch wires) - llm.complete(), tool_loop.run(), triage.run(): custom_llm_provider kwarg - _manual_test_tool_loop: DP_TEST_PROVIDER env var Native providers (anthropic/, deepseek/, nvidia_nim/, ...) work unchanged because they don't set custom_llm_provider. The override is only used when needed (most often: openai-compat third-party endpoints with model names that fool the heuristic). Also commits the manual test helper for tool_loop that was previously left untracked. Useful while step 4 is in flight. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Thinking-mode providers (DeepSeek v4-pro/v4-flash directly or via opencode.ai/zen, OpenAI o-series via some relays) emit a reasoning_content field alongside content + tool_calls, holding the model's intermediate chain-of-thought. The upstream API requires this field to be passed back on the next request, or the multi-turn call fails with HTTP 400: "The reasoning_content in the thinking mode must be passed back to the API." Changes: - llm.Response gains optional reasoning_content field; llm.complete extracts it from msg.reasoning_content if present (None otherwise). - tool_loop._assistant_message_from includes reasoning_content in the reconstructed assistant message when set, so the next LLM request preserves continuity. No-op for non-thinking models — reasoning_content stays None, nothing extra is sent. Verified with stubbed Response objects: thinking-mode reconstructed message carries reasoning_content; non-thinking does not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously every get_file result was base64. For UTF-8 text files (Makefiles, patches, source, the bulk of what the agent reads), this inflated content by ~33% AND made the model mentally decode base64 to find anything inside — burning prompt AND completion tokens. Now: read bytes, try UTF-8 decode with a NUL-byte sanity check; return {encoding: 'text', content: <str>} on success, fall back to {encoding: 'base64', content: <b64>} for binary. sha256 is computed over the raw bytes, so put_file's expected_sha256 round-trip works regardless of encoding. Verified with a temp-fs harness: text Makefile returns text; PNG-header file returns base64. Schema description updated so the LLM understands the dual-mode return shape. Example path in description updated to /work/DPorts/... (the common path; agent reads materialized port files from DPorts, edits source-of-truth in DeltaPorts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The patch agent now runs end-to-end through the harness instead of opencode. New code: - prompts.PATCH_SYSTEM: 4kB system prompt spelling out the dev-env's three-tree layout (freebsd-ports / DeltaPorts / DPorts), tool vocabulary, the repair loop, discipline rules (no commits/push/PRs), and the mandatory output format ending in the new rebuild_proof.json schema (origin, rebuild_ok, dsynth_profile, build_command, timestamp_utc — no branch/head/fports fields). - attempt_loop.run: budget-bounded retry around tool_loop. Each attempt is a fresh [system, user] conversation (with a small failure- context user turn appended on retries) so tool-call traces don't compound across attempts. Stops on rebuild_ok=true, budget exhaustion, or max_iterations. Returns PatchResult{status, final_text, usage, attempts[], proof}. - patch.run: thin wrapper over attempt_loop.run. Runner wiring (mirrors step 1 triage adapter): - New env vars: DP_HARNESS_PATCH_{MODEL,API_BASE,API_KEY,PROVIDER, TIMEOUT}, DP_HARNESS_ENV (dev-env name default), DP_HARNESS_POLICY (optional override of config/agentic-policy.json path). - process_patch_job: when DP_HARNESS_PATCH_MODEL is set, route to _process_patch_job_harness. It reads triage.md, resolves the tier via policy.tier_for(classification, confidence), and calls dportsv3.agent.patch.run with the tier's budget. - Bundle outputs: analysis/patch.md (final LLM text), analysis/ rebuild_proof.json (parsed proof block), analysis/patch_audit.json (status + tokens + per-attempt info + model), analysis/changes.diff (host-side git diff vs HEAD in the env's DeltaPorts overlay). Verified attempt_loop against a stubbed tool_loop: - success on first attempt - failure then success (failure-context message added to retry) - budget exhausted mid-sequence - needs-help after all attempts fail - missing rebuild_proof JSON falls back to needs-help End-to-end against a real LLM + env requires a manual smoke run with DP_HARNESS_PATCH_MODEL + a bundle on disk; covered in the next message. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

_manual_test_patch_flow.py fixtures a minimal bundle under /tmp (meta.txt, errors.txt, analysis/triage.md) and invokes dportsv3.agent.patch.run directly with a fabricated payload — bypassing the queue runner so the harness's loop is exercised in isolation against a real LLM + real dev-env. The fixture intentionally doesn't simulate a broken port; it asks the agent to verify the current state of the port via dsynth_build and emit rebuild_proof.json accordingly. Pointing at devel/readline (default) should reach rebuild_ok=true within 1-2 attempts. Env vars mirror _manual_test_tool_loop (DP_TEST_MODEL, ENV, ORIGIN, TIER_ITERATIONS, TIER_TOKENS, plus PROVIDER/API_BASE/API_KEY). The bundle dir is preserved on exit so you can inspect the artifacts the runner-side adapter would have written: patch.md, patch_audit.json, rebuild_proof.json, changes.diff (note: those are written by agent-queue-runner's _process_patch_job_harness, NOT by this fixture — this fixture only calls patch.run and reports the PatchResult). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

dsynth's 'build' subcommand asks interactive questions (most commonly "Rebuild local repository? [Y/n]" before scanning, sometimes follow- ups during the build). The agent has no tty, so the subprocess sat in [ttyin] state and the patch flow hung — observed mid-test: load: 0.67 cmd: dsynth 31619 [ttyin] 0.00u 0.06s 0% 4128k Fixes: - worker._exec accepts optional input_text kwarg; default stdin is empty string (effectively /dev/null) so unexpected prompts fail fast rather than blocking. - worker.dsynth_build pipes 'y\\n' * 50 to stdin to clear dsynth's prompts. Generous enough for multi-question build cycles, cheap to send. dbuild (the dev-env helper) is unchanged — humans running it interactively still get the prompts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…_turns default Observed: a single attempt burned 2,073,090 tokens before attempt_loop's between-attempts budget check caught it. Root cause: tool_loop only enforced max_turns (30), not the token budget. The model went into a tool-call frenzy and attempt_loop only noticed after 30 turns of accumulating 70k-token contexts. Fixes: - tool_loop.run: new max_tokens kwarg; checked at the top of each turn before issuing the LLM call. When the running total reaches the cap, return whatever Response we have. Default 0 = no cap (callers should pass remaining budget). - attempt_loop.run: passes tier's remaining budget (max_tokens - tokens_used_so_far) as max_tokens to tool_loop on each attempt. Also short-circuits with status=budget-exhausted before kicking off a new attempt if the budget is already gone. - tool_loop max_turns default: 20 -> 12. A patch task taking more than ~12 tool calls per attempt is in trouble; the cap should stop it sooner. - attempt_loop max_tool_turns default: 30 -> 12. Verified with stubbed LLM: tool_loop stops at 1200 tokens when max_tokens=1200 (turn 3 was the first check after total>=cap). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When the patch fixture run produces a surprising token count, we need to see what the model actually did — final_text alone tells us nothing if the loop ended on a tool call. _install_session_dump wraps llm.complete and tools.dispatch to write each turn as a JSON line to <bundle>/session.jsonl: - llm_call records: messages_preview (with long strings truncated to 800 chars), response.text (1200 chars), tool_calls, reasoning_content (600 chars), usage. - tool_dispatch records: tool name, arguments, ok flag, stdout/stderr tails truncated to 600 chars. Excludes result body (file bytes, full schemas) to keep the trace compact and shareable. After a run, share session.jsonl and the per-turn behavior is visible without re-running. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The first end-to-end run of the patch flow burned 40k tokens exploring devel/readline without ever finding the actual build error. Root cause was not a single bug — the agent's information feed had multiple compounding problems. Eight fixes addressing each, ranked by impact: 1. dsynth_build noise → use 'dsynth -S -y' directly. Skip the dbuild helper (which keeps ncurses for humans) and invoke dsynth with -S (disable ncurses TUI) and -y (assume-yes). Previously the agent received ~2kB of curses escape codes as stdout. -y also retires the 'y\n'*50 stdin hack. 2. grep used rg, which isn't packaged for DragonFly. Switch to POSIX 'grep -rn'. grep rc=1 (no matches) → ok=True with match_count=0; rc>=2 → ok=False. Prior behavior surfaced "no matches" as ok=False and the model concluded "rg is not available" (wrong inference but understandable). 3. dev-env exec INFO mount-prep noise on every chroot tool call. New '--quiet' flag on 'dportsv3 dev-env exec' and matching DPORTS_DEV_ENV_QUIET env var. worker._exec always passes --quiet so the harness's contexts stop accumulating 8 lines of "INFO: mount already present at ..." per call. 4. Surface dsynth's per-port build log. dsynth writes the actual build error to /work/dsynth/logs/<origin-with-underscores>.log (Directory_logs from dsynth.ini). Two changes: - dsynth_build result now carries 'log_hint' pointing at this path. - New 'dsynth_log(origin, tail_lines=200)' tool reads the tail. PATCH_SYSTEM updated to direct the agent: on build failure, call dsynth_log immediately — don't grep DPorts for *.log files (they don't exist there). 5. Add 'list_dir(path)' tool. Previously the agent tried get_file on directories and got opaque failures. list_dir returns entries with name/kind/size, capped at max_entries. 6. Tool schemas trimmed. Each schema's description now one focused sentence (was 2-4 sentences with examples). Total schema chars ~6.5kB → ~4kB. The example paths and prose moved to PATCH_SYSTEM, which is sent once per attempt-start instead of every turn. 7. Sliding-window reasoning_content. Thinking-mode providers require the most recent assistant turn's reasoning_content to be echoed back; older turns' reasoning is dead weight in the prompt. tool_loop._strip_old_reasoning drops it from all but the most recent assistant message after each turn. 8. give-up directive in PATCH_SYSTEM (no new tool — prompt-only). Explicit: "if you've tried two distinct approaches and both failed at the same point, stop and emit Rebuild Status: gave-up". Also: "if dsynth_build returned rebuild_ok=true, stop immediately — don't keep exploring." Plus: get_file failure envelopes now differentiate 'missing' / 'is_directory' / 'not_a_regular_file' via a 'kind' field, so the agent can react usefully. Verified with unit tests: - reasoning_content sliding window keeps LATEST, strips older - list_dir returns entries with kind+size - get_file on a directory returns kind=is_directory with a pointer to list_dir/grep - grep on a non-existent pattern returns ok=True match_count=0 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Deepseek's thinking-mode API requires reasoning_content on EVERY prior assistant turn, not just the most recent. Empirical proof: 3 turns in, after the trim removed turn 1's reasoning_content, the API rejected with HTTP 400: The reasoning_content in the thinking mode must be passed back to the API. So the trim violates the protocol, not just leaves tokens on the table. Reverting that change. The other 7 fixes in 9e35959 stand. Token cost of preserving all reasoning_content is the price of using a thinking-mode model — accept it or switch providers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous run on devel/readline got 9 turns in and was making real progress (diagnosed the actual C compile error, found a version-skew bug in DeltaPorts' overlay patch, was about to read the source to fix it) when 40k tokens ran out. Bump fixture defaults to match the ASSIST tier in agentic-policy.json: - DP_TEST_TIER_ITERATIONS: 2 -> 4 - DP_TEST_TIER_TOKENS: 40000 -> 120000 Real bundles classified as ASSIST will get these caps from the policy file; the fixture should mirror that so test results are representative of the production budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Triage now resolves a tier from (classification, confidence) via policy.tier_for and uses the tier name to decide whether to auto- enqueue a patch job. The tier name propagates into the patch .job file so the patch worker uses the same budget without re-resolving. agent-queue-runner._process_triage_job_harness: - Loads config/agentic-policy.json (or DP_HARNESS_POLICY path). - Resolves tier from result.classification + result.confidence. - tier=MANUAL: skips patch auto-enqueue, upserts a user_context request so the UI flags it for operator attention, returns status="manual_tier". - tier=AUTO|ASSIST: enqueues the patch job carrying tier= and dev_env= fields. enqueue_patch_job: optional tier_name + dev_env kwargs propagate into the new .job file. _process_patch_job_harness: prefer job.get('tier') (set by triage when running through the harness path) over re-parsing triage.md. Re-parse is the fallback for hand-fired patch jobs. Bug fix found by the new tier matrix test: policy.tier_for only downgraded once. plist-error + low confidence ended at ASSIST even though ASSIST's medium floor wasn't met. Fix: cascade downgrades in a while loop until either confidence meets the current tier's floor or MANUAL is reached. Verified against all combinations: plist-error/high -> AUTO plist-error/medium -> ASSIST (AUTO floor=high not met -> downgrade) plist-error/low -> MANUAL (cascades both downgrades) compile-error/high -> ASSIST compile-error/low -> MANUAL (ASSIST floor=medium not met) runtime-error/any -> MANUAL (mapped directly) unknown/any -> MANUAL (default) needs_user_context and should_enqueue_patch are no longer used by the harness triage path; they remain only for the (also legacy) opencode path. They'll go away in step 6 with the rest of the opencode sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors _manual_test_patch_flow but for the triage side of the loop. Fixtures a bundle with a synthetic-but-realistic error log, runs dportsv3.agent.triage.run against a real LLM, then asks policy.tier_for what the runner would do with the result. Three built-in fixtures: - compile-error — readline-shape 'lvalue required' compile error - plist-error — pkg-plist mismatch - unknown — opaque generic failure Reports: classification, confidence, resolved tier, the patch-budget the tier would grant, and whether the runner would auto-enqueue. Dumps per-turn LLM trace to <bundle>/session.jsonl for inspection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Final cleanup pass. With the harness fully validated end-to-end (real LLM + real env + real port fix on devel/readline; clean build on archivers/liblz4), the legacy opencode-driven code paths and the workspace concept have no callers and no purpose. Deleted: - config/opencode/ entirely: dports-triage.md (40), dports-patch.md (62), tool/dports.ts (257) — TS plugin and agent markdown moved to dportsv3.agent.{prompts, tools} in earlier steps. - agent-queue-runner constants: PATCHABLE_CLASSIFICATIONS, PATCHABLE_CONFIDENCE (replaced by policy.tier_for in step 5), DEFAULT_VM_SSH_KEY/PORT/HOST, DEFAULT_WORKSPACE_CONFIG, DEFAULT_MAX_SNIPPET_ROUNDS. - agent-queue-runner functions: - should_enqueue_patch, needs_user_context (replaced by policy tier dispatch) - parse_snippet_requests (only called by the now-dead snippet re-enqueue path; snippet rounds fold into harness triage) - get_vm_ssh_command, run_snippet_extractor (SSH-to-VM workaround for Linux dev hosts; harness runs natively on dfly) - enqueue_followup_job, check_and_handle_snippet_requests (legacy snippet escalation; harness triage handles in-process) - call_opencode, extract_response_text (HTTP plumbing for opencode serve) - extract_section, extract_json_block (only used by the legacy write_*_outputs; the harness has its own _PROOF_BLOCK_RE in attempt_loop.py) - write_triage_outputs, write_patch_outputs (legacy bundle writers; harness has _write_triage_audit_harness + _write_patch_audit_harness) - load_workspace_config (workspace.json reader; workspace concept retired) - The workspace-config embedding section of build_triage_payload - Legacy bodies of process_triage_job and process_patch_job: both shrink to thin wrappers that require DP_HARNESS_*_MODEL and call the corresponding _process_*_job_harness adapter. No more feature-flag-gated dual path. - process_job: drops opencode_url, opencode_provider, opencode_model, timeout, max_retries, retry_delay parameters from the signature and call sites. Snippet round display removed (in-process now). - main(): drops all OPENCODE_* env reads; startup log now reports DP_HARNESS_TRIAGE_MODEL + DP_HARNESS_PATCH_MODEL instead. - Docstring header at the top of the file rewritten to document the harness env vars and job-file conventions. Net effect: - scripts/agent-queue-runner: ~2300 LOC -> 1685 LOC - config/opencode/ gone (359 LOC) - Total: ~1000 LOC retired Negative checks pass: - 'opencode|OPENCODE_|VM_SSH|workspace\\.json|agentic-workspace| PATCHABLE_|should_enqueue_patch|call_opencode|extract_response_text| check_and_handle_snippet_requests|load_workspace_config' in scripts/agent-queue-runner: 0 hits - The remaining "opencode" mentions in dportsv3.agent/llm.py + prompts.py + _manual_test_tool_loop.py are documentation strings about opencode.ai/zen (a third-party OpenAI-compat relay) — they describe what works with the harness, not legacy code. - All harness modules import cleanly; 13 tools registered. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 3 (opencode → litellm harness, dev-env-native, no PR/branch/SSH) is complete in the branch. Updates: - docs/agentic-consolidation-plan.md: new "Status: shipped" callout at the top with a brief summary of what landed and pointers to the commit range (985889d ... 6f6db28). - docs/AGENTIC_BUILDS.md: warning banner that the doc describes the pre-Phase-3 architecture (opencode, OPENCODE_* env vars, VM_SSH, /build/synth/agentic-workspace, workspace.json, agentic-worker, process_pr_job — all retired). Sections below the banner are kept as historical context until the doc is rewritten. - docs/TESTING_E2E.md: same banner, plus pointers to the three manual test fixtures in scripts/generator/dportsv3/agent/_manual_test_*.py that exercise the new harness against a real dev-env. A full rewrite of AGENTIC_BUILDS.md and TESTING_E2E.md is queued as follow-up work — not in scope here. Banners prevent operators from following the stale instructions in the meantime. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fold the tracker schema into state.db so that when tracker becomes a read-only consumer in step 4 the tables already exist. No write behavior changes in this step — schema only, additive. Lifted verbatim from scripts/generator/dportsv3/tracker/db.py: - build_types (with seed rows 'test', 'release') - build_runs - build_results (incl. status default 'recorded' via idempotent ALTER) - port_status - 5 supporting indexes + uq_build_runs_active unique partial index - Idempotent ALTER migrations (build_results.status, build_runs.total_expected) Plus, per the consolidation plan's "weak link" model: add nullable runs.build_run_id (idempotent ALTER) so a dsynth invocation can later be associated with a campaign campaign via DPORTSV3_BUILD_RUN_ID (wired in step 3). Enable PRAGMA foreign_keys=ON on the artifact-store connection so tracker's FK constraints (build_results -> build_runs, port_status -> build_runs) are enforced going forward. None of artifact-store's existing tables have FKs, so the change only affects writes to the new tracker tables. Verified locally: - 4 tables + 6 indexes present after init - 2 seed rows in build_types - FK enforcement works (INSERT with bad build_run_id raises IntegrityError; valid insert succeeds) - Re-init is idempotent (no errors on second _init_db call) tracker.db remains authoritative until step 4 flips tracker to read state.db; both DBs are valid in parallel during the transition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 530-line standalone script becomes a 20-line shim; logic moves into dportsv3.artifact_store (importable, testable). Schema moves into dportsv3.db.schema (shared with tracker once it switches to state.db in step 4). New: - dportsv3/db/__init__.py — package marker - dportsv3/db/schema.py — SCHEMA, DEFAULT_BUILD_TYPES, MIGRATIONS, init_db(conn) helper. Idempotent on re-init. - dportsv3/artifact_store.py — ArtifactStore, Handler, ArtifactStoreServer, main(). Imports init_db from db.schema. Changed: - scripts/artifact-store — shrinks to a sys.path bootstrap + `from dportsv3.artifact_store import main; main()`. Same invocation, same behaviour. Executable bit preserved. - scripts/generator/pyproject.toml — adds `artifact-store = "dportsv3.artifact_store:main"` console script, so the generator venv's bin/ gets an `artifact-store` entry too. Invocation matrix now: - ./scripts/artifact-store --logs-root /path (production; no venv) - python -m dportsv3.artifact_store --logs-root /path (in-venv) - $VENV/bin/artifact-store --logs-root /path (console script) All identical in behaviour. Single source of truth for the state.db schema removes the duplication step 1 introduced (both artifact-store and tracker/db.py held the same 4 CREATE TABLEs). Verified: - All 15 tables created, 6 new indexes, FK enforced, runs.build_run_id column present, build_types seeded. - ArtifactStore.upsert_run_bundle / put_blob / get_artifact round-trip works against a temp dir. - Re-init is idempotent. - Both ./scripts/artifact-store --help and `python -m dportsv3.artifact_store --help` print the same help text. tracker/db.py is unchanged — it still uses its own schema for tracker.db. Step 4 will switch it to import from dportsv3.db.schema when it reads state.db. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Port state-server's POST /user-context to artifact-store. Same body shape, same semantics, same emit_event. State-server still serves its legacy /user-context in parallel until step 8 retires it. (Note: the Phase 4 plan also mentioned a POST /v1/jobs/enqueue/pr endpoint as part of step 2. That was prospective — state-server never actually had it, and Phase 3 deleted process_pr_job + the type=pr dispatch arm. No PR enqueue path exists today, so nothing to port. Step 2's real scope is just user-context.) New on the ArtifactStore class: - upsert_user_context(run_id, origin, context_text) -> int Looks up existing context_rev, increments, upserts the row, emits user_context_updated event, returns the new rev under the lock. New on the Handler: - POST /v1/user-context with body {run_id, origin, context_text}. Validation matches state-server: required fields, non-empty after strip, <= 8000 chars, valid JSON. Returns {"ok": true, "context_rev": N} on success. Verified locally via curl against the running shim: - First write to (r1, devel/readline): context_rev=1 - Second write same key: context_rev=2 (increment) - Different origin: starts fresh at context_rev=1 - Missing context_text -> 400 - Empty/whitespace -> 400 - > 8000 chars -> 400 - Malformed JSON -> 400 - state.db rows match: 2 user_context rows, 3 events with the expected rev/timestamp content. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add dsynth hook scripts that translate run start/end and per-port state changes into `dportsv3 tracker` API calls for the current build profile. The hooks persist one active tracker run id per dsynth profile, enqueue ports before marking them building, record final port outcomes, and fail soft with hook-local logging so tracker outages do not interrupt package builds. Include a shared helper, a config template, support for both `hook_pkg_start` and `hook_pkg_started`, and installation notes for copying the hook set into `/etc/dsynth`. Co-authored-by: OpenAI <noreply@openai.com>

Stop dsynth tracker hooks from reusing stale run ids when a new start-build request fails, and disable tracking for that dsynth run instead of continuing to enqueue and record into the previous build. Also switch the tracker server from one shared SQLite connection to fresh per-request connections so concurrent enqueue and status updates do not corrupt transaction state under hook traffic. Add a regression test covering the request connection lifecycle. Co-authored-by: OpenAI <noreply@openai.com>

…ok set The two parallel hook sets were a real problem: dsynth has one Hooks_Directory, so only one executable per event name (hook_pkg_failure etc.) can live there. Today an operator has to pick artifact-store evidence OR tracker-side build_results, not both. This commit folds builderhooks' tracker integration into the existing scripts/dsynth-hooks/, preserving the good ideas from builderhooks: - per-profile state file (under evidence_root/.tracker-state by default) - soft-fail logging (tracker outages don't fail dsynth builds) - disable-on-collision (if start-build fails because an active run exists, the new run is marked TRACKING_DISABLED instead of reusing a stale run id) - conf-driven, default-on tracker integration hook_common.sh gains a "tracker integration" section with: - tracker_log, tracker_fail_soft, tracker_should_skip - tracker_load_config / load_state / write_state / clear_state / disable_state - tracker_pkg_version, tracker_enqueue_one - tracker_run_start, tracker_run_end, tracker_mark_building, tracker_record_result Defaults: - DPORTSV3_TRACKER_TARGET = @${PROFILE} (one profile per target) - DPORTSV3_TRACKER_BUILD_TYPE = test - DPORTSV3_TRACKER_STATE_DIR = ${DIR_LOGS}/evidence/.tracker-state - DPORTSV3_TRACKER_HOOK_LOG = ${DIR_LOGS}/dportsv3-hooks.log Includes the mktemp fix from the wip commit (drop .json suffix — BSD mktemp requires X's at the end of the template). Hooks wired: - hook_run_start: existing evidence-root setup + new tracker_run_start - hook_run_end: existing evidence-pointer cleanup + new tracker_run_end - hook_pkg_failure: existing full bundle write + enqueue triage job + new tracker_record_result fail - hook_pkg_success / skipped / ignored: replace no-op with tracker_record_result {pass,skipped,ignored} - New hook_pkg_start + hook_pkg_started for tracker_mark_building (both names provided to match either dsynth variant) New supporting files: - scripts/dsynth-hooks/dportsv3-hooks.conf.example (single config file covering both artifact-store overrides and tracker config) - scripts/dsynth-hooks/README.md (install instructions + operational notes) Tracker integration is opt-in via the config file. Without DPORTSV3_TRACKER_URL set, every tracker_* high-level helper short-circuits and the hooks only do artifact-store work — preserving the previous behaviour for operators who don't run the tracker. Retired: - scripts/builderhooks/* (README, conf template, 9 hook stubs, and tracker_common.sh). The two unrelated poudriere-era scripts (bulk_started.sh, pkgbuild.sh) that lived in the same dir are kept in place pending a separate decision about whether to retire them too (they write per-port STATUS files — the thing tracker is meant to replace). Verified locally: - All hooks pass `sh -n` syntax check. - hook_pkg_success with tracker disabled (no config): exits 0, silent, no log file. - hook_pkg_success with tracker config but unreachable: exits 0 (soft-fail), error logged to dportsv3-hooks.log. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Detour before step 4 to close the "env's DeltaPorts is stale" gap. Two pieces: 1. runtime.py: bind-mount the host's repo mirror cache (config.repos_dir) into the chroot read-only at the same path. The env's git origins were recorded at clone time as host paths like /root/.cache/dports-dev/repos/deltaports.git — without this mount, `git pull` from inside the env fails because that path doesn't exist in the chroot's filesystem view. With the mount, the path resolves and standard git operations work from inside the env shell. 2. New `dportsv3 dev-env update NAME [--force]` subcommand. Two phases: - Phase 1: refresh the bare mirrors under config.repos_dir from the host's working tree (reuses RepoCache.refresh_all — same logic the builder runs at env create time). - Phase 2: for each env-side repo (work/DeltaPorts, work/freebsd- ports), run host-side `git fetch --prune origin` + `git pull --ff-only origin <current-branch>`. Errors when the working tree is dirty unless --force; errors when the branch can't fast-forward (divergent history) with a clear message. Logs before/after short SHAs per repo so the operator sees what moved. DPorts is intentionally excluded — it's compose-generated, not a git checkout. No --branch flag: switching branches in the env is now a normal `git -C /work/DeltaPorts checkout <other>` inside the env shell (works thanks to the bind mount). Also extends `dportsv3 dev-env status NAME` JSON output with per-repo `{branch, commit, dirty}` for DeltaPorts and freebsd-ports. Lets operators see what the env is tracking without entering it. Verification: argparse smoke + module parse + import. End-to-end verification needs a dfly env (next). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The tracker module now opens the same SQLite file artifact-store does (state.db). Two writers under SQLite WAL: one writer at a time at the SQLite layer, readers parallel. Pragmas (WAL, busy_timeout=5000, foreign_keys=ON) applied per-connection so the tracker server's per- request connections (commit a14fe9c) inherit them. tracker/db.py: - Drop the duplicate SCHEMA + DEFAULT_BUILD_TYPES + MIGRATIONS that lived inline. init_db now delegates to dportsv3.db.schema.init_db so the schema is single-sourced (matches what artifact-store writes). Re-exports DEFAULT_BUILD_TYPES for any consumer that imported it from this module. - open_db now also sets PRAGMA busy_timeout=5000 (was missing) to match artifact-store and survive concurrent-writer contention. CLI default --db path resolution (was hardcoded "tracker.db"): 1. --db PATH (operator override) 2. DPORTSV3_STATE_DB env var 3. $PWD/state.db (fall-back) Documented in cli.py help text. Operator is responsible for matching artifact-store's logs-root (e.g. /build/synth/logs/evidence/state.db when artifact-store runs with --logs-root /build/synth/logs). Tests: - All tracker test fixtures (test_tracker_api, test_tracker_queue, test_tracker_integration) switched from tmp_path / "tracker.db" to tmp_path / "state.db". Schema is identical via the shared module so test bodies don't change. - New test_state_db_concurrency.py: two threads hammering the same state.db (one as artifact-store, one as tracker) with ~60 writes each. Confirms no "database is locked" errors under WAL + busy_timeout, both sides' rows land, runtime well under the 15s bound. Plus a small FK-enforcement guard test. What did not change: - All high-level tracker functions (create_build_run, record_results, get_target_summary, …) — same API, same callers. - Tracker server (server.py) — per-request connections from a14fe9c stay. - Tracker CLI commands (start-build, record-result, …) — they hit the tracker server over HTTP; no awareness of the DB path. tracker.db file: - No code opens it anymore; safe to delete whenever convenient. - ~38k rows of test data (per the consolidation plan) — abandoned, not migrated. Static checks done locally: parse, import, schema delegation, all 15 expected tables present after tracker_init_db, pragmas applied. The pytest run requires the generator venv (dev deps); to verify on dfly: cd scripts/generator && .venv/bin/python -m pytest tests/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

schema.init_db(conn) mutates in place and returns None; the test wrapped the connect() call inside it and then tried to .close() the result. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds nullable target columns on bundles/jobs/runs (with indexes) via idempotent ALTER TABLE migrations. Hook + artifact-store + state-server propagate target on every write so step 8 can retire state-server without losing the target dimension. Tracker absorbs the agentic read API as eleven /api/* endpoints (runs, jobs, bundles, ports, activity, runner-status, agentic-status, artifacts, SSE events) all accepting a ?target= filter where applicable. Plan doc rewritten as Phase 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Server-rendered Jinja views for the agentic side: bundle list/detail, job list/detail, run detail, runner status, activity log. Each list view exposes a target selector populated from distinct_targets across bundles/jobs/runs. Bundle detail links artifact paths to the existing /api/bundles/<id>/artifacts/<relpath> streamer. Adds an "Agentic" nav entry next to Targets / Builds / Diff. Same Bootstrap layout as the build dashboard — no SPA, no JS framework. The dsynth-progress aesthetic redesign is deferred to Phase 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Deletes scripts/state-server (1373 LOC) and scripts/state-server-ui (2381 LOC across .css/.js/.html). The tracker absorbed the read API in step 5 and the HTML views in step 6, so the legacy server has nothing left to do. agent-queue-runner: STATE_SERVER_URL retired; bundle/artifact lookups now go through DPORTSV3_TRACKER_URL against /api/bundles and /api/ports. No backward-compat fallback — hard cutover. Stale "until step 8 retires it" comments scrubbed. Phase 4 complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Plumbing only, no existing page reskinned. Copies progress.{css,js} and the dsynth/favicon PNGs into dportsv3/tracker/static/, lifts the index.html as a Jinja template at templates/progress.html (with a <base> tag pinning relative URLs to the canonical path), and adds a progress_adapter that maps build_runs + build_results into the {summary.json, NN_history.json} shape progress.js consumes. Three new routes mounted under /target/{target}/progress/, leaving the existing dashboard untouched. Result vocabulary mapped success→built, failure→failed, skipped→skipped, ignored→ignored; meta is left at 0 (no tracker analog). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tuxillo added 30 commits May 15, 2026 00:59

docs: add UI option for the agentic proof of concept

68404a0

docs: add implementation plan

3631bdf

scripts: complete phase 1

6b6545d

scripts: complete phase 2

d170e43

scripts: complete phase 3

6c01236

scripts: complete phase 4

7ba5b9e

scripts: complete phase 5

54a3c4a

scripts: add Bundles to UI navbar

d4bb5a4

- Add /bundles API endpoint listing recent bundles - Add #/bundles route with renderBundles() view - Add Bundles nav item to navbar - Update Phase 9 docs with completion status and new route

scripts: complete phase 7

c3ed911

config: add whitespace preservation rule to dports-patch agent

cbaba2d

Makefiles use tabs, not spaces. The agent was generating patches with spaces which caused patch application failures. Added rule #8 to emphasize preserving exact whitespace from the bundle context.

scripts/apply-patch: delete existing branch before creating

d25c941

When retrying a patch application, the branch may already exist from a previous failed attempt. Delete it first to allow the retry.

scripts/agent-queue-runner: improve diff extraction from agent response

daebb86

Stop extraction when hitting common section markers like 'Rationale', 'Files Modified', etc. Also detect when prose text starts after hunks. This prevents non-diff content from being included in patch.diff.

config: add explicit diff format requirements to dports-patch agent

cb678f7

The agent was generating patches with incorrect hunk line counts. Added detailed instructions on unified diff format with example.

fix(runner): make apply-patch flags configurable

f8c2e64

fix(runner): write apply context at runs root

880f2a9

fix(hooks): honor apply context without ai-fix branch

94faa46

fix(hooks): use apply context even without DeltaPorts dir

fc8179a

fix(runner): propagate iteration to patch jobs

4d82796

feat(observability): add user context gating and retriage

55335c8

fix(patch): enforce unified diff validation before apply

2eba718

feat(blobstore): add artifact-store daemon and db-only bundles

387b212

fix(hooks): snapshot DeltaPorts port files for patch base

00ec2aa

tuxillo and others added 30 commits May 18, 2026 21:49

wip

e8d7e4e

test(state-db): fix init_state_db call sequence in concurrency test

e642e2a

schema.init_db(conn) mutates in place and returns None; the test wrapped the connect() call inside it and then tried to .close() the result. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic builds: dsynth evidence capture hooks#1517

Agentic builds: dsynth evidence capture hooks#1517
tuxillo wants to merge 114 commits into
masterfrom
agentic-dsynth-evidence-hooks

tuxillo commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tuxillo commented Jan 10, 2026

Goal

What this PR adds (foundation)

What this PR does not do (yet)

How to try it

Why this matters for automated fixing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant