Agentic builds: dsynth evidence capture hooks#1517
Open
tuxillo wants to merge 114 commits into
Open
Conversation
Add dsynth hook scripts that snapshot distilled build errors and relevant port metadata on failures, grouped by run, so debugging can stay build-driven without keeping full workdirs. Document the bounded evidence contract and the planned opencode integration/central queue model for asynchronous triage.
Add observe-only state server for remote UI integration: - REST API for runs, jobs, bundles, ports, artifacts - SSE event stream with replay support - SQLite persistence for full history - Filesystem reconciler for live updates Validated on DragonFlyBSD VM - all endpoints tested.
- Add vanilla JS Bootstrap 5 UI served by state-server - Live SSE event stream with replay/reconnect - Views: Overview, Events, Jobs, Runs, Ports, Bundles - Artifact viewer for markdown, diffs, logs - SSE improvements: after_id, tail query params, ts in payloads
- Add /bundles API endpoint listing recent bundles - Add #/bundles route with renderBundles() view - Add Bundles nav item to navbar - Update Phase 9 docs with completion status and new route
- agent-queue-runner: add apply job type and iteration tracking - apply-patch: add DragonFly local mode, --no-push flag, BSD-compatible patch - hook_common.sh: detect rebuild iterations, track previous bundles - Add KEDB entry for DragonFly source patch conventions
Makefiles use tabs, not spaces. The agent was generating patches with spaces which caused patch application failures. Added rule #8 to emphasize preserving exact whitespace from the bundle context.
When retrying a patch application, the branch may already exist from a previous failed attempt. Delete it first to allow the retry.
Stop extraction when hitting common section markers like 'Rationale', 'Files Modified', etc. Also detect when prose text starts after hunks. This prevents non-diff content from being included in patch.diff.
The agent was generating patches with incorrect hunk line counts. Added detailed instructions on unified diff format with example.
- Change dports-patch prompt to request complete file contents - Add extract_files_from_response() to parse FILE content blocks - Add generate_unified_diff() to create diffs programmatically - Add generate_combined_diff() for multi-file patches - Update write_patch_outputs() to try new format first, fallback to legacy This fixes the malformed diff issue - LLMs are good at generating file content but struggle with unified diff syntax and line counts.
The agent was outputting diff syntax inside FILE blocks for Makefile.DragonFly. Make it explicit that Makefile.DragonFly should be raw makefile content, while dragonfly/patch-* files are actual diffs. Also add specific hint for the IFM_IEEE80211_VHT5G error.
…er UI - Add activity_log and runner_status tables to state-server schema - Add /activity and /runner-status API endpoints with SSE events - Update agent-queue-runner to log activities at all job stages - Add heartbeat thread for runner liveness detection (5s interval) - UI: Add Activity Log panel showing last 10 runner activities - UI: Add Runner Status indicator with staleness detection (>15s) - UI: Add back button for artifact navigation in bundle view - UI: Hide session_id.txt files from artifact lists
…b error display - state-server: Only emit runner_status SSE events when status/job_id/stage changes, not on every heartbeat update_at change - app.js: Don't trigger full re-render for runner_status/activity events (fixes bundle tab reset issue), only re-render on overview page - app.js: Add renderJobDetail() with prominent error display and related activity log entries for failed jobs - agent-queue-runner: Write .job.error files before moving failed jobs, move error files along with job files
llm.py's tokenizers stub only fired when llm was imported — but the runner / tools modules / manual inspections can hit litellm without going through llm first. Moving the stub to dportsv3/agent/__init__.py makes it run as soon as any module under the package is imported. This unblocks invocations like: python -c "import dportsv3.agent; import litellm; ..." without needing to pre-import llm. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ting When litellm's model-name → provider heuristic mis-routes (e.g., any model name containing 'deepseek' or 'claude' is shunted to the native provider client even when openai/ prefix and api_base are set), custom_llm_provider forces a specific code path. Generic passthrough; default None means "let litellm pick from prefix as before." Set per flow: - agent-queue-runner: DP_HARNESS_TRIAGE_PROVIDER env var (DP_HARNESS_PATCH_PROVIDER will follow in step 4 when patch wires) - llm.complete(), tool_loop.run(), triage.run(): custom_llm_provider kwarg - _manual_test_tool_loop: DP_TEST_PROVIDER env var Native providers (anthropic/, deepseek/, nvidia_nim/, ...) work unchanged because they don't set custom_llm_provider. The override is only used when needed (most often: openai-compat third-party endpoints with model names that fool the heuristic). Also commits the manual test helper for tool_loop that was previously left untracked. Useful while step 4 is in flight. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Thinking-mode providers (DeepSeek v4-pro/v4-flash directly or via opencode.ai/zen, OpenAI o-series via some relays) emit a reasoning_content field alongside content + tool_calls, holding the model's intermediate chain-of-thought. The upstream API requires this field to be passed back on the next request, or the multi-turn call fails with HTTP 400: "The reasoning_content in the thinking mode must be passed back to the API." Changes: - llm.Response gains optional reasoning_content field; llm.complete extracts it from msg.reasoning_content if present (None otherwise). - tool_loop._assistant_message_from includes reasoning_content in the reconstructed assistant message when set, so the next LLM request preserves continuity. No-op for non-thinking models — reasoning_content stays None, nothing extra is sent. Verified with stubbed Response objects: thinking-mode reconstructed message carries reasoning_content; non-thinking does not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously every get_file result was base64. For UTF-8 text files
(Makefiles, patches, source, the bulk of what the agent reads), this
inflated content by ~33% AND made the model mentally decode base64
to find anything inside — burning prompt AND completion tokens.
Now: read bytes, try UTF-8 decode with a NUL-byte sanity check;
return {encoding: 'text', content: <str>} on success, fall back to
{encoding: 'base64', content: <b64>} for binary. sha256 is computed
over the raw bytes, so put_file's expected_sha256 round-trip works
regardless of encoding.
Verified with a temp-fs harness: text Makefile returns text;
PNG-header file returns base64.
Schema description updated so the LLM understands the dual-mode
return shape. Example path in description updated to /work/DPorts/...
(the common path; agent reads materialized port files from DPorts,
edits source-of-truth in DeltaPorts).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The patch agent now runs end-to-end through the harness instead of opencode.
New code:
- prompts.PATCH_SYSTEM: 4kB system prompt spelling out the dev-env's
three-tree layout (freebsd-ports / DeltaPorts / DPorts), tool
vocabulary, the repair loop, discipline rules (no commits/push/PRs),
and the mandatory output format ending in the new rebuild_proof.json
schema (origin, rebuild_ok, dsynth_profile, build_command,
timestamp_utc — no branch/head/fports fields).
- attempt_loop.run: budget-bounded retry around tool_loop. Each
attempt is a fresh [system, user] conversation (with a small failure-
context user turn appended on retries) so tool-call traces don't
compound across attempts. Stops on rebuild_ok=true, budget exhaustion,
or max_iterations. Returns PatchResult{status, final_text, usage,
attempts[], proof}.
- patch.run: thin wrapper over attempt_loop.run.
Runner wiring (mirrors step 1 triage adapter):
- New env vars: DP_HARNESS_PATCH_{MODEL,API_BASE,API_KEY,PROVIDER,
TIMEOUT}, DP_HARNESS_ENV (dev-env name default), DP_HARNESS_POLICY
(optional override of config/agentic-policy.json path).
- process_patch_job: when DP_HARNESS_PATCH_MODEL is set, route to
_process_patch_job_harness. It reads triage.md, resolves the tier
via policy.tier_for(classification, confidence), and calls
dportsv3.agent.patch.run with the tier's budget.
- Bundle outputs: analysis/patch.md (final LLM text), analysis/
rebuild_proof.json (parsed proof block), analysis/patch_audit.json
(status + tokens + per-attempt info + model), analysis/changes.diff
(host-side git diff vs HEAD in the env's DeltaPorts overlay).
Verified attempt_loop against a stubbed tool_loop:
- success on first attempt
- failure then success (failure-context message added to retry)
- budget exhausted mid-sequence
- needs-help after all attempts fail
- missing rebuild_proof JSON falls back to needs-help
End-to-end against a real LLM + env requires a manual smoke run with
DP_HARNESS_PATCH_MODEL + a bundle on disk; covered in the next message.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_manual_test_patch_flow.py fixtures a minimal bundle under /tmp (meta.txt, errors.txt, analysis/triage.md) and invokes dportsv3.agent.patch.run directly with a fabricated payload — bypassing the queue runner so the harness's loop is exercised in isolation against a real LLM + real dev-env. The fixture intentionally doesn't simulate a broken port; it asks the agent to verify the current state of the port via dsynth_build and emit rebuild_proof.json accordingly. Pointing at devel/readline (default) should reach rebuild_ok=true within 1-2 attempts. Env vars mirror _manual_test_tool_loop (DP_TEST_MODEL, ENV, ORIGIN, TIER_ITERATIONS, TIER_TOKENS, plus PROVIDER/API_BASE/API_KEY). The bundle dir is preserved on exit so you can inspect the artifacts the runner-side adapter would have written: patch.md, patch_audit.json, rebuild_proof.json, changes.diff (note: those are written by agent-queue-runner's _process_patch_job_harness, NOT by this fixture — this fixture only calls patch.run and reports the PatchResult). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dsynth's 'build' subcommand asks interactive questions (most commonly "Rebuild local repository? [Y/n]" before scanning, sometimes follow- ups during the build). The agent has no tty, so the subprocess sat in [ttyin] state and the patch flow hung — observed mid-test: load: 0.67 cmd: dsynth 31619 [ttyin] 0.00u 0.06s 0% 4128k Fixes: - worker._exec accepts optional input_text kwarg; default stdin is empty string (effectively /dev/null) so unexpected prompts fail fast rather than blocking. - worker.dsynth_build pipes 'y\\n' * 50 to stdin to clear dsynth's prompts. Generous enough for multi-question build cycles, cheap to send. dbuild (the dev-env helper) is unchanged — humans running it interactively still get the prompts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_turns default Observed: a single attempt burned 2,073,090 tokens before attempt_loop's between-attempts budget check caught it. Root cause: tool_loop only enforced max_turns (30), not the token budget. The model went into a tool-call frenzy and attempt_loop only noticed after 30 turns of accumulating 70k-token contexts. Fixes: - tool_loop.run: new max_tokens kwarg; checked at the top of each turn before issuing the LLM call. When the running total reaches the cap, return whatever Response we have. Default 0 = no cap (callers should pass remaining budget). - attempt_loop.run: passes tier's remaining budget (max_tokens - tokens_used_so_far) as max_tokens to tool_loop on each attempt. Also short-circuits with status=budget-exhausted before kicking off a new attempt if the budget is already gone. - tool_loop max_turns default: 20 -> 12. A patch task taking more than ~12 tool calls per attempt is in trouble; the cap should stop it sooner. - attempt_loop max_tool_turns default: 30 -> 12. Verified with stubbed LLM: tool_loop stops at 1200 tokens when max_tokens=1200 (turn 3 was the first check after total>=cap). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the patch fixture run produces a surprising token count, we need to see what the model actually did — final_text alone tells us nothing if the loop ended on a tool call. _install_session_dump wraps llm.complete and tools.dispatch to write each turn as a JSON line to <bundle>/session.jsonl: - llm_call records: messages_preview (with long strings truncated to 800 chars), response.text (1200 chars), tool_calls, reasoning_content (600 chars), usage. - tool_dispatch records: tool name, arguments, ok flag, stdout/stderr tails truncated to 600 chars. Excludes result body (file bytes, full schemas) to keep the trace compact and shareable. After a run, share session.jsonl and the per-turn behavior is visible without re-running. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first end-to-end run of the patch flow burned 40k tokens
exploring devel/readline without ever finding the actual build
error. Root cause was not a single bug — the agent's information
feed had multiple compounding problems. Eight fixes addressing
each, ranked by impact:
1. dsynth_build noise → use 'dsynth -S -y' directly.
Skip the dbuild helper (which keeps ncurses for humans) and
invoke dsynth with -S (disable ncurses TUI) and -y (assume-yes).
Previously the agent received ~2kB of curses escape codes as
stdout. -y also retires the 'y\n'*50 stdin hack.
2. grep used rg, which isn't packaged for DragonFly. Switch to
POSIX 'grep -rn'. grep rc=1 (no matches) → ok=True with
match_count=0; rc>=2 → ok=False. Prior behavior surfaced "no
matches" as ok=False and the model concluded "rg is not
available" (wrong inference but understandable).
3. dev-env exec INFO mount-prep noise on every chroot tool call.
New '--quiet' flag on 'dportsv3 dev-env exec' and matching
DPORTS_DEV_ENV_QUIET env var. worker._exec always passes
--quiet so the harness's contexts stop accumulating 8 lines of
"INFO: mount already present at ..." per call.
4. Surface dsynth's per-port build log. dsynth writes the actual
build error to /work/dsynth/logs/<origin-with-underscores>.log
(Directory_logs from dsynth.ini). Two changes:
- dsynth_build result now carries 'log_hint' pointing at this
path.
- New 'dsynth_log(origin, tail_lines=200)' tool reads the tail.
PATCH_SYSTEM updated to direct the agent: on build failure,
call dsynth_log immediately — don't grep DPorts for *.log
files (they don't exist there).
5. Add 'list_dir(path)' tool. Previously the agent tried
get_file on directories and got opaque failures. list_dir
returns entries with name/kind/size, capped at max_entries.
6. Tool schemas trimmed. Each schema's description now one
focused sentence (was 2-4 sentences with examples). Total
schema chars ~6.5kB → ~4kB. The example paths and prose
moved to PATCH_SYSTEM, which is sent once per attempt-start
instead of every turn.
7. Sliding-window reasoning_content. Thinking-mode providers
require the most recent assistant turn's reasoning_content to
be echoed back; older turns' reasoning is dead weight in the
prompt. tool_loop._strip_old_reasoning drops it from all but
the most recent assistant message after each turn.
8. give-up directive in PATCH_SYSTEM (no new tool — prompt-only).
Explicit: "if you've tried two distinct approaches and both
failed at the same point, stop and emit Rebuild Status:
gave-up". Also: "if dsynth_build returned rebuild_ok=true, stop
immediately — don't keep exploring."
Plus: get_file failure envelopes now differentiate
'missing' / 'is_directory' / 'not_a_regular_file' via a 'kind'
field, so the agent can react usefully.
Verified with unit tests:
- reasoning_content sliding window keeps LATEST, strips older
- list_dir returns entries with kind+size
- get_file on a directory returns kind=is_directory with a
pointer to list_dir/grep
- grep on a non-existent pattern returns ok=True match_count=0
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deepseek's thinking-mode API requires reasoning_content on EVERY prior assistant turn, not just the most recent. Empirical proof: 3 turns in, after the trim removed turn 1's reasoning_content, the API rejected with HTTP 400: The reasoning_content in the thinking mode must be passed back to the API. So the trim violates the protocol, not just leaves tokens on the table. Reverting that change. The other 7 fixes in 9e35959 stand. Token cost of preserving all reasoning_content is the price of using a thinking-mode model — accept it or switch providers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous run on devel/readline got 9 turns in and was making real progress (diagnosed the actual C compile error, found a version-skew bug in DeltaPorts' overlay patch, was about to read the source to fix it) when 40k tokens ran out. Bump fixture defaults to match the ASSIST tier in agentic-policy.json: - DP_TEST_TIER_ITERATIONS: 2 -> 4 - DP_TEST_TIER_TOKENS: 40000 -> 120000 Real bundles classified as ASSIST will get these caps from the policy file; the fixture should mirror that so test results are representative of the production budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Triage now resolves a tier from (classification, confidence) via
policy.tier_for and uses the tier name to decide whether to auto-
enqueue a patch job. The tier name propagates into the patch .job
file so the patch worker uses the same budget without re-resolving.
agent-queue-runner._process_triage_job_harness:
- Loads config/agentic-policy.json (or DP_HARNESS_POLICY path).
- Resolves tier from result.classification + result.confidence.
- tier=MANUAL: skips patch auto-enqueue, upserts a user_context
request so the UI flags it for operator attention, returns
status="manual_tier".
- tier=AUTO|ASSIST: enqueues the patch job carrying tier= and
dev_env= fields.
enqueue_patch_job: optional tier_name + dev_env kwargs propagate
into the new .job file.
_process_patch_job_harness: prefer job.get('tier') (set by triage
when running through the harness path) over re-parsing triage.md.
Re-parse is the fallback for hand-fired patch jobs.
Bug fix found by the new tier matrix test: policy.tier_for only
downgraded once. plist-error + low confidence ended at ASSIST even
though ASSIST's medium floor wasn't met. Fix: cascade downgrades
in a while loop until either confidence meets the current tier's
floor or MANUAL is reached. Verified against all combinations:
plist-error/high -> AUTO
plist-error/medium -> ASSIST (AUTO floor=high not met -> downgrade)
plist-error/low -> MANUAL (cascades both downgrades)
compile-error/high -> ASSIST
compile-error/low -> MANUAL (ASSIST floor=medium not met)
runtime-error/any -> MANUAL (mapped directly)
unknown/any -> MANUAL (default)
needs_user_context and should_enqueue_patch are no longer used by
the harness triage path; they remain only for the (also legacy)
opencode path. They'll go away in step 6 with the rest of the
opencode sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors _manual_test_patch_flow but for the triage side of the loop. Fixtures a bundle with a synthetic-but-realistic error log, runs dportsv3.agent.triage.run against a real LLM, then asks policy.tier_for what the runner would do with the result. Three built-in fixtures: - compile-error — readline-shape 'lvalue required' compile error - plist-error — pkg-plist mismatch - unknown — opaque generic failure Reports: classification, confidence, resolved tier, the patch-budget the tier would grant, and whether the runner would auto-enqueue. Dumps per-turn LLM trace to <bundle>/session.jsonl for inspection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Final cleanup pass. With the harness fully validated end-to-end (real
LLM + real env + real port fix on devel/readline; clean build on
archivers/liblz4), the legacy opencode-driven code paths and the
workspace concept have no callers and no purpose.
Deleted:
- config/opencode/ entirely: dports-triage.md (40), dports-patch.md
(62), tool/dports.ts (257) — TS plugin and agent markdown moved
to dportsv3.agent.{prompts, tools} in earlier steps.
- agent-queue-runner constants: PATCHABLE_CLASSIFICATIONS,
PATCHABLE_CONFIDENCE (replaced by policy.tier_for in step 5),
DEFAULT_VM_SSH_KEY/PORT/HOST, DEFAULT_WORKSPACE_CONFIG,
DEFAULT_MAX_SNIPPET_ROUNDS.
- agent-queue-runner functions:
- should_enqueue_patch, needs_user_context (replaced by policy
tier dispatch)
- parse_snippet_requests (only called by the now-dead snippet
re-enqueue path; snippet rounds fold into harness triage)
- get_vm_ssh_command, run_snippet_extractor (SSH-to-VM
workaround for Linux dev hosts; harness runs natively on dfly)
- enqueue_followup_job, check_and_handle_snippet_requests
(legacy snippet escalation; harness triage handles in-process)
- call_opencode, extract_response_text (HTTP plumbing for
opencode serve)
- extract_section, extract_json_block (only used by the legacy
write_*_outputs; the harness has its own _PROOF_BLOCK_RE in
attempt_loop.py)
- write_triage_outputs, write_patch_outputs (legacy bundle
writers; harness has _write_triage_audit_harness +
_write_patch_audit_harness)
- load_workspace_config (workspace.json reader; workspace
concept retired)
- The workspace-config embedding section of build_triage_payload
- Legacy bodies of process_triage_job and process_patch_job: both
shrink to thin wrappers that require DP_HARNESS_*_MODEL and call
the corresponding _process_*_job_harness adapter. No more
feature-flag-gated dual path.
- process_job: drops opencode_url, opencode_provider, opencode_model,
timeout, max_retries, retry_delay parameters from the signature
and call sites. Snippet round display removed (in-process now).
- main(): drops all OPENCODE_* env reads; startup log now reports
DP_HARNESS_TRIAGE_MODEL + DP_HARNESS_PATCH_MODEL instead.
- Docstring header at the top of the file rewritten to document
the harness env vars and job-file conventions.
Net effect:
- scripts/agent-queue-runner: ~2300 LOC -> 1685 LOC
- config/opencode/ gone (359 LOC)
- Total: ~1000 LOC retired
Negative checks pass:
- 'opencode|OPENCODE_|VM_SSH|workspace\\.json|agentic-workspace|
PATCHABLE_|should_enqueue_patch|call_opencode|extract_response_text|
check_and_handle_snippet_requests|load_workspace_config' in
scripts/agent-queue-runner: 0 hits
- The remaining "opencode" mentions in dportsv3.agent/llm.py +
prompts.py + _manual_test_tool_loop.py are documentation strings
about opencode.ai/zen (a third-party OpenAI-compat relay) — they
describe what works with the harness, not legacy code.
- All harness modules import cleanly; 13 tools registered.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 3 (opencode → litellm harness, dev-env-native, no PR/branch/SSH) is complete in the branch. Updates: - docs/agentic-consolidation-plan.md: new "Status: shipped" callout at the top with a brief summary of what landed and pointers to the commit range (985889d ... 6f6db28). - docs/AGENTIC_BUILDS.md: warning banner that the doc describes the pre-Phase-3 architecture (opencode, OPENCODE_* env vars, VM_SSH, /build/synth/agentic-workspace, workspace.json, agentic-worker, process_pr_job — all retired). Sections below the banner are kept as historical context until the doc is rewritten. - docs/TESTING_E2E.md: same banner, plus pointers to the three manual test fixtures in scripts/generator/dportsv3/agent/_manual_test_*.py that exercise the new harness against a real dev-env. A full rewrite of AGENTIC_BUILDS.md and TESTING_E2E.md is queued as follow-up work — not in scope here. Banners prevent operators from following the stale instructions in the meantime. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fold the tracker schema into state.db so that when tracker becomes a read-only consumer in step 4 the tables already exist. No write behavior changes in this step — schema only, additive. Lifted verbatim from scripts/generator/dportsv3/tracker/db.py: - build_types (with seed rows 'test', 'release') - build_runs - build_results (incl. status default 'recorded' via idempotent ALTER) - port_status - 5 supporting indexes + uq_build_runs_active unique partial index - Idempotent ALTER migrations (build_results.status, build_runs.total_expected) Plus, per the consolidation plan's "weak link" model: add nullable runs.build_run_id (idempotent ALTER) so a dsynth invocation can later be associated with a campaign campaign via DPORTSV3_BUILD_RUN_ID (wired in step 3). Enable PRAGMA foreign_keys=ON on the artifact-store connection so tracker's FK constraints (build_results -> build_runs, port_status -> build_runs) are enforced going forward. None of artifact-store's existing tables have FKs, so the change only affects writes to the new tracker tables. Verified locally: - 4 tables + 6 indexes present after init - 2 seed rows in build_types - FK enforcement works (INSERT with bad build_run_id raises IntegrityError; valid insert succeeds) - Re-init is idempotent (no errors on second _init_db call) tracker.db remains authoritative until step 4 flips tracker to read state.db; both DBs are valid in parallel during the transition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 530-line standalone script becomes a 20-line shim; logic moves into dportsv3.artifact_store (importable, testable). Schema moves into dportsv3.db.schema (shared with tracker once it switches to state.db in step 4). New: - dportsv3/db/__init__.py — package marker - dportsv3/db/schema.py — SCHEMA, DEFAULT_BUILD_TYPES, MIGRATIONS, init_db(conn) helper. Idempotent on re-init. - dportsv3/artifact_store.py — ArtifactStore, Handler, ArtifactStoreServer, main(). Imports init_db from db.schema. Changed: - scripts/artifact-store — shrinks to a sys.path bootstrap + `from dportsv3.artifact_store import main; main()`. Same invocation, same behaviour. Executable bit preserved. - scripts/generator/pyproject.toml — adds `artifact-store = "dportsv3.artifact_store:main"` console script, so the generator venv's bin/ gets an `artifact-store` entry too. Invocation matrix now: - ./scripts/artifact-store --logs-root /path (production; no venv) - python -m dportsv3.artifact_store --logs-root /path (in-venv) - $VENV/bin/artifact-store --logs-root /path (console script) All identical in behaviour. Single source of truth for the state.db schema removes the duplication step 1 introduced (both artifact-store and tracker/db.py held the same 4 CREATE TABLEs). Verified: - All 15 tables created, 6 new indexes, FK enforced, runs.build_run_id column present, build_types seeded. - ArtifactStore.upsert_run_bundle / put_blob / get_artifact round-trip works against a temp dir. - Re-init is idempotent. - Both ./scripts/artifact-store --help and `python -m dportsv3.artifact_store --help` print the same help text. tracker/db.py is unchanged — it still uses its own schema for tracker.db. Step 4 will switch it to import from dportsv3.db.schema when it reads state.db. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port state-server's POST /user-context to artifact-store. Same body
shape, same semantics, same emit_event. State-server still serves
its legacy /user-context in parallel until step 8 retires it.
(Note: the Phase 4 plan also mentioned a POST /v1/jobs/enqueue/pr
endpoint as part of step 2. That was prospective — state-server
never actually had it, and Phase 3 deleted process_pr_job + the
type=pr dispatch arm. No PR enqueue path exists today, so nothing
to port. Step 2's real scope is just user-context.)
New on the ArtifactStore class:
- upsert_user_context(run_id, origin, context_text) -> int
Looks up existing context_rev, increments, upserts the row, emits
user_context_updated event, returns the new rev under the lock.
New on the Handler:
- POST /v1/user-context with body {run_id, origin, context_text}.
Validation matches state-server: required fields, non-empty after
strip, <= 8000 chars, valid JSON. Returns
{"ok": true, "context_rev": N} on success.
Verified locally via curl against the running shim:
- First write to (r1, devel/readline): context_rev=1
- Second write same key: context_rev=2 (increment)
- Different origin: starts fresh at context_rev=1
- Missing context_text -> 400
- Empty/whitespace -> 400
- > 8000 chars -> 400
- Malformed JSON -> 400
- state.db rows match: 2 user_context rows, 3 events with the
expected rev/timestamp content.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add dsynth hook scripts that translate run start/end and per-port state changes into `dportsv3 tracker` API calls for the current build profile. The hooks persist one active tracker run id per dsynth profile, enqueue ports before marking them building, record final port outcomes, and fail soft with hook-local logging so tracker outages do not interrupt package builds. Include a shared helper, a config template, support for both `hook_pkg_start` and `hook_pkg_started`, and installation notes for copying the hook set into `/etc/dsynth`. Co-authored-by: OpenAI <noreply@openai.com>
Stop dsynth tracker hooks from reusing stale run ids when a new start-build request fails, and disable tracking for that dsynth run instead of continuing to enqueue and record into the previous build. Also switch the tracker server from one shared SQLite connection to fresh per-request connections so concurrent enqueue and status updates do not corrupt transaction state under hook traffic. Add a regression test covering the request connection lifecycle. Co-authored-by: OpenAI <noreply@openai.com>
…ok set
The two parallel hook sets were a real problem: dsynth has one
Hooks_Directory, so only one executable per event name (hook_pkg_failure
etc.) can live there. Today an operator has to pick artifact-store
evidence OR tracker-side build_results, not both.
This commit folds builderhooks' tracker integration into the existing
scripts/dsynth-hooks/, preserving the good ideas from builderhooks:
- per-profile state file (under evidence_root/.tracker-state by default)
- soft-fail logging (tracker outages don't fail dsynth builds)
- disable-on-collision (if start-build fails because an active run
exists, the new run is marked TRACKING_DISABLED instead of reusing
a stale run id)
- conf-driven, default-on tracker integration
hook_common.sh gains a "tracker integration" section with:
- tracker_log, tracker_fail_soft, tracker_should_skip
- tracker_load_config / load_state / write_state / clear_state /
disable_state
- tracker_pkg_version, tracker_enqueue_one
- tracker_run_start, tracker_run_end, tracker_mark_building,
tracker_record_result
Defaults:
- DPORTSV3_TRACKER_TARGET = @${PROFILE} (one profile per target)
- DPORTSV3_TRACKER_BUILD_TYPE = test
- DPORTSV3_TRACKER_STATE_DIR = ${DIR_LOGS}/evidence/.tracker-state
- DPORTSV3_TRACKER_HOOK_LOG = ${DIR_LOGS}/dportsv3-hooks.log
Includes the mktemp fix from the wip commit (drop .json suffix — BSD
mktemp requires X's at the end of the template).
Hooks wired:
- hook_run_start: existing evidence-root setup + new tracker_run_start
- hook_run_end: existing evidence-pointer cleanup + new tracker_run_end
- hook_pkg_failure: existing full bundle write + enqueue triage job
+ new tracker_record_result fail
- hook_pkg_success / skipped / ignored: replace no-op with
tracker_record_result {pass,skipped,ignored}
- New hook_pkg_start + hook_pkg_started for tracker_mark_building
(both names provided to match either dsynth variant)
New supporting files:
- scripts/dsynth-hooks/dportsv3-hooks.conf.example (single config file
covering both artifact-store overrides and tracker config)
- scripts/dsynth-hooks/README.md (install instructions + operational
notes)
Tracker integration is opt-in via the config file. Without
DPORTSV3_TRACKER_URL set, every tracker_* high-level helper
short-circuits and the hooks only do artifact-store work — preserving
the previous behaviour for operators who don't run the tracker.
Retired:
- scripts/builderhooks/* (README, conf template, 9 hook stubs, and
tracker_common.sh). The two unrelated poudriere-era scripts
(bulk_started.sh, pkgbuild.sh) that lived in the same dir are kept
in place pending a separate decision about whether to retire them
too (they write per-port STATUS files — the thing tracker is meant
to replace).
Verified locally:
- All hooks pass `sh -n` syntax check.
- hook_pkg_success with tracker disabled (no config): exits 0,
silent, no log file.
- hook_pkg_success with tracker config but unreachable: exits 0
(soft-fail), error logged to dportsv3-hooks.log.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Detour before step 4 to close the "env's DeltaPorts is stale" gap.
Two pieces:
1. runtime.py: bind-mount the host's repo mirror cache (config.repos_dir)
into the chroot read-only at the same path. The env's git origins
were recorded at clone time as host paths like
/root/.cache/dports-dev/repos/deltaports.git — without this mount,
`git pull` from inside the env fails because that path doesn't
exist in the chroot's filesystem view. With the mount, the path
resolves and standard git operations work from inside the env shell.
2. New `dportsv3 dev-env update NAME [--force]` subcommand. Two
phases:
- Phase 1: refresh the bare mirrors under config.repos_dir from
the host's working tree (reuses RepoCache.refresh_all — same
logic the builder runs at env create time).
- Phase 2: for each env-side repo (work/DeltaPorts, work/freebsd-
ports), run host-side `git fetch --prune origin` + `git pull
--ff-only origin <current-branch>`. Errors when the working
tree is dirty unless --force; errors when the branch can't
fast-forward (divergent history) with a clear message.
Logs before/after short SHAs per repo so the operator sees what
moved. DPorts is intentionally excluded — it's compose-generated,
not a git checkout.
No --branch flag: switching branches in the env is now a normal
`git -C /work/DeltaPorts checkout <other>` inside the env shell
(works thanks to the bind mount).
Also extends `dportsv3 dev-env status NAME` JSON output with
per-repo `{branch, commit, dirty}` for DeltaPorts and freebsd-ports.
Lets operators see what the env is tracking without entering it.
Verification: argparse smoke + module parse + import. End-to-end
verification needs a dfly env (next).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The tracker module now opens the same SQLite file artifact-store does (state.db). Two writers under SQLite WAL: one writer at a time at the SQLite layer, readers parallel. Pragmas (WAL, busy_timeout=5000, foreign_keys=ON) applied per-connection so the tracker server's per- request connections (commit a14fe9c) inherit them. tracker/db.py: - Drop the duplicate SCHEMA + DEFAULT_BUILD_TYPES + MIGRATIONS that lived inline. init_db now delegates to dportsv3.db.schema.init_db so the schema is single-sourced (matches what artifact-store writes). Re-exports DEFAULT_BUILD_TYPES for any consumer that imported it from this module. - open_db now also sets PRAGMA busy_timeout=5000 (was missing) to match artifact-store and survive concurrent-writer contention. CLI default --db path resolution (was hardcoded "tracker.db"): 1. --db PATH (operator override) 2. DPORTSV3_STATE_DB env var 3. $PWD/state.db (fall-back) Documented in cli.py help text. Operator is responsible for matching artifact-store's logs-root (e.g. /build/synth/logs/evidence/state.db when artifact-store runs with --logs-root /build/synth/logs). Tests: - All tracker test fixtures (test_tracker_api, test_tracker_queue, test_tracker_integration) switched from tmp_path / "tracker.db" to tmp_path / "state.db". Schema is identical via the shared module so test bodies don't change. - New test_state_db_concurrency.py: two threads hammering the same state.db (one as artifact-store, one as tracker) with ~60 writes each. Confirms no "database is locked" errors under WAL + busy_timeout, both sides' rows land, runtime well under the 15s bound. Plus a small FK-enforcement guard test. What did not change: - All high-level tracker functions (create_build_run, record_results, get_target_summary, …) — same API, same callers. - Tracker server (server.py) — per-request connections from a14fe9c stay. - Tracker CLI commands (start-build, record-result, …) — they hit the tracker server over HTTP; no awareness of the DB path. tracker.db file: - No code opens it anymore; safe to delete whenever convenient. - ~38k rows of test data (per the consolidation plan) — abandoned, not migrated. Static checks done locally: parse, import, schema delegation, all 15 expected tables present after tracker_init_db, pragmas applied. The pytest run requires the generator venv (dev deps); to verify on dfly: cd scripts/generator && .venv/bin/python -m pytest tests/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
schema.init_db(conn) mutates in place and returns None; the test wrapped the connect() call inside it and then tried to .close() the result. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds nullable target columns on bundles/jobs/runs (with indexes) via idempotent ALTER TABLE migrations. Hook + artifact-store + state-server propagate target on every write so step 8 can retire state-server without losing the target dimension. Tracker absorbs the agentic read API as eleven /api/* endpoints (runs, jobs, bundles, ports, activity, runner-status, agentic-status, artifacts, SSE events) all accepting a ?target= filter where applicable. Plan doc rewritten as Phase 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Server-rendered Jinja views for the agentic side: bundle list/detail, job list/detail, run detail, runner status, activity log. Each list view exposes a target selector populated from distinct_targets across bundles/jobs/runs. Bundle detail links artifact paths to the existing /api/bundles/<id>/artifacts/<relpath> streamer. Adds an "Agentic" nav entry next to Targets / Builds / Diff. Same Bootstrap layout as the build dashboard — no SPA, no JS framework. The dsynth-progress aesthetic redesign is deferred to Phase 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deletes scripts/state-server (1373 LOC) and scripts/state-server-ui (2381 LOC across .css/.js/.html). The tracker absorbed the read API in step 5 and the HTML views in step 6, so the legacy server has nothing left to do. agent-queue-runner: STATE_SERVER_URL retired; bundle/artifact lookups now go through DPORTSV3_TRACKER_URL against /api/bundles and /api/ports. No backward-compat fallback — hard cutover. Stale "until step 8 retires it" comments scrubbed. Phase 4 complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plumbing only, no existing page reskinned. Copies progress.{css,js}
and the dsynth/favicon PNGs into dportsv3/tracker/static/, lifts the
index.html as a Jinja template at templates/progress.html (with a
<base> tag pinning relative URLs to the canonical path), and adds a
progress_adapter that maps build_runs + build_results into the
{summary.json, NN_history.json} shape progress.js consumes.
Three new routes mounted under /target/{target}/progress/, leaving
the existing dashboard untouched. Result vocabulary mapped
success→built, failure→failed, skipped→skipped, ignored→ignored;
meta is left at 0 (no tracker analog).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
We are designing a system to automatically (agent-assisted) fix ports while keeping the existing, build-driven workflow intact:
dsynthstays the authoritative build executor.What this PR adds (foundation)
scripts/dsynth-hooks/:hook_run_start/hook_run_endgroup failures per build run and snapshot dsynth summary lists.hook_pkg_failurecreates a per-failure evidence bundle with:logs/errors.txt(high-signal extract, capped at 200KB)logs/full.log.gz(full log preserved for humans)port/*snapshot (Makefile/distinfo/pkg-plist/patches, etc.)meta.txtand basic dsynth profile/config snapshotsdocs/AGENTIC_BUILDS.mddescribing:What this PR does not do (yet)
Those are intentionally deferred so this PR can land the core evidence-capture mechanism safely and independently.
How to try it
scripts/dsynth-hooks/hook_*andscripts/dsynth-hooks/hook_common.shinto dsynth’s config base (/etc/dsynth/or/usr/local/etc/dsynth/) and making them executable.dsynthnormally.${Directory_logs}/evidence/runs/.../ports/.../for the evidence bundle.Why this matters for automated fixing
Reliable, size-capped evidence capture is the prerequisite for an automated port-fixing system:
errors.txt+ port metadata)dsynth-driven, and automation can be layered on without destabilizing build infrastructure