Skip to content

Agentic builds: dsynth evidence capture hooks#1517

Open
tuxillo wants to merge 114 commits into
masterfrom
agentic-dsynth-evidence-hooks
Open

Agentic builds: dsynth evidence capture hooks#1517
tuxillo wants to merge 114 commits into
masterfrom
agentic-dsynth-evidence-hooks

Conversation

@tuxillo
Copy link
Copy Markdown
Member

@tuxillo tuxillo commented Jan 10, 2026

Goal

We are designing a system to automatically (agent-assisted) fix ports while keeping the existing, build-driven workflow intact:

  • dsynth stays the authoritative build executor.
  • On failure, we capture a bounded evidence bundle (distilled errors + small port context) so automated triage/patch generation can be driven by real build output without dumping huge logs or entire work directories into an AI context.
  • Evidence is intended to flow into an asynchronous agent pipeline (triage → patch → review) via a central queue (documented), so builds never block on AI availability.

What this PR adds (foundation)

  • dsynth hook scripts under scripts/dsynth-hooks/:
    • hook_run_start / hook_run_end group failures per build run and snapshot dsynth summary lists.
    • hook_pkg_failure creates a per-failure evidence bundle with:
      • logs/errors.txt (high-signal extract, capped at 200KB)
      • logs/full.log.gz (full log preserved for humans)
      • port/* snapshot (Makefile/distinfo/pkg-plist/patches, etc.)
      • meta.txt and basic dsynth profile/config snapshots
  • Design/usage documentation in docs/AGENTIC_BUILDS.md describing:
    • the overall automated-fixing workflow (bounded evidence → triage → snippet escalation → patch → rebuild)
    • an opencode integration plan, including a central queue model for asynchronous triage
  • A small README pointer to the hook location.

What this PR does not do (yet)

  • No network calls from hooks.
  • No queue writer/runner implementation.
  • No automated patch application.

Those are intentionally deferred so this PR can land the core evidence-capture mechanism safely and independently.

How to try it

  1. Install hooks by copying/symlinking scripts/dsynth-hooks/hook_* and scripts/dsynth-hooks/hook_common.sh into dsynth’s config base (/etc/dsynth/ or /usr/local/etc/dsynth/) and making them executable.
  2. Run dsynth normally.
  3. On a port failure, inspect ${Directory_logs}/evidence/runs/.../ports/.../ for the evidence bundle.

Why this matters for automated fixing

Reliable, size-capped evidence capture is the prerequisite for an automated port-fixing system:

  • the triage agent needs consistent inputs (errors.txt + port metadata)
  • the patch agent can generate DeltaPorts-style diffs based on evidence, not guesses
  • the rebuild loop stays dsynth-driven, and automation can be layered on without destabilizing build infrastructure

tuxillo added 30 commits May 15, 2026 00:59
Add dsynth hook scripts that snapshot distilled build errors and relevant port metadata on failures, grouped by run, so debugging can stay build-driven without keeping full workdirs.

Document the bounded evidence contract and the planned opencode integration/central queue model for asynchronous triage.
Add observe-only state server for remote UI integration:
- REST API for runs, jobs, bundles, ports, artifacts
- SSE event stream with replay support
- SQLite persistence for full history
- Filesystem reconciler for live updates

Validated on DragonFlyBSD VM - all endpoints tested.
- Add vanilla JS Bootstrap 5 UI served by state-server
- Live SSE event stream with replay/reconnect
- Views: Overview, Events, Jobs, Runs, Ports, Bundles
- Artifact viewer for markdown, diffs, logs
- SSE improvements: after_id, tail query params, ts in payloads
- Add /bundles API endpoint listing recent bundles
- Add #/bundles route with renderBundles() view
- Add Bundles nav item to navbar
- Update Phase 9 docs with completion status and new route
- agent-queue-runner: add apply job type and iteration tracking
- apply-patch: add DragonFly local mode, --no-push flag, BSD-compatible patch
- hook_common.sh: detect rebuild iterations, track previous bundles
- Add KEDB entry for DragonFly source patch conventions
Makefiles use tabs, not spaces. The agent was generating patches with
spaces which caused patch application failures. Added rule #8 to
emphasize preserving exact whitespace from the bundle context.
When retrying a patch application, the branch may already exist from
a previous failed attempt. Delete it first to allow the retry.
Stop extraction when hitting common section markers like 'Rationale',
'Files Modified', etc. Also detect when prose text starts after hunks.
This prevents non-diff content from being included in patch.diff.
The agent was generating patches with incorrect hunk line counts.
Added detailed instructions on unified diff format with example.
- Change dports-patch prompt to request complete file contents
- Add extract_files_from_response() to parse FILE content blocks
- Add generate_unified_diff() to create diffs programmatically
- Add generate_combined_diff() for multi-file patches
- Update write_patch_outputs() to try new format first, fallback to legacy

This fixes the malformed diff issue - LLMs are good at generating
file content but struggle with unified diff syntax and line counts.
The agent was outputting diff syntax inside FILE blocks for Makefile.DragonFly.
Make it explicit that Makefile.DragonFly should be raw makefile content,
while dragonfly/patch-* files are actual diffs.

Also add specific hint for the IFM_IEEE80211_VHT5G error.
…er UI

- Add activity_log and runner_status tables to state-server schema
- Add /activity and /runner-status API endpoints with SSE events
- Update agent-queue-runner to log activities at all job stages
- Add heartbeat thread for runner liveness detection (5s interval)
- UI: Add Activity Log panel showing last 10 runner activities
- UI: Add Runner Status indicator with staleness detection (>15s)
- UI: Add back button for artifact navigation in bundle view
- UI: Hide session_id.txt files from artifact lists
…b error display

- state-server: Only emit runner_status SSE events when status/job_id/stage
  changes, not on every heartbeat update_at change
- app.js: Don't trigger full re-render for runner_status/activity events
  (fixes bundle tab reset issue), only re-render on overview page
- app.js: Add renderJobDetail() with prominent error display and related
  activity log entries for failed jobs
- agent-queue-runner: Write .job.error files before moving failed jobs,
  move error files along with job files
tuxillo and others added 30 commits May 18, 2026 21:49
llm.py's tokenizers stub only fired when llm was imported — but
the runner / tools modules / manual inspections can hit litellm
without going through llm first. Moving the stub to
dportsv3/agent/__init__.py makes it run as soon as any module
under the package is imported.

This unblocks invocations like:
  python -c "import dportsv3.agent; import litellm; ..."

without needing to pre-import llm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ting

When litellm's model-name → provider heuristic mis-routes (e.g., any
model name containing 'deepseek' or 'claude' is shunted to the native
provider client even when openai/ prefix and api_base are set),
custom_llm_provider forces a specific code path.

Generic passthrough; default None means "let litellm pick from prefix
as before." Set per flow:

- agent-queue-runner: DP_HARNESS_TRIAGE_PROVIDER env var
  (DP_HARNESS_PATCH_PROVIDER will follow in step 4 when patch wires)
- llm.complete(), tool_loop.run(), triage.run(): custom_llm_provider
  kwarg
- _manual_test_tool_loop: DP_TEST_PROVIDER env var

Native providers (anthropic/, deepseek/, nvidia_nim/, ...) work
unchanged because they don't set custom_llm_provider. The override is
only used when needed (most often: openai-compat third-party endpoints
with model names that fool the heuristic).

Also commits the manual test helper for tool_loop that was previously
left untracked. Useful while step 4 is in flight.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Thinking-mode providers (DeepSeek v4-pro/v4-flash directly or via
opencode.ai/zen, OpenAI o-series via some relays) emit a
reasoning_content field alongside content + tool_calls, holding the
model's intermediate chain-of-thought. The upstream API requires
this field to be passed back on the next request, or the multi-turn
call fails with HTTP 400:

  "The reasoning_content in the thinking mode must be passed back
  to the API."

Changes:
- llm.Response gains optional reasoning_content field; llm.complete
  extracts it from msg.reasoning_content if present (None otherwise).
- tool_loop._assistant_message_from includes reasoning_content in
  the reconstructed assistant message when set, so the next LLM
  request preserves continuity.

No-op for non-thinking models — reasoning_content stays None,
nothing extra is sent.

Verified with stubbed Response objects: thinking-mode reconstructed
message carries reasoning_content; non-thinking does not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously every get_file result was base64. For UTF-8 text files
(Makefiles, patches, source, the bulk of what the agent reads), this
inflated content by ~33% AND made the model mentally decode base64
to find anything inside — burning prompt AND completion tokens.

Now: read bytes, try UTF-8 decode with a NUL-byte sanity check;
return {encoding: 'text', content: <str>} on success, fall back to
{encoding: 'base64', content: <b64>} for binary. sha256 is computed
over the raw bytes, so put_file's expected_sha256 round-trip works
regardless of encoding.

Verified with a temp-fs harness: text Makefile returns text;
PNG-header file returns base64.

Schema description updated so the LLM understands the dual-mode
return shape. Example path in description updated to /work/DPorts/...
(the common path; agent reads materialized port files from DPorts,
edits source-of-truth in DeltaPorts).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The patch agent now runs end-to-end through the harness instead of opencode.

New code:
- prompts.PATCH_SYSTEM: 4kB system prompt spelling out the dev-env's
  three-tree layout (freebsd-ports / DeltaPorts / DPorts), tool
  vocabulary, the repair loop, discipline rules (no commits/push/PRs),
  and the mandatory output format ending in the new rebuild_proof.json
  schema (origin, rebuild_ok, dsynth_profile, build_command,
  timestamp_utc — no branch/head/fports fields).
- attempt_loop.run: budget-bounded retry around tool_loop. Each
  attempt is a fresh [system, user] conversation (with a small failure-
  context user turn appended on retries) so tool-call traces don't
  compound across attempts. Stops on rebuild_ok=true, budget exhaustion,
  or max_iterations. Returns PatchResult{status, final_text, usage,
  attempts[], proof}.
- patch.run: thin wrapper over attempt_loop.run.

Runner wiring (mirrors step 1 triage adapter):
- New env vars: DP_HARNESS_PATCH_{MODEL,API_BASE,API_KEY,PROVIDER,
  TIMEOUT}, DP_HARNESS_ENV (dev-env name default), DP_HARNESS_POLICY
  (optional override of config/agentic-policy.json path).
- process_patch_job: when DP_HARNESS_PATCH_MODEL is set, route to
  _process_patch_job_harness. It reads triage.md, resolves the tier
  via policy.tier_for(classification, confidence), and calls
  dportsv3.agent.patch.run with the tier's budget.
- Bundle outputs: analysis/patch.md (final LLM text), analysis/
  rebuild_proof.json (parsed proof block), analysis/patch_audit.json
  (status + tokens + per-attempt info + model), analysis/changes.diff
  (host-side git diff vs HEAD in the env's DeltaPorts overlay).

Verified attempt_loop against a stubbed tool_loop:
- success on first attempt
- failure then success (failure-context message added to retry)
- budget exhausted mid-sequence
- needs-help after all attempts fail
- missing rebuild_proof JSON falls back to needs-help

End-to-end against a real LLM + env requires a manual smoke run with
DP_HARNESS_PATCH_MODEL + a bundle on disk; covered in the next message.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_manual_test_patch_flow.py fixtures a minimal bundle under /tmp
(meta.txt, errors.txt, analysis/triage.md) and invokes
dportsv3.agent.patch.run directly with a fabricated payload —
bypassing the queue runner so the harness's loop is exercised in
isolation against a real LLM + real dev-env.

The fixture intentionally doesn't simulate a broken port; it asks
the agent to verify the current state of the port via dsynth_build
and emit rebuild_proof.json accordingly. Pointing at devel/readline
(default) should reach rebuild_ok=true within 1-2 attempts.

Env vars mirror _manual_test_tool_loop (DP_TEST_MODEL, ENV, ORIGIN,
TIER_ITERATIONS, TIER_TOKENS, plus PROVIDER/API_BASE/API_KEY).

The bundle dir is preserved on exit so you can inspect the artifacts
the runner-side adapter would have written: patch.md, patch_audit.json,
rebuild_proof.json, changes.diff (note: those are written by
agent-queue-runner's _process_patch_job_harness, NOT by this fixture
— this fixture only calls patch.run and reports the PatchResult).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dsynth's 'build' subcommand asks interactive questions (most commonly
"Rebuild local repository? [Y/n]" before scanning, sometimes follow-
ups during the build). The agent has no tty, so the subprocess sat
in [ttyin] state and the patch flow hung — observed mid-test:

  load: 0.67  cmd: dsynth 31619 [ttyin] 0.00u 0.06s 0% 4128k

Fixes:
- worker._exec accepts optional input_text kwarg; default stdin is
  empty string (effectively /dev/null) so unexpected prompts fail
  fast rather than blocking.
- worker.dsynth_build pipes 'y\\n' * 50 to stdin to clear dsynth's
  prompts. Generous enough for multi-question build cycles, cheap
  to send.

dbuild (the dev-env helper) is unchanged — humans running it
interactively still get the prompts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_turns default

Observed: a single attempt burned 2,073,090 tokens before
attempt_loop's between-attempts budget check caught it. Root cause:
tool_loop only enforced max_turns (30), not the token budget. The
model went into a tool-call frenzy and attempt_loop only noticed
after 30 turns of accumulating 70k-token contexts.

Fixes:
- tool_loop.run: new max_tokens kwarg; checked at the top of each
  turn before issuing the LLM call. When the running total reaches
  the cap, return whatever Response we have. Default 0 = no cap
  (callers should pass remaining budget).
- attempt_loop.run: passes tier's remaining budget (max_tokens -
  tokens_used_so_far) as max_tokens to tool_loop on each attempt.
  Also short-circuits with status=budget-exhausted before kicking
  off a new attempt if the budget is already gone.
- tool_loop max_turns default: 20 -> 12. A patch task taking more
  than ~12 tool calls per attempt is in trouble; the cap should
  stop it sooner.
- attempt_loop max_tool_turns default: 30 -> 12.

Verified with stubbed LLM: tool_loop stops at 1200 tokens when
max_tokens=1200 (turn 3 was the first check after total>=cap).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the patch fixture run produces a surprising token count, we
need to see what the model actually did — final_text alone tells us
nothing if the loop ended on a tool call.

_install_session_dump wraps llm.complete and tools.dispatch to write
each turn as a JSON line to <bundle>/session.jsonl:

- llm_call records: messages_preview (with long strings truncated
  to 800 chars), response.text (1200 chars), tool_calls,
  reasoning_content (600 chars), usage.
- tool_dispatch records: tool name, arguments, ok flag, stdout/stderr
  tails truncated to 600 chars. Excludes result body (file bytes,
  full schemas) to keep the trace compact and shareable.

After a run, share session.jsonl and the per-turn behavior is
visible without re-running.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first end-to-end run of the patch flow burned 40k tokens
exploring devel/readline without ever finding the actual build
error. Root cause was not a single bug — the agent's information
feed had multiple compounding problems. Eight fixes addressing
each, ranked by impact:

1. dsynth_build noise → use 'dsynth -S -y' directly.
   Skip the dbuild helper (which keeps ncurses for humans) and
   invoke dsynth with -S (disable ncurses TUI) and -y (assume-yes).
   Previously the agent received ~2kB of curses escape codes as
   stdout. -y also retires the 'y\n'*50 stdin hack.

2. grep used rg, which isn't packaged for DragonFly. Switch to
   POSIX 'grep -rn'. grep rc=1 (no matches) → ok=True with
   match_count=0; rc>=2 → ok=False. Prior behavior surfaced "no
   matches" as ok=False and the model concluded "rg is not
   available" (wrong inference but understandable).

3. dev-env exec INFO mount-prep noise on every chroot tool call.
   New '--quiet' flag on 'dportsv3 dev-env exec' and matching
   DPORTS_DEV_ENV_QUIET env var. worker._exec always passes
   --quiet so the harness's contexts stop accumulating 8 lines of
   "INFO: mount already present at ..." per call.

4. Surface dsynth's per-port build log. dsynth writes the actual
   build error to /work/dsynth/logs/<origin-with-underscores>.log
   (Directory_logs from dsynth.ini). Two changes:
   - dsynth_build result now carries 'log_hint' pointing at this
     path.
   - New 'dsynth_log(origin, tail_lines=200)' tool reads the tail.
   PATCH_SYSTEM updated to direct the agent: on build failure,
   call dsynth_log immediately — don't grep DPorts for *.log
   files (they don't exist there).

5. Add 'list_dir(path)' tool. Previously the agent tried
   get_file on directories and got opaque failures. list_dir
   returns entries with name/kind/size, capped at max_entries.

6. Tool schemas trimmed. Each schema's description now one
   focused sentence (was 2-4 sentences with examples). Total
   schema chars ~6.5kB → ~4kB. The example paths and prose
   moved to PATCH_SYSTEM, which is sent once per attempt-start
   instead of every turn.

7. Sliding-window reasoning_content. Thinking-mode providers
   require the most recent assistant turn's reasoning_content to
   be echoed back; older turns' reasoning is dead weight in the
   prompt. tool_loop._strip_old_reasoning drops it from all but
   the most recent assistant message after each turn.

8. give-up directive in PATCH_SYSTEM (no new tool — prompt-only).
   Explicit: "if you've tried two distinct approaches and both
   failed at the same point, stop and emit Rebuild Status:
   gave-up". Also: "if dsynth_build returned rebuild_ok=true, stop
   immediately — don't keep exploring."

Plus: get_file failure envelopes now differentiate
'missing' / 'is_directory' / 'not_a_regular_file' via a 'kind'
field, so the agent can react usefully.

Verified with unit tests:
- reasoning_content sliding window keeps LATEST, strips older
- list_dir returns entries with kind+size
- get_file on a directory returns kind=is_directory with a
  pointer to list_dir/grep
- grep on a non-existent pattern returns ok=True match_count=0

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deepseek's thinking-mode API requires reasoning_content on EVERY
prior assistant turn, not just the most recent. Empirical proof: 3
turns in, after the trim removed turn 1's reasoning_content, the
API rejected with HTTP 400:

  The reasoning_content in the thinking mode must be passed back
  to the API.

So the trim violates the protocol, not just leaves tokens on the
table. Reverting that change. The other 7 fixes in 9e35959
stand.

Token cost of preserving all reasoning_content is the price of
using a thinking-mode model — accept it or switch providers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous run on devel/readline got 9 turns in and was making
real progress (diagnosed the actual C compile error, found a
version-skew bug in DeltaPorts' overlay patch, was about to read
the source to fix it) when 40k tokens ran out.

Bump fixture defaults to match the ASSIST tier in agentic-policy.json:
- DP_TEST_TIER_ITERATIONS: 2 -> 4
- DP_TEST_TIER_TOKENS: 40000 -> 120000

Real bundles classified as ASSIST will get these caps from the policy
file; the fixture should mirror that so test results are
representative of the production budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Triage now resolves a tier from (classification, confidence) via
policy.tier_for and uses the tier name to decide whether to auto-
enqueue a patch job. The tier name propagates into the patch .job
file so the patch worker uses the same budget without re-resolving.

agent-queue-runner._process_triage_job_harness:
- Loads config/agentic-policy.json (or DP_HARNESS_POLICY path).
- Resolves tier from result.classification + result.confidence.
- tier=MANUAL: skips patch auto-enqueue, upserts a user_context
  request so the UI flags it for operator attention, returns
  status="manual_tier".
- tier=AUTO|ASSIST: enqueues the patch job carrying tier= and
  dev_env= fields.

enqueue_patch_job: optional tier_name + dev_env kwargs propagate
into the new .job file.

_process_patch_job_harness: prefer job.get('tier') (set by triage
when running through the harness path) over re-parsing triage.md.
Re-parse is the fallback for hand-fired patch jobs.

Bug fix found by the new tier matrix test: policy.tier_for only
downgraded once. plist-error + low confidence ended at ASSIST even
though ASSIST's medium floor wasn't met. Fix: cascade downgrades
in a while loop until either confidence meets the current tier's
floor or MANUAL is reached. Verified against all combinations:

  plist-error/high   -> AUTO
  plist-error/medium -> ASSIST  (AUTO floor=high not met -> downgrade)
  plist-error/low    -> MANUAL  (cascades both downgrades)
  compile-error/high -> ASSIST
  compile-error/low  -> MANUAL  (ASSIST floor=medium not met)
  runtime-error/any  -> MANUAL  (mapped directly)
  unknown/any        -> MANUAL  (default)

needs_user_context and should_enqueue_patch are no longer used by
the harness triage path; they remain only for the (also legacy)
opencode path. They'll go away in step 6 with the rest of the
opencode sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors _manual_test_patch_flow but for the triage side of the loop.
Fixtures a bundle with a synthetic-but-realistic error log, runs
dportsv3.agent.triage.run against a real LLM, then asks policy.tier_for
what the runner would do with the result.

Three built-in fixtures:
- compile-error  — readline-shape 'lvalue required' compile error
- plist-error    — pkg-plist mismatch
- unknown        — opaque generic failure

Reports: classification, confidence, resolved tier, the patch-budget
the tier would grant, and whether the runner would auto-enqueue.
Dumps per-turn LLM trace to <bundle>/session.jsonl for inspection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Final cleanup pass. With the harness fully validated end-to-end (real
LLM + real env + real port fix on devel/readline; clean build on
archivers/liblz4), the legacy opencode-driven code paths and the
workspace concept have no callers and no purpose.

Deleted:
- config/opencode/ entirely: dports-triage.md (40), dports-patch.md
  (62), tool/dports.ts (257) — TS plugin and agent markdown moved
  to dportsv3.agent.{prompts, tools} in earlier steps.

- agent-queue-runner constants: PATCHABLE_CLASSIFICATIONS,
  PATCHABLE_CONFIDENCE (replaced by policy.tier_for in step 5),
  DEFAULT_VM_SSH_KEY/PORT/HOST, DEFAULT_WORKSPACE_CONFIG,
  DEFAULT_MAX_SNIPPET_ROUNDS.

- agent-queue-runner functions:
  - should_enqueue_patch, needs_user_context (replaced by policy
    tier dispatch)
  - parse_snippet_requests (only called by the now-dead snippet
    re-enqueue path; snippet rounds fold into harness triage)
  - get_vm_ssh_command, run_snippet_extractor (SSH-to-VM
    workaround for Linux dev hosts; harness runs natively on dfly)
  - enqueue_followup_job, check_and_handle_snippet_requests
    (legacy snippet escalation; harness triage handles in-process)
  - call_opencode, extract_response_text (HTTP plumbing for
    opencode serve)
  - extract_section, extract_json_block (only used by the legacy
    write_*_outputs; the harness has its own _PROOF_BLOCK_RE in
    attempt_loop.py)
  - write_triage_outputs, write_patch_outputs (legacy bundle
    writers; harness has _write_triage_audit_harness +
    _write_patch_audit_harness)
  - load_workspace_config (workspace.json reader; workspace
    concept retired)
  - The workspace-config embedding section of build_triage_payload

- Legacy bodies of process_triage_job and process_patch_job: both
  shrink to thin wrappers that require DP_HARNESS_*_MODEL and call
  the corresponding _process_*_job_harness adapter. No more
  feature-flag-gated dual path.

- process_job: drops opencode_url, opencode_provider, opencode_model,
  timeout, max_retries, retry_delay parameters from the signature
  and call sites. Snippet round display removed (in-process now).

- main(): drops all OPENCODE_* env reads; startup log now reports
  DP_HARNESS_TRIAGE_MODEL + DP_HARNESS_PATCH_MODEL instead.

- Docstring header at the top of the file rewritten to document
  the harness env vars and job-file conventions.

Net effect:
- scripts/agent-queue-runner: ~2300 LOC -> 1685 LOC
- config/opencode/ gone (359 LOC)
- Total: ~1000 LOC retired

Negative checks pass:
- 'opencode|OPENCODE_|VM_SSH|workspace\\.json|agentic-workspace|
  PATCHABLE_|should_enqueue_patch|call_opencode|extract_response_text|
  check_and_handle_snippet_requests|load_workspace_config' in
  scripts/agent-queue-runner: 0 hits
- The remaining "opencode" mentions in dportsv3.agent/llm.py +
  prompts.py + _manual_test_tool_loop.py are documentation strings
  about opencode.ai/zen (a third-party OpenAI-compat relay) — they
  describe what works with the harness, not legacy code.
- All harness modules import cleanly; 13 tools registered.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 3 (opencode → litellm harness, dev-env-native, no PR/branch/SSH)
is complete in the branch. Updates:

- docs/agentic-consolidation-plan.md: new "Status: shipped" callout
  at the top with a brief summary of what landed and pointers to
  the commit range (985889d ... 6f6db28).

- docs/AGENTIC_BUILDS.md: warning banner that the doc describes the
  pre-Phase-3 architecture (opencode, OPENCODE_* env vars, VM_SSH,
  /build/synth/agentic-workspace, workspace.json, agentic-worker,
  process_pr_job — all retired). Sections below the banner are kept
  as historical context until the doc is rewritten.

- docs/TESTING_E2E.md: same banner, plus pointers to the three manual
  test fixtures in scripts/generator/dportsv3/agent/_manual_test_*.py
  that exercise the new harness against a real dev-env.

A full rewrite of AGENTIC_BUILDS.md and TESTING_E2E.md is queued as
follow-up work — not in scope here. Banners prevent operators from
following the stale instructions in the meantime.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fold the tracker schema into state.db so that when tracker becomes a
read-only consumer in step 4 the tables already exist. No write
behavior changes in this step — schema only, additive.

Lifted verbatim from scripts/generator/dportsv3/tracker/db.py:
- build_types (with seed rows 'test', 'release')
- build_runs
- build_results (incl. status default 'recorded' via idempotent ALTER)
- port_status
- 5 supporting indexes + uq_build_runs_active unique partial index
- Idempotent ALTER migrations (build_results.status,
  build_runs.total_expected)

Plus, per the consolidation plan's "weak link" model: add nullable
runs.build_run_id (idempotent ALTER) so a dsynth invocation can
later be associated with a campaign campaign via DPORTSV3_BUILD_RUN_ID
(wired in step 3).

Enable PRAGMA foreign_keys=ON on the artifact-store connection so
tracker's FK constraints (build_results -> build_runs, port_status
-> build_runs) are enforced going forward. None of artifact-store's
existing tables have FKs, so the change only affects writes to the
new tracker tables.

Verified locally:
- 4 tables + 6 indexes present after init
- 2 seed rows in build_types
- FK enforcement works (INSERT with bad build_run_id raises
  IntegrityError; valid insert succeeds)
- Re-init is idempotent (no errors on second _init_db call)

tracker.db remains authoritative until step 4 flips tracker to read
state.db; both DBs are valid in parallel during the transition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 530-line standalone script becomes a 20-line shim; logic moves
into dportsv3.artifact_store (importable, testable). Schema moves
into dportsv3.db.schema (shared with tracker once it switches to
state.db in step 4).

New:
- dportsv3/db/__init__.py  — package marker
- dportsv3/db/schema.py    — SCHEMA, DEFAULT_BUILD_TYPES, MIGRATIONS,
  init_db(conn) helper. Idempotent on re-init.
- dportsv3/artifact_store.py — ArtifactStore, Handler,
  ArtifactStoreServer, main(). Imports init_db from db.schema.

Changed:
- scripts/artifact-store — shrinks to a sys.path bootstrap +
  `from dportsv3.artifact_store import main; main()`. Same
  invocation, same behaviour. Executable bit preserved.
- scripts/generator/pyproject.toml — adds
  `artifact-store = "dportsv3.artifact_store:main"` console script,
  so the generator venv's bin/ gets an `artifact-store` entry too.

Invocation matrix now:
- ./scripts/artifact-store --logs-root /path  (production; no venv)
- python -m dportsv3.artifact_store --logs-root /path  (in-venv)
- $VENV/bin/artifact-store --logs-root /path  (console script)
All identical in behaviour.

Single source of truth for the state.db schema removes the
duplication step 1 introduced (both artifact-store and
tracker/db.py held the same 4 CREATE TABLEs).

Verified:
- All 15 tables created, 6 new indexes, FK enforced, runs.build_run_id
  column present, build_types seeded.
- ArtifactStore.upsert_run_bundle / put_blob / get_artifact round-trip
  works against a temp dir.
- Re-init is idempotent.
- Both ./scripts/artifact-store --help and `python -m dportsv3.artifact_store --help`
  print the same help text.

tracker/db.py is unchanged — it still uses its own schema for
tracker.db. Step 4 will switch it to import from dportsv3.db.schema
when it reads state.db.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port state-server's POST /user-context to artifact-store. Same body
shape, same semantics, same emit_event. State-server still serves
its legacy /user-context in parallel until step 8 retires it.

(Note: the Phase 4 plan also mentioned a POST /v1/jobs/enqueue/pr
endpoint as part of step 2. That was prospective — state-server
never actually had it, and Phase 3 deleted process_pr_job + the
type=pr dispatch arm. No PR enqueue path exists today, so nothing
to port. Step 2's real scope is just user-context.)

New on the ArtifactStore class:
- upsert_user_context(run_id, origin, context_text) -> int
  Looks up existing context_rev, increments, upserts the row, emits
  user_context_updated event, returns the new rev under the lock.

New on the Handler:
- POST /v1/user-context with body {run_id, origin, context_text}.
  Validation matches state-server: required fields, non-empty after
  strip, <= 8000 chars, valid JSON. Returns
  {"ok": true, "context_rev": N} on success.

Verified locally via curl against the running shim:
- First write to (r1, devel/readline): context_rev=1
- Second write same key: context_rev=2 (increment)
- Different origin: starts fresh at context_rev=1
- Missing context_text -> 400
- Empty/whitespace -> 400
- > 8000 chars -> 400
- Malformed JSON -> 400
- state.db rows match: 2 user_context rows, 3 events with the
  expected rev/timestamp content.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add dsynth hook scripts that translate run start/end and per-port
state changes into `dportsv3 tracker` API calls for the current
build profile.

The hooks persist one active tracker run id per dsynth profile,
enqueue ports before marking them building, record final port
outcomes, and fail soft with hook-local logging so tracker outages
do not interrupt package builds.

Include a shared helper, a config template, support for both
`hook_pkg_start` and `hook_pkg_started`, and installation notes for
copying the hook set into `/etc/dsynth`.

Co-authored-by: OpenAI <noreply@openai.com>
Stop dsynth tracker hooks from reusing stale run ids when a new
start-build request fails, and disable tracking for that dsynth run
instead of continuing to enqueue and record into the previous build.

Also switch the tracker server from one shared SQLite connection to
fresh per-request connections so concurrent enqueue and status
updates do not corrupt transaction state under hook traffic. Add a
regression test covering the request connection lifecycle.

Co-authored-by: OpenAI <noreply@openai.com>
…ok set

The two parallel hook sets were a real problem: dsynth has one
Hooks_Directory, so only one executable per event name (hook_pkg_failure
etc.) can live there. Today an operator has to pick artifact-store
evidence OR tracker-side build_results, not both.

This commit folds builderhooks' tracker integration into the existing
scripts/dsynth-hooks/, preserving the good ideas from builderhooks:
- per-profile state file (under evidence_root/.tracker-state by default)
- soft-fail logging (tracker outages don't fail dsynth builds)
- disable-on-collision (if start-build fails because an active run
  exists, the new run is marked TRACKING_DISABLED instead of reusing
  a stale run id)
- conf-driven, default-on tracker integration

hook_common.sh gains a "tracker integration" section with:
- tracker_log, tracker_fail_soft, tracker_should_skip
- tracker_load_config / load_state / write_state / clear_state /
  disable_state
- tracker_pkg_version, tracker_enqueue_one
- tracker_run_start, tracker_run_end, tracker_mark_building,
  tracker_record_result

Defaults:
- DPORTSV3_TRACKER_TARGET = @${PROFILE} (one profile per target)
- DPORTSV3_TRACKER_BUILD_TYPE = test
- DPORTSV3_TRACKER_STATE_DIR = ${DIR_LOGS}/evidence/.tracker-state
- DPORTSV3_TRACKER_HOOK_LOG = ${DIR_LOGS}/dportsv3-hooks.log

Includes the mktemp fix from the wip commit (drop .json suffix — BSD
mktemp requires X's at the end of the template).

Hooks wired:
- hook_run_start: existing evidence-root setup + new tracker_run_start
- hook_run_end:   existing evidence-pointer cleanup + new tracker_run_end
- hook_pkg_failure: existing full bundle write + enqueue triage job
                    + new tracker_record_result fail
- hook_pkg_success / skipped / ignored: replace no-op with
  tracker_record_result {pass,skipped,ignored}
- New hook_pkg_start + hook_pkg_started for tracker_mark_building
  (both names provided to match either dsynth variant)

New supporting files:
- scripts/dsynth-hooks/dportsv3-hooks.conf.example (single config file
  covering both artifact-store overrides and tracker config)
- scripts/dsynth-hooks/README.md (install instructions + operational
  notes)

Tracker integration is opt-in via the config file. Without
DPORTSV3_TRACKER_URL set, every tracker_* high-level helper
short-circuits and the hooks only do artifact-store work — preserving
the previous behaviour for operators who don't run the tracker.

Retired:
- scripts/builderhooks/* (README, conf template, 9 hook stubs, and
  tracker_common.sh). The two unrelated poudriere-era scripts
  (bulk_started.sh, pkgbuild.sh) that lived in the same dir are kept
  in place pending a separate decision about whether to retire them
  too (they write per-port STATUS files — the thing tracker is meant
  to replace).

Verified locally:
- All hooks pass `sh -n` syntax check.
- hook_pkg_success with tracker disabled (no config): exits 0,
  silent, no log file.
- hook_pkg_success with tracker config but unreachable: exits 0
  (soft-fail), error logged to dportsv3-hooks.log.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Detour before step 4 to close the "env's DeltaPorts is stale" gap.

Two pieces:

1. runtime.py: bind-mount the host's repo mirror cache (config.repos_dir)
   into the chroot read-only at the same path. The env's git origins
   were recorded at clone time as host paths like
   /root/.cache/dports-dev/repos/deltaports.git — without this mount,
   `git pull` from inside the env fails because that path doesn't
   exist in the chroot's filesystem view. With the mount, the path
   resolves and standard git operations work from inside the env shell.

2. New `dportsv3 dev-env update NAME [--force]` subcommand. Two
   phases:
   - Phase 1: refresh the bare mirrors under config.repos_dir from
     the host's working tree (reuses RepoCache.refresh_all — same
     logic the builder runs at env create time).
   - Phase 2: for each env-side repo (work/DeltaPorts, work/freebsd-
     ports), run host-side `git fetch --prune origin` + `git pull
     --ff-only origin <current-branch>`. Errors when the working
     tree is dirty unless --force; errors when the branch can't
     fast-forward (divergent history) with a clear message.
   Logs before/after short SHAs per repo so the operator sees what
   moved. DPorts is intentionally excluded — it's compose-generated,
   not a git checkout.

   No --branch flag: switching branches in the env is now a normal
   `git -C /work/DeltaPorts checkout <other>` inside the env shell
   (works thanks to the bind mount).

Also extends `dportsv3 dev-env status NAME` JSON output with
per-repo `{branch, commit, dirty}` for DeltaPorts and freebsd-ports.
Lets operators see what the env is tracking without entering it.

Verification: argparse smoke + module parse + import. End-to-end
verification needs a dfly env (next).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The tracker module now opens the same SQLite file artifact-store does
(state.db). Two writers under SQLite WAL: one writer at a time at the
SQLite layer, readers parallel. Pragmas (WAL, busy_timeout=5000,
foreign_keys=ON) applied per-connection so the tracker server's per-
request connections (commit a14fe9c) inherit them.

tracker/db.py:
- Drop the duplicate SCHEMA + DEFAULT_BUILD_TYPES + MIGRATIONS that
  lived inline. init_db now delegates to dportsv3.db.schema.init_db
  so the schema is single-sourced (matches what artifact-store writes).
  Re-exports DEFAULT_BUILD_TYPES for any consumer that imported it
  from this module.
- open_db now also sets PRAGMA busy_timeout=5000 (was missing) to
  match artifact-store and survive concurrent-writer contention.

CLI default --db path resolution (was hardcoded "tracker.db"):
  1. --db PATH (operator override)
  2. DPORTSV3_STATE_DB env var
  3. $PWD/state.db (fall-back)
Documented in cli.py help text. Operator is responsible for matching
artifact-store's logs-root (e.g. /build/synth/logs/evidence/state.db
when artifact-store runs with --logs-root /build/synth/logs).

Tests:
- All tracker test fixtures (test_tracker_api, test_tracker_queue,
  test_tracker_integration) switched from tmp_path / "tracker.db" to
  tmp_path / "state.db". Schema is identical via the shared module
  so test bodies don't change.
- New test_state_db_concurrency.py: two threads hammering the same
  state.db (one as artifact-store, one as tracker) with ~60 writes
  each. Confirms no "database is locked" errors under WAL +
  busy_timeout, both sides' rows land, runtime well under the 15s
  bound. Plus a small FK-enforcement guard test.

What did not change:
- All high-level tracker functions (create_build_run, record_results,
  get_target_summary, …) — same API, same callers.
- Tracker server (server.py) — per-request connections from
  a14fe9c stay.
- Tracker CLI commands (start-build, record-result, …) — they hit
  the tracker server over HTTP; no awareness of the DB path.

tracker.db file:
- No code opens it anymore; safe to delete whenever convenient.
- ~38k rows of test data (per the consolidation plan) — abandoned, not
  migrated.

Static checks done locally: parse, import, schema delegation, all
15 expected tables present after tracker_init_db, pragmas applied.
The pytest run requires the generator venv (dev deps); to verify
on dfly:
  cd scripts/generator && .venv/bin/python -m pytest tests/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
schema.init_db(conn) mutates in place and returns None; the test wrapped
the connect() call inside it and then tried to .close() the result.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds nullable target columns on bundles/jobs/runs (with indexes) via
idempotent ALTER TABLE migrations. Hook + artifact-store + state-server
propagate target on every write so step 8 can retire state-server
without losing the target dimension. Tracker absorbs the agentic read
API as eleven /api/* endpoints (runs, jobs, bundles, ports, activity,
runner-status, agentic-status, artifacts, SSE events) all accepting a
?target= filter where applicable. Plan doc rewritten as Phase 4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Server-rendered Jinja views for the agentic side: bundle list/detail,
job list/detail, run detail, runner status, activity log. Each list
view exposes a target selector populated from distinct_targets across
bundles/jobs/runs. Bundle detail links artifact paths to the existing
/api/bundles/<id>/artifacts/<relpath> streamer. Adds an "Agentic" nav
entry next to Targets / Builds / Diff. Same Bootstrap layout as the
build dashboard — no SPA, no JS framework. The dsynth-progress
aesthetic redesign is deferred to Phase 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deletes scripts/state-server (1373 LOC) and scripts/state-server-ui
(2381 LOC across .css/.js/.html). The tracker absorbed the read API
in step 5 and the HTML views in step 6, so the legacy server has
nothing left to do.

agent-queue-runner: STATE_SERVER_URL retired; bundle/artifact lookups
now go through DPORTSV3_TRACKER_URL against /api/bundles and
/api/ports. No backward-compat fallback — hard cutover.

Stale "until step 8 retires it" comments scrubbed. Phase 4 complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plumbing only, no existing page reskinned. Copies progress.{css,js}
and the dsynth/favicon PNGs into dportsv3/tracker/static/, lifts the
index.html as a Jinja template at templates/progress.html (with a
<base> tag pinning relative URLs to the canonical path), and adds a
progress_adapter that maps build_runs + build_results into the
{summary.json, NN_history.json} shape progress.js consumes.

Three new routes mounted under /target/{target}/progress/, leaving
the existing dashboard untouched. Result vocabulary mapped
success→built, failure→failed, skipped→skipped, ignored→ignored;
meta is left at 0 (no tracker analog).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant