Skip to content

Migrate to generic sandbox contract#31

Open
tthuwng wants to merge 1 commit into
mainfrom
ht/generic-sandbox
Open

Migrate to generic sandbox contract#31
tthuwng wants to merge 1 commit into
mainfrom
ht/generic-sandbox

Conversation

@tthuwng

@tthuwng tthuwng commented Jun 10, 2026

Copy link
Copy Markdown

Summary

Migrates SWE-bench to the provider-generic sandbox interface from create-benchmark-service v0.7.4 (vals-ai/valkyrie-ticket-library#89), following the VCB (vcb-benchmark-service#86) and ProgramBench (programbench-benchmark-service#19) reference migrations.

  • setup_task / evaluate_instance / _stream_command_with_retry and with_retry now take the generic Sandbox instead of Daytona AsyncSandbox
  • sandbox.process.execsandbox.exec (.result.stdout), sandbox.fs.upload_file(data, path)sandbox.upload_file(path, data)
  • retrieve_task returns a typed ImageSource instead of the raw docker_image string; the serialized response still includes the legacy docker_image field via the framework's computed property (covered by a unit-test assertion)
  • DaytonaError/DaytonaNotFoundError handling replaced with SandboxError/SandboxNotFoundError in with_retry and the stream-retry paths; the Daytona-only best-effort sandbox.refresh_data() diagnostic call was dropped (the generic adapter already embeds sandbox name/id/state in wrapped errors)
  • No DaytonaTimeoutError handling existed in this service, so no exit-code-124 timeout mapping was needed
  • Integration tests create sandboxes through the generic provider (DaytonaBackendConfig(...).create_provider() + SandboxCreateRequest) instead of a raw AsyncDaytona client; the unused Daytona-only get_session_logger test helper (session-log streaming has no generic equivalent) was removed
  • create-benchmark-service pin bumped v0.5.0 → git main (resolves to v0.7.4)

Note: the dependency bump pulls starlette 1.2.x, whose TestClient is type-annotated against the optional httpx2 package. Installing httpx2 was not possible in this environment, so tests/utils.py carries a file-level pyright suppression for the resulting Unknown-type diagnostics (runtime behavior is unchanged; tests pass on plain httpx). Happy to swap that for an httpx2 dev dependency instead. @JarettForzano

Validation

  • uv run ruff check / uv run ruff format --check clean (ruff format also fixed 2 pre-existing drift hunks present on main)
  • uv run pytest -q — 24 passed, 1 skipped (sandbox e2e, no Daytona creds), 2 deselected (experimental)
  • uv run basedpyright — 0 errors, matching the clean-tree baseline (verified via stash)

🤖 Generated with Claude Code


Open in Devin Review

Replace Daytona-native sandbox usage with the provider-generic Sandbox
interface from create-benchmark-service 0.7.x:

- type setup/eval entrypoints and helpers with generic Sandbox
- sandbox.process.exec -> sandbox.exec (.result -> .stdout)
- sandbox.fs.upload_file(data, path) -> sandbox.upload_file(path, data)
- retrieve_task returns typed ImageSource instead of docker_image string
- DaytonaError/DaytonaNotFoundError -> SandboxError/SandboxNotFoundError
  in with_retry and stream retry paths; drop Daytona-only refresh_data()
  diagnostic call
- integration tests create sandboxes via DaytonaBackendConfig provider
  and SandboxCreateRequest instead of a raw AsyncDaytona client
- drop unused Daytona-only get_session_logger test helper
- bump create-benchmark-service pin to git main (v0.7.4)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@tthuwng tthuwng requested a review from JarettForzano June 10, 2026 18:58
@assert-app

assert-app Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review on Assert →

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +112 to +113
# Legacy field preserved for older Valkyrie clients
assert data["docker_image"] == expected_image

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Test asserts legacy docker_image field exists in serialized response

The test at tests/unit/test_endpoints.py:112-113 asserts data["docker_image"] == expected_image, but the retrieve_task method at src/swebench_service/benchmark_service.py:84-90 no longer passes docker_image= to RetrieveTaskResponse — it only passes source=ImageSource(image=docker_image). This assertion will only pass if the RetrieveTaskResponse schema in the create-benchmark-service framework automatically computes a docker_image field from the source for backward compatibility. The comment "Legacy field preserved for older Valkyrie clients" suggests this is intentional framework behavior, but I could not verify by inspecting the RetrieveTaskResponse schema definition directly.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant