Migrate to generic sandbox contract by tthuwng · Pull Request #31 · vals-ai/SWEBenchBenchmarkService

tthuwng · 2026-06-10T18:58:39Z

Summary

Migrates SWE-bench to the provider-generic sandbox interface from create-benchmark-service v0.7.4 (vals-ai/valkyrie-ticket-library#89), following the VCB (vcb-benchmark-service#86) and ProgramBench (programbench-benchmark-service#19) reference migrations.

setup_task / evaluate_instance / _stream_command_with_retry and with_retry now take the generic Sandbox instead of Daytona AsyncSandbox
sandbox.process.exec → sandbox.exec (.result → .stdout), sandbox.fs.upload_file(data, path) → sandbox.upload_file(path, data)
retrieve_task returns a typed ImageSource instead of the raw docker_image string; the serialized response still includes the legacy docker_image field via the framework's computed property (covered by a unit-test assertion)
DaytonaError/DaytonaNotFoundError handling replaced with SandboxError/SandboxNotFoundError in with_retry and the stream-retry paths; the Daytona-only best-effort sandbox.refresh_data() diagnostic call was dropped (the generic adapter already embeds sandbox name/id/state in wrapped errors)
No DaytonaTimeoutError handling existed in this service, so no exit-code-124 timeout mapping was needed
Integration tests create sandboxes through the generic provider (DaytonaBackendConfig(...).create_provider() + SandboxCreateRequest) instead of a raw AsyncDaytona client; the unused Daytona-only get_session_logger test helper (session-log streaming has no generic equivalent) was removed
create-benchmark-service pin bumped v0.5.0 → git main (resolves to v0.7.4)

Note: the dependency bump pulls starlette 1.2.x, whose TestClient is type-annotated against the optional httpx2 package. Installing httpx2 was not possible in this environment, so tests/utils.py carries a file-level pyright suppression for the resulting Unknown-type diagnostics (runtime behavior is unchanged; tests pass on plain httpx). Happy to swap that for an httpx2 dev dependency instead. @JarettForzano

Validation

uv run ruff check / uv run ruff format --check clean (ruff format also fixed 2 pre-existing drift hunks present on main)
uv run pytest -q — 24 passed, 1 skipped (sandbox e2e, no Daytona creds), 2 deselected (experimental)
uv run basedpyright — 0 errors, matching the clean-tree baseline (verified via stash)

🤖 Generated with Claude Code

Replace Daytona-native sandbox usage with the provider-generic Sandbox interface from create-benchmark-service 0.7.x: - type setup/eval entrypoints and helpers with generic Sandbox - sandbox.process.exec -> sandbox.exec (.result -> .stdout) - sandbox.fs.upload_file(data, path) -> sandbox.upload_file(path, data) - retrieve_task returns typed ImageSource instead of docker_image string - DaytonaError/DaytonaNotFoundError -> SandboxError/SandboxNotFoundError in with_retry and stream retry paths; drop Daytona-only refresh_data() diagnostic call - integration tests create sandboxes via DaytonaBackendConfig provider and SandboxCreateRequest instead of a raw AsyncDaytona client - drop unused Daytona-only get_session_logger test helper - bump create-benchmark-service pin to git main (v0.7.4) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

assert-app · 2026-06-10T18:58:41Z

Review on Assert →

devin-ai-integration

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-06-10T19:02:04Z

+        # Legacy field preserved for older Valkyrie clients
+        assert data["docker_image"] == expected_image


🚩 Test asserts legacy docker_image field exists in serialized response

The test at tests/unit/test_endpoints.py:112-113 asserts data["docker_image"] == expected_image, but the retrieve_task method at src/swebench_service/benchmark_service.py:84-90 no longer passes docker_image= to RetrieveTaskResponse — it only passes source=ImageSource(image=docker_image). This assertion will only pass if the RetrieveTaskResponse schema in the create-benchmark-service framework automatically computes a docker_image field from the source for backward compatibility. The comment "Legacy field preserved for older Valkyrie clients" suggests this is intentional framework behavior, but I could not verify by inspecting the RetrieveTaskResponse schema definition directly.

Was this helpful? React with 👍 or 👎 to provide feedback.

tthuwng requested a review from JarettForzano June 10, 2026 18:58

devin-ai-integration Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to generic sandbox contract#31

Migrate to generic sandbox contract#31
tthuwng wants to merge 1 commit into
mainfrom
ht/generic-sandbox

tthuwng commented Jun 10, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

assert-app Bot commented Jun 10, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Legacy field preserved for older Valkyrie clients
		assert data["docker_image"] == expected_image

Conversation

tthuwng commented Jun 10, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

assert-app Bot commented Jun 10, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tthuwng commented Jun 10, 2026 •

edited by devin-ai-integration Bot

Loading