Migrate to generic sandbox contract#31
Conversation
Replace Daytona-native sandbox usage with the provider-generic Sandbox interface from create-benchmark-service 0.7.x: - type setup/eval entrypoints and helpers with generic Sandbox - sandbox.process.exec -> sandbox.exec (.result -> .stdout) - sandbox.fs.upload_file(data, path) -> sandbox.upload_file(path, data) - retrieve_task returns typed ImageSource instead of docker_image string - DaytonaError/DaytonaNotFoundError -> SandboxError/SandboxNotFoundError in with_retry and stream retry paths; drop Daytona-only refresh_data() diagnostic call - integration tests create sandboxes via DaytonaBackendConfig provider and SandboxCreateRequest instead of a raw AsyncDaytona client - drop unused Daytona-only get_session_logger test helper - bump create-benchmark-service pin to git main (v0.7.4) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| # Legacy field preserved for older Valkyrie clients | ||
| assert data["docker_image"] == expected_image |
There was a problem hiding this comment.
🚩 Test asserts legacy docker_image field exists in serialized response
The test at tests/unit/test_endpoints.py:112-113 asserts data["docker_image"] == expected_image, but the retrieve_task method at src/swebench_service/benchmark_service.py:84-90 no longer passes docker_image= to RetrieveTaskResponse — it only passes source=ImageSource(image=docker_image). This assertion will only pass if the RetrieveTaskResponse schema in the create-benchmark-service framework automatically computes a docker_image field from the source for backward compatibility. The comment "Legacy field preserved for older Valkyrie clients" suggests this is intentional framework behavior, but I could not verify by inspecting the RetrieveTaskResponse schema definition directly.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Migrates SWE-bench to the provider-generic sandbox interface from create-benchmark-service
v0.7.4(vals-ai/valkyrie-ticket-library#89), following the VCB (vcb-benchmark-service#86) and ProgramBench (programbench-benchmark-service#19) reference migrations.setup_task/evaluate_instance/_stream_command_with_retryandwith_retrynow take the genericSandboxinstead of DaytonaAsyncSandboxsandbox.process.exec→sandbox.exec(.result→.stdout),sandbox.fs.upload_file(data, path)→sandbox.upload_file(path, data)retrieve_taskreturns a typedImageSourceinstead of the rawdocker_imagestring; the serialized response still includes the legacydocker_imagefield via the framework's computed property (covered by a unit-test assertion)DaytonaError/DaytonaNotFoundErrorhandling replaced withSandboxError/SandboxNotFoundErrorinwith_retryand the stream-retry paths; the Daytona-only best-effortsandbox.refresh_data()diagnostic call was dropped (the generic adapter already embeds sandbox name/id/state in wrapped errors)DaytonaTimeoutErrorhandling existed in this service, so no exit-code-124 timeout mapping was neededDaytonaBackendConfig(...).create_provider()+SandboxCreateRequest) instead of a rawAsyncDaytonaclient; the unused Daytona-onlyget_session_loggertest helper (session-log streaming has no generic equivalent) was removedcreate-benchmark-servicepin bumpedv0.5.0→ gitmain(resolves to v0.7.4)Note: the dependency bump pulls starlette 1.2.x, whose
TestClientis type-annotated against the optionalhttpx2package. Installinghttpx2was not possible in this environment, sotests/utils.pycarries a file-level pyright suppression for the resulting Unknown-type diagnostics (runtime behavior is unchanged; tests pass on plainhttpx). Happy to swap that for anhttpx2dev dependency instead. @JarettForzanoValidation
uv run ruff check/uv run ruff format --checkclean (ruff format also fixed 2 pre-existing drift hunks present on main)uv run pytest -q— 24 passed, 1 skipped (sandbox e2e, no Daytona creds), 2 deselected (experimental)uv run basedpyright— 0 errors, matching the clean-tree baseline (verified via stash)🤖 Generated with Claude Code