Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions assets/lab/environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -911,6 +911,7 @@ Newer and more experimental environment classes include:
)
```
- **V1 `vf.Env` / `vf.Taskset` / `vf.Harness`** — preferred taskset/harness pattern for composing task data and program execution without subclassing. Use this for new environments that need reusable tasksets, reusable harnesses, config-driven metrics, rewards, toolsets, users, endpoint interception, or sandboxed Python/command programs. `vf.Taskset` owns train/eval rows, prompt shaping, setup/update/reward hooks, and toolsets. `vf.Harness` owns the framework program, endpoint proxy, model controls, sandbox options, and runtime hooks. `vf.Env` wires them into the standard evaluation and training surface.
- **`SWEDebugEnv`** — no-agent debugger for SWE-style `SandboxTaskSet` instances. It creates the task sandbox, optionally runs `taskset.setup(state)`, performs one debug step (`none`, `gold_patch`, `command`, or `script`), and optionally runs the task tests and scorer. It records setup, sandbox creation, gold patch, debug command, and test timings in state for validation and timing investigations.
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements [Recursive Language Models](https://alexzhang13.github.io/blog/2025/rlm/) for unbounded context processing via REPL-based decomposition and recursive sub-LLM calls
- **`OpenCodeEnv`** — runs [OpenCode](https://opencode.ai) CLI agents inside sandboxes with API call interception
Expand Down
1 change: 1 addition & 0 deletions docs/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -905,6 +905,7 @@ Newer and more experimental environment classes include:
)
```
- **V1 `vf.Env` / `vf.Taskset` / `vf.Harness`** — preferred taskset/harness pattern for composing task data and program execution without subclassing. Use this for new environments that need reusable tasksets, reusable harnesses, config-driven metrics, rewards, toolsets, users, endpoint interception, or sandboxed Python/command programs. `vf.Taskset` owns train/eval rows, prompt shaping, setup/update/reward hooks, and toolsets. `vf.Harness` owns the framework program, endpoint proxy, model controls, sandbox options, and runtime hooks. `vf.Env` wires them into the standard evaluation and training surface.
- **`SWEDebugEnv`** — no-agent debugger for SWE-style `SandboxTaskSet` instances. It creates the task sandbox, optionally runs `taskset.setup(state)`, performs one debug step (`none`, `gold_patch`, `command`, or `script`), and optionally runs the task tests and scorer. It records setup, sandbox creation, gold patch, debug command, and test timings in state for validation and timing investigations.
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements [Recursive Language Models](https://alexzhang13.github.io/blog/2025/rlm/) for unbounded context processing via REPL-based decomposition and recursive sub-LLM calls
- **`OpenCodeEnv`** — runs [OpenCode](https://opencode.ai) CLI agents inside sandboxes with API call interception
Expand Down
29 changes: 29 additions & 0 deletions docs/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -516,6 +516,35 @@ class OpenEnvEnv(MultiTurnEnv):

OpenEnv integration that runs OpenEnv projects in Prime Sandboxes using a prebuilt image manifest (`.build.json`), supports both gym and MCP contracts, and requires a `prompt_renderer` to convert observations into chat messages.

#### SWEDebugEnv

```python
class SWEDebugEnv(SandboxMixin, MultiTurnEnv):
def __init__(
self,
taskset: SandboxTaskSet,
dataset: Any = None,
*,
run_setup: bool = True,
debug_step: Literal["none", "gold_patch", "command", "script"] = "gold_patch",
run_tests: bool = True,
debug_command: str | None = None,
debug_script: str | None = None,
debug_script_path: str | None = None,
debug_timeout: int | None = None,
test_timeout: int = 900,
cpu_cores: int | None = None,
memory_gb: int | None = None,
disk_size_gb: int | None = None,
labels: list[str] | None = None,
timeout_seconds: float = 1800.0,
output_tail_chars: int = 2000,
**sandbox_kwargs,
): ...
```

No-agent debugger for SWE-style `SandboxTaskSet` instances. It creates the task sandbox, optionally runs task setup, runs one debug step (`none`, `gold_patch`, `command`, or `script`), and optionally runs tests and scores the result.

#### EnvGroup

```python
Expand Down
1 change: 1 addition & 0 deletions environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -911,6 +911,7 @@ Newer and more experimental environment classes include:
)
```
- **V1 `vf.Env` / `vf.Taskset` / `vf.Harness`** — preferred taskset/harness pattern for composing task data and program execution without subclassing. Use this for new environments that need reusable tasksets, reusable harnesses, config-driven metrics, rewards, toolsets, users, endpoint interception, or sandboxed Python/command programs. `vf.Taskset` owns train/eval rows, prompt shaping, setup/update/reward hooks, and toolsets. `vf.Harness` owns the framework program, endpoint proxy, model controls, sandbox options, and runtime hooks. `vf.Env` wires them into the standard evaluation and training surface.
- **`SWEDebugEnv`** — no-agent debugger for SWE-style `SandboxTaskSet` instances. It creates the task sandbox, optionally runs `taskset.setup(state)`, performs one debug step (`none`, `gold_patch`, `command`, or `script`), and optionally runs the task tests and scorer. It records setup, sandbox creation, gold patch, debug command, and test timings in state for validation and timing investigations.
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements [Recursive Language Models](https://alexzhang13.github.io/blog/2025/rlm/) for unbounded context processing via REPL-based decomposition and recursive sub-LLM calls
- **`OpenCodeEnv`** — runs [OpenCode](https://opencode.ai) CLI agents inside sandboxes with API call interception
Expand Down
2 changes: 2 additions & 0 deletions verifiers/envs/experimental/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
"TaskSet",
"Harness",
"ComposableEnv",
"SWEDebugEnv",
]


Expand All @@ -19,6 +20,7 @@ def __getattr__(name: str):
"TaskSet": "verifiers.envs.experimental.composable:TaskSet",
"Harness": "verifiers.envs.experimental.composable:Harness",
"ComposableEnv": "verifiers.envs.experimental.composable:ComposableEnv",
"SWEDebugEnv": "verifiers.envs.experimental.composable:SWEDebugEnv",
}
if name in _lazy:
import importlib
Expand Down
2 changes: 2 additions & 0 deletions verifiers/envs/experimental/composable/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
)
from verifiers.envs.experimental.composable.harness import Harness
from verifiers.envs.experimental.composable.composable_env import ComposableEnv
from verifiers.envs.experimental.composable.swe_debug_env import SWEDebugEnv

__all__ = [
"SandboxSpec",
Expand All @@ -15,5 +16,6 @@
"SandboxTaskSet",
"Harness",
"ComposableEnv",
"SWEDebugEnv",
"discover_sibling_dir",
]
Loading
Loading