diff --git a/.ai/README.md b/.ai/README.md
new file mode 100644
index 000000000..20eb3b402
--- /dev/null
+++ b/.ai/README.md
@@ -0,0 +1,3 @@
+# 简介
+
+我们相信用户依赖 AI 工具来完成任务。我们的目标是将任务上下文集中在一个文件夹中，让用户可以快速构建自己的 AI 工具，遵循上下文并持续推进任务。
diff --git a/.ai/task/nav.md b/.ai/task/nav.md
new file mode 100644
index 000000000..2dbc2310f
--- /dev/null
+++ b/.ai/task/nav.md
@@ -0,0 +1,86 @@
+
+```task
+
+我们要构建一个用于 RL-post-training（强化学习后训练）的 benchmark 和 agent 系统。
+
+可以大量参考 `rdagent/scenarios/finetune` 的内容。
+
+我们从 benchmark 开始。
+
+最终的 benchmark 代码将实现在 rdagent/scenarios/rl/eval
+
+待办事项清单：
+- [ ] 构建一个 example workspace（workspace 是 agent 系统生成的解决方案）并在 docker 环境中运行
+  - 为评测构建专用环境，并编写测试来评测 example workspace
+  - 测试用例示例：test/rl/test_example_workspace.py
+
+## 构建 benchmark & example workspace
+rdagent/scenarios/rl/eval/AutoRL-Bench/example_workspace
+
+基于 Dockerfile `rdagent/scenarios/rl/eval/AutoRL-Bench/env/`
+在 rdagent/utils/env.py 中编写 RL 专用的 DockerEnv
+
+
+我们开始一个新任务。
+
+## 构建一个 Agent 来生成 solution（workspace）
+
+### 第一步：
+
+参考 finetune agent，主要工作流在 `rdagent/scenarios/finetune/loop.py`
+
+请按照上述结构为 RL-post-training 实现一个脚手架（scaffold）。
+
+
+### 第二步：
+
+在脚手架中实现具体的示例。
+
+
+让我们从第一步开始；
+
+请为 RL 场景添加一个入口，类似 rdagent/app/finetune/llm/loop.py
+
+
+代码结构：
+- `rdagent/scenarios/rl/`: 场景的具体功能实现
+- `/rdagent/app/rl`: CLI 入口 & 配置
+
+## 组件说明
+
+CoSTEER：代码生成是困难的；我们需要多个步骤来生成代码。负责执行计划（计划来自外层循环）。
+- for step in all_steps:
+  - run step:
+    - 当 step 是 coding 时，我们有内层循环来生成代码。
+
+
+- TODO:
+  - 简化脚手架逻辑
+```
+
+
+[[test/rl/test_example_workspace.py:6]]
+
+
+## Coding Principles
+Don't catch unknown exceptions when implementing new code. I prefer to let the error propagate so it can be detected and fixed promptly.
+
+## (R)un 运行特定功能
+
+
+```
+```
+
+### 调试用
+
+## (A)I 编辑
+ <发送给 AI 的指令>
+
+## (E)xplanation 解释
+ <理解代码的关键部分>
+
+## (Q)uestions 问题
+ <记录要问同事的问题>
+
+ ## (B)acklogs 待办
+ <设计改进>
diff --git a/.ai/task/rl-naive-bench.md b/.ai/task/rl-naive-bench.md
new file mode 100644
index 000000000..225f76b9a
--- /dev/null
+++ b/.ai/task/rl-naive-bench.md
@@ -0,0 +1,42 @@
+
+# 任务描述
+
+我们正在开发一个最简版的 RL 后训练基准测试。开发时遵循以下原则：
+- 保持代码简洁是最高优先级
+- 性能不在考虑范围内
+
+## 技术决策：
+
+- 我们不想重新发明仓库级代码生成。所以打算使用现有的 coder 来生成仓库级代码。
+  - 候选：aider, openhands
+
+## TODO:
+
+- [/] (xiao)repo-level coder may not provide interfaces that fits curernt CoSTEER's interface.
+  - related code:
+    - `rdagent/components/coder/CoSTEER/evolving_strategy.py`
+  - This is not required.
+  - Key question:
+    - Do we have requirements to launch multiple runs?
+    - Extremely long code (2~3K lines)
+
+- UI:
+  - Ideal UI: if we use same framework, we expect a unified UI for all scenarios.
+    - BUT: Current UI may not be general enough for all scenarios.
+
+- Define benchmark interface:
+  - The users(e.g. agent) only interacts with the benchmark's public interface.
+  - interaction scenarios:
+    - CODE in R&D-Agent interaction with the benchmark
+    - ...
+
+# 编码原则
+实现新代码时不要捕获未知异常。我倾向于让错误传播，以便及时发现和修复。
+
+
+
+# 潜在重构待办
+## 框架
+- 简化构建新 CoSTEER coder 的流程 (xiao 正在思考)
+  - 相关代码: `rdagent/components/coder/rl/costeer.py`
+- 在 `rdagent/core/experiment.py` 中：能否在 Generic 类中创建新的 Workspace？
diff --git a/.env.example b/.env.example
index 89b5b4398..d11e7eae3 100644
--- a/.env.example
+++ b/.env.example
@@ -55,5 +55,7 @@ EMBEDDING_MODEL="litellm_proxy/BAAI/bge-large-en-v1.5"
 # Cache Setting (Optional):
 # USE_CHAT_CACHE=True
 # USE_EMBEDDING_CACHE=True
+# FT_DOCKER_ENABLE_CACHE=True
+# DS_DOCKER_ENABLE_CACHE=True
 # Senario Configs:
 # ==========================================
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index 7f4f8c8c6..086cbc73b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -182,4 +182,13 @@ static/
 # AI assistant
 .cursor/
 .claude/
-AGENTS.md
\ No newline at end of file
+AGENTS.md
+
+scripts/
+
+# AutoRL-Bench (legacy)
+rdagent/scenarios/rl/eval/autorl_bench/runs/
+rdagent/scenarios/rl/eval/autorl_bench/example_workspace/
+
+# Temporary files
+tmp/
diff --git a/rdagent/app/finetune/llm/README.md b/rdagent/app/finetune/llm/README.md
new file mode 100644
index 000000000..a28a14aeb
--- /dev/null
+++ b/rdagent/app/finetune/llm/README.md
@@ -0,0 +1,256 @@
+# LLM Fine-tuning (FT) 场景运行指南
+
+本文档介绍如何运行 RD-Agent 的 LLM Fine-tuning 场景。
+
+## 简介
+
+FT 场景用于自动化优化大语言模型在特定 benchmark 上的表现。系统会自动：
+1. 生成数据处理和训练代码
+2. 执行模型微调
+3. 在目标 benchmark 上评估模型性能
+4. 根据反馈迭代改进
+
+## 支持的 Benchmark
+
+| 类别 | Benchmark | 数据集 | 描述 |
+|------|-----------|--------|------|
+| Math | `aime24`, `aime25` | `deepscaler` | AIME 数学竞赛 |
+| Patent | `panorama_par4pc` | `panorama-par4pc` | 专利现有技术检索 |
+| Patent | `panorama_pi4pc` | `panorama-pi4pc` | 专利段落识别 |
+| Patent | `panorama_noc4pc` | `panorama-noc4pc` | 专利新颖性分类 |
+| Chemistry | `chemcotbench_mol_und` | `chemcot-mol_und` | 分子理解 |
+| Chemistry | `chemcotbench_mol_edit` | `chemcot-mol_edit` | 分子编辑 |
+| Chemistry | `chemcotbench_mol_opt` | `chemcot-mol_opt` | 分子优化 |
+| Chemistry | `chemcotbench_reaction` | `chemcot-rxn` | 化学反应预测 |
+
+> 数据集配置位于 `rdagent/scenarios/finetune/datasets/__init__.py` 的 `DATASETS` 字典中。
+
+>运行时agent会查看所有数据集，根据target benchmark和scenario选出与之相关的。
+
+## 环境配置
+
+### 1. 运行环境
+
+确保已安装 `rdagent` 主运行环境，其他需要的运行环境会自动创建
+
+> 在 `.env` 配置文件中通过设置  `FT_Coder_CoSTEER_env_type = conda/docker` 来配置
+
+### 2. .env 配置文件
+
+在项目根目录创建 `.env` 文件，参考以下模板：
+
+```bash
+# ========== API Configuration ==========
+BACKEND=rdagent.oai.backend.LiteLLMAPIBackend
+CHAT_MODEL=gpt-5.2
+CHAT_TEMPERATURE=1
+CHAT_STREAM=True
+OPENAI_API_KEY=sk-xxx
+OPENAI_API_BASE=http://your-api-endpoint
+
+EMBEDDING_MODEL=text-embedding-ada-002
+EMBEDDING_USE_AZURE=True
+
+# ========== Global Configs ==========
+MAX_RETRY=12000
+RETRY_WAIT_SECONDS=5
+MULTI_PROC_N=16
+STEP_SEMAPHORE=1
+
+# ========== Cache Settings ==========
+DUMP_CHAT_CACHE=False
+USE_CHAT_CACHE=False
+DUMP_EMBEDDING_CACHE=True
+USE_EMBEDDING_CACHE=True
+LOG_LLM_CHAT_CONTENT=True
+
+CHAT_FREQUENCY_PENALTY=0.1
+CHAT_PRESENCE_PENALTY=0.0
+
+# ========== FT Scenario Specific ==========
+FT_FILE_PATH=/path/to/your/finetune/workspace
+
+# Environment type: docker or conda
+# Set to "conda" when Docker is unavailable
+FT_Coder_CoSTEER_env_type=conda
+
+# Docker settings (only used when env_type=docker)
+FT_DOCKER_ENABLE_CACHE=True
+FT_UPDATE_LLAMA_FACTORY=False
+
+# Data processing API concurrency (adjust based on target API capacity)
+FT_API_MAX_WORKERS=1000
+
+# Data processing Model
+FT_STRONG_MODELS='["gpt-5", "gpt-5.1"]'
+FT_WEAK_MODELS='["gpt-4o-mini"]'
+
+# Benchmark and target (can be overridden in script)
+FT_TARGET_BENCHMARK=aime25
+FT_USER_TARGET_SCENARIO="I need to enhance the model's performance on math reasoning tasks."
+
+# Timeout settings
+FT_DATA_PROCESSING_TIMEOUT=28800
+
+# Judge settings (optional)
+# FT_JUDGE_MODEL=gpt-5.1
+# FT_JUDGE_RETRY=10
+
+REASONING_THINK_RM=True
+
+# ========== Logging ==========
+LOG_FORMAT_CONSOLE="{time:YYYY-MM-DD HH:mm:ss.SSS} | {level: <8} | <cyan>{process}</cyan> | {name}:{function}:{line} - {message}"
+
+# ========== HuggingFace ==========
+HF_TOKEN=hf_xxx
+```
+
+## 运行方法
+
+### 基本命令
+
+```bash
+# 激活 conda 环境
+conda activate rdagent
+
+# 运行 FT 场景
+dotenv run -- python rdagent/app/finetune/llm/loop.py --base-model <MODEL>
+```
+
+### 命令行参数
+
+| 参数 | 说明 | 示例 |
+|------|------|------|
+| `--base-model` | 基础模型名称（必需，其他都可以不填） | `Qwen/Qwen2.5-7B-Instruct` |
+| `--benchmark` | 目标 benchmark | `aime25` |
+| `--benchmark-description` | Benchmark 描述 | - |
+| `--dataset` | 指定数据集 | - |
+| `--step-n` | 步数限制 | `10` |
+| `--loop-n` | 循环次数限制 | `5` |
+| `--timeout` | 总时间限制 | - |
+
+### 运行示例
+
+```bash
+# 在 AIME25 上微调 Qwen2.5-7B
+dotenv run -- python rdagent/app/finetune/llm/loop.py \
+    --base-model Qwen/Qwen2.5-7B-Instruct
+
+# 指定 GPU 运行
+CUDA_VISIBLE_DEVICES=0,1 dotenv run -- python rdagent/app/finetune/llm/loop.py \
+    --base-model Qwen/Qwen2.5-7B-Instruct
+
+# 限制循环次数
+dotenv run -- python rdagent/app/finetune/llm/loop.py \
+    --base-model Qwen/Qwen2.5-7B-Instruct \
+    --loop-n 3
+```
+
+### 多任务并行运行
+
+创建 `tasks.json` 配置文件：
+```json
+{
+  "tasks": [
+    {"model": "Qwen/Qwen2.5-7B-Instruct", "benchmark": "aime25", "gpus": "0,1"},
+    {"model": "Qwen/Qwen2.5-7B-Instruct", "benchmark": "gsm8k", "gpus": "2,3"}
+  ]
+}
+```
+
+使用 `run_ft_deploy.sh` 脚本运行：
+```bash
+./run_ft_deploy.sh tasks.json           # 正常运行
+./run_ft_deploy.sh tasks.json --dry-run # 仅预览配置
+./run_ft_deploy.sh tasks.json --no-sync # 禁用 blob 同步
+```
+
+<details>
+<summary>run_ft_deploy.sh 脚本参考</summary>
+
+```bash
+#!/bin/bash
+# 多任务并行部署脚本（简化版）
+
+RDAGENT_DIR="$HOME/RD-Agent"
+ENV_TEMPLATE=".env.ft"
+STAGGER_DELAY=60
+
+cd "$RDAGENT_DIR"
+source ~/miniconda3/etc/profile.d/conda.sh
+conda activate rdagent
+
+CONFIG_FILE="${1:-tasks.json}"
+NUM_TASKS=$(jq '.tasks | length' "$CONFIG_FILE")
+
+for ((i=0; i<NUM_TASKS; i++)); do
+    model=$(jq -r ".tasks[$i].model" "$CONFIG_FILE")
+    benchmark=$(jq -r ".tasks[$i].benchmark" "$CONFIG_FILE")
+    gpus=$(jq -r ".tasks[$i].gpus" "$CONFIG_FILE")
+
+    # 更新 .env 中的 benchmark
+    cp "$ENV_TEMPLATE" .env
+    sed -i "s|^FT_TARGET_BENCHMARK=.*|FT_TARGET_BENCHMARK=$benchmark|" .env
+
+    CUDA_VISIBLE_DEVICES=$gpus \
+    dotenv run -- python rdagent/app/finetune/llm/loop.py --base-model "$model" &
+
+    # 首个任务等待环境创建，后续任务错开启动
+    [[ $i -eq 0 ]] && sleep 120 || sleep $STAGGER_DELAY
+done
+
+wait
+```
+
+</details>
+
+## Blob 日志同步
+
+使用 Azure Blob 在多台机器间同步日志文件。
+
+### 1. 生成 SAS Token
+
+```bash
+# 首先登录 Azure CLI
+az login
+
+# 生成 Token（默认有效期 7 天）
+bash rdagent/utils/blob/gen_token.sh
+
+# 或指定过期时间
+bash rdagent/utils/blob/gen_token.sh 2025-01-31T00:00Z
+```
+
+Token 会保存到 `git_ignore_folder/.az_sas_token`。
+
+### 2. 同步日志
+
+同步路径：`log/` ↔ `blob://epeastus/rdagent/FinetuneAgenticLLM/FT_qizheng/logs`
+
+```bash
+# 上传本地日志到 Blob
+bash rdagent/utils/blob/azsync.sh up
+
+# 从 Blob 下载日志到本地
+bash rdagent/utils/blob/azsync.sh down
+```
+
+> 如需修改远程路径，编辑 `rdagent/utils/blob/azsync.sh` 中的 `REMOTE_PATH` 变量。
+
+## 日志查看
+
+运行日志保存在 `log/` 目录下：
+
+```
+log/
+└── 2025-01-01_12-00-00-123456/
+    ├── Loop_0/
+    │   ├── direct_exp_gen/   # 假设生成
+    │   ├── coding/           # 代码生成
+    │   ├── running/          # 训练执行
+    │   └── feedback/         # 反馈总结
+    └── Loop_1/
+        └── ...
+```
+
+
diff --git a/rdagent/app/finetune/llm/conf.py b/rdagent/app/finetune/llm/conf.py
index 52ec66d82..3da4bb22a 100644
--- a/rdagent/app/finetune/llm/conf.py
+++ b/rdagent/app/finetune/llm/conf.py
@@ -1,43 +1,125 @@
-import os
+from pathlib import Path
 
 from pydantic_settings import SettingsConfigDict
 
-from rdagent.app.data_science.conf import DS_RD_SETTING
-from rdagent.core.conf import RD_AGENT_SETTINGS, ExtendedBaseSettings
+from rdagent.core.conf import ExtendedBaseSettings
 
 
-class LLMFinetuneScen(ExtendedBaseSettings):
+class LLMFinetunePropSetting(ExtendedBaseSettings):
+    """LLM Fine-tune dedicated property settings.
+
+    - Adjust timeouts and template
+    - Use FT_ env prefix for overrides
+    """
+
     model_config = SettingsConfigDict(env_prefix="FT_", protected_namespaces=())
-    scen: str = "rdagent.app.finetune.llm.scen.LLMFinetuneScen"
+
+    # Main Components
+    scen: str = "rdagent.scenarios.finetune.scen.scenario.LLMFinetuneScen"
+    """Scenario class for LLM fine-tuning tasks."""
+
+    hypothesis_gen: str = "rdagent.scenarios.finetune.proposal.proposal.LLMFinetuneExpGen"
+    """Hypothesis generation class for LLM fine-tuning tasks."""
+
+    coder: str = "rdagent.components.coder.finetune.LLMFinetuneCoSTEER"
+    """Code generator.
+    Function: Generate LLM fine-tuning code based on experiment design.
     """
-    Scenario class for data science tasks.
-    - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
-    - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
-    - For LLM finetune scenarios, use: "rdagent.app.finetune.llm.scen.LLMFinetuneScen"
-    - For Data science finetune scenarios, use: "rdagent.app.finetune.data_science.scen.DSFinetuneScen"
+
+    runner: str = "rdagent.scenarios.finetune.train.runner.LLMFinetuneRunner"  # TODO
+    """Code runner.
+    Function: Execute LLM fine-tuning code in a Docker environment.
     """
 
-    hypothesis_gen: str = "rdagent.app.finetune.llm.proposal.FinetuneExpGen"
-    """Hypothesis generation class"""
+    summarizer: str = "rdagent.scenarios.finetune.dev.feedback.FTExperiment2Feedback"
+    """Result summarizer - To be implemented.
+    Function: Analyze fine-tuning results and generate feedback, including performance metrics and error analysis.
+    """
 
-    debug_timeout: int = 36000
-    """The timeout limit for running on debugging data"""
+    # Timeouts (longer for LLM training, all for Docker container timeout)
     full_timeout: int = 360000
-    """The timeout limit for running on full data"""
+    """Full training timeout in seconds (default 100 hours, env: FT_FULL_TIMEOUT). Used in running stage for complete model training."""
+    data_processing_timeout: int = 3600
+    """Data processing script timeout in seconds (default 1 hour, env: FT_DATA_PROCESSING_TIMEOUT). Used for full data processing in running stage."""
+    debug_data_processing_timeout: int = 1200
+    """Debug data processing timeout in seconds (default 20 minutes, env: FT_DEBUG_DATA_PROCESSING_TIMEOUT). Used for --debug mode in coding stage."""
+    micro_batch_timeout: int = 1800
+    """Micro-batch test timeout in seconds (default 30 minutes, env: FT_MICRO_BATCH_TIMEOUT)."""
 
+    # Pipeline behavior
     coder_on_whole_pipeline: bool = True
-    enable_model_dump: bool = True
-    app_tpl: str = "app/finetune/llm/tpl"
+    app_tpl: str = "scenarios/finetune"
 
+    # Benchmark evaluation (always enabled as part of evaluation pipeline)
 
-def update_settings(competition: str):
-    """
-    Update the RD_AGENT_SETTINGS with the values from LLM_FINETUNE_SETTINGS.
-    """
-    LLM_FINETUNE_SETTINGS = LLMFinetuneScen()
-    RD_AGENT_SETTINGS.app_tpl = LLM_FINETUNE_SETTINGS.app_tpl
-    os.environ["DS_CODER_COSTEER_EXTRA_EVALUATOR"] = '["rdagent.app.finetune.share.eval.PrevModelLoadEvaluator"]'
-    for field_name, new_value in LLM_FINETUNE_SETTINGS.model_dump().items():
-        if hasattr(DS_RD_SETTING, field_name):
-            setattr(DS_RD_SETTING, field_name, new_value)
-    DS_RD_SETTING.competition = competition
+    benchmark_timeout: int = 0
+    """Benchmark evaluation timeout in seconds. 0 means no timeout."""
+
+    # Judge API configuration (for llmjudge benchmarks like AIME)
+    judge_model: str = "gpt-5.1"
+    """LLM judge model name for evaluation"""
+
+    judge_api_key: str | None = None
+    """API key for judge model (if None, will try to use from environment)"""
+
+    judge_api_base: str | None = None
+    """API base URL for judge model (if None, will use default)"""
+
+    judge_retry: int = 10
+    """Number of retries for LLM judge API calls (env: FT_JUDGE_RETRY)"""
+
+    benchmark_limit: int | None = None
+    """Limit number of samples for benchmark evaluation (None for full evaluation). Use for quick testing and debugging."""
+
+    benchmark_num_runs: int = 1
+    """Number of times to run each sample (for computing average or pass@k). Set >1 for multiple runs."""
+
+    benchmark_pass_k: list[int] | None = None
+    """Pass@k parameter list for code generation tasks (e.g., [1, 5, 10]). None to disable."""
+
+    # Data paths and processing
+    file_path: Path = Path.cwd() / "git_ignore_folder" / "finetune_files"
+    show_nan_columns: bool = False
+    sample_data_by_LLM: bool = True
+
+    # LLM-specific fields
+    user_target_scenario: str | None = None
+    target_benchmark: str | None = None
+    """Benchmark dataset to evaluate on. Supported: aime25, aime24, mmlu, gsm8k, math, etc."""
+    benchmark_description: str | None = None
+    base_model: str | None = None
+    dataset: str | None = None
+    upper_data_size_limit: int = 2000
+
+    # Data processing LLM models (for API calls in data processing scripts)
+    strong_models: list[str] = ["gpt-5", "gpt-5.1"]
+    """Strong models for complex tasks (CoT generation, reasoning) - supports list (env: FT_STRONG_MODELS)"""
+
+    weak_models: list[str] = ["gpt-4o-mini", "o4-mini", "gpt-5-mini"]
+    """Weak models for simple tasks (filtering, format conversion) - supports list (env: FT_WEAK_MODELS)"""
+
+    embedding_models: list[str] = ["text-embedding-3-small", "text-embedding-3-large"]
+
+    # Docker settings
+    docker_enable_cache: bool = False
+    """Enable Docker cache for training (set via FT_DOCKER_ENABLE_CACHE)"""
+
+    # data sample count
+    data_sample_count: int = 3
+
+    # API concurrency for data processing
+    api_max_workers: int = 1000
+    """Max concurrent workers for LLM API calls in data processing scripts (env: FT_API_MAX_WORKERS)"""
+
+    # Coder settings
+    coder_max_loop: int = 10
+
+    # CoT format settings
+    force_think_token: bool = False
+    """Force <think> token wrapping for CoT training data (env: FT_FORCE_THINK_TOKEN).
+    When True: Data must be wrapped in <think>...</think> format, benchmark uses extract-non-reasoning-content postprocessor.
+    When False: CoT reasoning required but format is flexible, no postprocessor needed."""
+
+
+# Global setting instance for LLM finetuning scenario
+FT_RD_SETTING = LLMFinetunePropSetting()
diff --git a/rdagent/app/finetune/llm/job/README.md b/rdagent/app/finetune/llm/job/README.md
new file mode 100644
index 000000000..6eacd6d25
--- /dev/null
+++ b/rdagent/app/finetune/llm/job/README.md
@@ -0,0 +1,131 @@
+# FT Job Runner
+
+批量并行运行多个 LLM 微调任务的脚本。
+
+## 快速开始
+
+```bash
+# 1. 准备环境配置
+cp .env.template .env
+# 编辑 .env，填入 API key 等配置
+
+# 2. 准备任务配置
+cp tasks.json.example tasks.json
+# 编辑 tasks.json，定义要运行的任务
+
+# 3. 运行
+./run_ft_job.sh
+```
+
+## 用法
+
+```bash
+./run_ft_job.sh [tasks.json]
+```
+
+| 参数 | 说明 |
+|------|------|
+| `tasks.json` | 任务配置文件路径（可选，默认使用同目录下的 `tasks.json`） |
+| `-h, --help` | 显示帮助信息 |
+
+### 示例
+
+```bash
+# 使用默认配置
+./run_ft_job.sh
+
+# 指定自定义配置文件
+./run_ft_job.sh /path/to/my_tasks.json
+```
+
+## 配置文件
+
+### tasks.json
+
+定义要并行运行的任务列表：
+
+```json
+{
+  "tasks": [
+    {
+      "model": "Qwen/Qwen3-8B",
+      "benchmark": "aime25",
+      "gpus": "0,1"
+    },
+    {
+      "model": "Qwen/Qwen3-8B",
+      "benchmark": "gsm8k",
+      "gpus": "2,3",
+      "scenario": "自定义优化目标"
+    }
+  ]
+}
+```
+
+| 字段 | 必填 | 默认值 | 说明 |
+|------|:----:|--------|------|
+| `model` | ✅ | - | HuggingFace 模型路径 |
+| `benchmark` | ✅ | - | 评估基准（如 `aime25`, `gsm8k`） |
+| `gpus` | ❌ | `"0"` | 使用的 GPU 编号 |
+| `scenario` | ❌ | `"Improve model performance on {benchmark}"` | 优化目标描述 |
+
+### .env
+
+环境配置文件，包含 API 密钥、模型设置等。从 `.env.template` 复制并修改：
+
+```bash
+cp .env.template .env
+```
+
+主要配置项：
+
+| 配置 | 说明 |
+|------|------|
+| `OPENAI_API_KEY` | OpenAI API 密钥 |
+| `OPENAI_API_BASE` | API 地址 |
+| `FT_Coder_CoSTEER_env_type` | 环境类型：`docker` 或 `conda` |
+| `HF_TOKEN` | HuggingFace Token |
+
+## 输出
+
+运行后会在 `log/` 目录下创建 job 文件夹：
+
+```
+log/2025-12-23/
+├── aime25_Qwen3-8B.log      # 任务日志
+├── gsm8k_Qwen3-8B.log
+└── aime25_Qwen3-8B/         # 任务 trace（Loop 数据）
+    ├── Loop_0/
+    └── ...
+```
+
+## 监控
+
+### 命令行
+
+```bash
+# 查看所有任务日志
+tail -f log/2025-12-23/*.log
+
+# 查看特定任务
+tail -f log/2025-12-23/aime25_Qwen3-8B.log
+```
+
+### Web UI
+
+```bash
+streamlit run rdagent/app/finetune/llm/ui/app.py
+```
+
+在 UI 中选择 Job Folder 为对应的日志目录即可查看运行状态。
+
+## 依赖
+
+- `jq`：JSON 解析工具
+- `conda` 环境：`rdagent`
+
+## 注意事项
+
+1. 任务启动间隔默认为 60 秒（`STAGGER_DELAY`），避免同时启动造成资源竞争
+2. 确保指定的 GPU 编号不冲突
+3. 如果同一天多次运行，会自动创建 `log/2025-12-23_1/`、`log/2025-12-23_2/` 等目录
diff --git a/rdagent/app/finetune/llm/job/run_ft_job.sh b/rdagent/app/finetune/llm/job/run_ft_job.sh
new file mode 100755
index 000000000..79f68ef32
--- /dev/null
+++ b/rdagent/app/finetune/llm/job/run_ft_job.sh
@@ -0,0 +1,186 @@
+#!/bin/bash
+# Run multiple FT tasks in parallel under a single job directory
+#
+# Usage: ./run_ft_job.sh [tasks.json]
+#
+# Config format (tasks.json):
+# {
+#   "tasks": [
+#     {"model": "Qwen/Qwen3-8B", "benchmark": "aime25", "gpus": "0,1"},
+#     {"model": "Qwen/Qwen3-8B", "benchmark": "gsm8k", "gpus": "2,3"}
+#   ]
+# }
+
+set -e
+
+# ========== CONFIG ==========
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+RDAGENT_DIR="$(cd "$SCRIPT_DIR/../../../../.." && pwd)"
+ENV_FILE="$SCRIPT_DIR/.env"
+SCENARIOS_FILE="$SCRIPT_DIR/scenarios.json"
+STAGGER_DELAY=60
+
+usage() {
+    echo "Usage: $0 [tasks.json]"
+    echo "Run multiple FT tasks under a single job directory."
+    echo "UI: streamlit run rdagent/app/finetune/llm/ui/app.py"
+    exit 0
+}
+
+# ========== PARSE ARGS ==========
+CONFIG_FILE=""
+
+for arg in "$@"; do
+    case $arg in
+        -h|--help) usage ;;
+        *) [[ -z "$CONFIG_FILE" ]] && CONFIG_FILE="$arg" ;;
+    esac
+done
+
+[[ -z "$CONFIG_FILE" ]] && CONFIG_FILE="$SCRIPT_DIR/tasks.json"
+[[ ! -f "$CONFIG_FILE" ]] && echo "Error: Config not found: $CONFIG_FILE" && exit 1
+
+# Check .env file
+if [[ ! -f "$ENV_FILE" ]]; then
+    echo "Error: .env not found at $ENV_FILE"
+    echo "Please create it from template: cp $SCRIPT_DIR/.env.template $ENV_FILE"
+    exit 1
+fi
+
+# Check jq
+command -v jq &>/dev/null || { echo "Error: jq required"; exit 1; }
+
+# ========== SETUP ==========
+# Get log and workspace base paths from environment or use defaults
+# Default to project-relative paths; can be overridden by environment variables
+FT_LOG_BASE="${FT_LOG_BASE:-$RDAGENT_DIR/log}"
+FT_WORKSPACE_BASE="${FT_WORKSPACE_BASE:-$RDAGENT_DIR/git_ignore_folder/RD-Agent_workspace}"
+
+JOB_ID=$(date +%Y-%m-%d_%H-%M)
+JOB_DIR="$FT_LOG_BASE/$JOB_ID"
+if [[ -d "$JOB_DIR" ]]; then
+    i=1; while [[ -d "${JOB_DIR}_$i" ]]; do ((i++)); done
+    JOB_ID="${JOB_ID}_$i"; JOB_DIR="${JOB_DIR}_$i"
+fi
+mkdir -p "$JOB_DIR"
+
+cd "$RDAGENT_DIR"
+
+NUM_TASKS=$(jq '.tasks | length' "$CONFIG_FILE")
+
+echo "=============================================="
+echo "FT Job: $JOB_ID"
+echo "=============================================="
+echo "Config:    $CONFIG_FILE"
+echo "Tasks:     $NUM_TASKS"
+echo "Log:       $JOB_DIR"
+echo "Workspace: $FT_WORKSPACE_BASE/$JOB_ID"
+echo ""
+
+# Setup tmux session
+TMUX_SESSION="rdagent"
+tmux kill-session -t "$TMUX_SESSION" 2>/dev/null || true
+tmux new-session -d -s "$TMUX_SESSION" -n "main"
+echo "Tmux session created: $TMUX_SESSION"
+echo ""
+
+for ((i=0; i<NUM_TASKS; i++)); do
+    model=$(jq -r ".tasks[$i].model" "$CONFIG_FILE")
+    benchmark=$(jq -r ".tasks[$i].benchmark" "$CONFIG_FILE")
+    gpus=$(jq -r ".tasks[$i].gpus // \"0\"" "$CONFIG_FILE")
+    port=$(jq -r ".tasks[$i].port // empty" "$CONFIG_FILE")
+    task_timeout=$(jq -r ".tasks[$i].timeout // \"12h\"" "$CONFIG_FILE")
+
+    # Load benchmark_description: tasks.json -> scenarios.json
+    benchmark_desc=$(jq -r ".tasks[$i].benchmark_description // empty" "$CONFIG_FILE")
+    if [[ -z "$benchmark_desc" ]]; then
+        benchmark_desc=$(jq -r ".[\"$benchmark\"].benchmark_description // empty" "$SCENARIOS_FILE")
+    fi
+    # Note: Special characters in benchmark_desc are handled by writing to env file
+    model_name=$(basename "$model")
+    task_name="${benchmark}_${model_name}"
+    trace_path="$JOB_DIR/$task_name"
+
+    port_info=""
+    [[ -n "$port" ]] && port_info=", port=$port"
+    echo "Task $i: $task_name (model=$model, benchmark=$benchmark, gpus=$gpus$port_info)"
+
+    # Run task in tmux window with script -c for output capture
+    task_workspace="$FT_WORKSPACE_BASE/$JOB_ID/$task_name"
+    mkdir -p "$task_workspace"
+    LOG_FILE="$JOB_DIR/${task_name}.log"
+
+    # Write task-specific env file (avoids command-line escaping issues with special chars)
+    TASK_ENV_FILE="$task_workspace/.task_env"
+    cat > "$TASK_ENV_FILE" << EOF
+CUDA_VISIBLE_DEVICES='$gpus'
+LOG_TRACE_PATH='$trace_path'
+WORKSPACE_PATH='$task_workspace'
+FT_TARGET_BENCHMARK='$benchmark'
+EOF
+    # Escape shell special characters for double-quoted string: \ " ` $
+    if [[ -n "$benchmark_desc" ]]; then
+        escaped_desc="$benchmark_desc"
+        escaped_desc="${escaped_desc//\\/\\\\}"  # \ -> \\
+        escaped_desc="${escaped_desc//\"/\\\"}"  # " -> \"
+        escaped_desc="${escaped_desc//\`/\\\`}"  # ` -> \`
+        escaped_desc="${escaped_desc//\$/\\\$}"  # $ -> \$
+        echo "FT_BENCHMARK_DESCRIPTION=\"$escaped_desc\"" >> "$TASK_ENV_FILE"
+    fi
+    [[ -n "$port" ]] && echo "OPENAI_API_BASE='http://localhost:$port'" >> "$TASK_ENV_FILE"
+
+    # Create tmux window for this task and get its full target (e.g., rdagent:1.0)
+    # Use "session:" format to ensure window is created in the correct session
+    WIN_TARGET=$(tmux new-window -t "$TMUX_SESSION:" -n "$benchmark" -P)
+
+    # Build the command with environment setup (env vars loaded from file)
+    timeout_arg=""
+    [[ -n "$task_timeout" ]] && timeout_arg="--timeout $task_timeout"
+
+    TASK_CMD="source ~/miniconda3/etc/profile.d/conda.sh && conda activate qz_rdagent"
+    TASK_CMD="$TASK_CMD && set -a && source '$ENV_FILE' && source '$TASK_ENV_FILE' && set +a"
+    TASK_CMD="$TASK_CMD && cd '$RDAGENT_DIR'"
+    TASK_CMD="$TASK_CMD && python rdagent/app/finetune/llm/loop.py --base-model '$model' $timeout_arg"
+
+    # Run with script -c to capture terminal output (using full target for reliability)
+    tmux send-keys -t "$WIN_TARGET" "script -q '$LOG_FILE' -c \"$TASK_CMD\"" Enter
+
+    echo "  Window:    $benchmark"
+    echo ""
+
+    # Stagger starts
+    if [[ $i -eq 0 ]]; then
+        # First task: wait for initialization
+        # Get FT_FILE_PATH from .env or use default
+        FT_FILE_PATH=$(grep -E "^FT_FILE_PATH=" "$ENV_FILE" | cut -d= -f2 | tr -d '"' || echo "")
+        [[ -z "$FT_FILE_PATH" ]] && FT_FILE_PATH="$RDAGENT_DIR/git_ignore_folder/finetune"
+        DATASET_INFO="$FT_FILE_PATH/datasets/dataset_info.json"
+
+        echo "  Waiting for scenario initialization (dataset_info.json)..."
+        while [[ ! -f "$DATASET_INFO" ]]; do
+            sleep 5
+        done
+        echo "  Scenario initialized!"
+
+        echo "  Waiting for llm_finetune conda env..."
+        while ! conda run -n llm_finetune python -c "import requests" 2>/dev/null; do
+            sleep 10
+        done
+
+        echo "  Waiting for opencompass conda env..."
+        while ! conda run -n opencompass python -c "import opencompass" 2>/dev/null; do
+            sleep 10
+        done
+        echo "  Environment ready!"
+    elif [[ $i -lt $((NUM_TASKS - 1)) ]]; then
+        sleep $STAGGER_DELAY
+    fi
+done
+
+echo "=============================================="
+echo "All tasks started in tmux session: $TMUX_SESSION"
+echo "  - Attach:  tmux attach -t $TMUX_SESSION"
+echo "  - List:    tmux list-windows -t $TMUX_SESSION"
+echo "  - Select:  tmux select-window -t $TMUX_SESSION:{window_name}"
+echo "Monitor: tail -f $JOB_DIR/*.log"
+echo "UI: streamlit run rdagent/app/finetune/llm/ui/app.py (Job Folder: $JOB_DIR)"
diff --git a/rdagent/app/finetune/llm/job/scenarios.json b/rdagent/app/finetune/llm/job/scenarios.json
new file mode 100644
index 000000000..07dfd2803
--- /dev/null
+++ b/rdagent/app/finetune/llm/job/scenarios.json
@@ -0,0 +1,128 @@
+{
+  "_comment": "Benchmark scenarios for FT tasks. Used by run_ft_job.sh and UI.",
+
+  "aime24": {
+    "category": "math",
+    "scenario": "Improve the model's ability to solve advanced competition math problems through multi-step reasoning, including number theory, combinatorics, geometry, and algebraic manipulation, with answers expressed as integers from 0 to 999.",
+    "benchmark_description": "AIME 2024 (American Invitational Mathematics Examination) - Advanced high school math competition problems requiring creative problem-solving. Each answer is an integer 0-999. Topics include number theory, algebra, geometry, trigonometry, probability, and combinatorics. Problems require multi-step reasoning and often have elegant solutions. Expected Output Format: Put final answer within \\boxed{}, e.g., \\boxed{42}."
+  },
+  "aime25": {
+    "category": "math",
+    "scenario": "Improve the model's ability to solve advanced competition math problems through multi-step reasoning, including number theory, combinatorics, geometry, and algebraic manipulation, with answers expressed as integers from 0 to 999.",
+    "benchmark_description": "AIME 2025 (American Invitational Mathematics Examination) - Advanced high school math competition problems requiring creative problem-solving. Each answer is an integer 0-999. Topics include number theory, algebra, geometry, trigonometry, probability, and combinatorics. Problems require multi-step reasoning and often have elegant solutions. Expected Output Format: Put final answer within \\boxed{}, e.g., \\boxed{42}."
+  },
+  "panorama": {
+    "category": "patent",
+    "scenario": "Improve the model's patent examination capabilities including prior art retrieval, paragraph identification, and novelty/obviousness classification based on USPTO standards.",
+    "benchmark_description": "PANORAMA tests patent examination capabilities based on real USPTO Office Actions. Tasks include: retrieving relevant prior art patents, identifying specific paragraphs in prior art that relate to claims, and classifying claims as allowable, lacking novelty (102), or obvious (103). Requires understanding patent law, technical document analysis, and legal reasoning. Expected Output Format: Return JSON with task-specific format (see subtask descriptions)."
+  },
+  "panorama_par4pc": {
+    "category": "patent",
+    "scenario": "Improve the model's ability to retrieve relevant prior art patents given a patent claim, by understanding claim scope, identifying technical similarities, and ranking patents by relevance for rejection analysis.",
+    "benchmark_description": "PAR4PC (Prior Art Retrieval for Patent Claims) - Given a patent claim, retrieve the most relevant prior art patents from a candidate pool. Requires understanding claim scope, identifying technical similarities, and ranking patents by relevance for potential 35 USC 102/103 rejections. Expected Output Format: Return JSON: {\"answer\": \"A\"} for single patent or {\"answer\": [\"A\", \"C\"]} for multiple patents (codes A-H)."
+  },
+  "panorama_pi4pc": {
+    "category": "patent",
+    "scenario": "Improve the model's ability to identify specific paragraphs in prior art patents that are most relevant for evaluating a claim's novelty and obviousness through element-by-element analysis.",
+    "benchmark_description": "PI4PC (Paragraph Identification for Patent Claims) - Given a patent claim and cited prior art patent, identify specific paragraphs most relevant for evaluating novelty and obviousness. Requires detailed technical reading, element-by-element claim analysis, and understanding how prior art teachings map to claim limitations. Expected Output Format: Return JSON: {\"answer\": \"<paragraph_id>\"}."
+  },
+  "panorama_noc4pc": {
+    "category": "patent",
+    "scenario": "Improve the model's ability to classify patent claims as allowable, anticipated, or obvious by applying patent law standards to analyze claim limitations against prior art.",
+    "benchmark_description": "NOC4PC (Novelty/Obviousness Classification) - Classify patent claims as ALLOW (patentable), 102 (anticipated/lacks novelty), or 103 (obvious). Requires applying patent law standards: 102 when single reference discloses all elements, 103 when combination of references with motivation makes claim obvious to skilled artisan. Expected Output Format: Return JSON: {\"code\": \"ALLOW\"} or {\"code\": \"102\"} or {\"code\": \"103\"}."
+  },
+  "panorama_par4pc_cot": {
+    "category": "patent",
+    "scenario": "Improve the model's ability to retrieve relevant prior art patents while providing explicit chain-of-thought reasoning explaining which claim elements each patent teaches and how it supports a rejection.",
+    "benchmark_description": "PAR4PC with chain-of-thought - Retrieve relevant prior art while providing explicit reasoning. Explain why each retrieved patent is relevant: which claim elements it teaches, what technical problems it addresses, and how it could support a rejection. Expected Output Format: Provide reasoning first, then return JSON: {\"answer\": \"A\"} or {\"answer\": [\"A\", \"C\"]}."
+  },
+  "panorama_pi4pc_cot": {
+    "category": "patent",
+    "scenario": "Improve the model's ability to identify relevant prior art paragraphs while providing element-by-element mapping showing how specific paragraph teachings correspond to claim limitations.",
+    "benchmark_description": "PI4PC with chain-of-thought - Identify relevant prior art paragraphs while explaining the technical connections. Provide element-by-element mapping showing how specific paragraph teachings correspond to claim limitations. Expected Output Format: Provide reasoning first, then return JSON: {\"answer\": \"<paragraph_id>\"}."
+  },
+  "panorama_noc4pc_cot": {
+    "category": "patent",
+    "scenario": "Improve the model's ability to classify patent claims with examiner-style rationale, explaining how references anticipate limitations or how combinations with motivation render claims obvious.",
+    "benchmark_description": "NOC4PC with chain-of-thought - Classify claims with examiner-style rationale. For 102: explain how reference anticipates each limitation. For 103: identify references, explain motivation to combine, and show how combination renders claim obvious. Use proper USPTO citation format. Expected Output Format: Return JSON: {\"reason\": \"<Office Action analysis>\", \"code\": \"ALLOW\"|\"102\"|\"103\"}."
+  },
+
+  "chemcotbench": {
+    "category": "chemistry",
+    "scenario": "Improve the model's chemical reasoning capabilities on molecular structures including understanding molecular features, editing molecules, optimizing properties, and predicting reaction outcomes.",
+    "benchmark_description": "ChemCoTBench tests step-wise chemical reasoning on SMILES molecular structures. Tasks include molecule understanding (identify functional groups, ring systems), molecule editing (add/delete/substitute groups while maintaining validity), molecule optimization (modify for desired properties), and reaction prediction (products, mechanisms, conditions). Contains subtasks with different output requirements. Expected Output Format: Return JSON: {\"output\": \"<answer>\"} where answer format depends on subtask - SMILES string for molecular tasks, numeric count for counting tasks, or Yes/No for equivalence tasks."
+  },
+  "chemcotbench_mol_und": {
+    "category": "chemistry",
+    "scenario": "Improve the model's ability to analyze molecular structures and identify structural features including functional groups (hydroxyl, carboxyl, amine), ring systems (aromatic, aliphatic), and molecular scaffolds.",
+    "benchmark_description": "Molecule Understanding - Analyze SMILES strings for structural features. Subtasks: (1) fg_count/ring_count: return integer count, (2) equivalence/ring_system_scaffold: return Yes or No, (3) Murcko_scaffold: return SMILES string. Requires parsing SMILES notation and applying organic chemistry knowledge. Expected Output Format: Return JSON: {\"output\": \"<answer>\"} where answer is integer/Yes/No/SMILES depending on subtask."
+  },
+  "chemcotbench_mol_edit": {
+    "category": "chemistry",
+    "scenario": "Improve the model's ability to perform precise structural modifications on molecules (add, delete, substitute functional groups) while maintaining chemical validity and molecule integrity.",
+    "benchmark_description": "Molecule Editing - Perform structural modifications on SMILES. Subtasks: add (add functional group), delete (remove group), sub (substitute group). Output must be valid SMILES representing chemically feasible molecules. Expected Output Format: Return JSON: {\"output\": \"<valid SMILES>\"}. SMILES validity is verified using RDKit."
+  },
+  "chemcotbench_mol_opt": {
+    "category": "chemistry",
+    "scenario": "Improve the model's ability to modify molecular structures to achieve target properties such as improved solubility, drug-likeness, or binding affinity to specific biological targets.",
+    "benchmark_description": "Molecule Optimization - Modify structures to achieve target properties. Subtasks: drd/gsk/jnk (binding affinity to DRD2/GSK3β/JNK3 targets), logp (lipophilicity), qed (drug-likeness), solubility. Requires understanding structure-property relationships. Expected Output Format: Return JSON: {\"output\": \"<optimized SMILES>\"}."
+  },
+  "chemcotbench_reaction": {
+    "category": "chemistry",
+    "scenario": "Improve the model's ability to predict chemical reaction outcomes including forward synthesis, retrosynthesis, mechanism selection, and reaction conditions based on functional group transformations.",
+    "benchmark_description": "Reaction Prediction - Predict reaction outcomes. Subtasks: fs (forward synthesis: reactants→products), retro (retrosynthesis: products→reactants), rcr (reaction condition recommendation), nepp (named reaction prediction), mechsel (mechanism selection). Requires understanding reaction types and functional group transformations. Expected Output Format: Return JSON: {\"output\": \"<SMILES or text answer>\"}."
+  },
+
+  "tablebench_data_analysis": {
+    "category": "table_qa",
+    "scenario": "Improve the model's ability to analyze tabular data for complex questions including trend identification, correlation analysis, statistical computation, and data-driven forecasting.",
+    "benchmark_description": "Table Data Analysis - Analyze tabular data to answer complex questions. Subtask types with different evaluation: (1) CorrelationAnalysis/TrendForecasting/StatisticalAnalysis: numeric answers with ±10% relative error tolerance, (2) ImpactAnalysis: exact match required, (3) Other analysis types: evaluated using ROUGE-L. Requires reading tables accurately and applying analytical reasoning. Expected Output Format: End response with \"Final Answer: <value>\"."
+  },
+  "tablebench_fact_checking": {
+    "category": "table_qa",
+    "scenario": "Improve the model's ability to verify factual claims against tabular data through accurate data extraction, implicit relationship understanding, and multi-hop reasoning across table cells.",
+    "benchmark_description": "Table Fact Checking - Answer table-based factual questions accurately. Questions may ask for specific information (numbers, names, dates) or verification (Yes/No, True/False). Uses Exact Match evaluation. Expected Output Format: End response with \"Final Answer: <value>\" where value is the precise answer to the question."
+  },
+  "tablebench_numerical_reasoning": {
+    "category": "table_qa",
+    "scenario": "Improve the model's ability to perform mathematical operations on table data including arithmetic, aggregations (sum, average, count), comparisons, percentages, and multi-step calculations.",
+    "benchmark_description": "Table Numerical Reasoning - Perform mathematical operations on table data: arithmetic (sum, difference, product), aggregations (average, count, max/min), comparisons, percentages, and multi-step calculations. Requires accurate number extraction and correct mathematical computation. Expected Output Format: End response with \"Final Answer: <numeric value>\"."
+  },
+  "tablebench_visualization": {
+    "category": "table_qa",
+    "scenario": "Improve the model's ability to generate Python code that creates appropriate visualizations (bar, line, pie, scatter charts) from tabular data with correct chart type selection and data mapping.",
+    "benchmark_description": "Table Visualization - Generate Python code to create appropriate visualizations from tabular data: bar charts, line charts, pie charts, scatter plots. Select correct chart type for data, map columns correctly to axes, and produce executable matplotlib/pandas code. Expected Output Format: Return Python code in ```python code block using matplotlib/pandas. Code will be executed to verify correctness."
+  },
+  "tablebench_gen": {
+    "category": "table_qa",
+    "scenario": "Improve the model's overall table question answering capabilities across fact checking, numerical reasoning, data analysis, and visualization by understanding table structure and generating accurate answers.",
+    "benchmark_description": "TableBench General - Comprehensive table QA covering fact checking, numerical reasoning, data analysis, and visualization. Questions require understanding table structure, extracting relevant data, performing reasoning or computation, and generating accurate answers or code. Expected Output Format: End response with \"Final Answer: <answer>\"."
+  },
+
+  "FinanceIQ_gen": {
+    "category": "finance",
+    "scenario": "Improve the model's financial domain knowledge and reasoning capabilities across Chinese financial certification exams including CPA, banking, securities, fund, futures, insurance, tax, and actuarial qualifications through multiple-choice question answering.",
+    "benchmark_description": "FinanceIQ tests financial domain knowledge through multiple-choice questions (A/B/C/D). Covers 10 Chinese financial certification exams: CPA (注册会计师), banking qualification, securities qualification, fund qualification, futures qualification, insurance qualification (CICE), tax advisor, economist, financial planner, and actuary. Uses LLM Judge for evaluation with 5-shot in-context learning. Evaluation metric: accuracy."
+  },
+
+  "bioprobench_gen": {
+    "category": "biology",
+    "scenario": "Improve the model's ability to generate complete, detailed experimental protocol steps from research context, including specific reagent concentrations, temperatures, incubation times, and equipment settings.",
+    "benchmark_description": "Protocol Generation - Generate complete experimental protocol steps given research context and objectives. Output detailed, actionable instructions: specify reagent concentrations, temperatures, incubation times, equipment settings. Protocols must be scientifically valid and reproducible. Expected Output Format: Wrap protocol steps in [ANSWER_START]Step 1: ... Step 2: ...[ANSWER_END]. Evaluated using BLEU, ROUGE, and step matching metrics."
+  },
+  "bioprobench_ord": {
+    "category": "biology",
+    "scenario": "Improve the model's ability to arrange shuffled experimental procedure steps in correct logical and temporal sequence by understanding procedural dependencies and scientific workflow logic.",
+    "benchmark_description": "Step Ordering - Arrange shuffled experimental procedure steps in correct logical and temporal sequence. Requires understanding procedural dependencies: which steps must precede others, timing constraints, and scientific logic of experimental workflows. Expected Output Format: Return [ANSWER_START][0, 2, 1, 3][ANSWER_END] with indices as a Python list. Evaluated using Exact Match and Kendall's Tau."
+  },
+  "bioprobench_err": {
+    "category": "biology",
+    "scenario": "Improve the model's ability to identify and correct errors in biological protocol text including incorrect temperatures, concentrations, reagents, missing steps, and procedural mistakes.",
+    "benchmark_description": "Error Correction - Identify and correct errors in biological protocol text. Errors include: incorrect temperatures (e.g., 37°C vs 4°C), wrong concentrations, inappropriate reagents, missing steps, or procedural mistakes. Requires domain expertise to spot scientifically incorrect instructions. Expected Output Format: Return [ANSWER_START]True[ANSWER_END] if protocol has errors, or [ANSWER_START]False[ANSWER_END] if correct."
+  },
+  "bioprobench_pqa": {
+    "category": "biology",
+    "scenario": "Improve the model's ability to extract specific factual information from experimental protocols including temperatures, concentrations, incubation times, reagent quantities, and procedural details.",
+    "benchmark_description": "Protocol QA - Extract specific factual information from experimental protocols: temperatures, concentrations, incubation times, reagent quantities, equipment specifications, and procedural details. Requires careful reading and accurate information extraction from technical text. Expected Output Format: Return [ANSWER_START]<answer text> & <confidence 0-100>%[ANSWER_END], e.g., [ANSWER_START]Option A & 95%[ANSWER_END]. Evaluated using accuracy and Brier Score."
+  }
+}
diff --git a/rdagent/app/finetune/llm/job/tasks.json.example b/rdagent/app/finetune/llm/job/tasks.json.example
new file mode 100644
index 000000000..c0f60df74
--- /dev/null
+++ b/rdagent/app/finetune/llm/job/tasks.json.example
@@ -0,0 +1,20 @@
+{
+  "tasks": [
+    {
+      "model": "Qwen/Qwen3-8B",
+      "benchmark": "aime25",
+      "gpus": "0,1"
+    },
+    {
+      "model": "Qwen/Qwen3-8B",
+      "benchmark": "gsm8k",
+      "gpus": "2,3"
+    },
+    {
+      "model": "meta-llama/Llama-3-8B",
+      "benchmark": "aime25",
+      "gpus": "4,5",
+      "scenario": "Improve AIME 2025 math reasoning with custom approach"
+    }
+  ]
+}
diff --git a/rdagent/app/finetune/llm/loop.py b/rdagent/app/finetune/llm/loop.py
index a7fb087e5..1839020ac 100644
--- a/rdagent/app/finetune/llm/loop.py
+++ b/rdagent/app/finetune/llm/loop.py
@@ -1,40 +1,101 @@
+"""
+LLM Fine-tuning Entry Point
+
+Standard RDLoop entry point for LLM fine-tuning, consistent with data science implementation.
+"""
+
 import asyncio
-from pathlib import Path
+from typing import Optional, cast
 
-import fire
+import typer
+from typing_extensions import Annotated
 
-from rdagent.app.data_science.conf import DS_RD_SETTING
-from rdagent.app.finetune.llm.conf import update_settings
-from rdagent.core.utils import import_class
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
 from rdagent.log import rdagent_logger as logger
-from rdagent.scenarios.data_science.loop import DataScienceRDLoop
+from rdagent.scenarios.finetune.loop import LLMFinetuneRDLoop
 
 
 def main(
-    model: str | None = None,
-    dataset: str | None = None,
+    path: Optional[str] = None,
+    checkout: Annotated[bool, typer.Option("--checkout/--no-checkout", "-c/-C")] = True,
+    user_target_scenario: Optional[str] = None,
+    benchmark: Optional[str] = None,
+    benchmark_description: Optional[str] = None,
+    dataset: Optional[str] = None,
+    base_model: Optional[str] = None,
+    upper_data_size_limit: Optional[int] = None,
+    step_n: Optional[int] = None,
+    loop_n: Optional[int] = None,
+    timeout: Optional[str] = None,
 ):
     """
+    LLM fine-tuning entry point
+
     Parameters
     ----------
-    dataset :
-        Dateset name, used for finetune.
+    path :
+        A path like `$LOG_PATH/__session__/1/0_propose`. This indicates that we restore the state after finishing step 0 in loop 1.
+    checkout :
+        Used to control the log session path. Boolean type, default is True.
+        - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path.
+        - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.
+    dataset : str
+        Dataset name for fine-tuning (e.g., 'shibing624/alpaca-zh')
+    base_model : str, optional
+        Model name for fine-tuning (e.g., 'Qwen/Qwen2.5-1.5B-Instruct').
+        If not provided, auto-selects optimal model based on hardware and dataset.
+    step_n : int, optional
+        Number of steps to run; if None, runs indefinitely until completion or error
+    loop_n : int, optional
+        Number of loops to run; if None, runs indefinitely until completion or error
+    timeout : str, optional
+        Maximum duration for the entire process
 
-    Auto R&D Evolving loop for models finetune.
-    You can continue running a session by using the command:
+    Examples:
     .. code-block:: bash
-        dotenv run -- python rdagent/app/finetune/llm/loop.py --dataset shibing624/alpaca-zh
+        dotenv run -- python rdagent/app/finetune/llm/loop.py --dataset shibing624/alpaca-zh --base-model Qwen/Qwen2.5-1.5B-Instruct
+        dotenv run -- python rdagent/app/finetune/llm/loop.py --dataset shibing624/alpaca-zh    # TODO: not enabled yet
     """
-    if not dataset:
-        raise Exception("Please specify dataset name.")
 
-    model_folder = Path(DS_RD_SETTING.local_data_path) / dataset / "prev_model"
-    if not model_folder.exists():
-        raise Exception(f"Please put the model path to {model_folder}.")
-    update_settings(dataset)
-    rd_loop: DataScienceRDLoop = DataScienceRDLoop(DS_RD_SETTING)
-    asyncio.run(rd_loop.run())
+    if user_target_scenario:
+        FT_RD_SETTING.user_target_scenario = user_target_scenario
+    assert (
+        FT_RD_SETTING.user_target_scenario is None
+    ), "user_target_scenario is not yet supported, please specify via benchmark and benchmark_description"
+    if upper_data_size_limit:
+        FT_RD_SETTING.upper_data_size_limit = upper_data_size_limit
+        logger.info(f"Set upper_data_size_limit to {FT_RD_SETTING.upper_data_size_limit}")
+    if benchmark and benchmark_description:
+        FT_RD_SETTING.target_benchmark = benchmark
+        FT_RD_SETTING.benchmark_description = benchmark_description
+    assert FT_RD_SETTING.user_target_scenario or (
+        FT_RD_SETTING.target_benchmark and FT_RD_SETTING.benchmark_description
+    ), "Either user_target_scenario or target_benchmark must be specified for LLM fine-tuning."
+
+    # Update configuration with provided parameters
+    if dataset:
+        FT_RD_SETTING.dataset = dataset
+    if base_model:
+        FT_RD_SETTING.base_model = base_model
+
+    # Create and run LLM fine-tuning loop
+    data_set_target = FT_RD_SETTING.dataset if FT_RD_SETTING.dataset else "auto generated dataset"
+    model_target = FT_RD_SETTING.base_model if FT_RD_SETTING.base_model else "auto selected model"
+
+    # Temporary assertion until auto-selection is implemented
+    assert (
+        FT_RD_SETTING.base_model is not None
+    ), "Base model auto selection not yet supported, please specify via --base-model"
+
+    logger.info(f"Starting LLM fine-tuning on dataset='{data_set_target}' with model='{model_target}'")
+
+    if path is None:
+        loop = LLMFinetuneRDLoop(FT_RD_SETTING)
+    else:
+        loop = cast(LLMFinetuneRDLoop, LLMFinetuneRDLoop.load(str(path), checkout=checkout))
+
+    asyncio.run(loop.run(step_n=step_n, loop_n=loop_n, all_duration=timeout))
 
 
 if __name__ == "__main__":
-    fire.Fire(main)
+    typer.run(main)
diff --git a/rdagent/app/finetune/llm/prompts.yaml b/rdagent/app/finetune/llm/prompts.yaml
deleted file mode 100644
index c686d12b1..000000000
--- a/rdagent/app/finetune/llm/prompts.yaml
+++ /dev/null
@@ -1,13 +0,0 @@
-scenario_description: |-
-  ------Background of the scenario------
-  You are a world-class machine learning engineer. Your task is to finetune a model on the given dataset using QLoRA method.
-  ------Dataset Description------
-  {{ raw_description }}
-
-competition_background: |-
-  ## QLoRA Fine-Tuning
-  You are a world-class machine learning engineer and prompt engineer specializing in parameter-efficient fine-tuning of large language models using **QLoRA**. Your expertise includes 4-bit quantization, low-rank adaptation, and maximizing performance on GPU clusters. You are committed to building accurate, resource-efficient, and robust LLMs.
-
-  - **Fine-Tuning Method**: QLoRA (4-bit quantized LoRA)  
-  - **Training Dataset**:  
-    > {{ raw_description }}
\ No newline at end of file
diff --git a/rdagent/app/finetune/llm/proposal.py b/rdagent/app/finetune/llm/proposal.py
deleted file mode 100644
index 4222859ba..000000000
--- a/rdagent/app/finetune/llm/proposal.py
+++ /dev/null
@@ -1,46 +0,0 @@
-from rdagent.app.data_science.conf import DS_RD_SETTING
-from rdagent.core.proposal import ExpGen
-from rdagent.core.scenario import Scenario
-from rdagent.log import rdagent_logger as logger
-from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
-from rdagent.scenarios.data_science.proposal.exp_gen.base import DSHypothesis, DSTrace
-from rdagent.scenarios.data_science.proposal.exp_gen.proposal import DSProposalV2ExpGen
-from rdagent.utils.agent.tpl import T
-
-
-class FinetuneExpGen(DSProposalV2ExpGen):
-    def gen(
-        self,
-        trace: DSTrace,
-    ) -> DSExperiment:
-        component_desc = T("scenarios.data_science.share:component_description_in_pipeline").r()
-
-        if (sota_exp_fb := trace.sota_experiment_fb()) is None:
-            sota_exp, fb_to_sota_exp = None, None
-        else:
-            sota_exp, fb_to_sota_exp = sota_exp_fb
-
-        if not isinstance(sota_exp, DSExperiment):
-            eda_output = None
-        else:
-            eda_output = sota_exp.experiment_workspace.file_dict.get("EDA.md", None)
-        scenario_desc = self.scen.get_scenario_all_desc(eda_output=eda_output)
-
-        # TODO: this is a over simplified version. More features will be added after more survey
-        sota_exp_desc = "No previous SOTA experiments available."
-        failed_exp_feedback_list_desc = "No previous experiments available."
-
-        return self.task_gen(
-            component_desc=component_desc,
-            scenario_desc=scenario_desc,
-            sota_exp_desc=sota_exp_desc,
-            sota_exp=sota_exp,
-            hypotheses=[
-                DSHypothesis(
-                    component="Model",
-                )
-            ],
-            pipeline=True,
-            failed_exp_feedback_list_desc=failed_exp_feedback_list_desc,
-            fb_to_sota_exp=fb_to_sota_exp,
-        )
diff --git a/rdagent/app/finetune/llm/scen.py b/rdagent/app/finetune/llm/scen.py
deleted file mode 100644
index 98f71095c..000000000
--- a/rdagent/app/finetune/llm/scen.py
+++ /dev/null
@@ -1,87 +0,0 @@
-from pathlib import Path
-
-from rdagent.app.data_science.conf import DS_RD_SETTING
-from rdagent.core.scenario import Scenario
-from rdagent.log import rdagent_logger as logger
-from rdagent.scenarios.data_science.scen import DataScienceScen
-from rdagent.scenarios.data_science.scen.utils import describe_data_folder_v2
-from rdagent.utils.agent.tpl import T
-
-
-class LLMFinetuneScen(DataScienceScen):
-    """LLMFinetuneScen Scenario"""
-
-    def __init__(self, competition: str) -> None:
-        self._download_data(competition=competition)
-        super().__init__(competition)
-        self._analysis_competition_description()
-
-    def _get_data_folder_description(self) -> str:
-        folder_desc = describe_data_folder_v2(
-            Path(DS_RD_SETTING.local_data_path) / self.competition, show_nan_columns=DS_RD_SETTING.show_nan_columns
-        )
-        return folder_desc
-
-    def _download_data(self, competition: str):
-        """
-        Download dateset from Hugging Face Hub
-
-        Parameters
-        ----------
-        - competition (str): Dateset ID, like "shibing624/alpaca-zh".
-        """
-        save_path = f"{DS_RD_SETTING.local_data_path}/{competition}"
-        if Path(save_path).exists():
-            logger.info(f"{save_path} already exists.")
-        else:
-            logger.info(f"Downloading {competition} to {save_path}")
-            try:
-                from huggingface_hub import snapshot_download
-
-                snapshot_download(
-                    repo_id=competition,
-                    repo_type="dataset",
-                    local_dir=save_path,
-                    local_dir_use_symlinks=False,
-                )
-            except ImportError:
-                raise ImportError(
-                    "Please install huggingface_hub first. "
-                    'You can install it with `pip install -U "huggingface_hub[cli]"`.'
-                )
-            except Exception as e:
-                logger.error(f"Error when downloading {competition}: {e}")
-                raise e
-
-    def _get_description(self):
-        if (fp := Path(f"{DS_RD_SETTING.local_data_path}/{self.competition}/README.md")).exists():
-            logger.info(f"{self.competition}/Found README.md, loading from local file.")
-            return fp.read_text()
-
-    def _get_direction(self):
-        return True
-
-    @property
-    def rich_style_description(self) -> str:
-        raise NotImplementedError
-
-    @property
-    def background(self) -> str:
-        background_template = T(".prompts:competition_background")
-        background_prompt = background_template.r(
-            raw_description=self.raw_description,
-        )
-        return background_prompt
-
-    def get_competition_full_desc(self) -> str:
-        return T(".prompts:scenario_description").r(
-            raw_description=self.raw_description,
-        )
-
-    def get_scenario_all_desc(self, eda_output=None) -> str:
-        """
-        eda_output depends on dynamic .md files from current workspace, not fixed.
-        """
-        return T(".prompts:scenario_description").r(
-            raw_description=self.raw_description,
-        )
diff --git a/rdagent/app/finetune/llm/tpl/components/coder/data_science/pipeline/prompts.yaml b/rdagent/app/finetune/llm/tpl/components/coder/data_science/pipeline/prompts.yaml
deleted file mode 100644
index 845a72885..000000000
--- a/rdagent/app/finetune/llm/tpl/components/coder/data_science/pipeline/prompts.yaml
+++ /dev/null
@@ -1,71 +0,0 @@
-pipeline_coder:
-  system: |-
-    You are a world-class ML engineer specializing in parameter-efficient LLM fine-tuning with QLoRA.
-    Design a single-file `main.py` that:
-      • Loads a pretrained model from `./workspace_input/prev_model`.  
-      • Attaches 4-bit LoRA adapters, runs fine-tuning, evaluates on the validation set.  
-      • Uses `print()` for progress and debug output (no `logging` or progress bars).  
-      • Wraps file reads in `try/except` only to catch missing files—do not suppress other errors.  
-      • Hardcodes all paths and hyperparameters—no CLI parsing.  
-      • Is directly executable via `python main.py`.
-
-    ## Task Description
-    {{ task_desc }}
-
-    ## The runtime environment your code will running on
-    {{ runtime_environment }}
-
-    {% if queried_former_failed_knowledge|length != 0 %}
-    --------- Previous Failed Attempts ---------
-    {% for former_failed_knowledge in queried_former_failed_knowledge %} Attempt {{ loop.index }}:
-    =====Code:=====
-    {{ former_failed_knowledge.implementation.all_codes }}
-    =====Feedback:=====
-    {{ former_failed_knowledge.feedback }}
-    {% endfor %}
-    {% endif %}
-
-    ## Guidelines
-    1. Ensure that the dataset is loaded strictly from `{% include "scenarios.data_science.share:scen.input_path" %}`, following the exact folder structure described in the **Data Folder Description**, and do not attempt to load data from the current directory (`./`).
-    2. You should avoid using logging module to output information in your generated code, and instead use the print() function.
-    3. You should be very careful about the try catch block in your code. You may use it to handle missing files in data reading, but you should not use it to handle the errors in your code. Especially use it to bypass the errors in your code. Directly solve the errors in your code instead of using try catch block to bypass them.
-    4. Initialize random seeds and specify device (`cpu`/`cuda`) for reproducibility.  
-    5. Ensure `main.py` runs end-to-end: training → validation → save `./scores.csv`.  
-    6. Save finetuned adapter to `./models/` directory.
-    7. When run the code again, the code will skip finetune process and directly load the finetuned adapter from `./models/` directory.
-
-    {% if enable_debug_mode %}
-    Your code will be executed in a debug mode with following command: 
-    ```bash
-    python main.py --debug
-    ```
-    In debug mode, you should only sample smallest possible subset from the training data and run the minimum epochs to quickly test the correctness of the code.
-    In debug mode, you should implement a timer to measure the time taken for your debug configuration and estimate the time required for the full run.
-    For example, you can sample smallest possible subset from the training data and run for one epoch, then the full run with ten epochs will take one hundred times the time taken for the debug run. The scale is calculated by yourself depending on the data sampling and epoch number you choose. If your full run enables early stopping, the scale should be smaller considering the early stopping will stop the training earlier than the full epochs.
-    You should sample the data after train valid split. When you split the data after sampling, you might get a class with only one sample which might cause the split strategy to fail.
-    Your debug code should run exactly the same as the full run, except for the data sampling and epoch number, to ensure the correctness of the code.
-    You should print total time and estimated time in standard output using print function in the following schema:
-    === Start of Debug Information ===
-    debug_time: time_taken_for_debug_run_in_seconds (e.g., 'debug_time: 10.0')
-    estimated_time: estimated_time_for_full_run_in_seconds (e.g., 'estimated_time: 100.0')
-    === End of Debug Information ===
-    User will use the following code to match: re.search(r"(.*?)=== Start of Debug Information ===(.*)=== End of Debug Information ===", stdout, re.DOTALL).groups()[1]
-    Notice, data sampling should only be applied in debug mode. Always use the full data in the full run!
-    Example code:
-    ```python
-    if args.debug:
-      sample_size = int(0.01 * len(train_dataset))  # 1% for debug
-    else:
-      sample_size = len(train_dataset)
-    ```
-    {% endif %}
-
-    ## Output Format
-    {% if out_spec %}
-    {{ out_spec }}
-    {% else %}
-    Please response the full runable code in the following json format. Here is an example structure for the JSON output:
-    {
-        "code": "The Python code as a string."
-    }
-    {% endif %}
diff --git a/rdagent/app/finetune/llm/tpl/scenarios/data_science/prompts.yaml b/rdagent/app/finetune/llm/tpl/scenarios/data_science/prompts.yaml
deleted file mode 100644
index 2317a9e2f..000000000
--- a/rdagent/app/finetune/llm/tpl/scenarios/data_science/prompts.yaml
+++ /dev/null
@@ -1,10 +0,0 @@
-system: |-
-    You are a world-class ML engineer specializing in parameter-efficient LLM fine-tuning with QLoRA.
-    Design a single-file `main.py` that:
-      • Loads a pretrained model from `./workspace_input/prev_model`.  
-      • Attaches 4-bit LoRA adapters, runs fine-tuning, evaluates on the validation set.  
-      • Uses `print()` for progress and debug output (no `logging` or progress bars).  
-      • Wraps file reads in `try/except` only to catch missing files—do not suppress other errors.  
-      • Hardcodes all paths and hyperparameters—no CLI parsing.  
-      • Is directly executable via `python main.py`.
-   
\ No newline at end of file
diff --git a/rdagent/app/finetune/llm/tpl/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml b/rdagent/app/finetune/llm/tpl/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
deleted file mode 100644
index 3c7c2d020..000000000
--- a/rdagent/app/finetune/llm/tpl/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
+++ /dev/null
@@ -1,82 +0,0 @@
-scenario_problem:
-  system: |-
-    You are a world-class machine learning and prompt engineer specializing in parameter-efficient fine-tuning of large language models using QLoRA (4-bit quantized LoRA adapters).  
-    Each iteration (trace) represents one training run or adapter update. If an iteration’s validation metric exceeds the current best, it becomes the new SOTA adapter; otherwise it is a failed experiment.
-    Your task is to analyze the scenario (and SOTA, if given) and identify a concise list of **2–3 Key Challenges** that most critically limit fine-tuning performance.
-
-    ### Core Analysis Dimensions
-    1. **Adapter-Model Alignment**  
-       Compare current LoRA adapter configuration against model capacity and task complexity.  
-    2. **Optimization Dynamics**  
-       Identify where training diverges or plateaus (e.g. LR too high/low, quantization noise).  
-    3. **Data-Model Coherence**  
-       Spot mismatches between the dataset’s characteristics and model input preprocessing or sequence length.  
-
-    ## Key Challenges / Core Problems
-    Categorize each challenge as one of:
-    - **Data-Driven Challenge**  
-      Issues in dataset size, domain mismatch, label noise, sequence length distribution, etc.
-    - **Model-Optimization Challenge**  
-      LoRA rank selection, quantization artifacts, learning rate schedule, gradient accumulation, etc.
-
-    ### For Each Challenge
-    1. Be **specific** and **actionable**.  
-    2. Focus on **methodological** aspects, not trivial bugs.  
-    3. Directly tie to improving the **target metric**.  
-    4. If no SOTA exists, include at least one challenge that guides building a minimal baseline adapter.
-
-    {% if task_output_format is not none %}
-    {% endif %}
-
-task_gen:
-  system: |-
-    You are an expert in LLM fine-tuning with QLoRA. Each iteration applies a specific hypothesis to improve the current adapter (SOTA) or establish an initial adapter.
-
-    **Inputs**:
-      - Scenario: base model, task, data path, evaluation metric  
-      - Current SOTA adapter & feedback (if any)  
-      - Proposed Hypothesis  
-      - Failed runs feedback (if any)
-
-    **Your task**: Outline a conceptual plan for `main.py` that implements the Proposed Hypothesis.
-
-    **Standards**:
-      - Run via `python main.py` with no CLI args; configs are hard-coded.  
-      - No code or pseudo-code—describe each step in plain language.  
-      - Do **not** use progress bars.  
-      - Do **not** infer test indices from sample files.
-
-    **Sketch**:
-      1. **Load Data**  
-         - Read train/validation files from the given data path.  
-         - Tokenize or preprocess inputs for the model.
-
-      2. **Initialize Model & Adapter**  
-         - Load the base LLM.  
-         - Attach a QLoRA adapter.
-
-      3. **Train with Hypothesis**  
-         - Apply the hypothesis change (e.g., modify learning schedule, adapter config).  
-         - Train and validate iteratively.
-
-      4. **Validate & Record**  
-         - Compute the metric on validation set.  
-         - Save results to `scores.csv` (with adapter name and “ensemble”).
-
-      5. **Generate Submission**  
-         - Write `submission.jsonl` or `.csv` matching the competition format exactly.
-
-    **Key Reminders for Developer**:
-      - Hard-code all paths; do not rely on sample files for indices.  
-      - Ensure tokenizer and model names match.  
-      - Validate output formats for `scores.csv` and `submission`.  
-      - Handle file I/O robustly (e.g., zipped data).
-
-    {% if task_output_format is not none %}
-    ## [Partial Response Format 1] Task Output Format:
-    {{ task_output_format }}
-    Your final output should strictly adhere to the following JSON format. 
-    {
-      "task_design": ---The dict corresponding to task output format---,
-    }
-    {% endif %}
diff --git a/rdagent/app/finetune/llm/tpl/scenarios/data_science/scen/prompts.yaml b/rdagent/app/finetune/llm/tpl/scenarios/data_science/scen/prompts.yaml
deleted file mode 100644
index d02d0286c..000000000
--- a/rdagent/app/finetune/llm/tpl/scenarios/data_science/scen/prompts.yaml
+++ /dev/null
@@ -1,18 +0,0 @@
-competition_description_template:
-  system: |-
-    You are a data science assistant that extracts structured information from unstructured text.
-    The user will provide you a description of an LLM fine-tuning project, and you need to extract specific details from it.
-    For the dataset, the user has already reviewed and provided any additional context—include that information in your response.
-    Please answer in JSON format with the following schema:
-    {
-      "Task Type":       "The type of fine-tuning task, e.g., 'Question Answering', 'Text Classification', 'Summarization', 'Translation', 'Code Generation'",
-      "Data Type":       "The type of data used for fine-tuning, e.g., 'Text (Natural Language)', 'Code', 'Multimodal', 'Dialogue'",
-      "Brief Description": "A concise summary of the fine-tuning project and its objectives",
-      "Dataset Description": "A description of the dataset as organized in the Processed Data folder: list files, formats, sizes, and any pre-processing steps applied, reconciled with contextual details from the project description",
-      "Training Specifications": "Details of the fine-tuning setup, including base model name, number of epochs, batch size, learning rate, optimizer, and any scheduler or early-stopping rules",
-      "Output Format":   "The expected model output format per sample (e.g., single label, probability distribution over N classes, generated text sequence)",
-      "Channels per Sample": "An integer indicating output dimensionality per example (e.g., 1 for single regression value, N for N-class probabilities, variable for generated text)",
-      "Evaluation Metric Description": "A precise explanation of how model performance is measured, including the formula or procedure used",
-      "Metric Name":     "The name of the evaluation metric (e.g., 'Accuracy', 'ROUGE-L', 'BLEU', 'F1'), please only choose one metric name",
-      "Metric Direction": true or false  // true if higher is better, false if lower is better
-    }
diff --git a/rdagent/app/finetune/llm/ui/__init__.py b/rdagent/app/finetune/llm/ui/__init__.py
new file mode 100644
index 000000000..837ecd963
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/__init__.py
@@ -0,0 +1 @@
+# FT (Fine-tune) scenario UI
diff --git a/rdagent/app/finetune/llm/ui/app.py b/rdagent/app/finetune/llm/ui/app.py
new file mode 100644
index 000000000..be3d06b68
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/app.py
@@ -0,0 +1,207 @@
+"""
+FT (Fine-tune) Timeline Viewer
+Hierarchical view: Session > Loop > Stage > EvoLoop > Events
+
+Run:
+    streamlit run rdagent/app/finetune/llm/ui/app.py
+"""
+
+import os
+from pathlib import Path
+
+import streamlit as st
+from streamlit import session_state as state
+
+from rdagent.app.finetune.llm.ui.benchmarks import get_core_metric_score
+from rdagent.app.finetune.llm.ui.components import render_session, render_summary
+from rdagent.app.finetune.llm.ui.config import ALWAYS_VISIBLE_TYPES, OPTIONAL_TYPES
+from rdagent.app.finetune.llm.ui.data_loader import (
+    get_summary,
+    get_valid_sessions,
+    load_ft_session,
+)
+from rdagent.app.finetune.llm.ui.ft_summary import render_job_summary
+
+DEFAULT_LOG_BASE = "log/"
+
+
+def get_job_options(base_path: Path) -> list[str]:
+    """
+    Scan directory and return job options list.
+    - "." means standalone tasks in root directory
+    - Others are job directory names
+    """
+    options = []
+    has_root_tasks = False
+    job_dirs = []
+
+    if not base_path.exists():
+        return options
+
+    for d in base_path.iterdir():
+        if not d.is_dir():
+            continue
+        # Check if standalone task (has __session__ directly)
+        if (d / "__session__").exists():
+            has_root_tasks = True
+        # Check if job directory (subdirs have __session__)
+        else:
+            try:
+                if any((sub / "__session__").exists() for sub in d.iterdir() if sub.is_dir()):
+                    job_dirs.append(d.name)
+            except PermissionError:
+                pass
+
+    # Sort job dirs by name descending (newest first, since names are date-based)
+    job_dirs.sort(reverse=True)
+
+    # Add job dirs first, then root tasks at the end
+    options.extend(job_dirs)
+    if has_root_tasks:
+        options.append(". (Current)")
+
+    return options
+
+
+def main():
+    st.set_page_config(layout="wide", page_title="FT Timeline", page_icon="🔬")
+
+    # ========== Sidebar ==========
+    with st.sidebar:
+        # View mode selection
+        view_mode = st.radio("View Mode", ["Job Summary", "Single Task"], horizontal=True)
+
+        st.divider()
+
+        default_log = os.environ.get("FT_LOG_PATH", DEFAULT_LOG_BASE)
+        job_folder = default_log  # Initialize for both modes
+        selected_types = ALWAYS_VISIBLE_TYPES.copy()  # Initialize for both modes
+        is_root_job = False  # Track if viewing root tasks
+
+        if view_mode == "Job Summary":
+            # Job Summary mode
+            st.header("Job")
+            base_folder = st.text_input("Base Folder", value=default_log, key="base_folder_input")
+            base_path = Path(base_folder)
+
+            job_options = get_job_options(base_path)
+            if job_options:
+                selected_job = st.selectbox("Select Job", job_options, key="job_select")
+                if selected_job.startswith("."):
+                    job_folder = base_folder
+                    is_root_job = True
+                else:
+                    job_folder = str(base_path / selected_job)
+                # Save to session_state for Single Task mode
+                state.selected_job_folder = job_folder
+            else:
+                st.warning("No jobs found in this directory")
+                job_folder = base_folder
+
+            if st.button("Refresh", type="primary", key="refresh_job"):
+                st.rerun()
+        else:
+            # Single Task mode
+            st.header("Session")
+            # Use job_folder from Job Summary mode if available
+            default_path = getattr(state, "selected_job_folder", default_log)
+            log_folder = st.text_input("Log Folder", value=default_path)
+            log_path = Path(log_folder)
+
+            sessions = get_valid_sessions(log_path)
+            if not sessions:
+                st.warning("No valid sessions found")
+                return
+
+            selected_session = st.selectbox("Session", sessions)
+
+            if st.button("Load", type="primary") or "session" not in state:
+                with st.spinner("Loading..."):
+                    state.session = load_ft_session(log_path / selected_session)
+                    state.session_name = selected_session
+
+            st.divider()
+
+            # Optional type toggles
+            st.subheader("Show More")
+            selected_types = ALWAYS_VISIBLE_TYPES.copy()
+            for event_type, (label, default) in OPTIONAL_TYPES.items():
+                if st.toggle(label, value=default, key=f"toggle_{event_type}"):
+                    selected_types.append(event_type)
+
+            st.divider()
+
+            # Display options
+            st.subheader("Display Options")
+            state.render_markdown = st.toggle("Render Prompts", value=False, key="render_markdown_toggle")
+
+            st.divider()
+
+            # Summary in sidebar
+            if "session" in state:
+                summary = get_summary(state.session)
+                st.subheader("Summary")
+                st.metric("Loops", summary.get("loop_count", 0))
+                st.metric("LLM Calls", summary.get("llm_call_count", 0))
+                success = summary.get("docker_success", 0)
+                fail = summary.get("docker_fail", 0)
+                st.metric("Docker", f"{success}✓ / {fail}✗")
+
+    # ========== Main Content ==========
+    if view_mode == "Job Summary":
+        st.title("📊 FT Job Summary")
+        job_path = Path(job_folder)
+        if job_path.exists():
+            render_job_summary(job_path, is_root=is_root_job)
+        else:
+            st.warning(f"Job folder not found: {job_folder}")
+        return
+
+    # Single Task mode
+    st.title("🔬 FT Timeline Viewer")
+
+    if "session" not in state:
+        st.info("Select a session and click **Load** to view")
+        return
+
+    session = state.session
+    summary = get_summary(session)
+
+    # Global info header (Base Model, Datasets, Benchmark) - compact style
+    scenario_event = next((e for e in session.init_events if e.type == "scenario"), None)
+    dataset_event = next((e for e in session.init_events if e.type == "dataset_selection"), None)
+
+    if scenario_event or dataset_event:
+        if scenario_event and hasattr(scenario_event.content, "base_model"):
+            st.markdown(f"🧠 **Model:** `{scenario_event.content.base_model}`")
+        if dataset_event:
+            selected = (
+                dataset_event.content.get("selected_datasets", []) if isinstance(dataset_event.content, dict) else []
+            )
+            if selected:
+                st.markdown(f"📂 **Datasets:** `{', '.join(selected)}`")
+        if scenario_event and hasattr(scenario_event.content, "target_benchmark"):
+            st.markdown(f"🎯 **Benchmark:** `{scenario_event.content.target_benchmark}`")
+        # Display baseline benchmark score
+        if scenario_event and hasattr(scenario_event.content, "baseline_benchmark_score"):
+            baseline = scenario_event.content.baseline_benchmark_score
+            if baseline and isinstance(baseline, dict):
+                benchmark_name = getattr(scenario_event.content, "target_benchmark", "")
+                accuracy_summary = baseline.get("accuracy_summary", {})
+                if accuracy_summary:
+                    result = get_core_metric_score(benchmark_name, accuracy_summary)
+                    if result:
+                        metric_name, score, _ = result
+                        st.markdown(f"📊 **Baseline:** `{metric_name} = {score:.1f}`")
+
+    # Summary bar
+    render_summary(summary)
+
+    st.divider()
+
+    # Hierarchical view
+    render_session(session, selected_types)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/__init__.py b/rdagent/app/finetune/llm/ui/benchmarks/__init__.py
new file mode 100644
index 000000000..c0c05b293
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/__init__.py
@@ -0,0 +1,70 @@
+"""Benchmark processors for core metric extraction.
+
+Each benchmark has its own processor that knows how to extract
+the core metric name and value from accuracy_summary data.
+"""
+
+from .bioprobench import BioProBenchProcessor
+from .chemcotbench import ChemCotBenchProcessor
+from .financeiq import FinanceIQProcessor
+from .panorama import PanoramaProcessor
+from .tablebench import TableBenchProcessor
+
+PROCESSORS = [
+    FinanceIQProcessor,
+    PanoramaProcessor,
+    ChemCotBenchProcessor,
+    TableBenchProcessor,
+    BioProBenchProcessor,
+]
+
+
+def get_core_metric_score(benchmark_name: str, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+    """Get core metric name, score, and direction for a benchmark.
+
+    Args:
+        benchmark_name: The benchmark name (e.g., "FinanceIQ", "panorama_par4pc")
+        accuracy_summary: {dataset_name: {metric: value, ...}, ...}
+
+    Returns:
+        (metric_name, value, higher_is_better) or None
+        - metric_name: includes "(average)" suffix if multiple datasets are averaged
+        - value: the score
+        - higher_is_better: True if higher values are better (use ↑), False otherwise (use ↓)
+    """
+    for processor in PROCESSORS:
+        if processor.match(benchmark_name):
+            return processor.get_core_metric(accuracy_summary)
+
+    # Default fallback: use first numeric value with "accuracy" label
+    scores = []
+    for ds, metrics in accuracy_summary.items():
+        if not isinstance(metrics, dict):
+            continue
+        if "accuracy" in metrics:
+            scores.append(float(metrics["accuracy"]))
+        else:
+            for v in metrics.values():
+                if isinstance(v, (int, float)):
+                    scores.append(float(v))
+                    break
+
+    if not scores:
+        return None
+
+    avg = sum(scores) / len(scores)
+    if len(scores) == 1:
+        return ("accuracy", avg, True)  # higher is better
+    else:
+        return ("accuracy (average)", avg, True)  # higher is better
+
+
+__all__ = [
+    "get_core_metric_score",
+    "PROCESSORS",
+    "FinanceIQProcessor",
+    "PanoramaProcessor",
+    "ChemCotBenchProcessor",
+    "TableBenchProcessor",
+    "BioProBenchProcessor",
+]
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/base.py b/rdagent/app/finetune/llm/ui/benchmarks/base.py
new file mode 100644
index 000000000..c3b7a5601
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/base.py
@@ -0,0 +1,53 @@
+"""Base class for benchmark core metric extraction."""
+
+from abc import ABC, abstractmethod
+
+
+class BenchmarkProcessor(ABC):
+    """Base class for benchmark core metric extraction."""
+
+    # Metrics where higher values are better (default assumption)
+    # Override in subclass if needed
+    HIGHER_IS_BETTER: set[str] = {
+        "accuracy",
+        "exact_match",
+        "f1",
+        "f1_score",
+        "macro_f1",
+        "correct_rate",
+        "success_rate",
+        "gold_hit_rate",
+        "score",
+        "scaffold_hard",
+        "kendall_tau",
+        "ROUGE-L",
+    }
+
+    @classmethod
+    @abstractmethod
+    def match(cls, benchmark_name: str) -> bool:
+        """Check if this processor handles the given benchmark."""
+        pass
+
+    @classmethod
+    @abstractmethod
+    def get_core_metric(cls, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+        """Extract core metric name, value, and direction from accuracy_summary.
+
+        Args:
+            accuracy_summary: {dataset_name: {metric: value, ...}, ...}
+
+        Returns:
+            (metric_name, value, higher_is_better) or None
+            - metric_name: includes "(average)" suffix if multiple datasets
+            - value: the score
+            - higher_is_better: True if higher values are better, False otherwise
+        """
+        pass
+
+    @classmethod
+    def is_higher_better(cls, metric_name: str) -> bool:
+        """Check if higher values are better for this metric."""
+        # Remove (average) suffix for checking
+        base_metric = metric_name.replace(" (average)", "").strip()
+        return base_metric.lower() in {m.lower() for m in cls.HIGHER_IS_BETTER}
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/bioprobench.py b/rdagent/app/finetune/llm/ui/benchmarks/bioprobench.py
new file mode 100644
index 000000000..8d8f990bd
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/bioprobench.py
@@ -0,0 +1,60 @@
+"""BioProBench benchmark processor."""
+
+from .base import BenchmarkProcessor
+
+
+class BioProBenchProcessor(BenchmarkProcessor):
+    """BioProBench: Biology protocol benchmark with different task types."""
+
+    CORE_METRICS = {
+        "pqa": "accuracy",
+        "ord": "kendall_tau",
+        "err": "f1",
+        "gen": "ROUGE-L",
+    }
+
+    @classmethod
+    def match(cls, benchmark_name: str) -> bool:
+        return "bioprobench" in benchmark_name.lower()
+
+    @classmethod
+    def get_core_metric(cls, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+        scores = []
+        metrics_used = []
+
+        for ds, metrics in accuracy_summary.items():
+            if not isinstance(metrics, dict):
+                continue
+            ds_lower = ds.lower()
+            # Find matching core metric
+            core_metric = "accuracy"  # fallback
+            for pattern, metric in cls.CORE_METRICS.items():
+                if pattern in ds_lower:
+                    core_metric = metric
+                    break
+
+            if core_metric in metrics:
+                scores.append(float(metrics[core_metric]))
+                metrics_used.append(core_metric)
+            elif core_metric.lower() in [k.lower() for k in metrics.keys()]:
+                # Case-insensitive fallback for metrics like "ROUGE-L"
+                for k, v in metrics.items():
+                    if k.lower() == core_metric.lower():
+                        scores.append(float(v))
+                        metrics_used.append(core_metric)
+                        break
+
+        if not scores:
+            return None
+
+        avg = sum(scores) / len(scores)
+        unique = list(set(metrics_used))
+
+        if len(scores) == 1:
+            metric_name = unique[0]
+        elif len(unique) == 1:
+            metric_name = f"{unique[0]} (average)"
+        else:
+            metric_name = "mixed (average)"
+
+        return (metric_name, avg, cls.is_higher_better(metric_name))
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/chemcotbench.py b/rdagent/app/finetune/llm/ui/benchmarks/chemcotbench.py
new file mode 100644
index 000000000..2b5c7d78e
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/chemcotbench.py
@@ -0,0 +1,105 @@
+"""ChemCotBench benchmark processor."""
+
+from .base import BenchmarkProcessor
+
+
+class ChemCotBenchProcessor(BenchmarkProcessor):
+    """ChemCotBench: Chemistry reasoning with various subtasks.
+
+    All metrics are 0-100 percentages, enabling unified averaging within each subset.
+    """
+
+    # Define core metric field names for each task
+    CORE_METRICS = {
+        # Molecular understanding
+        "mol_und_fg_count": "accuracy",
+        "mol_und_ring_count": "accuracy",
+        "mol_und_murcko_scaffold": "scaffold_hard",  # Exact match rate (0-100%)
+        "mol_und_ring_system_scaffold": "score",  # "Yes" ratio (0-100%)
+        "mol_und_equivalence": "accuracy",
+        # Molecular editing
+        "mol_edit_add": "correct_rate",
+        "mol_edit_delete": "correct_rate",
+        "mol_edit_sub": "correct_rate",
+        # Molecular optimization (prefix match)
+        "mol_opt_": "success_rate",
+        # Reaction tasks - unified to exact_match
+        "reaction_fs": "exact_match",
+        "reaction_retro": "exact_match",
+        "reaction_nepp": "exact_match",
+        "reaction_rcr": "exact_match",
+        "reaction_mechsel": "exact_match",  # Will fallback to accuracy if exact_match not found
+    }
+
+    # Metric groups: unified display names for each subset
+    METRIC_GROUPS = {
+        "mol_und": "accuracy",  # mol_und subset displays as accuracy
+        "mol_edit": "correct_rate",
+        "mol_opt": "success_rate",
+        "reaction": "exact_match",  # reaction subset displays as exact_match
+    }
+
+    @classmethod
+    def match(cls, benchmark_name: str) -> bool:
+        return "chemcot" in benchmark_name.lower()
+
+    @classmethod
+    def get_core_metric(cls, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+        scores = []
+        group_detected = None
+
+        for ds, metrics in accuracy_summary.items():
+            if not isinstance(metrics, dict):
+                continue
+            ds_lower = ds.lower()
+
+            # Detect subset type
+            for group in cls.METRIC_GROUPS:
+                if group in ds_lower:
+                    group_detected = group
+                    break
+
+            # Find matching core metric
+            core_metric = "accuracy"  # fallback
+            for pattern, metric in cls.CORE_METRICS.items():
+                # Prefix match for patterns ending with _
+                if pattern.endswith("_"):
+                    if pattern in ds_lower:
+                        core_metric = metric
+                        break
+                else:
+                    if pattern in ds_lower:
+                        core_metric = metric
+                        break
+
+            # Try to get metric value with fallback support
+            value = None
+            if core_metric in metrics:
+                value = float(metrics[core_metric])
+            elif core_metric == "exact_match" and "accuracy" in metrics:
+                # reaction_mechsel fallback: exact_match -> accuracy
+                value = float(metrics["accuracy"])
+
+            if value is not None:
+                scores.append(value)
+
+        if not scores:
+            return None
+
+        avg = sum(scores) / len(scores)
+
+        # Use unified metric name for the detected subset
+        if group_detected and group_detected in cls.METRIC_GROUPS:
+            unified_name = cls.METRIC_GROUPS[group_detected]
+            if len(scores) == 1:
+                metric_name = unified_name
+            else:
+                metric_name = f"{unified_name} (average)"
+        else:
+            # Fallback for unknown subsets
+            if len(scores) == 1:
+                metric_name = "accuracy"
+            else:
+                metric_name = "accuracy (average)"
+
+        return (metric_name, avg, cls.is_higher_better(metric_name))
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/financeiq.py b/rdagent/app/finetune/llm/ui/benchmarks/financeiq.py
new file mode 100644
index 000000000..f15e8073c
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/financeiq.py
@@ -0,0 +1,29 @@
+"""FinanceIQ benchmark processor."""
+
+from .base import BenchmarkProcessor
+
+
+class FinanceIQProcessor(BenchmarkProcessor):
+    """FinanceIQ: 10 exam subjects, all use accuracy."""
+
+    @classmethod
+    def match(cls, benchmark_name: str) -> bool:
+        return "financeiq" in benchmark_name.lower()
+
+    @classmethod
+    def get_core_metric(cls, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+        scores = []
+        for ds, metrics in accuracy_summary.items():
+            if not isinstance(metrics, dict):
+                continue
+            if "accuracy" in metrics:
+                scores.append(float(metrics["accuracy"]))
+
+        if not scores:
+            return None
+
+        avg = sum(scores) / len(scores)
+        if len(scores) == 1:
+            return ("accuracy", avg, True)  # higher is better
+        else:
+            return ("accuracy (average)", avg, True)  # higher is better
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/panorama.py b/rdagent/app/finetune/llm/ui/benchmarks/panorama.py
new file mode 100644
index 000000000..fa3001cf9
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/panorama.py
@@ -0,0 +1,52 @@
+"""Panorama benchmark processor."""
+
+from .base import BenchmarkProcessor
+
+
+class PanoramaProcessor(BenchmarkProcessor):
+    """Panorama: Different sub-datasets use different metrics."""
+
+    CORE_METRICS = {
+        "par4pc": "macro_f1",
+        "pi4pc": "gold_hit_rate",
+        "noc4pc": "macro_f1",
+    }
+
+    @classmethod
+    def match(cls, benchmark_name: str) -> bool:
+        return "panorama" in benchmark_name.lower()
+
+    @classmethod
+    def get_core_metric(cls, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+        scores = []
+        metrics_used = []
+
+        for ds, metrics in accuracy_summary.items():
+            if not isinstance(metrics, dict):
+                continue
+            ds_lower = ds.lower()
+            # Find matching core metric
+            core_metric = "accuracy"  # fallback
+            for pattern, metric in cls.CORE_METRICS.items():
+                if pattern in ds_lower:
+                    core_metric = metric
+                    break
+
+            if core_metric in metrics:
+                scores.append(float(metrics[core_metric]))
+                metrics_used.append(core_metric)
+
+        if not scores:
+            return None
+
+        avg = sum(scores) / len(scores)
+        unique = list(set(metrics_used))
+
+        if len(scores) == 1:
+            metric_name = unique[0]
+        elif len(unique) == 1:
+            metric_name = f"{unique[0]} (average)"
+        else:
+            metric_name = "mixed (average)"
+
+        return (metric_name, avg, cls.is_higher_better(metric_name))
diff --git a/rdagent/app/finetune/llm/ui/benchmarks/tablebench.py b/rdagent/app/finetune/llm/ui/benchmarks/tablebench.py
new file mode 100644
index 000000000..7ee78ec30
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/benchmarks/tablebench.py
@@ -0,0 +1,60 @@
+"""TableBench benchmark processor."""
+
+from .base import BenchmarkProcessor
+
+
+class TableBenchProcessor(BenchmarkProcessor):
+    """TableBench: Table QA with different subtasks."""
+
+    CORE_METRICS = {
+        "fact": "accuracy",
+        "numerical": "accuracy",
+        "analysis": "accuracy",
+        "visualization": "Pass@1",  # TableBench visualization uses Pass@1 as core metric
+    }
+
+    # TableBench-specific metrics where higher is better
+    HIGHER_IS_BETTER = BenchmarkProcessor.HIGHER_IS_BETTER | {
+        "Pass@1",
+        "ECR@1",
+        "Parse@1",
+    }
+
+    @classmethod
+    def match(cls, benchmark_name: str) -> bool:
+        return "tablebench" in benchmark_name.lower()
+
+    @classmethod
+    def get_core_metric(cls, accuracy_summary: dict) -> tuple[str, float, bool] | None:
+        scores = []
+        metrics_used = []
+
+        for ds, metrics in accuracy_summary.items():
+            if not isinstance(metrics, dict):
+                continue
+            ds_lower = ds.lower()
+            # Find matching core metric
+            core_metric = "accuracy"  # fallback
+            for pattern, metric in cls.CORE_METRICS.items():
+                if pattern in ds_lower:
+                    core_metric = metric
+                    break
+
+            if core_metric in metrics:
+                scores.append(float(metrics[core_metric]))
+                metrics_used.append(core_metric)
+
+        if not scores:
+            return None
+
+        avg = sum(scores) / len(scores)
+        unique = list(set(metrics_used))
+
+        if len(scores) == 1:
+            metric_name = unique[0]
+        elif len(unique) == 1:
+            metric_name = f"{unique[0]} (average)"
+        else:
+            metric_name = "mixed (average)"
+
+        return (metric_name, avg, cls.is_higher_better(metric_name))
diff --git a/rdagent/app/finetune/llm/ui/components.py b/rdagent/app/finetune/llm/ui/components.py
new file mode 100644
index 000000000..75242a8bd
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/components.py
@@ -0,0 +1,721 @@
+"""
+FT UI Components - Hierarchical Event Renderers
+"""
+
+import re
+from pathlib import Path
+from typing import Any
+
+import plotly.graph_objects as go
+import streamlit as st
+
+from rdagent.app.finetune.llm.ui.benchmarks import get_core_metric_score
+from rdagent.app.finetune.llm.ui.config import ICONS
+from rdagent.app.finetune.llm.ui.data_loader import Event, EvoLoop, Loop, Session
+
+
+def convert_latex_for_streamlit(text: str) -> str:
+    """Convert LaTeX syntax to Streamlit-compatible format.
+
+    Streamlit uses $...$ and $$...$$ for LaTeX rendering.
+    This converts \(...\) and \[...\] to the Streamlit format.
+    """
+    if not text:
+        return text
+    # Convert \(...\) to $...$
+    text = text.replace(r"\(", "$").replace(r"\)", "$")
+    # Convert \[...\] to $$...$$
+    text = text.replace(r"\[", "$$").replace(r"\]", "$$")
+    return text
+
+
+def format_duration(seconds: float | None) -> str:
+    if seconds is None:
+        return ""
+    if seconds < 60:
+        return f"{seconds:.1f}s"
+    minutes = int(seconds // 60)
+    secs = seconds % 60
+    return f"{minutes}m {secs:.0f}s"
+
+
+def render_session(session: Session, show_types: list[str]) -> None:
+    """Render full session with hierarchy"""
+    # Init events (before any loop)
+    if session.init_events:
+        filtered = [e for e in session.init_events if e.type in show_types]
+        if filtered:
+            with st.expander("🚀 **Initialization**", expanded=False):
+                for event in filtered:
+                    render_event(event)
+
+    # Loops
+    for loop_id in sorted(session.loops.keys()):
+        loop = session.loops[loop_id]
+        render_loop(loop, show_types)
+
+
+def render_loop(loop: Loop, show_types: list[str]) -> None:
+    """Render a single loop with lazy loading"""
+    # 1. Coding stage results
+    evo_results = []
+    for evo in loop.coding.values():
+        if evo.success is True:
+            evo_results.append("✓")
+        elif evo.success is False:
+            evo_results.append("✗")
+    coding_str = f"💻{''.join(evo_results)}" if evo_results else ""
+
+    # 2. Running stage results
+    runner_success = None
+    benchmark_score = None
+    for event in loop.runner:
+        # Docker (Full Train) result - check exit_code, not LLM evaluation
+        if event.type == "docker_exec" and "Full Train" in event.title and event.success is not None:
+            runner_success = event.success
+        # Benchmark score - use core metric from processor
+        if event.type == "feedback" and "Benchmark Result" in event.title:
+            content = event.content
+            if isinstance(content, dict):
+                benchmark_name = content.get("benchmark_name", "")
+                accuracy_summary = content.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    result = get_core_metric_score(benchmark_name, accuracy_summary)
+                    if result is not None:
+                        _, benchmark_score, _ = result
+
+    # 3. Get feedback decision for benchmark score coloring
+    feedback_decision = None
+    for event in loop.feedback:
+        if event.type == "feedback" and "Feedback:" in event.title:
+            feedback_decision = event.success
+            break
+
+    # 4. Build title string (only show existing stages)
+    parts = []
+    if coding_str:
+        parts.append(coding_str)
+    if runner_success is not None:
+        runner_str = "🏃✓" if runner_success else "🏃✗"
+        parts.append(runner_str)
+    # Show benchmark score with emoji based on feedback decision
+    if benchmark_score is not None:
+        if feedback_decision is True:
+            parts.append(f"✅📊{benchmark_score:.2f}")
+        elif feedback_decision is False:
+            parts.append(f"❌📊{benchmark_score:.2f}")
+        else:
+            parts.append(f"📊{benchmark_score:.2f}")
+
+    result_str = " ".join(parts) if parts else ""
+
+    loop_key = f"loop_{loop.loop_id}_loaded"
+    with st.expander(f"🔄 **Loop {loop.loop_id}** {result_str}", expanded=False):
+        if not st.session_state.get(loop_key, False):
+            # Lazy load: show button first
+            if st.button("📥 Load Content", key=f"load_{loop.loop_id}"):
+                st.session_state[loop_key] = True
+                st.rerun()
+        else:
+            # Render actual content
+            _render_loop_content(loop, show_types)
+
+
+def _render_loop_content(loop: Loop, show_types: list[str]) -> None:
+    """Render loop content (called after lazy load)"""
+    # Exp Gen
+    if loop.exp_gen:
+        filtered = [e for e in loop.exp_gen if e.type in show_types]
+        if filtered:
+            st.markdown("#### 🧪 Experiment Generation")
+            for event in filtered:
+                render_event(event)
+
+    # Coding (Evo Loops)
+    if loop.coding:
+        st.markdown("#### 💻 Coding")
+        for evo_id in sorted(loop.coding.keys()):
+            evo = loop.coding[evo_id]
+            render_evo_loop(evo, show_types)
+
+    # Runner
+    if loop.runner:
+        filtered = [e for e in loop.runner if e.type in show_types]
+        if filtered:
+            st.markdown("#### 🏃 Running(Full Train)")
+            for event in filtered:
+                render_event(event)
+
+    # Feedback
+    if loop.feedback:
+        filtered = [e for e in loop.feedback if e.type in show_types]
+        if filtered:
+            st.markdown("#### 📊 Feedback")
+            for event in filtered:
+                render_event(event)
+
+
+def render_evo_loop(evo: EvoLoop, show_types: list[str]) -> None:
+    """Render evolution loop"""
+    filtered = [e for e in evo.events if e.type in show_types]
+    if not filtered:
+        return
+
+    status = "🟢" if evo.success else "🔴" if evo.success is False else "⚪"
+    with st.expander(f"{status} Evo {evo.evo_id}", expanded=False):
+        for event in filtered:
+            render_event(event)
+
+
+def render_event(event: Event) -> None:
+    """Render a single event"""
+    icon = ICONS.get(event.type, "📌")
+    duration_str = f" ({format_duration(event.duration)})" if event.duration else ""
+
+    status = ""
+    if event.success is True:
+        status = "🟢 "
+    elif event.success is False:
+        status = "🔴 "
+
+    title = f"{event.time_str} {icon} {status}{event.title}{duration_str}"
+
+    renderers = {
+        "scenario": render_scenario,
+        "llm_call": render_llm_call,
+        "template": render_template,
+        "experiment": render_experiment,
+        "code": render_code,
+        "docker_exec": render_docker_exec,
+        "evaluator": render_docker_exec,  # Reuse docker_exec renderer for evaluator feedback
+        "feedback": render_feedback,
+        "token": render_token,
+        "time": render_time_info,
+        "settings": render_settings,
+        "hypothesis": render_hypothesis,
+        "dataset_selection": render_dataset_selection,
+    }
+
+    renderer = renderers.get(event.type, render_generic)
+    with st.expander(title, expanded=False):
+        # Pass event.title to docker_exec/evaluator renderers for context-aware labels
+        if event.type in ("docker_exec", "evaluator"):
+            renderer(event.content, event.title)
+        else:
+            renderer(event.content)
+
+
+def render_scenario(content: Any) -> None:
+    """Render scenario details (main info shown in page header, this shows extras)."""
+    import json
+
+    # 1. User target scenario
+    if hasattr(content, "user_target_scenario") and content.user_target_scenario:
+        st.markdown(f"**Target Scenario:** {content.user_target_scenario}")
+
+    # 2. Benchmark description
+    if hasattr(content, "benchmark_description") and content.benchmark_description:
+        st.markdown(f"**Benchmark Description:** {content.benchmark_description}")
+
+    # 3. Full timeout
+    if hasattr(content, "real_full_timeout"):
+        try:
+            timeout_hours = content.real_full_timeout() / 60 / 60
+            st.markdown(f"**Full Train Timeout:** {timeout_hours:.2f} hours")
+        except Exception:
+            pass
+
+    # 4. Device info - formatted nicely
+    if hasattr(content, "device_info") and content.device_info:
+        device = content.device_info
+        # Parse string to dict if needed
+        if isinstance(device, str):
+            try:
+                device = json.loads(device)
+            except json.JSONDecodeError:
+                st.markdown(f"**Device:** `{device}`")
+                device = None
+        if isinstance(device, dict):
+            parts = []
+            # Runtime info
+            runtime = device.get("runtime", {})
+            if runtime.get("python_version"):
+                parts.append(f"🐍 Python `{runtime['python_version'].split()[0]}`")
+            if runtime.get("os"):
+                parts.append(f"💻 {runtime['os']}")
+            # GPU info
+            gpu_info = device.get("gpu", {})
+            gpus = gpu_info.get("gpus", [])
+            if gpus:
+                gpu_name = gpus[0].get("name", "Unknown")
+                gpu_mem_gb = gpus[0].get("memory_total_gb", 0)
+                if len(gpus) > 1:
+                    parts.append(f"🎮 {len(gpus)}x {gpu_name} ({gpu_mem_gb}GB)")
+                else:
+                    parts.append(f"🎮 {gpu_name} ({gpu_mem_gb}GB)")
+            if parts:
+                st.markdown(" · ".join(parts))
+
+    # 5. Model info (detailed specs)
+    if hasattr(content, "model_info") and content.model_info:
+        model_info = content.model_info
+        if isinstance(model_info, dict) and model_info:
+            with st.expander("Model Info", expanded=False):
+                # Show key specs in a readable format
+                if "specs" in model_info and model_info["specs"]:
+                    st.markdown("**Specs:**")
+                    st.code(model_info["specs"], language="text", wrap_lines=True)
+                # Show other fields
+                other_info = {k: v for k, v in model_info.items() if k != "specs" and v}
+                if other_info:
+                    st.json(other_info)
+
+    # 6. Memory report (estimation based on hardware and model)
+    if hasattr(content, "memory_report") and content.memory_report:
+        with st.expander("Memory Estimation", expanded=False):
+            st.code(content.memory_report, language="text", wrap_lines=True)
+
+
+def render_dataset_selection(content: Any) -> None:
+    if not isinstance(content, dict):
+        st.json(content) if content else st.info("No content")
+        return
+
+    selected = content.get("selected_datasets", [])
+    total = content.get("total_datasets", 0)
+    reasoning = content.get("reasoning", "")
+
+    if selected:
+        st.markdown(f"**Selected ({len(selected)}/{total}):** " + ", ".join(f"`{ds}`" for ds in selected))
+
+    if reasoning:
+        with st.expander("Selection Reasoning", expanded=True):
+            st.markdown(reasoning)
+
+
+def render_hypothesis(content: Any) -> None:
+    """Render hypothesis content (Base Model shown in page header, not here)."""
+    if hasattr(content, "hypothesis") and content.hypothesis:
+        st.markdown("**Hypothesis:**")
+        st.markdown(content.hypothesis)
+    if hasattr(content, "reason") and content.reason:
+        with st.expander("Reason", expanded=False):
+            st.markdown(content.reason)
+
+
+def render_settings(content: Any) -> None:
+    if isinstance(content, dict):
+        st.json(content)
+    else:
+        st.code(str(content), wrap_lines=True)
+
+
+def render_llm_call(content: Any) -> None:
+    if not isinstance(content, dict):
+        st.json(content) if content else st.info("No content")
+        return
+
+    if content.get("start") and content.get("end"):
+        duration = (content["end"] - content["start"]).total_seconds()
+        st.caption(f"Duration: {format_duration(duration)}")
+
+    # Check if markdown rendering is enabled
+    render_md = st.session_state.get("render_markdown_toggle", False)
+
+    system = content.get("system", "")
+    if system:
+        with st.expander("System Prompt", expanded=False):
+            if render_md:
+                st.markdown(system)
+            else:
+                st.code(system, language="text", line_numbers=True, wrap_lines=True)
+
+    user = content.get("user", "")
+    if user:
+        with st.expander("User Prompt", expanded=False):
+            if render_md:
+                st.markdown(user)
+            else:
+                st.code(user, language="text", line_numbers=True, wrap_lines=True)
+
+    resp = content.get("resp", "")
+    if resp:
+        st.markdown("**Response:**")
+        if render_md:
+            st.markdown(resp)
+        elif resp.strip().startswith("{") or resp.strip().startswith("["):
+            st.code(resp, language="json", line_numbers=True, wrap_lines=True)
+        elif resp.strip().startswith("```"):
+            st.markdown(resp)
+        else:
+            st.code(resp, language="text", line_numbers=True, wrap_lines=True)
+
+
+def render_template(content: Any) -> None:
+    if not isinstance(content, dict):
+        st.json(content) if content else st.info("No content")
+        return
+
+    uri = content.get("uri", "")
+    st.caption(f"URI: `{uri}`")
+
+    context = content.get("context", {})
+    if context:
+        with st.expander("Context Variables", expanded=False):
+            st.json(context)
+
+    template = content.get("template", "")
+    if template:
+        with st.expander("Template", expanded=False):
+            st.code(template, language="text", line_numbers=True, wrap_lines=True)
+
+    rendered = content.get("rendered", "")
+    if rendered:
+        with st.expander("Rendered", expanded=True):
+            st.code(rendered, language="text", line_numbers=True, wrap_lines=True)
+
+
+def render_experiment(content: Any) -> None:
+    """Render experiment tasks (Base Model and Datasets shown in page header, not here)."""
+    if isinstance(content, list):
+        for i, task in enumerate(content):
+            if len(content) > 1:
+                st.markdown(f"**Task {i}**")
+
+            if hasattr(task, "description") and task.description:
+                st.markdown("**Description:**")
+                st.markdown(task.description)
+    else:
+        st.json(content) if content else st.info("No content")
+
+
+def render_code(content: Any) -> None:
+    if not isinstance(content, list):
+        st.info("No code available")
+        return
+
+    for i, ws in enumerate(content):
+        if not hasattr(ws, "file_dict") or not ws.file_dict:
+            continue
+
+        if len(content) > 1:
+            st.markdown(f"**Workspace {i}**")
+
+        for filename, code in ws.file_dict.items():
+            lang = "yaml" if filename.endswith((".yaml", ".yml")) else "python"
+            with st.expander(filename, expanded=False):
+                st.code(code, language=lang, line_numbers=True, wrap_lines=True)
+
+
+def _extract_evaluator_name(title: str) -> str:
+    """Extract evaluator name from event title like 'Eval (Data Processing) ✓'."""
+    match = re.search(r"\(([^)]+)\)", title)
+    return match.group(1) if match else ""
+
+
+def _render_single_feedback(fb: Any, evaluator_name: str = "") -> None:
+    """Render a single CoSTEERSingleFeedback object.
+
+    Structure:
+    - execution: LLM-generated execution summary (what happened, success/failure reason)
+    - raw_execution: Raw script stdout/stderr output
+    - return_checking: LLM-generated data quality assessment
+    - code: LLM-generated code improvement suggestions
+    """
+    decision = getattr(fb, "final_decision", None)
+    if decision is True:
+        st.success("Execution: PASS")
+    elif decision is False:
+        st.error("Execution: FAIL")
+
+    # 1. Execution Summary (LLM-generated)
+    execution = getattr(fb, "execution", "")
+    if execution:
+        label = f"{evaluator_name} Summary" if evaluator_name else "Execution Summary"
+        with st.expander(label, expanded=True):
+            st.code(execution, language="text", line_numbers=True, wrap_lines=True)
+
+    # 2. Raw Execution Log (script stdout)
+    raw_execution = getattr(fb, "raw_execution", "")
+    if raw_execution:
+        with st.expander("Raw Output (stdout)", expanded=False):
+            st.code(raw_execution, language="text", line_numbers=True, wrap_lines=True)
+
+    # 3. Data Quality Check (LLM-generated)
+    return_checking = getattr(fb, "return_checking", "")
+    if return_checking:
+        with st.expander("Data Quality Check", expanded=False):
+            st.code(return_checking, language="text", line_numbers=True, wrap_lines=True)
+
+    # 4. Code Improvement Suggestions (LLM-generated, often very long)
+    code_fb = getattr(fb, "code", "")
+    if code_fb:
+        with st.expander("Code Improvement Suggestions", expanded=False):
+            # Use markdown rendering if content contains markdown formatting
+            if "**" in code_fb or "```" in code_fb or "- " in code_fb:
+                st.markdown(code_fb)
+            else:
+                st.code(code_fb, language="text", line_numbers=True, wrap_lines=True)
+
+
+def render_docker_exec(content: Any, event_title: str = "") -> None:
+    # Extract evaluator name from event title for context-aware labels
+    evaluator_name = _extract_evaluator_name(event_title)
+
+    # Docker run raw output (dict with exit_code/stdout)
+    if isinstance(content, dict) and ("exit_code" in content or "stdout" in content or "success" in content):
+        # Show workspace ID if available (only the UUID part)
+        workspace_path = content.get("workspace_path")
+        if workspace_path:
+            workspace_id = Path(workspace_path).name
+            st.caption(f"📁 `{workspace_id}`")
+
+        exit_code = content.get("exit_code")
+        success = content.get("success")
+        if exit_code is not None:
+            if exit_code == 0:
+                st.success(f"Exit code: {exit_code}")
+            else:
+                st.error(f"Exit code: {exit_code}")
+        elif success is not None:
+            if success:
+                st.success("Execution: PASS")
+            else:
+                st.error("Execution: FAIL")
+
+        stdout = content.get("stdout", "")
+        if stdout:
+            label = f"{evaluator_name} Output" if evaluator_name else "Execution Output"
+            with st.expander(label, expanded=True):
+                st.code(stdout, language="text", line_numbers=True, wrap_lines=True)
+        return
+
+    # CoSTEERMultiFeedback (has feedback_list)
+    if hasattr(content, "feedback_list"):
+        for i, fb in enumerate(content.feedback_list):
+            if len(content.feedback_list) > 1:
+                st.markdown(f"**Feedback {i}**")
+            _render_single_feedback(fb, evaluator_name)
+        return
+
+    # Single CoSTEERSingleFeedback (has final_decision)
+    if hasattr(content, "final_decision"):
+        _render_single_feedback(content, evaluator_name)
+        return
+
+    # FTExperiment (runner result)
+    if hasattr(content, "sub_workspace_list"):
+        for ws in content.sub_workspace_list:
+            if not hasattr(ws, "running_info") or ws.running_info is None:
+                continue
+
+            info = ws.running_info
+            running_time = getattr(info, "running_time", None)
+            if running_time:
+                st.metric("Running Time", f"{running_time:.1f}s")
+
+            stdout = getattr(info, "stdout", "")
+            if stdout:
+                with st.expander("Full Train Log", expanded=True):
+                    st.code(stdout, language="text", line_numbers=True, wrap_lines=True)
+
+            result = getattr(info, "result", {})
+            if result:
+                render_training_result(result)
+        return
+
+    st.json(content) if content else st.info("No content")
+
+
+def render_feedback(content: Any) -> None:
+    # Handle benchmark result (dict with accuracy_summary)
+    if isinstance(content, dict) and "accuracy_summary" in content:
+        render_benchmark_result(content)
+        return
+
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        decision = getattr(content, "decision", None)
+        if decision is not None:
+            st.metric("Decision", "Accept" if decision else "Reject")
+    with col2:
+        acceptable = getattr(content, "acceptable", None)
+        if acceptable is not None:
+            st.metric("Acceptable", "Yes" if acceptable else "No")
+    with col3:
+        error_type = getattr(content, "observations", None)
+        if error_type:
+            st.metric("Error Type", error_type)
+
+    # FT scenario only uses code_change_summary (observations, hypothesis_evaluation,
+    # new_hypothesis, eda_improvement are DS scenario specific)
+    fields = [
+        ("code_change_summary", "Code Change Summary"),
+    ]
+
+    for attr, label in fields:
+        value = getattr(content, attr, None)
+        if value:
+            with st.expander(label, expanded=False):
+                st.markdown(value)
+
+    reason = getattr(content, "reason", None)
+    if reason:
+        with st.expander("Reason (Full Details)", expanded=True):
+            st.code(reason, language="text", line_numbers=True, wrap_lines=True)
+
+    exception = getattr(content, "exception", None)
+    if exception:
+        st.error(f"Exception: {exception}")
+
+
+def render_token(content: Any) -> None:
+    if isinstance(content, dict):
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.metric("Prompt", content.get("prompt_tokens", 0))
+        with col2:
+            st.metric("Completion", content.get("completion_tokens", 0))
+        with col3:
+            st.metric("Total", content.get("total_tokens", 0))
+    else:
+        st.json(content) if content else st.info("No content")
+
+
+def render_time_info(content: Any) -> None:
+    if isinstance(content, dict):
+        for k, v in content.items():
+            st.metric(k, f"{v:.1f}s" if isinstance(v, (int, float)) else str(v))
+    else:
+        st.json(content) if content else st.info("No content")
+
+
+def render_generic(content: Any) -> None:
+    if hasattr(content, "__dict__"):
+        st.json(vars(content))
+    elif content:
+        st.json(content)
+    else:
+        st.info("No content")
+
+
+def render_training_result(result: dict) -> None:
+    training_metrics = result.get("training_metrics", {})
+    loss_history = training_metrics.get("loss_history", {})
+
+    # loss_history is Dict[str, List[Dict]] with "train" and "eval" keys
+    train_history = loss_history.get("train", []) if isinstance(loss_history, dict) else []
+    if train_history:
+        fig = go.Figure()
+        steps = [entry.get("step", i) for i, entry in enumerate(train_history)]
+        losses = [entry.get("loss", 0) for entry in train_history]
+        fig.add_trace(go.Scatter(x=steps, y=losses, mode="lines+markers", name="Loss"))
+        fig.update_layout(title="Training Loss", xaxis_title="Step", yaxis_title="Loss", height=300)
+        st.plotly_chart(fig, use_container_width=True)
+
+        col1, col2 = st.columns(2)
+        initial_loss = training_metrics.get("initial_loss")
+        final_loss = training_metrics.get("final_loss")
+        if initial_loss:
+            col1.metric("Initial Loss", f"{initial_loss:.4f}")
+        if final_loss:
+            col2.metric("Final Loss", f"{final_loss:.4f}")
+
+    # Validation benchmark ([:100]) - used for SOTA judgment
+    benchmark = result.get("benchmark", {})
+    if benchmark:
+        st.markdown("**Validation Benchmark**")
+        # Detect format: old format has "accuracy_summary" at top level,
+        # new format has benchmark names as keys with nested accuracy_summary
+        if "accuracy_summary" in benchmark:
+            # Old format: {accuracy_summary: {...}, error_samples: [...]}
+            accuracy_summary = benchmark.get("accuracy_summary", {})
+            if accuracy_summary:
+                rows = [{"dataset": ds, **metrics} for ds, metrics in accuracy_summary.items()]
+                st.dataframe(rows)
+        else:
+            # New format: {bm_name: {accuracy_summary: {...}}, ...}
+            for bm_name, bm_result in benchmark.items():
+                if isinstance(bm_result, dict) and "accuracy_summary" in bm_result:
+                    st.markdown(f"*{bm_name}:*")
+                    accuracy_summary = bm_result.get("accuracy_summary", {})
+                    if accuracy_summary:
+                        rows = [{"dataset": ds, **metrics} for ds, metrics in accuracy_summary.items()]
+                        st.dataframe(rows)
+
+    # Test benchmark ([100:200]) - frontend display only, not visible to agent
+    benchmark_test = result.get("benchmark_test", {})
+    if benchmark_test and benchmark_test != benchmark:  # Avoid duplicate display for small datasets
+        st.markdown("**Test Benchmark**")
+        if "accuracy_summary" in benchmark_test:
+            accuracy_summary = benchmark_test.get("accuracy_summary", {})
+            if accuracy_summary:
+                rows = [{"dataset": ds, **metrics} for ds, metrics in accuracy_summary.items()]
+                st.dataframe(rows)
+        else:
+            for bm_name, bm_result in benchmark_test.items():
+                if isinstance(bm_result, dict) and "accuracy_summary" in bm_result:
+                    st.markdown(f"*{bm_name}:*")
+                    accuracy_summary = bm_result.get("accuracy_summary", {})
+                    if accuracy_summary:
+                        rows = [{"dataset": ds, **metrics} for ds, metrics in accuracy_summary.items()]
+                        st.dataframe(rows)
+
+
+def render_benchmark_result(content: dict) -> None:
+    """Render benchmark evaluation result"""
+    import pandas as pd
+
+    benchmark_name = content.get("benchmark_name", "Unknown")
+    st.markdown(f"**Benchmark: {benchmark_name}**")
+
+    # Accuracy summary table
+    # accuracy_summary is a dict: {dataset_name: {metric: value, ...}, ...}
+    accuracy_summary = content.get("accuracy_summary", {})
+    if accuracy_summary and isinstance(accuracy_summary, dict):
+        st.markdown("**Accuracy Summary:**")
+        # Convert dict {dataset: {metric: value}} to list of dicts for dataframe
+        rows = []
+        for ds, metrics in accuracy_summary.items():
+            row = {"dataset": ds, **metrics}
+            rows.append(row)
+
+        # Create DataFrame and reorder columns
+        df = pd.DataFrame(rows)
+        cols = ["dataset"] + [c for c in df.columns if c != "dataset"]
+        df = df[cols]
+        st.dataframe(df)
+
+    # Error samples
+    error_samples = content.get("error_samples", [])
+    if error_samples:
+        with st.expander(f"Error Samples ({len(error_samples)})", expanded=False):
+            for i, sample in enumerate(error_samples):
+                with st.expander(f"Sample {i+1} (Gold: {sample.get('gold', 'N/A')})", expanded=False):
+                    st.markdown(
+                        '<div style="font-size: 0.85em;">',
+                        unsafe_allow_html=True,
+                    )
+                    st.markdown("**Question:**")
+                    st.markdown(convert_latex_for_streamlit(sample.get("question", "N/A")))
+                    st.markdown("---")
+                    st.markdown(f"**Gold:** `{sample.get('gold', 'N/A')}`")
+                    st.markdown("---")
+                    st.markdown("**Model Output:**")
+                    st.markdown(convert_latex_for_streamlit(sample.get("model_output", "N/A")))
+                    st.markdown("</div>", unsafe_allow_html=True)
+
+
+def render_summary(summary: dict) -> None:
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric("Loops", summary.get("loop_count", 0))
+    with col2:
+        st.metric("LLM Calls", summary.get("llm_call_count", 0))
+    with col3:
+        llm_time = summary.get("llm_total_time", 0)
+        st.metric("LLM Time", format_duration(llm_time))
+    with col4:
+        success = summary.get("docker_success", 0)
+        fail = summary.get("docker_fail", 0)
+        st.metric("Executions", f"{success}✓ / {fail}✗")
diff --git a/rdagent/app/finetune/llm/ui/config.py b/rdagent/app/finetune/llm/ui/config.py
new file mode 100644
index 000000000..3c971217d
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/config.py
@@ -0,0 +1,69 @@
+"""
+FT UI Configuration Constants
+
+Centralized configuration for FT Timeline Viewer.
+"""
+
+from typing import Literal
+
+# Event type definition
+EventType = Literal[
+    "scenario",
+    "llm_call",
+    "template",
+    "experiment",
+    "code",
+    "docker_exec",
+    "evaluator",  # Evaluator feedback (separate from docker_exec)
+    "feedback",
+    "token",
+    "time",
+    "settings",
+    "hypothesis",
+    "dataset_selection",
+]
+
+# Event type icons
+ICONS = {
+    "scenario": "🎯",
+    "llm_call": "💬",
+    "template": "📋",
+    "experiment": "🧪",
+    "code": "📄",
+    "docker_exec": "🐳",
+    "evaluator": "📝",  # Evaluator feedback icon
+    "feedback": "📊",
+    "token": "🔢",
+    "time": "⏱️",
+    "settings": "⚙️",
+    "hypothesis": "💡",
+    "dataset_selection": "📂",
+}
+
+# Evaluator configuration mapping (name, default_stage)
+EVALUATOR_CONFIG = {
+    "FTDataEvaluator": ("Data Processing", "coding"),
+    "FTCoderEvaluator": ("Micro-batch Test", "coding"),
+    "FTRunnerEvaluator": ("Full Train", "runner"),
+}
+
+# Always visible event types
+ALWAYS_VISIBLE_TYPES = [
+    "scenario",
+    "dataset_selection",
+    "hypothesis",
+    "llm_call",
+    "experiment",
+    "code",
+    "docker_exec",
+    "evaluator",
+    "feedback",
+]
+
+# Optional event types with toggle config (label, default_enabled)
+OPTIONAL_TYPES = {
+    "template": ("📋 Template", False),
+    "token": ("🔢 Token", False),
+    "time": ("⏱️ Time", False),
+    "settings": ("⚙️ Settings", False),
+}
diff --git a/rdagent/app/finetune/llm/ui/data_loader.py b/rdagent/app/finetune/llm/ui/data_loader.py
new file mode 100644
index 000000000..6004c9912
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/data_loader.py
@@ -0,0 +1,455 @@
+"""
+FT UI Data Loader
+Load pkl logs and convert to hierarchical timeline structure
+"""
+
+import re
+from dataclasses import dataclass, field
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+import streamlit as st
+
+from rdagent.app.finetune.llm.ui.config import EVALUATOR_CONFIG, EventType
+from rdagent.log.storage import FileStorage
+
+
+@dataclass
+class Event:
+    """Timeline event"""
+
+    type: EventType
+    timestamp: datetime
+    tag: str
+    title: str
+    content: Any
+    loop_id: int | None = None
+    evo_id: int | None = None
+    stage: str = ""
+    duration: float | None = None
+    success: bool | None = None
+
+    @property
+    def time_str(self) -> str:
+        return self.timestamp.strftime("%H:%M:%S")
+
+
+@dataclass
+class EvoLoop:
+    """Evolution loop containing events"""
+
+    evo_id: int
+    events: list[Event] = field(default_factory=list)
+    success: bool | None = None
+
+
+@dataclass
+class Loop:
+    """Main loop containing stages"""
+
+    loop_id: int
+    exp_gen: list[Event] = field(default_factory=list)
+    coding: dict[int, EvoLoop] = field(default_factory=dict)  # evo_id -> EvoLoop
+    runner: list[Event] = field(default_factory=list)
+    feedback: list[Event] = field(default_factory=list)
+
+
+@dataclass
+class Session:
+    """Session containing init events and loops"""
+
+    init_events: list[Event] = field(default_factory=list)
+    loops: dict[int, Loop] = field(default_factory=dict)  # loop_id -> Loop
+
+
+def extract_loop_id(tag: str) -> int | None:
+    match = re.search(r"Loop_(\d+)", tag)
+    return int(match.group(1)) if match else None
+
+
+def extract_evo_id(tag: str) -> int | None:
+    match = re.search(r"evo_loop_(\d+)", tag)
+    return int(match.group(1)) if match else None
+
+
+def extract_stage(tag: str) -> str:
+    if "direct_exp_gen" in tag:
+        return "exp_gen"
+    if "coding" in tag:
+        return "coding"
+    if "running" in tag:  # Note: tag uses "running", not "runner"
+        return "runner"
+    if "feedback" in tag:
+        return "feedback"
+    return ""
+
+
+def get_valid_sessions(log_folder: Path) -> list[str]:
+    if not log_folder.exists():
+        return []
+    sessions = []
+    for d in log_folder.iterdir():
+        if d.is_dir() and d.joinpath("__session__").exists():
+            sessions.append(d.name)
+    return sorted(sessions, reverse=True)
+
+
+def parse_event(tag: str, content: Any, timestamp: datetime) -> Event | None:
+    loop_id = extract_loop_id(tag)
+    evo_id = extract_evo_id(tag)
+    stage = extract_stage(tag)
+
+    # Scenario
+    if tag == "scenario":
+        model = getattr(content, "base_model", "Unknown")
+        return Event(type="scenario", timestamp=timestamp, tag=tag, title=f"Scenario: {model}", content=content)
+
+    # Dataset selection
+    if "dataset_selection" in tag:
+        selected = content.get("selected_datasets", []) if isinstance(content, dict) else []
+        total = content.get("total_datasets", 0) if isinstance(content, dict) else 0
+        return Event(
+            type="dataset_selection",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Dataset Selection: {len(selected)}/{total}",
+            content=content,
+        )
+
+    # Settings
+    if "SETTINGS" in tag:
+        name = tag.replace("_SETTINGS", "").replace("SETTINGS", "")
+        return Event(type="settings", timestamp=timestamp, tag=tag, title=f"Settings: {name}", content=content)
+
+    # Hypothesis
+    if tag == "hypothesis" or (loop_id is not None and "hypothesis" in tag):
+        return Event(
+            type="hypothesis",
+            timestamp=timestamp,
+            tag=tag,
+            title="Hypothesis",
+            content=content,
+            loop_id=loop_id,
+            stage="exp_gen",
+        )
+
+    # LLM Call
+    if "debug_llm" in tag:
+        if isinstance(content, dict) and ("user" in content or "system" in content):
+            duration = None
+            if content.get("start") and content.get("end"):
+                duration = (content["end"] - content["start"]).total_seconds()
+            return Event(
+                type="llm_call",
+                timestamp=timestamp,
+                tag=tag,
+                title="LLM Call",
+                content=content,
+                loop_id=loop_id,
+                evo_id=evo_id,
+                stage=stage,
+                duration=duration,
+            )
+
+    # Template
+    if "debug_tpl" in tag:
+        if isinstance(content, dict) and "uri" in content:
+            uri = content.get("uri", "")
+            tpl_name = uri.split(":")[-1] if ":" in uri else uri
+            return Event(
+                type="template",
+                timestamp=timestamp,
+                tag=tag,
+                title=f"Template: {tpl_name}",
+                content=content,
+                loop_id=loop_id,
+                evo_id=evo_id,
+                stage=stage,
+            )
+
+    # Experiment generation
+    if "experiment generation" in tag:
+        task_count = len(content) if isinstance(content, list) else 1
+        return Event(
+            type="experiment",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Experiment ({task_count} task)",
+            content=content,
+            loop_id=loop_id,
+            stage=stage,
+        )
+
+    # Evolving code
+    if "evolving code" in tag:
+        file_count = 0
+        if isinstance(content, list):
+            for ws in content:
+                if hasattr(ws, "file_dict"):
+                    file_count += len(ws.file_dict)
+        return Event(
+            type="code",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Code ({file_count} files)",
+            content=content,
+            loop_id=loop_id,
+            evo_id=evo_id,
+            stage=stage or "coding",
+        )
+
+    # Benchmark execution (Docker or Conda) - must check before generic docker_run/conda_run
+    if "docker_run.Benchmark" in tag or "conda_run.Benchmark" in tag:
+        benchmark_name = content.get("benchmark_name", "Unknown") if isinstance(content, dict) else "Unknown"
+        exit_code = content.get("exit_code") if isinstance(content, dict) else None
+        success = exit_code == 0 if exit_code is not None else None
+        env_type = "Docker" if "docker_run" in tag else "Conda"
+        return Event(
+            type="docker_exec",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Benchmark ({benchmark_name}) [{env_type}] {'✓' if success else '✗' if success is False else ''}",
+            content=content,
+            loop_id=loop_id,
+            stage="runner",
+            success=success,
+        )
+
+    # Environment run (Docker or Conda, raw execution logged before LLM evaluation)
+    if "docker_run." in tag or "conda_run." in tag:
+        is_docker = "docker_run." in tag
+        tag_prefix = "docker_run." if is_docker else "conda_run."
+        class_name = tag.split(tag_prefix)[-1].split(".")[0]
+
+        # FTWorkspace unified logging - determine type from entry command
+        if class_name == "FTWorkspace":
+            entry = content.get("entry", "") if isinstance(content, dict) else ""
+            if "llamafactory-cli train" in entry:
+                # Distinguish by yaml file name: debug_train.yaml for micro-batch, train.yaml for full training
+                if "debug_train.yaml" in entry:
+                    evaluator_name, default_stage = "Micro-batch Test", "coding"
+                else:
+                    evaluator_name, default_stage = "Full Train", "runner"
+            elif "process_data" in entry.lower():
+                evaluator_name, default_stage = "Data Processing", "coding"
+            elif entry.startswith("rm "):
+                evaluator_name, default_stage = "Cleanup", "runner"
+            else:
+                evaluator_name, default_stage = "Env Run", "coding"
+        else:
+            evaluator_name, default_stage = EVALUATOR_CONFIG.get(class_name, (class_name, "coding"))
+
+        exit_code = content.get("exit_code") if isinstance(content, dict) else None
+        success = exit_code == 0 if exit_code is not None else content.get("success")
+        env_label = "Docker" if is_docker else "Conda"
+        title = f"{env_label} ({evaluator_name}) {'✓' if success else '✗' if success is False else ''}"
+        return Event(
+            type="docker_exec",
+            timestamp=timestamp,
+            tag=tag,
+            title=title,
+            content=content,
+            loop_id=loop_id,
+            evo_id=evo_id,
+            stage=stage or default_stage,
+            success=success,
+        )
+
+    # Docker execution (individual evaluator feedback, logged after LLM evaluation)
+    if "docker_exec." in tag:
+        class_name = tag.split("docker_exec.")[-1].split(".")[0]
+        evaluator_name, default_stage = EVALUATOR_CONFIG.get(class_name, (class_name, "coding"))
+        success = getattr(content, "final_decision", None)
+        title = f"Eval ({evaluator_name}) {'✓' if success else '✗' if success is False else '?'}"
+        return Event(
+            type="docker_exec",
+            timestamp=timestamp,
+            tag=tag,
+            title=title,
+            content=content,
+            loop_id=loop_id,
+            evo_id=evo_id,
+            stage=stage or default_stage,
+            success=success,
+        )
+
+    # Evaluator feedback (logged from FT evaluators with final_decision)
+    if "evaluator_feedback." in tag:
+        class_name = tag.split("evaluator_feedback.")[-1].split(".")[0]
+        evaluator_name, default_stage = EVALUATOR_CONFIG.get(class_name, (class_name, "coding"))
+        success = getattr(content, "final_decision", None)
+        title = f"Eval ({evaluator_name}) {'✓' if success else '✗' if success is False else '?'}"
+        return Event(
+            type="evaluator",  # Use dedicated evaluator type with 📝 icon
+            timestamp=timestamp,
+            tag=tag,
+            title=title,
+            content=content,
+            loop_id=loop_id,
+            evo_id=evo_id,
+            stage=stage or default_stage,
+            success=success,
+        )
+
+    # Final feedback
+    if "feedback.feedback" in tag or (tag.endswith(".feedback") and "evo_loop" not in tag):
+        decision = getattr(content, "decision", None)
+        return Event(
+            type="feedback",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Feedback: {'Accept' if decision else 'Reject'}",
+            content=content,
+            loop_id=loop_id,
+            stage="feedback",
+            success=decision,
+        )
+
+    # Benchmark result (supports benchmark_result, benchmark_result.validation, benchmark_result.test)
+    if "benchmark_result" in tag:
+        benchmark_name = content.get("benchmark_name", "Unknown") if isinstance(content, dict) else "Unknown"
+        accuracy = content.get("accuracy_summary", {}) if isinstance(content, dict) else {}
+        # Extract split from tag or content
+        split = content.get("split", "") if isinstance(content, dict) else ""
+        if not split and "." in tag:
+            split = tag.split(".")[-1]  # e.g., "validation" or "test" from "benchmark_result.validation"
+        split_label = f" [{split.title()}]" if split and split != "default" else ""
+        return Event(
+            type="feedback",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Benchmark Result{split_label} ({benchmark_name}: {len(accuracy)} datasets)",
+            content=content,
+            loop_id=loop_id,
+            stage="runner",
+        )
+
+    # Runner result
+    if "runner result" in tag:
+        return Event(
+            type="docker_exec",
+            timestamp=timestamp,
+            tag=tag,
+            title="Full Train",
+            content=content,
+            loop_id=loop_id,
+            stage="runner",
+        )
+
+    # Token cost
+    if "token_cost" in tag:
+        if isinstance(content, dict):
+            total = content.get("total_tokens", 0)
+            return Event(
+                type="token",
+                timestamp=timestamp,
+                tag=tag,
+                title=f"Token: {total}",
+                content=content,
+                loop_id=loop_id,
+                evo_id=evo_id,
+                stage=stage,
+            )
+
+    # Time info
+    if "time_info" in tag:
+        return Event(
+            type="time", timestamp=timestamp, tag=tag, title="Time Info", content=content, loop_id=loop_id, stage=stage
+        )
+
+    return None
+
+
+@st.cache_data(ttl=300, hash_funcs={Path: str})
+def load_ft_session(log_path: Path) -> Session:
+    """Load events into hierarchical session structure"""
+    session = Session()
+    storage = FileStorage(log_path)
+
+    events = []
+    for msg in storage.iter_msg():
+        if not msg.tag:
+            continue
+        event = parse_event(msg.tag, msg.content, msg.timestamp)
+        if event:
+            events.append(event)
+
+    # Sort by timestamp
+    events.sort(key=lambda e: e.timestamp)
+
+    # Organize into hierarchy
+    for event in events:
+        if event.loop_id is None:
+            session.init_events.append(event)
+            continue
+
+        # Ensure loop exists
+        if event.loop_id not in session.loops:
+            session.loops[event.loop_id] = Loop(loop_id=event.loop_id)
+        loop = session.loops[event.loop_id]
+
+        # Place event in appropriate stage
+        if event.stage == "exp_gen":
+            loop.exp_gen.append(event)
+        elif event.stage == "coding":
+            if event.evo_id is not None:
+                if event.evo_id not in loop.coding:
+                    loop.coding[event.evo_id] = EvoLoop(evo_id=event.evo_id)
+                evo = loop.coding[event.evo_id]
+                evo.events.append(event)
+                # Use evaluator feedback (final_decision) for evo success, fallback to docker_exec
+                if event.type in ("evaluator", "docker_exec") and event.success is not None:
+                    if evo.success is None:
+                        evo.success = event.success
+                    else:
+                        evo.success = evo.success and event.success  # AND logic: all evaluators must pass
+            else:
+                # Coding events without evo_id go to evo 0
+                if 0 not in loop.coding:
+                    loop.coding[0] = EvoLoop(evo_id=0)
+                loop.coding[0].events.append(event)
+        elif event.stage == "runner":
+            loop.runner.append(event)
+        elif event.stage == "feedback":
+            loop.feedback.append(event)
+        else:
+            # Unknown stage - put in exp_gen
+            loop.exp_gen.append(event)
+
+    return session
+
+
+def get_summary(session: Session) -> dict:
+    """Get summary statistics"""
+    llm_calls = []
+    docker_execs = []
+
+    # Collect from init
+    for e in session.init_events:
+        if e.type == "llm_call":
+            llm_calls.append(e)
+        elif e.type == "docker_exec":
+            docker_execs.append(e)
+
+    # Collect from loops
+    for loop in session.loops.values():
+        for e in loop.exp_gen + loop.runner + loop.feedback:
+            if e.type == "llm_call":
+                llm_calls.append(e)
+            elif e.type == "docker_exec":
+                docker_execs.append(e)
+        for evo in loop.coding.values():
+            for e in evo.events:
+                if e.type == "llm_call":
+                    llm_calls.append(e)
+                elif e.type == "docker_exec":
+                    docker_execs.append(e)
+
+    return {
+        "loop_count": len(session.loops),
+        "llm_call_count": len(llm_calls),
+        "llm_total_time": sum(e.duration or 0 for e in llm_calls),
+        "docker_success": sum(1 for e in docker_execs if e.success is True),
+        "docker_fail": sum(1 for e in docker_execs if e.success is False),
+    }
diff --git a/rdagent/app/finetune/llm/ui/ft_summary.py b/rdagent/app/finetune/llm/ui/ft_summary.py
new file mode 100644
index 000000000..053619b7e
--- /dev/null
+++ b/rdagent/app/finetune/llm/ui/ft_summary.py
@@ -0,0 +1,580 @@
+"""
+FT Job Summary View
+Display summary table for all tasks in a job directory
+"""
+
+import pickle
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+
+from rdagent.app.finetune.llm.ui.benchmarks import get_core_metric_score
+
+
+def is_valid_task(task_path: Path) -> bool:
+    """Check if directory is a valid FT task (has __session__ subdirectory)"""
+    return task_path.is_dir() and (task_path / "__session__").exists()
+
+
+def get_loop_dirs(task_path: Path) -> list[Path]:
+    """Get sorted list of Loop directories"""
+    loops = [d for d in task_path.iterdir() if d.is_dir() and d.name.startswith("Loop_")]
+    return sorted(loops, key=lambda d: int(d.name.split("_")[1]))
+
+
+def extract_benchmark_score(loop_path: Path, split: str = "") -> tuple[str, float, bool] | None:
+    """Extract benchmark score, metric name, and direction from loop directory.
+
+    Args:
+        loop_path: Path to loop directory
+        split: Filter by split type ("validation", "test", or "" for any)
+
+    Returns:
+        (metric_name, score, higher_is_better) or None
+        - metric_name includes "(average)" suffix if multiple datasets are averaged
+        - higher_is_better: True if higher values are better
+    """
+    for pkl_file in loop_path.rglob("**/benchmark_result*/**/*.pkl"):
+        try:
+            with open(pkl_file, "rb") as f:
+                content = pickle.load(f)
+            if isinstance(content, dict):
+                # Check split filter
+                content_split = content.get("split", "")
+                if split and content_split != split:
+                    continue
+
+                benchmark_name = content.get("benchmark_name", "")
+                accuracy_summary = content.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    result = get_core_metric_score(benchmark_name, accuracy_summary)
+                    if result is not None:
+                        return result
+        except Exception:
+            pass
+    return None
+
+
+def extract_benchmark_scores(loop_path: Path) -> dict[str, tuple[str, float, bool] | None]:
+    """Extract both validation and test benchmark scores from loop directory.
+
+    Returns:
+        Dict with keys "validation" and "test", each containing
+        (metric_name, score, higher_is_better) or None
+    """
+    return {
+        "validation": extract_benchmark_score(loop_path, split="validation"),
+        "test": extract_benchmark_score(loop_path, split="test"),
+    }
+
+
+def extract_baseline_score(task_path: Path) -> tuple[str, float] | None:
+    """Extract baseline benchmark score from scenario object (legacy, validation only).
+
+    Returns:
+        (metric_name, score) or None
+    """
+    scenario_dir = task_path / "scenario"
+    if not scenario_dir.exists():
+        return None
+
+    for pkl_file in scenario_dir.rglob("*.pkl"):
+        try:
+            with open(pkl_file, "rb") as f:
+                scenario = pickle.load(f)
+            baseline_score = getattr(scenario, "baseline_benchmark_score", None)
+            if baseline_score and isinstance(baseline_score, dict):
+                benchmark_name = getattr(scenario, "target_benchmark", "")
+                accuracy_summary = baseline_score.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    result = get_core_metric_score(benchmark_name, accuracy_summary)
+                    if result is not None:
+                        metric_name, score, _ = result
+                        return metric_name, score
+        except Exception:
+            pass
+    return None
+
+
+def extract_baseline_scores(task_path: Path) -> dict[str, tuple[str, float, bool] | None]:
+    """Extract both validation and test baseline benchmark scores from scenario.
+
+    Returns:
+        {"validation": (metric_name, score, higher_is_better) or None,
+         "test": (metric_name, score, higher_is_better) or None}
+    """
+    scenario_dir = task_path / "scenario"
+    if not scenario_dir.exists():
+        return {"validation": None, "test": None}
+
+    for pkl_file in scenario_dir.rglob("*.pkl"):
+        try:
+            with open(pkl_file, "rb") as f:
+                scenario = pickle.load(f)
+
+            benchmark_name = getattr(scenario, "target_benchmark", "")
+            result = {"validation": None, "test": None}
+
+            # Validation score
+            baseline_val = getattr(scenario, "baseline_benchmark_score", None)
+            if baseline_val and isinstance(baseline_val, dict):
+                accuracy_summary = baseline_val.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    core = get_core_metric_score(benchmark_name, accuracy_summary)
+                    if core:
+                        result["validation"] = core
+
+            # Test score (new format only)
+            baseline_test = getattr(scenario, "baseline_benchmark_score_test", None)
+            if baseline_test and isinstance(baseline_test, dict):
+                accuracy_summary = baseline_test.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    core = get_core_metric_score(benchmark_name, accuracy_summary)
+                    if core:
+                        result["test"] = core
+
+            return result
+        except Exception:
+            pass
+    return {"validation": None, "test": None}
+
+
+def get_loop_status(task_path: Path, loop_id: int) -> tuple[str, float | None, float | None, str | None, bool | None, bool]:
+    """
+    Get loop status, validation score, test score, metric name with direction arrow, feedback decision, and direction.
+    Returns: (status_str, val_score_or_none, test_score_or_none, metric_display_or_none, feedback_decision, higher_is_better)
+    Status: 'C'=Coding, 'R'=Running, 'X'=Failed, score_str=Success
+    metric_display: metric name with direction arrow (e.g., "accuracy ↑")
+    feedback_decision: True=accepted, False=rejected, None=no feedback
+    higher_is_better: True if higher values are better for this metric
+    """
+    loop_path = task_path / f"Loop_{loop_id}"
+    if not loop_path.exists():
+        return "-", None, None, None, None, True
+
+    # Check for benchmark results first (highest priority - means completed)
+    scores = extract_benchmark_scores(loop_path)
+    val_result = scores.get("validation")
+    test_result = scores.get("test")
+
+    # Fallback to old format (no split) if no validation/test found
+    if val_result is None and test_result is None:
+        legacy_result = extract_benchmark_score(loop_path, split="")
+        if legacy_result is not None:
+            val_result = legacy_result  # Treat legacy as validation
+
+    # Get feedback decision (used for both score coloring and fallback status)
+    feedback_decision = None
+    feedback_files = list(loop_path.rglob("**/feedback/**/*.pkl"))
+    for f in feedback_files:
+        try:
+            with open(f, "rb") as fp:
+                content = pickle.load(fp)
+            decision = getattr(content, "decision", None)
+            if decision is not None:
+                feedback_decision = decision
+                break
+        except Exception:
+            pass
+
+    if val_result is not None:
+        metric_name, val_score, higher_is_better = val_result
+        test_score = test_result[1] if test_result else None
+        arrow = "↑" if higher_is_better else "↓"
+        metric_display = f"{metric_name} {arrow}"
+        # Format: "val/test" or just "val" if no test
+        if test_score is not None:
+            status_str = f"{val_score:.1f}/{test_score:.1f}"
+        else:
+            status_str = f"{val_score:.1f}"
+        return status_str, val_score, test_score, metric_display, feedback_decision, higher_is_better
+
+    # Check feedback stage (no benchmark result, use feedback decision directly)
+    if feedback_decision is not None:
+        return ("OK" if feedback_decision else "X"), None, None, None, feedback_decision, True
+
+    # Check running stage
+    running_files = list(loop_path.rglob("**/running/**/*.pkl"))
+    if running_files:
+        return "R", None, None, None, None, True
+
+    # Check coding stage
+    coding_files = list(loop_path.rglob("**/coding/**/*.pkl"))
+    if coding_files:
+        return "C", None, None, None, None, True
+
+    # Has directory but no recognized files
+    return "?", None, None, None, None, True
+
+
+def get_max_loops(job_path: Path) -> int:
+    """Get maximum number of loops across all tasks"""
+    max_loops = 0
+    for task_dir in job_path.iterdir():
+        if is_valid_task(task_dir):
+            loops = get_loop_dirs(task_dir)
+            max_loops = max(max_loops, len(loops))
+    return max_loops
+
+
+def get_job_summary_df(job_path: Path) -> tuple[pd.DataFrame, pd.DataFrame]:
+    """Generate summary DataFrame and decision DataFrame for all tasks in job
+
+    Each loop column shows "val/test" format when both scores are available.
+    Best columns show the best validation and test scores separately.
+
+    Returns:
+        (df, decisions_df): df is display data, decisions_df has same structure
+        but values are True/False/None for feedback decision
+    """
+    if not job_path.exists():
+        return pd.DataFrame(), pd.DataFrame()
+
+    tasks = [d for d in sorted(job_path.iterdir(), reverse=True) if is_valid_task(d)]
+    if not tasks:
+        return pd.DataFrame(), pd.DataFrame()
+
+    max_loops = get_max_loops(job_path)
+    if max_loops == 0:
+        max_loops = 10  # Default display columns
+
+    data = []
+    decisions_data = []
+    for task_path in tasks:
+        row = {"Task": task_path.name}
+        decision_row = {"Task": task_path.name}
+        best_val_score = None
+        best_test_score = None
+        best_metric = None
+        best_higher_is_better = True  # Default to higher is better
+
+        # Extract baseline scores (validation and test) from scenario
+        baseline_scores = extract_baseline_scores(task_path)
+        val_baseline = baseline_scores.get("validation")
+        test_baseline = baseline_scores.get("test")
+        if val_baseline and test_baseline:
+            row["Baseline"] = f"{val_baseline[1]:.1f}/{test_baseline[1]:.1f}"
+        elif val_baseline:
+            row["Baseline"] = f"{val_baseline[1]:.1f}"
+        else:
+            row["Baseline"] = "-"
+        decision_row["Baseline"] = None
+
+        for i in range(max_loops):
+            status, val_score, test_score, metric_name, feedback_decision, higher_is_better = get_loop_status(task_path, i)
+            row[f"L{i}"] = status
+            decision_row[f"L{i}"] = feedback_decision
+            if val_score is not None:
+                # Use higher_is_better to determine if this score is better
+                if best_val_score is None:
+                    best_val_score = val_score
+                    best_higher_is_better = higher_is_better
+                    best_metric = metric_name
+                elif (higher_is_better and val_score > best_val_score) or \
+                     (not higher_is_better and val_score < best_val_score):
+                    best_val_score = val_score
+                    best_higher_is_better = higher_is_better
+                    best_metric = metric_name
+            if test_score is not None:
+                # Use same direction as validation score for consistency
+                if best_test_score is None:
+                    best_test_score = test_score
+                elif (best_higher_is_better and test_score > best_test_score) or \
+                     (not best_higher_is_better and test_score < best_test_score):
+                    best_test_score = test_score
+
+        # Show best validation and test scores
+        if best_val_score is not None and best_test_score is not None:
+            row["Best"] = f"{best_val_score:.1f}/{best_test_score:.1f}"
+        elif best_val_score is not None:
+            row["Best"] = f"{best_val_score:.1f}"
+        else:
+            row["Best"] = "-"
+        row["Metric"] = best_metric if best_metric else "-"
+        decision_row["Metric"] = None
+        decision_row["Best"] = None
+        data.append(row)
+        decisions_data.append(decision_row)
+
+    # Ensure column order: Task, Metric, Baseline, L0, L1, ..., Best
+    df = pd.DataFrame(data)
+    decisions_df = pd.DataFrame(decisions_data)
+    if not df.empty:
+        loop_cols = [c for c in df.columns if c.startswith("L")]
+        cols = ["Task", "Metric", "Baseline"] + sorted(loop_cols, key=lambda x: int(x[1:])) + ["Best"]
+        df = df[cols]
+        decisions_df = decisions_df[cols]
+    return df, decisions_df
+
+
+def style_status_cell(val: str, decision: bool | None = None) -> str:
+    """Style cell based on status value and feedback decision
+
+    Args:
+        val: The cell value
+        decision: True=accepted (green), False=rejected (red), None=no feedback (gray)
+    """
+    if val == "-":
+        return "color: #888"
+    if val == "C":
+        return "color: #f0ad4e; font-weight: bold"  # Orange for coding
+    if val == "R":
+        return "color: #5bc0de; font-weight: bold"  # Blue for running
+    if val == "X":
+        return "color: #d9534f; font-weight: bold"  # Red for failed
+    if val == "OK":
+        return "color: #5cb85c; font-weight: bold"  # Green for success
+    if val == "?":
+        return "color: #888"
+
+    # Check if it's a numeric score (with optional "/" separator)
+    is_numeric = False
+    try:
+        float(val)
+        is_numeric = True
+    except ValueError:
+        if "/" in val:
+            parts = val.split("/")
+            try:
+                float(parts[0])
+                is_numeric = True
+            except ValueError:
+                pass
+
+    if is_numeric:
+        # Use decision for coloring (use == instead of is for numpy.bool_ compatibility)
+        if decision == True:
+            return "color: #5cb85c; font-weight: bold"  # Green for accepted
+        elif decision == False:
+            return "color: #d9534f; font-weight: bold"  # Red for rejected
+        else:
+            return "color: #888"  # Gray for no feedback
+
+    return ""
+
+
+def style_df_with_decisions(df: pd.DataFrame, decisions_df: pd.DataFrame) -> pd.io.formats.style.Styler:
+    """Apply styling to dataframe based on decision data
+
+    Args:
+        df: Display dataframe
+        decisions_df: DataFrame with same shape, containing True/False/None values
+    """
+
+    def apply_styles(row_idx: int, col: str) -> str:
+        val = df.iloc[row_idx][col]
+        decision = decisions_df.iloc[row_idx][col] if col in decisions_df.columns else None
+        return style_status_cell(str(val), decision)
+
+    # Build style matrix
+    styles = pd.DataFrame("", index=df.index, columns=df.columns)
+    for row_idx in range(len(df)):
+        for col in df.columns:
+            styles.iloc[row_idx][col] = apply_styles(row_idx, col)
+
+    return df.style.apply(lambda _: styles, axis=None)
+
+
+def render_job_summary(job_path: Path, is_root: bool = False) -> None:
+    """Render job summary UI"""
+    title = "Standalone Tasks" if is_root else f"Job: {job_path.name}"
+    st.subheader(title)
+
+    df, decisions_df = get_job_summary_df(job_path)
+    if df.empty:
+        st.warning("No valid tasks found in this job directory")
+        return
+
+    # Display legend
+    st.markdown(
+        "**Legend:** "
+        "<span style='color:#f0ad4e'>C</span>=Coding, "
+        "<span style='color:#5bc0de'>R</span>=Running, "
+        "<span style='color:#5cb85c'>Score</span>=Accepted, "
+        "<span style='color:#d9534f'>Score/X</span>=Rejected/Failed, "
+        "<span style='color:#888'>Score</span>=No feedback",
+        unsafe_allow_html=True,
+    )
+
+    # Style and display dataframe
+    styled_df = style_df_with_decisions(df, decisions_df)
+    st.dataframe(styled_df, use_container_width=True, hide_index=True)
+
+    # Summary stats
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        st.metric("Tasks", len(df))
+    with col2:
+        # Count tasks with any score
+        tasks_with_score = df["Best"].apply(lambda x: x != "-").sum()
+        st.metric("With Score", tasks_with_score)
+    with col3:
+        # Count tasks with at least one improved loop (decision=True)
+        loop_cols = [c for c in decisions_df.columns if c.startswith("L")]
+        tasks_improved = decisions_df[loop_cols].apply(lambda row: any(v is True for v in row), axis=1).sum()
+        st.metric("Improved", tasks_improved)
+
+    # Detailed scores table
+    render_task_detail_selector(job_path)
+
+
+def extract_full_benchmark(loop_path: Path, split: str = "") -> dict | None:
+    """Extract full accuracy_summary from loop directory.
+
+    Args:
+        loop_path: Path to loop directory
+        split: Filter by split type ("validation", "test", or "" for any)
+
+    Returns:
+        accuracy_summary dict {dataset: {metric: value, ...}, ...} or None
+    """
+    for pkl_file in loop_path.rglob("**/benchmark_result*/**/*.pkl"):
+        try:
+            with open(pkl_file, "rb") as f:
+                content = pickle.load(f)
+            if isinstance(content, dict):
+                # Check split filter
+                content_split = content.get("split", "")
+                if split and content_split != split:
+                    continue
+
+                accuracy_summary = content.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    return accuracy_summary
+        except Exception:
+            pass
+    return None
+
+
+def extract_baseline_full_benchmark(task_path: Path, split: str = "validation") -> dict | None:
+    """Extract full accuracy_summary from baseline scenario.
+
+    Args:
+        task_path: Path to task directory
+        split: "validation" or "test"
+
+    Returns:
+        accuracy_summary dict or None
+    """
+    scenario_dir = task_path / "scenario"
+    if not scenario_dir.exists():
+        return None
+
+    for pkl_file in scenario_dir.rglob("*.pkl"):
+        try:
+            with open(pkl_file, "rb") as f:
+                scenario = pickle.load(f)
+
+            if split == "validation":
+                baseline = getattr(scenario, "baseline_benchmark_score", None)
+            else:
+                baseline = getattr(scenario, "baseline_benchmark_score_test", None)
+
+            if baseline and isinstance(baseline, dict):
+                accuracy_summary = baseline.get("accuracy_summary", {})
+                if isinstance(accuracy_summary, dict) and accuracy_summary:
+                    return accuracy_summary
+        except Exception:
+            pass
+    return None
+
+
+def get_task_full_benchmark_df(task_path: Path, split: str) -> pd.DataFrame:
+    """Generate full benchmark table for a single task and split.
+
+    Returns DataFrame with columns: Dataset, Metric, Baseline, Loop_0, Loop_1, ...
+    Each row is a dataset-metric combination.
+    """
+    # Collect all sources (Baseline + Loops)
+    sources = ["Baseline"]
+    loop_dirs = sorted(
+        [d for d in task_path.iterdir() if d.is_dir() and d.name.startswith("Loop_")],
+        key=lambda x: int(x.name.split("_")[1]),
+    )
+    sources.extend([d.name for d in loop_dirs])
+
+    # Collect all accuracy_summaries
+    all_summaries = {}
+
+    # Baseline
+    baseline_summary = extract_baseline_full_benchmark(task_path, split)
+    if baseline_summary:
+        all_summaries["Baseline"] = baseline_summary
+
+    # Loops
+    for loop_dir in loop_dirs:
+        loop_summary = extract_full_benchmark(loop_dir, split)
+        if loop_summary:
+            all_summaries[loop_dir.name] = loop_summary
+
+    if not all_summaries:
+        return pd.DataFrame()
+
+    # Collect all dataset-metric combinations
+    all_keys = set()
+    for summary in all_summaries.values():
+        for dataset, metrics in summary.items():
+            if isinstance(metrics, dict):
+                for metric in metrics.keys():
+                    all_keys.add((dataset, metric))
+
+    # Sort keys for consistent display
+    all_keys = sorted(all_keys)
+
+    # Build table data
+    data = []
+    for dataset, metric in all_keys:
+        row = {"Dataset": dataset, "Metric": metric}
+        for source in sources:
+            summary = all_summaries.get(source, {})
+            metrics_dict = summary.get(dataset, {})
+            value = metrics_dict.get(metric) if isinstance(metrics_dict, dict) else None
+            if value is not None:
+                row[source] = f"{value:.2f}" if isinstance(value, float) else str(value)
+            else:
+                row[source] = "-"
+        data.append(row)
+
+    df = pd.DataFrame(data)
+    # Ensure column order
+    if not df.empty:
+        cols = ["Dataset", "Metric"] + [s for s in sources if s in df.columns]
+        df = df[cols]
+    return df
+
+
+def render_task_detail_selector(job_path: Path) -> None:
+    """Render task selector dropdown and full benchmark tables."""
+    tasks = [d for d in sorted(job_path.iterdir(), reverse=True) if is_valid_task(d)]
+    if not tasks:
+        return
+
+    st.markdown("---")
+    st.subheader("Detailed Benchmark Scores")
+
+    # Task selector dropdown
+    task_names = [t.name for t in tasks]
+    selected_task = st.selectbox("Select Task", options=task_names, index=0, key="task_detail_selector")
+
+    if selected_task:
+        task_path = job_path / selected_task
+
+        # Display Validation and Test tables side by side
+        col1, col2 = st.columns(2)
+
+        with col1:
+            st.markdown("**Validation**")
+            df_val = get_task_full_benchmark_df(task_path, "validation")
+            if not df_val.empty:
+                st.dataframe(df_val, use_container_width=True, hide_index=True)
+            else:
+                st.info("No validation scores")
+
+        with col2:
+            st.markdown("**Test**")
+            df_test = get_task_full_benchmark_df(task_path, "test")
+            if not df_test.empty:
+                st.dataframe(df_test, use_container_width=True, hide_index=True)
+            else:
+                st.info("No test scores")
diff --git a/rdagent/app/rl/conf.py b/rdagent/app/rl/conf.py
new file mode 100644
index 000000000..4721e9816
--- /dev/null
+++ b/rdagent/app/rl/conf.py
@@ -0,0 +1,41 @@
+from pathlib import Path
+
+from pydantic_settings import SettingsConfigDict
+
+from rdagent.core.conf import ExtendedBaseSettings
+
+
+class RLPostTrainingPropSetting(ExtendedBaseSettings):
+    """RL Post-training dedicated property settings.
+
+    Use RL_ env prefix for overrides.
+    """
+
+    model_config = SettingsConfigDict(env_prefix="RL_", protected_namespaces=())
+
+    # Main Components
+    scen: str = "rdagent.scenarios.rl.scen.scenario.RLPostTrainingScen"
+    hypothesis_gen: str = "rdagent.scenarios.rl.proposal.proposal.RLPostTrainingExpGen"
+    coder: str = "rdagent.components.coder.rl.RLCoSTEER"
+    runner: str = "rdagent.scenarios.rl.train.runner.RLPostTrainingRunner"
+    summarizer: str = "rdagent.scenarios.rl.dev.feedback.RLExperiment2Feedback"
+
+    # Resource paths (unified directory management, similar to SFT)
+    file_path: Path = Path.cwd() / "git_ignore_folder" / "rl_files"
+    """RL resource root directory. Contains datasets/ and models/ subdirectories.
+    Can be overridden via RL_FILE_PATH environment variable."""
+
+    # Core config
+    base_model: str | None = None
+    """Model name (e.g., 'Qwen2.5-Coder-0.5B-Instruct'). Docker path: /models/{base_model}"""
+
+    benchmark: str | None = None
+    """Benchmark/dataset name (e.g., 'gsm8k'). Docker path: /data/{benchmark}"""
+
+    # Benchmark evaluation
+    benchmark_timeout: int = 0
+    """Benchmark evaluation timeout in seconds. 0 means no timeout."""
+
+
+# Global setting instance
+RL_RD_SETTING = RLPostTrainingPropSetting()
diff --git a/rdagent/app/rl/loop.py b/rdagent/app/rl/loop.py
new file mode 100644
index 000000000..f0ce18f53
--- /dev/null
+++ b/rdagent/app/rl/loop.py
@@ -0,0 +1,64 @@
+"""
+RL Post-training Entry Point
+"""
+
+import asyncio
+from typing import Optional
+
+import typer
+from typing_extensions import Annotated
+
+from rdagent.app.rl.conf import RL_RD_SETTING
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.rl.loop import RLPostTrainingRDLoop
+
+
+def main(
+    base_model: Annotated[Optional[str], typer.Option("--base-model", "-m")] = None,
+    benchmark: Annotated[Optional[str], typer.Option("--benchmark", "-b")] = None,
+    step_n: Optional[int] = None,
+    loop_n: Optional[int] = None,
+    timeout: Optional[str] = None,
+):
+    """
+    RL post-training entry point
+
+    Parameters
+    ----------
+    base_model : str
+        Model name (e.g., 'Qwen2.5-Coder-0.5B-Instruct')
+        Docker path: /models/{base_model}
+    benchmark : str
+        Benchmark/dataset name (e.g., 'gsm8k')
+        Docker path: /data/{benchmark}
+    step_n : int, optional
+        Number of steps to run; if None, runs all steps per loop
+    loop_n : int, optional
+        Number of loops to run; if None, runs indefinitely
+    timeout : str, optional
+        Maximum duration for the entire process
+
+    Examples
+    --------
+    .. code-block:: bash
+
+        export RL_MODELS_DIR=/path/to/models
+        export RL_DATA_DIR=/path/to/data
+        python rdagent/app/rl/loop.py --base-model Qwen2.5-Coder-0.5B-Instruct --benchmark gsm8k
+    """
+    # Update config from CLI
+    if base_model:
+        RL_RD_SETTING.base_model = base_model
+    if benchmark:
+        RL_RD_SETTING.benchmark = benchmark
+
+    logger.info(f"Starting RL post-training: model={RL_RD_SETTING.base_model}, benchmark={RL_RD_SETTING.benchmark}")
+
+    # RDLoop 会自动根据 RL_RD_SETTING.scen 创建 Scenario
+    # Scenario.__init__() 中会自动运行 baseline 评测
+    loop = RLPostTrainingRDLoop(RL_RD_SETTING)
+    asyncio.run(loop.run(step_n=step_n, loop_n=loop_n, all_duration=timeout))
+
+
+if __name__ == "__main__":
+    typer.run(main)
diff --git a/rdagent/app/rl/ui/__init__.py b/rdagent/app/rl/ui/__init__.py
new file mode 100644
index 000000000..9ef14f1e2
--- /dev/null
+++ b/rdagent/app/rl/ui/__init__.py
@@ -0,0 +1,2 @@
+"""RL Post-training UI"""
+
diff --git a/rdagent/app/rl/ui/app.py b/rdagent/app/rl/ui/app.py
new file mode 100644
index 000000000..4bc8aa399
--- /dev/null
+++ b/rdagent/app/rl/ui/app.py
@@ -0,0 +1,145 @@
+"""
+RL Post-training Timeline Viewer
+Hierarchical view: Session > Loop > Stage > Events
+
+Run:
+    streamlit run rdagent/app/rl/ui/app.py
+"""
+
+import os
+from pathlib import Path
+
+import streamlit as st
+from streamlit import session_state as state
+
+from rdagent.app.rl.ui.components import render_session, render_summary
+from rdagent.app.rl.ui.config import ALWAYS_VISIBLE_TYPES, OPTIONAL_TYPES
+from rdagent.app.rl.ui.data_loader import get_summary, get_valid_sessions, load_session
+from rdagent.app.rl.ui.rl_summary import render_job_summary
+
+DEFAULT_LOG_BASE = "log/"
+
+
+def get_job_options(base_path: Path) -> list[str]:
+    """Scan directory and return job options list."""
+    options = []
+    has_root_tasks = False
+    job_dirs = []
+
+    if not base_path.exists():
+        return options
+
+    for d in base_path.iterdir():
+        if not d.is_dir():
+            continue
+        if (d / "__session__").exists():
+            has_root_tasks = True
+        else:
+            try:
+                if any((sub / "__session__").exists() for sub in d.iterdir() if sub.is_dir()):
+                    job_dirs.append(d.name)
+            except PermissionError:
+                pass
+
+    job_dirs.sort(reverse=True)
+    options.extend(job_dirs)
+    if has_root_tasks:
+        options.append(". (Current)")
+
+    return options
+
+
+def main():
+    st.set_page_config(layout="wide", page_title="RL Timeline", page_icon="🤖")
+
+    with st.sidebar:
+        view_mode = st.radio("View Mode", ["Job Summary", "Single Task"], horizontal=True)
+        st.divider()
+
+        default_log = os.environ.get("RL_LOG_PATH", DEFAULT_LOG_BASE)
+        job_folder = default_log
+        selected_types = ALWAYS_VISIBLE_TYPES.copy()
+        is_root_job = False
+
+        if view_mode == "Job Summary":
+            st.header("Job")
+            base_folder = st.text_input("Base Folder", value=default_log, key="base_folder_input")
+            base_path = Path(base_folder)
+
+            job_options = get_job_options(base_path)
+            if job_options:
+                selected_job = st.selectbox("Select Job", job_options, key="job_select")
+                if selected_job.startswith("."):
+                    job_folder = base_folder
+                    is_root_job = True
+                else:
+                    job_folder = str(base_path / selected_job)
+                state.selected_job_folder = job_folder
+            else:
+                st.warning("No jobs found in this directory")
+                job_folder = base_folder
+
+            if st.button("Refresh", type="primary", key="refresh_job"):
+                st.rerun()
+        else:
+            st.header("Session")
+            default_path = getattr(state, "selected_job_folder", default_log)
+            log_folder = st.text_input("Log Folder", value=default_path)
+            log_path = Path(log_folder)
+
+            sessions = get_valid_sessions(log_path)
+            if not sessions:
+                st.warning("No valid sessions found")
+                return
+
+            selected_session = st.selectbox("Session", sessions)
+
+            if st.button("Load", type="primary") or "session" not in state:
+                with st.spinner("Loading..."):
+                    state.session = load_session(log_path / selected_session)
+                    state.session_name = selected_session
+
+            st.divider()
+
+            st.subheader("Show More")
+            selected_types = ALWAYS_VISIBLE_TYPES.copy()
+            for event_type, (label, default) in OPTIONAL_TYPES.items():
+                if st.toggle(label, value=default, key=f"toggle_{event_type}"):
+                    selected_types.append(event_type)
+
+            st.divider()
+
+            if "session" in state:
+                summary = get_summary(state.session)
+                st.subheader("Summary")
+                st.metric("Loops", summary.get("loop_count", 0))
+                st.metric("LLM Calls", summary.get("llm_call_count", 0))
+                success = summary.get("docker_success", 0)
+                fail = summary.get("docker_fail", 0)
+                st.metric("Docker", f"{success}✓ / {fail}✗")
+
+    if view_mode == "Job Summary":
+        st.title("📊 RL Job Summary")
+        job_path = Path(job_folder)
+        if job_path.exists():
+            render_job_summary(job_path, is_root=is_root_job)
+        else:
+            st.warning(f"Job folder not found: {job_folder}")
+        return
+
+    st.title("🤖 RL Timeline Viewer")
+
+    if "session" not in state:
+        st.info("Select a session and click **Load** to view")
+        return
+
+    session = state.session
+    summary = get_summary(session)
+    render_summary(summary)
+    st.divider()
+    render_session(session, selected_types)
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/rdagent/app/rl/ui/components.py b/rdagent/app/rl/ui/components.py
new file mode 100644
index 000000000..099f9367a
--- /dev/null
+++ b/rdagent/app/rl/ui/components.py
@@ -0,0 +1,316 @@
+"""
+RL UI Components - Event Renderers
+Simplified version without EvoLoop
+"""
+
+from typing import Any
+
+import streamlit as st
+
+from rdagent.app.rl.ui.config import ICONS
+from rdagent.app.rl.ui.data_loader import Event, Loop, Session
+
+
+def format_duration(seconds: float | None) -> str:
+    if seconds is None:
+        return ""
+    if seconds < 60:
+        return f"{seconds:.1f}s"
+    minutes = int(seconds // 60)
+    secs = seconds % 60
+    return f"{minutes}m {secs:.0f}s"
+
+
+def render_session(session: Session, show_types: list[str]) -> None:
+    """Render full session"""
+    if session.init_events:
+        filtered = [e for e in session.init_events if e.type in show_types]
+        if filtered:
+            with st.expander("🚀 **Initialization**", expanded=False):
+                for event in filtered:
+                    render_event(event)
+
+    for loop_id in sorted(session.loops.keys()):
+        loop = session.loops[loop_id]
+        render_loop(loop, show_types)
+
+
+def render_loop(loop: Loop, show_types: list[str]) -> None:
+    """Render a single loop"""
+    # Get status indicators
+    docker_success = None
+    feedback_decision = None
+    
+    for event in loop.running:
+        if event.type == "docker_exec" and event.success is not None:
+            docker_success = event.success
+    
+    for event in loop.feedback:
+        if event.type == "feedback" and event.success is not None:
+            feedback_decision = event.success
+
+    # Build title
+    parts = []
+    if docker_success is not None:
+        parts.append("🐳✓" if docker_success else "🐳✗")
+    if feedback_decision is not None:
+        parts.append("✅" if feedback_decision else "❌")
+    
+    result_str = " ".join(parts) if parts else ""
+
+    with st.expander(f"🔄 **Loop {loop.loop_id}** {result_str}", expanded=False):
+        # Proposal
+        if loop.proposal:
+            filtered = [e for e in loop.proposal if e.type in show_types]
+            if filtered:
+                st.markdown("#### 💡 Proposal")
+                for event in filtered:
+                    render_event(event)
+
+        # Coding
+        if loop.coding:
+            filtered = [e for e in loop.coding if e.type in show_types]
+            if filtered:
+                st.markdown("#### 💻 Coding")
+                for event in filtered:
+                    render_event(event)
+
+        # Running
+        if loop.running:
+            filtered = [e for e in loop.running if e.type in show_types]
+            if filtered:
+                st.markdown("#### 🏃 Running")
+                for event in filtered:
+                    render_event(event)
+
+        # Feedback
+        if loop.feedback:
+            filtered = [e for e in loop.feedback if e.type in show_types]
+            if filtered:
+                st.markdown("#### 📊 Feedback")
+                for event in filtered:
+                    render_event(event)
+
+
+def render_event(event: Event) -> None:
+    """Render a single event"""
+    icon = ICONS.get(event.type, "📌")
+    duration_str = f" ({format_duration(event.duration)})" if event.duration else ""
+
+    status = ""
+    if event.success is True:
+        status = "🟢 "
+    elif event.success is False:
+        status = "🔴 "
+
+    title = f"{event.time_str} {icon} {status}{event.title}{duration_str}"
+
+    renderers = {
+        "scenario": render_scenario,
+        "llm_call": render_llm_call,
+        "template": render_template,
+        "experiment": render_experiment,
+        "code": render_code,
+        "docker_exec": render_docker_exec,
+        "feedback": render_feedback,
+        "token": render_token,
+        "time": render_time_info,
+        "settings": render_settings,
+        "hypothesis": render_hypothesis,
+    }
+
+    renderer = renderers.get(event.type, render_generic)
+    with st.expander(title, expanded=False):
+        renderer(event.content)
+
+
+def render_scenario(content: Any) -> None:
+    if hasattr(content, "base_model"):
+        st.markdown(f"**Base Model:** `{content.base_model}`")
+    if hasattr(content, "benchmark"):
+        st.markdown(f"**Benchmark:** `{content.benchmark}`")
+    render_generic(content)
+
+
+def render_hypothesis(content: Any) -> None:
+    if hasattr(content, "hypothesis") and content.hypothesis:
+        st.markdown("**Hypothesis:**")
+        st.markdown(content.hypothesis)
+    if hasattr(content, "reason") and content.reason:
+        with st.expander("Reason", expanded=False):
+            st.markdown(content.reason)
+
+
+def render_settings(content: Any) -> None:
+    if isinstance(content, dict):
+        st.json(content)
+    else:
+        st.code(str(content), wrap_lines=True)
+
+
+def render_llm_call(content: Any) -> None:
+    if not isinstance(content, dict):
+        st.json(content) if content else st.info("No content")
+        return
+
+    if content.get("start") and content.get("end"):
+        duration = (content["end"] - content["start"]).total_seconds()
+        st.caption(f"Duration: {format_duration(duration)}")
+
+    system = content.get("system", "")
+    if system:
+        with st.expander("System Prompt", expanded=False):
+            st.code(system, language="text", line_numbers=True, wrap_lines=True)
+
+    user = content.get("user", "")
+    if user:
+        with st.expander("User Prompt", expanded=False):
+            st.code(user, language="text", line_numbers=True, wrap_lines=True)
+
+    resp = content.get("resp", "")
+    if resp:
+        st.markdown("**Response:**")
+        if resp.strip().startswith("{") or resp.strip().startswith("["):
+            st.code(resp, language="json", line_numbers=True, wrap_lines=True)
+        elif resp.strip().startswith("```"):
+            st.markdown(resp)
+        else:
+            st.code(resp, language="text", line_numbers=True, wrap_lines=True)
+
+
+def render_template(content: Any) -> None:
+    if not isinstance(content, dict):
+        st.json(content) if content else st.info("No content")
+        return
+
+    uri = content.get("uri", "")
+    st.caption(f"URI: `{uri}`")
+
+    context = content.get("context", {})
+    if context:
+        with st.expander("Context Variables", expanded=False):
+            st.json(context)
+
+    rendered = content.get("rendered", "")
+    if rendered:
+        with st.expander("Rendered", expanded=True):
+            st.code(rendered, language="text", line_numbers=True, wrap_lines=True)
+
+
+def render_experiment(content: Any) -> None:
+    if isinstance(content, list):
+        for i, task in enumerate(content):
+            if len(content) > 1:
+                st.markdown(f"**Task {i}**")
+            if hasattr(task, "description") and task.description:
+                st.markdown(task.description)
+    else:
+        render_generic(content)
+
+
+def render_code(content: Any) -> None:
+    if isinstance(content, list):
+        for ws in content:
+            if hasattr(ws, "file_dict") and ws.file_dict:
+                for filename, code in ws.file_dict.items():
+                    lang = "yaml" if filename.endswith((".yaml", ".yml")) else "python"
+                    with st.expander(filename, expanded=False):
+                        st.code(code, language=lang, line_numbers=True, wrap_lines=True)
+    elif hasattr(content, "file_dict") and content.file_dict:
+        for filename, code in content.file_dict.items():
+            lang = "yaml" if filename.endswith((".yaml", ".yml")) else "python"
+            with st.expander(filename, expanded=False):
+                st.code(code, language=lang, line_numbers=True, wrap_lines=True)
+    else:
+        render_generic(content)
+
+
+def render_docker_exec(content: Any) -> None:
+    if isinstance(content, dict):
+        exit_code = content.get("exit_code")
+        if exit_code is not None:
+            if exit_code == 0:
+                st.success(f"Exit code: {exit_code}")
+            else:
+                st.error(f"Exit code: {exit_code}")
+
+        stdout = content.get("stdout", "")
+        if stdout:
+            with st.expander("Output", expanded=True):
+                st.code(stdout, language="text", line_numbers=True, wrap_lines=True)
+    else:
+        render_generic(content)
+
+
+def render_feedback(content: Any) -> None:
+    # Handle benchmark result (dict)
+    if isinstance(content, dict):
+        if "accuracy" in content or "accuracy_summary" in content:
+            st.markdown("**Benchmark Result:**")
+            st.json(content)
+        else:
+            st.json(content)
+        return
+
+    # Handle HypothesisFeedback object
+    col1, col2 = st.columns(2)
+    with col1:
+        decision = getattr(content, "decision", None)
+        if decision is not None:
+            st.metric("Decision", "Accept" if decision else "Reject")
+
+    reason = getattr(content, "reason", None)
+    if reason:
+        with st.expander("Reason", expanded=True):
+            st.code(reason, language="text", line_numbers=True, wrap_lines=True)
+
+    code_change = getattr(content, "code_change_summary", None)
+    if code_change:
+        with st.expander("Code Change Summary", expanded=False):
+            st.markdown(code_change)
+
+
+def render_token(content: Any) -> None:
+    if isinstance(content, dict):
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.metric("Prompt", content.get("prompt_tokens", 0))
+        with col2:
+            st.metric("Completion", content.get("completion_tokens", 0))
+        with col3:
+            st.metric("Total", content.get("total_tokens", 0))
+    else:
+        render_generic(content)
+
+
+def render_time_info(content: Any) -> None:
+    if isinstance(content, dict):
+        for k, v in content.items():
+            st.metric(k, f"{v:.1f}s" if isinstance(v, (int, float)) else str(v))
+    else:
+        render_generic(content)
+
+
+def render_generic(content: Any) -> None:
+    if hasattr(content, "__dict__"):
+        st.json(vars(content))
+    elif content:
+        st.json(content)
+    else:
+        st.info("No content")
+
+
+def render_summary(summary: dict) -> None:
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric("Loops", summary.get("loop_count", 0))
+    with col2:
+        st.metric("LLM Calls", summary.get("llm_call_count", 0))
+    with col3:
+        llm_time = summary.get("llm_total_time", 0)
+        st.metric("LLM Time", format_duration(llm_time))
+    with col4:
+        success = summary.get("docker_success", 0)
+        fail = summary.get("docker_fail", 0)
+        st.metric("Docker", f"{success}✓ / {fail}✗")
+
diff --git a/rdagent/app/rl/ui/config.py b/rdagent/app/rl/ui/config.py
new file mode 100644
index 000000000..e0ac3180e
--- /dev/null
+++ b/rdagent/app/rl/ui/config.py
@@ -0,0 +1,55 @@
+"""
+RL UI Configuration Constants
+"""
+
+from typing import Literal
+
+# Event type definition
+EventType = Literal[
+    "scenario",
+    "llm_call",
+    "template",
+    "experiment",
+    "code",
+    "docker_exec",
+    "feedback",
+    "token",
+    "time",
+    "settings",
+    "hypothesis",
+]
+
+# Event type icons
+ICONS = {
+    "scenario": "🎯",
+    "llm_call": "💬",
+    "template": "📋",
+    "experiment": "🧪",
+    "code": "📄",
+    "docker_exec": "🐳",
+    "feedback": "📊",
+    "token": "🔢",
+    "time": "⏱️",
+    "settings": "⚙️",
+    "hypothesis": "💡",
+}
+
+# Always visible event types
+ALWAYS_VISIBLE_TYPES = [
+    "scenario",
+    "hypothesis",
+    "llm_call",
+    "experiment",
+    "code",
+    "docker_exec",
+    "feedback",
+]
+
+# Optional event types with toggle config (label, default_enabled)
+OPTIONAL_TYPES = {
+    "template": ("📋 Template", False),
+    "token": ("🔢 Token", False),
+    "time": ("⏱️ Time", False),
+    "settings": ("⚙️ Settings", False),
+}
+
diff --git a/rdagent/app/rl/ui/data_loader.py b/rdagent/app/rl/ui/data_loader.py
new file mode 100644
index 000000000..3ab7ae1fa
--- /dev/null
+++ b/rdagent/app/rl/ui/data_loader.py
@@ -0,0 +1,310 @@
+"""
+RL UI Data Loader
+Load pkl logs and convert to hierarchical timeline structure
+Simplified version: no EvoLoop (RL doesn't have evolution loops)
+"""
+
+import pickle
+import re
+from dataclasses import dataclass, field
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+import streamlit as st
+
+from rdagent.app.rl.ui.config import EventType
+from rdagent.log.storage import FileStorage
+
+
+@dataclass
+class Event:
+    """Timeline event"""
+
+    type: EventType
+    timestamp: datetime
+    tag: str
+    title: str
+    content: Any
+    loop_id: int | None = None
+    stage: str = ""
+    duration: float | None = None
+    success: bool | None = None
+
+    @property
+    def time_str(self) -> str:
+        return self.timestamp.strftime("%H:%M:%S")
+
+
+@dataclass
+class Loop:
+    """Main loop containing stages (no EvoLoop for RL)"""
+
+    loop_id: int
+    proposal: list[Event] = field(default_factory=list)  # hypothesis generation
+    coding: list[Event] = field(default_factory=list)    # code generation
+    running: list[Event] = field(default_factory=list)   # docker training + benchmark
+    feedback: list[Event] = field(default_factory=list)  # feedback
+
+
+@dataclass
+class Session:
+    """Session containing init events and loops"""
+
+    init_events: list[Event] = field(default_factory=list)
+    loops: dict[int, Loop] = field(default_factory=dict)
+
+
+def extract_loop_id(tag: str) -> int | None:
+    match = re.search(r"Loop_(\d+)", tag)
+    return int(match.group(1)) if match else None
+
+
+def extract_stage(tag: str) -> str:
+    if "proposal" in tag or "direct_exp_gen" in tag:
+        return "proposal"
+    if "coding" in tag:
+        return "coding"
+    if "running" in tag:
+        return "running"
+    if "feedback" in tag:
+        return "feedback"
+    return ""
+
+
+def get_valid_sessions(log_folder: Path) -> list[str]:
+    if not log_folder.exists():
+        return []
+    sessions = []
+    for d in log_folder.iterdir():
+        if d.is_dir() and d.joinpath("__session__").exists():
+            sessions.append(d.name)
+    return sorted(sessions, reverse=True)
+
+
+def parse_event(tag: str, content: Any, timestamp: datetime) -> Event | None:
+    loop_id = extract_loop_id(tag)
+    stage = extract_stage(tag)
+
+    # Scenario
+    if tag == "scenario":
+        return Event(type="scenario", timestamp=timestamp, tag=tag, title="Scenario", content=content)
+
+    # Settings
+    if "SETTINGS" in tag:
+        name = tag.replace("_SETTINGS", "").replace("SETTINGS", "")
+        return Event(type="settings", timestamp=timestamp, tag=tag, title=f"Settings: {name}", content=content)
+
+    # Hypothesis
+    if "hypothesis" in tag:
+        return Event(
+            type="hypothesis",
+            timestamp=timestamp,
+            tag=tag,
+            title="Hypothesis",
+            content=content,
+            loop_id=loop_id,
+            stage="proposal",
+        )
+
+    # LLM Call
+    if "debug_llm" in tag:
+        if isinstance(content, dict) and ("user" in content or "system" in content):
+            duration = None
+            if content.get("start") and content.get("end"):
+                duration = (content["end"] - content["start"]).total_seconds()
+            return Event(
+                type="llm_call",
+                timestamp=timestamp,
+                tag=tag,
+                title="LLM Call",
+                content=content,
+                loop_id=loop_id,
+                stage=stage,
+                duration=duration,
+            )
+
+    # Template
+    if "debug_tpl" in tag:
+        if isinstance(content, dict) and "uri" in content:
+            uri = content.get("uri", "")
+            tpl_name = uri.split(":")[-1] if ":" in uri else uri
+            return Event(
+                type="template",
+                timestamp=timestamp,
+                tag=tag,
+                title=f"Template: {tpl_name}",
+                content=content,
+                loop_id=loop_id,
+                stage=stage,
+            )
+
+    # Experiment/Coder result
+    if "coder result" in tag or "experiment generation" in tag:
+        return Event(
+            type="experiment",
+            timestamp=timestamp,
+            tag=tag,
+            title="Experiment",
+            content=content,
+            loop_id=loop_id,
+            stage=stage or "coding",
+        )
+
+    # Code
+    if "evolving code" in tag or "code" in tag.lower():
+        return Event(
+            type="code",
+            timestamp=timestamp,
+            tag=tag,
+            title="Code",
+            content=content,
+            loop_id=loop_id,
+            stage=stage or "coding",
+        )
+
+    # Docker run
+    if "docker_run" in tag:
+        exit_code = content.get("exit_code") if isinstance(content, dict) else None
+        success = exit_code == 0 if exit_code is not None else None
+        return Event(
+            type="docker_exec",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Docker Run {'✓' if success else '✗' if success is False else ''}",
+            content=content,
+            loop_id=loop_id,
+            stage="running",
+            success=success,
+        )
+
+    # Benchmark result
+    if "benchmark" in tag.lower():
+        return Event(
+            type="feedback",
+            timestamp=timestamp,
+            tag=tag,
+            title="Benchmark Result",
+            content=content,
+            loop_id=loop_id,
+            stage="running",
+        )
+
+    # Feedback
+    if "feedback" in tag:
+        decision = getattr(content, "decision", None)
+        return Event(
+            type="feedback",
+            timestamp=timestamp,
+            tag=tag,
+            title=f"Feedback: {'Accept' if decision else 'Reject'}",
+            content=content,
+            loop_id=loop_id,
+            stage="feedback",
+            success=decision,
+        )
+
+    # Token cost
+    if "token_cost" in tag:
+        if isinstance(content, dict):
+            total = content.get("total_tokens", 0)
+            return Event(
+                type="token",
+                timestamp=timestamp,
+                tag=tag,
+                title=f"Token: {total}",
+                content=content,
+                loop_id=loop_id,
+                stage=stage,
+            )
+
+    # Time info
+    if "time_info" in tag:
+        return Event(
+            type="time",
+            timestamp=timestamp,
+            tag=tag,
+            title="Time Info",
+            content=content,
+            loop_id=loop_id,
+            stage=stage,
+        )
+
+    return None
+
+
+@st.cache_data(ttl=300, hash_funcs={Path: str})
+def load_session(log_path: Path) -> Session:
+    """Load events into hierarchical session structure"""
+    session = Session()
+    
+    # 手动遍历 pkl 文件，跳过无法加载的
+    events = []
+    pkl_files = sorted(log_path.rglob("*.pkl"))
+    for pkl_file in pkl_files:
+        if pkl_file.name == "debug_llm.pkl":
+            continue
+        try:
+            with open(pkl_file, "rb") as f:
+                content = pickle.load(f)
+            timestamp = datetime.strptime(pkl_file.stem, "%Y-%m-%d_%H-%M-%S-%f")
+            # 正确解析 tag：Loop_5/running/debug_tpl/2957404/xxx.pkl -> Loop_5.running.debug_tpl
+            tag = ".".join(pkl_file.relative_to(log_path).as_posix().replace("/", ".").split(".")[:-3])
+            event = parse_event(tag, content, timestamp)
+            if event:
+                events.append(event)
+        except (ModuleNotFoundError, ImportError, pickle.UnpicklingError, ValueError):
+            # 跳过无法加载的文件（不同 Python 版本或格式错误）
+            continue
+
+    events.sort(key=lambda e: e.timestamp)
+
+    for event in events:
+        if event.loop_id is None:
+            session.init_events.append(event)
+            continue
+
+        if event.loop_id not in session.loops:
+            session.loops[event.loop_id] = Loop(loop_id=event.loop_id)
+        loop = session.loops[event.loop_id]
+
+        if event.stage == "proposal":
+            loop.proposal.append(event)
+        elif event.stage == "coding":
+            loop.coding.append(event)
+        elif event.stage == "running":
+            loop.running.append(event)
+        elif event.stage == "feedback":
+            loop.feedback.append(event)
+        else:
+            loop.proposal.append(event)
+
+    return session
+
+
+def get_summary(session: Session) -> dict:
+    """Get summary statistics"""
+    llm_calls = []
+    docker_execs = []
+
+    for e in session.init_events:
+        if e.type == "llm_call":
+            llm_calls.append(e)
+        elif e.type == "docker_exec":
+            docker_execs.append(e)
+
+    for loop in session.loops.values():
+        for e in loop.proposal + loop.coding + loop.running + loop.feedback:
+            if e.type == "llm_call":
+                llm_calls.append(e)
+            elif e.type == "docker_exec":
+                docker_execs.append(e)
+
+    return {
+        "loop_count": len(session.loops),
+        "llm_call_count": len(llm_calls),
+        "llm_total_time": sum(e.duration or 0 for e in llm_calls),
+        "docker_success": sum(1 for e in docker_execs if e.success is True),
+        "docker_fail": sum(1 for e in docker_execs if e.success is False),
+    }
+
diff --git a/rdagent/app/rl/ui/rl_summary.py b/rdagent/app/rl/ui/rl_summary.py
new file mode 100644
index 000000000..8beb4966b
--- /dev/null
+++ b/rdagent/app/rl/ui/rl_summary.py
@@ -0,0 +1,183 @@
+"""
+RL Job Summary View
+Display summary table for all tasks in a job directory
+"""
+
+import pickle
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+
+
+def is_valid_task(task_path: Path) -> bool:
+    """Check if directory is a valid RL task (has __session__ subdirectory)"""
+    return task_path.is_dir() and (task_path / "__session__").exists()
+
+
+def get_loop_dirs(task_path: Path) -> list[Path]:
+    """Get sorted list of Loop directories"""
+    loops = [d for d in task_path.iterdir() if d.is_dir() and d.name.startswith("Loop_")]
+    return sorted(loops, key=lambda d: int(d.name.split("_")[1]))
+
+
+def get_loop_status(task_path: Path, loop_id: int) -> tuple[str, bool | None]:
+    """
+    Get loop status and feedback decision.
+    Returns: (status_str, feedback_decision)
+    Status: 'C'=Coding, 'R'=Running, 'X'=Failed, 'OK'=Success
+    """
+    loop_path = task_path / f"Loop_{loop_id}"
+    if not loop_path.exists():
+        return "-", None
+
+    # Check for feedback
+    feedback_decision = None
+    feedback_files = list(loop_path.rglob("**/feedback/**/*.pkl"))
+    for f in feedback_files:
+        try:
+            with open(f, "rb") as fp:
+                content = pickle.load(fp)
+            decision = getattr(content, "decision", None)
+            if decision is not None:
+                feedback_decision = decision
+                break
+        except Exception:
+            pass
+
+    if feedback_decision is not None:
+        return ("OK" if feedback_decision else "X"), feedback_decision
+
+    # Check running stage
+    running_files = list(loop_path.rglob("**/running/**/*.pkl"))
+    if running_files:
+        return "R", None
+
+    # Check coding stage
+    coding_files = list(loop_path.rglob("**/coding/**/*.pkl"))
+    if coding_files:
+        return "C", None
+
+    return "?", None
+
+
+def get_max_loops(job_path: Path) -> int:
+    """Get maximum number of loops across all tasks"""
+    max_loops = 0
+    for task_dir in job_path.iterdir():
+        if is_valid_task(task_dir):
+            loops = get_loop_dirs(task_dir)
+            max_loops = max(max_loops, len(loops))
+    return max_loops
+
+
+def get_job_summary_df(job_path: Path) -> tuple[pd.DataFrame, pd.DataFrame]:
+    """Generate summary DataFrame for all tasks in job"""
+    if not job_path.exists():
+        return pd.DataFrame(), pd.DataFrame()
+
+    tasks = [d for d in sorted(job_path.iterdir(), reverse=True) if is_valid_task(d)]
+    if not tasks:
+        return pd.DataFrame(), pd.DataFrame()
+
+    max_loops = get_max_loops(job_path)
+    if max_loops == 0:
+        max_loops = 10
+
+    data = []
+    decisions_data = []
+    for task_path in tasks:
+        row = {"Task": task_path.name}
+        decision_row = {"Task": task_path.name}
+        success_count = 0
+        fail_count = 0
+
+        for i in range(max_loops):
+            status, feedback_decision = get_loop_status(task_path, i)
+            row[f"L{i}"] = status
+            decision_row[f"L{i}"] = feedback_decision
+            if feedback_decision is True:
+                success_count += 1
+            elif feedback_decision is False:
+                fail_count += 1
+
+        row["Summary"] = f"{success_count}✓/{fail_count}✗"
+        decision_row["Summary"] = None
+        data.append(row)
+        decisions_data.append(decision_row)
+
+    df = pd.DataFrame(data)
+    decisions_df = pd.DataFrame(decisions_data)
+    if not df.empty:
+        loop_cols = [c for c in df.columns if c.startswith("L")]
+        cols = ["Task"] + sorted(loop_cols, key=lambda x: int(x[1:])) + ["Summary"]
+        df = df[cols]
+        decisions_df = decisions_df[cols]
+    return df, decisions_df
+
+
+def style_status_cell(val: str, decision: bool | None = None) -> str:
+    """Style cell based on status value"""
+    if val == "-":
+        return "color: #888"
+    if val == "C":
+        return "color: #f0ad4e; font-weight: bold"
+    if val == "R":
+        return "color: #5bc0de; font-weight: bold"
+    if val == "X":
+        return "color: #d9534f; font-weight: bold"
+    if val == "OK":
+        return "color: #5cb85c; font-weight: bold"
+    if val == "?":
+        return "color: #888"
+    return ""
+
+
+def style_df_with_decisions(df: pd.DataFrame, decisions_df: pd.DataFrame):
+    """Apply styling to dataframe"""
+    def apply_styles(row_idx: int, col: str) -> str:
+        val = df.iloc[row_idx][col]
+        decision = decisions_df.iloc[row_idx][col] if col in decisions_df.columns else None
+        return style_status_cell(str(val), decision)
+
+    styles = pd.DataFrame("", index=df.index, columns=df.columns)
+    for row_idx in range(len(df)):
+        for col in df.columns:
+            styles.iloc[row_idx][col] = apply_styles(row_idx, col)
+
+    return df.style.apply(lambda _: styles, axis=None)
+
+
+def render_job_summary(job_path: Path, is_root: bool = False) -> None:
+    """Render job summary UI"""
+    title = "Standalone Tasks" if is_root else f"Job: {job_path.name}"
+    st.subheader(title)
+
+    df, decisions_df = get_job_summary_df(job_path)
+    if df.empty:
+        st.warning("No valid tasks found in this job directory")
+        return
+
+    st.markdown(
+        "**Legend:** "
+        "<span style='color:#f0ad4e'>C</span>=Coding, "
+        "<span style='color:#5bc0de'>R</span>=Running, "
+        "<span style='color:#5cb85c'>OK</span>=Success, "
+        "<span style='color:#d9534f'>X</span>=Failed",
+        unsafe_allow_html=True,
+    )
+
+    styled_df = style_df_with_decisions(df, decisions_df)
+    st.dataframe(styled_df, use_container_width=True, hide_index=True)
+
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        st.metric("Tasks", len(df))
+    with col2:
+        loop_cols = [c for c in decisions_df.columns if c.startswith("L")]
+        tasks_success = decisions_df[loop_cols].apply(lambda row: any(v is True for v in row), axis=1).sum()
+        st.metric("With Success", tasks_success)
+    with col3:
+        total_loops = sum(1 for _, row in decisions_df.iterrows() for c in loop_cols if row[c] is not None)
+        st.metric("Total Loops", total_loops)
+
diff --git a/rdagent/app/utils/ws_ft.py b/rdagent/app/utils/ws_ft.py
new file mode 100644
index 000000000..fb518a039
--- /dev/null
+++ b/rdagent/app/utils/ws_ft.py
@@ -0,0 +1,52 @@
+from typing import Optional
+
+import typer
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.finetune.conf import get_ft_env
+from rdagent.utils.agent.tpl import T
+
+app = typer.Typer(help="Run LLM fine-tuning environment commands.")
+
+
+@app.command()
+def run(
+    dataset: str,
+    model: str,
+    cmd: str,
+    local_path: str = "./",
+    mount_path: str | None = None,
+):
+    """
+    Launch the LLM fine-tuning environment for a specific dataset and model, then run the
+    provided command.
+
+    Example:
+        1) start the container:
+        dotenv run -- python -m rdagent.app.utils.ws_ft alpaca_gpt4_zh qwen2-7b "sleep 3600" --local-path your_workspace
+
+        2) then run the following command to enter the latest container:
+        - docker exec -it `docker ps --filter 'status=running' -l --format '{{.Names}}'` bash
+        Or you can attach to the container by specifying the container name (find it in the run info)
+        - docker exec -it sweet_robinson bash
+
+    Arguments:
+        dataset: The dataset name for fine-tuning.
+        model: The base model name for fine-tuning.
+        cmd: The shell command or script entry point to execute inside
+             the environment.
+    """
+    # Don't set time limitation and always disable cache
+    env = get_ft_env(
+        running_timeout_period=None,
+        enable_cache=False,
+    )
+
+    if mount_path is not None:
+        env.conf.mount_path = mount_path
+
+    env.run(entry=cmd, local_path=local_path)
+
+
+if __name__ == "__main__":  # pragma: no cover
+    app()
diff --git a/rdagent/components/benchmark/__init__.py b/rdagent/components/benchmark/__init__.py
new file mode 100644
index 000000000..5cf96f7d2
--- /dev/null
+++ b/rdagent/components/benchmark/__init__.py
@@ -0,0 +1,6 @@
+"""Shared benchmark evaluation utilities."""
+
+from pathlib import Path
+
+# 共享配置目录
+BENCHMARK_CONFIGS_DIR = Path(__file__).parent / "configs"
diff --git a/rdagent/components/benchmark/configs/__init__.py b/rdagent/components/benchmark/configs/__init__.py
new file mode 100644
index 000000000..42199d0bc
--- /dev/null
+++ b/rdagent/components/benchmark/configs/__init__.py
@@ -0,0 +1 @@
+"""Shared OpenCompass benchmark configurations."""
diff --git a/rdagent/components/benchmark/configs/models.yaml b/rdagent/components/benchmark/configs/models.yaml
new file mode 100644
index 000000000..f8759d655
--- /dev/null
+++ b/rdagent/components/benchmark/configs/models.yaml
@@ -0,0 +1,102 @@
+# Model Inference Parameters Configuration
+# Used by benchmark.py to determine inference settings for different models
+
+# Default configuration (used when model is not explicitly listed)
+default:
+  temperature: 0.0  # Greedy decoding for reproducible results
+  top_p: 1.0
+  top_k: 1
+  max_seq_len: 32768
+  max_out_len: 8192
+  batch_size: 16
+  tensor_parallel_size: auto  # Will be auto-determined based on GPU count
+  gpu_memory_utilization: 0.9
+  repetition_penalty: 1.0
+  dtype: bfloat16
+  enable_thinking: false
+  use_cot_postprocessor: true  # Enable CoT postprocessor to extract answer from <think>...</think>answer format
+
+# Model-specific configurations (override default values)
+models:
+  # Qwen3 series - support thinking mode and longer sequences
+  "Qwen/Qwen3-8B":
+    temperature: 0.6
+    top_p: 0.95
+    top_k: 20
+    max_seq_len: 40960
+    max_out_len: 38912
+    enable_thinking: true  # Qwen3-specific feature
+
+  "Qwen/Qwen3-32B":
+    temperature: 0.6
+    top_p: 0.95
+    top_k: 20
+    max_seq_len: 40960
+    max_out_len: 38912
+    enable_thinking: true
+
+  "Qwen/Qwen3-1.7B":
+    temperature: 0.6
+    top_p: 0.95
+    top_k: 20
+    max_seq_len: 40960
+    max_out_len: 38912
+    enable_thinking: true
+    gpu_memory_utilization: 0.7  # It does not use too much GPU memory. But it is worth 
+
+  # Qwen2.5 series - standard configuration with CoT postprocessor for fine-tuned models
+  "Qwen/Qwen2.5-0.5B-Instruct":
+    temperature: 0.0
+    top_p: 1.0
+    top_k: 1
+    max_seq_len: 32768
+    max_out_len: 8192
+    gpu_memory_utilization: 0.5  # 0.5B model is very small, no need for 0.9
+
+  "Qwen/Qwen2.5-0.5B":
+    temperature: 0.0
+    top_p: 1.0
+    top_k: 1
+    max_seq_len: 32768
+    max_out_len: 8192
+    gpu_memory_utilization: 0.5
+
+  "Qwen/Qwen2.5-7B-Instruct":
+    temperature: 0.0  # Greedy decoding for consistency
+    top_p: 1.0
+    top_k: 1
+    max_seq_len: 32768
+    max_out_len: 8192
+    use_cot_postprocessor: true  # Extract answer from CoT format after fine-tuning
+
+  "Qwen/Qwen2.5-32B-Instruct":
+    temperature: 0.0
+    top_p: 1.0
+    top_k: 1
+    max_seq_len: 32768
+    max_out_len: 8192
+
+  # Llama 3.1 series (128K context, 4K max output)
+  "meta-llama/Llama-3.1-8B-Instruct":
+    temperature: 0.7
+    top_p: 0.95
+    top_k: 40
+    max_seq_len: 32768 # 131072
+    max_out_len: 4096
+
+
+  # Mistral series
+  "mistralai/Mistral-7B-Instruct-v0.3":
+    temperature: 0.7
+    top_p: 0.95
+    top_k: 50
+    max_seq_len: 32768
+    max_out_len: 8192
+
+  # DeepSeek series
+  "deepseek-ai/deepseek-coder-33b-instruct":
+    temperature: 0.0
+    top_p: 1.0
+    top_k: 1
+    max_seq_len: 16384
+    max_out_len: 4096
diff --git a/rdagent/components/benchmark/configs/opencompass_template.yaml b/rdagent/components/benchmark/configs/opencompass_template.yaml
new file mode 100644
index 000000000..3b86ef9d0
--- /dev/null
+++ b/rdagent/components/benchmark/configs/opencompass_template.yaml
@@ -0,0 +1,129 @@
+# Auto-generated OpenCompass Config for RD-Agent Benchmark
+# DO NOT EDIT MANUALLY - Generated by benchmark.py
+
+template: |-
+    from mmengine.config import read_base
+    from opencompass.models import VLLMwithChatTemplate
+
+    # ==================== Dataset Import ====================
+    with read_base():
+    {% for dataset_module in dataset_imports %}
+        from {{ dataset_module }} import *
+    {% endfor %}
+
+    # Clean up non-serializable variables leaked by imported dataset configs.
+    # (e.g. BBH's config.py does `import os` and `with open() as f` at module scope,
+    # which get pulled in by `import *` and break OpenCompass config serialization.)
+    import types as _types
+    _cleanup = {_k for _k, _v in dict(locals()).items()
+                if isinstance(_v, _types.ModuleType) or (hasattr(_v, 'read') and hasattr(_v, 'close'))}
+    del _types
+    for _k in _cleanup:
+        exec(f'{_k} = None')
+    del _cleanup
+
+    # Aggregate all dataset variables
+    datasets = sum([v for k, v in locals().items() if (k == 'datasets' or k.endswith('_datasets')) and isinstance(v, list)], [])
+
+    # Apply dataset modifications
+    for ds in datasets:
+    {% if test_range %}
+        # Apply dataset range (e.g., "[:100]" for validation, "[-100:]" for test)
+        if 'reader_cfg' not in ds:
+            ds['reader_cfg'] = {}
+        ds['reader_cfg']['test_range'] = '{{ test_range }}'
+
+        # Sync to evaluator's dataset_cfg
+        if 'eval_cfg' in ds and 'evaluator' in ds['eval_cfg']:
+            evaluator = ds['eval_cfg']['evaluator']
+            if isinstance(evaluator, dict) and 'dataset_cfg' in evaluator:
+                if 'reader_cfg' not in evaluator['dataset_cfg']:
+                    evaluator['dataset_cfg']['reader_cfg'] = {}
+                evaluator['dataset_cfg']['reader_cfg']['test_range'] = '{{ test_range }}'
+    {% endif %}
+    {% if num_runs and num_runs > 1 %}
+        # Multiple runs (repeat each sample n times for averaging or pass@k)
+        ds['n'] = {{ num_runs }}
+    {% endif %}
+    {% if pass_k %}
+        # Pass@k evaluation
+        ds['k'] = {{ pass_k }}
+    {% endif %}
+        pass
+
+    # ==================== Model Configuration ====================
+    models = [
+        dict(
+            type=VLLMwithChatTemplate,
+            abbr='{{ model_abbr }}',
+            path='{{ model_path }}',
+            model_kwargs=dict(
+                tensor_parallel_size={{ tensor_parallel_size }},
+                gpu_memory_utilization={{ gpu_memory_utilization }},
+                trust_remote_code=True,
+                dtype='{{ dtype }}',
+                max_model_len={{ max_seq_len }},
+    {% if is_lora %}
+                enable_lora=True,
+                max_lora_rank=64,
+                max_cpu_loras=1,
+    {% endif %}
+            ),
+    {% if is_lora %}
+            lora_path='{{ lora_path }}',
+    {% endif %}
+            max_seq_len={{ max_seq_len }},
+            max_out_len={{ max_out_len }},
+            batch_size={{ batch_size }},
+            generation_kwargs=dict(
+                temperature={{ temperature }},
+                top_p={{ top_p }},
+                top_k={{ top_k }},
+    {% if repetition_penalty != 1.0 %}
+                repetition_penalty={{ repetition_penalty }},
+    {% endif %}
+            ),
+    {% if enable_thinking %}
+            chat_template_kwargs=dict(enable_thinking=True),
+    {% endif %}
+    {% if enable_thinking or use_cot_postprocessor %}
+            pred_postprocessor=dict(type='extract-non-reasoning-content'),
+    {% endif %}
+            run_cfg=dict(
+                num_gpus={{ tensor_parallel_size }},
+                num_procs=1,
+            ),
+        ),
+    ]
+
+    # ==================== Inference Configuration ====================
+    infer = dict(
+        partitioner=dict(
+            type='NaivePartitioner',
+        ),
+        runner=dict(
+            type='LocalRunner',
+            max_num_workers=16,
+            task=dict(
+                type='OpenICLInferTask',
+            ),
+        ),
+    )
+
+    # ==================== Evaluation Configuration ====================
+    eval = dict(
+        partitioner=dict(
+            type='NaivePartitioner',
+        ),
+        runner=dict(
+            type='LocalRunner',
+            max_num_workers=16,
+            task=dict(
+                type='OpenICLEvalTask',
+                dump_details=True,
+            ),
+        ),
+    )
+
+    # ==================== Work Directory ====================
+    work_dir = '{{ work_dir }}'
diff --git a/rdagent/components/coder/CoSTEER/__init__.py b/rdagent/components/coder/CoSTEER/__init__.py
index b74313cc6..0eff56cf0 100644
--- a/rdagent/components/coder/CoSTEER/__init__.py
+++ b/rdagent/components/coder/CoSTEER/__init__.py
@@ -28,6 +28,7 @@ def __init__(
         with_knowledge: bool = True,
         knowledge_self_gen: bool = True,
         max_loop: int | None = None,
+        stop_eval_chain_on_fail: bool = False,
         **kwargs,
     ) -> None:
         super().__init__(*args, **kwargs)
@@ -46,6 +47,7 @@ def __init__(
         self.evolving_strategy = es
         self.evaluator = eva
         self.evolving_version = evolving_version
+        self.stop_eval_chain_on_fail = stop_eval_chain_on_fail
 
         # init rag method
         self.rag = (
@@ -99,10 +101,10 @@ def develop(self, exp: Experiment) -> Experiment:
             evolving_strategy=self.evolving_strategy,
             rag=self.rag,
             with_knowledge=self.with_knowledge,
-            with_feedback=True,
             knowledge_self_gen=self.knowledge_self_gen,
             enable_filelock=self.settings.enable_filelock,
             filelock_path=self.settings.filelock_path,
+            stop_eval_chain_on_fail=self.stop_eval_chain_on_fail,
         )
 
         # Evolving the solution
diff --git a/rdagent/components/coder/CoSTEER/evaluators.py b/rdagent/components/coder/CoSTEER/evaluators.py
index bbce2333b..5d55257f6 100644
--- a/rdagent/components/coder/CoSTEER/evaluators.py
+++ b/rdagent/components/coder/CoSTEER/evaluators.py
@@ -1,11 +1,13 @@
+import json
 from abc import abstractmethod
 from copy import deepcopy
-from dataclasses import dataclass
-from typing import TYPE_CHECKING, List
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Dict, Generator, List
 
 from rdagent.components.coder.CoSTEER.evolvable_subjects import EvolvingItem
 from rdagent.core.conf import RD_AGENT_SETTINGS
 from rdagent.core.evaluation import Evaluator, Feedback
+from rdagent.core.evolving_agent import RAGEvaluator
 from rdagent.core.evolving_framework import QueriedKnowledge
 from rdagent.core.experiment import Task, Workspace
 from rdagent.core.utils import multiprocessing_wrapper
@@ -37,12 +39,16 @@ class CoSTEERSingleFeedback(Feedback):
     It is design align the phases of the implemented code
     - Execution -> Return Value -> Code -> Final Decision
     """
-    execution: str
+    execution: str  # Summarized execution feedback
     # execution_feedback
     return_checking: str | None  # including every check in the testing (constraints about the generated value)
     # value_feedback, shape_feedback, value_generated_flag
     code: str
-    final_decision: bool | None = None
+    final_decision: bool
+    raw_execution: str = ""  # Full raw stdout for UI display
+    source_feedback: Dict[str, bool] = field(
+        default_factory=dict
+    )  # Record the source of the feedback since it might be merged from multiple feedbacks, stores the mapping from source tag to its final_decision, this dict also includes the feedback source of itself
 
     @staticmethod
     def val_and_update_init_dict(data: dict) -> dict:
@@ -72,8 +78,8 @@ def val_and_update_init_dict(data: dict) -> dict:
             raise ValueError(f"'final_decision' must be a boolean, not {type(data['final_decision'])}")
 
         for attr in "execution", "return_checking", "code":
-            if data[attr] is not None and not isinstance(data[attr], str):
-                raise ValueError(f"'{attr}' must be a string, not {type(data[attr])}")
+            if data.get(attr) is not None and not isinstance(data[attr], str):
+                data[attr] = json.dumps(data[attr], indent=2, ensure_ascii=False)
         return data
 
     @classmethod
@@ -95,6 +101,10 @@ def merge(cls, feedback_li: list["CoSTEERSingleFeedback"]) -> "CoSTEERSingleFeed
                 attr,
                 "\n\n".join([getattr(_fb, attr) for _fb in feedback_li if getattr(_fb, attr) is not None]),
             )
+        fb.source_feedback = {}
+        for _fb in feedback_li:
+            for tag, decision in _fb.source_feedback.items():
+                fb.source_feedback[tag] = decision
         return fb
 
     def __str__(self) -> str:
@@ -226,6 +236,7 @@ def __init__(
     # TODO:
     # I think we should have unified interface for all evaluates, for examples.
     # So we should adjust the interface of other factors
+    # Based on the implementation, I think a better name is some name like task-implement evaluator
     @abstractmethod
     def evaluate(
         self,
@@ -237,19 +248,23 @@ def evaluate(
         raise NotImplementedError("Please implement the `evaluator` method")
 
 
-class CoSTEERMultiEvaluator(CoSTEEREvaluator):
+class CoSTEERMultiEvaluator(RAGEvaluator):
     """This is for evaluation of experiment. Due to we have multiple tasks, so we will return a list of evaluation feebacks"""
 
-    def __init__(self, single_evaluator: CoSTEEREvaluator | list[CoSTEEREvaluator], *args, **kwargs) -> None:
-        super().__init__(*args, **kwargs)
+    def __init__(self, single_evaluator: CoSTEEREvaluator | list[CoSTEEREvaluator], scen: "Scenario") -> None:
+        super().__init__()
+        self.scen = scen
         self.single_evaluator = single_evaluator
 
-    def evaluate(
+    def evaluate_iter(
         self,
-        evo: EvolvingItem,
         queried_knowledge: QueriedKnowledge = None,
         **kwargs,
-    ) -> CoSTEERMultiFeedback:
+    ) -> Generator[CoSTEERMultiFeedback, EvolvingItem | None, CoSTEERMultiFeedback]:
+        evo = yield CoSTEERMultiFeedback(
+            []
+        )  # it will receive the evo first, so the first yield is for get the sent evo instead of generate useful feedback
+
         eval_l = self.single_evaluator if isinstance(self.single_evaluator, list) else [self.single_evaluator]
 
         # 1) Evaluate each sub_task
@@ -279,7 +294,12 @@ def evaluate(
                 ],
                 n=RD_AGENT_SETTINGS.multi_proc_n,
             )
+            # None received, we skip the rest and return the overall feedback directly
+            evo_next_iter = yield CoSTEERMultiFeedback(multi_implementation_feedback)
             task_li_feedback_li.append(multi_implementation_feedback)
+            if evo_next_iter is None:
+                break
+            evo = evo_next_iter
 
         # 2) merge the feedbacks along the sub_tasks to aggregate the multiple evaluation feedbacks
         merged_task_feedback = []
diff --git a/rdagent/components/coder/CoSTEER/evolving_strategy.py b/rdagent/components/coder/CoSTEER/evolving_strategy.py
index 16b7639b0..fbbf38615 100644
--- a/rdagent/components/coder/CoSTEER/evolving_strategy.py
+++ b/rdagent/components/coder/CoSTEER/evolving_strategy.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 from abc import abstractmethod
+from typing import Callable, Generator
 
 from rdagent.components.coder.CoSTEER.config import CoSTEERSettings
 from rdagent.components.coder.CoSTEER.evaluators import (
@@ -26,7 +27,6 @@ def __init__(self, scen: Scenario, settings: CoSTEERSettings, improve_mode: bool
         self.settings = settings
         self.improve_mode = improve_mode  # improve mode means we only implement the task which has failed before. The main diff is the first loop will not implement all tasks.
 
-    @abstractmethod
     def implement_one_task(
         self,
         target_task: Task,
@@ -58,6 +58,15 @@ def implement_one_task(
         """
         raise NotImplementedError
 
+    def implement_func_list(self) -> list[Callable]:
+        """
+        One evolve solution will be divided into multiple implement functions.
+        The functions will be called sequentially.
+
+        `implement_one_task` is the default implementation.  Please refer to its signature for more details.
+        """
+        return [self.implement_one_task]
+
     @abstractmethod
     def assign_code_list_to_evo(self, code_list: list[dict], evo: EvolvingItem) -> None:
         """
@@ -66,19 +75,48 @@ def assign_code_list_to_evo(self, code_list: list[dict], evo: EvolvingItem) -> N
         Due to the implement_one_task take `workspace` as input and output the `modification`.
         We should apply implementation to evo
 
+        Assumptions:
+        - The modidication on evo should happen in-place!!
+
         The code list is aligned with the evolving item's sub-tasks.
         If a task is not implemented, put a None in the list.
         """
         raise NotImplementedError
 
-    def evolve(
+    def assign_code_list_to_evo(self, code_list: list[dict | None], evo) -> None:
+        """Assign code modifications to evolving item.
+
+        For runner, coder already generated full training config, so typically no modifications.
+        But this method is required by the abstract base class.
+        """
+        for index in range(len(evo.sub_tasks)):
+            if code_list[index] is None:
+                continue
+            if evo.sub_workspace_list[index] is None:
+                evo.sub_workspace_list[index] = evo.experiment_workspace
+
+            # If there are any modifications (usually empty for runner)
+            if code_list[index]:
+                # Handle change summary if present
+                if self.KEY_CHANGE_SUMMARY in code_list[index]:
+                    evo.sub_workspace_list[index].change_summary = code_list[index].pop(self.KEY_CHANGE_SUMMARY)
+                # Inject any modified files
+                evo.sub_workspace_list[index].inject_files(**code_list[index])
+
+        return evo
+
+    def evolve_iter(
         self,
         *,
         evo: EvolvingItem,
         queried_knowledge: CoSTEERQueriedKnowledge | None = None,
         evolving_trace: list[EvoStep] = [],
         **kwargs,
-    ) -> EvolvingItem:
+    ) -> Generator[EvolvingItem, EvolvingItem, None]:
+        if queried_knowledge is None:
+            raise ValueError(
+                "MultiProcessEvolvingStrategy requires queried_knowledge for efficient implementation. Please set with_knowledge=True in CoSTEER constructor."
+            )
         code_list = [None for _ in range(len(evo.sub_tasks))]
 
         last_feedback = None
@@ -111,24 +149,24 @@ def evolve(
                         {}
                     )  # empty implementation for skipped task, but assign_code_list_to_evo will still assign it
 
-        result = multiprocessing_wrapper(
-            [
-                (
-                    self.implement_one_task,
+        for implement_func in self.implement_func_list():
+            result = multiprocessing_wrapper(
+                [
                     (
-                        evo.sub_tasks[target_index],
-                        queried_knowledge,
-                        evo.experiment_workspace,
-                        None if last_feedback is None else last_feedback[target_index],
-                    ),
-                )
-                for target_index in to_be_finished_task_index
-            ],
-            n=RD_AGENT_SETTINGS.multi_proc_n,
-        )
-        for index, target_index in enumerate(to_be_finished_task_index):
-            code_list[target_index] = result[index]
-
-        evo = self.assign_code_list_to_evo(code_list, evo)
-
-        return evo
+                        implement_func,
+                        (
+                            evo.sub_tasks[target_index],
+                            queried_knowledge,
+                            evo.experiment_workspace,
+                            None if last_feedback is None else last_feedback[target_index],
+                        ),
+                    )
+                    for target_index in to_be_finished_task_index
+                ],
+                n=RD_AGENT_SETTINGS.multi_proc_n,
+            )
+            for index, target_index in enumerate(to_be_finished_task_index):
+                code_list[target_index] = result[index]
+
+            self.assign_code_list_to_evo(code_list, evo)
+            yield evo
diff --git a/rdagent/components/coder/CoSTEER/knowledge_management.py b/rdagent/components/coder/CoSTEER/knowledge_management.py
index 11659a9d6..d45e90f2f 100644
--- a/rdagent/components/coder/CoSTEER/knowledge_management.py
+++ b/rdagent/components/coder/CoSTEER/knowledge_management.py
@@ -100,6 +100,38 @@ def load_dumped_knowledge_base(self, *args, **kwargs):
 
 
 class CoSTEERQueriedKnowledge(QueriedKnowledge):
+    """
+    Data container for knowledge retrieved from the CoSTEER knowledge base during a query operation.
+
+    Parameters
+    ----------
+    success_task_to_knowledge_dict : dict, optional
+        A mapping between task information strings and their corresponding `CoSTEERKnowledge` objects
+        for tasks that were successfully completed.
+        Type: dict[str, CoSTEERKnowledge]
+        Example:
+            {
+                "task_info_1": CoSTEERKnowledge(target_task=Task(...),
+                                                implementation=FBWorkspace(...),
+                                                feedback=CoSTEERSingleFeedback(...)),
+                "task_info_2": CoSTEERKnowledge(...)
+            }
+    failed_task_info_set : set, optional
+        A set containing task information strings that were attempted but failed repeatedly beyond
+        the allowed trial limit.
+        Type: set[str]
+        Example:
+            {
+                "failed_task_info_1",
+                "failed_task_info_2"
+            }
+
+    Returns
+    -------
+    None
+        This class is a data holder, initialization does not return any value.
+    """
+
     def __init__(self, success_task_to_knowledge_dict: dict = {}, failed_task_info_set: set = set()) -> None:
         self.success_task_to_knowledge_dict = success_task_to_knowledge_dict
         self.failed_task_info_set = failed_task_info_set
@@ -134,6 +166,8 @@ def __init__(
 
 
 class CoSTEERRAGStrategyV1(CoSTEERRAGStrategy):
+    """it is deprecated"""
+
     def __init__(self, settings: CoSTEERSettings, *args, **kwargs) -> None:
         super().__init__(*args, **kwargs)
         self.current_generated_trace_count = 0
@@ -245,6 +279,62 @@ def query(
 
 
 class CoSTEERQueriedKnowledgeV2(CoSTEERQueriedKnowledgeV1):
+    """
+    Aggregation subclass of `CoSTEERQueriedKnowledgeV1` that extends the queried knowledge to also
+    include mappings between tasks and knowledge related to similar errors from successful executions.
+
+    Parameters
+    ----------
+    task_to_former_failed_traces : dict, optional
+        Mapping from task information strings to a tuple containing:
+            - A list of `CoSTEERKnowledge` objects representing the most recent failed attempts for that task.
+            - An optional `CoSTEERKnowledge` object of the latest failed attempt after a successful execution,
+              or `None` if not applicable.
+        Type: dict[str, tuple[list[CoSTEERKnowledge], CoSTEERKnowledge | None]]
+        Example:
+            {
+                "task_info_A": ([CoSTEERKnowledge(...), CoSTEERKnowledge(...)], None),
+                "task_info_B": ([CoSTEERKnowledge(...), CoSTEERKnowledge(...)], CoSTEERKnowledge(...))
+            }
+
+    task_to_similar_task_successful_knowledge : dict, optional
+        Mapping from task information strings to a list of `CoSTEERKnowledge` objects representing
+        knowledge from similar tasks that have been successfully completed.
+        Type: dict[str, list[CoSTEERKnowledge]]
+        Example:
+            {
+                "task_info_A": [CoSTEERKnowledge(...), CoSTEERKnowledge(...)],
+                "task_info_C": []
+            }
+
+    task_to_similar_error_successful_knowledge : dict, optional
+        Mapping from task information strings to a list of tuples, each containing:
+            - A string describing the error(s) encountered.
+            - A tuple of two `CoSTEERKnowledge` objects:
+                * The first corresponds to the trace where that error was encountered.
+                * The second is related to a successful implementation that had the same error in a prior attempt.
+        Type: dict[str, list[tuple[str, tuple[CoSTEERKnowledge, CoSTEERKnowledge]]]]
+        Example:
+            {
+                "task_info_B": [
+                    (
+                        "1. ErrorType: ValueError; Error line: some_function_call()",
+                        (CoSTEERKnowledge(...), CoSTEERKnowledge(...))
+                    )
+                ]
+            }
+
+    **kwargs : dict
+        Additional keyword arguments passed to the parent constructor, such as:
+            - success_task_to_knowledge_dict: dict[str, CoSTEERKnowledge]
+            - failed_task_info_set: set[str]
+
+    Returns
+    -------
+    None
+        This class is purely a data container and does not return a value upon initialization.
+    """
+
     # Aggregation of knowledge
     def __init__(
         self,
@@ -338,7 +428,7 @@ def generate_knowledge(
             self.current_generated_trace_count = len(evolving_trace)
             return None
 
-    def query(self, evo: EvolvableSubjects, evolving_trace: list[EvoStep]) -> CoSTEERQueriedKnowledge | None:
+    def query(self, evo: EvolvableSubjects, evolving_trace: list[EvoStep]) -> CoSTEERQueriedKnowledge:
         conf_knowledge_sampler = self.settings.v2_knowledge_sampler
         queried_knowledge_v2 = CoSTEERQueriedKnowledgeV2(
             success_task_to_knowledge_dict=self.knowledgebase.success_task_to_knowledge_dict,
diff --git a/rdagent/components/coder/data_science/ensemble/eval.py b/rdagent/components/coder/data_science/ensemble/eval.py
index d424e0d13..ad207c454 100644
--- a/rdagent/components/coder/data_science/ensemble/eval.py
+++ b/rdagent/components/coder/data_science/ensemble/eval.py
@@ -67,7 +67,7 @@ def evaluate(
 
         implementation.inject_files(**{fname: test_code})
         result = implementation.run(env=env, entry=f"python {fname}")
-        stdout = result.get_truncated_stdout()
+        stdout = result.stdout
         ret_code = result.exit_code
 
         stdout += f"\nNOTE: the above scripts run with return code {ret_code}"
diff --git a/rdagent/components/coder/data_science/feature/eval.py b/rdagent/components/coder/data_science/feature/eval.py
index 8f51c62e4..4135a6a0d 100644
--- a/rdagent/components/coder/data_science/feature/eval.py
+++ b/rdagent/components/coder/data_science/feature/eval.py
@@ -1,8 +1,5 @@
-import json
-import re
 from pathlib import Path
 
-from rdagent.app.data_science.conf import DS_RD_SETTING
 from rdagent.components.coder.CoSTEER.evaluators import (
     CoSTEEREvaluator,
     CoSTEERSingleFeedback,
@@ -13,7 +10,6 @@
 from rdagent.core.experiment import FBWorkspace, Task
 from rdagent.utils.agent.tpl import T
 from rdagent.utils.agent.workflow import build_cls_from_json_with_retry
-from rdagent.utils.fmt import shrink_text
 
 DIRNAME = Path(__file__).absolute().resolve().parent
 
@@ -69,7 +65,7 @@ def evaluate(
             workflow_code=implementation.all_codes,
         )
         user_prompt = T(".prompts:feature_eval.user").r(
-            stdout=result.get_truncated_stdout(),
+            stdout=result.stdout,
             workflow_stdout=workflow_stdout,
         )
 
diff --git a/rdagent/components/coder/data_science/model/eval.py b/rdagent/components/coder/data_science/model/eval.py
index 56ec86c2b..23fe27897 100644
--- a/rdagent/components/coder/data_science/model/eval.py
+++ b/rdagent/components/coder/data_science/model/eval.py
@@ -71,7 +71,7 @@ def evaluate(
             )  # only check the model changed this time
             implementation.inject_files(**{fname: test_code})
             result = implementation.run(env=env, entry=f"python {fname}")
-            stdout = result.get_truncated_stdout()
+            stdout = result.stdout
             ret_code = result.exit_code
 
             if stdout is None:
diff --git a/rdagent/components/coder/data_science/pipeline/eval.py b/rdagent/components/coder/data_science/pipeline/eval.py
index f296d986f..38126832a 100644
--- a/rdagent/components/coder/data_science/pipeline/eval.py
+++ b/rdagent/components/coder/data_science/pipeline/eval.py
@@ -163,7 +163,7 @@ def evaluate(
             result = implementation.run(
                 env=env, entry=f"strace -e trace=file -f -o trace.log python -m coverage run main.py"
             )
-        result_stdout = result.get_truncated_stdout()
+        result_stdout = result.stdout
 
         nb_conversion_ret_code = 0
         nb_conversion_check_text = ""
@@ -261,7 +261,7 @@ def evaluate(
             implementation.inject_files(**{"test/submission_format_test.py": base_check_code})
             # stdout += "----Submission Check 1-----\n"
             submission_result = implementation.run(env=env, entry="python test/submission_format_test.py")
-            submission_check_out = submission_result.get_truncated_stdout()
+            submission_check_out = submission_result.stdout
             submission_ret_code = submission_result.exit_code
             stdout += "\n" + submission_check_out
 
diff --git a/rdagent/components/coder/data_science/raw_data_loader/eval.py b/rdagent/components/coder/data_science/raw_data_loader/eval.py
index 2289f56f7..e21e2fae0 100644
--- a/rdagent/components/coder/data_science/raw_data_loader/eval.py
+++ b/rdagent/components/coder/data_science/raw_data_loader/eval.py
@@ -56,7 +56,7 @@ def evaluate(
         test_code = (DIRNAME / "eval_tests" / "data_loader_test.txt").read_text()
         implementation.inject_files(**{fname: test_code})
         result = implementation.run(env=env, entry=f"python {fname}")
-        stdout = result.get_truncated_stdout()
+        stdout = result.stdout
         ret_code = result.exit_code
         match = re.search(r"(.*?)=== Start of EDA part ===(.*)=== End of EDA part ===(.*)", stdout, re.DOTALL)
         stdout_part_1, eda_output, stdout_part_2 = match.groups() if match else (stdout, None, "")
diff --git a/rdagent/components/coder/data_science/workflow/eval.py b/rdagent/components/coder/data_science/workflow/eval.py
index d8d489fea..49fbc97c7 100644
--- a/rdagent/components/coder/data_science/workflow/eval.py
+++ b/rdagent/components/coder/data_science/workflow/eval.py
@@ -125,7 +125,7 @@ def evaluate(
         implementation.inject_files(**{"test/submission_format_test.py": base_check_code})
         # stdout += "----Submission Check 1-----\n"
         submission_result = implementation.run(env=env, entry="python test/submission_format_test.py")
-        submission_check_out = submission_result.get_truncated_stdout()
+        submission_check_out = submission_result.stdout
         submission_ret_code = submission_result.exit_code
         stdout += "\n" + submission_check_out
 
diff --git a/rdagent/components/coder/finetune/__init__.py b/rdagent/components/coder/finetune/__init__.py
new file mode 100644
index 000000000..7569ea8b2
--- /dev/null
+++ b/rdagent/components/coder/finetune/__init__.py
@@ -0,0 +1,391 @@
+"""
+LLM Fine-tuning CoSTEER Implementation
+
+This module provides fine-tuning specific components for the CoSTEER framework,
+including evaluators and evolving strategies.
+"""
+
+import json
+from pathlib import Path
+from typing import Callable
+
+import yaml
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.CoSTEER import CoSTEER
+from rdagent.components.coder.CoSTEER.evaluators import (
+    CoSTEERMultiEvaluator,
+    CoSTEERSingleFeedback,
+)
+from rdagent.components.coder.CoSTEER.evolving_strategy import (
+    MultiProcessEvolvingStrategy,
+)
+from rdagent.components.coder.CoSTEER.knowledge_management import (
+    CoSTEERQueriedKnowledge,
+)
+from rdagent.components.coder.finetune.conf import (
+    FT_DATA_SCRIPT_NAME,
+    FT_PATHS,
+    FT_TEST_PARAMS_FILE_NAME,
+    FT_YAML_FILE_NAME,
+    FTCoderCoSTEERSettings,
+)
+from rdagent.components.coder.finetune.eval import FTCoderEvaluator, FTDataEvaluator
+from rdagent.core.experiment import FBWorkspace, Task
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.scenarios.finetune.scen.llama_factory_manager import LLaMAFactory_manager
+from rdagent.scenarios.finetune.scen.utils import FinetuneDatasetDescriptor
+from rdagent.utils.agent.tpl import T
+
+DIRNAME = Path(__file__).absolute().resolve().parent
+
+
+class LLMFinetuneEvolvingStrategy(MultiProcessEvolvingStrategy):
+    """LLM Fine-tuning specific evolving strategy"""
+
+    def __init__(self, scen: Scenario, settings, *args, **kwargs):
+        super().__init__(scen, settings)
+        self.llama_factory_manager = LLaMAFactory_manager
+
+    def implement_func_list(self) -> list[Callable]:
+        return [self.implement_data, self.implement_lf_config]
+
+    def implement_data(
+        self,
+        target_task: Task,
+        queried_knowledge: CoSTEERQueriedKnowledge | None = None,
+        workspace: FBWorkspace | None = None,
+        prev_task_feedback: CoSTEERSingleFeedback | None = None,
+    ) -> dict[str, str]:
+        """Generate data processing script based on task.
+
+        This method generates a Python script that processes seed datasets
+        and outputs a data.json file in Alpaca format.
+
+        Returns:
+            dict with "process_data.py" key containing the script code,
+            or empty dict if data already exists.
+        """
+        # Check if proposal decided to skip data processing (reuse SOTA's data processing script)
+        if getattr(target_task, "skip_data_processing", False):
+            # Defensive check: ensure data script actually exists before skipping
+            script_exists = False
+            if workspace is not None:
+                script_exists = FT_DATA_SCRIPT_NAME in workspace.file_dict
+
+            if script_exists:
+                logger.info("Proposal decided to skip data processing, reusing SOTA's data script")
+                return {}
+            else:
+                logger.warning(
+                    "skip_data_processing=True but process_data.py not found in workspace, "
+                    "this indicates SOTA injection failed - system design issue"
+                )
+                # Don't fallback silently, let it fail early to expose the issue
+
+        # check whether the current code passes evaluation
+        if (
+            prev_task_feedback is not None
+            and "FTDataEvaluator" in prev_task_feedback.source_feedback
+            and prev_task_feedback.source_feedback["FTDataEvaluator"]
+        ):
+            logger.info("Previous data processing code passed evaluation, skipping regeneration")
+            return {}
+
+        # build former failed trace
+        queried_former_failed_knowledge = (
+            queried_knowledge.task_to_former_failed_traces[target_task.get_task_information()]
+            if queried_knowledge is not None
+            else []
+        )
+        queried_former_failed_knowledge = (
+            [
+                knowledge
+                for knowledge in queried_former_failed_knowledge[0]
+                if knowledge.implementation.file_dict.get(FT_YAML_FILE_NAME)
+                != workspace.file_dict.get(FT_YAML_FILE_NAME)
+            ],
+            queried_former_failed_knowledge[1],
+        )
+
+        # Get dataset information for the task
+        involving_datasets = getattr(target_task, "involving_datasets", [])
+        dataset_info = self._get_dataset_info(involving_datasets, datasets_path=FT_PATHS.datasets)
+
+        # Generate data processing script using LLM
+        system_prompt = T(".prompts:data_coder.system").r(
+            scenario=self.scen.get_scenario_all_desc(),
+            task_desc=target_task.get_task_information(),
+            dataset_info=dataset_info,
+            queried_former_failed_knowledge=queried_former_failed_knowledge[0],
+            api_max_workers=FT_RD_SETTING.api_max_workers,
+            datasets_path=FT_PATHS.datasets,
+            workspace_path=FT_PATHS.workspace,
+            force_think_token=FT_RD_SETTING.force_think_token,
+        )
+
+        user_prompt = T(".prompts:data_coder.user").r(
+            datasets_path=FT_PATHS.datasets,
+            workspace_path=FT_PATHS.workspace,
+            latest_code=workspace.file_dict.get(FT_DATA_SCRIPT_NAME, "") if workspace else "",
+            latest_feedback=prev_task_feedback,
+            involved_dataset_folder_desc={
+                ds_name: FinetuneDatasetDescriptor().describe_dataset_folder(
+                    Path(FT_RD_SETTING.file_path) / "datasets" / ds_name, include_dataset_readme=True
+                )
+                for ds_name in involving_datasets
+            },
+        )
+
+        script_code = APIBackend().build_messages_and_create_chat_completion(
+            user_prompt=user_prompt,
+            system_prompt=system_prompt,
+            json_mode=False,
+            code_block_language="python",
+            code_block_fallback=False,
+        )
+        logger.info(f"Generated data processing script ({len(script_code)} chars)")
+
+        return {FT_DATA_SCRIPT_NAME: script_code}
+
+    def _get_dataset_info(self, involving_datasets: list[str], datasets_path: str = None) -> str:
+        """Read dataset_info.json and return information for specified datasets.
+
+        Handles unified tasks structure:
+        - readme: Dataset README content
+        - file_tree: Directory structure
+        - total_samples: Total sample count
+        - tasks: Dict of task info (use "_root" for root-level data files)
+
+        Args:
+            involving_datasets: List of dataset names to include
+            datasets_path: Base path for datasets (e.g., "/assets/datasets/")
+        """
+        datasets_dir = Path(FT_RD_SETTING.file_path) / "datasets"
+        dataset_info_path = datasets_dir / "dataset_info.json"
+
+        # Use provided path or get from config
+        if datasets_path is None:
+            datasets_path = FT_PATHS.datasets
+
+        if not dataset_info_path.exists():
+            logger.warning(f"dataset_info.json not found at {dataset_info_path}")
+            return "No dataset information available."
+
+        try:
+            with open(dataset_info_path, "r", encoding="utf-8") as f:
+                all_dataset_info = json.load(f)
+        except Exception as e:
+            logger.error(f"Failed to read dataset_info.json: {e}")
+            return f"Error reading dataset info: {e}"
+
+        # Filter to only involved datasets, or use all if none specified
+        if involving_datasets:
+            filtered_info = {name: info for name, info in all_dataset_info.items() if name in involving_datasets}
+        else:
+            filtered_info = all_dataset_info
+
+        if not filtered_info:
+            return "No matching datasets found in dataset_info.json."
+
+        # Format dataset info for the prompt
+        info_parts = []
+        for name, info in filtered_info.items():
+            info_text = f"### Dataset: {name}\n"
+            # IMPORTANT: Tell LLM the full path to dataset directory
+            dataset_full_path = f"{datasets_path}{name}/"
+            info_text += f"- **Dataset path**: `{dataset_full_path}` (each dataset has its own subdirectory)\n"
+            info_text += f"- Total samples: {info.get('total_samples', 'N/A')}\n"
+            info_text += f"- Size: {info.get('total_size_mb', 'N/A')} MB\n"
+
+            # File tree for understanding directory structure
+            if info.get("file_tree"):
+                file_tree = info["file_tree"]
+                # Truncate if too long
+                if len(file_tree) > 1000:
+                    file_tree = file_tree[:1000] + "\n..."
+                info_text += f"\n**File Structure** (relative to `{dataset_full_path}`):\n```\n{file_tree}\n```\n"
+
+            # Handle unified tasks structure
+            tasks = info.get("tasks", {})
+            if tasks:
+                info_text += "\n**Tasks:**\n"
+                for task_name, task_info in tasks.items():
+                    # "_root" indicates data files are in root directory
+                    display_name = "(root)" if task_name == "_root" else task_name
+                    info_text += f"\n#### {display_name}\n"
+                    # Show full paths for data files
+                    files = task_info.get("files", [])
+                    info_text += f"- Files: {files}\n"
+                    if files:
+                        info_text += f"  - Full path example: `{dataset_full_path}{files[0]}`\n"
+                    info_text += f"- Sample count: {task_info.get('sample_count', 'N/A')}\n"
+                    if task_info.get("column_stats"):
+                        # Show key token stats
+                        stats_summary = []
+                        for col, stats in task_info["column_stats"].items():
+                            if stats.get("p50_tokens", 0) > 0:
+                                stats_summary.append(f"{col}: p50={stats['p50_tokens']}, p99={stats['p99_tokens']}")
+                        if stats_summary:
+                            info_text += f"- Token stats: {'; '.join(stats_summary[:5])}\n"
+
+            # README excerpt
+            if info.get("readme"):
+                readme = info["readme"]
+                if len(readme) > 500:
+                    readme = readme[:500] + "..."
+                info_text += f"\n**README:**\n{readme}\n"
+
+            info_parts.append(info_text)
+
+        return "\n".join(info_parts)
+
+    def implement_lf_config(
+        self,
+        target_task: Task,
+        queried_knowledge: CoSTEERQueriedKnowledge | None = None,
+        workspace: FBWorkspace | None = None,
+        prev_task_feedback: CoSTEERSingleFeedback | None = None,
+    ) -> dict[str, str]:
+        """Implement a single fine-tuning task by generating LlamaFactory config"""
+        if prev_task_feedback is not None and prev_task_feedback.source_feedback.get("FTCoderEvaluator", False):
+            logger.info("Previous training code passed evaluation, skipping regeneration")
+            return {}
+
+        task_info = target_task.get_task_information()
+
+        queried_former_failed_knowledge = (
+            queried_knowledge.task_to_former_failed_traces[task_info] if queried_knowledge is not None else []
+        )
+        queried_former_failed_knowledge = (
+            [
+                knowledge
+                for knowledge in queried_former_failed_knowledge[0]
+                if knowledge.implementation.file_dict.get(FT_YAML_FILE_NAME)
+                != workspace.file_dict.get(FT_YAML_FILE_NAME)
+            ],
+            queried_former_failed_knowledge[1],
+        )
+
+        # Get task parameters from the task object
+        base_model = getattr(target_task, "base_model")
+
+        # Use LLM to generate LlamaFactory config YAML
+        # Coder will decide method based on hypothesis and available parameters
+        config_files = self._generate_llamafactory_config_with_llm(
+            base_model=base_model,
+            task_info=task_info,
+            queried_former_failed_knowledge=queried_former_failed_knowledge,
+            prev_feedback=prev_task_feedback,
+            workspace=workspace,
+        )
+
+        # Return generated config files directly - validation happens in evaluator
+        return config_files
+
+    def _generate_llamafactory_config_with_llm(
+        self,
+        base_model: str,
+        task_info: str = "",
+        queried_former_failed_knowledge: tuple = None,
+        prev_feedback=None,
+        workspace=None,
+    ) -> dict[str, str]:
+        """Generate LlamaFactory configuration YAML using LLM"""
+
+        # Query LLaMA Factory parameters: shared params once + method-specific params
+        available_methods = self.llama_factory_manager.methods
+        shared_params = self.llama_factory_manager.format_shared_params()
+
+        # Format method-specific parameters only (no duplication of shared params)
+        methods_specific_params = {}
+        for method in available_methods:
+            methods_specific_params[method] = self.llama_factory_manager.format_method_specific_params(method)
+
+        # Use environment-aware paths (Docker vs Conda)
+        # Note: datasets_path in finetune_coder uses workspace path where processed
+        # data.json and dataset_info.json are located (generated by FTDataEvaluator)
+
+        # Generate prompts using templates with all required parameters
+        system_prompt = T(".prompts:finetune_coder.system").r(
+            scenario=self.scen.get_scenario_all_desc(),
+            task_desc=task_info,
+            queried_former_failed_knowledge=queried_former_failed_knowledge[0],
+            available_methods=", ".join(available_methods),
+            shared_params=shared_params,
+            methods_specific_params=methods_specific_params,
+        )
+
+        # Read data_stats.json from workspace (injected by FTDataEvaluator)
+        data_stats = workspace.file_dict.get("data_stats.json", "")
+
+        user_prompt = T(".prompts:finetune_coder.user").r(
+            latest_code=workspace.file_dict.get(FT_YAML_FILE_NAME, ""),
+            latest_feedback=prev_feedback,
+            base_model=base_model,
+            models_path=FT_PATHS.models,
+            datasets_path=FT_PATHS.workspace,  # Training config uses workspace path for processed data
+            workspace_path=FT_PATHS.workspace,
+            deepspeed_path=FT_PATHS.deepspeed,
+            data_stats=data_stats,
+            has_think_token=self.scen.model_info.get("has_think_token", False),
+            force_think_token=FT_RD_SETTING.force_think_token,
+        )
+
+        # Call LLM to generate config (multi-turn)
+        session = APIBackend().build_chat_session(session_system_prompt=system_prompt)
+
+        # Turn 1: Generate main training config
+        train_config_yaml = session.build_chat_completion(
+            user_prompt=user_prompt,
+            json_mode=False,
+            code_block_language="yaml",
+            code_block_fallback=False,
+        )
+
+        # Validate main config YAML syntax
+        yaml.safe_load(train_config_yaml)
+        logger.info("Extracted main YAML config successfully")
+
+        # Turn 2: Generate test parameters (test_params.yaml)
+        test_params_prompt = T(".prompts:finetune_coder.user_test_params").r(workspace_path=FT_PATHS.workspace)
+        test_params_yaml = session.build_chat_completion(
+            user_prompt=test_params_prompt,
+            json_mode=False,
+            code_block_language="yaml",
+            code_block_fallback=False,
+        )
+
+        # Validate test params YAML syntax
+        yaml.safe_load(test_params_yaml)
+        logger.info("Extracted test params YAML successfully")
+
+        return {FT_YAML_FILE_NAME: train_config_yaml, FT_TEST_PARAMS_FILE_NAME: test_params_yaml}
+
+
+class LLMFinetuneCoSTEER(CoSTEER):
+    """LLM Fine-tuning CoSTEER implementation"""
+
+    def __init__(
+        self,
+        scen: Scenario,
+        *args,
+        **kwargs,
+    ) -> None:
+        settings = FTCoderCoSTEERSettings()
+        eva = CoSTEERMultiEvaluator([FTDataEvaluator(scen=scen), FTCoderEvaluator(scen=scen)], scen=scen)
+        es = LLMFinetuneEvolvingStrategy(scen=scen, settings=settings)
+
+        super().__init__(
+            *args,
+            settings=settings,
+            eva=eva,
+            es=es,
+            evolving_version=2,
+            scen=scen,
+            max_loop=FT_RD_SETTING.coder_max_loop if hasattr(FT_RD_SETTING, "coder_max_loop") else 5,
+            stop_eval_chain_on_fail=True,  # finetune involve partial implementation.
+            **kwargs,
+        )
diff --git a/rdagent/components/coder/finetune/conf.py b/rdagent/components/coder/finetune/conf.py
new file mode 100644
index 000000000..27cbc28ff
--- /dev/null
+++ b/rdagent/components/coder/finetune/conf.py
@@ -0,0 +1,417 @@
+import json
+import os
+import re
+import shutil
+from pathlib import Path
+from typing import Any, Literal
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.CoSTEER.config import CoSTEERSettings
+from rdagent.core.experiment import FBWorkspace
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.scen.utils import _compute_column_stats
+from rdagent.utils.agent.tpl import T
+from rdagent.utils.env import (
+    BenchmarkCondaConf,
+    BenchmarkCondaEnv,
+    BenchmarkDockerConf,
+    BenchmarkDockerEnv,
+    DockerEnv,
+    Env,
+    FTCondaConf,
+    FTCondaEnv,
+    FTDockerEnv,
+)
+
+
+def is_docker_env(env: Env) -> bool:
+    """Check if the environment is Docker-based."""
+    return isinstance(env, DockerEnv)
+
+
+def get_workspace_prefix(env: Env) -> str:
+    """Return workspace path prefix based on env type.
+
+    Docker uses /workspace as mount point, conda uses current directory.
+    """
+    return "/workspace" if is_docker_env(env) else "."
+
+
+FT_YAML_FILE_NAME = "train.yaml"
+FT_DATA_PROC_FILE_NAME = "data_process.py"
+FT_DEBUG_YAML_FILE_NAME = "debug_train.yaml"
+FT_TEST_PARAMS_FILE_NAME = "test_params.yaml"
+FT_DATA_FILE_NAME = "data.json"
+FT_DATA_SCRIPT_NAME = "process_data.py"
+
+# ENV Info:  the path of the model and dataset in the container/environment
+FT_MODEL_PATH = "/assets/models"
+FT_DATASET_PATH = "/assets/datasets"
+
+
+def get_data_processing_cache_key(local_path: str | Path) -> list[list[str]]:
+    """Generate cache key based only on data processing script and dataset info.
+
+    This ensures that data processing results are reused as long as the script
+    and dataset configuration remain unchanged, even if other files in the
+    workspace (like training config) have been modified.
+    """
+    content = []
+    local_path = Path(local_path)
+    # We only care about the script that generates data and the dataset configuration
+    for filename in [FT_DATA_SCRIPT_NAME, "dataset_info.json"]:
+        file_path = local_path / filename
+        if file_path.exists():
+            content.append([filename, file_path.read_text()])
+    content.sort(key=lambda x: x[0])
+    return content
+
+
+class FTPathConfig:
+    """Centralized path configuration for FT scenario.
+
+    Provides environment-aware paths for Docker vs Conda modes.
+    Uses lazy evaluation (properties) to avoid import-time errors.
+
+    Usage:
+        from rdagent.components.coder.finetune.conf import FT_PATHS
+
+        models_path = FT_PATHS.models      # e.g., "/assets/models/" or "/path/to/finetune/models/"
+        datasets_path = FT_PATHS.datasets  # e.g., "/assets/datasets/" or "/path/to/finetune/datasets/"
+        workspace_path = FT_PATHS.workspace  # e.g., "/workspace/" or "./"
+    """
+
+    @property
+    def is_docker(self) -> bool:
+        """Check if current environment is Docker-based."""
+        # FIXME: the env should work in same way for docker and conda env.
+        # We should not expose the env type everywhere.
+        return FTCoderCoSTEERSettings().env_type == "docker"
+
+    @property
+    def models(self) -> str:
+        """Model directory path (with trailing slash)."""
+        if self.is_docker:
+            return FT_MODEL_PATH + "/"
+        return str(FT_RD_SETTING.file_path / "models") + "/"
+
+    @property
+    def datasets(self) -> str:
+        """Dataset directory path for raw datasets (with trailing slash)."""
+        if self.is_docker:
+            return FT_DATASET_PATH + "/"
+        return str(FT_RD_SETTING.file_path / "datasets") + "/"
+
+    @property
+    def workspace(self) -> str:
+        """Workspace path prefix for prompts (with trailing slash)."""
+        return "/workspace/" if self.is_docker else "./"
+
+    @property
+    def deepspeed(self) -> str:
+        """DeepSpeed config directory."""
+        if self.is_docker:
+            return "/app/examples/deepspeed/"
+        # Conda mode: use bundled deepspeed configs in project
+        # Path: conf.py -> finetune -> coder -> components -> rdagent -> scenarios/finetune/env/conda/deepspeed
+        rdagent_root = Path(__file__).parent.parent.parent.parent
+        deepspeed_path = rdagent_root / "scenarios" / "finetune" / "env" / "conda" / "deepspeed"
+        return str(deepspeed_path) + "/" if deepspeed_path.exists() else ""
+
+
+# Singleton instance for path configuration
+FT_PATHS = FTPathConfig()
+
+
+class FTCoderCoSTEERSettings(CoSTEERSettings):
+    """LLM Fine-tuning CoSTEER settings"""
+
+    class Config:
+        env_prefix = "FT_Coder_CoSTEER_"
+
+    max_seconds_multiplier: int = 8
+    """LLM training takes longer, use higher multiplier"""
+
+    env_type: str = "docker"
+    """Environment type for LLM fine-tuning (docker/conda)"""
+
+    extra_eval: list[str] = []
+    """Extra evaluators"""
+
+
+def _get_standard_ft_volumes() -> dict:
+    """Get standard mount volume configuration for LLM finetune environments.
+
+    Creates standard directory mappings:
+    - models -> /assets/models (ro)
+    - datasets -> /assets/datasets (ro)
+
+    Returns:
+        Dictionary of local_path -> docker_mount_config mappings
+    """
+    base_path = Path(FT_RD_SETTING.file_path)
+    volumes = {}
+
+    # Read-only mounts for data and models
+    readonly_mounts = [
+        ("models", FT_MODEL_PATH),
+        ("datasets", FT_DATASET_PATH),
+    ]
+
+    for local_dir, docker_path in readonly_mounts:
+        local_path = base_path / local_dir
+        volumes[str(local_path)] = {"bind": docker_path, "mode": "ro"}
+
+    return volumes
+
+
+def get_ft_env(
+    extra_volumes: dict = {},
+    operation: str = "full_training",
+    enable_cache: bool | None = None,
+) -> Env:
+    """LLM finetune dedicated environment construction function.
+
+    Automatically includes standard finetune volume mounts:
+    - models -> /assets/models (ro)
+    - datasets -> /assets/datasets (ro)
+    - output -> /workspace/output (rw, auto-created)
+
+    Note: .llama_factory_info is no longer automatically mounted.
+    Pass llama_factory_info volume via extra_volumes when needed.
+
+    Args:
+        extra_volumes: Additional volume mounts beyond standard ones
+        operation: Operation type for timeout selection.
+            - "data_processing": Data processing (data_processing_timeout)
+            - "micro_batch": Micro-batch test (micro_batch_timeout)
+            - "full_training": Full training (full_timeout)
+        enable_cache: Whether to enable caching (None means use config value)
+
+    Returns:
+        Configured environment ready for use
+    """
+
+    conf = FTCoderCoSTEERSettings()
+
+    # Select timeout based on operation type
+    timeout_map = {
+        "data_processing": FT_RD_SETTING.data_processing_timeout,
+        "debug_data_processing": FT_RD_SETTING.debug_data_processing_timeout,
+        "micro_batch": FT_RD_SETTING.micro_batch_timeout,
+        "full_training": FT_RD_SETTING.full_timeout,
+    }
+    running_timeout_period = timeout_map.get(operation, FT_RD_SETTING.full_timeout)
+
+    # Use config value if enable_cache is not explicitly provided
+    if enable_cache is None:
+        enable_cache = FT_RD_SETTING.docker_enable_cache
+
+    # Use dedicated LLM docker or conda env based on config
+    if conf.env_type == "docker":
+        env = FTDockerEnv()
+        # Docker mode: setup volume mounts for models/datasets
+        standard_volumes = _get_standard_ft_volumes()
+        combined_volumes = standard_volumes.copy()
+        combined_volumes.update(extra_volumes)
+        env.conf.extra_volumes = combined_volumes
+    elif conf.env_type == "conda":
+        env = FTCondaEnv(conf=FTCondaConf())  # Auto-installs dependencies if env doesn't exist
+        # Conda mode: no volume mounts needed, use local paths directly
+        # extra_volumes are ignored in conda mode
+    else:
+        raise ValueError(f"Unknown env type: {conf.env_type}")
+
+    env.conf.running_timeout_period = running_timeout_period
+    env.conf.enable_cache = enable_cache
+    env.prepare()
+    return env
+
+
+def get_data_processing_env(
+    enable_cache: bool | None = None,
+    is_debug: bool = False,
+) -> tuple[Env, dict]:
+    """Get environment for data processing scripts with LLM API access.
+
+    This environment is configured for running data processing scripts that may
+    need to call LLM APIs. It includes:
+    - Standard finetune volume mounts (datasets, models)
+    - LLM API environment variables (OPENAI_API_KEY, OPENAI_BASE_URL, etc.)
+
+    Args:
+        enable_cache: Whether to enable Docker caching
+        is_debug: Whether running in debug mode (shorter timeout, default 20 min vs 1 hour)
+
+    Returns:
+        Tuple of (env, env_vars) where env_vars contains LLM API keys
+        to be passed to env.run() as the env parameter
+    """
+    env = get_ft_env(
+        operation="debug_data_processing" if is_debug else "data_processing",
+        enable_cache=enable_cache,
+    )
+
+    # Collect LLM API environment variables to pass to env.run()
+    llm_env_vars = {"PYTHONPATH": "./"}  # Base env var
+
+    # Pass OPENAI_API_KEY directly
+    if api_key := os.getenv("OPENAI_API_KEY"):
+        llm_env_vars["OPENAI_API_KEY"] = api_key
+
+    # Read OPENAI_API_BASE from env, but pass as OPENAI_BASE_URL (OpenAI SDK expects this name)
+    if api_base := os.getenv("OPENAI_API_BASE"):
+        llm_env_vars["OPENAI_BASE_URL"] = api_base
+
+    # Pass model pools as JSON environment variables for load balancing
+    llm_env_vars["STRONG_MODEL_POOL"] = json.dumps(FT_RD_SETTING.strong_models)
+    llm_env_vars["WEAK_MODEL_POOL"] = json.dumps(FT_RD_SETTING.weak_models)
+
+    return env, llm_env_vars
+
+
+def clear_workspace(workspace: FBWorkspace, env: Env) -> None:
+    """
+    Clean the files in LLM finetune workspace.
+    Only keeps the files that are injected by the coder (in workspace.file_dict) and `logs`.
+
+    Args:
+        workspace: The workspace object containing the file dictionary.
+        env: The environment to execute the clean command in.
+    """
+    target_path = workspace.workspace_path
+    if not target_path.exists():
+        return
+
+    # The cache_path is created when mounting, so the permissions changes does not work.
+    keep_items = {"logs", T("scenarios.data_science.share:scen.cache_path").r()}
+
+    for file_path in workspace.file_dict.keys():
+        top_level = Path(file_path).parts[0]
+        keep_items.add(top_level)
+
+    remove_items = []
+    for item in target_path.iterdir():
+        if item.name in keep_items:
+            continue
+        remove_items.append(item.name)
+
+    if remove_items:
+        ws_prefix = get_workspace_prefix(env)
+        # Construct rm command with all items to remove
+        # Items are relative to workspace root inside the env
+        items_str = " ".join([f"'{ws_prefix}/{item}'" for item in remove_items])
+        cmd = f"rm -rf {items_str}"
+        workspace.execute(env=env, entry=cmd)
+
+
+def get_benchmark_env(
+    extra_volumes: dict = {},
+    timeout: int | None = None,
+) -> Env:
+    """OpenCompass benchmark environment construction function.
+
+    Supports both Docker and conda environments based on FT_Coder_CoSTEER_env_type.
+
+    Args:
+        extra_volumes: Additional volume mounts (only used in Docker mode)
+        timeout: Running timeout in seconds (None uses config default)
+
+    Returns:
+        Configured environment ready for benchmark evaluation
+    """
+    conf = FTCoderCoSTEERSettings()
+
+    # Use benchmark-specific timeout or config default
+    if timeout is None:
+        # 0 means no timeout, use 7 days as practical "infinite"
+        timeout = FT_RD_SETTING.benchmark_timeout if FT_RD_SETTING.benchmark_timeout > 0 else 86400 * 7
+
+    benchmark_volumes = {}
+    # Setup finetune share folder mount for models
+    (FT_RD_SETTING.file_path / "benchmarks").mkdir(parents=True, exist_ok=True)
+    # NOTE: we choose a folder in the workspace as the mount point due to we may run multiple instances in same 
+    # host machine. If conda env is used, the mount point will conflict with each other.
+    benchmark_volumes[str((FT_RD_SETTING.file_path / "benchmarks").resolve())] = {
+        "bind": "./benchmarks",
+        "mode": "rw",
+    }
+    env_dict = {"COMPASS_DATA_CACHE": "./benchmarks/opencompass_data"}
+    # Mount models directory for LoRA base model access (vLLM needs base model config)
+    models_path = FT_RD_SETTING.file_path / "models"
+    if models_path.exists():
+        benchmark_volumes[str(models_path.resolve())] = {"bind": FT_MODEL_PATH, "mode": "ro"}
+    benchmark_volumes.update(extra_volumes)
+
+    if conf.env_type == "docker":
+        docker_conf = BenchmarkDockerConf()
+        docker_conf.running_timeout_period = timeout
+        docker_conf.extra_volumes = benchmark_volumes
+        docker_conf.env_dict = env_dict
+        env = BenchmarkDockerEnv(conf=docker_conf)
+    elif conf.env_type == "conda":
+        # NOTE:
+        # We assume user has the permissions to create the softlink in the target directory.
+        # If we have requirements in the future, we suggest make the target directory configurable in BenchmarkCondaConf.
+        conda_conf = BenchmarkCondaConf()
+        conda_conf.running_timeout_period = timeout
+        conda_conf.extra_volumes = benchmark_volumes
+        conda_conf.env_dict = env_dict
+        env = BenchmarkCondaEnv(conf=conda_conf)  # Auto-installs dependencies if env doesn't exist
+    else:
+        raise ValueError(f"Unknown env type: {conf.env_type}")
+
+    env.prepare()
+    return env
+
+
+def parse_estimation_from_stdout(stdout: str) -> dict:
+    """Parse estimation info from script SUMMARY output.
+
+    Expected format in stdout:
+    ========== SUMMARY ==========
+    Total output samples: 10
+    Raw samples processed: 100
+    Raw samples total: 50000
+    Estimated full output: ~5000
+    =============================
+    """
+    estimation = {}
+    if not stdout:
+        return estimation
+
+    patterns = {
+        "output_samples": r"Total output samples:\s*(\d+)",
+        "raw_processed": r"Raw samples processed:\s*(\d+)",
+        "raw_total": r"Raw samples total:\s*(\d+)",
+        "estimated_full": r"Estimated full output:\s*~?(\d+)",
+    }
+    for key, pattern in patterns.items():
+        match = re.search(pattern, stdout)
+        if match:
+            estimation[key] = int(match.group(1))
+
+    return estimation
+
+
+def inject_data_stats(implementation: FBWorkspace, data: list, stdout: str) -> None:
+    """Compute token statistics and inject data_stats.json.
+
+    Used by both FTDataEvaluator (coding stage) and FTRunnerEvaluator (running stage).
+
+    Args:
+        implementation: The workspace to inject data_stats.json into
+        data: The data list from data.json
+        stdout: The stdout from process_data.py execution
+    """
+    token_stats = _compute_column_stats(data)
+    estimation = parse_estimation_from_stdout(stdout)
+
+    data_stats = {
+        "total_samples": len(data),
+        "token_stats": token_stats,
+        "estimation": estimation,
+    }
+
+    implementation.inject_files(**{"data_stats.json": json.dumps(data_stats, indent=2)})
+    logger.info(f"Injected data_stats.json with {len(data)} samples")
diff --git a/rdagent/components/coder/finetune/eval.py b/rdagent/components/coder/finetune/eval.py
new file mode 100644
index 000000000..3e9aff053
--- /dev/null
+++ b/rdagent/components/coder/finetune/eval.py
@@ -0,0 +1,396 @@
+"""
+LLM Fine-tuning Evaluation Components
+
+Provides simplified evaluation: parameter filtering + micro-batch testing.
+No redundant LLM feedback generation - test results speak for themselves.
+"""
+
+import json
+import random
+from pathlib import Path
+from typing import Optional
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.CoSTEER.evaluators import (
+    CoSTEEREvaluator,
+    CoSTEERSingleFeedback,
+)
+from rdagent.components.coder.finetune.conf import (
+    FT_DATA_FILE_NAME,
+    FT_DATA_SCRIPT_NAME,
+    FT_YAML_FILE_NAME,
+    clear_workspace,
+    get_data_processing_cache_key,
+    get_data_processing_env,
+    get_ft_env,
+    get_workspace_prefix,
+    inject_data_stats,
+)
+from rdagent.components.coder.finetune.unified_validator import LLMConfigValidator, SYSTEM_MANAGED_PARAMS
+from rdagent.core.evolving_framework import QueriedKnowledge
+from rdagent.core.experiment import FBWorkspace, Task
+from rdagent.log import rdagent_logger as logger
+from rdagent.utils.agent.tpl import T
+from rdagent.utils.agent.workflow import build_cls_from_json_with_retry
+
+DIRNAME = Path(__file__).absolute().resolve().parent
+
+
+class FTDataEvaluator(CoSTEEREvaluator):
+    """Evaluator for data processing results.
+
+    This evaluator:
+    1. Executes the process_data.py script in Docker
+    2. Validates the output data.json file
+    3. Generates dataset_info.json for LlamaFactory
+    """
+
+    def evaluate(
+        self,
+        target_task: Task,
+        implementation: FBWorkspace,
+        gt_implementation: FBWorkspace,
+        queried_knowledge: Optional[QueriedKnowledge] = None,
+        **kwargs,
+    ) -> CoSTEERSingleFeedback:
+        """Evaluate data processing implementation with LLM feedback."""
+
+        script_code = implementation.file_dict.get(FT_DATA_SCRIPT_NAME, "")
+        data_json_path = implementation.workspace_path / FT_DATA_FILE_NAME
+        execution_output = ""
+        exit_code = 0
+        data = None
+        error_msg = None
+
+        # Step 1: Check script exists
+        if not script_code:
+            feedback = CoSTEERSingleFeedback(
+                execution=f"No {FT_DATA_SCRIPT_NAME} found",
+                return_checking="Data processing script missing",
+                code="Please generate a data processing script first.",
+                final_decision=False,
+            )
+            logger.log_object(feedback, tag="evaluator_feedback.FTDataEvaluator")
+            return feedback
+
+        # NOTE: we depends cache for speeding up the process of data generation.
+        # So we clear the workspace every time.
+
+        # Step 3: Execute script in DEBUG mode (generates ~10 samples for fast validation)
+        env, env_vars = get_data_processing_env(is_debug=True)
+
+        # Clear workspace (except logs and file_dict items) before data processing
+        clear_workspace(implementation, env=env)
+        ws_prefix = get_workspace_prefix(env)
+
+        # Use FTWorkspace.run() for unified Docker logging
+        # --debug flag tells the script to generate only ~10 samples
+        result = implementation.run(
+            env=env,
+            entry=f"python {ws_prefix}/{FT_DATA_SCRIPT_NAME} --debug",
+            env_vars=env_vars,
+            cache_key_extra_func=get_data_processing_cache_key,
+            cache_files_to_extract=[FT_DATA_FILE_NAME],
+        )
+        execution_output = result.stdout if hasattr(result, "stdout") else str(result)
+        exit_code = result.exit_code if hasattr(result, "exit_code") else -1
+
+        # Step 4: Validate output
+        if not data_json_path.exists():
+            error_msg = f"{FT_DATA_FILE_NAME} not generated"
+        else:
+            validation_result = self._validate_data_json(data_json_path)
+            if not validation_result["valid"]:
+                error_msg = validation_result["error"]
+            else:
+                self._update_dataset_info(implementation, validation_result["sample_count"])
+
+        # Step 5: Load data if valid
+        if error_msg is None and data_json_path.exists():
+            with open(data_json_path, "r", encoding="utf-8") as f:
+                data = json.load(f)
+
+        # Step 5.5: Compute token stats and inject data_stats for yaml coder
+        if data is not None and error_msg is None:
+            inject_data_stats(implementation, data, execution_output)
+
+        # Step 6: Generate LLM feedback
+        # Truncate stdout from end for LLM (summary at the end is more useful)
+        stdout_summary = execution_output[-1500:] if execution_output else ""
+        return self._generate_llm_feedback(
+            target_task=target_task,
+            script_code=script_code if error_msg else "",  # Only show script on error
+            stdout=stdout_summary,  # Always show summary (truncated from end)
+            exit_code=exit_code,
+            data=data,
+            error_msg=error_msg,
+            queried_knowledge=queried_knowledge,
+            raw_stdout=execution_output,  # Full log for UI
+        )
+
+    def _generate_llm_feedback(
+        self,
+        target_task: Task,
+        script_code: str,
+        stdout: str,
+        exit_code: int,
+        data: Optional[list],
+        error_msg: Optional[str],
+        queried_knowledge: Optional[QueriedKnowledge],
+        raw_stdout: str = "",
+    ) -> CoSTEERSingleFeedback:
+        """Generate LLM-based feedback for data processing evaluation."""
+
+        # Prepare data statistics and samples
+        if data:
+            stats = self._analyze_data_quality(data)
+            data_stats = json.dumps(stats, indent=2)
+            sampled_data = self._sample_data(data)
+            data_samples = json.dumps(sampled_data, indent=2, ensure_ascii=False)
+            sample_count = len(sampled_data)
+            total_samples = len(data)
+        else:
+            data_stats = json.dumps({"error": error_msg or "No data generated"})
+            data_samples = "[]"
+            sample_count = 0
+            total_samples = 0
+
+        # Extract similar successful knowledge
+        queried_similar_successful_knowledge = []
+        if queried_knowledge is not None:
+            task_info = target_task.get_task_information()
+            queried_similar_successful_knowledge = queried_knowledge.task_to_similar_task_successful_knowledge.get(
+                task_info, []
+            )
+
+        # Build prompts
+        system_prompt = T(".prompts:data_eval.system").r(
+            scenario=self.scen.get_scenario_all_desc(),
+            queried_similar_successful_knowledge=queried_similar_successful_knowledge,
+            upper_data_size_limit=FT_RD_SETTING.upper_data_size_limit,
+            force_think_token=FT_RD_SETTING.force_think_token,
+        )
+        user_prompt = T(".prompts:data_eval.user").r(
+            task_desc=target_task.get_task_information(),
+            script_code=script_code,
+            exit_code=exit_code,
+            stdout=stdout[:3000] if stdout else "",  # Empty string triggers {% if stdout %} = false
+            data_stats=data_stats,
+            sample_count=sample_count,
+            total_samples=total_samples,
+            data_samples=data_samples,
+        )
+
+        logger.info(
+            f"Generating LLM feedback for data evaluation (samples: {total_samples}, has_error: {bool(error_msg)})"
+        )
+
+        feedback = build_cls_from_json_with_retry(
+            CoSTEERSingleFeedback,
+            system_prompt=system_prompt,
+            user_prompt=user_prompt,
+            init_kwargs_update_func=CoSTEERSingleFeedback.val_and_update_init_dict,
+        )
+
+        # NOTE: 0 exit code is a hard criteria for success
+        if exit_code != 0:
+            feedback.final_decision = False
+
+        feedback.raw_execution = raw_stdout
+        feedback.source_feedback[self.__class__.__name__] = feedback.final_decision
+        logger.log_object(feedback, tag="evaluator_feedback.FTDataEvaluator")
+        return feedback
+
+    def _validate_data_json(self, data_json_path: Path) -> dict:
+        """Validate data.json file format and content."""
+        try:
+            with open(data_json_path, "r", encoding="utf-8") as f:
+                data = json.load(f)
+
+            # Must be a non-empty list
+            if not isinstance(data, list):
+                return {"valid": False, "error": "data.json must be a JSON array", "sample_count": 0}
+
+            if len(data) == 0:
+                return {"valid": False, "error": "data.json is empty", "sample_count": 0}
+
+            # Check required fields in samples
+            required_fields = ["instruction", "output"]
+            for i, sample in enumerate(data[:10]):  # Check first 10 samples
+                if not isinstance(sample, dict):
+                    return {"valid": False, "error": f"Sample {i} is not a dict", "sample_count": 0}
+
+                missing = [f for f in required_fields if f not in sample]
+                if missing:
+                    return {"valid": False, "error": f"Sample {i} missing fields: {missing}", "sample_count": 0}
+
+                # Check for empty required fields
+                for field in required_fields:
+                    if not sample.get(field):
+                        return {
+                            "valid": False,
+                            "error": f"Sample {i} has empty '{field}' field",
+                            "sample_count": 0,
+                        }
+
+            return {"valid": True, "error": None, "sample_count": len(data)}
+
+        except json.JSONDecodeError as e:
+            return {"valid": False, "error": f"Invalid JSON: {e}", "sample_count": 0}
+        except Exception as e:
+            return {"valid": False, "error": f"Error reading file: {e}", "sample_count": 0}
+
+    def _update_dataset_info(self, implementation: FBWorkspace, sample_count: int):
+        """Generate dataset_info.json for LlamaFactory to use the processed data.
+
+        Note: LlamaFactory's columns mapping uses internal names (prompt, query, response)
+        that map to the actual column names in the data file (instruction, input, output).
+        See: https://github.com/hiyouga/LLaMA-Factory/blob/main/src/llamafactory/data/parser.py
+        """
+        dataset_info = {
+            "processed_data": {
+                "file_name": FT_DATA_FILE_NAME,
+                "formatting": "alpaca",
+                "columns": {
+                    "prompt": "instruction",
+                    "query": "input",
+                    "response": "output",
+                },
+            }
+        }
+
+        try:
+            implementation.inject_files(**{"dataset_info.json": json.dumps(dataset_info, indent=2)})
+            logger.info(f"Updated dataset_info.json with processed_data ({sample_count} samples)")
+        except Exception as e:
+            logger.warning(f"Failed to update dataset_info.json: {e}")
+
+    def _sample_data(self, data: list, n: int = 5) -> list:
+        """Random sampling for LLM evaluation."""
+        if len(data) <= n:
+            return data
+        return random.sample(data, n)
+
+    def _analyze_data_quality(self, data: list) -> dict:
+        """Analyze data quality statistics for all fields."""
+        if not data:
+            return {"total_samples": 0, "error": "Empty data"}
+
+        # Analyze length stats for all standard fields
+        fields = ["instruction", "input", "output"]
+        stats = {"total_samples": len(data)}
+
+        for field in fields:
+            lens = [len(str(d.get(field, ""))) for d in data]
+            empty_count = sum(1 for d in data if not d.get(field))
+            stats[f"{field}_len"] = {
+                "min": min(lens),
+                "max": max(lens),
+                "avg": round(sum(lens) / len(lens), 1),
+            }
+            stats[f"{field}_empty_ratio"] = round(empty_count / len(data) * 100, 1)
+
+        # Detect duplicates by full record (instruction + input + output)
+        record_set = set(
+            (str(d.get("instruction", "")), str(d.get("input", "")), str(d.get("output", ""))) for d in data
+        )
+        duplicate_count = len(data) - len(record_set)
+        stats["duplicate_count"] = duplicate_count
+        stats["duplicate_ratio"] = round(duplicate_count / len(data) * 100, 1)
+
+        return stats
+
+
+class FTCoderEvaluator(CoSTEEREvaluator):
+    """Evaluator for LLM fine-tuning implementations with simplified validation"""
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(
+        self,
+        target_task: Task,
+        implementation: FBWorkspace,
+        gt_implementation: FBWorkspace,
+        queried_knowledge: QueriedKnowledge = None,
+        **kwargs,
+    ) -> CoSTEERSingleFeedback:
+        """Evaluate LLM fine-tuning implementation with two-step validation"""
+
+        task_info = target_task.get_task_information()
+
+        # Check task history
+        if queried_knowledge is not None:
+            if task_info in queried_knowledge.success_task_to_knowledge_dict:
+                return queried_knowledge.success_task_to_knowledge_dict[task_info].feedback
+            elif task_info in queried_knowledge.failed_task_info_set:
+                feedback = CoSTEERSingleFeedback(
+                    execution="Task failed too many times, skipping.",
+                    return_checking="Task failed too many times, skipping.",
+                    code="Task failed too many times, skipping.",
+                    final_decision=False,
+                )
+                logger.log_object(feedback, tag="evaluator_feedback.FTCoderEvaluator")
+                return feedback
+
+        env = get_ft_env(operation="micro_batch")
+        config_yaml = implementation.file_dict.get(FT_YAML_FILE_NAME, "")
+        if not config_yaml:
+            feedback = CoSTEERSingleFeedback(
+                execution=f"No {FT_YAML_FILE_NAME} found",
+                return_checking="Configuration file missing",
+                code="No valid configuration file",
+                final_decision=False,
+            )
+            logger.log_object(feedback, tag="evaluator_feedback.FTCoderEvaluator")
+            return feedback
+
+        # Two-step validation: parameter filtering + micro-batch test
+        validation_result = LLMConfigValidator().validate_and_test(
+            config_yaml=config_yaml, workspace=implementation, env=env
+        )
+        # NOTE: Docker execution is logged by FTWorkspace.run() automatically
+
+        # Update config with filtered version
+        if validation_result.filtered_config != config_yaml:
+            implementation.inject_files(**{FT_YAML_FILE_NAME: validation_result.filtered_config})
+
+        queried_similar_successful_knowledge = (
+            queried_knowledge.task_to_similar_task_successful_knowledge[target_task.get_task_information()]
+            if queried_knowledge is not None
+            else []
+        )
+
+        system_prompt = T(".prompts:finetune_eval.system").r(
+            queried_similar_successful_knowledge=queried_similar_successful_knowledge,
+            system_managed_params=SYSTEM_MANAGED_PARAMS,
+        )
+        user_prompt = T(".prompts:finetune_eval.user").r(
+            scenario=self.scen.get_scenario_all_desc(),
+            task_desc=target_task.get_task_information(),
+            stdout=validation_result.execution_output or "No output",
+            code_yaml=implementation.file_dict[FT_YAML_FILE_NAME],
+            workspace_files="\n".join(
+                [
+                    f"- {file.name} ({file.stat().st_size} bytes)"
+                    for file in implementation.workspace_path.rglob("*")
+                    if file.is_file() and "checkpoint" not in file.absolute().as_posix()
+                ]
+            ),
+        )
+        feedback = build_cls_from_json_with_retry(
+            CoSTEERSingleFeedback,
+            system_prompt=system_prompt,
+            user_prompt=user_prompt,
+            init_kwargs_update_func=CoSTEERSingleFeedback.val_and_update_init_dict,
+        )
+
+        # Force failure if validation failed programmatically
+        if not validation_result.success:
+            feedback.final_decision = False
+            logger.warning("FTCoderEvaluator: Forced final_decision=False due to validation failure")
+
+        feedback.raw_execution = validation_result.raw_stdout or ""
+        feedback.source_feedback[self.__class__.__name__] = feedback.final_decision
+        logger.log_object(feedback, tag="evaluator_feedback.FTCoderEvaluator")
+        return feedback
diff --git a/rdagent/components/coder/finetune/exp.py b/rdagent/components/coder/finetune/exp.py
new file mode 100644
index 000000000..e19582241
--- /dev/null
+++ b/rdagent/components/coder/finetune/exp.py
@@ -0,0 +1,39 @@
+"""
+LLM Fine-tuning Experiment Components
+
+Defines tasks for LLM fine-tuning following data science pattern.
+"""
+
+from typing import List, Optional
+
+from rdagent.components.coder.CoSTEER.task import CoSTEERTask
+
+
+class FTTask(CoSTEERTask):
+    """Training task class for LLM fine-tuning operations - follows data science pattern"""
+
+    def __init__(
+        self,
+        base_model: str,
+        description: str,
+        benchmark: str,
+        involving_datasets: Optional[List[str]] = None,
+        skip_data_processing: bool = False,
+        *args,
+        **kwargs,
+    ) -> None:
+        super().__init__(name="LLM-Fine-Tuning", description=description, *args, **kwargs)
+        self.base_model = base_model
+        self.benchmark = benchmark
+        self.involving_datasets = involving_datasets or []
+        self.skip_data_processing = skip_data_processing  # If True, reuse SOTA's data processing script
+
+    def get_task_information(self) -> str:
+        """Get task information for coder prompt generation"""
+        task_desc = f"""name: {self.name}
+description: {self.description}
+base_model: {self.base_model}
+"""
+        if self.involving_datasets:
+            task_desc += f"involving_datasets: {self.involving_datasets}\n"
+        return task_desc
diff --git a/rdagent/components/coder/finetune/prompts.yaml b/rdagent/components/coder/finetune/prompts.yaml
new file mode 100644
index 000000000..af55eb9ad
--- /dev/null
+++ b/rdagent/components/coder/finetune/prompts.yaml
@@ -0,0 +1,761 @@
+data_coder:
+  system: |-
+    You are a world-class data engineer specializing in preparing training data for large language model fine-tuning.
+    Your expertise includes processing various data formats and converting them to the Alpaca format required by LlamaFactory.
+
+    # Part 1: Context
+
+    ## 1.1 Scenario Description
+    {{ scenario }}
+
+    ## 1.2 Task Description
+    {{ task_desc }}
+
+    ## 1.3 Available Datasets
+    The following datasets are available for processing:
+    {{ dataset_info }}
+
+    ## 1.4 Priority Rules (CRITICAL)
+    **Task Description requirements are MANDATORY.** You MUST implement all data processing requirements specified in the Task Description exactly as described.
+
+    # Part 2: Output Specification
+
+    ## 2.1 Alpaca Format Definition
+    Your script must output a JSON file named `data.json` in the current working directory (`{{ workspace_path }}`).
+    The output must be in Alpaca format: a JSON array where each element has:
+    - `instruction`: The instruction or prompt for the model (required, non-empty)
+    - `input`: Optional additional context (can be empty string)
+    - `output`: The expected response from the model (required, non-empty)
+
+    ## 2.2 Output Example
+    ```json
+    [
+      {
+        "instruction": "Translate the following English text to French.",
+        "input": "Hello, how are you?",
+        "output": "Bonjour, comment allez-vous?"
+      },
+      {
+        "instruction": "Summarize the following article.",
+        "input": "Article content here...",
+        "output": "Summary of the article..."
+      }
+    ]
+    ```
+
+    ## 2.3 Data Quality Awareness (IMPORTANT)
+    - Raw datasets may contain low-quality, noisy, or incorrect samples
+    - It is better to DISCARD questionable samples than to include them in training data
+    - When encountering samples that are ambiguous, malformed, or have inconsistent answers, prefer filtering them out
+    - A smaller but high-quality dataset is more valuable than a larger noisy one
+    - High filtering rate is acceptable and expected - it means the script is doing quality control properly
+
+    ## 2.4 Data Validation Rules
+    Before writing the final data.json, implement these validations:
+
+    ### 2.4.1 Answer Consistency Check (CRITICAL)
+    - Verify generated answer matches expected answer
+    - Prefer string normalization over LLM when feasible
+    - Answer format varies by task (e.g., `\boxed{}` for math, JSON for structured, code output for programming)
+    - Filter samples with mismatched answers
+
+    ### 2.4.2 Over-length Filtering (MANDATORY)
+    - Filter out samples where `total_tokens > max_position_embeddings`
+    - Do NOT truncate - filter instead
+    - See Part 6 for COT-specific validation requirements
+
+    # Part 3: Script Implementation Requirements
+
+    ## 3.1 Basic Conventions
+    1. Read data from `{{ datasets_path }}` directory (mounted read-only)
+    2. Use standard Python libraries (json, csv, os, pathlib) when possible
+    3. Handle file encoding properly (use utf-8)
+    4. Include error handling for file operations
+    5. Print progress information to stdout for debugging
+    6. **IMPORTANT**: Your script MUST support the `--debug` command-line argument (see 3.2). Other than `--debug`, do NOT expect any other command-line arguments.
+
+    ## 3.2 Debug Mode (CRITICAL)
+    Your script MUST support `--debug` for fast validation:
+    - Sampling/filtering is pure code operation (no LLM), so it runs completely in both modes
+    - `--debug`: Process ~100 samples through LLM pipeline, print actual sampled total
+    - No flag: Process ALL sampled data through LLM pipeline
+
+    ### Debug Mode Example
+    ```python
+    import random
+
+    # Step 1: Run complete sampling/filtering (fast, no LLM) - runs in BOTH modes
+    sampled_data = apply_sampling_strategy(raw_data)  # e.g., 50000 → 2000
+
+    # Step 2: Limit LLM processing in debug mode only
+    if args.debug:
+        samples_to_process = random.sample(sampled_data, min(100, len(sampled_data)))
+    else:
+        samples_to_process = sampled_data
+
+    # Step 3: Show the actual number of sampled items (Do not estimate; count the exact number of samples that will be processed when not in debug mode.)
+    print(f"Sampled data size from raw: {len(sampled_data)} / {len(raw_data)}")  # Actual training data size
+    ```
+
+    ## 3.3 Logging Convention
+    Only print progress at 20%, 40%, 60%, 80%, 100%. No per-item logs.
+
+    ## 3.4 Output Statistics Format
+    Your script should print statistics at the end of execution:
+
+    ### Script Execution Summary (REQUIRED)
+    ```
+    # Debug mode (--debug):
+    ========== SUMMARY ==========
+    Total output samples: {actual_output}
+    Sampled data size from raw: {sampled_count} / {raw_count}
+    Debug samples processed: {debug_processed_count}
+    Output file: {{ workspace_path }}data.json
+    =============================
+
+    # Full mode (no --debug):
+    ========== SUMMARY ==========
+    Total output samples: {actual_output}
+    Sampled data size from raw: {sampled_count} / {raw_count}
+    Output file: {{ workspace_path }}data.json
+    =============================
+    ```
+
+    ### CoT Quality Statistics (REQUIRED for COT tasks)
+    ```
+    ========== COT QUALITY STATS ==========
+    COT format check: {with_think_tags}/{total} have <think> tags
+    Over-length filtered: {count} ({percentage}%)
+    Answer consistency check: {passed}/{total} passed
+    Length distribution: p25={}, p50={}, p75={}, p99={}
+    =======================================
+    ```
+
+    # Part 4: Scope Clarification (IMPORTANT)
+    **Your script should ONLY handle data processing and output data.json.**
+    - DO NOT generate training configuration files (e.g., train.yaml, training_config.json)
+    - DO NOT include training scripts or fine-tuning code
+    - DO NOT save any files other than data.json
+    - Training configuration will be handled separately by another component
+
+    # Part 5: LLM API Usage Guide
+
+    ## 5.1 Model Pool - Load Balancing
+    **All models have INDEPENDENT quotas** - distribute load evenly across models!
+
+    ```python
+    import os, json
+    import litellm; litellm.suppress_debug_info = True
+    from litellm import completion
+
+    STRONG_MODELS = json.loads(os.getenv("STRONG_MODEL_POOL", "[]"))  # CoT generation
+    WEAK_MODELS = json.loads(os.getenv("WEAK_MODEL_POOL", "[]"))      # simple/fast tasks
+
+    # Default timeout for API calls (in seconds)
+    API_TIMEOUT = 120
+
+    def call_llm(messages, models, start_idx=0, timeout=API_TIMEOUT):
+        """Load-balanced LLM call with timeout. Use start_idx to distribute across models."""
+        if not models:
+            raise RuntimeError("Model pool is empty. Set STRONG_MODEL_POOL/WEAK_MODEL_POOL env vars.")
+        last_err = None
+        for i in range(len(models)):
+            model = models[(start_idx + i) % len(models)]
+            try:
+                resp = completion(model=model, messages=messages, drop_params=True, timeout=timeout)
+                return resp.choices[0].message.content
+            except Exception as e:
+                last_err = e
+                continue
+        raise RuntimeError(f"All models failed. Last error: {last_err}")
+    ```
+
+    ## 5.2 Timeout & Efficiency (CRITICAL)
+    - Set `timeout=120` for API calls to prevent blocking on complex problems
+    - If timeout after retries, skip sample and continue
+    - Prefer string/regex over LLM for validation (answer check, structure check) when possible
+
+    ## 5.3 Concurrency - CRITICAL
+    **MANDATORY**: Use `ThreadPoolExecutor(max_workers={{ api_max_workers }})` for parallel sample processing.
+    - DO NOT use `os.cpu_count()` - it limits parallelism unnecessarily
+    - The value {{ api_max_workers }} is intentional for maximizing API throughput
+    - Pass `start_idx=sample_index % len(models)` to distribute load evenly
+
+    ```python
+    with ThreadPoolExecutor(max_workers={{ api_max_workers }}) as executor:  # NOT os.cpu_count()!
+        futures = {executor.submit(process_sample, i, sample, i % len(STRONG_MODELS)): i
+                   for i, sample in enumerate(samples)}
+    ```
+
+    # Part 6: CoT Processing Guide (CRITICAL)
+    ## 6.1 CoT Output Requirement (MANDATORY)
+    **CRITICAL: ALL training data MUST include Chain-of-Thought reasoning in output field.**
+
+    ### Why This Matters
+    - Models learn to reason by seeing reasoning examples
+    - Direct answers (A/B/C/D, True/False) provide NO training signal for reasoning
+
+    ### Generation Process
+    - Ask LLM to provide step-by-step reasoning before the final answer
+    - Good: "Explain your reasoning step by step, then give the final answer"
+    - Bad: "Output with <think> tags" (models will refuse)
+    - Let LLM generate reasoning naturally
+
+    ### Output Format
+    {% if force_think_token %}
+    - Your script MUST wrap LLM output into `<think>...</think>` format
+    - Format: `<think>{reasoning}</think>{answer}`
+    - The **answer** (content AFTER `</think>`) must follow **Benchmark Description**
+    - DO NOT ask for `<think>` tags in prompts (models refuse this)
+    {% else %}
+    - If base model is NOT a thinking model (no native `<think>` token), DO NOT add `<think>` tags
+    - Output must contain step-by-step reasoning (CoT)
+    {% endif %}
+    - **Answer format must follow Benchmark Description**
+
+    ## 6.2 Post-Processing Validation
+    {% if force_think_token %}
+    - **Structure check**: `"<think>" in output and "</think>" in output`
+    {% endif %}
+    - **Content check**: Output must contain reasoning (not just direct answer)
+    - **Answer check**: Answer format must match Benchmark Description
+
+    # Part 7: Previous Failed Attempts
+    {% if queried_former_failed_knowledge|length != 0 %}
+    {% for former_failed_knowledge in queried_former_failed_knowledge %} Attempt {{ loop.index }}:
+    =====Code:=====
+    {{ former_failed_knowledge.implementation.all_codes }}
+    =====Feedback:=====
+    {{ former_failed_knowledge.feedback }}
+    {% endfor %}
+    {% endif %}
+
+    # Part 8: Response Format
+    Provide ONLY the Python script in a markdown code block:
+    ```python
+    # Your complete Python script here
+    ```
+
+    Do NOT add explanations before or after the code block.
+
+  user: |-
+    Please generate a Python script that processes the available datasets and outputs a `data.json` file in Alpaca format.
+
+    The script will be executed in two modes:
+    1. **Debug mode (coding phase):** `python {{ workspace_path }}process_data.py --debug` - process 100 samples for fast validation
+    2. **Full mode (running phase):** `python {{ workspace_path }}process_data.py` - generates all samples for training
+
+    Dataset files are located at: {{ datasets_path }}
+
+    ## Detailed Dataset Descriptions
+    {% for ds_name, ds_desc in involved_dataset_folder_desc.items() %}
+    ### Dataset: {{ ds_name }}
+    (Note: All file paths for this dataset are relative to `{{ datasets_path }}{{ ds_name }}/`)
+    {{ ds_desc }}
+    {% endfor %}
+
+    Output file should be: {{ workspace_path }}data.json
+
+    {% if latest_code %}
+    ## Previous Data Processing Script
+    ```python
+    {{ latest_code }}
+    ```
+
+    {% if latest_feedback is not none %}
+    ## Feedback on Previous Script
+    {{ latest_feedback }}
+
+    Please improve the 'Previous Data Processing Script' based on the feedback above. Do not create a new script. Consider the feedback carefully and make necessary corrections. If the feedback asks for more information or logging, make sure to include that in your revised script to help the evaluator to better assess your implementation.
+    {% endif %}
+    {% else %}
+    Please create a new Data Processing Script based on the task description.
+    {% endif %}
+
+    **IMPORTANT**: Make sure your script supports the `--debug` argument as described in the system prompt.
+
+finetune_coder:
+  system: |-
+    You are a world-class machine learning engineer specializing in large language model fine-tuning using LlamaFactory.
+    Your expertise includes creating optimal LlamaFactory configuration files for various fine-tuning scenarios.
+
+    # Scenario Description
+    {{ scenario }}
+
+    # Task Description
+    {{ task_desc }}
+
+    {% if queried_former_failed_knowledge|length != 0 %}
+    ## Previous Failed Attempts
+    {% for former_failed_knowledge in queried_former_failed_knowledge %} Attempt {{ loop.index }}:
+    =====Code:=====
+    {{ former_failed_knowledge.implementation.all_codes }}
+    =====Feedback:=====
+    {{ former_failed_knowledge.feedback }}
+    {% endfor %}
+    {% endif %}
+
+    ## Available Fine-tuning Methods
+    {{ available_methods }}
+
+    ## Shared Parameters
+    These parameters apply to all fine-tuning methods:
+    {{ shared_params }}
+
+    ## Method-Specific Parameters
+    {% for method, params_desc in methods_specific_params.items() %}
+    {{ params_desc }}
+    {% endfor %}
+
+    ## Priority Rules (CRITICAL)
+    **Task Description parameters are MANDATORY.** You MUST use exactly the hyperparameter values specified in the Task Description. Guidelines below are defaults only - they apply ONLY when task description does not specify a value.
+
+    ## Requirements
+    1. Create a LlamaFactory configuration file named `train.yaml`
+    2. Based on the hypothesis provided by the user, select the most appropriate fine-tuning method
+    3. Generate full training configuration (no sample limit)
+    4. Ensure all parameters are valid for LlamaFactory
+    5. **Adaptive Logging Configuration (CRITICAL)**:
+       - Set `logging_strategy` to 'steps' for consistent monitoring
+       - Calculate `logging_steps` adaptively:
+         * Estimate total_steps = (num_samples × num_epochs) / (batch_size × gradient_accumulation_steps × num_gpus)
+         * Target 20-50 log entries total
+    6. **Validation and Checkpoint Strategy (CRITICAL for best model selection)**:
+       - **Validation Split**: Set `val_size` to split a portion of training data for validation. Choose ratio based on dataset size and task needs.
+       - **Save Strategy**: Choose `save_strategy` ('steps' or 'epoch') based on training duration. MUST ensure `eval_strategy` == `save_strategy`.
+       - **Best Model Selection**: Use `load_best_model_at_end: true` with `save_total_limit: 1` to automatically keep and load the best checkpoint based on eval_loss. Note: `save_total_limit` will be force-injected to 1.
+    7. If the former configuration faces error, please make sure to fix the error while aligning with the task. If these two goals conflict, please prioritize fixing the error.
+
+    ## Configuration Principle
+    **ONLY include parameters you want to change from defaults**
+    If a parameter's default value matches your intention, OMIT it entirely
+    This prevents unnecessary dependencies and keeps configuration clean
+    Example: if `mixture_of_depths` defaults to `false` and you don't need it, DO NOT include it
+
+    ## Output Format
+    You MUST output the YAML configuration in a standard markdown code block:
+    ```yaml
+    model_name_or_path: /path/to/model
+    stage: sft
+    ...
+    ```
+
+    Do NOT add explanations before or after the YAML block.
+
+  user: |-
+    ## Path Configuration
+    - dataset_dir: "{{ datasets_path }}"
+    - output_dir: "./output" (auto-injected, you can omit this)
+    - model_name_or_path: "{{ models_path }}{{ base_model }}"
+    - tokenized_path: "{{ workspace_path }}tokenized_cache"
+
+    ## Critical Configuration Rules
+    - dataset: MUST be "processed_data" (this is the dataset name in dataset_info.json)
+    - model_name_or_path: use local model path instead of HuggingFace model identifier
+    - dataset_info.json is located at: "{{ datasets_path }}dataset_info.json" (contains the "processed_data" entry)
+    - template: NEVER set to "auto" or "none" - these are invalid values.
+      - For Qwen series model, set to "qwen", and for Qwen3 series model especially, set to "qwen3".
+      - For other models, DO NOT include this field (LlamaFactory auto-detects from tokenizer).
+    - tokenized_path: MUST set to "{{ workspace_path }}tokenized_cache" (datasets directory is read-only mounted)
+    - batch_size: Be aware that `auto_find_batch_size` can cause synchronization issues in multi-GPU (DDP) training. Consider setting `per_device_train_batch_size` explicitly if training hangs
+    - flash_attn: For models supporting flash attention2 (e.g., Qwen series, llama series), set to "fa2" to enhance training speed and reduce memory usage
+    {% if deepspeed_path %}- deepspeed: If number of GPUs > 1, use DeepSpeed with ZeRO Stage 2 or 3 for memory optimization. specifically, set to "{{ deepspeed_path }}ds_z3_config.json" for ZeRO Stage 3, otherwise use "{{ deepspeed_path }}ds_z2_config.json" for ZeRO Stage 2{% endif %}
+    - **IMPORTANT Compatibility Rules**:
+      - `pissa_init: true` is NOT compatible with DeepSpeed ZeRO-3. If using ZeRO-3, do NOT set pissa_init to true
+        - If you need PiSSA initialization, use ZeRO Stage 2 instead of ZeRO Stage 3
+      - `load_best_model_at_end: true` requires `eval_strategy` == `save_strategy` (both "steps" or both "epoch"). Always set both to the same value.
+
+    {% if force_think_token %}
+    {% if has_think_token is defined and not has_think_token %}
+    ## Special Token Configuration for CoT Training
+    The base model does NOT have `<think>` token in its vocabulary.
+    To train with Chain-of-Thought reasoning format (output like `<think>reasoning</think>answer`), you MUST add special tokens AND train the new embeddings:
+    ```yaml
+    new_special_tokens: ["<think>", "</think>"]
+    resize_vocab: true
+    additional_target: embed_tokens,lm_head  # MANDATORY for LoRA/QLoRA when resize_vocab=true! And Full Training does not need this field.
+    ```
+    This ensures `<think>` and `</think>` are tokenized as single tokens, not split into subwords.
+    {% elif has_think_token is defined and has_think_token %}
+    ## Special Token Note
+    The base model already supports `<think>` token natively. No need to add special tokens for CoT training.
+    {% endif %}
+    {% endif %}
+    {# When force_think_token=false, no special token configuration needed #}
+
+    {% if data_stats %}
+    ## Processed Data Statistics (from debug mode)
+    {{ data_stats }}
+
+    **Configuration Guidelines based on memory estimates:**
+    - `per_device_train_batch_size`: Use the recommended value from Scenario's Memory Estimates table
+      - For long CoT training (>8K tokens), prefer batch_size=1
+      - **IMPORTANT**: Smaller batch = can fit longer sequences = better reasoning quality
+    - `gradient_accumulation_steps`: Adjust to achieve effective batch of 16-64 (batch × accum × num_gpus)
+    - `cutoff_len`: Must accommodate your CoT length target
+      - Check data p99 and ensure cutoff_len > p99
+      - For reasoning tasks, aim for cutoff_len >= 8192
+    - `num_train_epochs` / `max_steps`: **If the task description specifies a specific value, use that value.** Otherwise, for small datasets (<1000), use 3-5 epochs; for large datasets (>10000), use 1-2 epochs.
+    {% endif %}
+
+    {% if latest_code %}
+    ## Previous Configuration
+    ```yaml
+    {{ latest_code }}
+    ```
+
+    {% if latest_feedback is not none %}
+    ## Feedback on Previous Configuration
+    {{ latest_feedback }}
+
+    Please improve the configuration based on the feedback above and the hypothesis.
+    {% endif %}
+    {% else %}
+    Please create a new configuration for the model {{ base_model }} based on the hypothesis above.
+
+    **Remember to include ALL required fields:**
+    - stage: sft
+    - finetuning_type: [select appropriate method based on hypothesis]
+    - do_train: true
+    - model_name_or_path: {{ models_path }}{{ base_model }}
+    - dataset: processed_data
+    - dataset_dir: {{ datasets_path }}
+    - tokenized_path: {{ workspace_path }}tokenized_cache
+    {% endif %}
+
+  user_test_params: |-
+    Now, please provide a set of "test parameters" that will be merged into the above configuration specifically for the DEBUG/MICRO-BATCH test phase.
+    
+    The debug phase runs on a very small subset (~10 samples).
+    You need to override parameters that adapt to the dataset for quick debugging the yaml config.
+
+    **Example for Test Parameters:**
+    - Set `num_train_epochs` to 1.
+    - Set `max_samples` to a very small number.
+
+    **Output Format:**
+    Output ONLY the YAML block for these test parameters:
+    ```yaml
+    num_train_epochs: 1
+    ...
+    ```
+
+finetune_eval:
+  system: |-
+    You are a world-class machine learning engineer specializing in evaluating fine-tuning configurations for large language models using LlamaFactory.
+    Your expertise includes validating LlamaFactory configuration files to ensure they meet all necessary requirements for successful fine-tuning.
+    
+    You will be provided with:
+    1. A detailed scenario description which requires a fine-tuning LLM.
+    2. A yaml configuration file named `train.yaml` created for LlamaFactory fine-tuning.
+    3. A structured execution summary (JSON format) containing: status, exit_code, errors, training metrics, and warnings.
+    4. The files generated during the execution.
+    5. Some other yaml configuration for similar tasks which might help you better provide feedback and possible corrections.
+
+    Your task is to:
+    1. Check the execution summary to determine if the run succeeded.
+    2. validate the provided `train.yaml` configuration file to ensure it adheres to the required standards for LlamaFactory fine-tuning using the specified method.
+    3. Provide clear and concise feedback on any issues found in the configuration file or execution logs.
+    4. Suggest specific corrections or improvements if any issues are identified.
+
+    ## Task Parameter Alignment Verification
+    You MUST verify that train.yaml parameters match task description requirements.
+
+    **IMPORTANT EXCEPTIONS - System-Managed Parameters**:
+    These parameters are automatically set by the system and should NOT be checked for alignment:
+    - {{ system_managed_params | join(", ") }}
+
+    If a parameter from task description is missing because LlamaFactory doesn't support it, this is expected, not a mismatch.
+
+    You must give a false final decision in following cases:
+    - The execution fails with non-zero exit code.
+    - Configuration parameters (excluding system-managed ones) do NOT match task description requirements.
+    
+    {% if queried_similar_successful_knowledge|length != 0 %}
+    ### Similar Successful Implementations to help training config Improvement
+    The user has done several similar tasks and get some successful implementations. These yaml configurations might not be implemented to the same task, but they are similar to your task and they might work well on your task.
+    Please refer to these successful implementation and provide your suggestions in your response on how to correct your current code based on these successful implementations.
+    ## Successful Implementations for Similar Tasks
+    ====={% for similar_successful_knowledge in queried_similar_successful_knowledge %} Similar Task {{ loop.index }}:=====
+    {{ similar_successful_knowledge.target_task.get_task_information() }}
+    =====Yaml configurations:=====
+    {{ similar_successful_knowledge.implementation.all_codes }}
+    {% endfor %} 
+    {% endif %}
+
+    # Important Notice
+    - You may find that the execution is short with limited data and iterations. This is expected as we are only validating the configuration file's correctness and not performing full-scale training. Don't treat this as a failure. Also do not put this information in your feedback.
+
+    ## Output Format
+    Please respond with your feedback in the following JSON format without anything else.
+    ```json
+    {
+        "execution": "State if run succeeded. If errors, include all messages verbatim. Classify cause: algorithm, implementation, or environment."
+        "return_checking": "Plain text. Examine the generated files from the user input. Does the output contains a fine-tuned model or expected artifacts? If not, specify what is missing or incorrect.",
+        "code": "Plain text. Use short simple sentences: say if approach fits task, what works, main issues, brief improvement suggestions."
+        "final_decision": <true/false>, # Final decision on whether the configuration is acceptable for full data fine-tuning
+    }
+    ```
+
+  user: |-
+    # Scenario Information
+    {{ scenario }}
+
+    # Task Description
+    {{ task_desc }}
+
+    # Yaml Configuration File
+    ```yaml
+    {{ code_yaml }}
+
+    ## Execution Summary (Structured)
+    ```json
+    {{ stdout }}
+    ```
+
+    ## Workspace Files
+    {{ workspace_files }}
+
+data_eval:
+  system: |-
+    You are a data quality expert for LLM fine-tuning using LlamaFactory.
+    Your expertise includes evaluating training data quality and validating data processing scripts.
+
+    You will evaluate:
+    1. **Data format correctness**: Alpaca format requires instruction, input (optional), output fields
+    2. **Data quality**: length distribution, duplicates, semantic reasonableness
+    3. **Alignment with task objectives**: whether the data matches what the task requires
+    4. **Code logic correctness**: whether the processing script is well-designed
+
+    ## The Main Scenario Description
+    {{ scenario }}
+
+    {% if queried_similar_successful_knowledge|length != 0 %}
+    ## Similar Successful Data Processing Examples
+    The following are successful data processing implementations for similar tasks:
+    {% for knowledge in queried_similar_successful_knowledge %}
+    ### Example {{ loop.index }}:
+    **Task:** {{ knowledge.target_task.get_task_information() }}
+    **Code:**
+    ```python
+    {{ knowledge.implementation.file_dict.get("process_data.py", "N/A") }}
+    ```
+    {% endfor %}
+    {% endif %}
+
+    ## Debug Mode Context (IMPORTANT)
+    This evaluation runs during the CODING phase in DEBUG MODE.
+    - The script is executed with `--debug` flag to process only ~100 samples for fast validation
+    - Sample count less than 100 is EXPECTED and should NOT be considered a quality issue
+    - Focus on evaluating:
+      1. Data format correctness (Alpaca format)
+      2. Data quality of the generated samples
+      3. Script logic correctness (will it work in full mode?)
+    - Do NOT fail the evaluation just because sample count is low
+
+    ## Evaluation Criteria
+    - **Format**: All samples must have non-empty instruction and output fields
+    - **Length**: instruction/output should be reasonable length (not too short or excessively long)
+    - **Duplicates**: High duplicate ratio indicates data quality issues
+    - **Semantic**: instruction should be a question/task, output should be an answer/response
+    - **Alignment**: Data should match the task's training objective
+
+    ## CRITICAL: Task Requirement Alignment Verification
+    You MUST verify that the data processing script and output match task description requirements:
+    1. EXTRACT all data processing requirements from task description
+       - Look for data format specifications (e.g., "Alpaca format", "ShareGPT format")
+       - Look for content requirements (e.g., "add system prompt", "include few-shot examples")
+       - Look for filtering criteria (e.g., "filter by answer consistency", "remove samples longer than X")
+       - Look for CoT/reasoning requirements (e.g., "generate step-by-step reasoning")
+    2. COMPARE with actual implementation and generated data samples
+    3. If ANY mismatch found between task requirements and actual output:
+       - Report the mismatch in "return_checking" field
+       - Set final_decision to FALSE
+
+    ## CoT Quality Evaluation (Task-Adaptive)
+    **IMPORTANT: CoT quality ≠ CoT length. Adapt criteria based on task type from README metadata.**
+
+    **Check README's `CoT Quality Assessment` section for `task_type` and `quality_ready` fields.**
+
+    1. **Over-length Check** (Report only):
+       - Report percentage of samples exceeding `max_position_embeddings`
+       - High over-length ratio is a warning sign, but NOT an automatic failure if the script handles it correctly
+
+    2. **Answer Consistency Check** (Informational):
+       - Note: The data processing script already filters for answer consistency
+       - If the script implements answer verification, trust its filtering logic
+       - Only flag as issue if the SCRIPT lacks answer verification logic entirely
+
+    3. **Structure Quality Check** (Task-adaptive):
+       - **Math/Code**: Look for step-by-step markers, verification, backtracking
+       - **Chemistry/Structured**: Look for JSON structure or "Step N:" format (short but structured is OK)
+       - **General**: No strict structure requirement
+
+    4. **Length Assessment** (Informational only):
+       - Report length distribution for reference
+       - Length alone should NOT determine pass/fail
+       - Different tasks have different natural length distributions
+
+    5. **Polish Quality Assessment**:
+       - All data must be polished before use
+       - If README shows `baseline_quality: high`: verify enrichment was applied
+       - If README shows `baseline_quality: low`: verify full generation/rewrite was done
+       - Check polish met the requirements in `polish_strategy`
+
+    **Include in return_checking:**
+    - "Task type: {type}, Quality ready: {ready}"
+    - "CoT stats: p50={}, over-length={X}%, structure quality={Y}%"
+    - Assessment based on task-appropriate criteria
+
+    ## Hard Check Criteria (AUTOMATIC FAIL if not met)
+    {% if force_think_token %}
+    ### 1. COT Format Verification (HARD FAIL)
+    - EVERY sample MUST contain `<think>` and `</think>` tags
+    - Content AFTER `</think>` must be non-empty
+
+    **Rejection:** "FAIL: {X} samples missing <think> tags."
+    {% else %}
+    ### 1. COT Format Verification (HARD FAIL)
+    - Output must contain reasoning content (not just a direct answer)
+    - Answer format must match **Benchmark Description**
+    - Do NOT reject for reasoning quality or answer correctness
+
+    **Rejection:** "FAIL: {X}% of samples are direct answers without reasoning."
+    {% endif %}
+
+    ### 2. Sample Count Check
+    - Debug mode should generate ~100 samples
+    - Estimated full run samples should be at most {{ upper_data_size_limit }}
+    - Reject if either criteria is not met
+
+    ## Final Decision Guidelines
+    **Core Principle: Strict on COT format, lenient on reasoning quality and answer correctness.**
+
+    - **Approve (true)** if:
+      - Script runs successfully (exit_code == 0)
+      - At least 1 sample is generated
+      {% if force_think_token %}- ALL samples have `<think>` and `</think>` tags (MANDATORY){% else %}- ALL samples contain reasoning content (not just direct answers){% endif %}
+      - Data format is correct (Alpaca format with instruction/output)
+
+    - **Reject (false)** if ANY of these:
+      - Script fails to run (exit_code != 0)
+      - Zero samples are generated
+      {% if force_think_token %}- **ANY sample missing `<think>` or `</think>` tags (HARD FAIL)**{% else %}- **ANY sample missing reasoning content (just direct answer)**{% endif %}
+      - Data format is fundamentally broken
+      - **Data does NOT match task description requirements**
+
+    - **Do NOT reject** for:
+      - Low sample count in debug mode (expected)
+      - Moderate quality variations in individual samples
+      - Length distribution not matching ideal patterns
+      - High filtering rate (script doing its job)
+  
+    ## Important Note
+    - Do not summarize the code into your feedback and DO NOT copy the task description also. Only provide new insights based on your evaluation.
+    - If you think the current logging information is not sufficient to find out the issues, please specify what additional logging information is needed in your feedback and put this information in 'code' block. The user will add further provide you the additional logging information in the next iteration.
+    - Do not write any code in your response, use plain text only.
+
+    ## Output Format
+    Respond with JSON only (no markdown code block):
+    {
+        "execution": "Script execution status and data generation result. Include exit code and any errors.",
+    "return_checking": "Data quality analysis: format validation, length distribution assessment, duplicate ratio, semantic issues found; Hard check criteria: does the solution meet the hard check criteria",
+        "code": "Code issues and specific improvement suggestions. What works well, what needs fixing.",
+        "final_decision": true/false
+    }
+
+  user: |-
+    # Task Description
+    {{ task_desc }}
+    {% if script_code %}
+
+    # Data Processing Script (for debugging)
+    ```python
+    {{ script_code }}
+    ```
+    {% endif %}
+    {% if stdout %}
+
+    # Execution Output ({% if exit_code != 0 %}error logs{% else %}summary{% endif %})
+    ```
+    Exit code: {{ exit_code }}
+    {{ stdout }}
+    ```
+    {% endif %}
+
+    # Data Statistics
+    ```json
+    {{ data_stats }}
+    ```
+
+    # Sample Data ({{ sample_count }} samples from total {{ total_samples }}) [DEBUG MODE]
+    ```json
+    {{ data_samples }}
+    ```
+
+runner_eval:
+  system: |-
+    You are a world-class ML engineer evaluating LLM fine-tuning results.
+
+    ## Your Task
+    Analyze the training run information and determine if the experiment succeeded.
+
+    ## Evaluation Criteria (for final_decision)
+    1. **Execution Success**: Did training complete without errors? Check exit_code and model outputs.
+    2. **Benchmark Execution**: Did benchmark run successfully? Check benchmark results availability.
+
+    ## Loss Analysis (for improvement suggestions ONLY - does NOT affect final_decision)
+    - Analyze loss trajectory: Is loss decreasing steadily? Any signs of overfitting?
+    - Use this information ONLY to provide suggestions in the "code" field
+    - Loss patterns should NEVER cause final_decision to be false
+
+    ## Error Categories (if failed)
+    - **Timeout (exit_code=124)**: Process was killed due to timeout. Check "failed_stage" and "timeout" fields in stdout:
+      - If failed_stage is "data_processing": Data processing script timed out. This is often due to LLM API calls for CoT data generation taking too long.
+      - If failed_stage is "training": Training timed out. 
+    - **OOM**: GPU memory exhaustion - suggest batch size/model changes
+    - **CUDA**: Driver/device issues - suggest environment checks
+    - **Config**: Invalid parameters - suggest specific fixes
+    - **Data**: Dataset issues - suggest data pipeline fixes
+
+    ## Output Format
+    Respond with JSON only:
+    {
+        "execution": "Execution status: SUCCESS or FAILED with category [OOM/CUDA/Config/Data]. Include key metrics or error details.",
+        "return_checking": "If success: benchmark analysis. If failed: what failed and expected behavior.",
+        "code": "Configuration assessment and improvement suggestions",
+        "final_decision": true/false  // Set to true as long as training succeeded (exit_code=0) and benchmark ran successfully
+    }
+
+  user: |-
+    # Task Description
+    {{ task_desc }}
+
+    # Training Configuration
+    ```yaml
+    {{ config_yaml }}
+    ```
+
+    # Execution Info
+    - Exit Code: {{ exit_code }}
+    - Model Output Files: {{ model_files_status }}
+    {% if failed_stage %}- Failed Stage: {{ failed_stage }}
+    - Stage Timeout Config: {{ timeout_seconds }} seconds
+    {% endif %}
+
+    # Benchmark Results
+    ```json
+    {{ benchmark_result }}
+    ```
+
+    # Loss History (train loss and eval_loss if validation enabled)
+    ```json
+    {{ loss_history }}
+    ```
+    {% include "components.coder.finetune.prompts:runner_eval.train_output" %}
+
+  train_output: |-
+    # Training Output (key information extracted from stdout)
+    ```
+    {{ stdout }}
+    ```
diff --git a/rdagent/components/coder/finetune/unified_validator.py b/rdagent/components/coder/finetune/unified_validator.py
new file mode 100644
index 000000000..7d76be84b
--- /dev/null
+++ b/rdagent/components/coder/finetune/unified_validator.py
@@ -0,0 +1,304 @@
+"""
+Simplified LLM Fine-tuning Configuration Validator
+
+Two-step validation:
+1. Parameter filtering - Remove unsupported parameters
+2. Micro-batch testing - Runtime validation with small dataset
+"""
+
+import json
+import re
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Dict, List, Optional, Set
+
+import yaml
+
+from rdagent.components.coder.finetune.conf import (
+    FT_DEBUG_YAML_FILE_NAME,
+    FT_TEST_PARAMS_FILE_NAME,
+    get_ft_env,
+    get_workspace_prefix,
+)
+from rdagent.core.experiment import FBWorkspace
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.scen.llama_factory_manager import LLaMAFactory_manager
+
+DIRNAME = Path(__file__).absolute().resolve().parent
+
+# System-managed parameters that are automatically injected during validation.
+# These should NOT be checked for alignment in eval prompts.
+# Single source of truth: modify here to change injected parameters.
+SYSTEM_MANAGED_PARAMS = {
+    "overwrite_cache": True,  # Avoid HF datasets cache lock contention
+    "save_only_model": True,  # Save disk space
+    "save_total_limit": 1,  # Limit checkpoint count to save disk space
+    "output_dir": "./output",  # Standardize model output location
+    "per_device_eval_batch_size": 1,  # Prevent OOM during evaluation
+}
+
+
+@dataclass
+class ValidationResult:
+    """Configuration validation result"""
+
+    success: bool
+    filtered_config: str
+    execution_output: str = ""  # Parsed/summarized output for LLM
+    raw_stdout: str = ""  # Full raw stdout for UI display
+    errors: List[str] = field(default_factory=list)
+    execution_time: float = 0.0
+
+
+class LLMConfigValidator:
+    """LLM configuration validator with two-step validation:
+
+    1. Parameter filtering - Remove unsupported parameters
+    2. Micro-batch test - Runtime validation with small dataset
+
+    The micro-batch test inherently validates completeness, so no separate completeness check is needed.
+    """
+
+    def __init__(self):
+        self._supported_params_cache: Optional[Set[str]] = None
+
+    def validate_and_test(self, config_yaml: str, workspace: FBWorkspace, env) -> ValidationResult:
+        """Three-step validation: parameter filtering + injection + micro-batch testing"""
+        start_time = time.time()
+
+        # Step 1: Parameter filtering
+        filtered_config, removed_params = self._filter_parameters(config_yaml)
+
+        # Step 2: Inject required parameters for multi-task environments
+        injected_config = self._inject_required_parameters(filtered_config)
+
+        # Step 3: Micro-batch testing (validates everything at runtime)
+        result = self._run_micro_batch_test(injected_config, workspace, env)
+        result.execution_time = time.time() - start_time
+
+        # Add filtered params info to execution_output for agent learning
+        if removed_params:
+            filter_info = (
+                f"\n\n[Filtered Parameters] {len(removed_params)} unsupported params removed: {removed_params}"
+            )
+            result.execution_output += filter_info
+
+        return result
+
+    def _filter_parameters(self, config_yaml: str) -> tuple[str, List[str]]:
+        """Filter configuration parameters to only include supported ones.
+
+        Returns:
+            tuple: (filtered_yaml, removed_params_list)
+        """
+        config_dict = yaml.safe_load(config_yaml)
+        if not isinstance(config_dict, dict):
+            return config_yaml, []
+
+        supported_params = self._get_supported_parameters()
+
+        filtered_config = {}
+        removed_params = []
+        for k, v in config_dict.items():
+            if k in supported_params:
+                filtered_config[k] = v
+            else:
+                removed_params.append(k)
+
+        if removed_params:
+            logger.info(f"Filtered out {len(removed_params)} unsupported parameters: {removed_params}")
+
+        return yaml.dump(filtered_config, default_flow_style=False, sort_keys=False), removed_params
+
+    def _inject_required_parameters(self, config_yaml: str) -> str:
+        """Inject required parameters for multi-task environments.
+
+        Uses SYSTEM_MANAGED_PARAMS as the single source of truth.
+        """
+        config = yaml.safe_load(config_yaml)
+        if not isinstance(config, dict):
+            return config_yaml
+
+        config.update(SYSTEM_MANAGED_PARAMS)
+
+        logger.info(f"Injected required parameters: {SYSTEM_MANAGED_PARAMS}")
+        return yaml.dump(config, default_flow_style=False, sort_keys=False)
+
+    def _get_supported_parameters(self) -> Set[str]:
+        """Get supported parameters from LlamaFactory Manager"""
+        if self._supported_params_cache is not None:
+            return self._supported_params_cache
+
+        all_params = LLaMAFactory_manager.get_parameters()
+
+        # Extract all parameter names from all parameter types (including nested structures)
+        supported_params = set()
+        for param_type, params_dict in all_params.items():
+            if isinstance(params_dict, dict):
+                # Recursively extract parameter names from nested dictionaries
+                for key, value in params_dict.items():
+                    if isinstance(value, dict) and "name" in value:
+                        # This is a parameter definition with metadata
+                        supported_params.add(key)
+                    elif isinstance(value, dict):
+                        # This is a nested category (e.g., BaseModelArguments, LoraArguments)
+                        # Extract parameter names from the nested structure
+                        for nested_key, nested_value in value.items():
+                            if isinstance(nested_value, dict) and "name" in nested_value:
+                                supported_params.add(nested_key)
+
+        if not supported_params:
+            raise RuntimeError("No parameters found in LlamaFactory Manager")
+
+        logger.info(f"Loaded {len(supported_params)} parameters from LlamaFactory Manager")
+        self._supported_params_cache = supported_params
+        return supported_params
+
+    def _parse_execution_log(self, stdout: str, exit_code: int, failed_stage: str = None) -> str:
+        """Parse execution log and extract key information for LLM evaluation.
+
+        Reduces log from ~36k tokens to ~500 tokens by extracting only:
+        - Status and exit code
+        - Error messages (if any)
+        - Training metrics (if successful)
+        - Warnings (limited)
+        - Timeout and stage information (if applicable)
+
+        Args:
+            stdout: The execution output
+            exit_code: The process exit code
+            failed_stage: Which stage failed - "data_processing" or "training"
+        """
+        result = {
+            "status": "success" if exit_code == 0 else "failed",
+            "exit_code": exit_code,
+        }
+
+        # Handle timeout (exit_code 124)
+        if exit_code == 124:
+            result["timeout"] = True
+            if failed_stage:
+                result["failed_stage"] = failed_stage
+
+        # 1. Extract error information (highest priority)
+        # Strategy: extract rank0's error block (each line prefixed with [rank0]:)
+        error_text = None
+
+        # Method A: Extract [rank0]: prefixed lines and reconstruct traceback
+        rank0_lines = re.findall(r"\[rank0\]:[^\n]+", stdout)
+        if rank0_lines:
+            rank0_block = "\n".join(line.replace("[rank0]: ", "").replace("[rank0]:", "") for line in rank0_lines)
+            # Find traceback in rank0 block
+            tb_match = re.search(
+                r"Traceback \(most recent call last\):.*?(?:Error|Exception):[^\n]+", rank0_block, re.DOTALL
+            )
+            if tb_match:
+                error_text = tb_match.group(0)
+
+        # Method B: Fallback to generic traceback (no rank prefix)
+        # Use findall to get ALL tracebacks, then keep the first one (root cause)
+        if not error_text:
+            all_tracebacks = re.findall(
+                r"Traceback \(most recent call last\):.*?(?:Error|Exception):[^\n]+", stdout, re.DOTALL
+            )
+            if all_tracebacks:
+                # First traceback is usually the root cause
+                error_text = all_tracebacks[0]
+                if len(all_tracebacks) > 1:
+                    error_text += f"\n\n[Note: {len(all_tracebacks)} total errors, showing root cause]"
+
+        if error_text:
+            # Limit length but keep from the END (actual error type/message is at the end of traceback)
+            result["error"] = error_text[-4000:] if len(error_text) > 4000 else error_text
+
+        # 2. Extract training information
+        if "Running training" in stdout:
+            result["training_started"] = True
+
+            # Extract training config
+            # NOTE: we may have log like "Num examples = 1,000,000" and "Num Epochs = 1,000"; So we need to handle ","
+            num_examples = re.search(r"Num examples\s*=\s*([\d,]+)", stdout)
+            num_epochs = re.search(r"Num Epochs\s*=\s*([\d,]+)", stdout)
+            if num_examples:
+                result["num_examples"] = int(num_examples.group(1).replace(",", ""))
+            if num_epochs:
+                result["num_epochs"] = int(num_epochs.group(1).replace(",", ""))
+
+            # Extract final metrics (JSON format from trainer output)
+            final_metrics = re.search(r"\{'train_runtime':[^}]+\}", stdout)
+            if final_metrics:
+                try:
+                    metrics = eval(final_metrics.group(0))  # Safe: only numbers and strings
+                    result["final_metrics"] = {
+                        "train_loss": metrics.get("train_loss"),
+                        "train_runtime": metrics.get("train_runtime"),
+                        "train_samples_per_second": metrics.get("train_samples_per_second"),
+                    }
+                except Exception:
+                    pass
+
+            # Check completion
+            if "Training completed" in stdout:
+                result["completed"] = True
+
+        # 3. Extract warnings (limit to 20)
+        warnings = re.findall(r"\[WARNING[^\]]*\][^\n]+", stdout)
+        if warnings:
+            result["warnings"] = list(set(warnings))[:20]
+
+        # 4. Fallback: if parsing failed, include truncated raw log
+        if not result.get("error") and not result.get("training_started"):
+            result["raw_log_tail"] = stdout[-2000:] if len(stdout) > 2000 else stdout
+
+        return json.dumps(result, indent=2, ensure_ascii=False)
+
+    def _run_micro_batch_test(self, config_yaml: str, workspace: FBWorkspace, env) -> ValidationResult:
+        """Run micro-batch training test for runtime validation"""
+        result = ValidationResult(success=True, filtered_config=config_yaml)
+        ws_prefix = get_workspace_prefix(env)
+
+        # Create micro-batch test configuration
+        config = yaml.safe_load(config_yaml)
+        if not isinstance(config, dict):
+            result.success = False
+            result.execution_output = "Invalid YAML configuration"
+            result.errors.append("Invalid configuration for micro-batch test")
+            return result
+
+        test_config = config.copy()
+
+        # Load extra test parameters from workspace (generated by coder in 2nd turn)
+        extra_test_params = yaml.safe_load(workspace.file_dict[FT_TEST_PARAMS_FILE_NAME])
+
+        # Merge extra test parameters (overrides previous settings)
+        if extra_test_params:
+            test_config.update(extra_test_params)
+
+        # Run micro-batch training
+        workspace.inject_files(**{FT_DEBUG_YAML_FILE_NAME: yaml.dump(test_config, default_flow_style=False)})
+        training_result = workspace.run(
+            env=env,
+            entry=f"llamafactory-cli train {FT_DEBUG_YAML_FILE_NAME}",
+        )
+
+        # Remove micro-batch test files
+        workspace.remove_files([FT_DEBUG_YAML_FILE_NAME, FT_TEST_PARAMS_FILE_NAME])
+
+        # Parse and store structured execution output (reduces ~36k tokens to ~500)
+        raw_stdout = training_result.stdout if training_result.stdout else ""
+        result.raw_stdout = raw_stdout  # Keep full log for UI
+        result.execution_output = self._parse_execution_log(raw_stdout, training_result.exit_code)
+
+        # Check results
+        progress_indicators = ["train_loss", "Training:", "Epoch", "loss:", "step"]
+        has_progress = any(ind.lower() in training_result.stdout.lower() for ind in progress_indicators)
+
+        if training_result.exit_code == 0 and has_progress:
+            logger.info("Micro-batch test passed")
+            result.success = True
+        else:
+            result.success = False
+            result.errors.append(f"Micro-batch test failed (exit_code={training_result.exit_code})")
+
+        return result
diff --git a/rdagent/components/coder/rl/__init__.py b/rdagent/components/coder/rl/__init__.py
new file mode 100644
index 000000000..4d9bc9028
--- /dev/null
+++ b/rdagent/components/coder/rl/__init__.py
@@ -0,0 +1 @@
+from rdagent.components.coder.rl.costeer import RLCoSTEER
\ No newline at end of file
diff --git a/rdagent/components/coder/rl/costeer.py b/rdagent/components/coder/rl/costeer.py
new file mode 100644
index 000000000..dd24d444b
--- /dev/null
+++ b/rdagent/components/coder/rl/costeer.py
@@ -0,0 +1,134 @@
+"""RL CoSTEER - Code generation component for RL post-training"""
+
+from typing import Generator
+
+from rdagent.components.coder.CoSTEER import CoSTEER
+from rdagent.components.coder.CoSTEER.config import CoSTEERSettings
+from rdagent.components.coder.CoSTEER.evolvable_subjects import EvolvingItem
+from rdagent.components.coder.CoSTEER.evaluators import CoSTEERMultiEvaluator, CoSTEERSingleFeedback
+from rdagent.components.coder.CoSTEER.knowledge_management import CoSTEERQueriedKnowledge
+from rdagent.core.evolving_agent import EvolvingStrategy, EvoStep
+from rdagent.core.experiment import FBWorkspace, Task
+from rdagent.core.scenario import Scenario
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.utils.agent.tpl import T
+from rdagent.log import rdagent_logger as logger
+
+
+class RLCoderCoSTEERSettings(CoSTEERSettings):
+    """RL Coder settings."""
+    pass
+
+
+class RLEvolvingStrategy(EvolvingStrategy):
+    """RL code generation strategy using LLM."""
+
+    def __init__(self, scen: Scenario, settings: CoSTEERSettings):
+        self.scen = scen
+        self.settings = settings
+
+    def evolve_iter(
+        self,
+        *,
+        evo: EvolvingItem,
+        queried_knowledge: CoSTEERQueriedKnowledge | None = None,
+        evolving_trace: list[EvoStep] = [],
+        **kwargs,
+    ) -> Generator[EvolvingItem, EvolvingItem, None]:
+        """Generate code for all tasks using LLM."""
+        for index, target_task in enumerate(evo.sub_tasks):
+            code = self._generate_code(target_task, evolving_trace)
+            if evo.sub_workspace_list[index] is None:
+                evo.sub_workspace_list[index] = evo.experiment_workspace
+            evo.sub_workspace_list[index].inject_files(**code)
+
+        evo = yield evo
+        return
+
+    def _generate_code(self, task: Task, evolving_trace: list[EvoStep] = []) -> dict[str, str]:
+        """Generate RL training code using LLM."""
+        from rdagent.app.rl.conf import RL_RD_SETTING
+
+        # 获取上轮反馈
+        feedback = None
+        if evolving_trace:
+            last_step = evolving_trace[-1]
+            if hasattr(last_step, 'feedback') and last_step.feedback:
+                feedback = str(last_step.feedback)
+
+        # 构造 prompt
+        system_prompt = T(".prompts:rl_coder.system").r()
+        user_prompt = T(".prompts:rl_coder.user").r(
+            task_description=task.description if hasattr(task, 'description') else str(task),
+            base_model=RL_RD_SETTING.base_model or "",
+            benchmark=RL_RD_SETTING.benchmark or "",
+            hypothesis=str(task.name) if hasattr(task, 'name') else "Train RL model",
+            feedback=feedback,
+        )
+
+        # 调用 LLM
+        session = APIBackend().build_chat_session(session_system_prompt=system_prompt)
+        code = session.build_chat_completion(
+            user_prompt=user_prompt,
+            json_mode=False,
+            code_block_language="python",
+        )
+        logger.info(f"LLM generated code:\n{code[:200]}...")
+        return {"main.py": code}
+
+    def _mock_code(self) -> dict[str, str]:
+        """Fallback mock code."""
+        return {"main.py": '''import gymnasium as gym
+from stable_baselines3 import PPO
+
+env = gym.make("CartPole-v1")
+model = PPO("MlpPolicy", env, verbose=1)
+model.learn(total_timesteps=1000)
+model.save("ppo_cartpole")
+print("Training completed!")
+'''}
+
+
+class RLCoderEvaluator:
+    """RL code evaluator (mock implementation)."""
+
+    def __init__(self, scen: Scenario) -> None:
+        self.scen = scen
+
+    def evaluate(
+        self,
+        target_task: Task,
+        implementation: FBWorkspace,
+        gt_implementation: FBWorkspace | None,
+        queried_knowledge: CoSTEERQueriedKnowledge | None = None,
+    ) -> CoSTEERSingleFeedback:
+        """Evaluate RL code. Currently returns mock success."""
+        # TODO: 实现真正的评估逻辑
+        return CoSTEERSingleFeedback(
+            execution="Mock: executed successfully",
+            return_checking=None,
+            code="Mock: code looks good",
+            final_decision=True,
+        )
+
+
+class RLCoSTEER(CoSTEER):
+    """RL CoSTEER - orchestrates code generation and evaluation."""
+
+    def __init__(self, scen: Scenario, *args, **kwargs) -> None:
+        settings = RLCoderCoSTEERSettings()
+        eva = CoSTEERMultiEvaluator([RLCoderEvaluator(scen=scen)], scen=scen)
+        es = RLEvolvingStrategy(scen=scen, settings=settings)
+
+        super().__init__(
+            *args,
+            settings=settings,
+            eva=eva,
+            es=es,
+            scen=scen,
+            max_loop=1,
+            stop_eval_chain_on_fail=False,
+            with_knowledge=False,
+            knowledge_self_gen=False,
+            **kwargs,
+        )
diff --git a/rdagent/components/coder/rl/prompts.yaml b/rdagent/components/coder/rl/prompts.yaml
new file mode 100644
index 000000000..00da21e8e
--- /dev/null
+++ b/rdagent/components/coder/rl/prompts.yaml
@@ -0,0 +1,94 @@
+rl_coder:
+  system: |-
+    你是 RL post-training 专家，负责生成训练代码。
+
+    ## 运行环境
+    代码会被部署到 `$WORKSPACE/code/main.py` 并在该目录下执行。
+    以下环境变量已由框架设置，代码中用 `os.environ["..."]` 读取：
+    - `MODEL_PATH`: 基础模型绝对路径（只读）
+    - `DATA_PATH`: 训练数据目录绝对路径（只读）
+    - `OUTPUT_DIR`: 模型输出目录绝对路径（`$WORKSPACE/output/`）
+    - `GRADING_SERVER_URL`: 评测服务地址（训练完后系统自动提交，代码不需要调用）
+
+    ## 框架: trl (版本 0.27+)
+
+    ## 可用算法
+    - **GRPO**: 推荐，只需 reward function，不需要预构建偏好对
+    - **DPO**: 需要 (prompt, chosen, rejected) 偏好对
+
+    ## API 要点
+
+    ### GRPOTrainer
+    ```python
+    from trl import GRPOConfig, GRPOTrainer
+
+    trainer = GRPOTrainer(
+        model=MODEL_PATH,               # 模型路径
+        reward_funcs=reward_fn,          # reward 函数
+        args=GRPOConfig(
+            output_dir=OUTPUT_DIR,       # 输出目录
+            ...
+        ),
+        train_dataset=dataset,           # 必须有 "prompt" 列
+        processing_class=tokenizer,
+    )
+    ```
+
+    ### reward function 签名（重要！）
+    ```python
+    def reward_fn(completions, answer, **kwargs):
+        # completions: list[str] - 模型生成的回复
+        # answer: list[str] - 数据集中的 answer 列（自动传入）
+        # kwargs: 数据集其他列（如 question）
+        return [float(...) for ...]  # 返回 reward 列表
+    ```
+
+    ### GRPOConfig 关键参数
+    - `num_generations`: 每个 prompt 采样次数，必须 >= 2
+    - `max_completion_length`: 生成最大长度
+    - `per_device_train_batch_size`: 批次大小
+
+    ## 输出要求
+    - 生成完整的 `main.py`，可直接运行
+    - 路径全部通过 `os.environ` 获取，**不要硬编码路径**
+    - 数据从 `$DATA_PATH` 下的 jsonl 文件加载
+    - 模型保存到 `$OUTPUT_DIR`（可用子目录如 `$OUTPUT_DIR/v1`）
+
+    ## 评测机制
+    训练完成后，系统自动将 `$OUTPUT_DIR` 下最新的模型提交到 Grading Server。
+    - 有模型 → 自动评测，返回 score
+    - 为空 → 跳过评测
+    代码只需负责训练和保存模型，**不需要**自行调用评测 API。
+
+    ## 代码模板
+    ```python
+    import os
+    MODEL_PATH = os.environ["MODEL_PATH"]
+    DATA_PATH = os.environ["DATA_PATH"]
+    OUTPUT_DIR = os.environ["OUTPUT_DIR"]
+    # ... 训练逻辑 ...
+    trainer.save_model(OUTPUT_DIR)
+    ```
+
+  user: |-
+    ## 任务
+    {{ task_description }}
+
+    ## 基础模型
+    - 名称: {{ base_model }}
+    - 路径: 通过 $MODEL_PATH 环境变量获取
+
+    ## 训练数据
+    - 数据集: {{ benchmark }}
+    - 路径: 通过 $DATA_PATH 环境变量获取
+
+    ## 假设
+    {{ hypothesis }}
+
+    {% if feedback %}
+    ## 上轮反馈
+    {{ feedback }}
+    {% endif %}
+
+    请根据数据格式和假设，生成完整的训练代码（main.py）。
+    注意：路径全部通过 os.environ 获取，不要硬编码。
diff --git a/rdagent/components/workflow/conf.py b/rdagent/components/workflow/conf.py
index 0f37c6e18..dec406b36 100644
--- a/rdagent/components/workflow/conf.py
+++ b/rdagent/components/workflow/conf.py
@@ -7,14 +7,14 @@ class BasePropSetting(ExtendedBaseSettings):
     You can add following config in the subclass to distinguish the environment variables.
     """
 
-    scen: str = ""
-    knowledge_base: str = ""
-    knowledge_base_path: str = ""
-    hypothesis_gen: str = ""
-    interactor: str = ""
-    hypothesis2experiment: str = ""
-    coder: str = ""
-    runner: str = ""
-    summarizer: str = ""
+    scen: str | None = None
+    knowledge_base: str | None = None
+    knowledge_base_path: str | None = None
+    hypothesis_gen: str | None = None
+    interactor: str | None = None
+    hypothesis2experiment: str | None = None
+    coder: str | None = None
+    runner: str | None = None
+    summarizer: str | None = None
 
     evolving_n: int = 10
diff --git a/rdagent/components/workflow/rd_loop.py b/rdagent/components/workflow/rd_loop.py
index b7f217b88..4d2768dae 100644
--- a/rdagent/components/workflow/rd_loop.py
+++ b/rdagent/components/workflow/rd_loop.py
@@ -30,14 +30,30 @@ def __init__(self, PROP_SETTING: BasePropSetting):
         logger.log_object(scen, tag="scenario")
         logger.log_object(PROP_SETTING.model_dump(), tag="RDLOOP_SETTINGS")
         logger.log_object(RD_AGENT_SETTINGS.model_dump(), tag="RD_AGENT_SETTINGS")
-        self.hypothesis_gen: HypothesisGen = import_class(PROP_SETTING.hypothesis_gen)(scen)
+        self.hypothesis_gen: HypothesisGen = (
+            import_class(PROP_SETTING.hypothesis_gen)(scen)
+            if hasattr(PROP_SETTING, "hypothesis_gen") and PROP_SETTING.hypothesis_gen
+            else None
+        )
 
-        self.hypothesis2experiment: Hypothesis2Experiment = import_class(PROP_SETTING.hypothesis2experiment)()
+        self.hypothesis2experiment: Hypothesis2Experiment = (
+            import_class(PROP_SETTING.hypothesis2experiment)()
+            if hasattr(PROP_SETTING, "hypothesis2experiment") and PROP_SETTING.hypothesis2experiment
+            else None
+        )
 
-        self.coder: Developer = import_class(PROP_SETTING.coder)(scen)
-        self.runner: Developer = import_class(PROP_SETTING.runner)(scen)
+        self.coder: Developer = (
+            import_class(PROP_SETTING.coder)(scen) if hasattr(PROP_SETTING, "coder") and PROP_SETTING.coder else None
+        )
+        self.runner: Developer = (
+            import_class(PROP_SETTING.runner)(scen) if hasattr(PROP_SETTING, "runner") and PROP_SETTING.runner else None
+        )
 
-        self.summarizer: Experiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
+        self.summarizer: Experiment2Feedback = (
+            import_class(PROP_SETTING.summarizer)(scen)
+            if hasattr(PROP_SETTING, "summarizer") and PROP_SETTING.summarizer
+            else None
+        )
         self.trace = Trace(scen=scen)
         super().__init__()
 
@@ -72,21 +88,21 @@ def running(self, prev_out: dict[str, Any]):
         return exp
 
     def feedback(self, prev_out: dict[str, Any]):
+        # TODO: the logic branch of exception should be moved to summarizer
         e = prev_out.get(self.EXCEPTION_KEY, None)
         if e is not None:
             feedback = HypothesisFeedback(
-                observations=str(e),
-                hypothesis_evaluation="",
-                new_hypothesis="",
-                reason="",
+                reason=str(e),
                 decision=False,
+                code_change_summary="",
+                acceptable=False,
             )
-            logger.log_object(feedback, tag="feedback")
-            self.trace.hist.append((prev_out["direct_exp_gen"]["exp_gen"], feedback))
         else:
             feedback = self.summarizer.generate_feedback(prev_out["running"], self.trace)
-            logger.log_object(feedback, tag="feedback")
-            self.trace.hist.append((prev_out["running"], feedback))
+        logger.log_object(feedback, tag="feedback")
+        return feedback
 
-    # TODO: `def record(self, prev_out: dict[str, Any]):` has already been hard coded into LoopBase
-    # So we should add it into RDLoop class to make sure every RDLoop Sub Class be aware of it.
+    def record(self, prev_out: dict[str, Any]):
+        feedback = prev_out["feedback"]
+        exp = prev_out.get("running") or prev_out.get("coding") or prev_out.get("direct_exp_gen", {}).get("exp_gen")
+        self.trace.sync_dag_parent_and_hist((exp, feedback), prev_out[self.LOOP_IDX_KEY])
diff --git a/rdagent/core/evaluation.py b/rdagent/core/evaluation.py
index 077830674..fc4fef64e 100644
--- a/rdagent/core/evaluation.py
+++ b/rdagent/core/evaluation.py
@@ -49,7 +49,6 @@ class Evaluator(ABC):
             2. advanced/summarized feedback information. (evaluate will handle this)
     """
 
-    @abstractmethod
     def evaluate(
         self,
         eo: EvaluableObj,
diff --git a/rdagent/core/evolving_agent.py b/rdagent/core/evolving_agent.py
index 145bb87af..6c4911854 100644
--- a/rdagent/core/evolving_agent.py
+++ b/rdagent/core/evolving_agent.py
@@ -9,7 +9,13 @@
 from tqdm import tqdm
 
 from rdagent.core.evaluation import EvaluableObj, Evaluator, Feedback
-from rdagent.core.evolving_framework import EvolvableSubjects, EvolvingStrategy, EvoStep
+from rdagent.core.evolving_framework import (
+    EvolvableSubjects,
+    EvolvingStrategy,
+    EvoStep,
+    IterEvaluator,
+    RAGStrategy,
+)
 from rdagent.log import rdagent_logger as logger
 
 ASpecificEvaluator = TypeVar("ASpecificEvaluator", bound=Evaluator)
@@ -26,22 +32,44 @@ def __init__(self, max_loop: int, evolving_strategy: EvolvingStrategy) -> None:
     def multistep_evolve(
         self,
         evo: ASpecificEvolvableSubjects,
-        eva: ASpecificEvaluator | Feedback,
+        eva: ASpecificEvaluator,
     ) -> Generator[ASpecificEvolvableSubjects, None, None]:
         """
         yield EvolvableSubjects for caller for easier process control and logging.
         """
 
 
-class RAGEvaluator(Evaluator):
+class RAGEvaluator(IterEvaluator):
 
     @abstractmethod
-    def evaluate(
-        self,
-        eo: EvaluableObj,
-        queried_knowledge: object = None,
-    ) -> Feedback:
-        raise NotImplementedError
+    def evaluate_iter(
+        self, queried_knowledge: object = None, evolving_trace: list[EvoStep] = []
+    ) -> Generator[Feedback, EvaluableObj | None, Feedback]:
+        """
+
+        1) It will yield a evaluation for each implement part and yield the feedback for that part.
+        2) And finally, it will get the summarize all the feedback and return a overall feedback.
+
+        Sending a None feedback will stop the evaluation chain and just return the overall feedback.
+
+        Assumptions:
+        - The evaluation process will make modifications on evo in-place.
+
+        A typical implementation of this method is:
+
+        .. code-block:: python
+
+            evo = yield Feedback()  # it will receive the evo first, so the first yield is for get the sent evo instead of generate useful feedback
+            assert evo is not None
+            for partial_eval_func in self.evaluate_func_iter():
+                partial_fb = partial_eval_func(evo, queried_knowledge, evolving_trace)
+                # return the partial feedback and receive the evolved solution for next iteration
+                yield partial_fb
+
+            final_fb = get_final_fb(...)
+            return final_fb
+
+        """
 
 
 class RAGEvoAgent(EvoAgent[RAGEvaluator, ASpecificEvolvableSubjects], Generic[ASpecificEvolvableSubjects]):
@@ -50,27 +78,62 @@ def __init__(
         self,
         max_loop: int,
         evolving_strategy: EvolvingStrategy,
-        rag: Any,
+        rag: RAGStrategy,
         *,
         with_knowledge: bool = False,
-        with_feedback: bool = True,
         knowledge_self_gen: bool = False,
         enable_filelock: bool = False,
         filelock_path: str | None = None,
+        stop_eval_chain_on_fail: bool = False,
     ) -> None:
+        """
+        Initialize a Retrieval-Augmented Generation (RAG) based evolutionary agent.
+
+        Args:
+            max_loop (int): Maximum number of evolution loops to execute.
+            evolving_strategy (EvolvingStrategy): Strategy defining how the subjects evolve each step.
+            rag (RAGStrategy): Retrieval-Augmented Generation strategy instance used for knowledge querying and/or creation.
+            with_knowledge (bool, optional): If True, retrieves knowledge from RAG for each evolution step. Defaults to False.
+            knowledge_self_gen (bool, optional): If True, enable RAG to load, generate, dump new knowledge from evolving trace. Defaults to False.
+            enable_filelock (bool, optional): If True, enables file-based lock when accessing/modifying the RAG knowledge base. Defaults to False.
+            filelock_path (str | None, optional): Path to the lock file when enable_filelock is True. Defaults to None.
+
+        This class coordinates the multi-step evolution process with optional:
+            - Knowledge retrieval before evolving.
+            - Feedback collection after evolving.
+            - Self-generation and persisting of knowledge base updates.
+
+        Evolving trace is maintained across steps for adaptive strategies and knowledge generation.
+        """
         super().__init__(max_loop, evolving_strategy)
         self.rag = rag
         self.evolving_trace: list[EvoStep[ASpecificEvolvableSubjects]] = []
         self.with_knowledge = with_knowledge
-        self.with_feedback = with_feedback
         self.knowledge_self_gen = knowledge_self_gen
         self.enable_filelock = enable_filelock
         self.filelock_path = filelock_path
+        self.stop_eval_chain_on_fail = stop_eval_chain_on_fail
+
+    def _get_overall_feedback(
+        self, eva_iter: Generator[Any, Any, Feedback], evo: EvolvableSubjects, eval_failed_happened: bool
+    ) -> Feedback:
+        """get overall feedback from eva_iter"""
+        try:
+            if self.stop_eval_chain_on_fail and eval_failed_happened:
+                fb = eva_iter.send(
+                    None
+                )  # send the signal to skip the rest partial evaluation and return the overall feedback directly
+            else:
+                fb = eva_iter.send(evo)
+                if not fb:
+                    eval_failed_happened = True
+        except StopIteration as e:
+            return e.value
 
     def multistep_evolve(
         self,
         evo: ASpecificEvolvableSubjects,
-        eva: RAGEvaluator | Feedback,
+        eva: RAGEvaluator,
     ) -> Generator[ASpecificEvolvableSubjects, None, None]:
         for evo_loop_id in tqdm(range(self.max_loop), "Implementing"):
             with logger.tag(f"evo_loop_{evo_loop_id}"):
@@ -80,22 +143,35 @@ def multistep_evolve(
                     # TODO: Putting the evolving trace in here doesn't actually work
                     queried_knowledge = self.rag.query(evo, self.evolving_trace)
 
-                # 2. evolve
-                evo = self.evolving_strategy.evolve(
+                # 2. evolve:
+                # A compelete solution of an evo can be break down into multiple evolving steps.
+                # Each evolving step can be evaluated separately.
+                # Assumptions:
+                # - if we want to stop on some point of the implementation, we must have a according evaluator (Otherwise, It is meaningless to stop)
+                evo_iter = self.evolving_strategy.evolve_iter(
                     evo=evo,
                     evolving_trace=self.evolving_trace,
                     queried_knowledge=queried_knowledge,
                 )
+                eva_iter = eva.evaluate_iter(
+                    evolving_trace=self.evolving_trace,
+                    queried_knowledge=queried_knowledge,
+                )
+                next(eva_iter)  # kick off the first iteration
+                eval_failed_happened = False
+                for evo in evo_iter:
+                    step_feedback = eva_iter.send(evo)
+                    if not step_feedback:
+                        eval_failed_happened = True
+                        if self.stop_eval_chain_on_fail:
+                            break
+                overall_feedback = self._get_overall_feedback(eva_iter, evo, eval_failed_happened)
 
                 # 3. Pack evolve results
-                es = EvoStep[ASpecificEvolvableSubjects](evo, queried_knowledge)
+                es = EvoStep[ASpecificEvolvableSubjects](evo, queried_knowledge, overall_feedback)
 
                 # 4. Evaluation
-                if self.with_feedback:
-                    es.feedback = (
-                        eva if isinstance(eva, Feedback) else eva.evaluate(evo, queried_knowledge=queried_knowledge)
-                    )
-                    logger.log_object(es.feedback, tag="evolving feedback")
+                logger.log_object(es.feedback, tag="evolving feedback")
 
                 # 5. update trace
                 self.evolving_trace.append(es)
@@ -110,6 +186,6 @@ def multistep_evolve(
                 yield evo  # yield the control to caller for process control and logging.
 
                 # 7. check if all tasks are completed
-                if self.with_feedback and es.feedback is not None and es.feedback.finished():
+                if es.feedback.finished():
                     logger.info("All tasks in evolving subject have been completed.")
                     break
diff --git a/rdagent/core/evolving_framework.py b/rdagent/core/evolving_framework.py
index b0ae68d3e..b49133904 100644
--- a/rdagent/core/evolving_framework.py
+++ b/rdagent/core/evolving_framework.py
@@ -3,9 +3,9 @@
 import copy
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
-from typing import TYPE_CHECKING, Any, Generic, TypeVar
+from typing import TYPE_CHECKING, Any, Generator, Generic, TypeVar
 
-from rdagent.core.evaluation import EvaluableObj
+from rdagent.core.evaluation import EvaluableObj, Evaluator
 from rdagent.core.knowledge_base import KnowledgeBase
 
 if TYPE_CHECKING:
@@ -61,20 +61,65 @@ class EvolvingStrategy(ABC, Generic[ASpecificEvolvableSubjects]):
     def __init__(self, scen: Scenario) -> None:
         self.scen = scen
 
-    @abstractmethod
-    def evolve(
+    def evolve_iter(
         self,
-        *evo: ASpecificEvolvableSubjects,
-        evolving_trace: list[EvoStep[ASpecificEvolvableSubjects]] | None = None,
-        queried_knowledge: QueriedKnowledge | None = None,
-        **kwargs: Any,
-    ) -> ASpecificEvolvableSubjects:
-        """The evolving trace is a list of (evolvable_subjects, feedback) ordered
+        evo: ASpecificEvolvableSubjects,
+        queried_knowledge: QueriedKnowledge = None,
+        evolving_trace: list[EvoStep] = [],
+    ) -> Generator[ASpecificEvolvableSubjects, None, None]:
+        """
+        The evolving trace is a list of (evolvable_subjects, feedback) ordered
         according to the time.
 
         The reason why the parameter is important for the evolving.
         - evolving_trace: the historical feedback is important.
         - queried_knowledge: queried knowledge
+
+        Assumptions:
+        - The evolving process will make modifications in-place. So the yield evo and the parameter evo are the same object!!!!
+
+
+        Typical implementation of this method is:
+
+        .. code-block:: python
+
+            for evolve_function in self.evolve_func_iter():
+                yield evolve_function(evo=evo, queried_knowledge=queried_knowledge, evolving_trace=evolving_trace)
+                # evolve_function will return a partial evolved solution.
+        """
+
+
+class IterEvaluator(Evaluator):
+    """
+    Some evolving implementation (i.e. evolve_iter) will iteratively implement partial solutions before a complete final solution.
+
+    According to that strategy, we have iterative evaluation
+    """
+
+    @abstractmethod
+    def evaluate_iter(self) -> Generator[Feedback, EvaluableObj | None, Feedback]:
+        """
+
+        1) It will yield a evaluation for each implement part and yield the feedback for that part.
+        2) And finally, it will get the summarize all the feedback and return a overall feedback.
+
+        Sending a None feedback will stop the evaluation chain and just return the overall feedback.
+
+        A typical implementation of this method is:
+
+        .. code-block:: python
+
+            evo = yield Feedback()  # it will receive the evo first, so the first yield is for get the sent evo instead of generate useful feedback
+            assert evo is not None
+            for partial_eval_func in self.evaluate_func_iter():
+                partial_fb = partial_eval_func(evo)
+                # return the partial feedback and receive the evolved solution for next iteration
+                evo_next_iter = yield partial_fb
+                evo = evo_next_iter
+
+            final_fb = get_final_fb(...)
+            return final_fb
+
         """
 
 
@@ -98,7 +143,7 @@ def query(
         evo: ASpecificEvolvableSubjects,
         evolving_trace: list[EvoStep],
         **kwargs: Any,
-    ) -> QueriedKnowledge | None:
+    ) -> QueriedKnowledge:
         pass
 
     @abstractmethod
diff --git a/rdagent/core/exception.py b/rdagent/core/exception.py
index eb18b8e2c..2a1ca4991 100644
--- a/rdagent/core/exception.py
+++ b/rdagent/core/exception.py
@@ -10,6 +10,16 @@ class FormatError(WorkflowError):
     """
 
 
+class CodeBlockParseError(FormatError):
+    """Raised when code block extraction fails after all strategies."""
+
+    def __init__(self, message: str, content: str, language: str):
+        self.message = message
+        self.content = content
+        self.language = language
+        super().__init__(message)
+
+
 class CoderError(WorkflowError):
     """
     Exceptions raised when Implementing and running code.
diff --git a/rdagent/core/experiment.py b/rdagent/core/experiment.py
index f9dac480b..287974c31 100644
--- a/rdagent/core/experiment.py
+++ b/rdagent/core/experiment.py
@@ -13,7 +13,7 @@
 from copy import deepcopy
 from dataclasses import dataclass
 from pathlib import Path
-from typing import TYPE_CHECKING, Any, Generic, TypeVar
+from typing import TYPE_CHECKING, Any, Generic, List, TypeVar
 
 from rdagent.core.conf import RD_AGENT_SETTINGS
 from rdagent.core.evaluation import Feedback
@@ -242,6 +242,18 @@ def inject_files(self, **files: str) -> None:
                 target_file_path.parent.mkdir(parents=True, exist_ok=True)
                 target_file_path.write_text(v)
 
+    def remove_files(self, file_names: str | List[str]) -> None:
+        """
+        Remove specified files from the workspace.
+        """
+        if isinstance(file_names, str):
+            file_names = [file_names]
+        for file_name in file_names:
+            target_file_path = self.workspace_path / file_name
+            if target_file_path.exists():
+                target_file_path.unlink()  # Unlink the file if it exists
+            self.file_dict.pop(file_name, None)  # Safely remove the key from file_dict
+
     def get_files(self) -> list[Path]:
         """
         Get the environment description.
@@ -264,6 +276,11 @@ def inject_code_from_file_dict(self, workspace: FBWorkspace) -> None:
         """
         Load the workspace from the file_dict
         """
+        # NOTE: this is a deprecated method, use inject_from_workspace instead
+        # TODO: remove this method; it is only for compatibility with old codes
+        self.inject_from_workspace(workspace)
+
+    def inject_from_workspace(self, workspace: FBWorkspace) -> None:
         for name, code in workspace.file_dict.items():
             self.inject_files(**{name: code})
 
@@ -292,7 +309,7 @@ def execute(self, env: Env, entry: str) -> str:
         Before each execution, make sure to prepare and inject code.
         """
         result = self.run(env, entry)
-        return result.get_truncated_stdout()  # NOTE: truncating just for aligning with the old code.
+        return result.stdout  # NOTE: truncating just for aligning with the old code.
 
     def run(self, env: Env, entry: str) -> EnvResult:
         """
diff --git a/rdagent/core/proposal.py b/rdagent/core/proposal.py
index 964ce3683..c0a2da479 100644
--- a/rdagent/core/proposal.py
+++ b/rdagent/core/proposal.py
@@ -96,13 +96,13 @@ def from_exception(cls, e: Exception) -> ExperimentFeedback:
 class HypothesisFeedback(ExperimentFeedback):
     def __init__(
         self,
-        observations: str,
-        hypothesis_evaluation: str,
-        new_hypothesis: str,
         reason: str,
-        *,
-        code_change_summary: str | None = None,
         decision: bool,
+        code_change_summary: str,
+        *,
+        observations: str | None = None,
+        hypothesis_evaluation: str | None = None,
+        new_hypothesis: str | None = None,
         eda_improvement: str | None = None,
         acceptable: bool | None = None,
     ) -> None:
@@ -118,10 +118,18 @@ def __init__(
         self.acceptable = acceptable
 
     def __str__(self) -> str:
-        return f"""{super().__str__()}
-Observations: {self.observations}
-Hypothesis Evaluation: {self.hypothesis_evaluation}
-New Hypothesis: {self.new_hypothesis}"""
+        upper_str = f"""{super().__str__()}"""
+        if self.observations is not None:
+            upper_str += f"\nObservations: {self.observations}"
+        if self.hypothesis_evaluation is not None:
+            upper_str += f"\nHypothesis Evaluation: {self.hypothesis_evaluation}"
+        if self.new_hypothesis is not None:
+            upper_str += f"\nNew Hypothesis: {self.new_hypothesis}"
+        if self.eda_improvement is not None:
+            upper_str += f"\nEDA Improvement: {self.eda_improvement}"
+        if self.acceptable is not None:
+            upper_str += f"\nOverall Acceptable: {self.acceptable}"
+        return upper_str
 
 
 ASpecificScen = TypeVar("ASpecificScen", bound=Scenario)
@@ -131,6 +139,7 @@ def __str__(self) -> str:
 class Trace(Generic[ASpecificScen, ASpecificKB]):
     NodeType = tuple[Experiment, ExperimentFeedback]  # Define NodeType as a new type representing the tuple
     NEW_ROOT: tuple = ()
+    SEL_LATEST_SOTA: tuple = (-1,)  # select the SOTA experiment in latest node
 
     def __init__(self, scen: ASpecificScen, knowledge_base: ASpecificKB | None = None) -> None:
         self.scen: ASpecificScen = scen
@@ -160,7 +169,9 @@ def __init__(self, scen: ASpecificScen, knowledge_base: ASpecificKB | None = Non
 
         # TODO: self.hist is 2-tuple now, remove hypothesis from it, change old code for this later.
         self.knowledge_base: ASpecificKB | None = knowledge_base
-        self.current_selection: tuple[int, ...] = (-1,)
+
+        # The next expending point of the selection. Set it as a state of the trace will make
+        self.current_selection: tuple[int, ...] = self.SEL_LATEST_SOTA
 
     def get_sota_hypothesis_and_experiment(self) -> tuple[Hypothesis | None, Experiment | None]:
         """Access the last experiment result, sub-task, and the corresponding hypothesis."""
@@ -240,6 +251,70 @@ def get_parents(self, child_idx: int) -> list[int]:
 
         return ancestors
 
+    def sync_dag_parent_and_hist(
+        self,
+        exp_and_fb: NodeType,
+        cur_loop_id: int,
+    ) -> None:
+        """
+        Adding corresponding parent index to the dag_parent when the hist is going to be changed.
+        Should be called when the hist is changed.
+        """
+        # Prioritize local_selection from the experiment if available
+        exp = exp_and_fb[0]
+        selection = getattr(exp, "local_selection", None)
+        if selection is None:
+            selection = self.get_current_selection()
+
+        if len(self.hist) == 0 or len(selection) == 0:
+            # the node we are going to add is the first node of hist / root node of a new sub-trace
+            self.dag_parent.append(self.NEW_ROOT)
+
+        else:
+            current_node_idx = selection[0]
+
+            if current_node_idx == -1:
+                # the current selection is the latest one
+                current_node_idx = len(self.hist) - 1
+
+            self.dag_parent.append((current_node_idx,))
+        self.hist.append(exp_and_fb)
+        self.idx2loop_id[len(self.hist) - 1] = cur_loop_id
+
+    def get_children(self, parent_idx: int | None = None) -> list[NodeType]:
+        """
+        Get all children nodes for a given parent index.
+        If parent_idx is None, returns the root nodes (experiments starting from scratch).
+        """
+        target_parents = (parent_idx,) if parent_idx is not None else self.NEW_ROOT
+        children = []
+        for i, parents in enumerate(self.dag_parent):
+            if parents == target_parents and i < len(self.hist):
+                children.append(self.hist[i])
+        return children
+
+    def get_sota_experiment(self, node_id: int | None = None) -> Experiment | None:
+        """
+        Get the SOTA experiment from the trace by traversing ancestors backwards from node_id.
+        """
+        # NOTE: it is first used in the finetune scenario.
+        if node_id is None:
+            selection = self.get_current_selection()
+            if self.is_selection_new_tree(selection):
+                return None
+            node_id = selection[0]
+
+        if node_id == -1:
+            if not self.hist:
+                return None
+            node_id = len(self.hist) - 1
+
+        ancestors = self.get_parents(node_id)
+        for i in reversed(ancestors):
+            if self.hist[i][1].decision:
+                return self.hist[i][0]
+        return None
+
 
 class CheckpointSelector:
     """
@@ -298,7 +373,7 @@ def __init__(self, scen: Scenario) -> None:
         self.scen = scen
 
     @abstractmethod
-    def gen(self, trace: Trace, plan: ExperimentPlan | None = None) -> Experiment:
+    def gen(self, trace: Trace) -> Experiment:
         """
         Generate the experiment based on the trace.
         Planning is part of gen, but since we may support multi-stage planning,
@@ -379,7 +454,9 @@ def __init__(self, scen: Scenario) -> None:
         self.scen = scen
 
     @abstractmethod
-    def generate_feedback(self, exp: Experiment, trace: Trace) -> ExperimentFeedback:
+    def generate_feedback(
+        self, exp: Experiment, trace: Trace, exception: Exception | None = None
+    ) -> ExperimentFeedback:
         """
         The `exp` should be executed and the results should be included, as well as the comparison
         between previous results (done by LLM).
diff --git a/rdagent/core/scenario.py b/rdagent/core/scenario.py
index b333c069b..80e384e45 100644
--- a/rdagent/core/scenario.py
+++ b/rdagent/core/scenario.py
@@ -35,22 +35,20 @@ def source_data(self) -> str:
     # We should not set them in the base class
 
     @property
-    @abstractmethod
     def rich_style_description(self) -> str:
         """Rich style description to present"""
+        return self.background
 
-    @abstractmethod
     def get_scenario_all_desc(
         self,
         task: Task | None = None,
-        filtered_tag: str | None = None,
-        simple_background: bool | None = None,
     ) -> str:
         """
         Combine all descriptions together
 
         The scenario description varies based on the task being performed.
         """
+        return f"Task:{task}\n {self.background}"
 
     @abstractmethod
     def get_runtime_environment(self) -> str:
diff --git a/rdagent/log/storage.py b/rdagent/log/storage.py
index 1aca9df87..2063173d3 100644
--- a/rdagent/log/storage.py
+++ b/rdagent/log/storage.py
@@ -1,3 +1,4 @@
+import dataclasses
 import json
 import pickle
 import re
@@ -10,6 +11,57 @@
 
 LOG_LEVEL = Literal["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
 
+try:
+    import numpy as np
+except Exception:  # pragma: no cover - optional
+    np = None
+
+
+def _to_jsonable(obj: object, seen: set[int] | None = None) -> object:
+    if seen is None:
+        seen = set()
+    obj_id = id(obj)
+    if obj_id in seen:
+        return "<recursion>"
+    seen.add(obj_id)
+
+    if obj is None or isinstance(obj, (str, int, float, bool)):
+        return obj
+    if isinstance(obj, bytes):
+        try:
+            return obj.decode("utf-8")
+        except Exception:
+            return obj.decode("utf-8", errors="replace")
+    if isinstance(obj, Path):
+        return str(obj)
+    if isinstance(obj, datetime):
+        return obj.isoformat()
+    if dataclasses.is_dataclass(obj):
+        return {f.name: _to_jsonable(getattr(obj, f.name), seen) for f in dataclasses.fields(obj)}
+    if hasattr(obj, "model_dump"):
+        try:
+            return _to_jsonable(obj.model_dump(mode="json"), seen)
+        except Exception:
+            try:
+                return _to_jsonable(obj.model_dump(), seen)
+            except Exception:
+                pass
+    if isinstance(obj, dict):
+        return {str(k): _to_jsonable(v, seen) for k, v in obj.items()}
+    if isinstance(obj, (list, tuple, set)):
+        return [_to_jsonable(v, seen) for v in obj]
+    if np is not None:
+        if isinstance(obj, np.ndarray):
+            return obj.tolist()
+        if isinstance(obj, np.generic):
+            try:
+                return obj.item()
+            except Exception:
+                pass
+    if hasattr(obj, "__dict__"):
+        return {"__type__": type(obj).__name__, "__dict__": _to_jsonable(obj.__dict__, seen)}
+    return {"__type__": type(obj).__name__, "__repr__": repr(obj)}
+
 
 def _remove_empty_dir(path: Path) -> None:
     """
@@ -53,16 +105,21 @@ def log(
 
         if save_type == "json":
             path = path.with_suffix(".json")
-            with path.open("w") as f:
-                try:
-                    json.dump(obj, f)
-                except TypeError:
-                    json.dump(json.loads(str(obj)), f)
+            with path.open("w", encoding="utf-8") as f:
+                json.dump(_to_jsonable(obj), f, ensure_ascii=False, indent=2)
             return path
         elif save_type == "pkl":
             path = path.with_suffix(".pkl")
             with path.open("wb") as f:
                 pickle.dump(obj, f)
+            # TODO: save_type: list[Literal["json", "text", "pkl"]]  = ["pkl", "json"]
+            # for save_type in save_type:
+            try:
+                json_path = path.with_suffix(".json")
+                with json_path.open("w", encoding="utf-8") as f:
+                    json.dump(_to_jsonable(obj), f, ensure_ascii=False, indent=2)
+            except Exception:
+                pass
             return path
         elif save_type == "text":
             obj = str(obj)
diff --git a/rdagent/oai/backend/base.py b/rdagent/oai/backend/base.py
index 79664b9c6..a53468a5f 100644
--- a/rdagent/oai/backend/base.py
+++ b/rdagent/oai/backend/base.py
@@ -16,7 +16,7 @@
 import pytz
 from pydantic import BaseModel, TypeAdapter
 
-from rdagent.core.exception import PolicyError
+from rdagent.core.exception import CodeBlockParseError, PolicyError
 from rdagent.core.utils import LLM_CACHE_SEED_GEN, SingletonBaseClass
 from rdagent.log import LogColors
 from rdagent.log import rdagent_logger as logger
@@ -136,6 +136,62 @@ def _extract_first_json(response: str) -> str:
         return json.dumps(obj)
 
 
+class CodeBlockParser:
+    """
+    Generic code block extractor supporting multiple languages.
+    Raises CodeBlockParseError on extraction failure to trigger retry.
+    """
+
+    SUPPORTED_LANGUAGES = {
+        "python": ["python", "py", "python3", "Python", "Py"],
+        "yaml": ["yaml", "yml"],
+    }
+
+    def __init__(self, language: str = "python", fallback_to_raw: bool = False) -> None:
+        """
+        Args:
+            language: Target language type (python, yaml, etc.)
+            fallback_to_raw: If True, return raw content when extraction fails.
+                           If False (default), raise CodeBlockParseError to trigger retry.
+        """
+        self.language = language.lower()
+        self.fallback_to_raw = fallback_to_raw
+        self._lang_aliases = self._get_language_aliases(self.language)
+
+    def _get_language_aliases(self, language: str) -> List[str]:
+        """Get all possible aliases for the language."""
+        for lang, aliases in self.SUPPORTED_LANGUAGES.items():
+            if language in [lang] + aliases:
+                return [lang] + aliases
+        return [language]
+
+    def parse(self, content: str) -> str:
+        """
+        Parse content and extract code block with exact language tag.
+
+        Returns:
+            Extracted code string.
+
+        Raises:
+            CodeBlockParseError: When extraction fails and fallback_to_raw=False.
+        """
+        # Match code block with exact language tag (```python, ```yaml, etc.)
+        for alias in self._lang_aliases:
+            pattern = rf"```{alias}\s*\n(.*?)\n```"
+            match = re.search(pattern, content, re.DOTALL | re.IGNORECASE)
+            if match:
+                return match.group(1).strip()
+
+        if self.fallback_to_raw:
+            return content.strip()
+
+        raise CodeBlockParseError(
+            message=f"Failed to extract {self.language} code block",
+            content=content,
+            language=self.language,
+        )
+
+
 class SQliteLazyCache(SingletonBaseClass):
     def __init__(self, cache_location: str) -> None:
         super().__init__()
@@ -267,7 +323,14 @@ def build_chat_completion(self, user_prompt: str, *args, **kwargs) -> str:  # ty
             )
             end_time = datetime.now(pytz.timezone("Asia/Shanghai"))
             logger.log_object(
-                {"user": user_prompt, "resp": response, "start": start_time, "end": end_time}, tag="debug_llm"
+                {
+                    "system": self.system_prompt,
+                    "user": user_prompt,
+                    "resp": response,
+                    "start": start_time,
+                    "end": end_time,
+                },
+                tag="debug_llm",
             )
 
         messages.append(
@@ -568,6 +631,8 @@ def _create_chat_completion_auto_continue(
         json_target_type: Optional[str] = None,
         add_json_in_prompt: bool = False,
         response_format: Optional[Union[dict, Type[BaseModel]]] = None,
+        code_block_language: Optional[str] = None,
+        code_block_fallback: bool = False,
         **kwargs: Any,
     ) -> str:
         """
@@ -617,13 +682,14 @@ def _create_chat_completion_auto_continue(
 
         # 2) refine the response and return
         if LLM_SETTINGS.reasoning_think_rm:
-            # Strategy 1: Try to match complete <think>...</think> pattern
-            match = re.search(r"<think>(.*?)</think>(.*)", all_response, re.DOTALL)
+            # Only remove <think>...</think> if it appears at the beginning of the response
+            # Strategy 1: Try to match complete <think>...</think> pattern at the start
+            match = re.match(r"\s*<think>(.*?)</think>(.*)", all_response, re.DOTALL)
             if match:
                 _, all_response = match.groups()
             else:
-                # Strategy 2: If no complete match, try to match only </think>
-                match = re.search(r"</think>(.*)", all_response, re.DOTALL)
+                # Strategy 2: If no complete match, try to match only </think> at the start
+                match = re.match(r"\s*</think>(.*)", all_response, re.DOTALL)
                 if match:
                     all_response = match.group(1)
                 # If no match at all, keep original content
@@ -636,6 +702,14 @@ def _create_chat_completion_auto_continue(
                 # deepseek will enter this branch
                 TypeAdapter(json_target_type).validate_json(all_response)
 
+        # 4) code block extraction
+        if code_block_language:
+            code_parser = CodeBlockParser(
+                language=code_block_language,
+                fallback_to_raw=code_block_fallback,
+            )
+            all_response = code_parser.parse(all_response)
+
         if response_format is not None:
             if not isinstance(response_format, dict) and issubclass(response_format, BaseModel):
                 # It may raise TypeError if initialization fails
diff --git a/rdagent/oai/backend/litellm.py b/rdagent/oai/backend/litellm.py
index 514c5aaae..15857a46d 100644
--- a/rdagent/oai/backend/litellm.py
+++ b/rdagent/oai/backend/litellm.py
@@ -135,7 +135,11 @@ def _create_chat_completion_inner_function(  # type: ignore[no-untyped-def] # no
         Call the chat completion function
         """
 
-        if response_format and not supports_response_schema(model=LITELLM_SETTINGS.chat_model):
+        if response_format and not supports_response_schema(
+            model=LITELLM_SETTINGS.chat_model,
+            # LiteLLM (1.43+) requires this arg; None means auto-infer provider from model.
+            custom_llm_provider=None,
+        ):
             # Deepseek will enter this branch
             logger.warning(
                 f"{LogColors.YELLOW}Model {LITELLM_SETTINGS.chat_model} does not support response schema, ignoring response_format argument.{LogColors.END}",
@@ -204,9 +208,13 @@ def _create_chat_completion_inner_function(  # type: ignore[no-untyped-def] # no
                 logger.info(
                     f"Current Cost: ${float(cost):.10f}; Accumulated Cost: ${float(ACC_COST):.10f}; {finish_reason=}",
                 )
-
-        prompt_tokens = token_counter(model=model, messages=messages)
-        completion_tokens = token_counter(model=model, text=content)
+        try:
+            prompt_tokens = token_counter(model=model, messages=messages)
+            completion_tokens = token_counter(model=model, text=content)
+        except ValueError as e:
+            logger.warning(f"Token counting failed for model {model}: {e}. Skip token statistics.")
+            prompt_tokens = 0
+            completion_tokens = 0
         logger.log_object(
             {
                 "model": model,
@@ -223,7 +231,13 @@ def supports_response_schema(self) -> bool:
         """
         Check if the backend supports function calling
         """
-        return supports_response_schema(model=LITELLM_SETTINGS.chat_model) and LITELLM_SETTINGS.enable_response_schema
+        return (
+            supports_response_schema(
+                model=LITELLM_SETTINGS.chat_model,
+                custom_llm_provider=None,
+            )
+            and LITELLM_SETTINGS.enable_response_schema
+        )
 
     @property
     def chat_token_limit(self) -> int:
diff --git a/rdagent/oai/llm_conf.py b/rdagent/oai/llm_conf.py
index 848045416..a9a1130e7 100644
--- a/rdagent/oai/llm_conf.py
+++ b/rdagent/oai/llm_conf.py
@@ -61,6 +61,7 @@ class LLMSettings(ExtendedBaseSettings):
 
     # Chat configs
     openai_api_key: str = ""  # TODO: simplify the key design.
+    openai_api_base: str = ""
     chat_openai_api_key: str | None = None
     chat_openai_base_url: str | None = None  #
     chat_azure_api_base: str = ""
diff --git a/rdagent/scenarios/data_science/dev/runner/eval.py b/rdagent/scenarios/data_science/dev/runner/eval.py
index c46797414..269bb61e8 100644
--- a/rdagent/scenarios/data_science/dev/runner/eval.py
+++ b/rdagent/scenarios/data_science/dev/runner/eval.py
@@ -100,7 +100,7 @@ def evaluate(
 
         # execute workflow
         result = implementation.run(env=env, entry="python -m coverage run main.py")
-        stdout = result.get_truncated_stdout()
+        stdout = result.stdout
         execute_ret_code = result.exit_code
         implementation.running_info.running_time = result.running_time
 
diff --git a/rdagent/scenarios/data_science/proposal/exp_gen/select/submit.py b/rdagent/scenarios/data_science/proposal/exp_gen/select/submit.py
index 69215fa2a..2f183917a 100644
--- a/rdagent/scenarios/data_science/proposal/exp_gen/select/submit.py
+++ b/rdagent/scenarios/data_science/proposal/exp_gen/select/submit.py
@@ -477,7 +477,7 @@ def _generate_and_run_script(
             result = ws.run(
                 env=env, entry=f"python {script_type}.py --cache-buster={time.time()}"
             )  # Do not cache the result
-            stdout = re.sub(r"^chmod:.*\n?", "", result.get_truncated_stdout(), flags=re.MULTILINE)
+            stdout = re.sub(r"^chmod:.*\n?", "", result.stdout, flags=re.MULTILINE)
 
             if result.exit_code == 0:
                 logger.info(f"Successfully generated and ran {script_type}.py.")
@@ -487,7 +487,7 @@ def _generate_and_run_script(
                         running_timeout_period=DS_RD_SETTING.full_timeout,
                     )
                     result = ws.run(env=env, entry=f"python reference_code.py")
-                    stdout = re.sub(r"^chmod:.*\n?", "", result.get_truncated_stdout(), flags=re.MULTILINE)
+                    stdout = re.sub(r"^chmod:.*\n?", "", result.stdout, flags=re.MULTILINE)
                     if result.exit_code == 0:
                         # move submission.csv to mock_folder
                         if Path(ws.workspace_path / "submission.csv").exists():
@@ -559,7 +559,7 @@ def process_experiment(
             env.conf.running_timeout_period = DS_RD_SETTING.debug_timeout
             result = ws.run(env=env, entry="python grade.py")
             if result.exit_code == 0:
-                grade_stdout = re.sub(r"^chmod:.*\n?", "", result.get_truncated_stdout(), flags=re.MULTILINE)
+                grade_stdout = re.sub(r"^chmod:.*\n?", "", result.stdout, flags=re.MULTILINE)
             logger.info(f"Ran grade.py for {competition}/{loop_id}; exit_code: {result.exit_code}")
         else:
             logger.warning(f"Skipping grading for {competition}/{loop_id} due to main.py execution failure.")
diff --git a/rdagent/scenarios/finetune/benchmark/__init__.py b/rdagent/scenarios/finetune/benchmark/__init__.py
new file mode 100644
index 000000000..a39131c5d
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/__init__.py
@@ -0,0 +1,3 @@
+from .benchmark import get_benchmark_ranges, run_benchmark
+
+__all__ = ["get_benchmark_ranges", "run_benchmark"]
diff --git a/rdagent/scenarios/finetune/benchmark/benchmark.py b/rdagent/scenarios/finetune/benchmark/benchmark.py
new file mode 100644
index 000000000..9fe4c7a55
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/benchmark.py
@@ -0,0 +1,398 @@
+"""
+Benchmark Evaluation using OpenCompass
+
+Evaluator that runs OpenCompass in Docker to evaluate fine-tuned models on standard benchmarks.
+
+Configure benchmark behavior via editting .env to cover default settings in conf.py:
+```
+FT_BENCHMARK_DATASETS='["aime25", "gsm8k"]'
+FT_BENCHMARK_NUM_RUNS=4
+FT_JUDGE_MODEL="gpt-4"
+FT_JUDGE_API_KEY="sk-xxx"
+FT_JUDGE_API_BASE="https://api.openai.com/v1"
+```
+"""
+
+import json
+import random
+import shutil
+import subprocess
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+import pandas as pd
+import yaml
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.finetune.conf import (
+    FT_MODEL_PATH,
+    get_benchmark_env,
+    get_ft_env,
+    get_workspace_prefix,
+    is_docker_env,
+)
+from rdagent.core.experiment import FBWorkspace, Task
+from rdagent.log import rdagent_logger as logger
+from rdagent.oai.llm_conf import LLM_SETTINGS
+from rdagent.scenarios.finetune.benchmark.data.adaptor import (
+    BENCHMARK_CONFIG_DICT,
+    BenchmarkConfig,
+)
+from rdagent.scenarios.finetune.benchmark.data.default import extract_error_samples
+from rdagent.scenarios.finetune.benchmark.merge.merge import (
+    check_if_merging_needed,
+    merge_model,
+)
+from rdagent.utils.agent.tpl import T
+
+
+def get_model_inference_config(base_model_name: str, gpu_count: int) -> dict:
+    """
+    Load model inference configuration from YAML file.
+
+    Args:
+        base_model_name: HuggingFace model name (e.g., "Qwen/Qwen3-8B")
+        gpu_count: GPU count for tensor_parallel_size (from scenario.device_info)
+
+    Returns:
+        dict: Merged configuration (model-specific overrides default)
+              Uses exact match first, then longest prefix match, finally default only.
+    """
+    from rdagent.components.benchmark import BENCHMARK_CONFIGS_DIR
+    config_data = yaml.safe_load(open(BENCHMARK_CONFIGS_DIR / "models.yaml", "r"))
+
+    default_config = config_data.get("default", {})
+    models_config = config_data.get("models", {})
+
+    # 1. Exact match
+    if base_model_name in models_config:
+        model_specific = models_config[base_model_name]
+    else:
+        # 2. Prefix match - find longest matching prefix
+        model_specific = {}
+        best_match_len = 5
+        for configured_model in models_config:
+            if base_model_name.startswith(configured_model) and len(configured_model) > best_match_len:
+                model_specific = models_config[configured_model]
+                best_match_len = len(configured_model)
+
+    final_config = {**default_config, **model_specific}
+
+    # Handle auto tensor_parallel_size
+    if final_config.get("tensor_parallel_size") == "auto":
+        if gpu_count <= 0:
+            final_config["tensor_parallel_size"] = 1
+        else:
+            # Round down to nearest power of 2
+            power = 0
+            while (1 << (power + 1)) <= gpu_count:
+                power += 1
+            final_config["tensor_parallel_size"] = 1 << power
+
+    return final_config
+
+
+def detect_model_type(model_path: str) -> bool:
+    """
+    Detect whether the given model path corresponds to a LoRA adapter.
+
+    Returns:
+        True if LoRA adapter, False otherwise.
+    """
+    model_dir = Path(model_path)
+
+    # LoRA (llama-factory style)
+    if (model_dir / "adapter_config.json").exists():
+        return True
+
+    # Alternate LoRA file indicators
+    for fname in ("adapter_model.bin", "adapter_model.safetensors"):
+        if (model_dir / fname).exists():
+            return True
+
+    return False
+
+
+def run_benchmark(
+    workspace_path: str,
+    model_path: str,
+    model_name: str,
+    benchmark_name: str,
+    gpu_count: int,
+    test_range: Optional[str] = "[:100]",
+    num_runs: int = 1,
+    pass_k: Optional[List[int]] = None,
+    max_error_samples: int = 10,
+    result_subdir: str = "",
+) -> Dict[str, Any]:
+    """
+    Run benchmark evaluation on a fine-tuned model.
+
+    Args:
+        workspace_path: Path to workspace directory
+        model_path: Path to fine-tuned model (supports full/LoRA auto-detection)
+        model_name: HuggingFace model name
+        benchmark_name: Benchmark dataset name (e.g., "aime25", "gsm8k")
+        gpu_count: GPU count for tensor_parallel_size (from scenario.device_info)
+        test_range: Python slice string for dataset sampling (e.g., "[:100]", "[-100:]").
+                    Negative indexing allows automatic adaptation to varying subset sizes.
+        num_runs: Number of times to run each sample (default: 1)
+        pass_k: Optional list of k values for pass@k evaluation (e.g., [1, 5, 10])
+        max_error_samples: Maximum number of error samples to extract for feedback
+        result_subdir: Subdirectory for results (e.g., "validation", "test")
+
+    Returns:
+        Dict containing:
+        - accuracy_summary: Dict mapping dataset -> {metric: value}, grouped by dataset
+        - error_samples: List of error samples for feedback analysis
+    """
+    # Load configurations
+    benchmark_cfg: BenchmarkConfig = BENCHMARK_CONFIG_DICT[benchmark_name]
+    dataset_imports = benchmark_cfg.dataset
+
+    # Auto download dependent data if configured on this benchmark
+    if benchmark_cfg.download is not None:
+        benchmark_cfg.download()
+
+    model_is_lora = detect_model_type(model_path)
+    inference_config = get_model_inference_config(model_name, gpu_count)
+    workspace_path = Path(workspace_path)
+
+    # Get environment first to determine path prefix
+    env = get_benchmark_env()
+    ws_prefix = get_workspace_prefix(env)
+    is_docker = is_docker_env(env)
+
+    # Determine model paths based on environment type
+    model_rel_path = Path(model_path).relative_to(workspace_path)
+    adapter_path_in_env = Path(ws_prefix) / model_rel_path
+
+    if model_is_lora:
+        if is_docker:
+            # Docker: use /assets/models mount
+            model_path_in_env = Path(FT_MODEL_PATH) / model_name
+        else:
+            # Conda: use actual file path
+            model_path_in_env = Path(FT_RD_SETTING.file_path) / "models" / model_name
+        lora_path_in_env = adapter_path_in_env
+
+        # Check if we need to merge the model (e.g. vLLM doesn't support LoRA with modules_to_save)
+        if check_if_merging_needed(model_path):
+            merged_model_dir_inside_env = Path(ws_prefix) / "merged_model"
+
+            # Create a temporary environment for merging (use FT env as it has peft/transformers)
+            merge_env = get_ft_env()
+
+            merge_model(
+                env=merge_env,
+                workspace_path=workspace_path,
+                base_model_path=str(model_path_in_env),
+                adapter_path=str(lora_path_in_env),
+                output_path=str(merged_model_dir_inside_env),
+            )
+
+            # Switch to using the merged model
+            model_path_in_env = merged_model_dir_inside_env
+            model_is_lora = False
+            lora_path_in_env = ""
+            adapter_path_in_env = merged_model_dir_inside_env
+    else:
+        model_path_in_env = adapter_path_in_env
+        lora_path_in_env = ""
+
+    # Prepare template variables (merge inference config from models.yaml)
+    template_vars = {
+        # Model configuration
+        "model_abbr": f"ft-{benchmark_name}",
+        "model_path": model_path_in_env,
+        "is_lora": model_is_lora,
+        "lora_path": lora_path_in_env,
+        # Dataset configuration
+        "dataset_imports": [dataset_imports],
+        "test_range": test_range,
+        "num_runs": num_runs,
+        "pass_k": pass_k,
+        "work_dir": adapter_path_in_env,
+        # Merge all inference parameters from models.yaml (default + model-specific)
+        **inference_config,
+    }
+
+    # Override use_cot_postprocessor based on force_think_token setting
+    # When force_think_token=false, we don't need the CoT postprocessor to extract answers
+    if not FT_RD_SETTING.force_think_token:
+        template_vars["use_cot_postprocessor"] = False
+
+    # Render Jinja2 template
+    config_content = T("rdagent.components.benchmark.configs.opencompass_template:template").r(**template_vars)
+
+    # Note: env was already created above via get_benchmark_env()
+
+    (workspace_path / "config.py").write_text(config_content)
+    # Use result_subdir for validation/test separation
+    if result_subdir:
+        benchmark_work_dir = f"{ws_prefix}/benchmark_results/{result_subdir}"
+    else:
+        benchmark_work_dir = f"{ws_prefix}/benchmark_results"
+
+    # Logging
+    logger.info(f"Running benchmark '{benchmark_name}' on model: {model_path}")
+    logger.info(f"Base model: {model_name}, LoRA?: {model_is_lora}")
+    logger.info(f"Workspace: {workspace_path}")
+    logger.info(f"Benchmark work_dir: {benchmark_work_dir}")
+    if test_range:
+        logger.info(f"Dataset range: {test_range}")
+
+    # Environment variables
+    env_vars = {
+        "OC_JUDGE_MODEL": FT_RD_SETTING.judge_model or LLM_SETTINGS.chat_model,
+        "OC_JUDGE_API_KEY": FT_RD_SETTING.judge_api_key or LLM_SETTINGS.openai_api_key,
+        "OC_JUDGE_API_BASE": FT_RD_SETTING.judge_api_base or LLM_SETTINGS.openai_api_base,
+        "OC_JUDGE_RETRY": str(FT_RD_SETTING.judge_retry),
+    }
+
+    # Check if results already exist (skip re-running if cached)
+    results_base = workspace_path / "benchmark_results"
+    if result_subdir:
+        results_base = results_base / result_subdir
+    timestamped_dirs = sorted([d for d in results_base.glob("202*_*") if d.is_dir()], reverse=True)
+
+    if timestamped_dirs:
+        logger.info(f"Found existing results in {timestamped_dirs[0].name}, skipping benchmark execution")
+    else:
+        # Run OpenCompass
+        entry_cmd = f"opencompass {ws_prefix}/config.py --work-dir {benchmark_work_dir}"
+
+        result = env.run(
+            entry=entry_cmd,
+            local_path=str(workspace_path),
+            env=env_vars,
+        )
+
+        # Log execution immediately (for UI display)
+        tag_prefix = "docker_run" if is_docker else "conda_run"
+        logger.log_object(
+            {
+                "exit_code": result.exit_code,
+                "stdout": (result.stdout or ""),
+                "benchmark_name": benchmark_name,
+                "model_path": str(model_path),
+                "workspace_path": str(workspace_path),
+            },
+            tag=f"{tag_prefix}.Benchmark",
+        )
+
+        # Check execution status
+        if result.exit_code != 0:
+            error_msg = result.stdout[-2000:] if result.stdout else "No output"
+            raise RuntimeError(f"Benchmark execution failed (exit_code={result.exit_code})\n{error_msg}")
+
+        # Re-scan for timestamped directories after execution
+        timestamped_dirs = sorted([d for d in results_base.glob("202*_*") if d.is_dir()], reverse=True)
+
+    # OpenCompass stores results in results/<model_name>/<dataset>.json
+    results_subdir = timestamped_dirs[0] / "summary"
+
+    results_csv_path = sorted([f for f in results_subdir.rglob("*.csv")], reverse=True)[0]
+    logger.info(f"Detailed results CSV: {results_csv_path.relative_to(results_base)}")
+
+    # Read CSV content for accuracy summary (grouped by dataset)
+    df = pd.read_csv(results_csv_path)
+    # Get score column (the model name column, e.g., 'api-chemcotbench')
+    score_col = [c for c in df.columns if c not in ["dataset", "version", "metric", "mode"]][0]
+    # Pivot to group by dataset, with metrics as columns (use pivot_table to handle duplicates)
+    pivoted = df.pivot_table(index="dataset", columns="metric", values=score_col, aggfunc="first").to_dict("index")
+    # Filter out NaN values (different datasets have different metrics)
+    accuracy_summary = {ds: {k: v for k, v in metrics.items() if pd.notna(v)} for ds, metrics in pivoted.items()}
+
+    # Extract error samples for feedback
+    error_samples = extract_error_samples(
+        timestamped_dirs[0],
+        max_samples=max_error_samples,
+    )
+
+    # Log benchmark result for UI display
+    # Use result_subdir to distinguish validation vs test in tag
+    log_tag = f"benchmark_result.{result_subdir}" if result_subdir else "benchmark_result"
+    logger.log_object(
+        {
+            "accuracy_summary": accuracy_summary,
+            "error_samples": error_samples,
+            "benchmark_name": benchmark_name,
+            "split": result_subdir or "default",  # validation, test, or default
+        },
+        tag=log_tag,
+    )
+
+    return {
+        "accuracy_summary": accuracy_summary,
+        "error_samples": error_samples,
+    }
+
+
+def get_benchmark_ranges() -> tuple[str, str]:
+    """Get validation and test range strings for benchmark evaluation.
+
+    Uses dynamic expressions that adapt to any dataset size:
+    - For small datasets (<200): splits 50/50 to avoid overlap
+    - For large datasets (>=200): takes 100 samples each
+
+    The expressions use OpenCompass's eval mechanism with index_list variable.
+
+    Returns:
+        Tuple of (validation_range, test_range) - guaranteed non-overlapping:
+        - validation: first min(100, 50%) samples
+        - test: last min(100, 50%) samples
+    """
+    return "[:min(100, len(index_list)//2)]", "[-min(100, len(index_list)//2):]"
+
+
+if __name__ == "__main__":
+    """Test benchmark evaluation on Qwen3-1.7B with LoRA adapter."""
+    # Configuration - Fill in your LoRA adapter path and model name
+    LORA_ADAPTER_PATH = "/home/v-qizhengli/workspace/FT_workspace/gitignore_folder/B200/B200_FT_workspace/limo/train/b200_sweep_yamls/saves/qwen3-1.7b/lora_b200_lr1e-4_acc4/checkpoint-100"
+    MODEL_NAME = "Qwen/Qwen3-1.7B"
+    BENCHMARK = "aime25"
+    GPU_COUNT = 1
+
+    print("=" * 80)
+    print("Benchmark Evaluation Test")
+    print("=" * 80)
+    print(f"\nEnvironment: FT_JUDGE_API_KEY={'Set' if FT_RD_SETTING.judge_api_key else 'Not Set'}")
+    print(f"Judge API Base: {FT_RD_SETTING.judge_api_base or 'Not Set'}")
+
+    if not Path(LORA_ADAPTER_PATH).exists():
+        print(f"\nPlease set LORA_ADAPTER_PATH to a valid checkpoint directory")
+        print(f"Current path does not exist: {LORA_ADAPTER_PATH}")
+        exit(1)
+
+    print(f"\nModel: {MODEL_NAME}")
+    print(f"Adapter: {LORA_ADAPTER_PATH}")
+    print(f"Benchmark: {BENCHMARK}")
+    print("-" * 80)
+
+    try:
+        # Create FBWorkspace for test (auto-generates UUID workspace)
+        test_task = Task(name=f"benchmark_test_{BENCHMARK}")
+        test_workspace = FBWorkspace(target_task=test_task)
+        test_workspace.prepare()
+
+        print(f"\nWorkspace: {test_workspace.workspace_path}")
+
+        result = run_benchmark(
+            workspace_path=str(test_workspace.workspace_path),
+            model_path=LORA_ADAPTER_PATH,
+            model_name=MODEL_NAME,
+            benchmark_name=BENCHMARK,
+            gpu_count=GPU_COUNT,
+        )
+
+        print("\nEvaluation completed!")
+        print(f"Accuracy Summary: {result['accuracy_summary']}")
+        print(f"Error Samples: {len(result['error_samples'])} samples")
+        print(f"\nResults saved to: {test_workspace.workspace_path / 'benchmark_results'}")
+
+    except Exception as e:
+        print(f"\nEvaluation failed: {e}")
+        import traceback
+
+        traceback.print_exc()
diff --git a/rdagent/scenarios/finetune/benchmark/data/adaptor.py b/rdagent/scenarios/finetune/benchmark/data/adaptor.py
new file mode 100644
index 000000000..9e508ab89
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/data/adaptor.py
@@ -0,0 +1,129 @@
+"""
+Benchmark dataset configuration and data preparation adaptor for finetune benchmarks.
+
+This module centralizes:
+- Mapping of benchmark names to OpenCompass dataset config import paths.
+- Optional dataset download / preparation hooks for benchmarks.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Callable, Dict, Optional
+
+from rdagent.scenarios.finetune.benchmark.data import financeiq_gen
+
+DownloadFunc = Callable[[], None]
+
+
+@dataclass
+class BenchmarkConfig:
+    """
+    Configuration for a single benchmark.
+
+    Attributes:
+        dataset: Import path for the dataset config in OpenCompass.
+        download: Optional function to ensure the dataset is available (e.g. download from HF).
+    """
+
+    dataset: str
+    download: Optional[DownloadFunc] = None
+
+
+# Mapping from benchmark_name -> benchmark configuration.
+BENCHMARK_CONFIG_DICT: Dict[str, BenchmarkConfig] = {
+    # Math Reasoning Benchmarks
+    "aime24": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.aime2024.aime2024_gen_17d799",
+    ),
+    "aime25": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f",
+    ),
+    "math": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.math.math_0shot_gen_393424",
+    ),
+    # General Knowledge Benchmarks
+    "mmlu": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.mmlu.mmlu_gen",
+    ),
+    # Code Generation Benchmarks
+    "humaneval": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.humaneval.humaneval_gen",
+    ),
+    "mbpp": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.mbpp.mbpp_gen",
+    ),
+    # PANORAMA - Patent Analysis Benchmarks (zero-shot)
+    "panorama": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_gen",
+    ),
+    "panorama_par4pc": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_par4pc_gen",
+    ),
+    "panorama_pi4pc": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_pi4pc_gen",
+    ),
+    "panorama_noc4pc": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_noc4pc_gen",
+    ),
+    # PANORAMA - Patent Analysis Benchmarks (CoT)
+    "panorama_par4pc_cot": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_par4pc_cot_gen",
+    ),
+    "panorama_pi4pc_cot": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_pi4pc_cot_gen",
+    ),
+    "panorama_noc4pc_cot": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.panorama.panorama_noc4pc_cot_gen",
+    ),
+    # ChemCoTBench - Chemistry Reasoning Benchmarks
+    "chemcotbench": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.chemcotbench.chemcotbench_gen",
+    ),
+    "chemcotbench_mol_und": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.chemcotbench.chemcotbench_mol_und_gen",
+    ),
+    "chemcotbench_mol_edit": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.chemcotbench.chemcotbench_mol_edit_gen",
+    ),
+    "chemcotbench_mol_opt": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.chemcotbench.chemcotbench_mol_opt_gen",
+    ),
+    "chemcotbench_reaction": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.chemcotbench.chemcotbench_reaction_gen",
+    ),
+    # TableBench - Table Question Answering Benchmarks
+    "tablebench_data_analysis": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.tablebench.tablebench_data_analysis_gen",
+    ),
+    "tablebench_fact_checking": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.tablebench.tablebench_fact_checking_gen",
+    ),
+    "tablebench_numerical_reasoning": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.tablebench.tablebench_numerical_reasoning_gen",
+    ),
+    "tablebench_visualization": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.tablebench.tablebench_visualization_gen",
+    ),
+    "tablebench_gen": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.tablebench.tablebench_gen",
+    ),
+    # BioProBench
+    "bioprobench_gen": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.bioprobench.bioprobench_gen",
+    ),
+    "bioprobench_ord": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.bioprobench.bioprobench_ord",
+    ),
+    "bioprobench_err": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.bioprobench.bioprobench_err",
+    ),
+    "bioprobench_pqa": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.bioprobench.bioprobench_pqa",
+    ),
+    # Native OpenCompass benchmarks
+    "FinanceIQ_gen": BenchmarkConfig(
+        dataset="opencompass.configs.datasets.FinanceIQ.FinanceIQ_llmjudge_gen",
+        download=financeiq_gen.download_financeiq_dataset,
+    ),
+}
diff --git a/rdagent/scenarios/finetune/benchmark/data/default.py b/rdagent/scenarios/finetune/benchmark/data/default.py
new file mode 100644
index 000000000..8e80cbb2c
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/data/default.py
@@ -0,0 +1,292 @@
+"""
+Error sample extraction from OpenCompass benchmark results.
+
+This module provides a unified approach to extract error samples from various
+OpenCompass evaluator formats using both results and predictions directories.
+"""
+
+from __future__ import annotations
+
+import json
+import random
+from pathlib import Path
+from typing import Any, Dict, List
+
+from rdagent.log import rdagent_logger as logger
+
+# ============================================================================
+# Helper Functions
+# ============================================================================
+
+
+def _to_bool(value: Any) -> bool:
+    """
+    Unified boolean conversion supporting multiple types.
+
+    Handles: list, str, bool, None, and other types.
+    Key: [False] -> False, [True] -> True
+    """
+    if value is None:
+        return False
+    if isinstance(value, list):
+        return all(_to_bool(v) for v in value) if value else False
+    if isinstance(value, str):
+        return value.strip().upper() in ("A", "CORRECT", "TRUE", "YES", "1")
+    return bool(value)
+
+
+def _is_correct(sample: Dict) -> bool:
+    """
+    Unified correctness check - returns True if sample is correct (should be skipped).
+
+    Checks fields in priority order from results directory.
+    """
+    # Direct fields
+    for field in ["cascade_correct", "correct", "is_correct", "exact_match"]:
+        if field in sample:
+            return _to_bool(sample[field])
+
+    # Nested llm_evaluation
+    llm_eval = sample.get("llm_evaluation")
+    if llm_eval and isinstance(llm_eval, list) and llm_eval:
+        return _to_bool(llm_eval[0].get("llm_correct"))
+
+    # Nested rule_evaluation
+    rule_eval = sample.get("rule_evaluation")
+    if rule_eval and isinstance(rule_eval, list) and rule_eval:
+        return _to_bool(rule_eval[0].get("correct"))
+
+    return False
+
+
+def _format_value(value: Any) -> str:
+    """Format value to string, handling list/dict/None."""
+    if value is None:
+        return "N/A"
+    if isinstance(value, list):
+        return str(value[0]) if value else "N/A"
+    return str(value)
+
+
+def _format_prompt(prompt: Any) -> str:
+    """
+    Format prompt to readable string (matches model input format).
+
+    Handles:
+    - Simple string: return as-is
+    - Single message dict: extract prompt field
+    - Single-turn list [{'role': 'HUMAN', 'prompt': '...'}]: return prompt directly (no prefix)
+    - Multi-turn few-shot: format with ChatML-style role markers
+    """
+    if isinstance(prompt, str):
+        return prompt
+    if isinstance(prompt, dict):
+        return prompt.get("prompt", str(prompt))
+    if isinstance(prompt, list) and prompt:
+        first = prompt[0]
+        # Check if it's conversation format
+        if isinstance(first, dict) and "role" in first:
+            # Single-turn: return prompt directly without prefix
+            if len(prompt) == 1:
+                return first.get("prompt", str(first))
+            # Multi-turn few-shot: format with ChatML-style markers
+            parts = []
+            for msg in prompt:
+                if isinstance(msg, dict):
+                    role = msg.get("role", "UNKNOWN")
+                    content = msg.get("prompt", str(msg))
+                    # Map HUMAN/BOT to user/assistant
+                    role_name = "user" if role == "HUMAN" else "assistant"
+                    parts.append(f"<|im_start|>{role_name}\n{content}<|im_end|>")
+                else:
+                    parts.append(str(msg))
+            return "\n".join(parts)
+        # Single item list (not conversation format)
+        if isinstance(first, dict):
+            return first.get("prompt", str(first))
+        return str(first)
+    return "N/A"
+
+
+def _extract_tag_content(prompt: Any, tag_name: str) -> str:
+    """
+    Extract content from <tag_name Begin>...<tag_name End> in prompt.
+
+    Used for extracting Original Question and Predicted Answer from LLM Judge prompts.
+    """
+    if isinstance(prompt, list):
+        prompt = str(prompt)
+    prompt_str = str(prompt)
+
+    start_tag = f"<{tag_name} Begin>"
+    end_tag = f"<{tag_name} End>"
+
+    start = prompt_str.find(start_tag)
+    end = prompt_str.find(end_tag)
+
+    if start != -1 and end > start:
+        content = prompt_str[start + len(start_tag) : end].strip()
+        # Clean up formatting artifacts
+        if content.startswith(": \\n"):
+            content = content[4:]
+        return content.strip()
+
+    return "N/A"
+
+
+def _get_question(sample: Dict, pred_entry: Dict) -> str:
+    """Extract question - prioritize predictions for complete content."""
+    # 1. Priority: predictions directory origin_prompt
+    if pred_entry.get("origin_prompt"):
+        return _format_prompt(pred_entry["origin_prompt"])
+
+    # 2. Results directory direct fields
+    for field in ["origin_prompt", "prompt", "source"]:
+        if field in sample and sample[field]:
+            return _format_prompt(sample[field])
+
+    # 3. Nested llm_evaluation (extract from <Original Question> tag)
+    llm_eval = sample.get("llm_evaluation")
+    if llm_eval and isinstance(llm_eval, list) and llm_eval:
+        prompt = llm_eval[0].get("origin_prompt")
+        if prompt:
+            content = _extract_tag_content(prompt, "Original Question")
+            if content != "N/A":
+                return content
+
+    return sample.get("example_abbr", "N/A")
+
+
+def _get_gold(sample: Dict, pred_entry: Dict) -> str:
+    """Extract gold/reference answer - prioritize predictions."""
+    # 1. Priority: predictions directory
+    if pred_entry.get("gold") is not None:
+        return _format_value(pred_entry["gold"])
+
+    # 2. Results directory direct fields
+    for field in ["gold", "answer", "reference", "references"]:
+        if field in sample and sample[field] is not None:
+            return _format_value(sample[field])
+
+    # 3. Nested structures
+    for nested in ["llm_evaluation", "rule_evaluation"]:
+        eval_data = sample.get(nested)
+        if eval_data and isinstance(eval_data, list) and eval_data:
+            gold = eval_data[0].get("gold") or eval_data[0].get("answer")
+            if gold is not None:
+                return _format_value(gold)
+
+    return "N/A"
+
+
+def _get_prediction(sample: Dict, pred_entry: Dict) -> str:
+    """Extract model prediction/output - prioritize predictions."""
+    # 1. Priority: predictions directory
+    if pred_entry.get("prediction") is not None:
+        return _format_value(pred_entry["prediction"])
+
+    # 2. Results directory direct fields (PANORAMA and similar formats)
+    for field in ["pred_raw", "pred", "origin_prediction"]:
+        if field in sample:
+            return _format_value(sample[field])
+
+    # 3. Nested rule_evaluation.pred (CascadeEvaluator extracted answer)
+    rule_eval = sample.get("rule_evaluation")
+    if rule_eval and isinstance(rule_eval, list) and rule_eval:
+        pred = rule_eval[0].get("pred")
+        if pred is not None:
+            return _format_value(pred)
+
+    return "N/A"
+
+
+# ============================================================================
+# Main Entry Point
+# ============================================================================
+
+
+def extract_error_samples(
+    results_base: Path,
+    max_samples: int = 10,
+) -> List[Dict[str, Any]]:
+    """
+    Extract error samples from OpenCompass benchmark results.
+
+    Uses both results and predictions directories:
+    - results: correctness judgment
+    - predictions: complete question/gold/prediction content
+
+    Args:
+        results_base: Path to benchmark_results/{timestamp} directory
+        max_samples: Maximum number of error samples to return
+
+    Returns:
+        List of error samples, each containing:
+        - question: The original prompt/question
+        - gold: The expected/ground truth answer
+        - model_output: The model's actual output
+        - silver_answers (optional): For PANORAMA evaluator
+        - custom_score (optional): For PANORAMA evaluator
+    """
+    errors: List[Dict[str, Any]] = []
+    results_dir = results_base / "results"
+    predictions_dir = results_base / "predictions"
+
+    if not results_dir.exists():
+        logger.warning(f"Results directory not found: {results_dir}")
+        return errors
+
+    for result_file in results_dir.rglob("*.json"):
+        with open(result_file) as f:
+            results_data = json.load(f)
+
+        # Load corresponding predictions file
+        rel_path = result_file.relative_to(results_dir)
+        pred_file = predictions_dir / rel_path
+        predictions: Dict[str, Any] = {}
+        if pred_file.exists():
+            with open(pred_file) as f:
+                predictions = json.load(f)
+
+        details = results_data.get("details", [])
+        if not details:
+            continue
+
+        # Handle both list and dict formats
+        if isinstance(details, list):
+            iterator = enumerate(details)
+        else:
+            iterator = details.items()
+
+        for idx, sample in iterator:
+            if not isinstance(sample, dict):
+                continue
+
+            # Skip correct samples (from results)
+            if _is_correct(sample):
+                continue
+
+            # Get predictions entry (complete content)
+            pred_entry = predictions.get(str(idx), {})
+
+            # Build error sample with core fields
+            error = {
+                "question": _get_question(sample, pred_entry),
+                "gold": _get_gold(sample, pred_entry),
+                "model_output": _get_prediction(sample, pred_entry),
+            }
+
+            # Add PANORAMA extra fields if present
+            if "silver" in sample:
+                error["silver_answers"] = sample.get("silver", [])
+            if "custom_score" in sample:
+                error["custom_score"] = sample.get("custom_score", 0.0)
+
+            errors.append(error)
+
+    # Random sample if we have more than max_samples
+    if len(errors) > max_samples:
+        errors = random.sample(errors, max_samples)
+
+    logger.info(f"Extracted {len(errors)} error samples from benchmark results")
+    return errors
diff --git a/rdagent/scenarios/finetune/benchmark/data/financeiq_gen.py b/rdagent/scenarios/finetune/benchmark/data/financeiq_gen.py
new file mode 100644
index 000000000..cfe990ef5
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/data/financeiq_gen.py
@@ -0,0 +1,148 @@
+from __future__ import annotations
+
+import json
+import random
+import shutil
+import subprocess
+from pathlib import Path
+from typing import Any, Dict, List
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.datasets.financeiq.split import split_financeiq_dataset
+
+
+def download_financeiq_dataset() -> None:
+    """
+    Download and arrange the FinanceIQ dataset for OpenCompass.
+
+    This downloads from `Duxiaoman-DI/FinanceIQ` into:
+        <FT_RD_SETTING.file_path>/benchmarks/opencompass_data/data/FinanceIQ
+
+    The repo structure includes a `data` subdirectory; we move `dev` and `test`
+    up one level to match the expected OpenCompass layout.
+    """
+    target_dir = FT_RD_SETTING.file_path / "benchmarks" / "opencompass_data" / "data" / "FinanceIQ"
+    if target_dir.exists():
+        logger.info(f"FinanceIQ dataset already exists at {target_dir}")
+        return
+
+    logger.info(f"Downloading FinanceIQ dataset to {target_dir}")
+    target_dir.parent.mkdir(parents=True, exist_ok=True)
+
+    subprocess.check_call(
+        [
+            "git",
+            "clone",
+            "https://huggingface.co/datasets/Duxiaoman-DI/FinanceIQ",
+            str(target_dir),
+        ]
+    )
+
+    # Move dev and test folders to upper level (opencompass_data/data/FinanceIQ)
+    data_subdir = target_dir / "data"
+    if data_subdir.exists():
+        for folder in ("dev", "test"):
+            src = data_subdir / folder
+            if src.exists():
+                shutil.move(str(src), str(target_dir / folder))
+        shutil.rmtree(data_subdir)
+
+    # Apply split for benchmark (keep test set only)
+    split_financeiq_dataset(str(target_dir), split="test")
+
+
+def extract_error_samples(results_base: Path, max_samples: int = 10) -> List[Dict[str, Any]]:
+    """
+    (Deprecated, processed by unified logic now)
+    Extract error samples specifically for FinanceIQ_gen benchmark.
+
+    FinanceIQ_gen result files (per subject) look like:
+
+        {
+            "accuracy": 60.0,
+            "details": {
+                "type": "GEN",
+                "0": {
+                    "prompt": [...],
+                    "origin_prediction": "...",
+                    "predictions": "D",
+                    "references": "B"
+                },
+                "1": { ... },
+                ...
+            }
+        }
+
+    We treat a sample as error when predictions != references.
+    The question text is taken from the last HUMAN prompt in the prompt list.
+
+    Args:
+        results_base: Path to benchmark_results/{timestamp} directory
+        max_samples: Maximum number of error samples to return
+
+    Returns:
+        List of error samples, each containing:
+        - question: The original prompt/question
+        - gold: The expected/ground truth answer (references)
+        - model_output: The model's actual output (predictions)
+    """
+    error_samples: List[Dict[str, Any]] = []
+    results_dir = results_base / "results" / "ft-FinanceIQ_gen"
+
+    if not results_dir.exists():
+        logger.warning(f"FinanceIQ_gen results directory not found: {results_dir}")
+        return error_samples
+
+    # Iterate through all FinanceIQ subject JSON files
+    for result_file in sorted(results_dir.glob("*.json")):
+        with open(result_file) as f:
+            data = json.load(f)
+
+        details = data.get("details", {})
+        if not isinstance(details, dict):
+            continue
+
+        # Each key in details except "type" is a sample index
+        for key, sample in details.items():
+            if key == "type" or not isinstance(sample, dict):
+                continue
+
+            pred = sample.get("predictions")
+            gold = sample.get("references")
+
+            # Skip if either is missing
+            if pred is None or gold is None:
+                continue
+
+            # Only keep incorrect predictions
+            if str(pred) == str(gold):
+                continue
+
+            prompt_list = sample.get("prompt", [])
+            question = "N/A"
+            if isinstance(prompt_list, list) and prompt_list:
+                # Take the last HUMAN message as the question
+                for msg in reversed(prompt_list):
+                    if isinstance(msg, dict) and msg.get("role") == "HUMAN":
+                        question = msg.get("prompt", "N/A")
+                        break
+
+            error_samples.append(
+                {
+                    "question": question,
+                    "gold": str(gold),
+                    "model_output": str(pred),
+                }
+            )
+
+    if not error_samples:
+        logger.info("No FinanceIQ_gen error samples found")
+        return error_samples
+
+    # Random sampling if too many error samples
+    if len(error_samples) > max_samples:
+        error_samples = random.sample(error_samples, max_samples)
+
+    logger.info(f"Extracted {len(error_samples)} FinanceIQ_gen error samples from {results_dir}")
+    return error_samples
diff --git a/rdagent/scenarios/finetune/benchmark/merge/__init__.py b/rdagent/scenarios/finetune/benchmark/merge/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/rdagent/scenarios/finetune/benchmark/merge/merge.py b/rdagent/scenarios/finetune/benchmark/merge/merge.py
new file mode 100644
index 000000000..bde5458ca
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/merge/merge.py
@@ -0,0 +1,75 @@
+import json
+import subprocess
+from pathlib import Path
+
+from rdagent.components.coder.finetune.conf import get_workspace_prefix
+from rdagent.log import rdagent_logger as logger
+from rdagent.utils.agent.tpl import T
+
+BLACKWELL_GPU_KEYWORDS = ["b100", "b200", "b300"]
+
+
+def is_blackwell_gpu() -> bool:
+    """Check if the current GPU is NVIDIA Blackwell architecture (B100, B200, B300)."""
+    try:
+        result = subprocess.run(
+            ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
+            capture_output=True,
+            text=True,
+            timeout=10,
+        )
+        if result.returncode == 0:
+            gpu_names = result.stdout.strip().lower()
+            return any(kw in gpu_names for kw in BLACKWELL_GPU_KEYWORDS)
+    except Exception:
+        pass
+    return False
+
+
+def check_if_merging_needed(model_path: str | Path) -> bool:
+    """
+    Check if the model needs to be merged before benchmarking.
+    Usually required when LoRA adapter has modules_to_save which vLLM doesn't support.
+    """
+    config_path = Path(model_path) / "adapter_config.json"
+    if not config_path.exists():
+        return False
+    with open(config_path, "r") as f:
+        config = json.load(f)
+    # Check for modules_to_save which requires merging for vLLM
+    # The logic is based in https://github.com/vllm-project/vllm/issues/9280
+    if config.get("modules_to_save") is not None:
+        logger.info(f"Model merging required due to modules_to_save: {config.get('modules_to_save')}")
+        return True
+    if is_blackwell_gpu():
+        logger.info("Model merging required due to Blackwell GPU (B100/B200/B300)")
+        return True
+    return False
+
+
+def merge_model(env, workspace_path: Path, base_model_path: str, adapter_path: str, output_path: str):
+    """
+    Merge LoRA adapter into base model using a template-generated script.
+    """
+    # Prepare template variables
+    template_vars = {
+        "base_model_path": base_model_path,
+        "adapter_path": adapter_path,
+        "output_path": output_path,
+    }
+
+    # Render Jinja2 template
+    merge_script = T("rdagent.scenarios.finetune.benchmark.merge.merge_model_template:template").r(**template_vars)
+
+    script_path = workspace_path / "merge_model.py"
+    script_path.write_text(merge_script)
+
+    logger.info(f"Starting model merging from {adapter_path}...")
+
+    ws_prefix = get_workspace_prefix(env)
+    cmd = f"python {ws_prefix}/merge_model.py"
+
+    result = env.run(cmd, local_path=str(workspace_path))
+    if result.exit_code != 0:
+        raise RuntimeError(f"Model merging failed (exit_code={result.exit_code}):\n{result.stdout}")
+    logger.info("Model merging completed.")
diff --git a/rdagent/scenarios/finetune/benchmark/merge/merge_model_template.yaml b/rdagent/scenarios/finetune/benchmark/merge/merge_model_template.yaml
new file mode 100644
index 000000000..9625bd447
--- /dev/null
+++ b/rdagent/scenarios/finetune/benchmark/merge/merge_model_template.yaml
@@ -0,0 +1,44 @@
+# Jinja2 template for merging LoRA models
+# Used by benchmark.py to generate a merging script
+
+template: |-
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    from peft import PeftModel
+    import os
+    import shutil
+
+    base_model_path = "{{ base_model_path }}"
+    adapter_path = "{{ adapter_path }}"
+    output_path = "{{ output_path }}"
+
+    print(f"Loading base model from {base_model_path}...")
+    base_model = AutoModelForCausalLM.from_pretrained(
+        base_model_path,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        trust_remote_code=True,
+        local_files_only=True
+    )
+
+    print(f"Loading LoRA adapter from {adapter_path}...")
+    model = PeftModel.from_pretrained(base_model, adapter_path, local_files_only=True)
+
+    print(f"Loading tokenizer from {adapter_path}...")
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(adapter_path, trust_remote_code=True, local_files_only=True)
+    except:
+        print("Tokenizer not found in adapter, loading from base model...")
+        tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True, local_files_only=True)
+
+    print("Merging model...")
+    model = model.merge_and_unload()
+
+    if os.path.exists(output_path):
+        print(f"Removing existing output path: {output_path}")
+        shutil.rmtree(output_path)
+
+    print(f"Saving merged model to {output_path}...")
+    model.save_pretrained(output_path)
+    tokenizer.save_pretrained(output_path)
+    print("Merge Done.")
diff --git a/rdagent/scenarios/finetune/datasets/README.md b/rdagent/scenarios/finetune/datasets/README.md
new file mode 100644
index 000000000..310726519
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/README.md
@@ -0,0 +1,190 @@
+# 数据集管理模块
+
+本模块管理 LLM Finetune 场景的数据集，通过 `snapshot_download` 下载完整的 HuggingFace 仓库。
+
+## 设计目标
+
+1. **简洁性**: 下载完整的 HF 仓库，保留原始文件结构
+2. **可扩展性**: 支持可选的 `post_download_fn` 进行自定义处理（如删除测试集）
+
+## 使用方法
+
+```python
+from rdagent.scenarios.finetune.datasets import prepare, prepare_all, DATASETS
+
+# 1. 查看已注册的数据集
+print(DATASETS.keys())
+# ['chemcot', 'panorama', 'deepscaler', 'financeiq']
+
+# 2. 准备单个数据集（下载到本地）
+path = prepare("chemcot")
+# 下载至: datasets/chemcot/
+
+# 3. 准备所有数据集
+prepare_all()
+```
+
+## 数据集配置
+
+每个数据集通过 `DatasetConfig` 配置：
+
+```python
+@dataclass
+class DatasetConfig:
+    repo_id: str                                          # HuggingFace 仓库 ID
+    post_download_fn: Optional[Callable[[str], None]]     # 下载后处理函数
+```
+
+## 已注册数据集
+
+| 名称 | 仓库 | 描述 |
+|------|------|------|
+| `chemcot` | OpenMol/ChemCoTDataset | 化学推理 + CoT |
+| `panorama` | LG-AI-Research/PANORAMA | 专利审查基准 |
+| `deepscaler` | agentica-org/DeepScaleR-Preview-Dataset | 数学推理 |
+| `financeiq` | Duxiaoman-DI/FinanceIQ | 金融问答 |
+
+## 添加新数据集
+
+在 `__init__.py` 的 `DATASETS` 字典中添加配置：
+
+```python
+DATASETS["my-dataset"] = DatasetConfig(
+    repo_id="organization/dataset-name",
+    post_download_fn=my_cleanup_function,  # 可选
+)
+```
+
+---
+
+## README 替换机制
+
+**重要**: 下载数据集时，本地 README 会覆盖 HuggingFace 原始 README。
+
+### 工作原理
+
+```python
+# __init__.py 中的逻辑
+custom_readme = Path(__file__).parent / name / "README.md"
+if custom_readme.exists():
+    shutil.copy(custom_readme, out_dir / "README.md")
+```
+
+1. 数据集下载完成后，检查 `datasets/{name}/README.md` 是否存在
+2. 如果存在，用本地版本覆盖下载目录中的 README
+3. 这样可以为每个数据集提供**定制化的文档**
+
+### 目录结构
+
+```
+rdagent/scenarios/finetune/datasets/
+├── __init__.py          # 主模块: prepare(), prepare_all(), DATASETS
+├── README.md            # 本文档
+├── chemcot/
+│   └── README.md        # ChemCoT 数据集文档（会覆盖 HF 原版）
+├── panorama/
+│   └── README.md        # PANORAMA 数据集文档（会覆盖 HF 原版）
+├── deepscaler/
+│   └── README.md        # DeepScaleR 数据集文档（会覆盖 HF 原版）
+└── financeiq/
+    └── README.md        # FinanceIQ 数据集文档（会覆盖 HF 原版）
+```
+
+---
+
+## README 编写规范
+
+为每个数据集编写 README 时，建议包含以下内容：
+
+### 1. 基础信息（必需）
+
+```markdown
+# 数据集名称
+
+简要描述 + 论文链接
+
+**Repository**: [HuggingFace 链接]
+
+## Overview
+
+数据集规模、来源、用途的概述
+```
+
+### 2. 数据集规模（必需）
+
+```markdown
+## Dataset Scale
+
+| 类别 | 子任务 | 样本数 |
+|------|--------|--------|
+| xxx | xxx | 1,234 |
+| **Total** | **N subtasks** | **总数** |
+```
+
+### 3. 数据字段说明（必需）
+
+```markdown
+## Data Fields
+
+| 字段 | 类型 | 描述 |
+|------|------|------|
+| `id` | string | 唯一标识符 |
+| `query` | string | 问题/指令 |
+| `answer` | string | 答案 |
+| ... | ... | ... |
+```
+
+### 4. CoT 质量评估（关键）
+
+这是最重要的部分，直接告诉使用者数据是否可用、如何处理：
+
+```markdown
+## CoT Quality Assessment
+
+**IMPORTANT**: [数据质量的核心警告]
+
+| Dimension | Value |
+|-----------|-------|
+| baseline_quality | low / medium / high / N/A |
+| task_type | math / chemistry / legal / ... |
+| polish_difficulty | low / medium / high |
+
+**Baseline**: [详细说明]
+- 如果有 CoT: 说明来源、验证方式、质量问题
+- 如果没有 CoT: 明确标注 "NO CoT"，说明必须生成
+```
+
+### 5. Baseline 性能（推荐）
+
+```markdown
+## Baseline Performance
+
+| Task | Best Model | Score |
+|------|-----------|-------|
+| xxx | GPT-4o | 85.2% |
+```
+
+### 6. 许可证（必需）
+
+```markdown
+## License
+
+MIT / CC-BY-NC-4.0 / ...
+```
+
+---
+
+## 示例参考
+
+- **DeepScaleR**: [deepscaler/README.md](deepscaler/README.md) - 标杆示例，CoT Quality Assessment 写得最清晰
+- **ChemCoT**: [chemcot/README.md](chemcot/README.md) - 有 CoT 但需要精化的情况
+- **PANORAMA**: [panorama/README.md](panorama/README.md) - 没有 CoT 的情况
+
+---
+
+## 注意事项
+
+1. **Token**: 私有数据集需要设置 `HF_TOKEN` 环境变量
+2. **缓存**: HuggingFace hub 会自动缓存下载内容
+3. **强制刷新**: 使用 `prepare(name, force=True)` 重新下载
+4. **README 优先级**: 本地 README 会覆盖 HuggingFace 原版，确保文档一致性
diff --git a/rdagent/scenarios/finetune/datasets/__init__.py b/rdagent/scenarios/finetune/datasets/__init__.py
new file mode 100644
index 000000000..df8210a58
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/__init__.py
@@ -0,0 +1,133 @@
+"""Dataset preparation module for finetune scenarios.
+
+Usage:
+    from rdagent.scenarios.finetune.datasets import prepare, prepare_all
+
+    prepare("chemcot")     # Download ChemCoT dataset
+    prepare("panorama")    # Download PANORAMA dataset
+    prepare_all()          # Prepare all registered datasets
+"""
+
+import shutil
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Callable, Optional
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.scenarios.finetune.datasets.chemcot import normalize_rcr
+from rdagent.scenarios.finetune.datasets.financeiq.split import split_financeiq_dataset
+from rdagent.scenarios.finetune.download.hf import download_dataset
+
+
+@dataclass
+class DatasetConfig:
+    """Configuration for a registered dataset.
+
+    Attributes:
+        repo_id: HuggingFace dataset repository ID
+        post_download_fn: Optional function to run after download (e.g., remove test split)
+    """
+
+    repo_id: str
+    post_download_fn: Optional[Callable[[str], None]] = field(default=None)
+
+
+def _remove_eval_splits(out_dir: str) -> None:
+    """Remove validation and test split files to prevent data leakage."""
+    for pattern in ["*validation*", "*test*"]:
+        for f in Path(out_dir).rglob(pattern):
+            if f.is_file():
+                f.unlink()
+            elif f.is_dir():
+                shutil.rmtree(f)
+
+
+# Dataset registry: name -> DatasetConfig
+DATASETS: dict[str, DatasetConfig] = {
+    "chemcot": DatasetConfig(
+        repo_id="OpenMol/ChemCoTDataset",
+        post_download_fn=normalize_rcr,
+    ),
+    "panorama": DatasetConfig(
+        repo_id="LG-AI-Research/PANORAMA",
+        post_download_fn=_remove_eval_splits,
+    ),
+    "deepscaler": DatasetConfig(
+        repo_id="agentica-org/DeepScaleR-Preview-Dataset",
+    ),
+    "financeiq": DatasetConfig(
+        repo_id="Duxiaoman-DI/FinanceIQ",
+        post_download_fn=lambda out_dir: split_financeiq_dataset(out_dir, split="train"),
+    ),
+    "tableinstruct": DatasetConfig(
+        repo_id="Multilingual-Multimodal-NLP/TableInstruct",
+    ),
+    "bioprobench": DatasetConfig(
+        repo_id="bowenxian/BioProBench",
+    ),
+}
+
+
+def prepare(name: str, force: bool = False) -> str:
+    """Download dataset to local directory using snapshot_download.
+
+    Downloads the entire HuggingFace dataset repository, preserving the original
+    file structure.
+
+    Args:
+        name: Dataset name (must be registered in DATASETS)
+        force: If True, re-download even if exists
+
+    Returns:
+        Path to the dataset directory
+    """
+    if name not in DATASETS:
+        raise ValueError(f"Unknown dataset: {name}. Available: {list(DATASETS.keys())}")
+
+    config = DATASETS[name]
+    out_dir = Path(FT_RD_SETTING.file_path) / "datasets" / name
+
+    # Skip if already exists and not forcing
+    if not force and out_dir.exists():
+        return str(out_dir)
+
+    # Download using snapshot_download
+    download_dataset(
+        repo_id=config.repo_id,
+        out_dir=str(out_dir),
+        force=force,
+    )
+
+    # Run post-download processing if defined
+    if config.post_download_fn:
+        config.post_download_fn(str(out_dir))
+
+    # Copy custom README if exists in source code
+    custom_readme = Path(__file__).parent / name / "README.md"
+    if custom_readme.exists():
+        shutil.copy(custom_readme, out_dir / "README.md")
+
+    return str(out_dir)
+
+
+def prepare_all(force: bool = False) -> dict[str, str]:
+    """Prepare all registered datasets.
+
+    Args:
+        force: If True, re-download even if exists
+
+    Returns:
+        Dict mapping dataset name to download path
+    """
+    return {name: prepare(name, force=force) for name in DATASETS}
+
+
+if __name__ == "__main__":
+    import sys
+
+    if len(sys.argv) > 1:
+        dataset_name = sys.argv[1]
+        path = prepare(dataset_name)
+        print(f"Dataset prepared at: {path}")
+    else:
+        print(f"Available datasets: {list(DATASETS.keys())}")
diff --git a/rdagent/scenarios/finetune/datasets/bioprobench/README.md b/rdagent/scenarios/finetune/datasets/bioprobench/README.md
new file mode 100644
index 000000000..468b6a6d6
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/bioprobench/README.md
@@ -0,0 +1,98 @@
+---
+license: cc-by-4.0
+configs:
+- config_name: PQA
+	data_files:
+		- split: train
+			path: PQA.json
+		- split: test
+			path: PQA_test.json
+- config_name: ERR
+	data_files:
+		- split: train
+			path: ERR.json
+		- split: test
+			path: ERR_test.json
+- config_name: ORD
+	data_files:
+		- split: train
+			path: ORD.json
+		- split: test
+			path: ORD_test.json
+- config_name: GEN
+	data_files:
+		- split: train
+			path: GEN.json
+		- split: test
+			path: GEN_test.json
+---
+
+# BioProBench Dataset for LLM Fine-Tuning
+
+BioProBench is a large-scale, multi-task benchmark focused on biological protocol understanding and reasoning for large language models (LLMs). It spans four fine-tuning tasks provided here: Protocol Question Answering (PQA), Step Ordering (ORD), Error Correction (ERR), and Protocol Generation (GEN).
+
+This dataset is built on a raw corpus of ~27K biological protocols and provides over 550K structured instances across tasks, with a held-out test set of 1,000 examples per task. See the original benchmark for full details:
+- Code: https://github.com/YuyangSunshine/bioprotocolbench/
+- Dataset hub: https://huggingface.co/BioProBench
+- License: CC BY 4.0
+
+## Data Files
+
+The JSON files for each task (train/test) are organized per task. If your fine-tuning pipeline expects local files, place them alongside this README or update paths accordingly.
+
+- PQA: [bioprobench/PQA.json](bioprobench/PQA.json), [bioprobench/PQA_test.json](bioprobench/PQA_test.json)
+- ERR: [bioprobench/ERR.json](bioprobench/ERR.json), [bioprobench/ERR_test.json](bioprobench/ERR_test.json)
+- ORD: [bioprobench/ORD.json](bioprobench/ORD.json), [bioprobench/ORD_test.json](bioprobench/ORD_test.json)
+- GEN: [bioprobench/GEN.json](bioprobench/GEN.json), [bioprobench/GEN_test.json](bioprobench/GEN_test.json)
+
+## Task Definitions and Fields
+
+### PQA — Protocol Question Answering
+Multiple-choice QA over protocol content.
+- Fields:
+	- `question`: the question string
+	- `choices`: list of candidate answers
+	- `answer`: the correct answer
+	- `type`: category of the question (e.g., parameter, reagent, operation)
+	- `id`: unique identifier
+
+### ORD — Step Ordering
+Order protocol steps correctly (top-level or sub-step sequences).
+- Fields:
+	- `question`: prompt describing the step list and context/title
+	- `wrong_steps`: list of steps in a shuffled or incorrect order
+	- `correct_steps`: steps in the correct chronological order
+	- `type`: sequence granularity (e.g., `top`, `child`)
+	- `id`: unique identifier
+
+### ERR — Error Correction
+Detect and correct errors in protocol text with local context.
+- Fields:
+	- `context`: object with `purpose`, `prior_step`, `next_step`
+	- `corrupted_text`: the erroneous text (may be `null` for correct cases)
+	- `corrected_text`: corrected version of the text
+	- `is_correct`: boolean indicating whether the provided text was already correct
+	- `type`: category (e.g., parameter, reagent, operation, or `correct`)
+	- `error_description`: brief rationale for the correction
+	- `id`: unique identifier
+
+### GEN — Protocol Generation
+Generate concise, single-level, numbered protocol steps from prompts.
+- Fields:
+	- `system_prompt`: role/system instruction
+	- `instruction`: formatting and style constraints
+	- `input`: task description or query
+	- `output`: list of numbered steps (flat 1., 2., 3. ...)
+	- `id`: unique identifier
+	- `type`: difficulty tag (e.g., `easy`)
+
+## Splits
+- Train: use the non-`_test.json` files per task.
+- Test: each task provides a held-out set of 1,000 examples.
+
+## License
+- CC BY 4.0 — see https://creativecommons.org/licenses/by/4.0/
+
+## Notes
+- Tasks cover protocol QA, ordering, correction, and generation (REA is part of the broader benchmark but not included in the files above).
+- Data spans diverse biological domains and repositories; see the original benchmark for details.
diff --git a/rdagent/scenarios/finetune/datasets/chemcot/README.md b/rdagent/scenarios/finetune/datasets/chemcot/README.md
new file mode 100644
index 000000000..8438c766d
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/chemcot/README.md
@@ -0,0 +1,107 @@
+---
+language:
+- en
+license: mit
+tags:
+- chemistry
+- chain-of-thought
+- molecular-reasoning
+size_categories:
+- 10K<n<100K
+task_categories:
+- text-generation
+- question-answering
+---
+
+# ChemCoT Dataset
+
+Chemical reasoning dataset with Chain-of-Thought annotations from [ChemCoTBench](https://arxiv.org/abs/2505.21318).
+
+**Repository**: [OpenMol/ChemCoTDataset](https://huggingface.co/datasets/OpenMol/ChemCoTDataset)
+
+## Overview
+
+The **ChemCoTDataset** provides ~23K high-quality chain-of-thought samples for training chemical reasoning models. CoT annotations were distilled from state-of-the-art reasoning models (Gemini-2.5-pro, DeepSeek-R1, Claude-3.7-sonnet-thinking) and validated by 13 chemistry PhD candidates with >90% accuracy.
+
+### Dataset Scale
+
+| Category | Subtasks | Samples |
+|----------|----------|---------|
+| mol_und | fg_count, ring_count, ring_system_scaffold, Murcko_scaffold | 6,319 |
+| mol_edit | add, delete, sub | 4,497 |
+| mol_opt | drd, gsk, jnk, qed, solubility, logp | 5,587 |
+| rxn | fs_by_product, fs_major_product, rcr | 6,820 |
+| **Total** | **16 subtasks** | **23,223** |
+
+## Tasks
+
+### 1. Molecular Understanding (mol_und)
+
+| Subtask | Description |
+|---------|-------------|
+| `fg_count` | Functional group counting |
+| `ring_count` | Ring counting |
+| `Murcko_scaffold` | Murcko scaffold extraction |
+| `ring_system_scaffold` | Ring system scaffold extraction |
+
+**Metrics**: MAE for counting, Tanimoto similarity for scaffold extraction
+
+### 2. Molecular Editing (mol_edit)
+
+| Subtask | Description |
+|---------|-------------|
+| `add` | Add functional groups to molecules |
+| `delete` | Delete functional groups from molecules |
+| `sub` | Substitute functional groups in molecules |
+
+**Metrics**: Pass@1 (validity and instruction matching)
+
+### 3. Molecular Optimization (mol_opt)
+
+| Subtask | Description |
+|---------|-------------|
+| `logp` | LogP (lipophilicity) optimization |
+| `solubility` | Aqueous solubility optimization |
+| `qed` | QED (drug-likeness) optimization |
+| `drd` | DRD2 binding affinity optimization |
+| `gsk` | GSK3-beta binding affinity optimization |
+| `jnk` | JNK3 binding affinity optimization |
+
+**Metrics**: Mean improvement rate, Success rate
+
+### 4. Reaction Prediction (rxn)
+
+| Subtask | Description |
+|---------|-------------|
+| `fs` | Forward synthesis (major product + by-product prediction) |
+| `rcr` | Reaction Condition Recommendation (catalyst prediction) |
+
+**Metrics**: Top-1 accuracy, Fingerprint similarity
+
+## Data Format
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `id` | string | Unique sample identifier |
+| `query` | string | The chemical problem/question |
+| `task` | string | Task category (mol_und, mol_edit, mol_opt, rxn) |
+| `subtask` | string | Specific subtask name |
+| `struct_cot` | string | Structured chain-of-thought reasoning |
+| `raw_cot` | string | Raw chain-of-thought annotation |
+| `meta` | object | Additional metadata |
+
+## CoT Quality Assessment
+
+**IMPORTANT**: Distilled CoT may require domain refinement.
+
+| Dimension | Value |
+|-----------|-------|
+| baseline_quality | medium-high |
+| task_type | chemistry |
+| polish_difficulty | medium |
+
+**Baseline**: CoT distilled from Gemini-2.5-pro/DeepSeek-R1/Claude, validated by 13 chemistry PhD candidates (>90% accuracy). Paper notes: *"distillation strategy falters in chemistry"* - consider expert refinement for optimal results.
+
+## License
+
+MIT License
diff --git a/rdagent/scenarios/finetune/datasets/chemcot/__init__.py b/rdagent/scenarios/finetune/datasets/chemcot/__init__.py
new file mode 100644
index 000000000..bf7c3b2da
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/chemcot/__init__.py
@@ -0,0 +1,38 @@
+"""ChemCoT dataset preparation utilities."""
+
+import json
+from pathlib import Path
+
+
+def normalize_rcr(out_dir: str) -> None:
+    """Normalize rcr.json to match standard data format.
+
+    Fixes:
+    1. Move `gt` from top-level into `meta`
+    2. Rename `cot_result` to `struct_cot` and strip markdown wrapper
+    """
+    rcr_path = Path(out_dir) / "chemcotbench-cot" / "rxn" / "rcr.json"
+    if not rcr_path.exists():
+        return
+
+    with open(rcr_path) as f:
+        data = json.load(f)
+
+    for item in data:
+        # 1. Move gt from top-level into meta
+        if "gt" in item:
+            meta = json.loads(item["meta"]) if isinstance(item["meta"], str) else item["meta"]
+            meta["gt"] = item.pop("gt")
+            item["meta"] = json.dumps(meta)
+
+        # 2. Rename cot_result -> struct_cot, strip markdown wrapper
+        if "cot_result" in item:
+            cot = item.pop("cot_result").strip()
+            if cot.startswith("```json"):
+                cot = cot[7:]
+            if cot.endswith("```"):
+                cot = cot[:-3]
+            item["struct_cot"] = cot.strip()
+
+    with open(rcr_path, "w") as f:
+        json.dump(data, f, indent=4)
diff --git a/rdagent/scenarios/finetune/datasets/deepscaler/README.md b/rdagent/scenarios/finetune/datasets/deepscaler/README.md
new file mode 100644
index 000000000..568754997
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/deepscaler/README.md
@@ -0,0 +1,61 @@
+---
+language:
+- en
+size_categories:
+- 10K<n<100K
+license: mit
+configs:
+- config_name: default
+  data_files:
+  - split: train
+    path: data/train-*
+  splits:
+  - name: train
+    num_examples: 40315
+---
+
+# DeepScaleR Mathematical Reasoning Dataset
+
+Dataset for DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL.
+
+> DeepScaleR-1.5B-Preview achieves **43.1% Pass@1 accuracy on AIME 2024**, representing a **15% improvement** over the base model (28.8%) and **surpassing OpenAI's O1-Preview performance** with just 1.5B parameters through distributed reinforcement learning.
+
+## Overview
+
+The **DeepScaleR dataset** is a carefully curated collection of approximately **40,000 unique mathematics problem-answer pairs** designed for training mathematical reasoning models through reinforcement learning. This dataset covers a wide range of competition-level mathematics problems from high school to olympiad level, providing a robust foundation for scaling RL algorithms on reasoning tasks.
+
+DeepScaleR demonstrates that sophisticated mathematical reasoning can be achieved through strategic data curation combined with iterative context length scaling (8K→16K→24K) using Group Relative Policy Optimization (GRPO).
+
+
+### Data Sources
+
+Our training dataset consists of problems compiled from prestigious mathematics competitions and curated datasets:
+
+- **AIME** (American Invitational Mathematics Examination) problems (1984-2023)
+- **AMC** (American Mathematics Competition) problems (prior to 2023)
+- **Omni-MATH** dataset
+- **Still** dataset
+
+### Data Fields
+
+The dataset contains three key fields:
+
+- `problem`: The mathematical problem statement, formatted with LaTeX notation
+- `solution`: Official solution to the problem, including LaTeX formatting and boxed final answers. If there is no solution, the `solution` field is an empty string
+- `answer`: The final mathematical result/answer, usually extracted from the solution
+
+## CoT Quality Assessment
+
+**IMPORTANT**: Raw data must be polished before training.
+
+| Dimension | Value |
+|-----------|-------|
+| baseline_quality | low |
+| task_type | math |
+| polish_difficulty | high |
+
+**Baseline**: 82% empty `solution`, 18% too short (p50=373 tokens, summary-style). Need to generate exploratory CoT (For your reference, the length of a well-structured CoT is usually longer than 1/4 * the model max_position_embeddings tokens) for all samples.
+
+## License
+
+This dataset is released under the MIT License.
diff --git a/rdagent/scenarios/finetune/datasets/financeiq/__init__.py b/rdagent/scenarios/finetune/datasets/financeiq/__init__.py
new file mode 100644
index 000000000..4073e454c
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/financeiq/__init__.py
@@ -0,0 +1 @@
+from .split import get_split_indices, split_financeiq_dataset
diff --git a/rdagent/scenarios/finetune/datasets/financeiq/split.py b/rdagent/scenarios/finetune/datasets/financeiq/split.py
new file mode 100644
index 000000000..04ff96a31
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/financeiq/split.py
@@ -0,0 +1,64 @@
+import csv
+import math
+from pathlib import Path
+from typing import Literal
+
+
+def get_split_indices(
+    total_count: int, split: Literal["train", "test"], test_limit: int = 100, test_ratio: float = 0.5
+) -> slice:
+    """
+    Calculate the slice for train/test split.
+
+    Rule:
+    - Test set size = min(total_count * test_ratio, test_limit)
+    - Test set takes from the END of the data.
+    - Train set takes the rest (from the START).
+    """
+    test_count = min(int(math.ceil(total_count * test_ratio)), test_limit)
+
+    if split == "test":
+        return slice(total_count - test_count, total_count)
+    else:
+        return slice(0, total_count - test_count)
+
+
+def split_financeiq_dataset(data_dir: str, split: Literal["train", "test"]) -> None:
+    """
+    Iterate over CSV files in the directory and apply the split in-place.
+    """
+    path = Path(data_dir)
+
+    # Process CSV files
+    for f in list(path.rglob("*.csv")):
+        # HACK:
+        # FinanceIQ specific: 'dev' folder is small and used for few-shot.
+        # We preserve it for benchmarking (split='test') but remove for training (split='train') to avoid leakage.
+        # Some times, the training in debug mode of llama factory will only check few samples. Which may results in failures
+        rel_parts = f.relative_to(path).parts
+        if "dev" in rel_parts:
+            if split == "train":
+                f.unlink()
+            continue
+
+        rows = []
+        header = None
+        # Use 'utf-8-sig' to handle potential BOM in Excel-saved CSVs, or just 'utf-8'
+        # Assuming 'utf-8' for now as it's standard for HF datasets
+        with open(f, "r", encoding="utf-8", newline="") as fp:
+            reader = csv.reader(fp)
+            try:
+                header = next(reader)
+                rows = list(reader)
+            except StopIteration:
+                # Empty file
+                continue
+
+        indices = get_split_indices(len(rows), split)
+        new_rows = rows[indices]
+
+        with open(f, "w", encoding="utf-8", newline="") as fp:
+            writer = csv.writer(fp)
+            if header:
+                writer.writerow(header)
+            writer.writerows(new_rows)
diff --git a/rdagent/scenarios/finetune/datasets/panorama/README.md b/rdagent/scenarios/finetune/datasets/panorama/README.md
new file mode 100644
index 000000000..117c2b8d1
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/panorama/README.md
@@ -0,0 +1,137 @@
+---
+language:
+- en
+license: cc-by-nc-4.0
+tags:
+- patent
+- legal
+- retrieval
+- classification
+size_categories:
+- 100K<n<1M
+task_categories:
+- text-classification
+- question-answering
+---
+
+# PANORAMA Dataset
+
+Patent examination benchmark capturing decision trails and rationales from [PANORAMA](https://huggingface.co/datasets/LG-AI-Research/PANORAMA).
+
+**Repository**: [LG-AI-Research/PANORAMA](https://huggingface.co/datasets/LG-AI-Research/PANORAMA)
+
+## Tasks
+
+### 1. PAR4PC: Prior-Art Retrieval for Patent Claims
+
+**Task**: Multi-label classification - select relevant prior-art documents from 8 candidates.
+
+**Train samples**: 54,028
+
+**Metrics**: Exact Match Accuracy, Custom Score (partial credit)
+
+### 2. PI4PC: Paragraph Identification for Patent Claims
+
+**Task**: Single-choice - identify the most relevant paragraph in a prior-art document.
+
+**Train samples**: 64,210
+
+**Metrics**: Exact Match Accuracy
+
+### 3. NOC4PC: Novelty and Non-Obviousness Classification
+
+**Task**: Ternary classification - determine if a claim should be ALLOW, 102 rejection, or 103 rejection.
+
+**Train samples**: 136,211
+
+**Metrics**: Macro F1-score, Per-class Accuracy
+
+## Legal Background
+
+- **35 U.S.C. §102 (Novelty)**: Claim rejected if anticipated by a single prior art reference
+- **35 U.S.C. §103 (Non-Obviousness)**: Claim rejected if obvious from combining prior art
+
+## Data Format (Parquet Fields)
+
+### PAR4PC / PI4PC Format
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `application_number` | str | Patent application identifier |
+| `claim_number` | int64 | Specific claim number being evaluated |
+| `context` | dict | Patent context: `{abstract: str, claims: list[str], title: str}` |
+| `options` | dict | 8 candidate documents: `{A: {abstract, claims, patent_id, title}, B: {...}, ...}` |
+| `gold_answers` | ndarray | Correct answer labels, e.g. `array(['G'])` or `array(['A', 'C'])` |
+| `silver_answers` | ndarray | Partially correct answers |
+| `negative_answers` | ndarray | Incorrect options |
+
+**Note**: PI4PC has an additional `prior_art_specification` field containing the relevant prior-art document text.
+
+### NOC4PC Format
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `application_number` | str | Patent application identifier |
+| `claim_number` | int64 | Specific claim number being evaluated |
+| `context` | dict | Patent context: `{abstract: str, claims: list[str], title: str}` |
+| `prior_art_specifications` | list | Prior art document specifications |
+| `answer` | str | Classification label: `ALLOW`, `102`, or `103` |
+
+**Important**: Array fields (`gold_answers`, `silver_answers`, `negative_answers`) are `numpy.ndarray` type.
+Use `.tolist()` to convert to Python list before processing.
+
+### Example Data
+
+```python
+{
+    "application_number": 14281639,
+    "claim_number": 1,
+    "context": {
+        "abstract": "In an endodontic procedure...",
+        "claims": ["claim 1 text", "claim 2 text", ...],
+        "title": "Method for irrigating root canals"
+    },
+    "options": {
+        "A": {"abstract": "...", "claims": [...], "patent_id": "US1234567", "title": "..."},
+        "B": {"abstract": "...", "claims": [...], "patent_id": "US2345678", "title": "..."},
+        # ... G, H
+    },
+    "gold_answers": array(['G'], dtype=object),  # numpy.ndarray, use .tolist() -> ['G']
+    "negative_answers": array(['A', 'B', 'C', 'D', 'E', 'F', 'H'], dtype=object)
+}
+```
+
+## CoT Quality Assessment
+
+**IMPORTANT**: This dataset does NOT contain CoT annotations.
+
+| Dimension | Value |
+|-----------|-------|
+| baseline_quality | N/A (no CoT) |
+| task_type | legal reasoning |
+| polish_difficulty | high |
+
+**Baseline**: Raw data contains rejection reasons but NO step-by-step reasoning chains. Paper explicitly states *"lacked ground-truth CoTs"*. **You MUST generate CoT** for all samples before training.
+
+## Baseline Performance (CoT Prompting)
+
+| Task | Best Model | Score |
+|------|-----------|-------|
+| PAR4PC | Gemma-3-12B | 77.30% |
+| PI4PC | GPT-4o | 62.62% |
+| NOC4PC | Claude-3.7-Sonnet | 45.40% |
+
+## Citation
+
+```bibtex
+@article{panorama2024,
+  title={PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination},
+  author={LG AI Research and KAIST},
+  year={2024},
+  url={https://huggingface.co/datasets/LG-AI-Research/PANORAMA}
+}
+```
+
+## License
+
+CC-BY-NC-4.0 License
diff --git a/rdagent/scenarios/finetune/datasets/tableinstruct/README.md b/rdagent/scenarios/finetune/datasets/tableinstruct/README.md
new file mode 100644
index 000000000..d0d7c9a4c
--- /dev/null
+++ b/rdagent/scenarios/finetune/datasets/tableinstruct/README.md
@@ -0,0 +1,262 @@
+---
+language:
+- en
+size_categories:
+- 1K<n<10K
+license: mit
+configs:
+- config_name: test
+  data_files:
+  - split: test
+    path: data/test-*
+  splits:
+  - name: test
+    num_examples: 886
+- config_name: train
+  data_files:
+  - split: train
+    path: data/train-*
+  splits:
+  - name: train
+    num_examples: ~10K
+---
+
+# TableBench: Table Question Answering Dataset
+
+Dataset for TableBench: A Comprehensive and Complex Benchmark for Table Question Answering.
+
+> TableBench is a **comprehensive** and **complex** benchmark designed to evaluate Table Question Answering (TableQA) capabilities, covering **18 question categories** across **4 major categories** with **886** carefully curated test cases. 
+
+## Overview
+
+The **TableBench dataset** consists of two main components:
+
+1. **TableBench (Test)**: 886 high-quality test cases for evaluation across 4 major reasoning categories
+2. **TableInstruct (Train)**: Large-scale training dataset with diverse table QA examples
+
+TableBench substantially pushes the boundaries of large language models in complex TableQA scenarios, aligning closely with the "Reasoning Complexity of Questions" dimension in real-world Table QA applications.
+
+### Task Categories
+
+The benchmark covers **4 major categories** with **18 sub-tasks**:
+
+1. **Fact Checking**: Verify factual statements against table data
+   - Simple fact verification, cross-table validation, temporal consistency
+
+2. **Numerical Reasoning**: Mathematical computations and comparisons
+   - Arithmetic operations, aggregations, comparative analysis
+
+3. **Data Analysis**: Complex analytical reasoning
+   - Impact analysis, correlation analysis, trend forecasting, statistical analysis
+
+4. **Visualization**: Chart generation and interpretation
+   - Bar charts, line charts, pie charts, scatter plots
+
+### Data Sources
+
+**Test Data (TableBench)**:
+- Repository: [Multilingual-Multimodal-NLP/TableBench](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableBench)
+- 886 carefully curated and verified test cases
+- Enhanced version released April 2025 with error corrections
+
+**Train Data (TableInstruct)**:
+- Repository: [Multilingual-Multimodal-NLP/TableInstruct](https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableInstruct)
+- Large-scale instruction tuning dataset for table QA
+- Diverse question types and reasoning patterns
+
+### Data Fields
+
+The dataset contains the following key fields:
+
+The TableInstruct dataset contains the following fields:
+
+- `id` (string): Unique identifier for each sample
+- `qtype` (string): Major task category (4 values)
+  - `FactChecking`, `NumericalReasoning`, `DataAnalysis`, `Visualization`
+- `qsubtype` (string): Specific sub-task type (18 values)
+  - Examples: `Counting`, `Aggregation`, `Comparison`, `CorrelationAnalysis`, etc.
+- `instruction` (string): Complete instruction template with task guidelines
+  - Contains the full prompt template defining how to approach the task
+  - Includes role definition, guidelines, code format requirements
+  - Typically 800-15,000 characters depending on instruction type
+- `instruction_type` (string): Reasoning strategy type (4 values)
+  - `DP` (Direct Prompting), `TCoT` (Textual Chain-of-Thought)
+  - `PoT` (Program-of-Thought), `SCoT` (Structured Chain-of-Thought)
+- `table` (string): Table data in JSON format
+  - Structure: `{"columns": [...], "data": [[...], [...], ...]}`
+- `question` (string): Specific question about the table
+- `response` (string): Model's answer including reasoning process
+
+**TableBench Test Dataset Fields**:
+
+- `question`: The table question or task description
+- `table`: The table data (JSON format)
+- `answer`: The ground truth answer
+- `category`: Major category
+- `subcategory`: Specific sub-task type
+
+<!-- - `question`: The table question or task description
+- `table`: The table data (various formats: CSV, JSON, markdown)
+- `answer`: The ground truth answer or expected output
+- `category`: Major category (Fact Checking, Numerical Reasoning, Data Analysis, Visualization)
+- `subcategory`: Specific sub-task type
+- `reasoning_steps`: Optional chain-of-thought reasoning (for training data) -->
+
+### Instruction Types and Reasoning Strategies
+Tablebench training data (TableInstruct) supports multiple instruction types content that define how the model approaches reasoning and generates answers. Understanding these types is crucial for dataset filtering and fine-tuning strategy selection.
+
+### Available Instruction Type
+**1. Direct Prompting(DP)**
+**Characteristics**:
+- Provides solutions directly without intermediate reasoning steps
+- Simplest instruction format focused on immediate answer generation
+- Best for straightforward fact-checking and simple queries
+**Instruction Template Pattern**：
+  You are a table analyst. Your task is to answer questions based on the table content.
+  Read the table below in JSON format: [TABLE]
+  Question: [QUESTION]
+  Answer directly.
+  **Response Format**:
+  [Direct Answer]
+
+**2. Textual Chain-of-Thought (TCoT)**
+**Characteristics**:
+- LLMs incrementally derive intermediate steps through textual reasoning
+- Natural language explanations for each reasoning step
+- Suitable for complex reasoning requiring logical deduction
+
+**Instruction Template Pattern**:
+  You are a table analyst. Your task is to answer questions based on the table content.
+  [Guidelines for step-by-step reasoning]
+  Think step by step
+  Show your reasoning process
+  Provide the final answer
+  ***Response Format**:
+  Let's analyze this step by step:
+  [First reasoning step]
+  [Second reasoning step]
+  ...
+  Final Answer: [Answer]
+
+ 
+#### 3. Program-of-Thought (PoT)
+
+**Characteristics**:
+- Decomposes problems into executable Python code
+- Separates computation from reasoning using programming
+- Ideal for numerical reasoning and computational tasks
+- Most common type in TableInstruct for analytical tasks
+
+**Instruction Template Pattern** (actual from dataset):
+  You are a data analyst proficient in Python. Your task is to write executable Python
+  code to analyze the table and then answer questions.
+  [Guidelines]
+  1. Based on the question, write out your analytical approach, then write Python code
+  2. The code needs to be concise and easy to understand
+  3. Code blocks need to strictly start with
+  '''
+  import pandas as pd
+  df = pd.read_csv('table.csv')
+  ...
+  print(f'Final Answer: {answer}')
+  '''
+  4.Your analysis must be based entirely on the above data
+  5.Generate executable code with results using print function
+  6.Ensure to load the table with: df = pd.read_csv('table.csv')
+
+
+#### 4. Symbolic Chain-of-Thought (SCoT)
+
+**Characteristics**:
+- A methodology that utilizes Python-based instructions to facilitate logical reasoning
+- Combines symbolic reasoning with executable code verification
+- Three primary steps repeated until a definitive conclusion is derived
+- Distinguishes itself from PoT by emphasizing iterative analysis-generation-simulation cycles
+
+**Three-Step Process**:
+- **STEP-1**: Analyzing the available information to determine the next move
+- **STEP-2**: Generating instructions using Python programming language commands
+- **STEP-3**: Simulating the outcomes by executing the instructions and analyzing the results
+
+**Instruction Template Pattern**:
+  You are a table analyst. Use symbolic reasoning with iterative Python commands.
+  Process:
+  STEP-1: Analyze available information to determine the next move
+  STEP-2: Generate Python programming language commands
+  STEP-3: Simulate outcomes by executing instructions and analyzing results
+  Repeat these three steps until reaching a definitive conclusion
+
+
+
+
+### Evaluation Metrics
+
+Different metrics are used based on task type:
+
+| Task Type | Metric | Description |
+|-----------|--------|-------------|
+| Fact Checking | Exact Match (EM) | Exact match of predicted statement |
+| Numerical Reasoning | Exact Match (EM) | Correctness of numerical outputs |
+| Impact Analysis | Exact Match (EM) | Precise match of influential factors |
+| Correlation/Trend/Stats | EM_with_error_10 | ±10% numerical margin of error |
+| Other Data Analysis | ROUGE-L | For open-ended textual responses |
+| Visualization | Pass@1 | Correct chart generated on first attempt |
+
+## CoT Quality Assessment
+
+**IMPORTANT**: Consider enhancing reasoning chains during training preparation.
+
+| Dimension | Value |
+|-----------|-------|
+| baseline_quality | medium-high |
+| task_type | table_qa |
+| polish_difficulty | medium |
+
+**Baseline**: Training data (TableInstruct) contains reasoning examples, but test data focuses on final answers. For complex reasoning tasks (Data Analysis, Numerical Reasoning), generating detailed step-by-step CoT can significantly improve model performance.
+
+**Recommendation**: For Data Analysis and Numerical Reasoning categories, expand reasoning chains to include:
+- Table understanding and schema identification
+- Step-by-step computation or logical reasoning
+- Intermediate results and verification
+- Final answer with confidence indicators
+
+## Example
+
+### Fact Checking
+```json
+{
+  "question": "Based on the table, verify if the statement is true: 'Company A had higher revenue than Company B in Q4 2023'",
+  "table": "| Company | Q4 2023 Revenue |\n|---------|----------------|\n| A       | $2.5M          |\n| B       | $3.1M          |",
+  "answer": "False",
+  "category": "Fact Checking",
+  "subcategory": "simple_fact_verification"
+}
+```
+
+### Numerical Reasoning
+```json
+{
+  "question": "What is the total revenue across all quarters for Product X?",
+  "table": "| Quarter | Product X Revenue |\n|---------|------------------|\n| Q1      | 150              |\n| Q2      | 200              |\n| Q3      | 175              |\n| Q4      | 225              |",
+  "answer": "750",
+  "category": "Numerical Reasoning",
+  "subcategory": "aggregation"
+}
+```
+
+### Data Analysis
+```json
+{
+  "question": "Analyze the correlation between marketing spend and sales growth. What is the correlation coefficient?",
+  "table": "| Month | Marketing ($K) | Sales Growth (%) |\n|-------|----------------|------------------|\n| Jan   | 50             | 12               |\n| Feb   | 75             | 18               |\n| Mar   | 60             | 15               |",
+  "answer": "0.95",
+  "category": "Data Analysis",
+  "subcategory": "correlation_analysis"
+}
+```
+
+
+## License
+
+This dataset is released under the MIT License.
+
diff --git a/rdagent/scenarios/finetune/dev/feedback.py b/rdagent/scenarios/finetune/dev/feedback.py
new file mode 100644
index 000000000..0a5145903
--- /dev/null
+++ b/rdagent/scenarios/finetune/dev/feedback.py
@@ -0,0 +1,154 @@
+"""
+LLM Fine-tuning Experiment Feedback Generation
+
+Provides feedback analysis for LLM fine-tuning experiments, including
+model performance evaluation, training metrics analysis, and improvement suggestions.
+"""
+
+import json
+from typing import Dict
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.core.proposal import (
+    Experiment2Feedback,
+    ExperimentFeedback,
+    HypothesisFeedback,
+)
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+from rdagent.log.utils import dict_get_with_warning
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.scenarios.finetune.experiment.experiment import FTExperiment
+from rdagent.scenarios.finetune.proposal.proposal import FTHypothesis
+from rdagent.scenarios.finetune.proposal.trace import FTTrace
+from rdagent.utils import convert2bool
+from rdagent.utils.agent.tpl import T
+
+
+class FTExperiment2Feedback(Experiment2Feedback):
+    """Generate feedback for LLM fine-tuning experiments"""
+
+    def __init__(self, scen: Scenario, version: str = "exp_feedback") -> None:
+        super().__init__(scen)
+        self.version = version
+
+    def generate_feedback(
+        self, exp: FTExperiment, trace: FTTrace | None = None, exception: Exception | None = None
+    ) -> ExperimentFeedback:
+        """
+        Generate comprehensive feedback for LLM fine-tuning experiment.
+
+        Args:
+            exp: The experiment to analyze
+            trace: Experiment trace (optional)
+            exception: If provided, indicates experiment failed and contains error details
+
+        Note: If exception is None, it means training succeeded and we evaluate quality/effectiveness.
+              If exception is provided, we analyze the failure cause.
+        """
+        # Get task information
+        task_desc = exp.sub_tasks[0].get_task_information()
+
+        # Initialize for SOTA update logic later
+        sota_benchmark = None
+
+        if exception is not None:
+            # Error case: use error analysis prompt
+            version = "exp_feedback_error"
+            error_info = str(exception)
+
+            # Try to get FTRunnerEvaluator's analysis result from workspace
+            # This contains structured feedback (execution, return_checking, code) instead of raw error string
+            runner_feedback = None
+            if exp.sub_workspace_list:
+                for ws in exp.sub_workspace_list:
+                    if ws and hasattr(ws, "feedback") and ws.feedback:
+                        runner_feedback = ws.feedback
+                        break
+
+            if runner_feedback:
+                # Use FTRunnerEvaluator's structured analysis result
+                error_info = f"""## Execution Analysis
+{runner_feedback.execution}
+
+## Return Checking
+{runner_feedback.return_checking}
+
+## Code Analysis
+{runner_feedback.code}"""
+
+            system_prompt = T(f".prompts:{version}.system").r(
+                scenario=self.scen.get_scenario_all_desc(),
+            )
+            # Get workspace files safely
+            workspace_files = {}
+            if hasattr(exp, "experiment_workspace") and exp.experiment_workspace is not None:
+                workspace_files = exp.experiment_workspace.file_dict
+            user_prompt = T(f".prompts:{version}.user").r(
+                hypothesis=exp.hypothesis,
+                task_desc=task_desc,
+                workspace_files=workspace_files,
+                error_info=error_info,
+            )
+        else:
+            # Success case: use normal feedback prompt
+            version = self.version
+            # Process experiment result - handle both new and legacy formats
+            exp_result = exp.experiment_workspace.running_info.result
+            if isinstance(exp_result, dict) and "benchmark" in exp_result:
+                # New format: contains benchmark and training_metrics
+                benchmark = exp_result.get("benchmark", {})
+                raw_metrics = exp_result.get("training_metrics", {})
+                # Pass loss_history directly (simpler and preserves full information)
+                loss_history = raw_metrics.get("loss_history", {"train": [], "eval": []})
+                # Sample train entries if too many to avoid token bloat
+                if len(loss_history.get("train", [])) > 60:
+                    loss_history["train"] = loss_history["train"][:30] + loss_history["train"][-30:]
+                training_metrics = {"loss_history": loss_history} if (loss_history.get("train") or loss_history.get("eval")) else {}
+            else:
+                # Legacy format: exp_result is directly the benchmark result (list of dicts)
+                benchmark = {"accuracy_summary": exp_result, "error_samples": []}
+                training_metrics = {}
+
+            # Get SOTA experiment's benchmark results for comparison
+            sota_benchmark = trace.sota_benchmark() if trace else None
+
+            # Get baseline benchmark (always exists, computed at scenario init)
+            baseline_benchmark = getattr(self.scen, "baseline_benchmark_score", None)
+
+            system_prompt = T(f".prompts:{version}.system").r(
+                scenario=self.scen.get_scenario_all_desc(),
+                has_sota=sota_benchmark is not None,
+                force_think_token=FT_RD_SETTING.force_think_token,
+            )
+            user_prompt = T(f".prompts:{version}.user").r(
+                hypothesis=exp.hypothesis,
+                task_desc=task_desc,
+                workspace_files=exp.experiment_workspace.file_dict,
+                execution_time=exp.experiment_workspace.running_info.running_time,
+                benchmark=benchmark,
+                training_metrics=training_metrics,
+                sota_benchmark=sota_benchmark,
+                baseline_benchmark=baseline_benchmark,
+            )
+
+        resp_dict = json.loads(
+            APIBackend().build_messages_and_create_chat_completion(
+                user_prompt=user_prompt,
+                system_prompt=system_prompt,
+                json_mode=True,
+                json_target_type=Dict[str, str | bool | int],
+            )
+        )
+
+        # Extract feedback components
+        error_type = resp_dict.get("Error Type") if exception is not None else None
+        hypothesis_feedback = HypothesisFeedback(
+            code_change_summary=dict_get_with_warning(resp_dict, "Code Summary", "No code summary provided"),
+            reason=dict_get_with_warning(resp_dict, "Reason", "No reasoning provided"),
+            decision=convert2bool(dict_get_with_warning(resp_dict, "Decision", "no")),
+            acceptable=exception is None,  # Only acceptable if no error
+            observations=error_type,  # Store error type for history display
+        )
+
+        return hypothesis_feedback
diff --git a/rdagent/scenarios/finetune/dev/prompts.yaml b/rdagent/scenarios/finetune/dev/prompts.yaml
new file mode 100644
index 000000000..a1e7e25c1
--- /dev/null
+++ b/rdagent/scenarios/finetune/dev/prompts.yaml
@@ -0,0 +1,234 @@
+exp_feedback:
+  system: |-
+    You are an expert AI assistant specializing in analyzing LLM fine-tuning experiments.
+
+    Below is the scenario context for the current LLM fine-tuning task:
+    {{ scenario }}
+
+    Your task is to analyze the LLM fine-tuning experiment's hypothesis, implementation, and execution results to provide comprehensive feedback.
+    Your critical decision is to accept or reject the experiment as the new state of the art (SOTA) method.
+
+    # Decision Making Framework:
+    ## Step 0: Pre-definition
+    - The user has proposed a hypothesis for fine-tuning a specific base model. Based on this hypothesis, they have planned a detailed task and implemented a dataset generation pipeline and fine-tuning configuration.
+    - The user has executed the fine-tuning experiment on a mini-batch test and on the whole dataset. The execution was successful.
+    - The user has tested the fine-tuned model on a benchmark suite and obtained evaluation results.
+
+    ## Step 1: Benchmark Metrics Evaluation (HIGHEST PRIORITY)
+    **This is the most critical step. Benchmark performance is the primary decision factor.**
+    - The user will provide you the benchmark evaluation results after executing the fine-tuned model on a benchmark suite.
+    {% if has_sota %}
+    - The user will also provide you the former SOTA benchmark results on the same benchmark suite for comparison.
+    - If the current experiment **exceeds SOTA on the primary metrics**, this is a strong signal to ACCEPT.
+    - If the results are significantly worse than SOTA, reject with [Benchmark Performance Issue].
+    {% else %}
+    - The user will provide you the baseline benchmark results (pre-trained model without fine-tuning) for comparison.
+    - If the current experiment **exceeds baseline**, this is a strong signal to ACCEPT.
+    - If the results are worse than or equal to baseline, reject with [Benchmark Performance Issue].
+    {% endif %}
+
+    ## Step 2: Code Quality Assessment
+    - Evaluate the implementation quality and best practices
+    - Compare the implementation against sota methods. If the implementation is significantly worse than sota methods, reject the experiment and start your reason by: [Implementation Quality Issue].
+
+    ## Step 3: Final Decision (Acceptance as SOTA)
+    You MUST determine the "Decision" (yes/no) based on the following:
+
+    {% if has_sota %}
+    **Compare with SOTA**
+    - **Primary rule**: If benchmark results exceed SOTA → Decision: "yes"
+    - Consider metrics comprehensively, but prioritize actual performance over hypothesis alignment
+    - Set "Decision": "no" only if SOTA is still better on the primary metrics
+    {% else %}
+    **Compare with BASELINE (no SOTA yet)**
+    - **Primary rule**: If benchmark results exceed baseline → Decision: "yes"
+    - The baseline results will be provided in the user prompt
+    - Set "Decision": "no" only if results are worse than or equal to baseline
+    {% endif %}
+    - A config that "doesn't match hypothesis" but produces better results is still a valid finding worth accepting.
+
+    # Core improvement identification
+    ## Failure identification (On rejection)
+    - The user has provided you the hypothesis, task description, implementation code, execution logs, and benchmark results. You should analyze them and provide an explaination in depth.
+    - Identify the main cause of failure. Is the hypothesis flawed, task poorly defined, or implementation subpar?
+    - Provide a specific guess on the root cause of failure with detailed analysis.
+    - Put your analysis in the "reason" field of your final response.
+
+    ## Improvement suggestions (On acceptance or rejection)
+    - Decide the core component that needs improvement for the next iteration.
+    - Suggest specific improvements or alternative approaches.
+    - Put your suggestions in the "reason" field of your final response.
+
+    # Training Loss Analysis Guidelines
+    You will receive the complete training loss history. Analyze the following aspects:
+    - Loss convergence pattern: Is the loss decreasing steadily, oscillating, or plateauing?
+    - Signs of overfitting or underfitting based on loss trajectory
+    - Learning rate appropriateness based on loss curve shape
+    - Suggest hyperparameter-level adjustments (learning rate, batch size, epochs), NOT data-level changes
+
+    # COT Output Understanding Guidelines
+    {% if force_think_token %}
+    **IMPORTANT**: If model output contains `<think>...</think>` tags, this is NORMAL and EXPECTED.
+
+    - During benchmark evaluation, a postprocessor REMOVES `<think>...</think>` content
+    - The evaluator ONLY sees content AFTER `</think>`
+    - Having `<think>` tags is correct CoT training behavior, NOT an error
+    {% endif %}
+    {# When force_think_token=false, model output won't have <think> tags, no special explanation needed #}
+
+    # Error Sample Analysis Guidelines (CRITICAL - Avoid Benchmark Leakage)
+    You will receive model outputs for incorrectly answered questions.
+    **IMPORTANT**: You must provide INSIGHTS about model capability gaps, NOT specific training suggestions that could lead to benchmark overfitting.
+
+    **DO:**
+    - Identify error patterns (e.g., "model struggles with multi-step reasoning")
+    - Classify error types (calculation errors, logical errors, format errors, early termination)
+    - Analyze capability dimensions (mathematical reasoning, code understanding, chain-of-thought)
+    - Suggest general capability improvements at a conceptual level
+
+    **DO NOT:**
+    - Reference specific question content or numbers from the benchmark
+    - Suggest "add training data similar to question X" or any targeted data augmentation
+    - Reproduce model's specific wrong answers in your analysis
+    - Propose targeted fixes for specific test cases
+
+    Example good insight: "Model shows early termination in reasoning chains, often concluding before fully exploring all cases. This suggests insufficient training on long-form reasoning tasks."
+    Example bad insight: "Model got question 3 wrong about prime numbers, should add more prime number training data."
+
+    # Code Change Summary
+    - Summarize the user's implementation approach and key components concisely compared to sota methods.
+
+    Provide structured feedback in the following JSON format (all values must be strings, not arrays):
+    {
+      "Code Summary": "Concise summary of the implementation approach and key components",
+      "Reason": "A single paragraph (not a list) explaining the decision with specific evidence, root cause analysis, and improvement suggestions. Limit to 3-5 sentences.",
+      "Decision": "yes or no - whether this experiment should be accepted as the new SOTA (see Step 3)"
+    }
+
+  user: |-
+    # Current LLM Fine-tuning Experiment Analysis
+
+    ## Hypothesis
+    {{ hypothesis }}
+
+    ## Task Description
+    {{ task_desc }}
+
+    ## Workspace Files
+    {% for file_name, file_content in workspace_files.items() %}
+    - {{ file_name }}: {{ file_content }}
+    {% endfor %}
+
+    **Execution Time**: {{ execution_time }} seconds
+
+    ## Training Metrics
+    {% if training_metrics %}
+    ```json
+    {{ training_metrics | tojson(indent=2) }}
+    ```
+    {% else %}
+    No training metrics available.
+    {% endif %}
+
+    ## Benchmark Results
+    ### Accuracy Summary
+    {% if benchmark.accuracy_summary %}
+    ```json
+    {{ benchmark.accuracy_summary | tojson(indent=2) }}
+    ```
+    {% else %}
+    No accuracy summary available.
+    {% endif %}
+
+    ### Error Sample Analysis ({{ benchmark.error_samples | length }} samples)
+    Below are model outputs for incorrectly answered questions.
+    Analyze the error patterns and provide INSIGHTS, not specific training suggestions:
+
+    {% for sample in benchmark.error_samples %}
+    **Error {{ loop.index }}:**
+    - Question: {{ sample.question[:1000] }}{% if sample.question | length > 1000 %}... (truncated){% endif %}
+    - Expected Answer: {{ sample.gold }}
+    - Model Output: {{ sample.model_output[:500] }}{% if sample.model_output | length > 500 %}... (truncated){% endif %}
+
+    {% endfor %}
+
+    {% if sota_benchmark %}
+    ## Previous SOTA Benchmark Results
+    The following are the benchmark results from the current best (SOTA) experiment.
+    Compare the current results with these to determine if the current experiment should become the new SOTA.
+
+    ### SOTA Accuracy Summary
+    {% if sota_benchmark.accuracy_summary %}
+    ```json
+    {{ sota_benchmark.accuracy_summary | tojson(indent=2) }}
+    ```
+    {% else %}
+    No SOTA accuracy summary available.
+    {% endif %}
+    {% else %}
+    ## Baseline Benchmark Results (Pre-trained Model)
+    **No SOTA exists yet.** Compare against the BASELINE (model performance before fine-tuning).
+    **IMPORTANT**: Only set "Decision": "yes" if the fine-tuned model EXCEEDS this baseline.
+
+    ### Baseline Accuracy Summary
+    ```json
+    {{ baseline_benchmark.accuracy_summary | tojson(indent=2) }}
+    ```
+    {% endif %}
+
+exp_feedback_error:
+  system: |-
+    You are an expert LLM fine-tuning debugger specializing in analyzing experiment failures.
+
+    Below is the scenario context:
+    {{ scenario }}
+
+    Your task is to analyze why the LLM fine-tuning experiment failed and provide actionable feedback.
+
+    # Failure Analysis Framework:
+
+    ## Step 1: Error Classification
+    Identify the type of failure (use these exact labels):
+    - CONFIG: YAML syntax, invalid parameters, incompatible settings
+    - OOM: GPU memory exhaustion, CUDA out of memory
+    - DATA: Dataset format issues, tokenization failures, empty data
+    - ENV: Missing dependencies, version conflicts, file not found
+
+    ## Step 2: Root Cause Analysis
+    - Examine the error message and stack trace
+    - Identify the specific component that failed
+    - Determine if it's a code bug, configuration issue, or resource limitation
+
+    ## Step 3: Actionable Suggestions
+    - Provide specific fixes for the identified issues
+    - Suggest configuration changes or code modifications
+    - Recommend debugging steps if root cause is unclear
+
+    Provide structured feedback in JSON format (all values must be strings, not arrays):
+    {
+      "Error Type": "CONFIG|OOM|DATA|ENV",
+      "Code Summary": "Brief description of what was attempted",
+      "Reason": "A single paragraph (not a list) with detailed error analysis, root cause, and specific fix suggestions. Limit to 3-5 sentences.",
+      "Decision": "no"
+    }
+
+  user: |-
+    # Failed LLM Fine-tuning Experiment Analysis
+
+    ## Hypothesis
+    {{ hypothesis }}
+
+    ## Task Description
+    {{ task_desc }}
+
+    ## Workspace Files
+    {% for file_name, file_content in workspace_files.items() %}
+    - {{ file_name }}: {{ file_content }}
+    {% endfor %}
+
+    ## Error Information
+    ```
+    {{ error_info }}
+    ```
+
+    Please analyze why this experiment failed and provide suggestions for fixing it.
diff --git a/rdagent/scenarios/finetune/download/__init__.py b/rdagent/scenarios/finetune/download/__init__.py
new file mode 100644
index 000000000..33f8ddbfd
--- /dev/null
+++ b/rdagent/scenarios/finetune/download/__init__.py
@@ -0,0 +1,25 @@
+"""
+Hugging Face download utility module
+
+Provides functions to download models and datasets from the Hugging Face Hub.
+
+Main functions:
+- download_dataset: Download entire dataset repo using snapshot_download
+- download_model: Download model repo using snapshot_download
+
+For high-level dataset management (with registered datasets), use:
+    from rdagent.scenarios.finetune.datasets import prepare, prepare_all
+
+Environment variable configuration:
+- HF_TOKEN / HUGGINGFACE_TOKEN / HUGGING_FACE_HUB_TOKEN: Hugging Face access token
+- FT_FILE_PATH: Root directory for finetuning files (managed by FT_RD_SETTING)
+
+Usage example:
+    from rdagent.scenarios.finetune.download.hf import download_dataset, download_model
+
+    # Download dataset
+    dataset_path = download_dataset("OpenMol/ChemCoTDataset", "/path/to/chemcot")
+
+    # Download model
+    model_path = download_model("Qwen/Qwen2.5-7B")
+"""
diff --git a/rdagent/scenarios/finetune/download/hf.py b/rdagent/scenarios/finetune/download/hf.py
new file mode 100644
index 000000000..c7ce0bbe9
--- /dev/null
+++ b/rdagent/scenarios/finetune/download/hf.py
@@ -0,0 +1,108 @@
+import os
+import shutil
+from pathlib import Path
+from typing import Optional
+
+
+def _ensure_parent(path: Path) -> None:
+    os.makedirs(path.parent, mode=0o777, exist_ok=True)
+
+
+def _get_hf_token(token: Optional[str] = None) -> Optional[str]:
+    """Get HuggingFace token from parameter or environment variables."""
+    return (
+        token
+        or os.environ.get("HF_TOKEN")
+        or os.environ.get("HUGGINGFACE_TOKEN")
+        or os.environ.get("HUGGING_FACE_HUB_TOKEN")
+    )
+
+
+def download_dataset(
+    repo_id: str,
+    out_dir: str,
+    token: Optional[str] = None,
+    revision: Optional[str] = None,
+    force: bool = False,
+) -> str:
+    """
+    Download HuggingFace dataset to a specified directory using snapshot_download.
+    Preserves the original file structure from HuggingFace.
+
+    Args:
+        repo_id: HuggingFace dataset repository ID
+        out_dir: Directory to save the dataset
+        token: HuggingFace token for private datasets
+        revision: Specific revision to download
+        force: If True, re-download even if exists
+
+    Returns:
+        Path to the downloaded dataset directory
+    """
+    save_path = Path(out_dir)
+    _ensure_parent(save_path)
+
+    if force and save_path.exists():
+        shutil.rmtree(save_path)
+
+    try:
+        from huggingface_hub import snapshot_download
+    except Exception as e:
+        raise ImportError(
+            "huggingface_hub is missing. Please install it first: pip install -U 'huggingface_hub[cli]'"
+        ) from e
+
+    snapshot_download(
+        repo_id=repo_id,
+        repo_type="dataset",
+        local_dir=str(save_path),
+        local_dir_use_symlinks=False,
+        token=_get_hf_token(token),
+        revision=revision,
+    )
+    return str(save_path)
+
+
+def download_model(
+    repo_id: str,
+    out_dir_root: Optional[str] = None,
+    token: Optional[str] = None,
+    revision: Optional[str] = None,
+    force: bool = False,
+) -> str:
+    """
+    Download Hugging Face model to a subdirectory under the specified root: <out_dir_root>/<repo_id>
+    Returns the actual download directory path as a string.
+    """
+    if out_dir_root:
+        save_root = Path(out_dir_root)
+    else:
+        # Use FT_RD_SETTING for default root directory
+        from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+
+        if not FT_RD_SETTING.file_path:
+            raise ValueError("No out_dir_root specified and FT_FILE_PATH not set")
+        save_root = Path(FT_RD_SETTING.file_path) / "model"
+
+    save_path = save_root / repo_id
+    _ensure_parent(save_path)
+
+    if force and save_path.exists():
+        shutil.rmtree(save_path)
+
+    try:
+        from huggingface_hub import snapshot_download
+    except Exception as e:
+        raise ImportError(
+            "huggingface_hub is missing. Please install it first: pip install -U 'huggingface_hub[cli]'"
+        ) from e
+
+    snapshot_download(
+        repo_id=repo_id,
+        repo_type="model",
+        local_dir=str(save_path),
+        local_dir_use_symlinks=False,
+        token=_get_hf_token(token),
+        revision=revision,
+    )
+    return str(save_path)
diff --git a/rdagent/scenarios/finetune/env/conda/deepspeed/ds_z2_config.json b/rdagent/scenarios/finetune/env/conda/deepspeed/ds_z2_config.json
new file mode 100644
index 000000000..c4177e5e0
--- /dev/null
+++ b/rdagent/scenarios/finetune/env/conda/deepspeed/ds_z2_config.json
@@ -0,0 +1,28 @@
+{
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu": "auto",
+  "gradient_accumulation_steps": "auto",
+  "gradient_clipping": "auto",
+  "zero_allow_untested_optimizer": true,
+  "fp16": {
+    "enabled": "auto",
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "initial_scale_power": 16,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "bf16": {
+    "enabled": "auto"
+  },
+  "zero_optimization": {
+    "stage": 2,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients": true,
+    "round_robin_gradients": true
+  }
+}
diff --git a/rdagent/scenarios/finetune/env/conda/deepspeed/ds_z3_config.json b/rdagent/scenarios/finetune/env/conda/deepspeed/ds_z3_config.json
new file mode 100644
index 000000000..46584a769
--- /dev/null
+++ b/rdagent/scenarios/finetune/env/conda/deepspeed/ds_z3_config.json
@@ -0,0 +1,30 @@
+{
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu": "auto",
+  "gradient_accumulation_steps": "auto",
+  "gradient_clipping": "auto",
+  "zero_allow_untested_optimizer": true,
+  "fp16": {
+    "enabled": "auto",
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "initial_scale_power": 16,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "bf16": {
+    "enabled": "auto"
+  },
+  "zero_optimization": {
+    "stage": 3,
+    "overlap_comm": false,
+    "contiguous_gradients": true,
+    "sub_group_size": 1e9,
+    "reduce_bucket_size": "auto",
+    "stage3_prefetch_bucket_size": "auto",
+    "stage3_param_persistence_threshold": "auto",
+    "stage3_max_live_parameters": 1e9,
+    "stage3_max_reuse_distance": 1e9,
+    "stage3_gather_16bit_weights_on_model_save": true
+  }
+}
diff --git a/rdagent/scenarios/finetune/env/conda/llm_finetune_requirements.txt b/rdagent/scenarios/finetune/env/conda/llm_finetune_requirements.txt
new file mode 100644
index 000000000..e889153b5
--- /dev/null
+++ b/rdagent/scenarios/finetune/env/conda/llm_finetune_requirements.txt
@@ -0,0 +1,58 @@
+# LLaMA Factory Environment Requirements
+# Equivalent to: rdagent/scenarios/finetune/docker/llm_finetune_docker/Dockerfile
+# Docker base: hiyouga/llamafactory:0.9.4 uses PyTorch 2.6.0 + CUDA 12.4 + flash-attn 2.7.4
+
+# PyTorch 2.9.0 with CUDA 12.8 (for B200 GPUs with sm_100 architecture)
+# Note: PyTorch 2.6.0 only supports up to sm_90, B200 requires 2.8.0+
+# For non-B200 machines with CUDA 12.4, change to cu124 and torch==2.6.0
+--index-url https://download.pytorch.org/whl/cu128
+torch==2.9.0
+torchvision==0.24.0
+
+# Reset to default index for other packages
+--index-url https://pypi.org/simple
+
+# Core LlamaFactory package (PyPI latest is 0.9.3, Docker uses 0.9.4 from GitHub)
+llamafactory==0.9.3
+
+# FlashAttention-2: installed separately via llm_finetune_flash_attn.txt
+# (requires torch installed first, and --no-build-isolation flag)
+
+# Transformers library (for tokenizer)
+transformers
+
+# Additional dependencies (matches Dockerfile line 17)
+bitsandbytes>=0.39.0
+mixture-of-depth>=1.1.6
+litellm
+
+# Common utilities for data processing scripts
+requests
+
+# DeepSpeed for memory optimization
+# Note: LlamaFactory 0.9.3 requires deepspeed<=0.16.9 (hardcoded check in parser.py)
+deepspeed>=0.10.0,<=0.16.9
+
+# LlamaFactory optional dependencies (commonly used)
+# Liger Kernel - fused triton kernels for training acceleration
+liger-kernel>=0.5.5
+
+# Metrics for evaluation
+nltk
+jieba
+rouge-chinese
+
+# Advanced optimizers
+galore-torch
+apollo-torch
+badam>=1.2.1
+adam-mini
+
+# Quantization
+hqq
+
+# FP8 training support
+torchao>=0.8.0
+
+# Chemistry support
+rdkit
diff --git a/rdagent/scenarios/finetune/env/conda/opencompass_requirements.txt b/rdagent/scenarios/finetune/env/conda/opencompass_requirements.txt
new file mode 100644
index 000000000..c87a22a20
--- /dev/null
+++ b/rdagent/scenarios/finetune/env/conda/opencompass_requirements.txt
@@ -0,0 +1,22 @@
+# OpenCompass Benchmark Environment Requirements
+# Equivalent to: rdagent/scenarios/finetune/docker/opencompass/Dockerfile
+
+# PyTorch 2.9.0 with CUDA 12.8 (for B200 GPUs with sm_100 architecture)
+# Note: PyTorch 2.1.0 only supports up to sm_90, B200 requires 2.8.0+
+# For non-B200 machines with CUDA 12.4, change to cu124 and torch==2.6.0
+--index-url https://download.pytorch.org/whl/cu128
+torch==2.9.0
+torchvision==0.24.0
+
+# Reset to default index for other packages
+--index-url https://pypi.org/simple
+
+# vLLM for model inference (latest version supports PyTorch 2.9.0)
+vllm>=0.12.0
+
+# OpenCompass benchmark framework (custom fork with cascade eval support)
+opencompass @ git+https://github.com/Jensen246/opencompass.git
+
+# Math evaluation dependencies (matches Dockerfile line 22)
+math_verify
+latex2sympy2_extended
diff --git a/rdagent/scenarios/finetune/env/docker/llm_finetune/Dockerfile b/rdagent/scenarios/finetune/env/docker/llm_finetune/Dockerfile
new file mode 100644
index 000000000..ae674370c
--- /dev/null
+++ b/rdagent/scenarios/finetune/env/docker/llm_finetune/Dockerfile
@@ -0,0 +1,20 @@
+FROM hiyouga/llamafactory:0.9.4
+
+# Set CUDA environment variables for DeepSpeed compilation
+ENV CUDA_HOME=/usr/local/cuda
+ENV PATH=$CUDA_HOME/bin:$PATH
+ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
+
+RUN apt-get clean && apt-get update && apt-get install -y \  
+    curl \  
+    vim \  
+    git \  
+    build-essential \
+    git-lfs \
+    unzip \
+    && rm -rf /var/lib/apt/lists/* 
+
+RUN pip install "bitsandbytes>=0.39.0" "mixture-of-depth>=1.1.6" "litellm"
+
+# Set working directory for experiments
+WORKDIR /workspace
diff --git a/rdagent/scenarios/finetune/env/docker/opencompass/Dockerfile b/rdagent/scenarios/finetune/env/docker/opencompass/Dockerfile
new file mode 100644
index 000000000..fe7d21ed2
--- /dev/null
+++ b/rdagent/scenarios/finetune/env/docker/opencompass/Dockerfile
@@ -0,0 +1,40 @@
+FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
+
+# Install system dependencies
+RUN apt-get clean && apt-get update && apt-get install -y \
+    curl \
+    vim \
+    git \
+    build-essential \
+    git-lfs \
+    && rm -rf /var/lib/apt/lists/*
+
+# Upgrade pip
+RUN pip install --upgrade pip setuptools wheel --no-cache-dir
+
+# Install OpenCompass with vLLM backend support
+RUN git clone https://github.com/Jensen246/opencompass.git /opencompass
+WORKDIR /opencompass
+
+RUN pip install ".[vllm]" --no-cache-dir
+
+# Install math evaluation dependencies for AIME/MATH benchmarks
+RUN pip install math_verify latex2sympy2_extended --no-cache-dir
+
+# Install peft and transformers for model merging
+RUN pip install peft transformers --no-cache-dir
+
+# Set working directory
+WORKDIR /workspace
+
+# Set environment variables for cache directories
+ENV HF_HOME=/benchmarks/hf_cache
+ENV HF_HUB_CACHE=/benchmarks/hf_cache/hub
+ENV TRANSFORMERS_CACHE=/benchmarks/hf_cache/transformers
+ENV HF_DATASETS_CACHE=/benchmarks/datasets
+ENV COMPASS_DATA_CACHE=/benchmarks/opencompass_data
+
+# Fix MKL threading layer compatibility issue with vLLM
+ENV MKL_SERVICE_FORCE_INTEL=1
+ENV MKL_THREADING_LAYER=GNU
+
diff --git a/rdagent/scenarios/finetune/experiment/__init__.py b/rdagent/scenarios/finetune/experiment/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/rdagent/scenarios/finetune/experiment/experiment.py b/rdagent/scenarios/finetune/experiment/experiment.py
new file mode 100644
index 000000000..924b3fc1b
--- /dev/null
+++ b/rdagent/scenarios/finetune/experiment/experiment.py
@@ -0,0 +1,32 @@
+import re
+from typing import Literal
+
+import pandas as pd
+
+from rdagent.components.coder.finetune.conf import FT_YAML_FILE_NAME
+from rdagent.core.experiment import Experiment, Task
+from rdagent.scenarios.finetune.experiment.workspace import FTWorkspace
+
+COMPONENT = Literal["Training"]
+
+
+class FTExperiment(Experiment[Task, FTWorkspace, FTWorkspace]):
+    def __init__(self, sub_tasks: list[Task], *args, **kwargs) -> None:
+        super().__init__(sub_tasks=sub_tasks, *args, **kwargs)
+        # Status
+        # - Initial: blank;
+        # - Injecting from SOTA code;
+        # - New version no matter successful or not
+        # the initial workspace or the successful new version after coding
+        self.experiment_workspace = FTWorkspace()
+
+        self.format_check_result = None
+        # this field is optional. It is not none only when we have a format checker. Currently, only following cases are supported.
+        # - mle-bench
+
+    def is_ready_to_run(self) -> bool:
+        """
+        ready to run does not indicate the experiment is runnable
+        (so it is different from `trace.next_incomplete_component`.)
+        """
+        return self.experiment_workspace is not None and FT_YAML_FILE_NAME in self.experiment_workspace.file_dict
diff --git a/rdagent/scenarios/finetune/experiment/workspace.py b/rdagent/scenarios/finetune/experiment/workspace.py
new file mode 100644
index 000000000..0432b0a0c
--- /dev/null
+++ b/rdagent/scenarios/finetune/experiment/workspace.py
@@ -0,0 +1,104 @@
+"""
+FT-specific Workspace implementation with minimal checkpoint strategy.
+
+This module provides FTWorkspace, which configures checkpoint to only save
+configuration files (train.yaml), excluding all training outputs.
+
+Design Philosophy:
+- Checkpoint is for code version control during CoSTEER evolution
+- Model persistence is handled separately by Runner's save_model()
+- This separation keeps concerns clear and checkpoints lightweight
+"""
+
+from typing import TYPE_CHECKING, Any
+
+from rdagent.components.coder.finetune.conf import FT_YAML_FILE_NAME
+from rdagent.core.conf import RD_AGENT_SETTINGS
+from rdagent.core.experiment import FBWorkspace
+from rdagent.log import rdagent_logger as logger
+from rdagent.utils.env import CacheKeyFunc, DockerEnv, LocalEnv
+
+if TYPE_CHECKING:
+    from rdagent.utils.env import Env
+
+from rdagent.utils.env import EnvResult
+
+
+class FTWorkspace(FBWorkspace):
+    """
+    Fine-tuning workspace with minimal checkpoint strategy and unified Docker logging.
+
+    Checkpoint Strategy:
+    - Only saves configuration files (train.yaml) for version control
+    - Training outputs (models, checkpoints) are excluded by design
+    - Final model persistence is Runner's responsibility, not checkpoint's
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # Configure checkpoint to save essential files for training
+        # Training outputs (models, checkpoints) are managed separately by save_final_model()
+        RD_AGENT_SETTINGS.workspace_ckp_white_list_names = [
+            FT_YAML_FILE_NAME,  # train.yaml - training config
+            "dataset_info.json",  # LlamaFactory dataset config
+        ]
+        RD_AGENT_SETTINGS.workspace_ckp_size_limit = 100 * 1024
+
+    def run(
+        self,
+        env: "Env",
+        entry: str,
+        env_vars: dict | None = None,
+        cache_key_extra_func: CacheKeyFunc | None = None,
+        cache_files_to_extract: list[str] | None = None,
+    ) -> "EnvResult":
+        """Execute the code in the environment with unified Docker logging.
+
+        Args:
+            env: The environment to run in (DockerEnv, LocalEnv, etc.)
+            entry: The command to execute
+            env_vars: Optional additional environment variables (e.g., LLM API keys)
+                     Will be merged with default {"PYTHONPATH": "./"}
+            cache_key_extra_func: Optional extra function for cache key calculation
+            cache_files_to_extract: Optional list of files to extract from cache
+
+        Returns:
+            EnvResult with stdout, exit_code, running_time
+        """
+        self.prepare()
+        self.inject_files(**self.file_dict)
+
+        # Merge default env with custom env_vars
+        run_env = {"PYTHONPATH": "./"}
+        if env_vars:
+            run_env.update(env_vars)
+
+        result = env.run(
+            entry,
+            str(self.workspace_path),
+            env=run_env,
+            cache_key_extra_func=cache_key_extra_func,
+            cache_files_to_extract=cache_files_to_extract,
+        )
+
+        # Unified execution logging for FT scenario (supports both Docker and Conda)
+        if isinstance(env, DockerEnv):
+            tag_prefix = "docker_run"
+        elif isinstance(env, LocalEnv):
+            tag_prefix = "conda_run"
+        else:
+            tag_prefix = "env_run"
+
+        logger.log_object(
+            {
+                "exit_code": result.exit_code,
+                "stdout": result.stdout or "",
+                "running_time": result.running_time,
+                "entry": entry,
+                "workspace_path": str(self.workspace_path),
+            },
+            tag=f"{tag_prefix}.FTWorkspace",
+        )
+
+        return result
diff --git a/rdagent/scenarios/finetune/loop.py b/rdagent/scenarios/finetune/loop.py
new file mode 100644
index 000000000..e2ba8a936
--- /dev/null
+++ b/rdagent/scenarios/finetune/loop.py
@@ -0,0 +1,62 @@
+import asyncio
+from typing import Any
+
+from rdagent.app.finetune.llm.conf import LLMFinetunePropSetting
+from rdagent.components.coder.finetune.conf import get_ft_env
+from rdagent.components.workflow.rd_loop import RDLoop
+from rdagent.core.conf import RD_AGENT_SETTINGS
+from rdagent.core.exception import CoderError
+from rdagent.core.proposal import HypothesisFeedback
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.proposal.trace import FTTrace
+
+
+class LLMFinetuneRDLoop(RDLoop):
+    """LLM fine-tuning loop using standard RDLoop workflow"""
+
+    skip_loop_error = (CoderError,)
+    skip_loop_error_stepname = "feedback"  # if `skip_loop_error` happens, we should skip and continue on feedback step
+    withdraw_loop_error = ()
+
+    def __init__(self, PROP_SETTING: LLMFinetunePropSetting):
+        # Store finetune-specific settings
+        self.ft_rd_setting = PROP_SETTING
+        self.dataset = PROP_SETTING.dataset
+        self.model = PROP_SETTING.base_model
+
+        # Initialize using base class
+        super().__init__(PROP_SETTING)
+
+        # Replace generic Trace with FTTrace for SOTA tracking
+        self.trace = FTTrace(scen=self.trace.scen)
+
+    async def direct_exp_gen(self, prev_out: dict[str, Any]):
+        """Generate LLM fine-tuning experiment"""
+        exp = await self.hypothesis_gen.async_gen(self.trace, self)
+        logger.log_object(exp.hypothesis, tag="hypothesis")
+        logger.log_object(exp.sub_tasks, tag="experiment generation")
+        return exp
+
+    def coding(self, prev_out: dict[str, Any]):
+        """Generate fine-tuning code"""
+        exp = prev_out["direct_exp_gen"]
+        exp = self.coder.develop(exp)
+        logger.log_object(exp.sub_workspace_list, tag="coder result")
+        return exp
+
+    def feedback(self, prev_out: dict[str, Any]):
+        """Generate feedback for LLM fine-tuning experiment - always call LLM"""
+
+        # Get experiment from available sources
+        exp = prev_out.get("running") or prev_out.get("coding") or prev_out.get("direct_exp_gen")
+        e = prev_out.get(self.EXCEPTION_KEY, None)
+        feedback = self.summarizer.generate_feedback(exp, self.trace, exception=e)
+
+        logger.log_object(feedback, tag="feedback")
+        return feedback
+
+    def record(self, prev_out: dict[str, Any]):
+        """Record the experiment and feedback into trace"""
+        feedback = prev_out["feedback"]
+        exp = prev_out.get("running") or prev_out.get("coding") or prev_out.get("direct_exp_gen")
+        self.trace.sync_dag_parent_and_hist((exp, feedback), prev_out[self.LOOP_IDX_KEY])
diff --git a/rdagent/scenarios/finetune/proposal/__init__.py b/rdagent/scenarios/finetune/proposal/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/rdagent/scenarios/finetune/proposal/prompts.yaml b/rdagent/scenarios/finetune/proposal/prompts.yaml
new file mode 100644
index 000000000..539e2f1c1
--- /dev/null
+++ b/rdagent/scenarios/finetune/proposal/prompts.yaml
@@ -0,0 +1,346 @@
+# =============================================================================
+# Unified Hypothesis Generation
+# =============================================================================
+# Single prompt that covers both data processing and training configuration.
+# LLM decides the focus based on historical experiments and current needs.
+
+unified_hypothesis_gen:
+  system_prompt: |-
+    You are an expert in both data processing and LLM fine-tuning. Your task is to generate a comprehensive hypothesis covering BOTH data processing AND training configuration to build the best possible model given the constraints.
+
+    You should make decisions in a hypothesis that aims to achieve the best performance possible given the constraints. Following the hypothesis, provide a detailed task for the code generator to implement.
+
+    The user might have historical experiments to learn from. Use them wisely to avoid repeating mistakes and build upon successful strategies.
+
+    # Scenario Description
+    {{ scenario }}
+
+    # ═══════════════════════════════════════════════════════════════════════════
+    # PART 1: DATA PROCESSING
+    # ═══════════════════════════════════════════════════════════════════════════
+
+    ## 1.0 Core Principle: Less is More
+
+    **Your Goal:** Create a **small, diverse, high-quality** dataset.
+
+    ### The Three Rules
+
+    1. **Quality over Quantity**: A smaller set of excellent samples beats a larger set of mediocre ones
+    2. **Diversity over Volume**: Cover different problem types, difficulty levels, and reasoning patterns
+    3. **Simplicity over Complexity**: Each processing step you add is a potential failure point
+
+    ### Warning Signs (When to Simplify)
+
+    If you observe any of these, your pipeline is probably over-engineered:
+
+    - **Low retention**: Most samples are being filtered out
+    - **Empty output**: Debug mode produces very few or zero samples
+    - **Cascading failures**: One step's output causes the next step to fail
+    - **Diminishing returns**: Adding more processing but results don't improve
+
+    **When in doubt, do less. A simple pipeline that works beats a complex one that fails.**
+
+    ## 1.1 Data Quality Assessment (Before Processing)
+
+    **Step 1: Understand your data before processing it.**
+
+    | Dataset Quality | Action | Example |
+    |-----------------|--------|---------|
+    | High (structured CoT, correct format) | Use directly with minimal changes | Math datasets with step-by-step solutions |
+    | Medium (has reasoning, needs polish) | Targeted improvements only | Q&A with brief explanations |
+    | Low (no CoT, format issues) | Full processing needed | Direct answer-only datasets |
+
+    **Key insight: High-quality data does NOT need heavy processing. Over-processing good data can degrade it.**
+
+    ## 1.2 Processing Methods
+
+    ### Code-Based Methods (For filtering and formatting)
+    - **Length filtering**: Remove samples exceeding context limit (DO NOT truncate)
+    - **Format validation**: Check required fields exist and are non-empty
+    - **Deduplication**: N-gram or exact match
+    - **Sampling**: Random or stratified by category
+
+    ### LLM-Based Methods (For content generation)
+
+    **✅ Core Operation: CoT Generation with Strong Models**
+
+    This is the most valuable use of LLM in data processing. High-quality CoT is essential for training reasoning ability.
+
+    - **Actively use strong models** to generate detailed, logical reasoning chains
+    - Quality of CoT directly impacts training effectiveness
+    - The cost of strong model calls is justified by better training data
+
+    **When to generate CoT:**
+    - Dataset lacks reasoning traces (direct answers only)
+    - Existing reasoning is shallow, unclear, or incomplete
+    - You want to ensure consistent high-quality reasoning format
+
+    **❌ Redundant Operations: Avoid These**
+    - LLM-based answer validation (inconsistent, expensive, adds little value)
+    - Multi-stage quality scoring (compounds errors, slow)
+    - LLM judging if CoT is "logically correct" (subjective, unreliable)
+    - Multiple LLM calls per sample for different purposes
+
+    **Key Distinction:**
+    - ✅ One high-quality LLM call per sample to generate CoT → Good investment
+    - ❌ Multiple LLM calls per sample (generate + validate + score + rewrite) → Wasteful
+
+    **Note**: Do NOT specify exact model names. Describe which tier (strong/weak) for each step. Model selection is automatic.
+
+    ## 1.3 CoT Generation Strategy
+
+    **Philosophy: Invest in quality CoT generation, not in redundant validation.**
+
+    **CRITICAL: ALL training data MUST include Chain-of-Thought reasoning. No direct answers.**
+
+    **How to generate CoT:**
+    1. **Use strong model tier** - this is where quality matters most
+    2. Generate naturally - let the model reason step by step
+    3. Don't request specific format tags in the prompt (models may refuse)
+    4. Post-process to add required format (`<think>` tags) via code
+
+    **Quality Assurance (Lightweight):**
+    - **Outcome-based check**: If CoT leads to correct final answer, accept it
+    - **For math/code**: Verify answer with tools (calculator, code execution), not LLM
+    - **Self-consistency (optional)**: Generate 2-3 chains, keep if majority agree on answer
+
+    **What to avoid:**
+    - Using LLM to judge if reasoning is "good enough" (subjective, inconsistent)
+    - Rejecting samples because CoT style differs from expectation
+    - Adding validation steps that filter out valid samples
+
+    ## 1.4 Diversity Sampling
+
+    **Why diversity matters:** Training on varied examples helps the model generalize.
+
+    **Implementation:**
+    1. Identify natural categories in your dataset (topic, difficulty, source, format)
+    2. Sample proportionally from each category rather than randomly from the whole
+    3. Prioritize coverage across categories over total volume
+
+    **Example:**
+    - Dataset has difficulty levels (easy/medium/hard)
+    - Avoid: Taking whatever comes first (may be 90% easy)
+    - Prefer: Sample balanced amounts from each level
+
+    ## 1.5 Length & Filtering
+
+    **Core Formula**: `total_tokens = input_tokens + cot_tokens + answer_tokens`
+
+    This total must satisfy: `total_tokens ≤ cutoff_len ≤ max_position_embeddings`
+
+    - Filter samples exceeding context limit (do NOT truncate)
+    - Set `cutoff_len` based on Memory Constraints table
+    - Maximize CoT length within constraints
+
+    ## 1.6 Output Format
+
+    Output filename: `data.json` (path handled by system). Use Alpaca format:
+
+    ```json
+    [
+      {
+        "instruction": "problem statement",
+        "input": "optional additional context",
+    {% if force_think_token %}
+        "output": "<think>[step-by-step reasoning]</think>[final answer]"
+    {% else %}
+        "output": "[step-by-step reasoning]...[final answer]"
+    {% endif %}
+      }
+    ]
+    ```
+
+    {% if force_think_token %}
+    **Note**: `<think>` tags are added by code post-processing, not requested in LLM prompts.
+    The **answer** (after `</think>`) must follow **Benchmark Description**.
+    {% else %}
+    **Note**: Focus on reasoning quality. Let LLM generate naturally. DO NOT include `<think>` tags.
+    {% endif %}
+
+    **Answer format**: Follow the format specified in Benchmark Description.
+
+    # ═══════════════════════════════════════════════════════════════════════════
+    # PART 2: TRAINING CONFIGURATION
+    # ═══════════════════════════════════════════════════════════════════════════
+
+    ## 2.1 Hardware Memory Constraints
+
+    The **Hardware Memory Constraints** table in Scenario Description shows:
+    - Max `seq_len` each method can support at `batch_size=1`
+    - Model's `max_position_embeddings` limit
+
+    **Method Selection based on seq_len needs:**
+    1. Check which methods support your required seq_len
+    2. Among viable methods: **prefer `full` > `full_gc` > `lora` > `qlora`** for quality
+    3. `full` is not always viable - choose based on your actual seq_len requirements
+
+    **Set cutoff_len:** `cutoff_len ≤ min(max_seq_len from table, max_position_embeddings)`
+
+    **Batch size trade-offs:**
+    - Smaller seq_len → can increase batch_size
+    - Larger seq_len → must decrease batch_size (possibly to 1)
+    - Use `gradient_accumulation_steps` to achieve effective batch size of 16-64
+
+    **Example Decision Flow:**
+    Given 4×48GB GPU, 7B model, need 16K seq_len for rich CoT:
+    1. Check table: `full`=18K ✓, `full_gc`=52K ✓, `lora`=89K ✓
+    2. All methods viable → choose `full` (best quality)
+    3. Set `cutoff_len`=16384 (≤ 18K and ≤ max_position_embeddings)
+    4. batch_size=1, gradient_accumulation=16 → effective batch=64
+
+    ## 2.2 Available Resources
+
+    {% if select_model %}
+    **Available Models**:
+    {{ available_models }}
+    {% endif %}
+
+    **Available Fine-tuning Methods**:
+    {{ available_methods }}
+
+    **Shared Parameters** (apply to all methods):
+    {{ shared_params }}
+
+    ## 2.3 Method-Specific Parameters
+
+    {% for method, params_desc in methods_specific_params.items() %}
+    {{ params_desc }}{% endfor %}
+
+    # ═══════════════════════════════════════════════════════════════════════════
+    # PART 3: OUTPUT SPECIFICATION
+    # ═══════════════════════════════════════════════════════════════════════════
+
+    ## 3.1 Guidelines
+
+    - Please provide the hypothesis in simplest form - avoid unnecessary complexity
+    - Consider hardware constraints for training and available LLM endpoints for data processing
+    - **IMPORTANT**: Check dataset info for quality issues - not just missing fields, but whether **content quality** (length, depth, richness) matches training objectives
+    - When data quality is insufficient, augmentation/rewrite is expected, not direct use
+    - Chain data processing methods logically: filtering → quality scoring → augmentation/generation
+    - If history shows a method failed, explain why your new approach differs
+    - Use code-based sampling to reduce dataset size before LLM processing (see 1.2)
+
+    ## 3.2 Focus Strategy
+
+    {% if not based_on_a_successful_parent %}
+    **You are drafting a expreriment from scratch..** You must provide a comprehensive strategy covering BOTH:
+    1. Data processing: How to prepare the training data
+    2. Training configuration: How to configure the fine-tuning process
+
+    Both aspects are equally important.
+    {% else %}
+    **This is a subsequent experiment.** Based on a exsiting parent experiment:
+    - Identify which aspect (data processing OR training configuration) needs MORE improvement
+    - You can choose to focus primarily on ONE aspect while keeping the other stable
+    - Or you can improve BOTH if needed
+    - Clearly state your focus in the hypothesis (e.g., "Focus on improving data quality while keeping training config stable")
+
+    **Data Processing Skip Option:**
+    If the Parent's data processing strategy is already good and you want to focus ONLY on training configuration improvements:
+    - Set `skip_data_processing: true` in your response to reuse the Parent's data processing script
+    - This saves LLM API costs and allows you to focus purely on hyperparameter tuning
+    - Only use this option when you believe the data quality is sufficient
+    {% endif %}
+
+    ## 3.3 Response Format
+
+    **Hypothesis**: Provide in natural language, integrating both data processing strategy and training configuration. Structure: "[Data Processing] ... [Training] ..." or a unified narrative covering both aspects.
+
+    **Task Specification**: A clear task for the code generator, following these rules:
+    - **No Code**: MUST NOT contain programming code, library calls, or pseudo-code
+    - **Structure**: Organize into 1) Data Processing, 2) Training Configuration
+    - **Specificity**:
+      - [Data] Which datasets to use and how to process them
+      - [Data] Which LLM endpoints for which processing steps
+      - [Data] Filtering strategy (do NOT hardcode specific thresholds like "score < 8.0")
+      - [Training] Which training methods and hyperparameters to use (single-stage only)
+
+    **Output JSON format:**
+    ```json
+        {
+          "reason": "[Your reasoning about why this approach should work, covering BOTH data processing and training aspects, referencing history if available]",
+          "hypothesis": "[Your hypothesis in natural language, integrating both data processing strategy and training configuration, comprehensive and specific]",
+          "task": "[Step-by-step task description for the code generator, covering the complete workflow from data processing to training, no code]",
+          "skip_data_processing": false  // Set to true ONLY if you want to reuse Parent's data processing script (not applicable for first experiment)
+        }
+    ```
+    Since responding the whole content in one message may exceed the token limit, the user has requested you to provide reason, hypothesis, and task one by one in separate messages. Your response should be a valid JSON object, so the closing curly brace should always be included.
+
+  user_prompt: |-
+    {% if siblings %}
+    ## Sibling Experiments
+    These are other experiments that branched from the same parent.
+    {% for sib_exp, sib_fb in siblings %}
+    ### Sibling {{ loop.index }}
+    - Hypothesis: {{ sib_exp.hypothesis }}
+    - Result: {{ "✅ Successful" if sib_fb.decision else "❌ Failed" }}{% if sib_fb.observations %} [{{ sib_fb.observations }}]{% endif %}
+    - Reason: {{ sib_fb.reason }}
+    {% endfor %}
+    {% endif %}
+
+    {% if parent_exp %}
+    {% set parent_info = trace.get_experiment_info(parent_exp) %}
+    ## Parent Experiment (Base for this iteration)
+    This is the successful experiment you are building upon.
+
+    ### Parent Hypothesis
+    {{ parent_info.hypothesis }}
+
+    {% if parent_info.config %}
+    ### Parent Training Configuration
+    ```yaml
+    {{ parent_info.config }}
+    ```
+    {% endif %}
+
+    {% if parent_info.data_script %}
+    ### Parent Data Processing Script
+    ```python
+    {{ parent_info.data_script }}
+    ```
+    {% endif %}
+
+    {% if parent_info.benchmark %}
+    ### Parent Benchmark Results
+    ```json
+    {{ parent_info.benchmark | tojson(indent=2) }}
+    ```
+    {% endif %}
+
+    **Improvement Focus**: Analyze the Parent's limitations and propose improvements. Consider:
+    - What aspects of the current Parent could be improved?
+    - Are there any hyperparameters that seem suboptimal?
+    - Could the data processing strategy be enhanced?
+    - If Parent's data processing is already good, you may focus on training config improvements only.
+    {% endif %}
+
+    {% if based_on_a_successful_parent %}
+    **Task**: Based on the parent and sibling results above, propose a NEW hypothesis covering BOTH data processing AND training configuration that:
+    - Learns from sibling failures to avoid repeating mistakes
+    - Builds upon the successful parent while exploring improvements
+    - Tests promising directions not yet explored
+    - Decides which aspect (data/training/both) to focus on for this iteration
+    {% else %}
+    **Task**: This is the first experiment (or starting from scratch). Propose an optimal comprehensive strategy covering both data processing and training based on the scenarios and the given seed datasets.
+    {% endif %}
+
+  specific_format: |-
+    In your response, provide ONLY the following JSON structure without any additional text or explanation:
+
+    {% if field == "task" %}
+    ```json
+    {
+      "task": "the step-by-step task description for the code generator",
+      "skip_data_processing": false
+    }
+    ```
+    Note: Set `skip_data_processing` to `true` ONLY if you want to reuse SOTA's data processing script and focus purely on training configuration improvements. This is only valid for subsequent experiments (not the first one).
+    {% else %}
+    ```json
+    {
+      "{{ field }}": "the content to {{ field }} following the instruction in the previous message"
+    }
+    ```
+    {% endif %}
+
diff --git a/rdagent/scenarios/finetune/proposal/proposal.py b/rdagent/scenarios/finetune/proposal/proposal.py
new file mode 100644
index 000000000..5d7a7de88
--- /dev/null
+++ b/rdagent/scenarios/finetune/proposal/proposal.py
@@ -0,0 +1,181 @@
+"""LLM Fine-tuning Proposal Generator
+
+Unified hypothesis generation that covers both data processing and training configuration.
+LLM decides the focus based on historical experiments and current needs.
+"""
+
+import json
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.finetune.exp import FTTask
+from rdagent.core.proposal import ExpGen, Hypothesis, Trace
+from rdagent.log import rdagent_logger as logger
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.scenarios.finetune.experiment.experiment import FTExperiment
+from rdagent.scenarios.finetune.proposal.trace import FTTrace
+from rdagent.scenarios.finetune.scen.llama_factory_manager import (
+    LLaMAFactory_manager,
+)
+from rdagent.scenarios.finetune.scen.scenario import LLMFinetuneScen
+from rdagent.scenarios.finetune.utils import ensure_ft_assets_exist
+from rdagent.utils.agent.tpl import T
+
+
+class FTHypothesis(Hypothesis):
+    """LLM fine-tuning hypothesis class."""
+
+    def __init__(
+        self,
+        base_model: str,
+        hypothesis: str | None = None,
+        reason: str | None = None,
+    ) -> None:
+        super().__init__(
+            hypothesis,
+            reason,
+            concise_reason="",
+            concise_observation="",
+            concise_justification="",
+            concise_knowledge="",
+        )
+        self.base_model = base_model
+
+    def __str__(self) -> str:
+        if self.hypothesis is None:
+            return f"No hypothesis available. Constructing first runnable {self.component} component."
+
+        lines = [
+            f"Base Model: {self.base_model}",
+            f"Hypothesis: {self.hypothesis}",
+        ]
+        if self.reason:
+            lines.append(f"Reason: {self.reason}")
+        return "\n".join(lines)
+
+
+class LLMFinetuneExpGen(ExpGen):
+    """LLM fine-tuning experiment generator.
+
+    Generates unified hypothesis covering both data processing and training configuration.
+    """
+
+    def __init__(self, scen: LLMFinetuneScen):
+        super().__init__(scen)
+
+    def gen(self, trace: Trace) -> FTExperiment:
+        """Generate LLM fine-tuning experiment."""
+        base_model = FT_RD_SETTING.base_model
+        logger.info(f"Generating experiment with base model: {base_model}")
+
+        sota_exp = trace.get_sota_experiment()  # use sota_exp as the parent
+
+        return self._gen_hypothesis(trace, base_model, parent_exp=sota_exp)
+
+    def _gen_hypothesis(self, trace: Trace, base_model: str, parent_exp: FTExperiment | None = None) -> FTExperiment:
+        """Generate hypothesis covering both data processing and training configuration.
+
+        Args:
+            trace: Experiment trace history
+            base_model: Base model name
+            parent_exp: Parent experiment to base this one on; usually the SOTA experiment
+
+        Returns:
+            FTExperiment with tasks for both data processing and training
+        """
+        based_on_a_successful_parent = parent_exp is not None
+        logger.info(f"Generating hypothesis based on (parent_exp={parent_exp})")
+
+        available_models = LLaMAFactory_manager.models
+        available_methods = LLaMAFactory_manager.methods
+        shared_params = LLaMAFactory_manager.format_shared_params()
+        methods_specific_params = {}
+        for method in available_methods:
+            methods_specific_params[method] = LLaMAFactory_manager.format_method_specific_params(method)
+
+        # Find siblings
+        parent_idx = trace.exp2idx(parent_exp) if parent_exp else None
+        # Handle potential list return
+        if isinstance(parent_idx, list):
+            parent_idx = parent_idx[0] if parent_idx else None
+
+        # If no parent, start from void root node
+        siblings = trace.get_children(parent_idx)
+
+        system_prompt = T(".prompts:unified_hypothesis_gen.system_prompt").r(
+            based_on_a_successful_parent=based_on_a_successful_parent,
+            scenario=self.scen.get_scenario_all_desc(enable_dataset_description=True),
+            available_models=available_models,
+            available_methods=available_methods,
+            shared_params=shared_params,
+            methods_specific_params=methods_specific_params,
+            select_model=base_model is None,
+            force_think_token=FT_RD_SETTING.force_think_token,
+        )
+
+        user_prompt = T(".prompts:unified_hypothesis_gen.user_prompt").r(
+            parent_exp=parent_exp,
+            siblings=siblings,
+            trace=trace,
+            based_on_a_successful_parent=based_on_a_successful_parent,
+        )
+
+        session = APIBackend().build_chat_session(session_system_prompt=system_prompt)
+        reason_dict = json.loads(
+            session.build_chat_completion(
+                user_prompt=user_prompt + "\n" + T(".prompts:unified_hypothesis_gen.specific_format").r(field="reason"),
+                json_target_type=dict,
+            )
+        )
+        hypothesis_dict = json.loads(
+            session.build_chat_completion(
+                user_prompt=T(".prompts:unified_hypothesis_gen.specific_format").r(field="hypothesis"),
+                json_target_type=dict,
+            )
+        )
+        task_dict = json.loads(
+            session.build_chat_completion(
+                user_prompt=T(".prompts:unified_hypothesis_gen.specific_format").r(field="task"),
+                json_target_type=dict,
+            )
+        )
+
+        ensure_ft_assets_exist(model=base_model, check_model=True)
+
+        # Get skip_data_processing from task_dict (merged with task in 3rd LLM call)
+        # Only valid for subsequent experiments, first experiment always generates data
+        skip_data_processing = task_dict.get("skip_data_processing", False) if based_on_a_successful_parent else False
+        if skip_data_processing:
+            logger.info("Proposal decided to skip data processing, will reuse Parent's data script")
+
+        # Use pre-selected datasets from scenario initialization
+        task = FTTask(
+            base_model=base_model,
+            description=task_dict.get("task"),
+            benchmark=FT_RD_SETTING.target_benchmark,
+            involving_datasets=self.scen.selected_datasets,
+            skip_data_processing=skip_data_processing,
+        )
+
+        hypothesis = FTHypothesis(
+            base_model=base_model,
+            hypothesis=hypothesis_dict.get("hypothesis"),
+            reason=reason_dict.get("reason", ""),
+        )
+
+        exp = FTExperiment(sub_tasks=[task], hypothesis=hypothesis)
+        if parent_exp:
+            parent_idx = trace.exp2idx(parent_exp)
+            if parent_idx is not None:
+                exp.local_selection = (parent_idx,)
+        else:
+            # If no parent, it is a experiment from scratch
+            exp.local_selection = trace.NEW_ROOT
+
+        # Inject workspace files from Parent or SOTA experiment (if available)
+        if parent_exp and (ws := parent_exp.experiment_workspace) is not None and ws.file_dict:
+            exp.experiment_workspace.inject_from_workspace(ws)
+            logger.info(f"Injected {len(ws.file_dict)} files from parent: {list(ws.file_dict.keys())}")
+
+        logger.info("Experiment created")
+
+        return exp
diff --git a/rdagent/scenarios/finetune/proposal/trace.py b/rdagent/scenarios/finetune/proposal/trace.py
new file mode 100644
index 000000000..b2488adac
--- /dev/null
+++ b/rdagent/scenarios/finetune/proposal/trace.py
@@ -0,0 +1,80 @@
+"""FT Trace - Specialized Trace for LLM Fine-tuning scenario.
+
+Provides SOTA experiment tracking functionality.
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Any
+
+from rdagent.components.coder.finetune.conf import (
+    FT_DATA_SCRIPT_NAME,
+    FT_YAML_FILE_NAME,
+)
+from rdagent.core.evolving_framework import KnowledgeBase
+from rdagent.core.proposal import ExperimentFeedback, Trace
+from rdagent.log import rdagent_logger as logger
+
+if TYPE_CHECKING:
+    from rdagent.scenarios.finetune.experiment.experiment import FTExperiment
+    from rdagent.scenarios.finetune.scen.scenario import LLMFinetuneScen
+
+
+class FTTrace(Trace["LLMFinetuneScen", KnowledgeBase]):
+    """Specialized Trace for LLM Fine-tuning scenario.
+
+    Adds SOTA experiment tracking on top of the base Trace class.
+    SOTA is explicitly managed via DAG traversal.
+    """
+
+    def __init__(self, scen: "LLMFinetuneScen", knowledge_base: KnowledgeBase | None = None) -> None:
+        super().__init__(scen, knowledge_base)
+
+        # Type hint for linting
+        self.hist: list[tuple[FTExperiment, ExperimentFeedback]] = []
+
+    def sota_benchmark(self) -> dict | None:
+        """Return SOTA experiment's benchmark results."""
+        sota_exp = self.get_sota_experiment()
+        if sota_exp is None:
+            return None
+        ws = sota_exp.experiment_workspace
+        if ws is None or ws.running_info is None:
+            return None
+        result = getattr(ws.running_info, "result", None)
+        if result and isinstance(result, dict) and "benchmark" in result:
+            return result["benchmark"]
+        return None
+
+    def get_experiment_info(self, exp: "FTExperiment") -> dict[str, Any]:
+        """Return experiment's full info for hypothesis generation."""
+        info: dict[str, Any] = {
+            "hypothesis": str(exp.hypothesis) if exp.hypothesis else None,
+            "config": None,
+            "benchmark": None,
+            "data_script": None,
+        }
+
+        ws = exp.experiment_workspace
+        if ws is None:
+            return info
+
+        if ws.file_dict:
+            if FT_YAML_FILE_NAME in ws.file_dict:
+                info["config"] = ws.file_dict[FT_YAML_FILE_NAME]
+            if FT_DATA_SCRIPT_NAME in ws.file_dict:
+                info["data_script"] = ws.file_dict[FT_DATA_SCRIPT_NAME]
+
+        if ws.running_info:
+            result = getattr(ws.running_info, "result", None)
+            if result and isinstance(result, dict) and "benchmark" in result:
+                info["benchmark"] = result["benchmark"].get("accuracy_summary")
+
+        return info
+
+    def sota_info(self) -> dict[str, Any] | None:
+        """Return SOTA experiment's full info for hypothesis generation."""
+        sota_exp = self.get_sota_experiment()
+        if sota_exp is None:
+            return None
+        return self.get_experiment_info(sota_exp)
diff --git a/rdagent/scenarios/finetune/scen/__init__.py b/rdagent/scenarios/finetune/scen/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/rdagent/scenarios/finetune/scen/docker_scripts/extract_parameters.py b/rdagent/scenarios/finetune/scen/docker_scripts/extract_parameters.py
new file mode 100644
index 000000000..1880d58a8
--- /dev/null
+++ b/rdagent/scenarios/finetune/scen/docker_scripts/extract_parameters.py
@@ -0,0 +1,123 @@
+"""
+Streamlined LLaMA Factory parameter extraction script.
+Extracts all parameters directly from LLaMA Factory without hardcoded filtering.
+Always pulls the latest LLaMA Factory code before extraction.
+"""
+
+import json
+import subprocess
+import sys
+from dataclasses import fields
+from pathlib import Path
+
+import requests
+from llamafactory.data.template import TEMPLATES
+from llamafactory.extras.constants import METHODS, SUPPORTED_MODELS, TRAINING_STAGES
+from llamafactory.hparams.data_args import DataArguments
+from llamafactory.hparams.finetuning_args import (
+    ApolloArguments,
+    BAdamArgument,
+    FinetuningArguments,
+    FreezeArguments,
+    GaloreArguments,
+    LoraArguments,
+    RLHFArguments,
+    SwanLabArguments,
+)
+from llamafactory.hparams.model_args import ModelArguments, QuantizationArguments
+from transformers import TrainingArguments
+
+
+def extract_field_info(field):
+    """Extract field information from a dataclass field."""
+    from dataclasses import MISSING
+
+    # Handle default value - avoid MISSING type which is not JSON serializable
+    if hasattr(field, "default") and field.default is not MISSING:
+        default_value = field.default
+    elif hasattr(field, "default_factory") and field.default_factory is not MISSING:
+        default_value = "<factory>"
+    else:
+        default_value = None
+
+    return {
+        "name": field.name,
+        "type": str(field.type).replace("typing.", "").replace("<class '", "").replace("'>", ""),
+        "default": default_value,
+        "help": field.metadata.get("help", "") if field.metadata else "",
+    }
+
+
+def extract_params(cls):
+    """Extract all parameters from a dataclass."""
+    return {field.name: extract_field_info(field) for field in fields(cls)}
+
+
+def extract_base_params(cls):
+    """Extract only the parameters defined in the class itself, not inherited."""
+    # Get all fields from the class
+    all_fields = {f.name: f for f in fields(cls)}
+
+    # Get fields from all parent classes
+    parent_fields = set()
+    for base in cls.__bases__:
+        if hasattr(base, "__dataclass_fields__"):
+            parent_fields.update(base.__dataclass_fields__.keys())
+
+    # Keep only fields defined in the class itself
+    own_fields = {name: field for name, field in all_fields.items() if name not in parent_fields}
+
+    return {name: extract_field_info(field) for name, field in own_fields.items()}
+
+
+def save_parameters(base_dir):
+    """Extract and save all LLaMA Factory parameters with category information."""
+    base_path = Path(base_dir)
+    base_path.mkdir(parents=True, exist_ok=True)
+
+    # Save constants
+    constants = {
+        "methods": list(METHODS),
+        "training_stages": dict(TRAINING_STAGES),
+        "supported_models": dict(SUPPORTED_MODELS) if SUPPORTED_MODELS else {},
+        "templates": list(TEMPLATES.keys()),
+    }
+    (base_path / "constants.json").write_text(json.dumps(constants, indent=2))
+
+    # Save parameters - preserve parameter ownership by categorizing them
+    parameters = {
+        "model": extract_params(ModelArguments),
+        "data": extract_params(DataArguments),
+        "training": extract_params(TrainingArguments),
+        "finetuning": {
+            # Categorize parameters by PEFT method
+            "freeze": extract_params(FreezeArguments),
+            "lora": extract_params(LoraArguments),
+            "galore": extract_params(GaloreArguments),
+            "apollo": extract_params(ApolloArguments),
+            "badam": extract_params(BAdamArgument),
+            "rlhf": extract_params(RLHFArguments),
+            "swanlab": extract_params(SwanLabArguments),
+            "quantization": extract_params(QuantizationArguments),
+            # Extract only FinetuningArguments' own parameters (excluding inherited ones)
+            "base": extract_base_params(FinetuningArguments),
+        },
+    }
+    (base_path / "parameters.json").write_text(json.dumps(parameters, indent=2))
+
+
+def main():
+    """Main entry point for parameter extraction."""
+    base_dir = sys.argv[1] if len(sys.argv) > 1 else "/workspace/.llama_factory_info"
+
+    try:
+        save_parameters(base_dir)
+        print("Successfully extracted LLaMA Factory parameters")
+        return 0
+    except Exception as e:
+        print(f"ERROR: {e}", file=sys.stderr)
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/rdagent/scenarios/finetune/scen/llama_factory_manager.py b/rdagent/scenarios/finetune/scen/llama_factory_manager.py
new file mode 100644
index 000000000..71ddfd837
--- /dev/null
+++ b/rdagent/scenarios/finetune/scen/llama_factory_manager.py
@@ -0,0 +1,287 @@
+"""
+Streamlined LLaMA Factory manager for parameter extraction.
+"""
+
+import json
+import re
+import shutil
+from pathlib import Path
+from typing import Dict, List, Optional
+
+import requests
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.finetune.conf import (
+    get_ft_env,
+    get_workspace_prefix,
+    is_docker_env,
+)
+from rdagent.core.experiment import FBWorkspace
+from rdagent.log import rdagent_logger as logger
+
+EXTRACT_PARAMETERS_SCRIPT_NAME = "extract_parameters.py"
+DEFAULT_HELP_TRUNCATE_LEN = None  # Default max length for help text in formatted output
+
+# Regex patterns to exclude parameters not relevant for SFT training prompts
+EXCLUDED_PARAM_PATTERNS = [
+    # Inference engines & inference-only params
+    r"^infer_",  # Inference related (infer_backend, infer_dtype)
+    r"^vllm_",  # vLLM engine
+    r"^sglang_",  # SGLang engine
+    r"^kt_",  # KTransformers config (kt_maxlen, kt_mode, etc.)
+    r"^use_kt$",  # KTransformers toggle
+    r"^use_kv_cache$",  # Inference only
+    r"^use_cache$",  # KV cache for generation
+    r"^cpu_infer$",  # KTransformers: CPU cores for computation
+    r"^chunk_size$",  # KTransformers: chunk size for CPU compute
+    # Hub/Cloud
+    r"^push_to_hub",  # Hub push
+    r"^hub_",  # Hub related
+    r"_hub_token$",  # Hub tokens (hf_hub_token, ms_hub_token, om_hub_token)
+    # Multimodal inputs (text-only SFT)
+    r"^image_",  # Image inputs
+    r"^video_",  # Video inputs
+    r"^audio_",  # Audio inputs
+    r"^crop_to_patches$",  # Image processing for internvl
+    r"^use_audio_in_video$",  # Video audio
+    r"^media_dir$",  # Media directory for multimodal
+    r"^freeze_vision_tower$",  # MLLM: freeze vision encoder
+    r"^freeze_multi_modal_projector$",  # MLLM: freeze projector
+    r"^freeze_language_model$",  # MLLM: freeze LLM backbone
+    # Export (post-training)
+    r"^export_",  # Model export
+    # Hardware specific (non-NVIDIA)
+    r"^tpu_",  # TPU related (tpu_num_cores, tpu_metrics_debug)
+    r"^use_cpu$",  # CPU-only training
+    r"^use_ipex$",  # Intel Extension for PyTorch
+    r"^jit_mode_eval$",  # PyTorch JIT for inference
+    # Third-party logging & reporting tools
+    r"^ray_",  # Ray hyperparameter search
+    r"^swanlab_",  # SwanLab logging
+    r"^use_swanlab$",  # SwanLab toggle
+    r"^trackio_",  # Trackio logging
+    r"^logging_dir$",  # Tensorboard log directory
+    r"^report_to$",  # Logging integrations (wandb, tensorboard, mlflow, comet)
+    r"^run_name$",  # Run name for logging tools (wandb, mlflow, trackio, comet, swanlab)
+    # RLHF/DPO (not for SFT)
+    r"^pref_",  # Preference learning (DPO/KTO/ORPO/SimPO)
+    r"^dpo_",  # DPO specific
+    r"^kto_",  # KTO specific
+    r"^simpo_",  # SimPO specific
+    r"^ppo_",  # PPO specific
+    r"^ref_model",  # Reference model for RLHF
+    r"^reward_model",  # Reward model for PPO
+    r"^ld_alpha$",  # LD-DPO
+    # Deprecated (per help text)
+    r"^no_cuda$",  # Deprecated in transformers 5.0
+    r"^use_mps_device$",  # Deprecated in transformers 5.0
+    r"^per_gpu_",  # Deprecated: use per_device_* instead
+    r"^torchdynamo$",  # Deprecated: use torch_compile_backend
+    r"^fp16_backend$",  # Deprecated: use half_precision_backend
+    r"^include_inputs_for_metrics$",  # Deprecated: use include_for_metrics
+    # Unsloth (third-party, not used by default)
+    r"^use_unsloth",  # use_unsloth, use_unsloth_gc
+    # Internal/derived params (help says "Do not specify it")
+    r"^compute_dtype$",
+    r"^device_map$",
+    r"^model_max_length$",
+    r"^block_diag_attn$",
+    # Platform-specific / internal
+    r"^mp_parameters$",  # SageMaker launcher only
+    r"^_n_gpu$",  # Internal variable
+    r"^use_legacy_prediction_loop$",  # Legacy feature
+    r"^past_index$",  # Rarely used
+    r"^print_param_status$",  # Debug only
+]
+EXCLUDED_PARAM_REGEX = re.compile("|".join(EXCLUDED_PARAM_PATTERNS))
+
+
+class LLaMAFactoryManager:
+    """Manager for LLaMA Factory parameter extraction and caching."""
+
+    def __init__(self):
+        """Initialize the manager instance."""
+        self.cache_dir = Path(FT_RD_SETTING.file_path) / ".llama_factory_info"
+        self._info_cache: Optional[Dict] = None
+
+    def extract_info_from_docker(self) -> Dict:
+        """Extract LLaMA Factory information from Docker/Conda environment."""
+        if not self.cache_dir.exists() or not any(self.cache_dir.iterdir()):
+            logger.info("Extract LLaMA Factory parameters")
+            # Prepare extraction script
+            workspace = FBWorkspace()
+            script_path = Path(__file__).parent / "docker_scripts" / EXTRACT_PARAMETERS_SCRIPT_NAME
+            workspace.inject_files(**{EXTRACT_PARAMETERS_SCRIPT_NAME: script_path.read_text()})
+
+            # Setup cache directory and volumes
+            if self.cache_dir.exists():
+                shutil.rmtree(self.cache_dir)
+            self.cache_dir.mkdir(parents=True, exist_ok=True)
+            volumes = {str(self.cache_dir): {"bind": "/workspace/.llama_factory_info", "mode": "rw"}}
+
+            # Run extraction
+            env = get_ft_env(extra_volumes=volumes, enable_cache=False)
+            env.conf.running_timeout_period = 120  # Short timeout for parameter extraction
+
+            # Determine output path based on environment type
+            # Docker: uses volume mount, output to /workspace/.llama_factory_info
+            # Conda: no volume mount, output directly to cache_dir (absolute path)
+            if is_docker_env(env):
+                output_path = "/workspace/.llama_factory_info"
+            else:
+                # For conda mode, use absolute path to cache_dir
+                output_path = str(self.cache_dir)
+
+            result = workspace.run(
+                env=env,
+                entry=f"python {EXTRACT_PARAMETERS_SCRIPT_NAME} {output_path}",
+            )
+
+            if result.exit_code != 0:
+                raise RuntimeError(f"Parameter extraction failed: {result.stdout}")
+
+        else:
+            logger.info("Skip updating LLaMA Factory, using local cache")
+
+        # Load the extracted data
+        self._info_cache = self._load_extracted_data()
+        if not self._info_cache:
+            raise RuntimeError("Failed to load extracted LLaMA Factory information")
+
+        logger.info("Successfully extracted LLaMA Factory parameters")
+        return self._info_cache
+
+    def _load_extracted_data(self) -> Dict:
+        """Load extracted information from flat file structure."""
+        data = {}
+
+        # Load constants
+        constants_file = self.cache_dir / "constants.json"
+        if constants_file.exists():
+            with open(constants_file, encoding="utf-8") as f:
+                data.update(json.load(f))
+
+        # Load parameters
+        parameters_file = self.cache_dir / "parameters.json"
+        if parameters_file.exists():
+            with open(parameters_file, encoding="utf-8") as f:
+                data["parameters"] = json.load(f)
+
+        return data
+
+    def get_info(self) -> Dict:
+        """Get complete LLaMA Factory information, extracting on first call."""
+        if self._info_cache is None:
+            self._info_cache = self.extract_info_from_docker()
+        return self._info_cache
+
+    @property
+    def methods(self) -> List[str]:
+        """Available fine-tuning methods."""
+        return self.get_info().get("methods", [])
+
+    @property
+    def models(self) -> List[str]:
+        """Available base models."""
+        return list(self.get_info().get("supported_models", {}).keys())
+
+    @property
+    def hf_models(self) -> List[str]:
+        """Available HuggingFace models."""
+        supported_models = self.get_info().get("supported_models", {})
+        return list({v for v in supported_models.values() if isinstance(v, str)})
+
+    @property
+    def peft_methods(self) -> List[str]:
+        """Available PEFT methods, dynamically filtered from available methods."""
+        known_peft = {"lora", "qlora", "adalora"}
+        return [m for m in self.methods if m in known_peft]
+
+    @property
+    def training_stages(self) -> Dict[str, str]:
+        """Training stage mapping."""
+        return self.get_info().get("training_stages", {})
+
+    @property
+    def templates(self) -> List[str]:
+        """Available chat templates."""
+        return self.get_info().get("templates", [])
+
+    def is_peft_method(self, method: str) -> bool:
+        """Check if the given method is a PEFT method."""
+        return method in self.peft_methods
+
+    def get_parameters(self, param_type: Optional[str] = None) -> Dict:
+        """Get parameters by type or all parameters."""
+        params = self.get_info().get("parameters", {})
+        if param_type:
+            return params.get(param_type, {})
+        return params
+
+    def _format_param_line(self, param_name: str, param_info: dict, max_help_len: int | None) -> str:
+        """Format a single parameter line.
+
+        Args:
+            max_help_len: Max length for help text. None means no truncation.
+        """
+        help_text = param_info["help"]
+        if max_help_len:
+            help_text = help_text[:max_help_len]
+        type_text = param_info.get("type", "").replace("typing.", "")
+        default_val = param_info.get("default")
+
+        # Build metadata: filter out empty parts, join with comma
+        parts = [p for p in [type_text, f"default={default_val}" if default_val is not None else ""] if p]
+        meta = f" ({', '.join(parts)})" if parts else ""
+        return f"- {param_name}{meta}: {help_text}"
+
+    def _format_params_dict(self, params_dict: dict, max_help_len: int | None) -> list[str]:
+        """Format a dictionary of parameters."""
+        return [
+            self._format_param_line(name, info, max_help_len)
+            for name, info in params_dict.items()
+            if isinstance(info, dict) and "help" in info and not EXCLUDED_PARAM_REGEX.search(name)
+        ]
+
+    def format_shared_params(self, max_help_len: int | None = DEFAULT_HELP_TRUNCATE_LEN) -> str:
+        """Format shared parameters (model, data, training) that apply to all methods.
+
+        Args:
+            max_help_len: Max length for help text. None means no truncation.
+        """
+        all_params = self.get_parameters()
+        sections = []
+
+        for param_type in ["model", "data", "training"]:
+            if param_type in all_params:
+                sections.append(f"### {param_type.upper()} Parameters:")
+                sections.extend(self._format_params_dict(all_params[param_type], max_help_len))
+                sections.append("")
+
+        return "\n".join(sections).rstrip()
+
+    def format_method_specific_params(self, method: str, max_help_len: int | None = DEFAULT_HELP_TRUNCATE_LEN) -> str:
+        """Format only method-specific finetuning parameters.
+
+        Args:
+            max_help_len: Max length for help text. None means no truncation.
+        """
+        all_params = self.get_parameters()
+        if "finetuning" not in all_params:
+            return f"**{method}**: No specific parameters"
+
+        finetuning_params = all_params["finetuning"]
+        method_lower = method.lower()
+
+        if method_lower == "full":
+            return f"**{method}**: Uses shared parameters only (full-parameter training)"
+
+        if method_lower not in finetuning_params or not finetuning_params[method_lower]:
+            return f"**{method}**: Uses shared parameters only"
+
+        lines = [f"**{method}**:"]
+        lines.extend(self._format_params_dict(finetuning_params[method_lower], max_help_len))
+        return "\n".join(lines)
+
+
+LLaMAFactory_manager = LLaMAFactoryManager()
diff --git a/rdagent/scenarios/finetune/scen/memory_estimator.py b/rdagent/scenarios/finetune/scen/memory_estimator.py
new file mode 100644
index 000000000..229ad7032
--- /dev/null
+++ b/rdagent/scenarios/finetune/scen/memory_estimator.py
@@ -0,0 +1,136 @@
+"""LLM Fine-tuning Memory Constraints Calculator
+
+Calculate max supported seq_len for each fine-tuning method.
+Based on EleutherAI Transformer Math: https://blog.eleuther.ai/transformer-math/
+"""
+
+import re
+
+
+class MemoryEstimator:
+    """Calculate memory constraints for fine-tuning methods."""
+
+    # Memory factors (GB per billion parameters)
+    MEM_FACTOR = {
+        "full": 18,  # bf16 params + bf16 grads + fp32 optimizer states
+        "base_bf16": 2,  # bf16 params only (frozen)
+        "base_4bit": 0.5,  # 4-bit quantized params
+        "trainable": 18,  # trainable params
+    }
+
+    # Architecture estimation: params_b -> (hidden_dim, num_layers)
+    ARCH = {
+        3: (2048, 24),
+        7: (4096, 32),
+        13: (5120, 40),
+        34: (6144, 48),
+        70: (8192, 80),
+    }
+
+    DEFAULT_LORA_RANK = 64
+
+    def __init__(
+        self,
+        params_b: float,
+        gpu_mem: float,
+        num_gpus: int,
+        max_position_embeddings: int = 32768,
+    ):
+        self.params_b = params_b
+        self.gpu_mem = gpu_mem
+        self.num_gpus = num_gpus
+        self.total_mem = gpu_mem * num_gpus
+        self.max_ctx = max_position_embeddings
+
+        # Estimate architecture
+        self.hidden, self.layers = next(
+            (v for k, v in self.ARCH.items() if params_b <= k),
+            (8192, 96),
+        )
+
+    @classmethod
+    def from_model_name(
+        cls,
+        name: str,
+        gpu_mem: float,
+        num_gpus: int,
+        model_specs: str = "",
+    ) -> "MemoryEstimator":
+        """Create from model name and specs."""
+        # Parse params from name: Qwen2.5-7B -> 7.0
+        match = re.search(r"(\d+(?:\.\d+)?)[Bb]", name)
+        params_b = float(match.group(1)) if match else 7.0
+
+        # Parse max_position_embeddings from specs
+        max_ctx = 32768
+        if model_specs:
+            ctx_match = re.search(r"max_position_embeddings:\s*(\d+)", model_specs)
+            if ctx_match:
+                max_ctx = int(ctx_match.group(1))
+
+        return cls(params_b, gpu_mem, num_gpus, max_ctx)
+
+    def _base_memory(self, method: str) -> float:
+        """Base memory without activations (GB)."""
+        lora_p = 2 * self.DEFAULT_LORA_RANK * self.hidden * 4 * self.layers / 1e9
+
+        if method == "full":
+            return self.params_b * self.MEM_FACTOR["full"]
+        elif method == "full_gc":
+            return self.params_b * self.MEM_FACTOR["full"]
+        elif method == "lora":
+            return self.params_b * self.MEM_FACTOR["base_bf16"] + lora_p * self.MEM_FACTOR["trainable"]
+        elif method == "qlora":
+            return self.params_b * self.MEM_FACTOR["base_4bit"] + lora_p * self.MEM_FACTOR["trainable"]
+        return 0
+
+    def _activation_factor(self, method: str) -> float:
+        """Activation memory factor (gradient checkpointing reduces this)."""
+        return 0.35 if method == "full_gc" else 1.0
+
+    def _find_max_seq_len(self, method: str, batch_size: int = 1) -> int:
+        """Find max seq_len that fits in memory."""
+        available = self.total_mem * 0.9
+        base = self._base_memory(method)
+        remaining = available - base * 1.2
+
+        if remaining <= 0:
+            return 0
+
+        act_factor = self._activation_factor(method)
+        # activation = seq * hidden * layers * 8 * batch / 1e9 * act_factor * 1.2
+        max_seq = int(remaining * 1e9 / (self.hidden * self.layers * 8 * batch_size * act_factor * 1.2))
+        return max_seq  # Don't cap at max_ctx here, show raw capability
+
+    def estimate(self) -> dict[str, int]:
+        """Calculate max seq_len for each method (batch=1)."""
+        methods = ["full", "full_gc", "lora", "qlora"]
+        return {m: self._find_max_seq_len(m) for m in methods}
+
+    def format(self, estimates: dict[str, int] = None) -> str:
+        """Format as constraint table."""
+        if estimates is None:
+            estimates = self.estimate()
+
+        lines = [
+            "## Hardware Memory Constraints",
+            f"**Hardware**: {self.num_gpus}x {self.gpu_mem:.0f}GB GPU = {self.total_mem:.0f}GB total",
+            f"**Model**: {self.params_b}B parameters",
+            f"**Model max_position_embeddings**: {self.max_ctx}",
+            "",
+            "| Method | Max seq_len (batch=1) |",
+            "|--------|----------------------|",
+        ]
+
+        for method, max_seq in estimates.items():
+            if max_seq > 0:
+                lines.append(f"| {method} | {max_seq} |")
+            else:
+                lines.append(f"| {method} | Not viable |")
+
+        lines.append("")
+        lines.append("**Note**: Choose `cutoff_len` <= min(max_seq_len, max_position_embeddings)")
+        lines.append("- Larger `cutoff_len` enables longer CoT but reduces batch_size")
+        lines.append("- Method quality: full > lora > qlora (when all can support your seq_len needs)")
+
+        return "\n".join(lines)
diff --git a/rdagent/scenarios/finetune/scen/prompts.yaml b/rdagent/scenarios/finetune/scen/prompts.yaml
new file mode 100644
index 000000000..577601764
--- /dev/null
+++ b/rdagent/scenarios/finetune/scen/prompts.yaml
@@ -0,0 +1,141 @@
+scenario_description: |-
+  The user is targeting a fine-tuned model best for specific scenarios based on the provided dataset.
+  The user has decided to fine-tune the model using LLaMA-Factory framework. Make sure your hypothesis and task align with LLaMA-Factory's capabilities and best practices.
+
+  # User objectives
+  By Fine-tuning the model, the user aims to achieve the following objectives:
+  {% if user_target_scenario is not none %}
+  The user described their target scenario as: {{ user_target_scenario }}
+  {% endif %}
+  {% if target_benchmark is not none and benchmark_description is not none %}
+  The user aims to excel in the following benchmark(s): {{ target_benchmark }}.
+  The benchmark can be described as: {{ benchmark_description }}.
+  {% endif %}
+
+  # Device Information
+  The device available for fine-tuning has the following specifications:
+  {{ device_info }}
+  The hardware constraints might limit certain choices, so consider them carefully.
+
+  {% if memory_report %}
+  {{ memory_report }}
+  {% endif %}
+
+  {% if chosen_model %}
+  # Base Model Details
+  The user has decided the base model to fine-tune: {{ base_model }}.
+  ## Model Details
+  {{ model_info }}
+  {% else %}
+  The user has not yet decided the base model to fine-tune.
+  {% endif %}
+
+  {%- if enable_dataset_description %}
+  # Dataset Configuration
+  {%- for ds_name, ds_info in dataset_config.items() %}
+  ## Dataset: {{ ds_name }}
+  - **total_samples**: {{ ds_info.total_samples }}
+  - **total_size_mb**: {{ ds_info.total_size_mb }}
+  {%- if ds_info.file_tree %}
+  - **file_tree**:
+    ```
+    {{ ds_info.file_tree }}
+    ```
+  {%- endif %}
+  {%- if ds_info.tasks %}
+  - **tasks**:
+    {%- for task_name, task_info in ds_info.tasks.items() %}
+    ### {{ "(root)" if task_name == "_root" else task_name }}
+    - files: {{ task_info.files }}
+    - sample_count: {{ task_info.sample_count }}
+    {%- if task_info.column_stats %}
+    - column_stats:
+      {%- for col, col_stats in task_info.column_stats.items() %}
+      - {{ col }}: empty={{ col_stats.empty_count }}, min_tokens={{ col_stats.min_tokens }}, max_tokens={{ col_stats.max_tokens }}, p50_tokens={{ col_stats.p50_tokens }}, p99_tokens={{ col_stats.p99_tokens }}
+      {%- endfor %}
+    {%- endif %}
+    {%- if task_info.samples and task_info.samples | length > 0 %}
+    - first_sample:
+      ```json
+      {{ task_info.samples[0] | tojson }}
+      ```
+    {%- endif %}
+    {%- endfor %}
+  {%- endif %}
+  {%- if ds_info.readme %}
+  - **readme**: {{ ds_info.readme | tojson }}
+  {%- endif %}
+  {%- endfor %}
+
+  ## Timeout Constraints
+  - Full Training Timeout: {{ full_timeout }}
+  - Data Processing Timeout: {{ data_processing_timeout }}
+  {% endif %}
+
+  ## (Very important!)Sample Size Control (Code-Based, No LLM)
+  To avoid unlimited training cost, we have a strict upper limit on the number of training samples fed into LLM fine-tuning. You should sample the data with some rule which does not including feeding all the data into LLM because going through all data may exceed budget or time limits.
+  The upper limit is {{ upper_data_size_limit }} samples.
+
+  You can choose one of the following strategies to control the sample size(all strategies should be code based, no LLM calls):
+  1. Quality-first: Prefer samples with complete fields, reasonable length, and clear structure
+  2. Diversity: 
+      - If dataset has categories/sources, sample proportionally to preserve distribution
+      - **Difficulty-aware**: If difficulty metadata exists, use stratified sampling to maintain difficulty distribution of target benchmark/test set to ensure training coverage across all evaluation scenarios during initial training stage. For the subsequent training stages, 
+      adjust difficulty proportions based on base model capability, training objectives and previous experiment results - focus more on the model's capability boundary for maximum learning efficiency.
+
+  The hypothesis should specify which sampling strategy to use based on dataset info. The data processing script will implement it.
+
+dataset_selection:
+  system: |-
+    You are a dataset selection expert. Your task is to select relevant datasets for a specific fine-tuning goal.
+
+    ## User Goal
+    {{ user_target_scenario }}
+    {% if target_benchmark %}
+
+    ## Target Benchmark
+    {{ target_benchmark }}
+    {{ benchmark_description }}
+    {% endif %}
+
+    ## Selection Guidelines
+    - Select datasets that are directly relevant to the user's target scenario
+    - Consider domain alignment (e.g., math datasets for math reasoning tasks)
+    - Consider task type alignment (e.g., reasoning datasets for reasoning tasks)
+    - When uncertain, include the dataset (better to have false positives than miss relevant data)
+
+    ## Output Format
+    Return a JSON object:
+    ```json
+    {
+      "selected_datasets": ["dataset1", "dataset2"],
+      "reasoning": "Brief explanation of why these datasets were selected"
+    }
+    ```
+
+  user: |-
+    ## Available Datasets
+    {% for ds in datasets %}
+    ### {{ ds.name }}
+    - **total_samples**: {{ ds.total_samples }}
+    - **total_size_mb**: {{ ds.total_size_mb }}
+    {%- if ds.tasks %}
+    - **tasks**:
+      {%- for task_name, task_info in ds.tasks.items() %}
+      #### {{ "(root)" if task_name == "_root" else task_name }}
+      - sample_count: {{ task_info.sample_count }}
+      {%- if task_info.column_stats %}
+      - column_stats:
+        {%- for col, col_stats in task_info.column_stats.items() %}
+        - {{ col }}: p50={{ col_stats.p50_tokens }}, p99={{ col_stats.p99_tokens }}
+        {%- endfor %}
+      {%- endif %}
+      {%- endfor %}
+    {%- endif %}
+    {%- if ds.readme %}
+    - **readme**: {{ ds.readme }}
+    {%- endif %}
+
+    {% endfor %}
+
+    Please select the datasets most relevant to the user's fine-tuning goal.
diff --git a/rdagent/scenarios/finetune/scen/scenario.py b/rdagent/scenarios/finetune/scen/scenario.py
new file mode 100644
index 000000000..5d9859f3c
--- /dev/null
+++ b/rdagent/scenarios/finetune/scen/scenario.py
@@ -0,0 +1,304 @@
+import json
+import os
+import shutil
+from pathlib import Path
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.finetune.conf import get_ft_env
+from rdagent.core.utils import cache_with_pickle
+from rdagent.log import rdagent_logger as logger
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.scenarios.data_science.scen import DataScienceScen
+from rdagent.scenarios.finetune.benchmark import get_benchmark_ranges, run_benchmark
+from rdagent.scenarios.finetune.datasets import prepare_all
+from rdagent.scenarios.finetune.experiment.workspace import FTWorkspace
+from rdagent.scenarios.finetune.scen.llama_factory_manager import LLaMAFactory_manager
+from rdagent.scenarios.finetune.scen.memory_estimator import MemoryEstimator
+from rdagent.scenarios.finetune.scen.utils import (
+    FinetuneDatasetDescriptor,
+    generate_dataset_info_config,
+)
+from rdagent.scenarios.finetune.utils import ensure_ft_assets_exist
+from rdagent.scenarios.shared.get_runtime_info import get_runtime_environment_by_env
+from rdagent.utils.agent.tpl import T
+
+
+class LLMFinetuneScen(DataScienceScen):
+    """LLMFinetuneScen Scenario"""
+
+    def __init__(self) -> None:
+        """Initialize LLM finetune scenario using configuration from FT_RD_SETTING."""
+        logger.info("Initializing LLM Fine-tune scenario")
+
+        # Basic attributes
+        self.user_target_scenario = FT_RD_SETTING.user_target_scenario
+        self.target_benchmark = FT_RD_SETTING.target_benchmark
+        self.benchmark_description = FT_RD_SETTING.benchmark_description
+        self.dataset = FT_RD_SETTING.dataset
+        self.base_model = FT_RD_SETTING.base_model
+
+        # Validate and prepare environment
+        self._validate_and_prepare_environment()
+
+        # Initialize LLaMA Factory manager
+        self._initialize_llama_factory()
+
+        # Generate dataset configuration for all datasets first
+        self.dataset_config = self._prepare_dataset_config()
+
+        # Select relevant datasets based on user target scenario (using full config info)
+        self.selected_datasets = self._select_relevant_datasets()
+
+        # Filter dataset_config to only include selected datasets
+        self.dataset_config = {k: v for k, v in self.dataset_config.items() if k in self.selected_datasets}
+
+        # timeout tracking
+        self.timeout_increase_count = 0
+
+        # NOTE: we disable the cache for environment. in case of changing cuda config
+        self.device_info = get_runtime_environment_by_env(get_ft_env(enable_cache=False))
+        self.gpu_count = json.loads(self.device_info).get("gpu_count", 0)
+        self.model_info = FinetuneDatasetDescriptor().describe_model(self.base_model)
+
+        # Initialize memory estimator
+        self.memory_report = self._generate_memory_report()
+
+        baseline_result = self.run_baseline_model_evaluation(
+            model_name=self.base_model, benchmark_name=self.target_benchmark
+        )
+        # Agent only sees validation score
+        self.baseline_benchmark_score = baseline_result.get("benchmark", {})
+        # Test score is for frontend display only
+        self.baseline_benchmark_score_test = baseline_result.get("benchmark_test", {})
+
+    def benchmark_hash(self, model_name, benchmark_name) -> str:
+        return f"llm_finetune_baseline_eval_{model_name}_{benchmark_name}"
+
+    @cache_with_pickle(benchmark_hash)
+    def run_baseline_model_evaluation(self, model_name, benchmark_name) -> dict:
+        ws = FTWorkspace()
+        shutil.copytree(
+            Path(FT_RD_SETTING.file_path) / "models" / model_name,
+            ws.workspace_path / "models" / model_name,
+            dirs_exist_ok=True,
+        )
+        val_range, test_range = get_benchmark_ranges()
+
+        # Validation set - visible to agent
+        validation_result = run_benchmark(
+            workspace_path=str(ws.workspace_path),
+            model_path=ws.workspace_path / "models" / model_name,
+            model_name=model_name,
+            benchmark_name=benchmark_name,
+            gpu_count=self.gpu_count,
+            test_range=val_range,
+            result_subdir="validation",
+        )
+        # Test set - NOT visible to agent, frontend only
+        test_result = run_benchmark(
+            workspace_path=str(ws.workspace_path),
+            model_path=ws.workspace_path / "models" / model_name,
+            model_name=model_name,
+            benchmark_name=benchmark_name,
+            gpu_count=self.gpu_count,
+            test_range=test_range,
+            result_subdir="test",
+        )
+        return {
+            "benchmark": validation_result,  # Agent sees this
+            "benchmark_test": test_result,  # Agent does NOT see this
+        }
+
+    def real_full_timeout(self):
+        return FT_RD_SETTING.full_timeout
+
+    def _generate_memory_report(self) -> str:
+        """Generate memory estimation report based on hardware and model."""
+        try:
+            # Parse device info
+            device_info = json.loads(self.device_info) if isinstance(self.device_info, str) else self.device_info
+            gpu_info = device_info.get("gpu", {})
+
+            # Extract GPU info based on source
+            if gpu_info.get("source") == "pytorch":
+                # PyTorch format: gpu_count at top level, total_memory_gb in summary
+                num_gpus = gpu_info.get("gpu_count")
+                gpu_mem = gpu_info.get("summary", {}).get("total_memory_gb")
+            else:
+                # nvidia-smi format: has gpus array with memory_total_gb
+                gpus = gpu_info.get("gpus", [])
+                num_gpus = len(gpus) if gpus else None
+                gpu_mem = gpus[0].get("memory_total_gb", 0) if gpus else None
+
+            # Skip if GPU info not available
+            if not num_gpus or not gpu_mem:
+                logger.warning("GPU info not available, skipping memory report")
+                return ""
+
+            # Create estimator from model name (pass model_specs for max_position_embeddings)
+            estimator = MemoryEstimator.from_model_name(
+                name=self.base_model,
+                gpu_mem=gpu_mem,
+                num_gpus=num_gpus,
+                model_specs=self.model_info.get("specs", ""),
+            )
+            return estimator.format()
+        except Exception as e:
+            logger.warning(f"Failed to generate memory report: {e}")
+            return ""
+
+    def _validate_and_prepare_environment(self):
+        """Validate FT_FILE_PATH and prepare all registered datasets"""
+        ft_root = Path(FT_RD_SETTING.file_path)
+        if not ft_root.exists():
+            os.makedirs(ft_root, mode=0o777, exist_ok=True)
+            logger.info(f"FT_FILE_PATH not exists, created FT_FILE_PATH directory: {ft_root}")
+
+        # Prepare all registered datasets
+        prepare_all()
+
+        # Ensure model assets exist
+        if self.base_model:
+            ensure_ft_assets_exist(model=self.base_model, check_model=True)
+
+    def _initialize_llama_factory(self):
+        """Initialize LLaMA Factory information manager"""
+
+        # Extract LLaMA Factory information (pulls latest code automatically)
+        info = LLaMAFactory_manager.get_info()
+
+        # Log extracted information
+        methods_count = len(info.get("methods", []))
+        params_count = sum(len(p) if isinstance(p, dict) else 0 for p in info.get("parameters", {}).values())
+        logger.info(f"LLaMA Factory initialized: {methods_count} methods, {params_count} parameters")
+
+    def _select_relevant_datasets(self) -> list[str]:
+        """Select relevant datasets based on user target scenario using LLM.
+
+        Uses self.dataset_config which contains full information (stats, description, samples).
+        """
+        total = len(self.dataset_config)
+
+        # If user specified a dataset, use it directly
+        if self.dataset:
+            selected, reasoning = [self.dataset], "User specified dataset directly"
+        elif not self.dataset_config:
+            logger.warning("No datasets found for selection")
+            return []
+        else:
+            # Use LLM to select relevant datasets
+            logger.info(f"Found {total} datasets, selecting relevant ones...")
+            selected, reasoning = self._llm_select_datasets()
+
+        # Log results
+        logger.info(f"Dataset selection: {len(selected)}/{total} - {selected}")
+        logger.log_object(
+            {"selected_datasets": selected, "total_datasets": total, "reasoning": reasoning},
+            tag="dataset_selection",
+        )
+        return selected
+
+    def _llm_select_datasets(self) -> tuple[list[str], str]:
+        """Use LLM to select relevant datasets."""
+        # Pass dataset_config directly - it already has the unified tasks structure
+        dataset_summaries = [
+            {
+                "name": ds_name,
+                "total_samples": ds_config.get("total_samples"),
+                "total_size_mb": ds_config.get("total_size_mb"),
+                "tasks": ds_config.get("tasks", {}),
+                "readme": ds_config.get("readme"),
+            }
+            for ds_name, ds_config in self.dataset_config.items()
+        ]
+
+        system_prompt = T(".prompts:dataset_selection.system").r(
+            user_target_scenario=self.user_target_scenario,
+            target_benchmark=self.target_benchmark,
+            benchmark_description=self.benchmark_description,
+        )
+        user_prompt = T(".prompts:dataset_selection.user").r(datasets=dataset_summaries)
+
+        response = APIBackend().build_messages_and_create_chat_completion(
+            system_prompt=system_prompt,
+            user_prompt=user_prompt,
+            json_mode=True,
+        )
+
+        result = json.loads(response)
+        return result.get("selected_datasets", []), result.get("reasoning", "")
+
+    def _prepare_dataset_config(self) -> dict:
+        """Generate dataset_info.json configuration.
+
+        This is the single source of truth for dataset information, containing:
+        - LlamaFactory compatible fields (file_name, formatting, columns)
+        - Auto-computed statistics (stats.column_stats)
+        - Data samples (truncated)
+        - AI-generated description
+
+        Returns:
+            dict: Complete dataset configuration
+        """
+        datasets_dir = Path(FT_RD_SETTING.file_path) / "datasets"
+        dataset_info_path = datasets_dir / "dataset_info.json"
+
+        # Check if already configured
+        existing_config = {}
+        if dataset_info_path.exists():
+            try:
+                with open(dataset_info_path, "r", encoding="utf-8") as f:
+                    existing_config = json.load(f)
+
+                # Only keep entries that have corresponding local directories
+                local_datasets = {d.name for d in datasets_dir.iterdir() if d.is_dir() and not d.name.startswith(".")}
+                existing_config = {k: v for k, v in existing_config.items() if k in local_datasets}
+
+            except Exception as e:
+                logger.warning(f"Failed to load existing dataset_info.json: {e}")
+
+        # Generate config for all datasets (will be filtered later by _select_relevant_datasets)
+        target_dataset_list = [] if self.dataset is None else [self.dataset]
+        logger.info(
+            f"Generating dataset_info.json configuration for: {target_dataset_list if target_dataset_list else 'all datasets'}"
+        )
+        generated_config = generate_dataset_info_config(target_dataset_list, FT_RD_SETTING.file_path, existing_config)
+        for dataset_name, config in generated_config.items():
+            existing_config[dataset_name] = config
+
+        try:
+            os.makedirs(datasets_dir, mode=0o777, exist_ok=True)
+
+            with open(dataset_info_path, "w", encoding="utf-8") as f:
+                json.dump(existing_config, f, indent=2, ensure_ascii=False)
+            logger.info(f"Successfully updated dataset_info.json with configuration for: {target_dataset_list}")
+        except Exception as e:
+            raise RuntimeError(f"Failed to write dataset_info.json: {e}")
+        return existing_config
+
+    @property
+    def metric_direction(self) -> bool:
+        """Metric direction for LLM fine-tuning (higher is better)"""
+        return True
+
+    def get_scenario_all_desc(self, enable_dataset_description: bool = False) -> str:
+        """Get complete scenario description for LLM fine-tuning.
+
+        Uses dataset_config as the single source of truth for dataset information.
+        The prompt template renders tasks with their statistics and samples.
+        """
+        return T(".prompts:scenario_description").r(
+            user_target_scenario=self.user_target_scenario,
+            target_benchmark=self.target_benchmark,
+            benchmark_description=self.benchmark_description,
+            device_info=self.device_info,
+            memory_report=self.memory_report,
+            chosen_model=FT_RD_SETTING.base_model is not None,
+            base_model=FT_RD_SETTING.base_model,
+            dataset_config=self.dataset_config,
+            model_info=self.model_info,
+            full_timeout=f"{self.real_full_timeout() / 60 / 60:.2f} hours",
+            data_processing_timeout=f"{FT_RD_SETTING.data_processing_timeout / 60:.0f} minutes",
+            enable_dataset_description=enable_dataset_description,
+            upper_data_size_limit=FT_RD_SETTING.upper_data_size_limit,
+        )
diff --git a/rdagent/scenarios/finetune/scen/utils.py b/rdagent/scenarios/finetune/scen/utils.py
new file mode 100644
index 000000000..d7764a4d9
--- /dev/null
+++ b/rdagent/scenarios/finetune/scen/utils.py
@@ -0,0 +1,909 @@
+"""Utilities for fine-tuning scenario data extraction and analysis."""
+
+import json
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pandas as pd
+import tiktoken
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.core.utils import cache_with_pickle
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.data_science.scen.utils import FileTreeGenerator
+from rdagent.utils import md5_hash
+
+# Fixed tokenizer model for token counting
+_TOKENIZER_MODEL = "gpt-3.5-turbo"
+
+
+def _find_data_files(dataset_path: Path, max_files: int = 50) -> list[Path]:
+    """Find data files in dataset directory using recursive glob.
+
+    Args:
+        dataset_path: Root path of the dataset
+        max_files: Maximum number of files to return
+
+    Returns:
+        List of Path objects for discovered data files, sorted by name
+    """
+    patterns = ["*.json", "*.jsonl", "*.csv", "*.txt", "*.parquet"]
+    files = []
+    for pattern in patterns:
+        files.extend(dataset_path.rglob(pattern))
+    # Sort by name for deterministic order, limit count to avoid excessive files
+    dataset_files = sorted(files, key=lambda x: x.name)[:max_files]
+    return [f for f in dataset_files if f != dataset_path / "dataset_info.json"]
+
+
+def _truncate_long_values(obj, max_length: int = 3000):
+    """Recursively truncate long string values in nested data structures.
+
+    Args:
+        obj: The object to truncate (dict, list, ndarray, or str)
+        max_length: Maximum length for string values
+
+    Returns:
+        Truncated object with the same structure, showing omitted character count.
+        numpy arrays are converted to Python lists for JSON serialization.
+    """
+    if isinstance(obj, np.ndarray):
+        # Convert numpy array to list first, then process recursively
+        return _truncate_long_values(obj.tolist(), max_length)
+    elif isinstance(obj, dict):
+        return {k: _truncate_long_values(v, max_length) for k, v in obj.items()}
+    elif isinstance(obj, list):
+        return [_truncate_long_values(item, max_length) for item in obj]
+    elif isinstance(obj, str) and len(obj) > max_length:
+        omitted = len(obj) - max_length
+        return obj[:max_length] + f"...(omitted {omitted} chars)"
+    elif isinstance(obj, (np.integer, np.floating)):
+        # Convert numpy scalar types to Python native types
+        return obj.item()
+    return obj
+
+
+def _compute_column_stats(data: list[dict]) -> dict[str, dict]:
+    """Compute token statistics for each string column in the dataset.
+
+    Uses tiktoken batch encoding for 10-50x faster processing.
+    Fixed to use gpt-3.5-turbo tokenizer.
+
+    Args:
+        data: List of dictionaries representing dataset samples
+
+    Returns:
+        Dictionary mapping column names to their token statistics:
+        {column_name: {empty_count, min_tokens, max_tokens, p50_tokens, p99_tokens}}
+    """
+    if not data:
+        return {}
+
+    # Collect all column names from the dataset
+    all_columns: set[str] = set()
+    for item in data:
+        if isinstance(item, dict):
+            all_columns.update(item.keys())
+
+    # Get tiktoken encoder (cached after first call)
+    try:
+        encoding = tiktoken.encoding_for_model(_TOKENIZER_MODEL)
+    except Exception:
+        encoding = tiktoken.get_encoding("cl100k_base")
+
+    column_stats = {}
+    for col in all_columns:
+        texts: list[str] = []
+        empty_count = 0
+
+        # Collect all non-empty texts for this column
+        for item in data:
+            if isinstance(item, dict):
+                val = item.get(col, "")
+                if isinstance(val, str):
+                    if not val.strip():
+                        empty_count += 1
+                    else:
+                        texts.append(val)
+
+        if texts:
+            # Batch encode all texts at once (10-50x faster than individual calls)
+            try:
+                encoded_batch = encoding.encode_batch(texts)
+                token_counts = [len(tokens) for tokens in encoded_batch]
+            except Exception as e:
+                logger.warning(f"Batch encoding failed for column '{col}': {e}, falling back to sequential")
+                token_counts = [len(encoding.encode(t)) for t in texts]
+
+            column_stats[col] = {
+                "empty_count": empty_count,
+                "min_tokens": int(min(token_counts)),
+                "max_tokens": int(max(token_counts)),
+                "p50_tokens": int(np.percentile(token_counts, 50)),
+                "p99_tokens": int(np.percentile(token_counts, 99)),
+            }
+        else:
+            column_stats[col] = {
+                "empty_count": empty_count,
+                "min_tokens": 0,
+                "max_tokens": 0,
+                "p50_tokens": 0,
+                "p99_tokens": 0,
+            }
+
+    return column_stats
+
+
+def _load_dataset_for_stats(data_files: list[Path], max_samples: int = 50000) -> list[dict]:
+    """Load dataset samples from data files for statistics computation.
+
+    Args:
+        data_files: List of data file paths
+        max_samples: Maximum number of samples to load
+
+    Returns:
+        List of dictionaries representing dataset samples
+    """
+    all_data: list[dict] = []
+
+    for data_file in data_files:
+        if len(all_data) >= max_samples:
+            break
+
+        suffix = data_file.suffix.lower()
+        try:
+            if suffix == ".json":
+                with open(data_file, "r", encoding="utf-8") as f:
+                    data = json.load(f)
+                    if isinstance(data, list):
+                        all_data.extend(data[: max_samples - len(all_data)])
+                    elif isinstance(data, dict):
+                        all_data.append(data)
+
+            elif suffix == ".jsonl":
+                with open(data_file, "r", encoding="utf-8") as f:
+                    for line in f:
+                        if len(all_data) >= max_samples:
+                            break
+                        line = line.strip()
+                        if line:
+                            all_data.append(json.loads(line))
+
+            elif suffix == ".csv":
+                df = pd.read_csv(data_file, nrows=max_samples - len(all_data))
+                all_data.extend(df.to_dict("records"))
+
+            elif suffix == ".parquet":
+                df = pd.read_parquet(data_file)
+                all_data.extend(df.head(max_samples - len(all_data)).to_dict("records"))
+
+        except Exception as e:
+            logger.warning(f"Failed to load {data_file.name} for stats: {e}")
+
+    return all_data
+
+
+class FinetuneDatasetDescription(dict):
+    """Specialized dataset description for finetune scenarios."""
+
+    def __str__(self) -> str:
+        """Generate human-readable description for LLM prompts."""
+        parts = []
+
+        if "file_tree" in self:
+            parts.append(f"## File Tree:\n{self['file_tree']}")
+
+        if "file_path_to_descriptions" in self:
+            for file_path, file_desc in self["file_path_to_descriptions"]:
+                parts.append(f"### File path: {file_path}\n{file_desc}")
+
+        if "readme_file_descs" in self and self["readme_file_descs"] is not None:
+            parts.append(f"## Dataset readme Description:\n{self['readme_file_descs']}")
+
+        if "stats" in self:
+            stats = self["stats"]
+            parts.append(
+                f"## Statistics:\n"
+                f"- Files: {stats.get('file_count', 0)}\n"
+                f"- Samples: {stats.get('sample_count', 0)}\n"
+                f"- Size: {stats.get('total_size_mb', 0)} MB"
+            )
+
+        return "\n\n".join(parts) if parts else "Empty dataset description"
+
+
+class FinetuneFileDescription(dict):
+    """Specialized file description for finetune scenarios."""
+
+    def __str__(self) -> str:
+        """Generate human-readable file description."""
+        output_str = f"File name: {self.get('name', 'unknown')}\nFile Type: {self.get('type', 'unknown')}"
+        if "samples" in self:
+            output_str += f"\nFile Samples:\n{self['samples']}"
+        for k in self:
+            if k not in ["name", "type", "samples"]:
+                output_str += f"\n{k.capitalize()}: {self[k]}"
+        return output_str
+
+
+class FinetuneDatasetDescriptor:
+    """Specialized dataset descriptor for finetune scenarios that provides separated file tree and data samples."""
+
+    def _generate_file_tree(self, dataset_path: Path) -> str:
+        """Generate file tree for the dataset directory."""
+        try:
+            generator = FileTreeGenerator(max_lines=150)
+            return generator.generate_tree(dataset_path)
+        except Exception as e:
+            logger.warning(f"Could not generate file tree: {e}")
+            return f"Error generating file tree: {str(e)}"
+
+    def _count_samples_in_file(self, data_file: Path) -> int:
+        """Count total samples in a single data file.
+
+        Args:
+            data_file: Path to data file
+
+        Returns:
+            Total number of samples in file (0 if error or unsupported format)
+        """
+        suffix = data_file.suffix.lower()
+
+        try:
+            if suffix == ".json":
+                with open(data_file, "r", encoding="utf-8") as f:
+                    data = json.load(f)
+                    if isinstance(data, list):
+                        return len(data)
+                    elif isinstance(data, dict):
+                        return 1  # Single object
+
+            elif suffix == ".jsonl":
+                with open(data_file, "r", encoding="utf-8") as f:
+                    return sum(1 for line in f if line.strip())
+
+            elif suffix in [".csv", ".parquet"]:
+                df = pd.read_csv(data_file) if suffix == ".csv" else pd.read_parquet(data_file)
+                return len(df)
+
+        except Exception as e:
+            logger.warning(f"Cannot count samples in {data_file.name}: {e}")
+
+        return 0
+
+    def _generate_stats(self, dataset_path: Path, include_column_stats: bool = False) -> dict[str, Any]:
+        """Calculate dataset statistics: sample count, file size, and optionally column token stats.
+
+        Args:
+            dataset_path: Path to the dataset directory
+            include_column_stats: Whether to compute per-column token statistics
+
+        Returns:
+            Dictionary with sample_count, total_size_mb, file_count, and optionally column_stats.
+            Note: column_stats contains TOKEN counts (not character lengths) for each column,
+            using gpt-3.5-turbo tokenizer:
+            {column_name: {empty_count, min_tokens, max_tokens, p50_tokens, p99_tokens}}
+        """
+        try:
+            data_files = _find_data_files(dataset_path, max_files=50)
+
+            total_samples = 0
+            total_size_bytes = 0
+            file_count = len(data_files)
+
+            for data_file in data_files:
+                # Calculate file size
+                try:
+                    total_size_bytes += data_file.stat().st_size
+                except (OSError, FileNotFoundError):
+                    logger.warning(f"Cannot get size of {data_file}")
+
+                # Count samples using unified method
+                total_samples += self._count_samples_in_file(data_file)
+
+            stats = {
+                "sample_count": total_samples,
+                "total_size_mb": round(total_size_bytes / (1024 * 1024), 2),
+                "file_count": file_count,
+            }
+
+            # Compute column token statistics if requested
+            if include_column_stats and data_files:
+                try:
+                    dataset_samples = _load_dataset_for_stats(data_files)
+                    if dataset_samples:
+                        stats["column_stats"] = _compute_column_stats(dataset_samples)
+                        logger.info(
+                            f"Computed column token stats for {len(stats['column_stats'])} columns "
+                            f"(using tokenizer: {_TOKENIZER_MODEL})"
+                        )
+                except Exception as e:
+                    logger.warning(f"Failed to compute column token stats: {e}")
+
+            return stats
+
+        except Exception as e:
+            logger.warning(f"Failed to calculate dataset stats: {e}")
+            return {
+                "sample_count": 0,
+                "total_size_mb": 0,
+                "file_count": 0,
+            }
+
+    def hash_dataset_path(
+        self, dataset_path: Path, dataset_name: str | None = None, include_dataset_readme: bool = False
+    ) -> str:
+        """Generate hash key for dataset description caching."""
+        key_parts = []
+        key_parts.append(str(dataset_path))
+        files = sorted(str(path.relative_to(dataset_path)) for path in dataset_path.rglob("*") if path.is_file())
+        key_parts.append(",".join(files))
+        if dataset_name:
+            key_parts.append(dataset_name)
+        key_parts.append(str(include_dataset_readme))
+        return md5_hash("|".join(key_parts))
+
+    @cache_with_pickle(hash_dataset_path)
+    def describe_dataset_folder(
+        self, dataset_path: Path, dataset_name: str | None = None, include_dataset_readme: bool = False
+    ) -> FinetuneDatasetDescription:
+        """Generate complete dataset folder description.
+
+        Args:
+            dataset_path: Path to the dataset directory
+            dataset_name: Name of the dataset (defaults to directory name)
+
+        Returns:
+            FinetuneDatasetDescription with comprehensive dataset information
+        """
+        try:
+            logger.info(f"Generating dataset folder description for {dataset_path}...")
+            # Generate file tree and stats
+            file_tree = self._generate_file_tree(dataset_path)
+            stats = self._generate_stats(dataset_path)
+
+            # Get data files
+            data_files = _find_data_files(dataset_path, max_files=50)
+
+            # Use public interface to describe files
+            file_path_to_descriptions = []
+            for data_file in data_files[: FT_RD_SETTING.data_sample_count]:  # Process first N files for samples
+                try:
+                    file_path_to_descriptions.append(
+                        (data_file.relative_to(dataset_path), self.describe_data_file(data_file))
+                    )
+                except Exception as e:
+                    logger.warning(f"Could not describe file {data_file.name}: {e}")
+
+            # Read description from README
+            if include_dataset_readme:
+                readme_file_descs = self._read_dataset_readme(dataset_path)
+            else:
+                readme_file_descs = None
+
+            # Get file list
+            files = []
+            for file_path in data_files:
+                try:
+                    relative_path = file_path.relative_to(dataset_path)
+                    files.append(str(relative_path))
+                except ValueError:
+                    files.append(file_path.name)
+
+            return FinetuneDatasetDescription(
+                {
+                    # For new interface (generate_dataset_info_config)
+                    "file_tree": file_tree,
+                    "file_path_to_descriptions": file_path_to_descriptions,
+                    "stats": stats,
+                    # For templates (scenario_description, task_description)
+                    "name": dataset_name or dataset_path.name,
+                    "readme_file_descs": readme_file_descs,
+                    "files": files,
+                    "sample_count": stats.get("sample_count", 0),
+                    "total_size_mb": stats.get("total_size_mb", 0),
+                    "file_count": stats.get("file_count", 0),
+                }
+            )
+        except Exception as e:
+            logger.warning(f"Could not generate dataset folder description: {e}")
+            return FinetuneDatasetDescription(
+                {
+                    "file_tree": f"Error: {str(e)}",
+                    "data_samples": f"Error: {str(e)}",
+                    "stats": {"sample_count": 0, "total_size_mb": 0, "file_count": 0},
+                    "name": dataset_name or "unknown",
+                    "readme_file_descs": None,
+                    "files": [],
+                    "sample_count": 0,
+                    "total_size_mb": 0,
+                    "file_count": 0,
+                }
+            )
+
+    def get_dataset_stats(self, dataset_path: Path) -> dict[str, Any]:
+        """Calculate dataset statistics (public interface for compatibility)."""
+        return self._generate_stats(dataset_path)
+
+    def _walk(self, dir_path: Path, depth: int, max_depth: int, target_names: set[str]) -> None:
+        results = []
+        if depth > max_depth:
+            return results
+        for entry in dir_path.iterdir():
+            if entry.is_file():
+                # 区分大小写匹配（与题目保持一致）
+                if entry.name in target_names:
+                    results.append(entry)
+                # 如果希望大小写不敏感，可用：
+                # if entry.name.lower() in {"readme.md", "readme.txt"}:
+                #     results.append(entry)
+            elif entry.is_dir():
+                results.extend(self._walk(entry, depth + 1, max_depth, target_names))
+        return results
+
+    def _read_dataset_readme(self, dataset_path: Path, max_chars: int = 5000) -> str:
+        """Read README description from dataset directory.
+
+        Args:
+            dataset_path: Path to dataset directory
+            max_chars: Maximum characters to read from each README file
+
+        Returns:
+            README content (truncated to max_chars) or empty string
+        """
+        target_names = {"README.md", "readme.md", "README.txt"}
+        readme_files = self._walk(dataset_path, depth=0, max_depth=2, target_names=target_names)
+        readme_file_descs = ""
+        for readme_file in readme_files:
+            try:
+                description = readme_file.read_text(encoding="utf-8")[:max_chars]
+                logger.info(f"Loaded dataset description from {readme_file.relative_to(dataset_path)}")
+                readme_file_descs += f"### From readme file: {readme_file.relative_to(dataset_path)}:\n<start_of_readme>\n{description}<end_of_readme>\n\n"
+            except Exception as e:
+                logger.warning(f"Failed to read {readme_file.relative_to(dataset_path)}: {e}")
+        return readme_file_descs
+
+    def _extract_samples_for_template(self, data_files: list[Path], max_samples: int = 2) -> list:
+        """Extract samples from first data file for template usage.
+
+        Args:
+            data_files: List of data file paths
+            max_samples: Maximum samples to extract
+
+        Returns:
+            List of sample dicts (may be empty if extraction fails)
+        """
+        if not data_files:
+            return []
+
+        try:
+            first_file = data_files[0]
+            suffix = first_file.suffix.lower()
+
+            # Dispatch to appropriate handler
+            if suffix == ".json":
+                file_desc = self.describe_file_json(first_file, max_samples=max_samples)
+            elif suffix == ".jsonl":
+                file_desc = self.describe_file_jsonl(first_file, max_samples=max_samples)
+            elif suffix == ".csv":
+                file_desc = self.describe_file_csv(first_file, max_samples=max_samples)
+            elif suffix == ".parquet":
+                file_desc = self.describe_file_parquet(first_file, max_samples=max_samples)
+            else:
+                return []
+
+            return file_desc.get("samples", [])
+
+        except Exception as e:
+            logger.warning(f"Failed to extract samples for template: {e}")
+            return []
+
+    def describe_model(self, base_model_name: str = None, ft_file_path: str = None) -> dict[str, Any]:
+        """Extract model information from config and metadata.
+
+        Args:
+            base_model_name: Name of the base model
+            ft_file_path: Path to finetune directory structure
+
+        Returns:
+            dict with model information (name, description, specs)
+        """
+        model_name = base_model_name or FT_RD_SETTING.base_model
+        info = {
+            "name": model_name or "Unknown",
+            "description": "",
+            "specs": "",
+        }
+
+        if not model_name:
+            return info
+
+        # Find model path
+        if not ft_file_path:
+            ft_file_path = FT_RD_SETTING.file_path
+
+        if not ft_file_path:
+            return info
+
+        model_path = Path(ft_file_path) / "models" / model_name
+        if not model_path.exists():
+            return info
+
+        # Read config
+        config_path = model_path / "config.json"
+        if config_path.exists():
+            try:
+                with open(config_path, encoding="utf-8") as f:
+                    config = json.load(f)
+                    specs = []
+                    for key in ["model_type", "max_position_embeddings"]:
+                        if key in config:
+                            specs.append(f"{key}: {config[key]}")
+                    info["specs"] = ", ".join(specs)
+            except Exception as e:
+                logger.warning(f"Failed to read model config: {e}")
+
+        # Read description
+        for readme in ["README.md", "readme.md", "model_card.md"]:
+            readme_path = model_path / readme
+            if readme_path.exists():
+                try:
+                    info["description"] = readme_path.read_text(encoding="utf-8")[:1000]
+                    logger.info(f"Loaded model description from {readme}")
+                    break
+                except Exception as e:
+                    logger.warning(f"Failed to read {readme}: {e}")
+
+        # Check if tokenizer supports <think> token for CoT training
+        info["has_think_token"] = False
+        tokenizer_path = model_path / "tokenizer.json"
+        if tokenizer_path.exists():
+            try:
+                with open(tokenizer_path, encoding="utf-8") as f:
+                    tokenizer_config = json.load(f)
+                    # Check in vocabulary
+                    vocab = tokenizer_config.get("model", {}).get("vocab", {})
+                    # Check in added_tokens
+                    added_tokens = tokenizer_config.get("added_tokens", [])
+                    added_token_contents = {t.get("content") for t in added_tokens if isinstance(t, dict)}
+
+                    if "<think>" in vocab or "<think>" in added_token_contents:
+                        info["has_think_token"] = True
+                        logger.info(f"Model {model_name} has native <think> token support")
+            except Exception as e:
+                logger.warning(f"Failed to check tokenizer for <think> token: {e}")
+
+        return info
+
+    def describe_file_json(self, data_file: Path, max_samples: int = 3) -> FinetuneFileDescription:
+        samples = []
+        try:
+            with open(data_file, "r", encoding="utf-8") as f:
+                data = json.load(f)
+                if isinstance(data, list) and len(data) > 0:
+                    samples = _truncate_long_values(data[:max_samples])
+                elif isinstance(data, dict):
+                    truncated_data = _truncate_long_values(data)
+                    samples = [truncated_data]
+        except Exception as e:
+            logger.warning(f"Error extracting samples from {data_file.name}: {e}")
+
+        return FinetuneFileDescription({"name": data_file.name, "type": "json", "samples": samples})
+
+    def describe_file_jsonl(self, data_file: Path, max_samples: int = 3) -> FinetuneFileDescription:
+        samples = []
+        jsonl_shape = None
+        try:
+            with open(data_file, "r", encoding="utf-8") as f:
+                for i, line in enumerate(f):
+                    if i >= max_samples:
+                        break
+                    line = line.strip()
+                    if line:
+                        samples.append(json.loads(line))
+            if samples:
+                samples = _truncate_long_values(samples)
+            jsonl_shape = (i + 1,)
+
+        except Exception as e:
+            logger.warning(f"Error extracting samples from {data_file.name}: {e}")
+
+        return FinetuneFileDescription(
+            {"name": data_file.name, "type": "jsonl", "samples": samples, "shape": jsonl_shape}
+        )
+
+    def describe_file_csv(self, data_file: Path, max_samples: int = 3) -> FinetuneFileDescription:
+        samples = []
+        df_shape = None
+        df_columns = []
+        try:
+            df = pd.read_csv(data_file)
+            if len(df) > 0:
+                samples = df.head(max_samples).to_dict("records")
+                samples = _truncate_long_values(samples)
+            df_shape = df.shape
+            df_columns = df.columns.tolist()
+        except Exception as e:
+            logger.warning(f"Error extracting samples from {data_file.name}: {e}")
+
+        return FinetuneFileDescription(
+            {"name": data_file.name, "type": "csv", "samples": samples, "shape": df_shape, "columns": df_columns}
+        )
+
+    def describe_file_parquet(self, data_file: Path, max_samples: int = 3) -> FinetuneFileDescription:
+        samples = []
+        df_shape = None
+        df_columns = []
+        try:
+            df = pd.read_parquet(data_file)
+            if len(df) > 0:
+                samples = df.head(max_samples).to_dict("records")
+                samples = _truncate_long_values(samples)
+            df_shape = df.shape
+            df_columns = df.columns.tolist()
+        except Exception as e:
+            logger.warning(f"Error extracting samples from {data_file.name}: {e}")
+
+        return FinetuneFileDescription(
+            {"name": data_file.name, "type": "parquet", "samples": samples, "shape": df_shape, "columns": df_columns}
+        )
+
+    def describe_data_file(self, data_file: Path) -> FinetuneFileDescription:
+        """Describe data file based on suffix, dispatching to specific format handlers.
+
+        This is the main public interface for describing individual data files.
+        It automatically detects file type and calls the appropriate handler.
+
+        Args:
+            data_file: Path to the data file
+
+        Returns:
+            FinetuneFileDescription with file metadata and samples
+        """
+        suffix = data_file.suffix.lower()
+        describe_map = {
+            ".json": self.describe_file_json,
+            ".jsonl": self.describe_file_jsonl,
+            ".csv": self.describe_file_csv,
+            ".parquet": self.describe_file_parquet,
+        }
+        describe_func = describe_map.get(suffix)
+        if describe_func:
+            return describe_func(data_file)
+        # For unsupported file types, return basic info
+        return FinetuneFileDescription({"name": data_file.name, "type": "unknown", "samples": []})
+
+    def _discover_subtasks(self, dataset_dir: Path) -> dict:
+        """Discover subtasks by scanning directory structure.
+
+        Groups data files by their parent directory name. The deepest directory
+        containing data files is considered a subtask.
+
+        Args:
+            dataset_dir: Root directory of the dataset
+
+        Returns:
+            Dictionary mapping subtask names to their info:
+            {subtask_name: {"files": [relative_paths], "file_paths": [absolute_paths]}}
+        """
+        data_extensions = {".json", ".jsonl", ".parquet", ".csv"}
+        subtasks: dict[str, dict] = {}
+
+        for data_file in dataset_dir.rglob("*"):
+            if not data_file.is_file():
+                continue
+            if data_file.suffix.lower() not in data_extensions:
+                continue
+            if data_file.name.startswith("."):
+                continue
+
+            rel_path = data_file.relative_to(dataset_dir)
+            # Use deepest directory name as subtask, or "_root" if file is in top-level
+            subtask_name = rel_path.parent.name if len(rel_path.parts) > 1 else "_root"
+
+            if subtask_name not in subtasks:
+                subtasks[subtask_name] = {"files": [], "file_paths": []}
+            subtasks[subtask_name]["files"].append(str(rel_path))
+            subtasks[subtask_name]["file_paths"].append(data_file)
+
+        return subtasks
+
+    def analyze_dataset(self, dataset_dir: Path) -> dict:
+        """Analyze a dataset directory and generate dataset_info.json entry.
+
+        This method:
+        1. Reads README from the dataset directory
+        2. Generates file tree for LLM understanding
+        3. Discovers tasks by directory structure
+        4. Computes statistics for each task (sample count, token stats)
+        5. Extracts sample data for each task
+
+        All datasets have a unified "tasks" structure. For datasets with files
+        directly in the root directory, "_root" is used as the task name.
+
+        Args:
+            dataset_dir: Root directory of the dataset
+
+        Returns:
+            Dictionary containing dataset info ready for dataset_info.json
+        """
+        # 1. Read README
+        readme = self._read_dataset_readme(dataset_dir)
+
+        # 2. Generate file tree (for LLM to understand directory structure)
+        file_tree = self._generate_file_tree(dataset_dir)
+
+        # 3. Discover tasks
+        tasks = self._discover_subtasks(dataset_dir)
+
+        if not tasks:
+            logger.warning(f"No data files found in {dataset_dir}")
+            return {
+                "readme": readme,
+                "file_tree": file_tree,
+                "total_samples": 0,
+                "total_size_mb": 0,
+                "tasks": {},
+            }
+
+        # 4. Compute stats for each task
+        total_samples = 0
+        total_size = 0
+        for name, info in tasks.items():
+            file_paths = info["file_paths"]
+            data = _load_dataset_for_stats(file_paths)
+            info["sample_count"] = len(data)
+            info["column_stats"] = _compute_column_stats(data)
+            info["samples"] = _truncate_long_values(self._extract_samples_for_template(file_paths, max_samples=3))
+            total_samples += info["sample_count"]
+            total_size += sum(f.stat().st_size for f in file_paths)
+            # Remove file_paths as it's not JSON serializable and not needed in output
+            del info["file_paths"]
+
+        # 5. Return unified structure (all datasets have tasks)
+        return {
+            "readme": readme,
+            "file_tree": file_tree,
+            "total_samples": total_samples,
+            "total_size_mb": round(total_size / 1024 / 1024, 2),
+            "tasks": tasks,
+        }
+
+
+def _read_single_dataset_readme(dataset_path: Path, max_chars: int = 2000) -> str:
+    """Read README file from a single dataset directory or its parent directories.
+
+    Args:
+        dataset_path: Path to the dataset directory
+        max_chars: Maximum characters to read (default: 2000)
+
+    Returns:
+        README content as string, or empty string if not found
+    """
+    target_names = {"README.md", "readme.md", "README.txt", "README"}
+
+    try:
+        # Check current directory first
+        for readme_name in target_names:
+            readme_file = dataset_path / readme_name
+            if readme_file.exists() and readme_file.is_file():
+                try:
+                    content = readme_file.read_text(encoding="utf-8")[:max_chars]
+                    logger.info(f"Loaded README from {readme_file} ({len(content)} chars)")
+                    return content
+                except Exception as e:
+                    logger.warning(f"Failed to read {readme_file}: {e}")
+
+        # If not found in current directory, check parent directory
+        parent_path = dataset_path.parent
+        if parent_path != dataset_path:  # Avoid infinite loop at filesystem root
+            for readme_name in target_names:
+                readme_file = parent_path / readme_name
+                if readme_file.exists() and readme_file.is_file():
+                    try:
+                        content = readme_file.read_text(encoding="utf-8")[:max_chars]
+                        logger.info(f"Loaded README from parent directory {readme_file} ({len(content)} chars)")
+                        return content
+                    except Exception as e:
+                        logger.warning(f"Failed to read {readme_file}: {e}")
+
+        # If still not found, check one level down in subdirectories
+        if dataset_path.exists():
+            for item in dataset_path.iterdir():
+                if item.is_dir():
+                    for readme_name in target_names:
+                        readme_file = item / readme_name
+                        if readme_file.exists() and readme_file.is_file():
+                            try:
+                                content = readme_file.read_text(encoding="utf-8")[:max_chars]
+                                logger.info(f"Loaded README from subdirectory {readme_file} ({len(content)} chars)")
+                                return content
+                            except Exception as e:
+                                logger.warning(f"Failed to read {readme_file}: {e}")
+    except Exception as e:
+        logger.warning(f"Error searching for README in {dataset_path}: {e}")
+
+    return ""
+
+
+def check_all_dataset_in_info(ft_file_path, existing_config, max_depth: int = 3):
+    """Scan datasets directory and return top-level dataset names not yet in existing_config.
+
+    Only scans first-level directories under datasets/. Each top-level directory is treated
+    as a single dataset, regardless of its internal structure.
+
+    Examples:
+        - datasets/chemcot/ → dataset: "chemcot"
+        - datasets/panorama/ → dataset: "panorama"
+        - datasets/deepscaler/ → dataset: "deepscaler"
+
+    Args:
+        ft_file_path: Path to finetune directory structure
+        existing_config: Existing dataset_info.json configuration
+        max_depth: Unused, kept for API compatibility
+
+    Returns:
+        list: Dataset names (top-level directory names) not yet in existing_config
+    """
+    root_path = Path(ft_file_path) / "datasets"
+    dataset_list = []
+
+    try:
+        for item in root_path.iterdir():
+            if item.is_dir() and not item.name.startswith("."):
+                dataset_list.append(item.name)
+    except Exception as e:
+        logger.warning(f"Error scanning datasets directory: {e}")
+
+    remain_dataset_list = [dataset_name for dataset_name in dataset_list if dataset_name not in existing_config]
+    return remain_dataset_list
+
+
+def generate_dataset_info_config(target_dataset_list: list, ft_file_path: str, existing_config: dict) -> dict:
+    """Generate dataset_info.json configuration with auto-discovered subtasks.
+
+    This function analyzes datasets not yet in existing_config and generates
+    structured information including:
+    - README content
+    - File tree structure
+    - Auto-discovered subtasks with statistics
+    - Column token statistics for each subtask
+    - Sample data for LLM understanding
+
+    The dataset_info.json acts as a cache - existing datasets are skipped.
+
+    Args:
+        target_dataset_list: List of specific datasets to process (empty for all)
+        ft_file_path: Path to finetune directory structure
+        existing_config: Existing dataset_info.json configuration (used as cache)
+
+    Returns:
+        dict: New configuration entries for dataset_info.json
+    """
+    # Find datasets not yet in existing_config
+    remain_dataset_list = check_all_dataset_in_info(ft_file_path, existing_config)
+    if not remain_dataset_list:
+        return {}
+
+    datasets_root = Path(ft_file_path) / "datasets"
+    descriptor = FinetuneDatasetDescriptor()
+    new_config = {}
+
+    # Determine which datasets to process
+    datasets_to_process = (
+        remain_dataset_list if not target_dataset_list else [d for d in target_dataset_list if d in remain_dataset_list]
+    )
+
+    for dataset_name in datasets_to_process:
+        dataset_dir = datasets_root / dataset_name
+        if dataset_dir.exists() and dataset_dir.is_dir():
+            logger.info(f"Analyzing dataset '{dataset_name}'...")
+            new_config[dataset_name] = descriptor.analyze_dataset(dataset_dir)
+            logger.info(
+                f"Analyzed dataset '{dataset_name}': "
+                f"{new_config[dataset_name].get('total_samples', 0)} samples, "
+                f"{new_config[dataset_name].get('total_size_mb', 0)} MB"
+            )
+
+    return new_config
diff --git a/rdagent/scenarios/finetune/share.yaml b/rdagent/scenarios/finetune/share.yaml
new file mode 100644
index 000000000..6fbfefe05
--- /dev/null
+++ b/rdagent/scenarios/finetune/share.yaml
@@ -0,0 +1,5 @@
+scen:  # customizable
+  role: |-
+    You are an expert in Large Language Model fine-tuning with deep knowledge of training techniques, hyperparameter optimization, and model evaluation.
+  assets_path: "./assets/"
+
diff --git a/rdagent/scenarios/finetune/train/eval.py b/rdagent/scenarios/finetune/train/eval.py
new file mode 100644
index 000000000..96a0d3fac
--- /dev/null
+++ b/rdagent/scenarios/finetune/train/eval.py
@@ -0,0 +1,322 @@
+import json
+from typing import Any, Dict, List, Optional
+
+from rdagent.components.coder.CoSTEER.evaluators import (
+    CoSTEEREvaluator,
+    CoSTEERSingleFeedback,
+)
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.finetune.conf import (
+    FT_DATA_FILE_NAME,
+    FT_DATA_SCRIPT_NAME,
+    FT_YAML_FILE_NAME,
+    clear_workspace,
+    get_data_processing_cache_key,
+    get_data_processing_env,
+    get_ft_env,
+    get_workspace_prefix,
+    inject_data_stats,
+)
+from rdagent.components.coder.finetune.exp import FTTask
+from rdagent.components.coder.finetune.unified_validator import LLMConfigValidator
+from rdagent.core.evolving_framework import QueriedKnowledge
+from rdagent.core.experiment import FBWorkspace
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.benchmark import get_benchmark_ranges, run_benchmark
+from rdagent.utils.agent.tpl import T
+from rdagent.utils.agent.workflow import build_cls_from_json_with_retry
+
+
+def extract_loss_history(output_path) -> Dict[str, List[Dict[str, Any]]]:
+    """
+    Extract training and evaluation loss history from LlamaFactory's trainer_state.json.
+
+    Args:
+        output_path: Path to the training output directory
+
+    Returns:
+        Dict with 'train' and 'eval' keys, each containing a list of loss entries.
+    """
+    trainer_state_path = output_path / "trainer_state.json"
+    result = {"train": [], "eval": []}
+
+    if not trainer_state_path.exists():
+        logger.warning(f"trainer_state.json not found at {trainer_state_path}")
+        return result
+
+    try:
+        with open(trainer_state_path) as f:
+            trainer_state = json.load(f)
+
+        log_history = trainer_state.get("log_history", [])
+        for entry in log_history:
+            if "loss" in entry:
+                result["train"].append({
+                    "step": entry.get("step"),
+                    "epoch": entry.get("epoch"),
+                    "loss": entry.get("loss"),
+                })
+            if "eval_loss" in entry:
+                result["eval"].append({
+                    "step": entry.get("step"),
+                    "epoch": entry.get("epoch"),
+                    "eval_loss": entry.get("eval_loss"),
+                })
+
+        logger.info(f"Extracted {len(result['train'])} train + {len(result['eval'])} eval entries")
+
+    except (json.JSONDecodeError, OSError) as e:
+        logger.warning(f"Failed to parse trainer_state.json: {e}")
+
+    return result
+
+
+class FTRunnerEvaluator(CoSTEEREvaluator):
+    """LLM Fine-tuning specific evaluator that uses LLM Docker environment."""
+
+    def evaluate(
+        self,
+        target_task: FTTask,
+        implementation: FBWorkspace,
+        gt_implementation: FBWorkspace,
+        queried_knowledge: Optional[QueriedKnowledge] = None,
+        **kwargs,
+    ) -> CoSTEERSingleFeedback:
+        """Evaluate LLM fine-tuning implementation using dedicated LLM environment.
+
+        This evaluator performs three stages:
+        0. Clean workspace (remove old training outputs)
+        1. Full data processing (without --debug flag) to generate complete data.json
+        2. Full training with the complete dataset
+        """
+
+        # Check if FT_YAML_FILE_NAME exists
+        if FT_YAML_FILE_NAME not in implementation.file_dict:
+            fb = CoSTEERSingleFeedback(
+                execution=f"No {FT_YAML_FILE_NAME} found in workspace",
+                return_checking="Config file missing",
+                code="No valid configuration file",
+                final_decision=False,
+            )
+            implementation.feedback = fb
+            logger.log_object(fb, tag="evaluator_feedback.FTRunnerEvaluator")
+            return fb
+
+        # Use LLM-specific environment with appropriate timeout for training
+        env = get_ft_env(operation="full_training")
+
+        # ========== Stage 0: Clean Workspace ==========
+        # Clean old training outputs before data processing and training
+        clear_workspace(implementation, env)
+
+        # ========== Stage 1: Full Data Processing ==========
+        # Execute data processing WITHOUT --debug flag to generate complete data.json
+        data_result = self._run_full_data_processing(implementation)
+        data_stdout = data_result.stdout or ""
+
+        if data_result.exit_code != 0:
+            # Data processing failed, return feedback to enter next loop
+            logger.error(f"Full data processing failed with exit_code={data_result.exit_code}")
+            return self._generate_llm_feedback(
+                target_task=target_task,
+                implementation=implementation,
+                raw_stdout=data_stdout,
+                exit_code=data_result.exit_code,
+                model_files_exist=False,
+                benchmark_result=None,
+                loss_history=None,
+                failed_stage="data_processing",
+            )
+
+        logger.info("Full data processing completed successfully")
+
+        # Update data_stats.json with full dataset statistics
+        # This ensures feedback sees the correct sample count, not debug mode count
+        data_json_path = implementation.workspace_path / FT_DATA_FILE_NAME
+        if data_json_path.exists():
+            with open(data_json_path, "r", encoding="utf-8") as f:
+                data = json.load(f)
+            if isinstance(data, list) and len(data) > 0:
+                inject_data_stats(implementation, data, data_stdout)
+
+        # ========== Stage 2: Full Training ==========
+
+        # Execute LlamaFactory training
+        train_result = implementation.run(
+            env=env,
+            entry=f"llamafactory-cli train {FT_YAML_FILE_NAME}",
+        )
+        # Combine data processing and training stdout for comprehensive feedback
+        combined_stdout = (
+            f"=== DATA PROCESSING OUTPUT ===\n{data_stdout}\n\n=== TRAINING OUTPUT ===\n{train_result.stdout or ''}"
+        )
+        implementation.running_info.running_time = train_result.running_time
+        # NOTE: Docker execution is logged by FTWorkspace.run() automatically
+
+        # Simple success check: exit code
+        training_success = train_result.exit_code == 0
+
+        # Check for model output files
+        workspace_path = implementation.workspace_path
+        output_path = workspace_path / "output"
+        model_output_files = (
+            list(output_path.glob("*.safetensors"))
+            + list(output_path.glob("*.bin"))
+            + list(output_path.glob("adapter_*"))
+            if output_path.exists()
+            else []
+        )
+
+        # Early return if training failed
+        if not training_success or len(model_output_files) == 0:
+            return self._generate_llm_feedback(
+                target_task=target_task,
+                implementation=implementation,
+                raw_stdout=combined_stdout,
+                exit_code=train_result.exit_code,
+                model_files_exist=len(model_output_files) > 0,
+                benchmark_result=None,
+                loss_history=None,
+                failed_stage="training",
+            )
+
+        # Extract loss history from training output
+        loss_history = extract_loss_history(output_path)
+
+        val_range, test_range = get_benchmark_ranges()
+
+        # Validation set - used for SOTA judgment, visible to agent
+        validation_result = run_benchmark(
+            workspace_path=str(workspace_path),
+            model_path=output_path,
+            model_name=target_task.base_model,
+            benchmark_name=target_task.benchmark,
+            gpu_count=self.scen.gpu_count,
+            test_range=val_range,
+            result_subdir="validation",
+        )
+
+        # Test set - only for frontend display, not visible to agent
+        test_result = run_benchmark(
+            workspace_path=str(workspace_path),
+            model_path=output_path,
+            model_name=target_task.base_model,
+            benchmark_name=target_task.benchmark,
+            gpu_count=self.scen.gpu_count,
+            test_range=test_range,
+            result_subdir="test",
+        )
+
+        # Build comprehensive result with training metrics and benchmark results
+        # Note: "benchmark" is for agent (SOTA judgment), "benchmark_test" is for frontend only
+        train_history = loss_history.get("train", []) if loss_history else []
+        implementation.running_info.result = {
+            "benchmark": validation_result,  # Agent visible - used for SOTA judgment
+            "benchmark_test": test_result,  # Agent invisible - frontend display only
+            "training_metrics": {
+                "loss_history": loss_history,
+                "final_loss": train_history[-1]["loss"] if train_history else None,
+                "initial_loss": train_history[0]["loss"] if train_history else None,
+            },
+        }
+        benchmark_result = validation_result  # For backward compatibility with feedback
+
+        # Call LLM for feedback analysis - LLM will determine final_decision
+        return self._generate_llm_feedback(
+            target_task=target_task,
+            implementation=implementation,
+            raw_stdout=combined_stdout,
+            exit_code=train_result.exit_code,
+            model_files_exist=len(model_output_files) > 0,
+            benchmark_result=benchmark_result,
+            loss_history=loss_history,
+        )
+
+    def _generate_llm_feedback(
+        self,
+        target_task: FTTask,
+        implementation: FBWorkspace,
+        raw_stdout: str,
+        exit_code: int,
+        model_files_exist: bool,
+        benchmark_result: Optional[Dict] = None,
+        loss_history: Optional[Dict[str, List[Dict]]] = None,
+        failed_stage: Optional[str] = None,
+    ) -> CoSTEERSingleFeedback:
+        """Generate LLM-based feedback for runner evaluation.
+
+        LLM will determine final_decision based on all provided information.
+
+        Args:
+            failed_stage: Which stage failed - "data_processing" or "training"
+        """
+        # Parse execution log to extract structured info (reuse unified_validator's method)
+        # Reduces ~36k tokens to ~500 tokens by extracting: status, errors, metrics, warnings
+        parsed_stdout = LLMConfigValidator()._parse_execution_log(raw_stdout, exit_code, failed_stage)
+
+        # Get timeout config for the failed stage
+        timeout_seconds = None
+        if failed_stage == "data_processing":
+            timeout_seconds = FT_RD_SETTING.data_processing_timeout
+        elif failed_stage == "training":
+            timeout_seconds = FT_RD_SETTING.full_timeout
+
+        # Pass loss_history directly (simpler and preserves full information)
+        # Sample train entries if too many to avoid token bloat
+        if loss_history and len(loss_history.get("train", [])) > 60:
+            loss_history["train"] = loss_history["train"][:30] + loss_history["train"][-30:]
+
+        system_prompt = T("rdagent.components.coder.finetune.prompts:runner_eval.system").r()
+        user_prompt = T("rdagent.components.coder.finetune.prompts:runner_eval.user").r(
+            task_desc=target_task.get_task_information(),
+            config_yaml=implementation.file_dict.get(FT_YAML_FILE_NAME, ""),
+            exit_code=exit_code,
+            model_files_status="Found" if model_files_exist else "Not found",
+            stdout=parsed_stdout,  # Structured JSON instead of raw truncated log
+            benchmark_result=(
+                json.dumps(benchmark_result, indent=2) if benchmark_result else "N/A (not executed or failed)"
+            ),
+            loss_history=json.dumps(loss_history, indent=2) if (loss_history and (loss_history.get("train") or loss_history.get("eval"))) else "N/A",
+            failed_stage=failed_stage,
+            timeout_seconds=timeout_seconds,
+        )
+
+        feedback = build_cls_from_json_with_retry(
+            CoSTEERSingleFeedback,
+            system_prompt=system_prompt,
+            user_prompt=user_prompt,
+            init_kwargs_update_func=CoSTEERSingleFeedback.val_and_update_init_dict,
+        )
+        feedback.raw_execution = raw_stdout
+        implementation.feedback = feedback
+        logger.log_object(feedback, tag="evaluator_feedback.FTRunnerEvaluator")
+        return feedback
+
+    def _run_full_data_processing(self, implementation: FBWorkspace):
+        """Execute full data processing (without --debug flag) to generate complete data.json.
+
+        This is called at the beginning of the running stage to regenerate data.json
+        with all samples instead of the debug subset created during coding stage.
+
+        Args:
+            implementation: The workspace containing process_data.py
+
+        Returns:
+            EnvResult with exit_code, stdout, etc.
+        """
+        # Get data processing environment with LLM API access
+        env, env_vars = get_data_processing_env()
+        ws_prefix = get_workspace_prefix(env)
+
+        logger.info("Starting full data processing (without --debug flag)")
+
+        # Execute WITHOUT --debug flag to generate all samples
+        result = implementation.run(
+            env=env,
+            entry=f"python {ws_prefix}/{FT_DATA_SCRIPT_NAME}",  # No --debug flag
+            env_vars=env_vars,
+            cache_key_extra_func=get_data_processing_cache_key,
+            cache_files_to_extract=[FT_DATA_FILE_NAME],
+        )
+
+        return result
diff --git a/rdagent/scenarios/finetune/train/runner.py b/rdagent/scenarios/finetune/train/runner.py
new file mode 100644
index 000000000..7a29f1dfa
--- /dev/null
+++ b/rdagent/scenarios/finetune/train/runner.py
@@ -0,0 +1,129 @@
+"""
+LLM Fine-tuning Runner Implementation
+
+This module provides a specialized runner for LLM fine-tuning that executes
+LLaMA-Factory configuration files generated by the coder.
+"""
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.components.coder.CoSTEER import CoSTEER
+from rdagent.components.coder.CoSTEER.evaluators import (
+    CoSTEERMultiEvaluator,
+    CoSTEERSingleFeedback,
+)
+from rdagent.components.coder.CoSTEER.evolving_strategy import (
+    MultiProcessEvolvingStrategy,
+)
+from rdagent.components.coder.CoSTEER.knowledge_management import (
+    CoSTEERQueriedKnowledge,
+)
+from rdagent.components.coder.finetune.conf import (
+    FT_YAML_FILE_NAME,
+    FTCoderCoSTEERSettings,
+)
+from rdagent.components.coder.finetune.eval import FTDataEvaluator
+from rdagent.core.experiment import FBWorkspace, Task
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.train.eval import FTRunnerEvaluator
+
+
+class FTRunnerSettings(FTCoderCoSTEERSettings):
+    """LLM Fine-tuning specific runner settings."""
+
+    class Config:
+        env_prefix = "LLM_FT_Runner_"
+
+
+class FTRunnerEvolvingStrategy(MultiProcessEvolvingStrategy):
+    """Evolving strategy for LLM fine-tuning runner.
+
+    Runner directly executes the yaml from coder without modification.
+    The coder generates full training config, and its validator tests with micro-batch.
+    """
+
+    def implement_one_task(
+        self,
+        target_task: Task,
+        queried_knowledge: CoSTEERQueriedKnowledge | None = None,
+        workspace: FBWorkspace | None = None,
+        prev_task_feedback: CoSTEERSingleFeedback | None = None,
+    ) -> dict[str, str]:
+        """No modification needed - directly use coder's full training config."""
+        # TODO: detect error during training automatically, and fix it here
+        if not workspace or FT_YAML_FILE_NAME not in workspace.file_dict:
+            logger.error(f"No {FT_YAML_FILE_NAME} found in workspace")
+            return {}
+
+        # Coder already generated full training config, no modification needed
+        # Return empty dict to indicate no changes
+        return {}
+
+
+class LLMFinetuneRunner(CoSTEER):
+    """LLM Fine-tuning specific runner that executes LLaMA-Factory configurations."""
+
+    def __init__(
+        self,
+        scen: Scenario,
+        *args,
+        **kwargs,
+    ) -> None:
+        eval_l = [
+            FTRunnerEvaluator(scen=scen),  # Training validation
+        ]
+
+        eva = CoSTEERMultiEvaluator(single_evaluator=eval_l, scen=scen)
+        settings = FTRunnerSettings()
+
+        # Use runner-specific evolving strategy for full dataset training
+        es = FTRunnerEvolvingStrategy(scen=scen, settings=settings, improve_mode=True)
+
+        # Initialize with LLM-specific configuration
+        super().__init__(
+            *args,
+            settings=settings,
+            eva=eva,
+            es=es,
+            evolving_version=2,
+            scen=scen,
+            max_loop=getattr(FT_RD_SETTING, "runner_max_loop", 1),  # Default to 1 loop for running
+            stop_eval_chain_on_fail=True,  # finetune involve partial implementation.
+            **kwargs,
+        )
+
+    def develop(self, exp):
+        """Execute LLaMA-Factory fine-tuning on full dataset.
+
+        Runner directly executes the full training config generated by coder.
+        The actual training execution and basic validation are handled by LLMFinetuneEvaluator.
+        Benchmark evaluation should be done as a separate step after training.
+        """
+        logger.info("Starting full dataset LLM fine-tuning with LLaMA-Factory")
+
+        # Run the standard CoSTEER develop process:
+        # 1. Execute training using coder's full training config (no modification)
+        # 2. Validate execution using LLMFinetuneEvaluator
+        exp = super().develop(exp)
+        return exp
+
+    def get_develop_max_seconds(self) -> int | None:
+        """Get maximum seconds for development using FT settings."""
+        return int(self.scen.real_full_timeout() * self.settings.max_seconds_multiplier)
+
+    def compare_and_pick_fb(self, base_fb, new_fb) -> bool:
+        """Compare feedback for LLM fine-tuning results."""
+        if base_fb is None:
+            return True
+
+        base_fb = base_fb[0]
+        new_fb = new_fb[0]
+
+        def compare_scores(s1, s2) -> bool:
+            if s2 is None:
+                return False
+            if s1 is None:
+                return True
+            return (s2 > s1) == self.scen.metric_direction
+
+        return compare_scores(getattr(base_fb, "score", None), getattr(new_fb, "score", None))
diff --git a/rdagent/scenarios/finetune/utils.py b/rdagent/scenarios/finetune/utils.py
new file mode 100644
index 000000000..f35eb15da
--- /dev/null
+++ b/rdagent/scenarios/finetune/utils.py
@@ -0,0 +1,48 @@
+from pathlib import Path
+
+from rdagent.app.finetune.llm.conf import FT_RD_SETTING
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.finetune.datasets import prepare as prepare_dataset
+from rdagent.scenarios.finetune.download.hf import download_model
+
+
+def ensure_ft_assets_exist(
+    *, model: str | None = None, dataset: str | None = None, check_model: bool = False, check_dataset: bool = False
+) -> None:
+    """Ensure dataset and model assets exist under FT_FILE_PATH structure.
+
+    Args:
+        model: Model name to check/download. Required if check_model=True.
+        dataset: Dataset name (registered in DATASETS) to check/download. Required if check_dataset=True.
+        check_model: Whether to ensure model exists.
+        check_dataset: Whether to ensure dataset exists.
+
+    Paths:
+        - Dataset path: FT_RD_SETTING.file_path/datasets/<dataset>
+        - Model path:   FT_RD_SETTING.file_path/models/<model>
+    """
+    # Ensure dataset exists if requested
+    if check_dataset:
+        if dataset is None:
+            raise ValueError("Dataset name is required when check_dataset=True")
+
+        dataset_dir = Path(FT_RD_SETTING.file_path) / "datasets" / dataset
+        if not dataset_dir.exists():
+            try:
+                logger.info(f"Preparing dataset '{dataset}' to {dataset_dir}")
+                prepare_dataset(dataset)
+            except Exception as e:
+                raise Exception(f"Failed to prepare dataset '{dataset}' to {dataset_dir}: {e}") from e
+
+    # Ensure model exists if requested
+    if check_model:
+        if model is None:
+            raise ValueError("Model name is required when check_model=True")
+
+        model_dir = Path(FT_RD_SETTING.file_path) / "models" / model
+        if not model_dir.exists():
+            try:
+                logger.info(f"Downloading model '{model}' to {model_dir}")
+                download_model(model, out_dir_root=str(Path(FT_RD_SETTING.file_path) / "models"))
+            except Exception as e:
+                raise Exception(f"Failed to download model '{model}' to {model_dir}: {e}. ") from e
diff --git a/rdagent/scenarios/rl/autorl_bench/.gitignore b/rdagent/scenarios/rl/autorl_bench/.gitignore
new file mode 100644
index 000000000..cdc739fc5
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/.gitignore
@@ -0,0 +1,12 @@
+# 运行时生成
+workspace/
+results.csv
+log/
+doc/
+
+# Python
+__pycache__/
+*.pyc
+
+# Jupyter
+.ipynb_checkpoints/
diff --git a/rdagent/scenarios/rl/autorl_bench/README.md b/rdagent/scenarios/rl/autorl_bench/README.md
new file mode 100644
index 000000000..d725c5394
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/README.md
@@ -0,0 +1,332 @@
+# AutoRL-Bench
+
+让大模型（如 GPT-5.2）自主驱动 RL 训练流程，提升小模型（如 Qwen2.5-7B）在各类 Benchmark 上的表现，并评测"大模型驱动 RL"的增益效果。
+
+> 核心问题：给定一个 Benchmark 及其 baseline，大模型通过 Workflow 对小模型进行 RL 训练后，小模型的分数能否超过 baseline？
+
+| 角色 | 实例 | 职责 |
+|------|------|------|
+| **Benchmark** | GSM8K、ALFWorld 等 | 提供任务环境、自动评分 |
+| **小模型** | Qwen2.5-0.5B/7B | 被 RL 训练的 Agent |
+| **大模型** | GPT-5.2 等 | 离线驱动 RL 优化（生成 reward、调超参等） |
+
+---
+
+## 快速开始
+
+### 1. 环境安装
+
+```bash
+# --- 1a. Clone 代码 ---
+git clone git@github.com:microsoft/RD-Agent.git ~/RD-Agent
+cd ~/RD-Agent
+
+# --- 1b. 基础 conda 环境 ---
+conda create -n cwy-rl python=3.10 -y
+conda activate cwy-rl
+pip install -e .
+
+# 全局依赖（trl, vllm, torch, opencompass 等）
+pip install -r rdagent/scenarios/rl/autorl_bench/requirements.txt
+
+# --- 1c. 按需安装 benchmark 额外依赖 ---
+# ALFWorld（alfworld, textworld, openai）
+pip install -r rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/requirements.txt
+# GSM8K：无额外依赖
+
+# --- 1d. OpenHands Agent（如需使用）---
+git clone git@github.com:couragec/openhands-rl.git ~/openhands-rl
+# OpenHands 需要独立 conda 环境（Python 3.12）
+conda create -n openhands python=3.12 -y
+conda run -n openhands pip install -r ~/openhands-rl/requirements.txt
+```
+
+### 2. 配置 `.env`
+
+```bash
+cp .env.example .env  # 或手动创建
+```
+
+`.env` 中需要配置的关键项：
+
+```env
+# LLM API（OpenHands Agent 必需）
+OPENAI_API_KEY=your_api_key
+OPENAI_API_BASE=https://your-api-endpoint/v1
+CHAT_MODEL=gpt-5.2
+
+# OpenHands 环境（可选，有默认值）
+# CONDA_ENV_OPENHANDS=openhands      # 默认 openhands
+# OPENHANDS_RL_ROOT=$HOME/openhands-rl  # 默认 ~/openhands-rl
+```
+
+### 3. 运行
+
+```bash
+cd /path/to/RD-Agent
+conda activate cwy-rl
+
+# Example Agent（简单 GRPO 训练，验证流程）
+python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent example_agent --task gsm8k --model Qwen/Qwen2.5-0.5B --timeout 7200
+
+# OpenHands Agent + GSM8K
+python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent openhands --task gsm8k --model Qwen/Qwen2.5-0.5B --timeout 41600
+
+# OpenHands Agent + ALFWorld（首次运行自动下载 ~2GB 游戏数据）
+python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent openhands --task alfworld --model Qwen/Qwen2.5-0.5B-Instruct --timeout 41600
+
+# 后台运行（推荐）
+nohup python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent openhands --task alfworld --model Qwen/Qwen2.5-0.5B-Instruct \
+    --timeout 41600 > /dev/null 2>&1 &
+```
+
+> **数据自动下载**：首次运行某个 benchmark 时，`run.py` 会自动调用对应 `data.py` 下载训练数据，无需手动操作。
+> - GSM8K：从 HuggingFace 下载 (~5MB)
+> - ALFWorld：调用 `alfworld-download` 从 GitHub Releases 下载 (~2GB，含 json/pddl/tw-pddl/logic)
+
+### 4. 查看结果
+
+```bash
+# 实时查看运行日志
+tail -f workspace/alfworld/20260228T100000_openhands/agent.log
+
+# 查看评分记录
+cat workspace/alfworld/20260228T100000_openhands/scores.json
+
+# 查看全局实验汇总
+cat rdagent/scenarios/rl/autorl_bench/results.csv
+
+# Web UI（Streamlit 面板）
+streamlit run rdagent/scenarios/rl/autorl_bench/core/ui.py --server.port 8511
+```
+
+### 命令行参数
+
+| 参数 | 说明 | 示例 |
+|------|------|------|
+| `--agent` | Agent 类型 | `example_agent`、`rdagent`、`openhands` |
+| `--task` | Benchmark 任务名（对应 `benchmarks/` 子目录） | `gsm8k`、`alfworld` |
+| `--model` | HuggingFace 模型 repo_id，首次自动下载 | `Qwen/Qwen2.5-0.5B` |
+| `--timeout` | Agent 最大运行时长（秒） | `41600`（~11.5h） |
+| `--port` | Grading Server 端口（默认 5000） | `5000` |
+
+---
+
+## 核心流程
+
+```
+run.py 启动
+ │
+ ├─ 1. 准备资源：下载模型（HuggingFace）+ 下载训练数据（各 benchmark 的 data.py）
+ ├─ 2. 构建 workspace：创建隔离目录、软链接模型和数据
+ ├─ 3. 挂载文件：description.md + instructions.md + benchmark 特有文件
+ ├─ 4. 启动 Grading Server（Flask 评测服务）
+ ├─ 5. 评测 baseline：用原始模型跑一次基准分（有缓存）
+ ├─ 6. 运行 Agent：Agent 在 workspace 内训练 + 多次提交评测
+ ├─ 7. 收集结果：从 Grading Server 获取所有提交记录
+ └─ 8. 保存结果：追加到 results.json，更新全局 best
+```
+
+### 资源存储
+
+模型和数据下载后统一存储在 `git_ignore_folder/rl_files/`（可通过 `AUTORL_FILE_PATH` 覆盖）：
+
+```
+git_ignore_folder/rl_files/
+├── models/Qwen/Qwen2.5-0.5B/    # 模型权重（snapshot_download）
+├── datasets/
+│   ├── gsm8k/train.jsonl         # 训练数据（agent 可见）
+│   └── alfworld/train → ...      # 训练游戏数据（agent 可见，评估数据不在这）
+└── baseline_workspace/           # baseline 分数缓存
+    └── gsm8k_Qwen_Qwen2.5-0.5B.json
+```
+
+### Workspace（每次运行隔离）
+
+每次运行创建独立的 workspace 目录（`workspace/<task>/<run_id>/`），通过软链接挂载资源：
+
+```
+workspace/gsm8k/
+├── 20260211T143000_openhands/        # 一次独立实验（agent 在时限内的完整生命周期）
+│   ├── code/                         # Agent 代码区（所有自行编写的代码）
+│   │   ├── train.py                  # 训练脚本
+│   │   └── ...                       # 分析、处理等其他脚本
+│   ├── output/                       # 模型输出（$OUTPUT_DIR）
+│   │   ├── v1/                       # 第一版模型
+│   │   └── v2/                       # 第二版模型（迭代优化）
+│   ├── models/Qwen/Qwen2.5-0.5B →   # 软链接 → rl_files/models/...（只读）
+│   ├── data →                        # 软链接 → rl_files/datasets/gsm8k/（只读）
+│   ├── description.md →              # 软链接 → benchmarks/gsm8k/description.md
+│   ├── instructions.md →             # 软链接 → core/instructions.md
+│   ├── scores.json                   # 本次实验内所有提交的评分记录
+│   └── grading_server.log            # Grading Server 日志
+└── 20260211T160000_rdagent/          # 另一次独立实验
+    └── ...
+```
+
+> **评测原则**：每次实验（一次 `run.py` 调用）是一个独立的评测单元。
+> Agent 在 `--timeout` 时限内可以多次训练、多次提交，最终取**本次实验内**的最高分。
+> 不同实验之间完全隔离，不存在跨实验的"全局最优"。
+
+### results.csv（实验日志）
+
+`autorl_bench/results.csv` 是纯日志记录，用于论文实验汇总，**不参与评测逻辑**：
+
+```csv
+run_id,timestamp,task,agent,base_model,baseline,best_score,improvement,submissions,duration_s,success,workspace
+20260211T143000,2026-02-11 14:30:00,gsm8k,openhands,Qwen/Qwen2.5-0.5B,21.61,22.37,0.76,3,3600,True,workspace/gsm8k/...
+20260211T160000,2026-02-11 16:00:00,gsm8k,rdagent,Qwen/Qwen2.5-0.5B,21.61,23.12,1.51,7,3600,True,workspace/gsm8k/...
+```
+
+每行记录一次独立实验的结果，方便对比不同 agent 在相同条件下的表现。
+
+---
+
+## Agent 环境变量
+
+Agent 启动时（`start.sh`）可用的环境变量：
+
+| 变量 | 说明 | 示例 |
+|------|------|------|
+| `TASK` | 任务名 | `gsm8k` |
+| `BASE_MODEL` | 模型名 | `Qwen/Qwen2.5-0.5B` |
+| `WORKSPACE` | 工作根目录 | `workspace/gsm8k/20260211T143000` |
+| `MODEL_PATH` | 模型路径（只读） | `$WORKSPACE/models/Qwen/Qwen2.5-0.5B` |
+| `DATA_PATH` | 数据路径（只读） | `$WORKSPACE/data` |
+| `OUTPUT_DIR` | 输出目录 | `$WORKSPACE/output` |
+| `GRADING_SERVER_URL` | 评测服务地址 | `http://localhost:5000` |
+
+### Grading Server API
+
+| 端点 | 方法 | 说明 |
+|------|------|------|
+| `/submit` | POST | `{"model_path": "..."}` → 返回 score + best + improvement |
+| `/set_baseline` | POST | `{"score": 21.91}` → 设置 baseline |
+| `/health` | GET | 健康检查 |
+
+`/submit` 响应：
+
+```json
+{
+  "submission_id": 3,
+  "score": 65.0,
+  "baseline_score": 45.0,
+  "improvement": 20.0,
+  "best": {"submission_id": 2, "score": 68.0},
+  "total_submissions": 3
+}
+```
+
+---
+
+## 代码结构
+
+```
+autorl_bench/
+├── run.py                    # 入口脚本
+├── conf.py                   # 路径配置
+│
+├── core/                     # 【主干代码】
+│   ├── evaluator.py          # BaseEvaluator 基类
+│   ├── opencompass.py        # OpenCompassEvaluator（通用评测器）
+│   ├── server.py             # Grading Server（Flask）
+│   ├── utils.py              # 工具函数（下载、软链接、baseline）
+│   └── instructions.md       # Agent 通用指导说明
+│
+├── benchmarks/               # 【Benchmark 扩展】
+│   ├── __init__.py           # 注册表 BENCHMARKS
+│   ├── gsm8k/
+│   │   ├── data.py           # 数据下载（train split）
+│   │   └── description.md
+│   └── alfworld/
+│       ├── data.py           # 数据下载（训练游戏数据）
+│       ├── eval.py           # 自定义评测器
+│       ├── requirements.txt  # 额外依赖（alfworld, textworld）
+│       ├── description.md
+│       └── react_prompts.json
+│
+├── agents/                   # 【Agent 扩展】
+│   ├── registry.py           # 注册表（读 config.yaml）
+│   ├── example_agent/        # 简单 GRPO 训练
+│   ├── openhands/            # OpenHands SDK
+│   └── rdagent/              # RD-Agent
+│
+└── workspace/                # [运行时] 工作区 + 结果
+```
+
+---
+
+## 扩展指南
+
+### 添加新 Benchmark
+
+新建 `benchmarks/new_task/` 目录，需要 3 个文件：
+
+**1. `data.py` — 数据下载（只给 agent 训练数据，评估数据自己管）**
+
+```python
+# benchmarks/new_task/data.py
+from pathlib import Path
+from loguru import logger
+
+def download_train_data(target_dir: Path) -> None:
+    """下载训练数据到 target_dir，agent 只能看到这里的内容"""
+    # target_dir 会被软链接到 workspace/data
+    ...
+```
+
+**2. `description.md` — 任务描述（agent 可见）**
+
+**3. 注册到 `benchmarks/__init__.py`**
+
+```python
+BENCHMARKS["new_task"] = BenchmarkConfig(
+    id="new_task",
+    evaluator_class="rdagent.scenarios.rl.autorl_bench.core.opencompass.OpenCompassEvaluator",
+    data_module="rdagent.scenarios.rl.autorl_bench.benchmarks.new_task.data",
+    description="新任务描述",
+    eval_config={"dataset": "opencompass.configs.datasets.xxx"},
+)
+```
+
+如果需要自定义评测逻辑（不用 OpenCompass），再加一个 `eval.py`：
+
+```python
+# benchmarks/new_task/eval.py
+from rdagent.scenarios.rl.autorl_bench.core import BaseEvaluator
+
+class NewTaskEvaluator(BaseEvaluator):
+    def __init__(self, config):
+        self.config = config
+
+    def run_eval(self, model_path: str, workspace_path: str, **kwargs) -> dict:
+        return {"score": 85.0, "accuracy_summary": {...}}
+```
+
+### 添加新 Agent
+
+```yaml
+# agents/my_agent/config.yaml
+name: "My Agent"
+start: "start.sh"
+env_vars:
+  MY_PARAM: "value"
+```
+
+```bash
+# agents/my_agent/start.sh
+#!/bin/bash
+# 在 code/ 下编写训练脚本，模型输出到 output/
+python $WORKSPACE/code/train.py --model $MODEL_PATH --data $DATA_PATH --output $OUTPUT_DIR/v1
+curl -X POST $GRADING_SERVER_URL/submit \
+    -H "Content-Type: application/json" \
+    -d '{"model_path": "'$OUTPUT_DIR'/v1"}'
+```
+
+Agent 通过 `config.yaml` 自动注册，无需修改代码。
+
+
diff --git a/rdagent/scenarios/rl/autorl_bench/__init__.py b/rdagent/scenarios/rl/autorl_bench/__init__.py
new file mode 100644
index 000000000..8b00830be
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/__init__.py
@@ -0,0 +1,5 @@
+"""
+AutoRL-Bench: Benchmark for evaluating RL Post-training Agents
+"""
+
+__version__ = "0.1.0"
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/__init__.py b/rdagent/scenarios/rl/autorl_bench/agents/__init__.py
new file mode 100644
index 000000000..621a98cfe
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/__init__.py
@@ -0,0 +1,3 @@
+from .registry import get_agent, list_agents
+
+__all__ = ["get_agent", "list_agents"]
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/example_agent/config.yaml b/rdagent/scenarios/rl/autorl_bench/agents/example_agent/config.yaml
new file mode 100644
index 000000000..cf9844b3f
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/example_agent/config.yaml
@@ -0,0 +1,6 @@
+name: "Example Agent"
+description: "GRPO 训练 + 评测"
+start: "start.sh"
+env_vars:
+  TRAIN_RATIO: "0.1"
+  NUM_EPOCHS: "1"
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/example_agent/start.sh b/rdagent/scenarios/rl/autorl_bench/agents/example_agent/start.sh
new file mode 100755
index 000000000..469f2b004
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/example_agent/start.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+echo "=== Example Agent ==="
+echo "Task: $TASK"
+echo "Model: $BASE_MODEL"
+python "$(dirname "$0")/train.py"
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/example_agent/train.py b/rdagent/scenarios/rl/autorl_bench/agents/example_agent/train.py
new file mode 100644
index 000000000..88049c5e7
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/example_agent/train.py
@@ -0,0 +1,145 @@
+"""
+GRPO Training Loop
+"""
+import json
+import os
+import re
+import time
+
+import requests
+from datasets import Dataset
+from transformers import AutoTokenizer
+from trl import GRPOConfig, GRPOTrainer
+
+
+def extract_answer(text):
+    match = re.search(r"####\s*([-+]?\d[\d,]*\.?\d*)", text)
+    if match:
+        try:
+            return float(match.group(1).replace(",", ""))
+        except:
+            pass
+    numbers = re.findall(r"[-+]?\d[\d,]*\.?\d*", text)
+    if numbers:
+        try:
+            return float(numbers[-1].replace(",", ""))
+        except:
+            pass
+    return None
+
+
+def load_data(file_path, ratio=1.0):
+    records = []
+    with open(file_path, "r") as f:
+        for line in f:
+            item = json.loads(line)
+            prompt = f"Solve this math problem step by step. Put your final answer after ####.\n\nQuestion: {item['question']}\n\nSolution:"
+            records.append({"prompt": prompt, "question": item["question"], "answer": item["answer"]})
+    if ratio < 1.0:
+        n = max(10, int(len(records) * ratio))
+        records = records[:n]
+    return records
+
+
+def gsm8k_reward_func(completions, answer, **kwargs):
+    rewards = []
+    for completion, gold_answer in zip(completions, answer):
+        pred = extract_answer(completion)
+        gold = extract_answer(gold_answer)
+        if pred is not None and gold is not None and abs(pred - gold) < 1e-6:
+            rewards.append(1.0)
+        else:
+            rewards.append(-1.0)
+    return rewards
+
+
+def submit_for_grading(grading_url: str, model_path: str) -> dict | None:
+    if not grading_url:
+        return None
+    try:
+        resp = requests.post(f"{grading_url}/submit", json={"model_path": model_path}, timeout=600)
+        if resp.status_code == 200:
+            return resp.json()
+    except Exception as e:
+        print(f"  Grading error: {e}")
+    return None
+
+
+def main():
+    MODEL_PATH = os.environ.get("MODEL_PATH")
+    DATA_PATH = os.environ.get("DATA_PATH")
+    OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/tmp/autorl_output")
+    GRADING_SERVER_URL = os.environ.get("GRADING_SERVER_URL", "")
+    TRAIN_RATIO = float(os.environ.get("TRAIN_RATIO", "0.05"))
+    NUM_EPOCHS = int(os.environ.get("NUM_EPOCHS", "3"))
+    
+    if not MODEL_PATH or not DATA_PATH:
+        raise ValueError("MODEL_PATH and DATA_PATH required")
+
+    print(f"Model: {MODEL_PATH}")
+    print(f"Data: {DATA_PATH}")
+    print(f"Output: {OUTPUT_DIR}")
+
+    train_file = f"{DATA_PATH}/train.jsonl"
+    train_data = load_data(train_file, TRAIN_RATIO)
+    print(f"Train samples: {len(train_data)}")
+    dataset = Dataset.from_list([{"prompt": d["prompt"], "answer": d["answer"]} for d in train_data])
+
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    start_time = time.time()
+
+    # 第一个 epoch 使用原始模型，后续 epoch 使用上一个 checkpoint
+    current_model_path = MODEL_PATH
+
+    for epoch in range(NUM_EPOCHS):
+        print(f"\n=== Epoch {epoch + 1}/{NUM_EPOCHS} ===")
+
+        config = GRPOConfig(
+            output_dir=OUTPUT_DIR,
+            num_train_epochs=1,
+            per_device_train_batch_size=4,       # 小 batch 避免 OOM
+            gradient_accumulation_steps=16,      # 梯度累积保持有效batch=64
+            learning_rate=1e-5,
+            max_completion_length=256,
+            num_generations=4,
+            logging_steps=5,
+            save_strategy="no",
+            report_to="none",
+            bf16=True,
+        )
+
+        # 直接传模型路径，让 GRPOTrainer 自己管理模型加载
+        # 避免 vLLM colocate 模式下模型被加载两次导致 OOM
+        trainer = GRPOTrainer(
+            model=current_model_path,
+            reward_funcs=gsm8k_reward_func,
+            args=config,
+            train_dataset=dataset,
+            processing_class=tokenizer,
+        )
+
+        trainer.train()
+
+        checkpoint_dir = f"{OUTPUT_DIR}/checkpoint-epoch{epoch + 1}"
+        trainer.save_model(checkpoint_dir)
+        tokenizer.save_pretrained(checkpoint_dir)
+        
+        # 下一个 epoch 从这个 checkpoint 继续训练
+        current_model_path = checkpoint_dir
+
+        result = submit_for_grading(GRADING_SERVER_URL, checkpoint_dir)
+        if result:
+            print(f"  Score: {result.get('score')}")
+
+    trainer.save_model(OUTPUT_DIR)
+    tokenizer.save_pretrained(OUTPUT_DIR)
+    submit_for_grading(GRADING_SERVER_URL, OUTPUT_DIR)
+    print(f"\nDone! Total: {(time.time() - start_time) / 60:.1f} min")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/opencode/README.md b/rdagent/scenarios/rl/autorl_bench/agents/opencode/README.md
new file mode 100644
index 000000000..831372a1a
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/opencode/README.md
@@ -0,0 +1,197 @@
+# OpenCode Agent
+
+基于 [opencode-rl](https://github.com/shatianming5/opencode-rl) 的固定阶段 Pipeline Agent，使用大模型（如 GPT-5.2）通过 OpenCode 驱动代码生成→训练→评测→反馈的迭代循环。
+
+## 架构说明
+
+OpenCode Agent 采用**外挂式设计**：核心 Pipeline 代码维护在独立的 [opencode-rl](https://github.com/shatianming5/opencode-rl) 仓库中，RD-Agent 通过 `start.sh` 调用外部 opencode-rl 目录。
+
+```
+RD-Agent (autorl_bench)          opencode-rl (外部独立 repo)
+┌─────────────────────┐          ┌──────────────────────────┐
+│ run.py              │          │ main.py                  │
+│   ↓                 │  exec    │ pipeline/                │
+│ start.sh ─────────────────────→│ runner_fsm/              │
+│   (设置环境变量)     │          │ benchmarks/              │
+│                     │          │   gsm8k, humaneval, ...  │
+│ Grading Server      │◄─ HTTP ──│   smith-*, alfworld, ... │
+│ (评分 & 模型管理)    │          └──────────────────────────┘
+└─────────────────────┘
+```
+
+**好处**：
+- opencode-rl 可以独立开发、测试、迭代，不受 RD-Agent 发版周期限制
+- 支持独立运行（不依赖 RD-Agent）或作为 Agent 插件运行
+- 通过 `OPENCODE_RL_ROOT` 环境变量灵活切换版本
+
+## 快速开始
+
+### 1. 准备 opencode-rl
+
+```bash
+# 克隆 opencode-rl 到本地（如果还没有的话）
+git clone https://github.com/shatianming5/opencode-rl.git /path/to/opencode-rl
+cd /path/to/opencode-rl
+pip install -r requirements.txt
+```
+
+### 2. 安装 RD-Agent 依赖
+
+```bash
+cd ~/RD-Agent
+pip install -e .
+pip install -r rdagent/scenarios/rl/autorl_bench/requirements.txt
+```
+
+此外需要 [OpenCode](https://opencode.ai/) CLI 工具：
+
+```bash
+npm install -g opencode    # 需要 Node.js >= 18
+```
+
+### 3. 配置 `.env`
+
+在 RD-Agent 根目录的 `.env` 中添加以下配置：
+
+```env
+# LLM API（必需）
+OPENAI_API_KEY=your_api_key
+OPENAI_API_BASE=https://your-api-endpoint/v1
+
+# OpenCode 使用的模型（推荐 gpt-5.2，默认 gpt-5）
+OPENCODE_MODEL=gpt-5.2
+
+# opencode-rl 路径（默认 /data/userdata/v-tiansha/opencode-rl）
+OPENCODE_RL_ROOT=/path/to/your/opencode-rl
+
+# GPU 配置（可选）
+CUDA_VISIBLE_DEVICES=0,1,2,3
+```
+
+### 4. 运行
+
+```bash
+cd /path/to/RD-Agent
+
+# GSM8K
+python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent opencode --task gsm8k --model Qwen/Qwen2.5-0.5B-Instruct --timeout 41600
+
+# ALFWorld
+python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent opencode --task alfworld --model Qwen/Qwen2.5-0.5B-Instruct --timeout 41600
+
+# 后台运行
+nohup python -m rdagent.scenarios.rl.autorl_bench.run \
+    --agent opencode --task gsm8k --model Qwen/Qwen2.5-0.5B-Instruct \
+    --timeout 41600 > /dev/null 2>&1 &
+```
+
+### 5. 查看日志
+
+```bash
+# Agent 实时日志
+tail -f workspace/gsm8k/20260301T160000_opencode/agent.log
+
+# 评分记录
+cat workspace/gsm8k/20260301T160000_opencode/scores.json
+```
+
+---
+
+## Pipeline 执行流程
+
+每轮迭代包含以下固定阶段：
+
+```
+Code Gen → Training → Eval → Analysis → 下一轮
+   │          │         │        │
+   │          │         │        └─ Agent 总结结果，规划改进方向
+   │          │         └─ 提交模型到 Grading Server 评分
+   │          └─ accelerate launch train.py（GRPO 训练）
+   └─ Agent 生成/修改 train.py
+```
+
+- **Code Gen**：大模型（通过 OpenCode）生成训练代码 `train.py`
+- **Training**：使用 `accelerate` 执行 RL 训练（GRPO）
+- **Eval**：将训练后的模型提交到 Grading Server 评测
+- **Analysis**：大模型分析评测结果，决定下一轮改进方向
+
+失败时自动重试（最多 `MAX_RETRIES` 次），支持 `--resume` 断点续跑。
+
+---
+
+## 配置参数
+
+以下参数在 `config.yaml` 中配置，通过环境变量传入 opencode-rl：
+
+| 参数 | 默认值 | 说明 |
+|------|--------|------|
+| `MAX_ITERATIONS` | 5 | 最大迭代轮数 |
+| `MAX_RETRIES` | 20 | 各阶段失败重试次数 |
+| `MAX_AGENT_STEPS` | 25 | Agent 每阶段最大步数 |
+| `TRAINING_TIMEOUT` | 7200 | 训练超时（秒） |
+| `STALE_TIMEOUT` | 1800 | LLM 无响应超时（秒） |
+| `HTTP_TIMEOUT` | 600 | HTTP 请求超时（秒） |
+| `EVAL_TIMEOUT` | 7200 | 评测请求超时（秒） |
+
+可通过 `.env` 或命令行环境变量覆盖。
+
+---
+
+## 目录结构
+
+```
+agents/opencode/
+├── config.yaml              # Agent 注册配置（参数、启动脚本）
+├── start.sh                 # 启动脚本（设置环境变量后 exec opencode-rl）
+├── README.md                # 本文档
+│
+└── opencode-rl/             # 内置副本（fallback，优先使用外部 OPENCODE_RL_ROOT）
+    └── ...
+```
+
+外部 opencode-rl 仓库结构详见：https://github.com/shatianming5/opencode-rl
+
+---
+
+## 自定义 opencode-rl 路径
+
+`start.sh` 通过 `OPENCODE_RL_ROOT` 环境变量决定使用哪个 opencode-rl：
+
+```bash
+# 默认值（在 start.sh 中）
+OPENCODE_RL_ROOT="${OPENCODE_RL_ROOT:-/data/userdata/v-tiansha/opencode-rl}"
+```
+
+可以在 `.env` 中覆盖：
+
+```env
+OPENCODE_RL_ROOT=/home/user/my-opencode-rl
+```
+
+或者运行时指定：
+
+```bash
+OPENCODE_RL_ROOT=/tmp/opencode-rl-dev python -m rdagent.scenarios.rl.autorl_bench.run --agent opencode --task gsm8k
+```
+
+---
+
+## 常见问题
+
+### LLM 长时间 "thinking"
+
+这是 gpt-5.2 等推理模型的正常行为（生成复杂训练代码时可能思考 1-3 分钟）。推荐使用 `OPENCODE_MODEL=gpt-5.2`。
+
+### opencode-rl 更新
+
+opencode-rl 独立维护，更新时只需：
+
+```bash
+cd /path/to/opencode-rl
+git pull
+pip install -r requirements.txt  # 如果依赖有变化
+```
+
+RD-Agent 侧无需任何修改，下次运行自动使用新版本。
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/opencode/config.yaml b/rdagent/scenarios/rl/autorl_bench/agents/opencode/config.yaml
new file mode 100644
index 000000000..2ddc8d4dd
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/opencode/config.yaml
@@ -0,0 +1,12 @@
+name: "OpenCode Agent"
+description: "固定阶段 pipeline：代码生成→训练→评测→反馈（基于 opencode-rl）"
+start: "start.sh"
+env_vars:
+  MAX_ITERATIONS: "5"
+  TRAINING_TIMEOUT: "7200"
+  MAX_AGENT_STEPS: "25"
+  MAX_RETRIES: "20"
+  STALE_TIMEOUT: "1800"
+  HTTP_TIMEOUT: "600"
+  EVAL_TIMEOUT: "7200"
+  MAX_STEPS: "20"
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/opencode/start.sh b/rdagent/scenarios/rl/autorl_bench/agents/opencode/start.sh
new file mode 100755
index 000000000..b310932ce
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/opencode/start.sh
@@ -0,0 +1,78 @@
+#!/bin/bash
+# OpenCode Agent wrapper for AutoRL-Bench
+
+echo "=== OpenCode Agent ==="
+echo "Task: $TASK"
+echo "Model: $BASE_MODEL"
+echo "Workspace: $WORKSPACE"
+echo "Grading Server: $GRADING_SERVER_URL"
+echo "Output Dir: $OUTPUT_DIR"
+
+# 加载 .env 配置（启动时已在 RD-Agent 目录）
+if [ -f .env ]; then
+    export $(grep -v '^#' .env | xargs)
+    echo "Loaded .env"
+fi
+
+# opencode-rl 路径：默认用外部独立目录
+OPENCODE_RL_ROOT="${OPENCODE_RL_ROOT:-/data/userdata/v-tiansha/opencode-rl}"
+
+# OPENCODE_MODEL 优先从 config.yaml 传入，否则用 CHAT_MODEL，默认 gpt-5
+export OPENCODE_MODEL="${OPENCODE_MODEL:-${CHAT_MODEL:-gpt-5}}"
+echo "OpenCode Model: $OPENCODE_MODEL"
+
+export PYTHONUNBUFFERED=1
+
+# opencode CLI 可能装在 ~/.opencode/bin，确保在 PATH 中
+export PATH="$HOME/.opencode/bin:$PATH"
+
+# 把训练环境的 bin 目录加到 PATH，这样 LLM agent 的 bash 工具调用
+# (python3 -c "from trl import ...") 也能用到正确的训练依赖
+if [ -n "$TRAINING_PYTHON" ]; then
+    TRAINING_BIN_DIR="$(dirname "$TRAINING_PYTHON")"
+    export PATH="$TRAINING_BIN_DIR:$PATH"
+    echo "Training env bin: $TRAINING_BIN_DIR (prepended to PATH)"
+fi
+
+# Python 解释器：优先用 .env 中的 OPENCODE_PYTHON，否则用 python3
+PYTHON="${OPENCODE_PYTHON:-python3}"
+echo "Python: $PYTHON"
+
+# 生成 opencode config（用 RD-Agent 根 .env 中的 API 配置）
+export XDG_CONFIG_HOME="${OPENCODE_RL_ROOT}/.opencode-config"
+mkdir -p "$XDG_CONFIG_HOME/opencode"
+cat > "$XDG_CONFIG_HOME/opencode/opencode.json" <<EOCFG
+{
+  "\$schema": "https://opencode.ai/config.json",
+  "provider": {
+    "openai": {
+      "npm": "@ai-sdk/openai",
+      "name": "Auto-configured",
+      "options": {
+        "baseURL": "${OPENAI_API_BASE}",
+        "apiKey": "${OPENAI_API_KEY}"
+      },
+      "models": {
+        "${OPENCODE_MODEL}": { "name": "${OPENCODE_MODEL}" }
+      }
+    }
+  }
+}
+EOCFG
+
+# 运行 opencode-rl pipeline
+cd "$OPENCODE_RL_ROOT"
+
+# Use exec to REPLACE bash with python3, so signals go directly to python3
+# without an intermediate bash process. This avoids double signal delivery.
+exec "$PYTHON" main.py \
+    --benchmark "$TASK" \
+    --base-model "$BASE_MODEL" \
+    --run-dir "$WORKSPACE" \
+    --max-iterations ${MAX_ITERATIONS:-5} \
+    --max-retries ${MAX_RETRIES:-20} \
+    --training-timeout ${TRAINING_TIMEOUT:-7200} \
+    --stale-timeout ${STALE_TIMEOUT:-1800} \
+    --http-timeout ${HTTP_TIMEOUT:-600} \
+    --eval-timeout ${EVAL_TIMEOUT:-7200} \
+    --max-agent-steps ${MAX_AGENT_STEPS:-25}
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/openhands/config.yaml b/rdagent/scenarios/rl/autorl_bench/agents/openhands/config.yaml
new file mode 100644
index 000000000..d0f186de0
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/openhands/config.yaml
@@ -0,0 +1,8 @@
+name: "OpenHands Agent"
+description: "固定阶段 pipeline：每轮 代码生成→训练→评测→反馈（参考 openhands-magic）"
+start: "start.sh"
+env_vars:
+  MAX_ITERATIONS: "30"           # Pipeline 迭代次数（每轮=写代码+训练+评测）
+  TRAINING_TIMEOUT: "36000"      # 每轮训练超时（秒）= 10小时
+  MAX_AGENT_STEPS: "20"         # 每轮代码生成 agent 最大步数
+  LLM_MODEL: "gpt-5.2"
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/openhands/start.sh b/rdagent/scenarios/rl/autorl_bench/agents/openhands/start.sh
new file mode 100755
index 000000000..b2f71861b
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/openhands/start.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# OpenHands Agent wrapper for AutoRL-Bench
+
+echo "=== OpenHands Agent ==="
+echo "Task: $TASK"
+echo "Model: $BASE_MODEL"
+echo "Workspace: $WORKSPACE"
+echo "Grading Server: $GRADING_SERVER_URL"
+echo "Output Dir: $OUTPUT_DIR"
+
+# 加载 .env 配置（启动时已在 RD-Agent 目录）
+if [ -f .env ]; then
+    export $(grep -v '^#' .env | xargs)
+    echo "Loaded .env"
+fi
+
+# 映射环境变量（rdagent 用 OPENAI_API_KEY，openhands 用 LLM_API_KEY）
+export LLM_API_KEY="${OPENAI_API_KEY}"
+# LLM_MODEL 优先从 config.yaml 传入，否则用 CHAT_MODEL，默认 gpt-5
+export LLM_MODEL="${LLM_MODEL:-${CHAT_MODEL:-gpt-5}}"
+export LLM_BASE_URL="${OPENAI_API_BASE}"
+echo "LLM Model: $LLM_MODEL"
+
+# 训练环境 Python 路径（.env 中设 TRAINING_PYTHON 即可，无需 conda）
+if [ -z "$TRAINING_PYTHON" ]; then
+    echo "WARNING: TRAINING_PYTHON not set in .env, trying conda fallback..."
+    source "$(conda info --base 2>/dev/null || echo /root/miniconda3)/etc/profile.d/conda.sh" 2>/dev/null
+    conda activate "${CONDA_ENV_TRAINING:-cwy-rl}" 2>/dev/null
+    export TRAINING_PYTHON="$(which python)"
+    conda activate "${CONDA_ENV_OPENHANDS:-openhands}" 2>/dev/null
+fi
+echo "Training Python: $TRAINING_PYTHON"
+
+# 运行 openhands-rl pipeline
+cd "${OPENHANDS_RL_ROOT:-$HOME/openhands-rl}"
+OPENHANDS_PYTHON="${OPENHANDS_PYTHON:-python}"
+
+"$OPENHANDS_PYTHON" main.py \
+    --benchmark "$TASK" \
+    --base-model "$BASE_MODEL" \
+    --workspace "$WORKSPACE" \
+    --max-iterations ${MAX_ITERATIONS:-10} \
+    --training-timeout ${TRAINING_TIMEOUT:-7200} \
+    --max-agent-steps ${MAX_AGENT_STEPS:-50}
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/rdagent/config.yaml b/rdagent/scenarios/rl/autorl_bench/agents/rdagent/config.yaml
new file mode 100644
index 000000000..fc9e06242
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/rdagent/config.yaml
@@ -0,0 +1,6 @@
+name: "RD-Agent"
+description: "RD-Agent RL Post-training Loop (自动假设生成 + 代码生成 + 验证迭代)"
+start: "start.sh"
+env_vars:
+  STEP_N: "200"
+  LOOP_N: "40"
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/rdagent/start.sh b/rdagent/scenarios/rl/autorl_bench/agents/rdagent/start.sh
new file mode 100755
index 000000000..6d5f6eaa3
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/rdagent/start.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# RD-Agent wrapper for AutoRL-Bench
+
+echo "=== RD-Agent ==="
+echo "Task: $TASK"
+echo "Model: $BASE_MODEL"
+echo "Workspace: $WORKSPACE"
+
+# 加载 .env 配置（启动时已在 RD-Agent 目录）
+if [ -f .env ]; then
+    export $(grep -v '^#' .env | xargs)
+    echo "Loaded .env"
+fi
+
+# 设置 rdagent 数据目录（命令行会传 base_model 和 benchmark）
+export RL_FILE_PATH=$(dirname $(dirname $MODEL_PATH))
+echo "RL_FILE_PATH: $RL_FILE_PATH"
+
+# 运行 rdagent（内部每次迭代会自动调用 grading server 评测）
+python -m rdagent.app.rl.loop \
+    --base-model "$BASE_MODEL" \
+    --benchmark "$TASK" \
+    --step-n $STEP_N \
+    --loop-n $LOOP_N
diff --git a/rdagent/scenarios/rl/autorl_bench/agents/registry.py b/rdagent/scenarios/rl/autorl_bench/agents/registry.py
new file mode 100644
index 000000000..9bf128252
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/agents/registry.py
@@ -0,0 +1,40 @@
+"""
+Agent Registry
+"""
+import yaml
+from dataclasses import dataclass
+from pathlib import Path
+
+AGENTS_DIR = Path(__file__).parent
+
+
+@dataclass
+class Agent:
+    id: str
+    name: str
+    start: Path
+    env_vars: dict = None
+    
+    def __post_init__(self):
+        self.env_vars = self.env_vars or {}
+
+
+def get_agent(agent_id: str) -> Agent:
+    agent_dir = AGENTS_DIR / agent_id
+    config_file = agent_dir / "config.yaml"
+    
+    if not config_file.exists():
+        raise ValueError(f"Agent not found: {agent_id}")
+    
+    data = yaml.safe_load(config_file.read_text())
+    
+    return Agent(
+        id=agent_id,
+        name=data.get("name", agent_id),
+        start=agent_dir / data.get("start", "start.sh"),
+        env_vars=data.get("env_vars", {}),
+    )
+
+
+def list_agents() -> list[str]:
+    return [d.name for d in AGENTS_DIR.iterdir() if d.is_dir() and (d / "config.yaml").exists()]
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/__init__.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/__init__.py
new file mode 100644
index 000000000..1548d42d9
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/__init__.py
@@ -0,0 +1,97 @@
+"""
+AutoRL-Bench Benchmarks Registry
+
+注册表，管理所有可用的 benchmark 评测器。
+添加新 benchmark 时，在此注册。
+"""
+import importlib
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Dict, Optional, Type
+
+from rdagent.scenarios.rl.autorl_bench.core.evaluator import BaseEvaluator
+
+
+BENCHMARKS_DIR = Path(__file__).parent
+
+
+@dataclass
+class BenchmarkConfig:
+    """Benchmark 配置
+
+    每个 benchmark 的数据下载/处理逻辑写在各自目录的 data.py 里，
+    不在这里统一处理。这样新增 benchmark 时只需在自己目录下实现即可。
+    """
+    id: str
+    evaluator_class: str  # 评测器类的完整路径
+    data_module: str = ""  # 数据模块路径（实现 download_train_data 函数）
+    description: str = ""
+    eval_config: Optional[Dict[str, Any]] = field(default=None)
+    expose_files: list = field(default_factory=list)  # benchmark 特有的额外文件（description.md 和 instructions.md 由 run.py 统一挂载）
+    bench_dir: Optional[str] = None  # 自定义 benchmark 目录路径（默认 None 则用 BENCHMARKS_DIR / id）
+
+
+# Benchmark 注册表
+BENCHMARKS: Dict[str, BenchmarkConfig] = {
+    "gsm8k": BenchmarkConfig(
+        id="gsm8k",
+        evaluator_class="rdagent.scenarios.rl.autorl_bench.core.opencompass.OpenCompassEvaluator",
+        data_module="rdagent.scenarios.rl.autorl_bench.benchmarks.gsm8k.data",
+        description="Grade School Math 8K - 小学数学推理",
+        eval_config={
+            "dataset": "opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4",
+        },
+    ),
+    "alfworld": BenchmarkConfig(
+        id="alfworld",
+        evaluator_class="rdagent.scenarios.rl.autorl_bench.benchmarks.alfworld.eval.ALFWorldEvaluator",
+        data_module="rdagent.scenarios.rl.autorl_bench.benchmarks.alfworld.data",
+        description="ALFWorld - 文本游戏交互环境（ReAct agent，支持 vLLM/API）",
+        eval_config={
+            "max_steps": 50,
+            "env_num": 134,  # 完整评测集（valid_unseen），之前调试时设为 1
+        },
+        expose_files=["eval.py", "react_prompts.json"],
+    ),
+    "webshop": BenchmarkConfig(
+        id="webshop",
+        evaluator_class="rdagent.scenarios.rl.autorl_bench.benchmarks.webshop.eval.WebShopEvaluator",
+        data_module="rdagent.scenarios.rl.autorl_bench.benchmarks.webshop.data",
+        description="WebShop - 在线购物网站交互环境（ReAct agent，支持 vLLM/API）",
+        eval_config={
+            "max_steps": 50,
+            "num_instructions": 100,  # 评测指令数量（完整集约 1.2 万）
+            "webshop_port": 8080,     # WebShop 服务器端口
+        },
+        expose_files=["eval.py"],
+    ),
+}
+
+
+from rdagent.scenarios.rl.autorl_bench.benchmarks.smith import discover_smith_benchmarks
+BENCHMARKS.update(discover_smith_benchmarks())
+
+
+def get_benchmark(benchmark_id: str) -> BenchmarkConfig:
+    """获取 benchmark 配置"""
+    if benchmark_id not in BENCHMARKS:
+        available = list(BENCHMARKS.keys())
+        raise ValueError(f"Unknown benchmark: {benchmark_id}. Available: {available}")
+    return BENCHMARKS[benchmark_id]
+
+
+def get_evaluator(benchmark_id: str) -> BaseEvaluator:
+    """获取 benchmark 的评测器实例"""
+    config = get_benchmark(benchmark_id)
+    
+    # 动态导入评测器类
+    module_path, class_name = config.evaluator_class.rsplit(".", 1)
+    module = importlib.import_module(module_path)
+    evaluator_class: Type[BaseEvaluator] = getattr(module, class_name)
+    
+    return evaluator_class(config)
+
+
+def list_benchmarks() -> list[str]:
+    """列出所有可用的 benchmark"""
+    return list(BENCHMARKS.keys())
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/__init__.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/__init__.py
new file mode 100644
index 000000000..d5921e5c2
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/__init__.py
@@ -0,0 +1 @@
+"""ALFWorld Benchmark"""
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/base_config.yaml b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/base_config.yaml
new file mode 100644
index 000000000..61311b006
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/base_config.yaml
@@ -0,0 +1,36 @@
+# ALFWorld base config (from alfworld official repo)
+# $ALFWORLD_DATA is set by eval.py -> data._ensure_alfworld_data()
+
+dataset:
+  data_path: '$ALFWORLD_DATA/json_2.1.1/train'
+  eval_id_data_path: '$ALFWORLD_DATA/json_2.1.1/valid_seen'
+  eval_ood_data_path: '$ALFWORLD_DATA/json_2.1.1/valid_unseen'
+  num_train_games: -1
+  num_eval_games: -1
+
+logic:
+  domain: '$ALFWORLD_DATA/logic/alfred.pddl'
+  grammar: '$ALFWORLD_DATA/logic/alfred.twl2'
+
+env:
+  type: 'AlfredTWEnv'
+  domain_randomization: False
+  task_types: [1, 2, 3, 4, 5, 6]
+  expert_timeout_steps: 150
+  expert_type: "handcoded"
+  goal_desc_human_anns_prob: 0.0
+
+controller:
+  type: 'oracle'
+  debug: False
+  load_receps: True
+
+general:
+  random_seed: 42
+  use_cuda: True
+  task: 'alfred'
+  training_method: 'dagger'
+
+dagger:
+  training:
+    max_nb_steps_per_episode: 50
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/data.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/data.py
new file mode 100644
index 000000000..8962bf31c
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/data.py
@@ -0,0 +1,77 @@
+"""
+ALFWorld 数据准备
+
+官方 alfworld-download 一次性下载所有数据（json + pddl + game.tw-pddl + logic）
+到 ~/.cache/alfworld/，然后只把训练数据 symlink 给 agent。
+"""
+import sys
+from pathlib import Path
+
+from loguru import logger
+
+
+def _run_alfworld_download() -> None:
+    """调用 alfworld-download，兼容 conda env PATH 问题"""
+    import subprocess
+
+    bin_dir = Path(sys.executable).parent
+    script = bin_dir / "alfworld-download"
+    if script.exists():
+        subprocess.run([sys.executable, str(script)], check=True)
+    else:
+        subprocess.run(["alfworld-download"], check=True)
+
+
+def _ensure_alfworld_data() -> Path:
+    """确保 alfworld 完整数据已下载，返回数据根目录
+
+    alfworld-download 下载三个 zip 到 ~/.cache/alfworld/:
+      - json_2.1.1_json.zip  -> traj_data.json
+      - json_2.1.1_pddl.zip  -> initial_state.pddl
+      - json_2.1.3_tw-pddl.zip -> game.tw-pddl
+      + logic/alfred.pddl, logic/alfred.twl2
+    """
+    cache_dir = Path.home() / ".cache" / "alfworld"
+    json_dir = cache_dir / "json_2.1.1"
+
+    tw_pddl_ok = json_dir.exists() and any(json_dir.rglob("game.tw-pddl"))
+    pddl_ok = json_dir.exists() and any(json_dir.rglob("initial_state.pddl"))
+    logic_ok = (cache_dir / "logic" / "alfred.pddl").exists()
+
+    if tw_pddl_ok and pddl_ok and logic_ok:
+        logger.info(f"ALFWorld data already complete: {cache_dir}")
+        return cache_dir
+
+    logger.info("Running alfworld-download (downloads ~2GB, first time only)...")
+    _run_alfworld_download()
+
+    if not any(json_dir.rglob("game.tw-pddl")):
+        raise RuntimeError(
+            f"alfworld-download finished but game.tw-pddl not found in {json_dir}. "
+            "Check network connectivity to GitHub releases."
+        )
+    logger.info(f"ALFWorld data ready: {cache_dir}")
+    return cache_dir
+
+
+def download_train_data(target_dir: Path) -> None:
+    """准备 ALFWorld 训练数据（agent 可见）"""
+    marker = target_dir / ".downloaded"
+    if marker.exists():
+        logger.info(f"ALFWorld train data exists: {target_dir}")
+        return
+
+    target_dir.mkdir(parents=True, exist_ok=True)
+
+    cache_dir = _ensure_alfworld_data()
+    train_src = cache_dir / "json_2.1.1" / "train"
+
+    if not train_src.exists():
+        raise FileNotFoundError(f"ALFWorld train data not found: {train_src}")
+
+    train_dst = target_dir / "train"
+    if not train_dst.exists():
+        train_dst.symlink_to(train_src)
+    logger.info(f"ALFWorld train data linked: {train_dst} -> {train_src}")
+
+    marker.touch()
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/description.md b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/description.md
new file mode 100644
index 000000000..895973382
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/description.md
@@ -0,0 +1,62 @@
+# ALFWorld 任务
+
+## 目标
+训练模型在 ALFWorld 文本游戏环境中获得更高的任务成功率。这是一个**交互式**任务：模型需要在环境中多步决策（rollout），而非一次性生成答案。
+
+## 环境概述
+ALFWorld 是一个文本模拟的家庭环境（TextWorld 引擎）。模型扮演 agent，通过文本指令在房间中导航、操作物品来完成任务。
+
+## 任务类型（6 种）
+1. **pick_and_place**: 拿起物品放到指定位置
+2. **pick_clean_then_place**: 清洁物品后放到指定位置
+3. **pick_heat_then_place**: 加热物品后放到指定位置
+4. **pick_cool_then_place**: 冷却物品后放到指定位置
+5. **look_at_obj_in_light**: 在灯光下查看物品
+6. **pick_two_obj_and_place**: 拿起两个物品放到指定位置
+
+## Rollout 流程
+
+每局游戏的交互循环：
+
+```
+初始化：ob, info = env.reset()     # 获取初始观察（房间描述 + 任务目标）
+
+循环（每步）：
+  action = model(观察历史)           # 模型根据历史生成动作（文本）
+  ob, reward, done, info = env.step([action])  # 环境执行动作，返回新观察
+  if done:
+      break
+```
+
+**一个 rollout 示例（pick_and_place）：**
+```
+任务: "put a pencil in/on shelf."
+
+Step 1:  观察: "You are in the middle of a room. Looking around you, you see a bed 1, a desk 1, a shelf 1..."
+         动作: "go to desk 1"
+Step 2:  观察: "On the desk 1, you see a pencil 1, a book 2."
+         动作: "take pencil 1 from desk 1"
+Step 3:  观察: "You pick up the pencil 1 from the desk 1."
+         动作: "go to shelf 1"
+Step 4:  观察: "You arrive at shelf 1. On the shelf 1, you see nothing."
+         动作: "put pencil 1 in/on shelf 1"
+Step 5:  观察: "You put the pencil 1 in/on the shelf 1."
+         结果: 任务完成
+```
+
+## 可用动作空间
+Agent 的动作是自由文本，常见动作包括：
+- 导航: `go to {object} {id}`（如 `go to desk 1`, `go to fridge 1`）
+- 拿取: `take {object} {id} from {location} {id}`
+- 放置: `put {object} {id} in/on {location} {id}`
+- 打开/关闭: `open {object} {id}`, `close {object} {id}`
+- 加热/冷却: `heat {object} {id} with microwave {id}`, `cool {object} {id} with fridge {id}`
+- 清洁: `clean {object} {id} with sinkbasin {id}`
+- 使用: `use {object} {id}`（如 `use desklamp 1`）
+- 思考: `think: {reasoning}`（不影响环境状态）
+
+## 评测指标
+- **成功率** = 成功任务数 / 总任务数
+
+## 参考代码
+环境交互和评测的完整实现见 `eval.py`。
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/eval.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/eval.py
new file mode 100644
index 000000000..0a253c952
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/eval.py
@@ -0,0 +1,355 @@
+"""
+ALFWorld Evaluator - 交互式文本游戏环境
+
+使用 ReAct agent（few-shot + 完整历史）在 ALFWorld 中评测 LLM。
+支持两种后端：
+  - vllm: 本地模型推理（text completion，和 ReAct 原版一致）
+  - api:  OpenAI 兼容 API（chat completion）
+
+ReAct 官方代码: https://github.com/ysymyth/ReAct/blob/main/alfworld.ipynb
+"""
+import json
+import os
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Callable, Dict, List
+from rdagent.scenarios.rl.autorl_bench.core.evaluator import BaseEvaluator
+
+# 日志目录
+LOG_DIR = Path(__file__).resolve().parent.parent.parent / "log"
+
+
+class _Tee:
+    """同时输出到终端和日志文件"""
+    def __init__(self, filepath):
+        self.terminal = sys.__stdout__
+        self.log = open(filepath, "w", encoding="utf-8")
+    def write(self, message):
+        self.terminal.write(message)
+        self.log.write(message)
+        self.log.flush()
+    def flush(self):
+        self.terminal.flush()
+        self.log.flush()
+    def isatty(self):
+        return False
+    def fileno(self):
+        return self.terminal.fileno()
+
+
+def _log(msg: str):
+    """简单的 print 日志（会被 Tee 同时写入文件）"""
+    print(msg, flush=True)
+
+
+# ============================================================
+# ReAct agent 核心逻辑（来自官方 alfworld.ipynb）
+# ============================================================
+
+# 任务类型 → few-shot prompt key 的映射
+TASK_PREFIXES = {
+    "pick_and_place": "put",
+    "pick_clean_then_place": "clean",
+    "pick_heat_then_place": "heat",
+    "pick_cool_then_place": "cool",
+    "look_at_obj": "examine",
+    "pick_two_obj": "puttwo",
+}
+
+
+def process_ob(ob: str) -> str:
+    """官方 ReAct 的 observation 清洗"""
+    if ob.startswith("You arrive at loc "):
+        ob = ob[ob.find(". ") + 2 :]
+    return ob
+
+
+def alfworld_run(llm_fn: Callable, env, prompt: str, ob: str, max_steps: int = 50) -> tuple:
+    """
+    ReAct 官方的单局评测逻辑。
+
+    Args:
+        llm_fn: llm(prompt, stop) -> str
+        env: ALFWorld 环境实例
+        prompt: few-shot prompt（含 2 个示例）
+        ob: 初始 observation
+        max_steps: 最大步数
+
+    Returns:
+        (reward, steps): reward=1 表示成功，steps 为实际步数
+    """
+    init_prompt = prompt + ob + "\n>"
+    history = ""
+    for i in range(1, max_steps + 1):
+        action = llm_fn(init_prompt + history, stop=["\n"]).strip()
+        observation, reward, done, info = env.step([action])
+        observation = process_ob(observation[0])
+        reward = info["won"][0]
+        done = done[0]
+        if action.startswith("think:"):
+            observation = "OK."
+        _log(f"  Act {i}: {action}")
+        _log(f"  Obs {i}: {observation}")
+        history += f" {action}\n{observation}\n>"
+        if done:
+            return reward, i
+    return 0, max_steps
+
+
+# ============================================================
+# LLM 后端工厂
+# ============================================================
+
+def create_llm_fn(backend: str, model_path: str, **kwargs) -> tuple:
+    """
+    创建统一的 llm(prompt, stop) 函数。
+
+    backend="vllm": 本地模型，text completion（和 ReAct 原版行为一致）
+    backend="api":  OpenAI 兼容 chat API
+
+    Returns:
+        (llm_fn, cleanup_fn): cleanup_fn 释放 GPU 显存
+    """
+    if backend == "vllm":
+        from vllm import LLM, SamplingParams
+        from vllm.distributed.parallel_state import destroy_model_parallel
+
+        llm_engine = LLM(model=model_path, tensor_parallel_size=kwargs.get("tensor_parallel_size", 1), trust_remote_code=True)
+
+        def vllm_fn(prompt: str, stop: List[str] = None) -> str:
+            params = SamplingParams(temperature=0, max_tokens=100, stop=stop or ["\n"])
+            outputs = llm_engine.generate([prompt], params)
+            return outputs[0].outputs[0].text
+
+        def cleanup():
+            nonlocal llm_engine
+            import gc
+            import torch
+            destroy_model_parallel()
+            llm_engine = None
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            _log("vLLM engine released, GPU memory freed.")
+
+        return vllm_fn, cleanup
+
+    elif backend == "api":
+        from openai import OpenAI
+
+        client = OpenAI(
+            api_key=kwargs.get("api_key", os.getenv("OPENAI_API_KEY")),
+            base_url=kwargs.get("api_base", os.getenv("OPENAI_API_BASE")),
+        )
+        model_name = model_path
+
+        system_msg = (
+            "You are playing a text-based household game. "
+            "You will be given a task and interaction history. "
+            "Output ONLY the next action (e.g. 'go to desk 1', 'take mug 1 from desk 1', "
+            "'use desklamp 1', 'think: I need to find...') with NO extra text, "
+            "NO prefix like '>' or 'Action:', just the raw action string."
+        )
+
+        def api_fn(prompt: str, stop: List[str] = None) -> str:
+            response = client.chat.completions.create(
+                model=model_name,
+                messages=[
+                    {"role": "system", "content": system_msg},
+                    {"role": "user", "content": prompt},
+                ],
+                temperature=0,
+                max_tokens=100,
+                stop=stop or ["\n"],
+            )
+            text = response.choices[0].message.content or ""
+            text = text.strip()
+            if text.startswith("> "):
+                text = text[2:]
+            return text
+
+        return api_fn, lambda: None
+
+    else:
+        raise ValueError(f"Unknown backend: {backend}. Use 'vllm' or 'api'.")
+
+
+# ============================================================
+# Evaluator
+# ============================================================
+
+class ALFWorldEvaluator(BaseEvaluator):
+    """
+    ALFWorld 评测器（ReAct agent）
+
+    eval_config 字段：
+        max_steps:    每局最大步数（默认 50）
+        env_num:      评测局数（默认 134）
+        react_prompts: ReAct few-shot prompts 文件路径
+        backend:      "vllm" 或 "api"（默认自动判断）
+        api_key:      API 密钥（backend=api 时）
+        api_base:     API 地址（backend=api 时）
+    """
+
+    def __init__(self, config):
+        self.config = config
+        self.benchmark_id = config.id
+        self.eval_config = config.eval_config or {}
+
+    def run_eval(
+        self,
+        model_path: str,
+        workspace_path: str,
+        **kwargs,
+    ) -> Dict[str, Any]:
+        """运行 ALFWorld 评测"""
+        result = self.get_default_result(self.benchmark_id, model_path)
+        result["eval_type"] = "alfworld"
+
+        # 合并 kwargs 到 eval_config
+        cfg = {**self.eval_config, **kwargs}
+        max_steps = cfg.get("max_steps", 50)
+        env_num = cfg.get("env_num", 134)
+
+        # --- 设置日志 Tee ---
+        LOG_DIR.mkdir(parents=True, exist_ok=True)
+        model_safe = model_path.replace("/", "_")
+        log_file = LOG_DIR / f"alfworld_{model_safe}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+        sys.stdout = _Tee(log_file)
+
+        # --- 判断 backend ---
+        backend = cfg.get("backend")
+        if backend is None:
+            backend = "api" if not Path(model_path).exists() else "vllm"
+        _log(f"Log: {log_file}")
+        _log(f"ALFWorld eval: backend={backend}, model={model_path}")
+
+        # --- 创建 LLM 函数 ---
+        llm_fn, llm_cleanup = create_llm_fn(
+            backend=backend,
+            model_path=model_path,
+            api_key=cfg.get("api_key"),
+            api_base=cfg.get("api_base"),
+            tensor_parallel_size=cfg.get("tensor_parallel_size", 1),
+        )
+
+        # --- 加载 ReAct few-shot prompts ---
+        prompts_path = cfg.get("react_prompts")
+        if prompts_path is None:
+            # 默认路径：和 eval.py 同目录下的 react_prompts.json
+            prompts_path = Path(__file__).parent / "react_prompts.json"
+        with open(prompts_path) as f:
+            react_prompts = json.load(f)
+
+        # --- 确保 ALFWorld 游戏数据已下载 ---
+        self._ensure_alfworld_data()
+
+        # --- 初始化 ALFWorld 环境 ---
+        workspace = Path(workspace_path)
+
+        from rdagent.scenarios.rl.autorl_bench.benchmarks.alfworld.data import _ensure_alfworld_data
+        alfworld_data = str(_ensure_alfworld_data())
+        os.environ["ALFWORLD_DATA"] = alfworld_data
+
+        # env_config: 读同目录下官方 base_config.yaml，展开 $ALFWORLD_DATA
+        config_yaml = Path(__file__).parent / "base_config.yaml"
+        with open(config_yaml) as f:
+            import yaml
+            env_config = yaml.safe_load(f)
+        env_config = self._expand_env_vars(env_config)
+
+        from alfworld.agents.environment import get_environment
+
+        split = cfg.get("split", "eval_out_of_distribution")
+        env_type = env_config.get("env", {}).get("type", "AlfredTWEnv")
+        alfred_env = get_environment(env_type)(env_config, train_eval=split)
+        env = alfred_env.init_env(batch_size=1)
+
+        num_games = min(env_num, alfred_env.num_games)
+        _log(f"ALFWorld: {num_games} games, max {max_steps} steps, split={split}")
+
+        # --- 评测循环（ReAct 官方逻辑） ---
+        cnts = [0] * 6
+        rs = [0] * 6
+
+        for game_no in range(num_games):
+            ob, info = env.reset()
+            ob = "\n".join(ob[0].split("\n\n")[1:])
+            name = "/".join(info["extra.gamefile"][0].split("/")[-3:-1])
+            _log(f"\n[Game {game_no + 1}/{num_games}] {name}")
+
+            matched = False
+            for i, (prefix, prompt_key) in enumerate(TASK_PREFIXES.items()):
+                if name.startswith(prefix):
+                    prompt = (
+                        "Interact with a household to solve a task. Here are two examples.\n"
+                        + react_prompts[f"react_{prompt_key}_1"]
+                        + react_prompts[f"react_{prompt_key}_0"]
+                        + "\nHere is the task.\n"
+                    )
+                    reward, steps = alfworld_run(llm_fn, env, prompt, ob, max_steps)
+                    rs[i] += reward
+                    cnts[i] += 1
+                    matched = True
+                    _log(f"  Result: {'WON' if reward else 'LOST'} ({steps} steps)")
+                    break
+
+            if not matched:
+                _log(f"  WARNING: Unknown task type: {name}, skipping")
+                continue
+
+            total_r, total_c = sum(rs), sum(cnts)
+            _log(f"  Running: {total_r}/{total_c} = {total_r / max(total_c, 1):.1%}")
+
+        env.close()
+        llm_cleanup()
+
+        # --- 汇总结果 ---
+        total_success = sum(rs)
+        total_count = sum(cnts)
+        success_rate = total_success / total_count if total_count > 0 else 0.0
+
+        per_task = {}
+        for (prefix, _), s, c in zip(TASK_PREFIXES.items(), rs, cnts):
+            if c > 0:
+                per_task[prefix] = {"success": s, "total": c, "rate": s / c}
+
+        result["score"] = success_rate * 100
+        result["accuracy_summary"] = {
+            "success_count": total_success,
+            "total_count": total_count,
+            "success_rate": success_rate,
+            "per_task": per_task,
+        }
+
+        _log(f"\nALFWorld done: {total_success}/{total_count} = {success_rate:.2%}")
+        for prefix, stats in per_task.items():
+            _log(f"  {prefix:30s} {stats['success']}/{stats['total']} = {stats['rate']:.0%}")
+
+        # 恢复 stdout
+        sys.stdout = sys.__stdout__
+
+        return result
+
+    @staticmethod
+    def _ensure_alfworld_data():
+        """检查 ALFWorld 游戏数据（~2.1GB），没有就自动下载"""
+        import subprocess
+        cache_dir = Path.home() / ".cache" / "alfworld"
+        if (cache_dir / "json_2.1.1").exists():
+            return
+        _log("Downloading ALFWorld game data (~2.1GB, first time only)...")
+        subprocess.run(["alfworld-download"], check=True)
+        _log(f"ALFWorld data downloaded to {cache_dir}")
+
+    def _expand_env_vars(self, obj):
+        """递归展开 $ENV_VAR"""
+        if isinstance(obj, str):
+            return os.path.expandvars(obj)
+        elif isinstance(obj, dict):
+            return {k: self._expand_env_vars(v) for k, v in obj.items()}
+        elif isinstance(obj, list):
+            return [self._expand_env_vars(x) for x in obj]
+        return obj
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/react_prompts.json b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/react_prompts.json
new file mode 100644
index 000000000..0e7c20481
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/react_prompts.json
@@ -0,0 +1 @@
+{"react_put_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a countertop 1, a garbagecan 1, a handtowelholder 2, a handtowelholder 1, a sinkbasin 2, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.\nYour task is to: put some spraybottle on toilet.\n> think: To solve the task, I need to find and take a sparybottle, then put it on toilet.\nOK.\n> think: First I need to find a spraybottle. A spraybottle is more likely to appear in cabinet (1-4), countertop (1), toilet (1), sinkbasin (1-2), garbagecan (1). I can check one by one, starting with cabinet 1.\nOK.\n> go to cabinet 1\nOn the cabinet 1, you see a cloth 1, a soapbar 1, a soapbottle 1.\n> go to cabinet 2\nThe cabinet 2 is closed.\n> open cabinet 2\nYou open the cabinet 2. The cabinet 2 is open. In it, you see a candle 1, and a spraybottle 2.\n> think: Now I find a spraybottle (2). Next, I need to take it.\n> take spraybottle 2 from cabinet 2\nYou pick up the spraybottle 2 from the cabinet 2.\n> think: Now I take a spraybottle (2). Next, I need to put it in/on toilet 1.\nOK.\n> go to toilet 1\nOn the toilet 1, you see a soapbottle 2.\n> put spraybottle 2 in/on toilet 1\nYou put the spraybottle 2 in/on the toilet 1.\n", "react_put_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 3, a diningtable 2, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sidetable 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: find some apple and put it in sidetable.\n> think: To solve the task, I need to find and take an apple, then put it in sidetable.\nOK.\n> think: First I need to find an apple. An apple is more likely to appear in fridges (1), diningtables (1-3), sidetables (1), countertops (1), sinkbasins (1), garbagecan (1). I can check one by one, starting with fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, a potato 2, and a tomato 1.\n> go to diningtable 1\nOn the diningtable 1, you see nothing.\n> go to diningtable 2\nOn the diningtable 2, you see a pineapple 1.\n> go to diningtable 3\nOn the diningtable 3, you see a bowl 2, a bowl 1, a bread 2, and a bread 1.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> go to countertop 1\nOn the countertop 1, you see a butterknife 1, a dishsponge 1, and a soapbottle 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.\n> go to garbagecan 1\nOn the garbagecan 1, you see a apple 3, and a egg 3.\n> think: Now I find an apple (3). Next, I need to take it.\n> take apple 3 from garbagecan 1\nYou pick up the apple 3 from the garbagecan 1.\n> think: Now I find an apple (3). Next, I need to put it in/on sidetable 1.\nOK.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> put apple 3 in/on sidetable 1\nYou put the apple 3 in/on the sidetable 1.\n", "react_put_2": "You are in the middle of a room. Looking quickly around you, you see a bathtubbasin 1, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a countertop 1, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a handtowelholder 1, a sinkbasin 2, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.\nYour task is to: put a soapbottle in garbagecan.\n> think: To solve the task, I need to find and take a soapbottle, and put it in garbagecan.\nOK.\n> think: First I need to find a soapbottle. A soapbottle is more likely to appear in toilet (1), sinkbasin (1-2), bathtubbasin (1), cabinet (1-4), countertop (1), dresser (1), drawer (1-4), garbagecan (1). I can check one by one, starting with toilet 1.\nOK.\n> go to toilet 1\nOn the toilet 1, you see nothing.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see nothing.\n> go to sinkbasin 2\nOn the sinkbasin 2, you see nothing.\n> go to bathtubbasin 1\nOn the bathtubbasin 1, you see a cloth 2.\n> go to cabinet 1\nThe cabinet 1 is closed.\n> open cabinet 1\nYou open the cabinet 1. The cabinet 1 is open. In it, you see nothing.\n> go to cabinet 2\nThe cabinet 2 is closed.\n> open cabinet 2\nYou open the cabinet 2. The cabinet 2 is open. In it, you see nothing.\n> go to cabinet 3\nThe cabinet 3 is closed.\n> open cabinet 3\nYou open the cabinet 3. The cabinet 3 is open. In it, you see a toiletpaper 2.\n> go to cabinet 4\nThe cabinet 4 is closed.\n> open cabinet 4\nYou open the cabinet 4. The cabinet 4 is open. In it, you see a spraybottle 2.\n> go to countertop 1\nOn the countertop 1, you see a soapbottle 2, and a spraybottle 1.\n> think: Now I find a soapbottle (2). Next, I need to take it.\nOK.\n> take soapbottle 2 from countertop 1\nYou pick up the soapbottle 2 from the countertop 1.\n> think: Now I take a soapbottle (2). Next, I need to put it in/on garbagecan 1.\nOK.\n> go to garbagecan 1\nOn the garbagecan 1, you see nothing.\n> put soapbottle 2 in/on garbagecan 1\nYou put the soapbottle 2 in/on the garbagecan 1.\n", "react_clean_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put a clean lettuce in diningtable.\n> think: To solve the task, I need to find and take a lettuce, then clean it with sinkbasin, then put it in diningtable. \nOK.\n> think: First I need to find a lettuce. A lettuce is more likely to appear in fridge (1), diningtable (1), sinkbasin (1), stoveburner (1-3), cabinet (1-13). I can check one by one, starting with fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a cup 3, a egg 2, a potato 3, and a potato 2.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 1, a bread 1, a butterknife 2, a cup 2, a fork 2, a knife 2, a knife 1, a ladle 1, a lettuce 1, a mug 2, a mug 1, a pan 2, a peppershaker 1, a spatula 3, a tomato 2, and a tomato 1.\n> think: Now I find a lettuce (1). Next, I need to take it.\nOK.\n> take lettuce 1 from diningtable 1\nYou pick up the lettuce 1 from the diningtable 1.\n> think: Now I take a lettuce (1). Next, I need to go to sinkbasin (1) and clean it.\nOK.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a apple 2, a ladle 2, a spoon 1, and a tomato 3.\n> clean lettuce 1 with sinkbasin 1\nYou clean the lettuce 1 using the sinkbasin 1.\n> think: Now I clean a lettuce (1). Next, I need to put it in/on diningtable 1.\nOK.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 1, a bread 1, a butterknife 2, a cup 2, a fork 2, a knife 2, a knife 1, a ladle 1, a mug 2, a mug 1, a pan 2, a peppershaker 1, a spatula 3, a tomato 2, and a tomato 1.\n> put lettuce 1 in/on diningtable 1\nYou put the lettuce 1 in/on the diningtable 1.\n", "react_clean_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 3, a diningtable 2, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sidetable 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: clean some apple and put it in sidetable.\n> think: To solve the task, I need to find and take an apple, then clean it with sinkbasin, then put it in sidetable.\nOK.\n> think: First I need to find an apple. An apple is more likely to appear in fridges (1), diningtable (1-3), sidetable (1), countertop (1), sinkbasin (1), garbagecan (1). I can check one by one, starting with fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, a potato 2, and a tomato 1.\n> go to diningtable 1\nOn the diningtable 1, you see nothing.\n> go to diningtable 2\nOn the diningtable 2, you see a pineapple 1.\n> go to diningtable 3\nOn the diningtable 3, you see a bowl 2, a bowl 1, a bread 2, and a bread 1.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> go to countertop 1\nOn the countertop 1, you see a butterknife 1, a dishsponge 1, and a soapbottle 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.\n> go to garbagecan 1\nOn the garbagecan 1, you see a apple 3, and a egg 3.\n> think: Now I find an apple (3). Next, I need to take it.\nOK.\n> take apple 3 from garbagecan 1\nYou pick up the apple 3 from the garbagecan 1.\n> think: Now I take an apple (3). Next, I need to go to sinkbasin (1) and clean it.\nOK.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.\n> clean apple 3 with sinkbasin 1\nYou clean the apple 3 using the sinkbasin 1.\n> think: Now I clean an apple (3). Next, I need to put it in/on sidetable 1.\nOK.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> put apple 3 in/on sidetable 1\nYou put the apple 3 in/on the sidetable 1.\n", "react_clean_2": "You are in the middle of a room. Looking quickly around you, you see a bathtubbasin 1, a garbagecan 1, a handtowelholder 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.\nYour task is to: clean some soapbar and put it in toilet.\n> think: To solve the task, I need to find and take a soapbar, then clean it with sinkbasin, then put it in toilet.\nOK.\n> think: First I need to find a soapbar. A soapbar is more likely to appear in toilet (1), bathtubbasin (1), sinkbasin (1), shelf (1-3), garbagecan (1). I can check one by one, starting with toilet 1.\nOK.\n> go to toilet 1\nOn the toilet 1, you see a soapbar 4, and a soapbar 3.\n> think: Now I find a soapbar (4). Next, I need to take it.\nOK.\n> take soapbar 4 from toilet 1\nYou pick up the soapbar 4 from the toilet 1.\n> think: Now I take a soapbar (4). Next, I need to go to sinkbasin (1) and clean it.\nOK.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see nothing.\n> clean soapbar 4 with sinkbasin 1\nYou clean the soapbar 4 using the sinkbasin 1.\n> think: Now I clean a soapbar (4). Next, I need to put it in/on toilet 1.\nOK.\n> go to toilet 1\nOn the toilet 1, you see a soapbar 3.\n> put soapbar 4 in/on toilet 1\nYou put the soapbar 4 in/on the toilet 1.\n", "react_heat_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a diningtable 1, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: heat some egg and put it in diningtable.\n> think: To solve the task, I need to find and take an egg, then heat it with microwave, then put it in diningtable.\nOK.\n> think: First I need to find an egg. An egg is more likely to appear in fridge (1), countertop (1-3), diningtable (1), stoveburner (1-4), toaster (1), garbagecan (1), cabinet (1-10). I can check one by one, starting with fridge 1.\nOK.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, and a potato 3.\n> go to countertop 1\nOn the countertop 1, you see a bread 1, a fork 1, and a saltshaker 1.\n> go to countertop 2\nOn the countertop 2, you see nothing.\n> go to countertop 3\nOn the countertop 3, you see a bowl 1, a butterknife 1, a egg 2, a kettle 2, a plate 1, a sink 1, and a spatula 2.\n> think: Now I find an egg (2). Next, I need to take it.\nOK.\n> take egg 2 from countertop 3\nYou pick up the egg 2 from the countertop 3.\n> think: Now I take an egg (2). Next, I need go to a microwave (1) and heat it.\nOK.\n> go to microwave 1\nThe microwave 1 is closed.\n> heat egg 2 with microwave 1\nYou heat the egg 2 using the microwave 1.\n> think: Now I heat an egg (2). Next, I need to put it in/on diningtable 1.\nOK.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 2, a bread 3, a egg 1, a kettle 1, a knife 1, a mug 1, a papertowelroll 1, a peppershaker 2, a potato 1, a soapbottle 1, and a spatula 1.\n> put egg 2 in/on diningtable 1\nYou put the egg 2 in/on the diningtable 1.\n", "react_heat_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put a hot apple in fridge.\n> think: To solve the task, I need to find and take an apple, then heat it with microwave, then put it in fridge.\nOK.\n> think: First I need to find an apple. An apple is more likely to appear in fridge (1), diningtable (1), coffeetable (1), drawer (1), cabinet (1-13), garbagecan (1). I can check one by one, starting with fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a cup 1, and a egg 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 1, a bread 1, a fork 3, a fork 2, a fork 1, a mug 2, a peppershaker 3, a plate 2, a pot 1, a soapbottle 3, a spatula 1, a spoon 3, a spoon 2, a spoon 1, a winebottle 3, and a winebottle 2.\n> think: Now I find an apple (1). Next, I need to take it.\nOK.\n> take apple 1 from diningtable 1\nYou pick up the apple 1 from the diningtable 1.\n> think: Now I take an apple (1). Next, I need to go to a microwave (1) and heat it.\nOK.\n> go to microwave 1\nThe microwave 1 is closed.\n> heat apple 1 with microwave 1\nYou heat the apple 1 using the microwave 1.\n> think: Now I heat an apple (1). Next, I need to put it in/on fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is open. In it, you see a cup 1, and a egg 1.\n> put apple 1 in/on fridge 1\nYou put the apple 1 in/on the fridge 1.\n", "react_heat_2": "You are in the middle of a room. Looking quickly around you, you see a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: heat some bread and put it in countertop.\n> think: To solve the task, I need to find and take a bread, then heat it with microwave, then put it in countertop.\nOK.\n> think: First I need to find a bread. A bread is more likely to appear in fridge (1), toaster (1), diningtable (1), countertop (1-2), drawer (1-3), cabinet (1-10). I can check one by one, starting with fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a mug 3, and a mug 2.\n> go to toaster 1\nOn the toaster 1, you see nothing.\n> go to diningtable 1\nOn the diningtable 1, you see a bread 4, a bread 3, a bread 2, a bread 1, a cup 2, a knife 1, a lettuce 1, a potato 3, a tomato 3, a tomato 2, and a winebottle 3.\n> think: Now I find a bread (1). Next, I need to take it.\nOK.\n> take bread 1 from diningtable 1\nYou pick up the bread 1 from the diningtable 1.\n> think: Now I take a bread (1). Next, I need to go to a microwave (1) and heat it.\nOK.\n> go to microwave 1\nThe microwave 1 is closed.\n> heat bread 1 with microwave 1\nYou heat the bread 1 using the microwave 1.\n> think: Now I heat a bread (1). Next, I need to put it in/on countertop 1.\nOK.\n> go to countertop 1\nOn the countertop 1, you see a apple 1, a egg 1, a pan 2, a peppershaker 3, a plate 2, a plate 1, a soapbottle 1, and a winebottle 1.\n> put bread 1 in/on countertop 1\nYou put the bread 1 in/on the countertop 1.\n", "react_cool_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 16, a cabinet 15, a cabinet 14, a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 1, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a safe 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: cool some pan and put it in stoveburner.\n> think: To solve the task, I need to find and take a pan, then cool it with fridge, then put it in stoveburner.\nOK.\n> think: First I need to find a pan. An pan is more likely to appear in stoveburner (1-4), sinkbasin (1), diningtable (1), countertop (1-2), cabinet (1-16), drawer (1-5). I can check one by one, starting with stoveburner 1.\nOK.\n>go to stoveburner 1\nOn the stoveburner 1, you see nothing.\n>go to stoveburner 2\nOn the stoveburner 2, you see a pot 1.\n> go to stoveburner 3\nOn the stoveburner 3, you see a pan 1.\n> take pan 1 from stoveburner 3\n> think: Now I find a pan (1). Next, I need to take it.\nOK.\nYou pick up the pan 1 from the stoveburner 3.\n> think: Now I take a pan (1). Next, I need to go to a fridge (1) and cool it.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> cool pan 1 with fridge 1\nYou cool the pan 1 using the fridge 1.\n> think: Now I cool a pan (1). Next, I need to put it in/on stoveburner 1.\nOK.\n>go to stoveburner 1\nOn the stoveburner 1, you see nothing.\n> put pan 1 in/on stoveburner 1\nYou put the pan 1 in/on the stoveburner 1.\n", "react_cool_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a drawer 9, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put a cool mug in shelf.\n> think: To solve the task, I need to find and take a mug, then cool it with fridge, then put it in shelf.\nOK.\n> think: First I need to find a mug. A mug is more likely to appear in countertop (1-3), coffeemachine (1), cabinet (1-9), shelf (1-3), drawer (1-9). I can check one by one, starting with countertop 1.\nOK.\n> go to countertop 1\nOn the countertop 1, you see a knife 1, a pan 2, a pan 1, and a plate 1.\n> go to countertop 2\nOn the countertop 2, you see a apple 2, a butterknife 2, a fork 3, a houseplant 1, a knife 2, a lettuce 3, a spatula 1, and a winebottle 1.\n> go to countertop 3\nOn the countertop 3, you see a apple 1, a book 1, a bread 1, a butterknife 3, a glassbottle 2, a glassbottle 1, a knife 3, a lettuce 2, a lettuce 1, a papertowelroll 1, a plate 3, a spoon 1, a statue 1, and a vase 2.\n> go to coffeemachine 1\nOn the coffeemachine 1, you see nothing.\n> go to cabinet 1\nThe cabinet 1 is closed.\n> open cabinet 1\nYou open the cabinet 1. The cabinet 1 is open. In it, you see nothing.\n> go to cabinet 2\nThe cabinet 2 is closed.\n> open cabinet 2\nYou open the cabinet 2. The cabinet 2 is open. In it, you see nothing.\n> go to cabinet 3\nThe cabinet 3 is closed.\n> open cabinet 3\nYou open the cabinet 3. The cabinet 3 is open. In it, you see a pot 1.\n> go to cabinet 4\nOn the cabinet 4, you see nothing.\n> go to cabinet 5\nThe cabinet 5 is closed.\n> open cabinet 5\nYou open the cabinet 5. The cabinet 5 is open. In it, you see nothing.\n> go to cabinet 6\nThe cabinet 6 is closed.\n> open cabinet 6\nYou open the cabinet 6. The cabinet 6 is open. In it, you see a mug 3.\n> think: Now I find a mug (3). Next, I need to take it.\nOK.\n> take mug 3 from cabinet 6\nYou pick up the mug 3 from the cabinet 6.\n> think: Now I take a mug (3). Next, I need to go to a fridge (1) and cool it.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> cool mug 3 with fridge 1\nYou cool the mug 3 using the fridge 1.\n> think: Now I cool the mug. Next, I need to put it in/on shelf 1.\nOK.\n> go to shelf 1\nOn the shelf 1, you see a cup 2, a cup 1, a peppershaker 1, and a saltshaker 1.\n> put mug 3 in/on shelf 1\nYou put the mug 3 in/on the shelf 1.\n", "react_cool_2": "You are in the middle of a room. Looking quickly around you, you see a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: cool some potato and put it in diningtable.\n> think: To solve the task, I need to find and take a potato, then cool it with fridge, then put it in diningtable.\nOK.\n> think: First I need to find a potato. A potato is more likely to appear in fridge (1), diningtable (1), countertop (1-2), cabinet (1-12), drawer (1-3), garbagecan (1). I can check one by one, starting with fridge 1.\nOK.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a apple 1, a bowl 1, a lettuce 2, and a lettuce 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 3, a bread 1, a butterknife 3, a egg 2, a fork 1, a knife 2, a knife 1, a mug 1, a peppershaker 3, a plate 1, a potato 4, a potato 2, a saltshaker 2, a soapbottle 2, and a tomato 3.\n> think: Now I find a potato (2). Next, I need to take it.\nOK.\n> take potato 2 from diningtable 1\nYou pick up the potato 2 from the diningtable 1.\n> think: Now I take a potato (2). Next, I need to go to a fridge (1) and cool it.\nOK.\n> go to fridge 1\nThe fridge 1 is open. In it, you see a apple 1, a bowl 1, a lettuce 2, and a lettuce 1.\n> cool potato 2 with fridge 1\nYou cool the potato 2 using the fridge 1.\n> think: Now I cool a potato (2). Next, I need to put it in/on diningtable 1.\nOK.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 3, a bread 1, a butterknife 3, a egg 2, a fork 1, a knife 2, a knife 1, a mug 1, a peppershaker 3, a plate 1, a potato 4, a saltshaker 2, a soapbottle 2, and a tomato 3.\n> put potato 2 in/on diningtable 1\nYou put the potato 2 in/on the diningtable 1.\n", "react_puttwo_0": "You are in the middle of a room. Looking quickly around you, you see a armchair 2, a armchair 1, a bed 1, a countertop 1, a diningtable 1, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a laundryhamper 1, and a sidetable 1.\nYour task is to: put two creditcard in dresser.\n> think: To solve the task, I need to find and take the first creditcard, then put it in dresser, then find and take the second creditcard, then put it in dresser.\nOK.\n> think: First I need to find the first creditcard. A creditcard is more likely to appear in drawer (1-2), coutertop (1), sidetable (1), diningtable (1), armchair (1-2), bed (1). I can check one by one, starting with drawer 1.\nOK.\n> go to drawer 1\nThe drawer 1 is closed.\n> open drawer 1\nYou open the drawer 1. The drawer 1 is open. In it, you see a book 1, a cd 1, and a pen 1.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see nothing.\n> go to countertop 1\nOn the countertop 1, you see a cellphone 2, a creditcard 4, a creditcard 3, a creditcard 2, a mirror 1, a pencil 2, and a pencil 1.\n> think: Now I find the first creditcard (2). Next, I need to take it. I can find the second creditcard (3) later in countertop 1.\nOK.\n> take creditcard 2 from countertop 1\nYou pick up the creditcard 2 from the countertop 1.\n> think: Now I take the first creditcard (2). Next, I need to put it in/on dresser 1.\nOK.\n> go to dresser 1\nOn the dresser 1, you see a mug 1, and a television 1.\n> put creditcard 2 in/on dresser 1\nYou put the creditcard 2 in/on the dresser 1.\n> think: Now I put the first creditcard in dresser. Next, I need to find the second creditcard. I can directly go to countertop 1.\nOK.\n> go to countertop 1\nOn the countertop 1, you see a cellphone 2, a creditcard 4, a creditcard 3, a mirror 1, a pencil 2, and a pencil 1.\n> think: Now I find the second creditcard (3). Next, I need to take it.\nOK.\n> take creditcard 3 from countertop 1\nYou pick up the creditcard 3 from the countertop 1.\n> think: Now I take the second creditcard (3). Next, I need to put it in/on dresser 1.\nOK.\n> go to dresser 1\nOn the dresser 1, you see a creditcard 2, a mug 1, and a television 1.\n> put creditcard 3 in/on dresser 1\nYou put the creditcard 3 in/on the dresser 1.\n", "react_puttwo_1": "You are in the middle of a room. Looking quickly around you, you see a coffeetable 1, a diningtable 1, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a sidetable 2, a sidetable 1, and a sofa 1.\nYour task is to: put two cellphone in sofa.\n> think: To solve the task, I need to find and take the first cellphone, then put it in sofa, then find and take the second cellphone, then put it in sofa.\nOK.\n> think: First I need to find the first cellphone. A cellphone is more likely to appear in coffeetable (1), diningtable (1), sidetable (1-2), drawer (1-4), sofa (1), dresser (1), garbagecan (1). I can check one by one, starting with coffeetable 1.\nOK.\n> go to coffeetable 1\nOn the coffeetable 1, you see a box 1, a cellphone 3, and a plate 1.\n> take cellphone 3 from coffeetable 1\nYou pick up the cellphone 3 from the coffeetable 1.\n> think: Now I find the first cellphone (3). Next, I need to put it in/on sofa 1.\nOK.\n> go to sofa 1\nOn the sofa 1, you see a newspaper 2, a pillow 1, a remotecontrol 2, and a remotecontrol 1.\n> put cellphone 3 in/on sofa 1\nYou put the cellphone 3 in/on the sofa 1.\n> think: Now I put the first cellphone in sofa. Next, I need to find the second cellphone. I have visited coffeetable (1), so the second cellphone is more likely to appear in diningtable (1), sidetable (1-2), drawer (1-4), sofa (1), dresser (1), garbagecan (1). I can check one by one, starting with coffeetable 1.\nOK.\n> go to diningtable 1\nOn the diningtable 1, you see a cellphone 2, a keychain 2, a laptop 1, a statue 2, and a statue 1.\n> think: Now I find the second cellphone (2). Next, I need to take it.\nOK.\n> take cellphone 2 from diningtable 1\nYou pick up the cellphone 2 from the diningtable 1.\n> think: Now I take the second cellphone (2). Next, I need to put it in/on sofa 1.\nOK.\n> go to sofa 1\nOn the sofa 1, you see a cellphone 3, a newspaper 2, a pillow 1, a remotecontrol 2, and a remotecontrol 1.\n> put cellphone 2 in/on sofa 1\nYou put the cellphone 2 in/on the sofa 1.\n", "react_puttwo_2": "You are in the middle of a room. Looking quickly around you, you see a cabinet 20, a cabinet 19, a cabinet 18, a cabinet 17, a cabinet 16, a cabinet 15, a cabinet 14, a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put two saltshaker in drawer.\n> think: To solve the task, I need to find and take the first saltshaker, then put it in drawer, then find and take the second salkshaker, then put it in drawer.\nOK.\n> think: First I need to find the first saltshaker. A saltshaker is more likely to appear in drawer (1-3), countertop (1-3), shelf (1-3), cabinet (1-20). I can check one by one, starting with drawer 1.\nOK.\n> go to drawer 1\nThe drawer 1 is closed.\n> open drawer 1\nYou open the drawer 1. The drawer 1 is open. In it, you see nothing.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see a dishsponge 3.\n> go to drawer 3\nThe drawer 3 is closed.\n> open drawer 3\nYou open the drawer 3. The drawer 3 is open. In it, you see a fork 2, a ladle 2, and a spoon 1.\n> go to countertop 1\nOn the countertop 1, you see a bowl 1, a butterknife 1, a mug 2, a pan 1, a papertowelroll 1, a spatula 1, and a statue 3.\n> go to countertop 2\nOn the countertop 2, you see a bread 1, a butterknife 2, a dishsponge 2, a fork 1, a kettle 1, a knife 1, a ladle 1, a spatula 3, a spoon 2, a statue 2, and a vase 3.\n> go to countertop 3\nOn the countertop 3, you see a bowl 2, a houseplant 1, a knife 2, a peppershaker 1, a saltshaker 4, a soapbottle 1, a spatula 2, and a tomato 3.\n> think: Now I find the first saltshaker (4). Next, I need to take it.\nOK.\n> take saltshaker 4 from countertop 3\nYou pick up the saltshaker 4 from the countertop 3.\n> think: Now I take the first saltshaker (4). Next, I need to put it in/on drawer 1.\nOK.\n> go to drawer 1\nThe drawer 1 is open. In it, you see nothing.\n> put saltshaker 4 in/on drawer 1\nYou put the saltshaker 4 in/on the drawer 1.\n> think: Now I put the first saltshaker in drawer. Next, I need to find the second saltshaker. I have visited drawer (1-3), countertop (1-3), so the second saltshaker is more likely to appear in shelf (1-3), cabinet (1-20). I can check one by one, starting with shelf 1.\nOK.\n> go to shelf 1\nOn the shelf 1, you see a statue 1.\n> go to shelf 2\nOn the shelf 2, you see a dishsponge 1, and a vase 1.\n> go to shelf 3\nOn the shelf 3, you see nothing.\n> go to cabinet 1\nThe cabinet 1 is closed.\n> open cabinet 1\nYou open the cabinet 1. The cabinet 1 is open. In it, you see nothing.\n> go to cabinet 2\nNothing happens.\n> look\nYou are facing the cabinet 2, and cabinet 1. Next to it, you see nothing.\n> go to cabinet 3\nThe cabinet 3 is closed.\n> open cabinet 3\nYou open the cabinet 3. The cabinet 3 is open. In it, you see a saltshaker 3, and a saltshaker 2.\n> think: Now I find the second saltshaker (2). Next, I need to take it.\nOK.\n> take saltshaker 2 from cabinet 3\nYou pick up the saltshaker 2 from the cabinet 3.\n> think: Now I take the second saltshaker (2). Next, I need to put it in/on drawer 1.\nOK.\n> go to drawer 1\nThe drawer 1 is open. In it, you see a saltshaker 4.\n> put saltshaker 2 in/on drawer 1\nYou put the saltshaker 2 in/on the drawer 1.\n", "react_examine_0": "You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 1, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a safe 1, a shelf 5, a shelf 4, a shelf 3, a shelf 2, a shelf 1, a sidetable 2, and a sidetable 1.\nYour task is to: look at bowl under the desklamp.\n> think: To solve the task, I need to find and take a bowl, then find and use a desklamp. \nOK. \n> think: First I need to find a bowl. A bowl is more likely to appear in drawer (1-3), desk (1), sidetable (1-2), shelf (1-5), garbagecan (1). I can check one by one, starting with drawer 1.\nOK.\n> go to drawer 1\nOn the drawer 1, you see nothing.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see nothing.\n> go to drawer 3\nThe drawer 3 is closed.\n> open drawer 3\nYou open the drawer 3. The drawer 3 is open. In it, you see nothing.\n> go to desk 1\nOn the desk 1, you see a alarmclock 2, a book 1, a cellphone 1, a keychain 1, a laptop 2, a laptop 1, and a pen 3.\n> go to sidetable 1\nOn the sidetable 1, you see a cd 1, a pen 1, and a pencil 1.\n> go to shelf 1\nOn the shelf 1, you see nothing.\n> go to shelf 2\nOn the shelf 2, you see a bowl 1.\n> think: Now I find a bowl (1). Next, I need to take it.\nOK.\n> take bowl 1 from shelf 2\nYou pick up the bowl 1 from the shelf 2.\n> think: Now I take a bowl (1). Next, I need to find a desklamp. A desklamp is more likely to appear in desk (1), sidetable (1-2), shelf (1-5), bed (1), drawer (1-3). I can check one by one, starting with desk 1.\nOK.\n> go to desk 1\nOn the desk 1, you see a alarmclock 2, a book 1, a cellphone 1, a keychain 1, a laptop 2, a laptop 1, and a pen 3.\n> go to sidetable 1\nOn the sidetable 1, you see a cd 1, a pen 1, and a pencil 1.\n> go to sidetable 2\nOn the sidetable 2, you see a alarmclock 1, a desklamp 1, and a pen 2.\n> think: Now I find a desklamp (1). Next, I need to use it.\nOK.\n> use desklamp 1\nYou turn on the desklamp 1.\n", "react_examine_1": "You are in the middle of a room. Looking quickly around you, you see a bed 1, a drawer 10, a drawer 9, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a shelf 9, a shelf 8, a shelf 7, a shelf 6, a shelf 5, a shelf 4, a shelf 3, a shelf 2, and a shelf 1.\nYour task is to: examine the pen with the desklamp.\n> think: To solve the task, I need to find and take a pen, then find and use a desklamp. First I need to find a pen. A pen is more likely to appear in drawer (1-10), shelf (1-9), bed (1), garbagecan (1). I can check one by one, starting with drawer 1.\nOK.\n> go to drawer 1\nThe drawer 1 is closed.\n> open drawer 1\nYou open the drawer 1. The drawer 1 is open. In it, you see nothing.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see nothing.\n> go to drawer 3\nThe drawer 3 is closed.\n> open drawer 3\nYou open the drawer 3. The drawer 3 is open. In it, you see nothing.\n> go to drawer 4\nThe drawer 4 is closed.\n> open drawer 4\nYou open the drawer 4. The drawer 4 is open. In it, you see a cd 3, a keychain 1, and a pen 2.\n> think: Now I find a pen (2). Next, I need to take it.\nOK.\n> take pen 2 from drawer 4\nYou pick up the pen 2 from the drawer 4.\n> think: Now I take a pen (2). Next, I need to find a desklamp. A desklamp is more likely to appear in dresser (1), shelf (1-9), bed (1), garbagecan (1), drawer (1-10). I can check one by one, starting with dresser 1.\nOK.\n> go to dresser 1\nOn the dresser 1, you see a alarmclock 2, a alarmclock 1, a book 1, a cd 1, a creditcard 1, a desklamp 1, a keychain 2, a pen 3, and a pen 1.\n> think: Now I find a desklamp (1). Next, I need to use it.\nOK.\n> use desklamp 1\nYou turn on the desklamp 1.\n", "react_examine_2": "You are in the middle of a room. Looking quickly around you, you see a coffeetable 1, a diningtable 1, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a sidetable 2, a sidetable 1, and a sofa 1.\nYour task is to: look at statue under the desklamp.\n> think: To solve the task, I need to find and take a statue, then find and use a desklamp.\nOK.\n> think: First I need to find a statue. A statue is more likely to appear in dresser (1), coffeetable (1), sidetable (1-2), diningtable (1), sofa (1), garbagecan (1). I can check one by one, starting with dresser 1.\nOK.\n> go to dresser 1\nOn the dresser 1, you see a cellphone 3, a newspaper 2, a statue 1, and a television 1.\n> think: Now I find a statue (1). Next, I need to take it.\nOK.\n> take statue 1 from dresser 1\nYou pick up the statue 1 from the dresser 1.\n> think: Now I take a statue (1). Next, I need to find a desklamp. A desklamp is more likely to appear in sidetable (1-2), coffeetable (1), diningtable (1). I can check one by one, starting with sidetable 1.\nOK.\n> go to sidetable 1\nOn the sidetable 1, you see nothing.\n> go to sidetable 2\nOn the sidetable 2, you see a desklamp 3, a newspaper 1, and a statue 2.\n> think: Now I find a desklamp (3). Next, I need to use it.\nOK.\n> use desklamp 3\nYou turn on the desklamp 3.\n", "act_put_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a countertop 1, a garbagecan 1, a handtowelholder 2, a handtowelholder 1, a sinkbasin 2, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.\nYour task is to: put some spraybottle on toilet.\n> go to cabinet 1\nOn the cabinet 1, you see a cloth 1, a soapbar 1, a soapbottle 1.\n> go to cabinet 2\nThe cabinet 2 is closed.\n> open cabinet 2\nYou open the cabinet 2. The cabinet 2 is open. In it, you see a candle 1, and a spraybottle 2.\n> take spraybottle 2 from cabinet 2\nYou pick up the spraybottle 2 from the cabinet 2.\n> go to toilet 1\nOn the toilet 1, you see a soapbottle 2.\n> put spraybottle 2 in/on toilet 1\nYou put the spraybottle 2 in/on the toilet 1.\n", "act_put_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 3, a diningtable 2, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sidetable 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: find some apple and put it in sidetable.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, a potato 2, and a tomato 1.\n> go to diningtable 1\nOn the diningtable 1, you see nothing.\n> go to diningtable 2\nOn the diningtable 2, you see a pineapple 1.\n> go to diningtable 3\nOn the diningtable 3, you see a bowl 2, a bowl 1, a bread 2, and a bread 1.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> go to countertop 1\nOn the countertop 1, you see a butterknife 1, a dishsponge 1, and a soapbottle 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.\n> go to garbagecan 1\nOn the garbagecan 1, you see a apple 3, and a egg 3.\n> take apple 3 from garbagecan 1\nYou pick up the apple 3 from the garbagecan 1.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> put apple 3 in/on sidetable 1\nYou put the apple 3 in/on the sidetable 1.\n", "act_put_2": "You are in the middle of a room. Looking quickly around you, you see a bathtubbasin 1, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a countertop 1, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a handtowelholder 1, a sinkbasin 2, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.\nYour task is to: put a soapbottle in garbagecan.\n> go to toilet 1\nOn the toilet 1, you see nothing.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see nothing.\n> go to sinkbasin 2\nOn the sinkbasin 2, you see nothing.\n> go to bathtubbasin 1\nOn the bathtubbasin 1, you see a cloth 2.\n> go to cabinet 1\nThe cabinet 1 is closed.\n> open cabinet 1\nYou open the cabinet 1. The cabinet 1 is open. In it, you see nothing.\n> go to cabinet 2\nThe cabinet 2 is closed.\n> open cabinet 2\nYou open the cabinet 2. The cabinet 2 is open. In it, you see nothing.\n> go to cabinet 3\nThe cabinet 3 is closed.\n> open cabinet 3\nYou open the cabinet 3. The cabinet 3 is open. In it, you see a toiletpaper 2.\n> go to cabinet 4\nThe cabinet 4 is closed.\n> open cabinet 4\nYou open the cabinet 4. The cabinet 4 is open. In it, you see a spraybottle 2.\n> go to countertop 1\nOn the countertop 1, you see a soapbottle 2, and a spraybottle 1.\n> take soapbottle 2 from countertop 1\nYou pick up the soapbottle 2 from the countertop 1.\n> go to garbagecan 1\nOn the garbagecan 1, you see nothing.\n> put soapbottle 2 in/on garbagecan 1\nYou put the soapbottle 2 in/on the garbagecan 1.\n", "act_clean_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put a clean lettuce in diningtable.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a cup 3, a egg 2, a potato 3, and a potato 2.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 1, a bread 1, a butterknife 2, a cup 2, a fork 2, a knife 2, a knife 1, a ladle 1, a lettuce 1, a mug 2, a mug 1, a pan 2, a peppershaker 1, a spatula 3, a tomato 2, and a tomato 1.\n> take lettuce 1 from diningtable 1\nYou pick up the lettuce 1 from the diningtable 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a apple 2, a ladle 2, a spoon 1, and a tomato 3.\n> clean lettuce 1 with sinkbasin 1\nYou clean the lettuce 1 using the sinkbasin 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 1, a bread 1, a butterknife 2, a cup 2, a fork 2, a knife 2, a knife 1, a ladle 1, a mug 2, a mug 1, a pan 2, a peppershaker 1, a spatula 3, a tomato 2, and a tomato 1.\n> put lettuce 1 in/on diningtable 1\nYou put the lettuce 1 in/on the diningtable 1.\n", "act_clean_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 3, a diningtable 2, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sidetable 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: clean some apple and put it in sidetable.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, a potato 2, and a tomato 1.\n> go to diningtable 1\nOn the diningtable 1, you see nothing.\n> go to diningtable 2\nOn the diningtable 2, you see a pineapple 1.\n> go to diningtable 3\nOn the diningtable 3, you see a bowl 2, a bowl 1, a bread 2, and a bread 1.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> go to countertop 1\nOn the countertop 1, you see a butterknife 1, a dishsponge 1, and a soapbottle 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.\n> go to garbagecan 1\nOn the garbagecan 1, you see a apple 3, and a egg 3.\n> take apple 3 from garbagecan 1\nYou pick up the apple 3 from the garbagecan 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.\n> clean apple 3 with sinkbasin 1\nYou clean the apple 3 using the sinkbasin 1.\n> go to sidetable 1\nOn the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.\n> put apple 3 in/on sidetable 1\nYou put the apple 3 in/on the sidetable 1.\n", "act_clean_2": "You are in the middle of a room. Looking quickly around you, you see a bathtubbasin 1, a garbagecan 1, a handtowelholder 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a toilet 1, a toiletpaperhanger 1, and a towelholder 1.\nYour task is to: clean some soapbar and put it in toilet.\n> go to toilet 1\nOn the toilet 1, you see a soapbar 4, and a soapbar 3.\n> take soapbar 4 from toilet 1\nYou pick up the soapbar 4 from the toilet 1.\n> go to sinkbasin 1\nOn the sinkbasin 1, you see nothing.\n> clean soapbar 4 with sinkbasin 1\nYou clean the soapbar 4 using the sinkbasin 1.\n> go to toilet 1\nOn the toilet 1, you see a soapbar 3.\n> put soapbar 4 in/on toilet 1\nYou put the soapbar 4 in/on the toilet 1.\n", "act_heat_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a diningtable 1, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: heat some egg and put it in diningtable.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, and a potato 3.\n> go to countertop 1\nOn the countertop 1, you see a bread 1, a fork 1, and a saltshaker 1.\n> go to countertop 2\nOn the countertop 2, you see nothing.\n> go to countertop 3\nOn the countertop 3, you see a bowl 1, a butterknife 1, a egg 2, a kettle 2, a plate 1, a sink 1, and a spatula 2.\n> take egg 2 from countertop 3\nYou pick up the egg 2 from the countertop 3.\n> go to microwave 1\nThe microwave 1 is closed.\n> heat egg 2 with microwave 1\nYou heat the egg 2 using the microwave 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 2, a bread 3, a egg 1, a kettle 1, a knife 1, a mug 1, a papertowelroll 1, a peppershaker 2, a potato 1, a soapbottle 1, and a spatula 1.\n> put egg 2 in/on diningtable 1\nYou put the egg 2 in/on the diningtable 1.\n", "act_heat_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put a hot apple in fridge.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a cup 1, and a egg 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 1, a bread 1, a fork 3, a fork 2, a fork 1, a mug 2, a peppershaker 3, a plate 2, a pot 1, a soapbottle 3, a spatula 1, a spoon 3, a spoon 2, a spoon 1, a winebottle 3, and a winebottle 2.\n> take apple 1 from diningtable 1\nYou pick up the apple 1 from the diningtable 1.\n> go to microwave 1\nThe microwave 1 is closed.\n> heat apple 1 with microwave 1\nYou heat the apple 1 using the microwave 1.\n> go to fridge 1\nThe fridge 1 is open. In it, you see a cup 1, and a egg 1.\n> put apple 1 in/on fridge 1\nYou put the apple 1 in/on the fridge 1.\n", "act_heat_2": "You are in the middle of a room. Looking quickly around you, you see a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: heat some bread and put it in countertop.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a mug 3, and a mug 2.\n> go to toaster 1\nOn the toaster 1, you see nothing.\n> go to diningtable 1\nOn the diningtable 1, you see a bread 4, a bread 3, a bread 2, a bread 1, a cup 2, a knife 1, a lettuce 1, a potato 3, a tomato 3, a tomato 2, and a winebottle 3.\n> take bread 1 from diningtable 1\nYou pick up the bread 1 from the diningtable 1.\n> go to microwave 1\nThe microwave 1 is closed.\n> heat bread 1 with microwave 1\nYou heat the bread 1 using the microwave 1.\n> go to countertop 1\nOn the countertop 1, you see a apple 1, a egg 1, a pan 2, a peppershaker 3, a plate 2, a plate 1, a soapbottle 1, and a winebottle 1.\n> put bread 1 in/on countertop 1\nYou put the bread 1 in/on the countertop 1.\n", "act_cool_0": "You are in the middle of a room. Looking quickly around you, you see a cabinet 16, a cabinet 15, a cabinet 14, a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 1, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a safe 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: cool some pan and put it in stoveburner.\n>go to stoveburner 1\nOn the stoveburner 1, you see nothing.\n>go to stoveburner 2\nOn the stoveburner 2, you see a pot 1.\n> go to stoveburner 3\nOn the stoveburner 3, you see a pan 1.\n> take pan 1 from stoveburner 3\nYou pick up the pan 1 from the stoveburner 3.\n> go to fridge 1\nThe fridge 1 is closed.\n> cool pan 1 with fridge 1\nYou cool the pan 1 using the fridge 1.\n>go to stoveburner 1\nOn the stoveburner 1, you see nothing.\n> put pan 1 in/on stoveburner 1\nYou put the pan 1 in/on the stoveburner 1.\n", "act_cool_1": "You are in the middle of a room. Looking quickly around you, you see a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a drawer 9, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put a cool mug in shelf.\n> go to countertop 1\nOn the countertop 1, you see a knife 1, a pan 2, a pan 1, and a plate 1.\n> go to countertop 2\nOn the countertop 2, you see a apple 2, a butterknife 2, a fork 3, a houseplant 1, a knife 2, a lettuce 3, a spatula 1, and a winebottle 1.\n> go to countertop 3\nOn the countertop 3, you see a apple 1, a book 1, a bread 1, a butterknife 3, a glassbottle 2, a glassbottle 1, a knife 3, a lettuce 2, a lettuce 1, a papertowelroll 1, a plate 3, a spoon 1, a statue 1, and a vase 2.\n> go to coffeemachine 1\nOn the coffeemachine 1, you see nothing.\n> go to cabinet 1\nThe cabinet 1 is closed.\n> open cabinet 1\nYou open the cabinet 1. The cabinet 1 is open. In it, you see nothing.\n> go to cabinet 2\nThe cabinet 2 is closed.\n> open cabinet 2\nYou open the cabinet 2. The cabinet 2 is open. In it, you see nothing.\n> go to cabinet 3\nThe cabinet 3 is closed.\n> open cabinet 3\nYou open the cabinet 3. The cabinet 3 is open. In it, you see a pot 1.\n> go to cabinet 4\nOn the cabinet 4, you see nothing.\n> go to cabinet 5\nThe cabinet 5 is closed.\n> open cabinet 5\nYou open the cabinet 5. The cabinet 5 is open. In it, you see nothing.\n> go to cabinet 6\nThe cabinet 6 is closed.\n> open cabinet 6\nYou open the cabinet 6. The cabinet 6 is open. In it, you see a mug 3.\n> take mug 3 from cabinet 6\nYou pick up the mug 3 from the cabinet 6.\n> go to fridge 1\nThe fridge 1 is closed.\n> cool mug 3 with fridge 1\nYou cool the mug 3 using the fridge 1.\n> go to shelf 1\nOn the shelf 1, you see a cup 2, a cup 1, a peppershaker 1, and a saltshaker 1.\n> put mug 3 in/on shelf 1\nYou put the mug 3 in/on the shelf 1.\n", "act_cool_2": "You are in the middle of a room. Looking quickly around you, you see a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 2, a countertop 1, a diningtable 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: cool some potato and put it in diningtable.\n> go to fridge 1\nThe fridge 1 is closed.\n> open fridge 1\nYou open the fridge 1. The fridge 1 is open. In it, you see a apple 1, a bowl 1, a lettuce 2, and a lettuce 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 3, a bread 1, a butterknife 3, a egg 2, a fork 1, a knife 2, a knife 1, a mug 1, a peppershaker 3, a plate 1, a potato 4, a potato 2, a saltshaker 2, a soapbottle 2, and a tomato 3.\n> take potato 2 from diningtable 1\nYou pick up the potato 2 from the diningtable 1.\n> go to fridge 1\nThe fridge 1 is open. In it, you see a apple 1, a bowl 1, a lettuce 2, and a lettuce 1.\n> cool potato 2 with fridge 1\nYou cool the potato 2 using the fridge 1.\n> go to diningtable 1\nOn the diningtable 1, you see a apple 3, a bread 1, a butterknife 3, a egg 2, a fork 1, a knife 2, a knife 1, a mug 1, a peppershaker 3, a plate 1, a potato 4, a saltshaker 2, a soapbottle 2, and a tomato 3.\n> put potato 2 in/on diningtable 1\nYou put the potato 2 in/on the diningtable 1.\n", "act_puttwo_0": "You are in the middle of a room. Looking quickly around you, you see a armchair 2, a armchair 1, a bed 1, a countertop 1, a diningtable 1, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a laundryhamper 1, and a sidetable 1.\nYour task is to: put two creditcard in dresser.\n> go to drawer 1\nThe drawer 1 is closed.\n> open drawer 1\nYou open the drawer 1. The drawer 1 is open. In it, you see a book 1, a cd 1, and a pen 1.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see nothing.\n> go to countertop 1\nOn the countertop 1, you see a cellphone 2, a creditcard 4, a creditcard 3, a creditcard 2, a mirror 1, a pencil 2, and a pencil 1.\n> take creditcard 2 from countertop 1\nYou pick up the creditcard 2 from the countertop 1.\n> go to dresser 1\nOn the dresser 1, you see a mug 1, and a television 1.\n> put creditcard 2 in/on dresser 1\nYou put the creditcard 2 in/on the dresser 1.\n> go to countertop 1\nOn the countertop 1, you see a cellphone 2, a creditcard 4, a creditcard 3, a mirror 1, a pencil 2, and a pencil 1.\n> take creditcard 3 from countertop 1\nYou pick up the creditcard 3 from the countertop 1.\n> go to dresser 1\nOn the dresser 1, you see a creditcard 2, a mug 1, and a television 1.\n> put creditcard 3 in/on dresser 1\nYou put the creditcard 3 in/on the dresser 1.\n", "act_puttwo_1": "You are in the middle of a room. Looking quickly around you, you see a coffeetable 1, a diningtable 1, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a sidetable 2, a sidetable 1, and a sofa 1.\nYour task is to: put two cellphone in sofa.\n> go to coffeetable 1\nOn the coffeetable 1, you see a box 1, a cellphone 3, and a plate 1.\n> take cellphone 3 from coffeetable 1\nYou pick up the cellphone 3 from the coffeetable 1.\n> go to sofa 1\nOn the sofa 1, you see a newspaper 2, a pillow 1, a remotecontrol 2, and a remotecontrol 1.\n> put cellphone 3 in/on sofa 1\nYou put the cellphone 3 in/on the sofa 1.\n> go to diningtable 1\nOn the diningtable 1, you see a cellphone 2, a keychain 2, a laptop 1, a statue 2, and a statue 1.\n> take cellphone 2 from diningtable 1\nYou pick up the cellphone 2 from the diningtable 1.\n> go to sofa 1\nOn the sofa 1, you see a cellphone 3, a newspaper 2, a pillow 1, a remotecontrol 2, and a remotecontrol 1.\n> put cellphone 2 in/on sofa 1\nYou put the cellphone 2 in/on the sofa 1.\n", "act_puttwo_2": "You are in the middle of a room. Looking quickly around you, you see a cabinet 20, a cabinet 19, a cabinet 18, a cabinet 17, a cabinet 16, a cabinet 15, a cabinet 14, a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.\nYour task is to: put two saltshaker in drawer.\n> go to drawer 1\nThe drawer 1 is closed.\n> open drawer 1\nYou open the drawer 1. The drawer 1 is open. In it, you see nothing.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see a dishsponge 3.\n> go to drawer 3\nThe drawer 3 is closed.\n> open drawer 3\nYou open the drawer 3. The drawer 3 is open. In it, you see a fork 2, a ladle 2, and a spoon 1.\n> go to countertop 1\nOn the countertop 1, you see a bowl 1, a butterknife 1, a mug 2, a pan 1, a papertowelroll 1, a spatula 1, and a statue 3.\n> go to countertop 2\nOn the countertop 2, you see a bread 1, a butterknife 2, a dishsponge 2, a fork 1, a kettle 1, a knife 1, a ladle 1, a spatula 3, a spoon 2, a statue 2, and a vase 3.\n> go to countertop 3\nOn the countertop 3, you see a bowl 2, a houseplant 1, a knife 2, a peppershaker 1, a saltshaker 4, a soapbottle 1, a spatula 2, and a tomato 3.\n> take saltshaker 4 from countertop 3\nYou pick up the saltshaker 4 from the countertop 3.\n> go to drawer 1\nThe drawer 1 is open. In it, you see nothing.\n> put saltshaker 4 in/on drawer 1\nYou put the saltshaker 4 in/on the drawer 1.\n> go to shelf 1\nOn the shelf 1, you see a statue 1.\n> go to shelf 2\nOn the shelf 2, you see a dishsponge 1, and a vase 1.\n> go to shelf 3\nOn the shelf 3, you see nothing.\n> go to cabinet 1\nThe cabinet 1 is closed.\n> open cabinet 1\nYou open the cabinet 1. The cabinet 1 is open. In it, you see nothing.\n> go to cabinet 2\nNothing happens.\n> look\nYou are facing the cabinet 2, and cabinet 1. Next to it, you see nothing.\n> go to cabinet 3\nThe cabinet 3 is closed.\n> open cabinet 3\nYou open the cabinet 3. The cabinet 3 is open. In it, you see a saltshaker 3, and a saltshaker 2.\n> take saltshaker 2 from cabinet 3\nYou pick up the saltshaker 2 from the cabinet 3.\n> go to drawer 1\nThe drawer 1 is open. In it, you see a saltshaker 4.\n> put saltshaker 2 in/on drawer 1\nYou put the saltshaker 2 in/on the drawer 1.\n", "act_examine_0": "You are in the middle of a room. Looking quickly around you, you see a bed 1, a desk 1, a drawer 3, a drawer 2, a drawer 1, a garbagecan 1, a safe 1, a shelf 5, a shelf 4, a shelf 3, a shelf 2, a shelf 1, a sidetable 2, and a sidetable 1.\nYour task is to: look at bowl under the desklamp.\nOK. \n> go to drawer 1\nOn the drawer 1, you see nothing.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see nothing.\n> go to drawer 3\nThe drawer 3 is closed.\n> open drawer 3\nYou open the drawer 3. The drawer 3 is open. In it, you see nothing.\n> go to desk 1\nOn the desk 1, you see a alarmclock 2, a book 1, a cellphone 1, a keychain 1, a laptop 2, a laptop 1, and a pen 3.\n> go to sidetable 1\nOn the sidetable 1, you see a cd 1, a pen 1, and a pencil 1.\n> go to shelf 1\nOn the shelf 1, you see nothing.\n> go to shelf 2\nOn the shelf 2, you see a bowl 1.\n> take bowl 1 from shelf 2\nYou pick up the bowl 1 from the shelf 2.\n> go to desk 1\nOn the desk 1, you see a alarmclock 2, a book 1, a cellphone 1, a keychain 1, a laptop 2, a laptop 1, and a pen 3.\n> go to sidetable 1\nOn the sidetable 1, you see a cd 1, a pen 1, and a pencil 1.\n> go to sidetable 2\nOn the sidetable 2, you see a alarmclock 1, a desklamp 1, and a pen 2.\n> use desklamp 1\nYou turn on the desklamp 1.\n", "act_examine_1": "You are in the middle of a room. Looking quickly around you, you see a bed 1, a drawer 10, a drawer 9, a drawer 8, a drawer 7, a drawer 6, a drawer 5, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a shelf 9, a shelf 8, a shelf 7, a shelf 6, a shelf 5, a shelf 4, a shelf 3, a shelf 2, and a shelf 1.\nYour task is to: examine the pen with the desklamp.\n> go to drawer 1\nThe drawer 1 is closed.\n> open drawer 1\nYou open the drawer 1. The drawer 1 is open. In it, you see nothing.\n> go to drawer 2\nThe drawer 2 is closed.\n> open drawer 2\nYou open the drawer 2. The drawer 2 is open. In it, you see nothing.\n> go to drawer 3\nThe drawer 3 is closed.\n> open drawer 3\nYou open the drawer 3. The drawer 3 is open. In it, you see nothing.\n> go to drawer 4\nThe drawer 4 is closed.\n> open drawer 4\nYou open the drawer 4. The drawer 4 is open. In it, you see a cd 3, a keychain 1, and a pen 2.\n> take pen 2 from drawer 4\nYou pick up the pen 2 from the drawer 4.\n> go to dresser 1\nOn the dresser 1, you see a alarmclock 2, a alarmclock 1, a book 1, a cd 1, a creditcard 1, a desklamp 1, a keychain 2, a pen 3, and a pen 1.\n> use desklamp 1\nYou turn on the desklamp 1.\n", "act_examine_2": "You are in the middle of a room. Looking quickly around you, you see a coffeetable 1, a diningtable 1, a drawer 4, a drawer 3, a drawer 2, a drawer 1, a dresser 1, a garbagecan 1, a sidetable 2, a sidetable 1, and a sofa 1.\nYour task is to: look at statue under the desklamp.\n> go to dresser 1\nOn the dresser 1, you see a cellphone 3, a newspaper 2, a statue 1, and a television 1.\n> take statue 1 from dresser 1\nYou pick up the statue 1 from the dresser 1.\n> go to sidetable 1\nOn the sidetable 1, you see nothing.\n> go to sidetable 2\nOn the sidetable 2, you see a desklamp 3, a newspaper 1, and a statue 2.\n> use desklamp 3\nYou turn on the desklamp 3.\n"}
\ No newline at end of file
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/requirements.txt b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/requirements.txt
new file mode 100644
index 000000000..70e47447c
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/alfworld/requirements.txt
@@ -0,0 +1,5 @@
+# ALFWorld benchmark 额外依赖
+# pip install -r benchmarks/alfworld/requirements.txt
+alfworld
+textworld
+openai
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/__init__.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/data.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/data.py
new file mode 100644
index 000000000..4ff43ca72
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/data.py
@@ -0,0 +1,27 @@
+"""
+GSM8K 数据下载
+
+Agent 只能看到 train split。
+评估（OpenCompass）用 test split，由 OpenCompass 自己内部加载。
+"""
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from loguru import logger
+
+
+def download_train_data(target_dir: Path) -> None:
+    """下载 GSM8K 训练数据（agent 可见）"""
+    output_file = target_dir / "train.jsonl"
+    if output_file.exists():
+        logger.info(f"GSM8K train data exists: {output_file}")
+        return
+
+    target_dir.mkdir(parents=True, exist_ok=True)
+    logger.info("Downloading GSM8K train split...")
+    dataset = load_dataset("openai/gsm8k", "main", split="train")
+    with open(output_file, "w", encoding="utf-8") as f:
+        for item in dataset:
+            f.write(json.dumps(item, ensure_ascii=False) + "\n")
+    logger.info(f"Saved {len(dataset)} samples to {output_file}")
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/description.md b/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/description.md
new file mode 100644
index 000000000..541e92a90
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/gsm8k/description.md
@@ -0,0 +1,16 @@
+# GSM8K 任务
+
+## 目标
+训练模型在 GSM8K 数学问题上获得更高准确率。
+
+## 数据格式
+```json
+{"question": "...", "answer": "... #### 42"}
+```
+
+## 评测指标
+- 答案准确率（exact match）
+
+## 提示
+- 答案格式: `#### 数字`
+- 使用 GRPO/PPO 等 RL 方法训练
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/smith/__init__.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/smith/__init__.py
new file mode 100644
index 000000000..3cc10eef1
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/smith/__init__.py
@@ -0,0 +1,68 @@
+"""Smith benchmarks — dynamic discovery via config.yaml.
+
+Scans SMITH_BENCH_DIR/*/config.yaml and builds BenchmarkConfig entries
+automatically. The actual benchmark code/data lives outside the repo;
+default location is ``<repo-root>/../rl-smith/benchmarks/``.
+"""
+import logging
+import os
+
+import yaml
+from pathlib import Path
+
+from rdagent.scenarios.rl.autorl_bench.benchmarks import BenchmarkConfig
+
+logger = logging.getLogger(__name__)
+
+# Default: rl-smith/benchmarks as a sibling of the repo root
+_REPO_ROOT = Path(__file__).resolve().parents[6]  # .../RD-Agent
+_SMITH_BENCH_DIR = Path(
+    os.environ.get("SMITH_BENCH_DIR", str(_REPO_ROOT.parent / "rl-smith" / "benchmarks"))
+)
+_PKG = "rdagent.scenarios.rl.autorl_bench"
+
+
+def discover_smith_benchmarks() -> dict[str, BenchmarkConfig]:
+    """Scan SMITH_BENCH_DIR/*/config.yaml and build BenchmarkConfig dict."""
+    if not _SMITH_BENCH_DIR.is_dir():
+        logger.warning(
+            "SMITH_BENCH_DIR=%s does not exist; returning empty smith registry",
+            _SMITH_BENCH_DIR,
+        )
+        return {}
+
+    result = {}
+    for cfg_path in sorted(_SMITH_BENCH_DIR.glob("*/config.yaml")):
+        bench_dir = cfg_path.parent
+        if bench_dir.name.startswith("_"):
+            continue
+        raw = yaml.safe_load(cfg_path.read_text(encoding="utf-8"))
+        if not isinstance(raw, dict) or not raw.get("name"):
+            continue
+
+        name = raw["name"]
+        eval_mode = raw.get("eval_mode", "per_sample")
+        bench_id = f"smith-{name}"
+
+        if eval_mode == "opencompass":
+            evaluator_class = f"{_PKG}.core.opencompass.OpenCompassEvaluator"
+            eval_config = {"dataset": raw.get("opencompass_dataset", "")}
+        elif eval_mode == "per_sample":
+            evaluator_class = f"{_PKG}.benchmarks.smith.per_sample_eval.PerSampleEvaluator"
+            eval_config = {"eval_script": str(bench_dir / "eval.py")}
+        else:
+            # Skip benchmarks with unsupported eval modes (e.g. custom_model)
+            # that are already registered as standalone benchmarks.
+            logger.info("Skipping smith-%s: unsupported eval_mode=%s", name, eval_mode)
+            continue
+
+        result[bench_id] = BenchmarkConfig(
+            id=bench_id,
+            evaluator_class=evaluator_class,
+            data_module="",
+            description=raw.get("description", ""),
+            eval_config=eval_config,
+            expose_files=raw.get("expose_files", []),
+            bench_dir=str(bench_dir),
+        )
+    return result
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/smith/per_sample_eval.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/smith/per_sample_eval.py
new file mode 100644
index 000000000..5a1b6e982
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/smith/per_sample_eval.py
@@ -0,0 +1,149 @@
+"""Per-sample evaluator for smith benchmarks (arc_agi, zero_shot_cot).
+
+Loads a model via vLLM, runs inference on each test sample, then uses the
+benchmark's eval.py to score each prediction individually.
+"""
+from __future__ import annotations
+
+import importlib
+import importlib.util
+import json
+from pathlib import Path
+from typing import Any, Dict
+
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.rl.autorl_bench.core.evaluator import BaseEvaluator
+
+
+class PerSampleEvaluator(BaseEvaluator):
+    """Evaluator that scores each sample individually using benchmark-specific eval.py."""
+
+    def __init__(self, config):
+        self.config = config
+        self.benchmark_id = config.id
+        self.eval_config = config.eval_config or {}
+
+    def run_eval(
+        self,
+        model_path: str,
+        workspace_path: str,
+        model_name: str = "",
+        gpu_count: int = 1,
+        test_range: str = "[:]",
+        **kwargs,
+    ) -> Dict[str, Any]:
+        result = self.get_default_result(self.benchmark_id, model_path)
+        result["eval_type"] = "per_sample"
+
+        if not self.validate_model(model_path):
+            result["error"] = f"Model not found: {model_path}"
+            return result
+
+        # Load the benchmark-specific eval module
+        eval_script = self.eval_config.get("eval_script", "")
+        eval_module_path = self.eval_config.get("eval_module", "")
+        if not eval_script and not eval_module_path:
+            result["error"] = "No eval_script or eval_module configured"
+            return result
+
+        try:
+            if eval_script:
+                spec = importlib.util.spec_from_file_location("eval", eval_script)
+                eval_mod = importlib.util.module_from_spec(spec)
+                spec.loader.exec_module(eval_mod)
+            else:
+                eval_mod = importlib.import_module(eval_module_path)
+        except Exception as e:
+            result["error"] = f"Cannot load eval module: {e}"
+            return result
+
+        # Load test data
+        workspace = Path(workspace_path)
+        data_dir = workspace / "data"
+        test_file = data_dir / "train.jsonl"
+        if not test_file.exists():
+            result["error"] = f"Test data not found: {test_file}"
+            return result
+
+        test_data = []
+        with open(test_file, "r", encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    test_data.append(json.loads(line))
+
+        # Apply test_range slicing
+        test_data = _apply_range(test_data, test_range)
+
+        if not test_data:
+            result["error"] = "No test data after applying range"
+            return result
+
+        logger.info(f"[{self.benchmark_id}] Running per-sample eval on {len(test_data)} samples")
+
+        # Load model and run inference via vLLM
+        try:
+            import vllm
+            from vllm import SamplingParams
+
+            llm = vllm.LLM(model=model_path, tensor_parallel_size=gpu_count)
+            sampling_params = SamplingParams(temperature=0, max_tokens=2048)
+
+            prompts = []
+            for item in test_data:
+                q = item.get("question", "")
+                if isinstance(q, dict):
+                    # For arc_agi: question is a JSON object, stringify it
+                    q = json.dumps(q)
+                prompts.append(q)
+
+            outputs = llm.generate(prompts, sampling_params)
+        except Exception as e:
+            result["error"] = f"vLLM inference failed: {e}"
+            return result
+
+        # Score each sample
+        total = 0
+        correct = 0.0
+        for item, output in zip(test_data, outputs):
+            model_answer = output.outputs[0].text
+            question = item.get("question", "")
+            reference = item.get("answer", "")
+
+            # Pass extra kwargs from the item (e.g. answer_type for zero_shot_cot)
+            extra = {k: v for k, v in item.items() if k not in ("question", "answer")}
+            try:
+                score = eval_mod.evaluate(question, model_answer, reference, **extra)
+            except Exception as e:
+                logger.warning(f"Eval error on sample: {e}")
+                score = 0.0
+
+            correct += score
+            total += 1
+
+        accuracy = (correct / total) * 100 if total > 0 else 0.0
+        result["score"] = accuracy
+        result["accuracy_summary"] = {
+            "correct": correct,
+            "total": total,
+            "accuracy": accuracy,
+        }
+
+        logger.info(f"[{self.benchmark_id}] Score: {accuracy:.2f}% ({correct}/{total})")
+        return result
+
+
+def _apply_range(data: list, test_range: str) -> list:
+    """Apply a Python-style slice string like '[:]' or '[:100]' to a list."""
+    test_range = test_range.strip()
+    if not test_range or test_range == "[:]":
+        return data
+    try:
+        # Parse "[start:stop]" or "[:stop]" etc.
+        inner = test_range.strip("[]")
+        parts = inner.split(":")
+        start = int(parts[0]) if parts[0] else None
+        stop = int(parts[1]) if len(parts) > 1 and parts[1] else None
+        return data[start:stop]
+    except Exception:
+        return data
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/__init__.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/__init__.py
new file mode 100644
index 000000000..6dd132777
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/__init__.py
@@ -0,0 +1,5 @@
+"""WebShop Benchmark"""
+from .eval import WebShopEvaluator
+from .data import download_train_data
+
+__all__ = ["WebShopEvaluator", "download_train_data"]
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/data.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/data.py
new file mode 100644
index 000000000..cbe610c25
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/data.py
@@ -0,0 +1,147 @@
+"""
+WebShop 数据准备
+
+注意：WebShop PyPI 包不完整（缺少 web_agent_site 模块），需要从 GitHub 克隆完整仓库。
+为避免 setup.sh 破坏当前环境依赖，我们手动下载数据。
+"""
+import subprocess
+import sys
+from pathlib import Path
+
+from loguru import logger
+
+WEBSHOP_CACHE_DIR = Path.home() / ".cache" / "webshop"
+WEBSHOP_REPO_DIR = WEBSHOP_CACHE_DIR / "repo"
+
+
+def _clone_webshop_repo() -> Path:
+    """克隆 WebShop 仓库到缓存目录"""
+    if WEBSHOP_REPO_DIR.exists() and (WEBSHOP_REPO_DIR / ".git").exists():
+        logger.info(f"WebShop repo exists: {WEBSHOP_REPO_DIR}")
+        return WEBSHOP_REPO_DIR
+
+    WEBSHOP_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    logger.info("Cloning WebShop repository...")
+
+    subprocess.run([
+        "git", "clone", "--depth", "1",
+        "https://github.com/princeton-nlp/webshop.git",
+        str(WEBSHOP_REPO_DIR)
+    ], check=True)
+
+    logger.info(f"WebShop repo cloned to: {WEBSHOP_REPO_DIR}")
+    return WEBSHOP_REPO_DIR
+
+
+def _ensure_repo_in_path():
+    """确保 webshop 仓库在 Python 路径中（优先于 PyPI 包）"""
+    repo_str = str(WEBSHOP_REPO_DIR)
+    if repo_str not in sys.path:
+        sys.path.insert(0, repo_str)
+
+
+def _download_webshop_data():
+    """下载 WebShop 数据（手动下载，避免 setup.sh 破坏环境依赖）"""
+    data_dir = WEBSHOP_REPO_DIR / "data"
+    marker = data_dir / ".download_complete"
+
+    if marker.exists():
+        logger.info(f"WebShop data already downloaded: {data_dir}")
+        return
+
+    logger.info("Downloading WebShop data (~500MB, first time only)...")
+    data_dir.mkdir(parents=True, exist_ok=True)
+
+    # 使用 gdown 下载 Google Drive 文件（small 数据集，1000个产品）
+    files = [
+        ("1EgHdxQ_YxqIQlvvq5iKlCrkEKR6-j0Ib", "items_shuffle_1000.json"),
+        ("1IduG0xl544V_A_jv3tHXC0kyFi7PnyBu", "items_ins_v2_1000.json"),
+        ("14Kb5SPBk_jfdLZ_CDBNitW98QLDlKR5O", "items_human_ins.json"),
+    ]
+
+    for file_id, filename in files:
+        filepath = data_dir / filename
+        if not filepath.exists():
+            try:
+                subprocess.run(
+                    ["gdown", file_id, "-O", str(filepath)],
+                    check=True, timeout=120
+                )
+                logger.info(f"Downloaded {filename}")
+            except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e:
+                logger.warning(f"Failed to download {filename}: {e}")
+
+    # 构建搜索引擎索引
+    _build_search_index()
+
+    marker.touch()
+    logger.info(f"WebShop data ready: {data_dir}")
+
+
+def _build_search_index():
+    """构建 WebShop 搜索引擎索引"""
+    search_engine_dir = WEBSHOP_REPO_DIR / "search_engine"
+    marker = search_engine_dir / ".index_built"
+
+    if marker.exists():
+        return
+
+    logger.info("Building WebShop search index...")
+
+    # 创建必要的目录
+    resources_dir = search_engine_dir / "resources_1k"
+    resources_dir.mkdir(parents=True, exist_ok=True)
+    indexes_dir = search_engine_dir / "indexes"
+    indexes_dir.mkdir(parents=True, exist_ok=True)
+
+    try:
+        # 转换产品文件格式
+        convert_script = search_engine_dir / "convert_product_file_format.py"
+        if convert_script.exists():
+            subprocess.run(
+                [sys.executable, str(convert_script)],
+                cwd=search_engine_dir, check=True, timeout=60
+            )
+
+        # 构建索引
+        index_script = search_engine_dir / "run_indexing.sh"
+        if index_script.exists():
+            subprocess.run(
+                ["bash", str(index_script)],
+                cwd=search_engine_dir, check=True, timeout=120
+            )
+
+        marker.touch()
+        logger.info("Search index built successfully")
+    except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e:
+        logger.warning(f"Failed to build search index: {e}")
+
+
+def download_train_data(target_dir: Path) -> None:
+    """准备 WebShop 训练数据（agent 可见）
+
+    流程：
+    1. 克隆 WebShop 仓库（如果不存在）
+    2. 下载产品数据（手动方式，避免 setup.sh 破坏依赖）
+    3. 将训练数据链接到 target_dir
+    """
+    marker = target_dir / ".downloaded"
+    if marker.exists():
+        logger.info(f"WebShop train data exists: {target_dir}")
+        return
+
+    target_dir.mkdir(parents=True, exist_ok=True)
+
+    _clone_webshop_repo()
+    _ensure_repo_in_path()
+    _download_webshop_data()
+
+    # 链接训练数据给 agent
+    human_traj_src = WEBSHOP_REPO_DIR / "data" / "human_trajectories"
+    if human_traj_src.exists():
+        human_traj_dst = target_dir / "human_trajectories"
+        if not human_traj_dst.exists():
+            human_traj_dst.symlink_to(human_traj_src)
+        logger.info(f"Linked human_trajectories: {human_traj_dst}")
+
+    marker.touch()
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/description.md b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/description.md
new file mode 100644
index 000000000..c5a7946b2
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/description.md
@@ -0,0 +1,97 @@
+# WebShop 任务
+
+## 目标
+训练模型在 WebShop 电商网站环境中获得更高的购物任务成功率。这是一个**交互式**任务：模型需要在网页环境中多步决策（rollout），根据用户指令搜索并购买匹配的产品。
+
+## 环境概述
+
+WebShop 是一个模拟电商网站环境，包含 118 万真实产品和用户指令。Agent 需要根据文本指令完成购物任务。
+
+环境有 4 种页面状态：
+- **search** - 搜索页面，包含搜索框
+- **results** - 搜索结果页，列出匹配的产品
+- **item** - 产品详情页
+- **item-detail** - 产品详细信息页
+
+## 动作空间
+
+Agent 的动作是文本格式，有两种类型：
+
+1. **搜索**: `search[query]` - 在搜索页面使用
+   - 示例：`search[red running shoes]`
+
+2. **选择**: `choose[option]` - 根据当前页面选择选项
+   - `choose[Back to Search]` - 返回搜索页
+   - `choose[Next >]` / `choose[< Prev]` - 翻页
+   - `choose[Product Title]` - 选择产品
+   - `choose[Option]` - 选择颜色/尺寸等变体
+   - `choose[Description]` - 查看详情
+   - `choose[Buy Now]` - 购买产品
+
+## Rollout 流程
+
+每轮购物任务的交互循环：
+
+```python
+# 初始化
+obs, info = env.reset(idx=instruction_idx)  # 获取初始观察（搜索页面）
+
+done = False
+for step in range(max_steps):
+    # 1. 模型根据指令、历史、当前观察生成动作
+    action = model(instruction, history, obs)
+    
+    # 2. 环境执行动作
+    obs, reward, done, info = env.step(action)
+    
+    # 3. 记录历史
+    history.append((action, obs))
+    
+    if done:
+        break
+
+# reward: 最终奖励 (0-1)，反映产品匹配程度
+```
+
+**一个 rollout 示例**：
+
+```
+指令: "I'm looking for a quick-release replacement fitness strap band; 
+       it should match my chic teal fitbit, and price lower than 40.00 dollars"
+
+Step 1: 观察: "WebShop [SEP] Search [SEP]"
+        动作: "search[quick-release fitness strap band teal fitbit]"
+
+Step 2: 观察: "WebShop [SEP] Results [SEP] [Back to Search] [Next >] 
+               [Teal Silicone Sport Band for Fitbit... $12.99] 
+               [Quick Release Nylon Band Teal... $15.99]..."
+        动作: "choose[Teal Silicone Sport Band for Fitbit Charge 2, Large, $12.99]"
+
+Step 3: 观察: "WebShop [SEP] Item [SEP] Teal Silicone Sport Band... 
+               [Buy Now] [Back to Search] [Description] [Size Large] [Size Small]"
+        动作: "choose[Buy Now]"
+
+Step 4: 观察: "WebShop [SEP] Episode finished [SEP] reward = 0.95"
+        结果: 任务完成，奖励 0.95（高匹配度）
+```
+
+## 观测格式
+
+环境返回的观测是文本格式：
+
+```
+WebShop [SEP] {Page Type} [SEP] {Content}
+```
+
+- `WebShop` - 固定前缀
+- `{Page Type}` - 页面类型：Search / Results / Item
+- `{Content}` - 页面内容，包括可用选项
+
+## 评测指标
+
+- **成功率** = 成功购买匹配产品的比例（reward >= 0.5 视为成功）
+- **平均奖励** = 所有任务的平均奖励值（0-1），基于产品类型、属性、价格匹配度计算
+
+## 参考代码
+
+环境交互和评测的完整实现见 `eval.py`。
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/eval.py b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/eval.py
new file mode 100644
index 000000000..5ed01bd47
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/eval.py
@@ -0,0 +1,382 @@
+"""
+WebShop Evaluator - 电商网站交互环境
+
+使用 ReAct agent 在 WebShop 环境中评测 LLM。
+支持两种后端：
+  - vllm: 本地模型推理
+  - api:  OpenAI 兼容 API
+
+WebShop 官方代码: https://github.com/princeton-nlp/webshop
+"""
+import json
+import os
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Callable, Dict, List, Tuple
+
+from rdagent.scenarios.rl.autorl_bench.core.evaluator import BaseEvaluator
+from rdagent.log import rdagent_logger as logger
+
+from .data import WEBSHOP_REPO_DIR, _clone_webshop_repo, _ensure_repo_in_path
+
+# 日志目录
+LOG_DIR = Path(__file__).resolve().parent.parent.parent / "log"
+
+
+class _Tee:
+    """同时输出到终端和日志文件"""
+    def __init__(self, filepath):
+        self.terminal = sys.__stdout__
+        self.log = open(filepath, "w", encoding="utf-8")
+    def write(self, message):
+        self.terminal.write(message)
+        self.log.write(message)
+        self.log.flush()
+    def flush(self):
+        self.terminal.flush()
+        self.log.flush()
+    def isatty(self):
+        return False
+    def fileno(self):
+        return self.terminal.fileno()
+
+
+def _log(msg: str):
+    """简单的 print 日志（会被 Tee 同时写入文件）"""
+    print(msg, flush=True)
+
+
+# ============================================================
+# LLM 后端工厂
+# ============================================================
+
+def create_llm_fn(backend: str, model_path: str, **kwargs) -> Tuple[Callable, Callable]:
+    """
+    创建统一的 llm(prompt, stop) 函数。
+
+    backend="vllm": 本地模型，text completion
+    backend="api":  OpenAI 兼容 chat API
+
+    Returns:
+        (llm_fn, cleanup_fn): cleanup_fn 释放资源
+    """
+    if backend == "vllm":
+        from vllm import LLM, SamplingParams
+        from vllm.distributed.parallel_state import destroy_model_parallel
+
+        llm_engine = LLM(
+            model=model_path,
+            tensor_parallel_size=kwargs.get("tensor_parallel_size", 1),
+            trust_remote_code=True
+        )
+
+        def vllm_fn(prompt: str, stop: List[str] = None) -> str:
+            params = SamplingParams(temperature=0, max_tokens=100, stop=stop or ["\n"])
+            outputs = llm_engine.generate([prompt], params)
+            return outputs[0].outputs[0].text
+
+        def cleanup():
+            nonlocal llm_engine
+            import gc
+            import torch
+            destroy_model_parallel()
+            llm_engine = None
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            _log("vLLM engine released, GPU memory freed.")
+
+        return vllm_fn, cleanup
+
+    elif backend == "api":
+        from openai import OpenAI
+
+        client = OpenAI(
+            api_key=kwargs.get("api_key", os.getenv("OPENAI_API_KEY")),
+            base_url=kwargs.get("api_base", os.getenv("OPENAI_API_BASE")),
+        )
+        model_name = model_path
+
+        system_msg = (
+            "You are a helpful shopping assistant. "
+            "You are browsing an e-commerce website. "
+            "Given a user instruction and the current webpage, "
+            "you need to take an action to find and purchase a matching product. "
+            "Output ONLY the action (e.g., 'search[red shoes]', 'choose[Nike Air Max]', 'choose[Buy Now]') "
+            "with NO extra text, NO explanation."
+        )
+
+        def api_fn(prompt: str, stop: List[str] = None) -> str:
+            response = client.chat.completions.create(
+                model=model_name,
+                messages=[
+                    {"role": "system", "content": system_msg},
+                    {"role": "user", "content": prompt},
+                ],
+                temperature=0,
+                max_tokens=100,
+                stop=stop or ["\n"],
+            )
+            text = response.choices[0].message.content or ""
+            return text.strip()
+
+        return api_fn, lambda: None
+
+    else:
+        raise ValueError(f"Unknown backend: {backend}. Use 'vllm' or 'api'.")
+
+
+# ============================================================
+# ReAct Agent 核心逻辑
+# ============================================================
+
+def build_react_prompt(instruction: str, history: List[Tuple[str, str]], observation: str) -> str:
+    """构建 ReAct 风格的提示词"""
+    prompt = f"""You are shopping on an e-commerce website. Find and purchase a product matching the user's instruction.
+
+Instruction: {instruction}
+
+You can use these actions:
+- search[query]: Search for products with the given query
+- choose[option]: Click on an option (product name, button like "Buy Now", "Back to Search", etc.)
+
+Example interaction:
+Observation: WebShop [SEP] Search [SEP]
+Thought: I need to search for products matching the instruction.
+Action: search[stylish metal filing cabinet]
+
+Observation: WebShop [SEP] Results [SEP] [Back to Search] [Next >] [Metal Cabinet $50] [Wood Cabinet $30]
+Thought: The Metal Cabinet looks like a good match. Let me click on it.
+Action: choose[Metal Cabinet $50]
+
+Now it's your turn:
+"""
+
+    # 添加历史交互
+    for i, (action, obs) in enumerate(history):
+        prompt += f"\nObservation {i+1}: {obs}\n"
+        prompt += f"Action {i+1}: {action}\n"
+
+    # 添加当前观察
+    prompt += f"\nObservation {len(history)+1}: {observation}\n"
+    prompt += f"Action {len(history)+1}:"
+
+    return prompt
+
+
+def webshop_run(
+    llm_fn: Callable,
+    env,
+    instruction: str,
+    observation: str,
+    max_steps: int = 50,
+) -> Tuple[float, int, bool]:
+    """
+    单轮 WebShop 评测逻辑。
+
+    Args:
+        llm_fn: llm(prompt, stop) -> str
+        env: WebShop 环境实例
+        instruction: 用户指令
+        observation: 初始观察
+        max_steps: 最大步数
+
+    Returns:
+        (reward, steps, success): reward为最终奖励, steps为实际步数, success是否成功
+    """
+    history = []
+
+    for step in range(1, max_steps + 1):
+        # 构建 prompt
+        prompt = build_react_prompt(instruction, history, observation)
+
+        # 调用 LLM 生成动作
+        action = llm_fn(prompt, stop=["\n"]).strip()
+
+        # 清理动作
+        if action.startswith("Action:"):
+            action = action[7:].strip()
+
+        _log(f"  Step {step}: {action}")
+
+        # 执行动作
+        observation, reward, done, info = env.step(action)
+
+        _log(f"  Obs {step}: {observation[:200]}...")
+        _log(f"  Reward: {reward}, Done: {done}")
+
+        # 记录历史
+        history.append((action, observation))
+
+        if done:
+            success = reward >= 0.5  # reward >= 0.5 视为成功
+            return reward, step, success
+
+    # 达到最大步数
+    return 0.0, max_steps, False
+
+
+# ============================================================
+# Evaluator
+# ============================================================
+
+class WebShopEvaluator(BaseEvaluator):
+    """
+    WebShop 评测器（ReAct agent）
+
+    eval_config 字段：
+        max_steps:        每任务最大步数（默认 50）
+        num_instructions: 评测指令数量（默认 100）
+        backend:          "vllm" 或 "api"（默认自动判断）
+        api_key:          API 密钥（backend=api 时）
+        api_base:         API 地址（backend=api 时）
+        num_products:     加载的产品数量（默认 1000，可选 1000 或全部）
+    """
+
+    def __init__(self, config):
+        self.config = config
+        self.benchmark_id = config.id
+        self.eval_config = config.eval_config or {}
+
+    def run_eval(
+        self,
+        model_path: str,
+        workspace_path: str,
+        **kwargs,
+    ) -> Dict[str, Any]:
+        """运行 WebShop 评测"""
+        result = self.get_default_result(self.benchmark_id, model_path)
+        result["eval_type"] = "webshop"
+
+        # 合并 kwargs 到 eval_config
+        cfg = {**self.eval_config, **kwargs}
+        max_steps = cfg.get("max_steps", 50)
+        num_instructions = cfg.get("num_instructions", 100)
+        num_products = cfg.get("num_products", 1000)
+
+        # --- 设置日志 Tee ---
+        LOG_DIR.mkdir(parents=True, exist_ok=True)
+        model_safe = model_path.replace("/", "_").replace("\\", "_")
+        log_file = LOG_DIR / f"webshop_{model_safe}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+        old_stdout = sys.stdout
+        sys.stdout = _Tee(log_file)
+
+        try:
+            _log(f"Log: {log_file}")
+
+            # --- 确保 WebShop 仓库可用 ---
+            _clone_webshop_repo()
+            _ensure_repo_in_path()
+
+            # --- 判断 backend ---
+            backend = cfg.get("backend")
+            if backend is None:
+                backend = "api" if not Path(model_path).exists() else "vllm"
+            _log(f"WebShop eval: backend={backend}, model={model_path}")
+
+            # --- 创建 LLM 函数 ---
+            llm_fn, llm_cleanup = create_llm_fn(
+                backend=backend,
+                model_path=model_path,
+                api_key=cfg.get("api_key"),
+                api_base=cfg.get("api_base"),
+                tensor_parallel_size=cfg.get("tensor_parallel_size", 1),
+            )
+
+            # --- 初始化 WebShop 环境 ---
+            try:
+                from web_agent_site.envs.web_agent_text_env import WebAgentTextEnv
+            except ImportError as e:
+                result["error"] = f"Failed to import WebShop: {e}. Please check WebShop installation."
+                sys.stdout = old_stdout
+                return result
+
+            env = WebAgentTextEnv(
+                observation_mode="text",
+                num_products=num_products,
+            )
+
+            # --- 加载评测指令 ---
+            instruction_idxs = list(range(min(num_instructions, 12000)))
+
+            _log(f"WebShop: {len(instruction_idxs)} instructions, max {max_steps} steps each")
+
+            # --- 评测循环 ---
+            total_reward = 0.0
+            success_count = 0
+            total_steps = 0
+
+            for idx, instr_idx in enumerate(instruction_idxs):
+                try:
+                    # 重置环境
+                    observation, _ = env.reset(session=instr_idx)
+                    instruction = env.get_instruction_text()
+
+                    _log(f"\n[Task {idx + 1}/{len(instruction_idxs)}] {instruction[:80]}...")
+
+                    # 运行 agent
+                    reward, steps, success = webshop_run(
+                        llm_fn=llm_fn,
+                        env=env,
+                        instruction=instruction,
+                        observation=observation,
+                        max_steps=max_steps,
+                    )
+
+                    total_reward += reward
+                    total_steps += steps
+                    if success:
+                        success_count += 1
+
+                    _log(f"  Result: {'SUCCESS' if success else 'FAIL'} (reward={reward:.2f}, steps={steps})")
+
+                    # 打印进度
+                    current_success_rate = success_count / (idx + 1)
+                    _log(f"  Running: {success_count}/{idx + 1} = {current_success_rate:.1%}")
+
+                except Exception as e:
+                    _log(f"  ERROR: {e}")
+                    import traceback
+                    _log(traceback.format_exc())
+                    continue
+
+            # --- 汇总结果 ---
+            total_count = len(instruction_idxs)
+            success_rate = success_count / total_count if total_count > 0 else 0.0
+            avg_reward = total_reward / total_count if total_count > 0 else 0.0
+            avg_steps = total_steps / total_count if total_count > 0 else 0.0
+
+            result["score"] = success_rate * 100  # 转为百分比
+            result["accuracy_summary"] = {
+                "success_count": success_count,
+                "total_count": total_count,
+                "success_rate": success_rate,
+                "avg_reward": avg_reward,
+                "avg_steps": avg_steps,
+                "total_reward": total_reward,
+            }
+
+            _log(f"\nWebShop done: {success_count}/{total_count} = {success_rate:.2%}")
+            _log(f"  Average reward: {avg_reward:.3f}")
+            _log(f"  Average steps: {avg_steps:.1f}")
+
+        except Exception as e:
+            result["error"] = str(e)
+            _log(f"ERROR: {e}")
+            import traceback
+            _log(traceback.format_exc())
+
+        finally:
+            # --- 清理 ---
+            if 'env' in locals():
+                env.close()
+
+            # 释放 LLM 资源
+            if 'llm_cleanup' in locals():
+                llm_cleanup()
+
+            # 恢复 stdout
+            sys.stdout = old_stdout
+
+        return result
diff --git a/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/requirements.txt b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/requirements.txt
new file mode 100644
index 000000000..404e82479
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/benchmarks/webshop/requirements.txt
@@ -0,0 +1,29 @@
+# WebShop benchmark 依赖
+#
+# 前置要求：Java 11+ (JDK) 和 faiss-cpu
+#   conda install -c conda-forge openjdk=11 faiss-cpu
+#
+# 安装命令:
+#   pip install -r benchmarks/webshop/requirements.txt
+#   python -m spacy download en_core_web_sm
+#
+# 注意：Flask/Werkzeug 已在主 requirements.txt 中固定为 2.x 版本
+
+# WebShop PyPI 包
+webshop
+
+# 数据下载工具
+gdown
+
+# WebShop 特有依赖
+gym==0.24.0
+beautifulsoup4==4.11.1
+cleantext==1.1.4
+pyserini==0.17.0
+rank_bm25==0.2.2
+thefuzz==0.19.0
+spacy==3.7.2
+
+# 注意：Flask/Werkzeug 固定为 2.x（Flask 3.x 与 WebShop 不兼容）
+flask==2.2.5
+Werkzeug==2.2.3
\ No newline at end of file
diff --git a/rdagent/scenarios/rl/autorl_bench/conf.py b/rdagent/scenarios/rl/autorl_bench/conf.py
new file mode 100644
index 000000000..50d59c25c
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/conf.py
@@ -0,0 +1,49 @@
+"""
+AutoRL-Bench 配置
+
+独立配置，不依赖 RL_RD_SETTING，只复用 rdagent 基类。
+"""
+from pathlib import Path
+
+from pydantic_settings import SettingsConfigDict
+
+from rdagent.core.conf import ExtendedBaseSettings
+
+
+class AutoRLBenchSettings(ExtendedBaseSettings):
+    """AutoRL-Bench 配置
+    
+    环境变量前缀: AUTORL_
+    例如: AUTORL_FILE_PATH=/data/autorl_bench
+    """
+    model_config = SettingsConfigDict(env_prefix="AUTORL_", protected_namespaces=())
+    
+    file_path: Path = Path.cwd() / "git_ignore_folder" / "rl_files"
+    rdagent_root: Path = Path.cwd()  # Docker 挂载用，可通过 AUTORL_RDAGENT_ROOT 覆盖
+
+
+AUTORL_BENCH_SETTING = AutoRLBenchSettings()
+
+
+def get_autorl_bench_dir() -> Path:
+    return Path(__file__).parent
+
+
+def get_workspace_dir() -> Path:
+    return get_autorl_bench_dir() / "workspace"
+
+
+def get_instructions_file() -> Path:
+    return get_autorl_bench_dir() / "core" / "instructions.md"
+
+
+def get_models_dir() -> Path:
+    return AUTORL_BENCH_SETTING.file_path / "models"
+
+
+def get_data_dir() -> Path:
+    return AUTORL_BENCH_SETTING.file_path / "datasets"
+
+
+def get_baseline_cache_dir() -> Path:
+    return AUTORL_BENCH_SETTING.file_path / "baseline_workspace"
diff --git a/rdagent/scenarios/rl/autorl_bench/core/__init__.py b/rdagent/scenarios/rl/autorl_bench/core/__init__.py
new file mode 100644
index 000000000..a4b7bf38a
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/__init__.py
@@ -0,0 +1,94 @@
+"""
+AutoRL-Bench Core Module
+
+主干代码，定义统一的评测接口和服务。
+开发新 benchmark 或 agent 时不需要修改此模块。
+
+================================================================================
+面向开发者的接口约定（输入 / 输出 / 环节）
+================================================================================
+Put all the branches and complexity during initialization.
+
+Config:
+- eval_class
+    - uri: rdagent.scenarios.rl.autorl_bench.core.OpenCompassEvaluator
+    - kwargs:
+        config: BenchmarkConfig   # 见 benchmarks/__init__.py
+        ...
+
+RESTful API  -- (...) -> def run_eval(...):
+
+class BenchmarkBase:
+    '''
+    所有 benchmark 评测器的基类（实际代码见 evaluator.py 中的 BaseEvaluator）
+    '''
+    def run_eval(self, workspace_path: str, model_path: str, task_config) -> dict:
+        '''
+        输入:
+            workspace_path: 工作目录路径
+            model_path:     训练后的模型路径（本地目录）
+            task_config:    任务配置（模型名称、GPU 数量、测试范围等）
+
+        输出 (dict):
+            benchmark:        str           # benchmark 名称
+            model_path:       str           # 评测的模型路径
+            score:            float         # 评测分数 (0-100)
+            accuracy_summary: Dict[str,Any] # 详细指标
+
+        副作用 (side-effects):
+            - 在 workspace_path 下生成评测结果文件
+            - 日志输出到 logger
+        '''
+        ...
+
+class DRBenchmark(BenchmarkBase):
+    '''具体实现示例（如 OpenCompassEvaluator、ALFWorldEvaluator）'''
+    def run_eval(self, workspace_path: str, model_path: str, task_config) -> dict:
+        '''调用具体 benchmark 的评测逻辑，返回统一格式的结果 dict'''
+        ...
+
+================================================================================
+"""
+from .evaluator import (
+    BaseEvaluator,
+    EvalInput,
+    EvalResult,
+)
+from .opencompass import OpenCompassEvaluator
+from .utils import (
+    ensure_symlink,
+    download_model,
+    download_data,
+    get_baseline_score,
+    submit_to_grading_server,
+    set_baseline_to_server,
+    create_grading_server,
+    setup_workspace,
+    append_result,
+    detect_driver_model,
+    print_summary,
+    kill_process_group,
+)
+
+__all__ = [
+    # 数据结构
+    "EvalInput",
+    "EvalResult",
+    # 评测器
+    "BaseEvaluator",
+    "OpenCompassEvaluator",
+    # 工具函数
+    "ensure_symlink",
+    "download_model",
+    "download_data",
+    "get_baseline_score",
+    "submit_to_grading_server",
+    "set_baseline_to_server",
+    "create_grading_server",
+    # workspace & results
+    "setup_workspace",
+    "append_result",
+    "detect_driver_model",
+    "print_summary",
+    "kill_process_group",
+]
diff --git a/rdagent/scenarios/rl/autorl_bench/core/evaluator.py b/rdagent/scenarios/rl/autorl_bench/core/evaluator.py
new file mode 100644
index 000000000..da17c2869
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/evaluator.py
@@ -0,0 +1,159 @@
+"""
+AutoRL-Bench Evaluator Base Class
+
+所有 benchmark 评测器的基类，定义统一的评测接口。
+
+开发新 benchmark 时，继承 BaseEvaluator 并实现 run_eval 方法。
+"""
+from abc import ABC, abstractmethod
+from pathlib import Path
+from typing import Any, Dict
+from typing_extensions import TypedDict, NotRequired
+
+
+# ============================================================
+# 数据结构定义（Schema）
+# ============================================================
+
+class EvalInput(TypedDict):
+    """
+    评测输入参数
+    
+    Attributes:
+        model_path: 训练后的模型路径（本地目录）
+        workspace_path: 工作目录路径
+        model_name: 模型名称（用于配置推理参数）
+        gpu_count: 可用 GPU 数量
+        test_range: 测试数据范围，如 "[:]" 或 "[:100]"
+    """
+    model_path: str
+    workspace_path: str
+    model_name: NotRequired[str]
+    gpu_count: NotRequired[int]
+    test_range: NotRequired[str]
+
+
+class EvalResult(TypedDict):
+    """
+    评测输出结果
+    
+    必须字段:
+        benchmark: benchmark 名称
+        model_path: 评测的模型路径
+        score: 评测分数 (0-100)
+        accuracy_summary: 详细指标字典
+        
+    可选字段:
+        eval_type: 评测类型 ("opencompass" / "alfworld" / ...)
+        error: 错误信息（评测失败时）
+        raw_output: 原始输出日志
+    """
+    # 必须字段
+    benchmark: str
+    model_path: str
+    score: float
+    accuracy_summary: Dict[str, Any]
+    
+    # 可选字段
+    eval_type: NotRequired[str]
+    error: NotRequired[str]
+    raw_output: NotRequired[str]
+
+
+# ============================================================
+# 抽象基类
+# ============================================================
+
+class BaseEvaluator(ABC):
+    """
+    Benchmark 评测器基类
+    
+    所有自定义 benchmark 必须继承此类并实现 run_eval 方法。
+    
+    =====================================================
+    最简单的方式：调用 benchmark 自带的评测代码
+    =====================================================
+    
+    大多数 benchmark（如 HumanEval、MBPP、ALFWorld）都有官方评测脚本，
+    只需要：
+    1. 下载 benchmark repo
+    2. 调用它的评测函数
+    3. 把结果转成 EvalResult 格式
+    
+    Example（包装现有评测）:
+        class MyBenchmarkEvaluator(BaseEvaluator):
+            def __init__(self, config):
+                self.config = config
+                self.benchmark_id = config.id
+            
+            def run_eval(self, model_path, workspace_path, **kwargs) -> EvalResult:
+                result = self.get_default_result(self.benchmark_id, model_path)
+                
+                # 1. 调用 benchmark 自带的评测
+                from some_benchmark import evaluate  # benchmark 官方库
+                raw_result = evaluate(model_path)    # 调用官方评测
+                
+                # 2. 转换成统一格式
+                result["score"] = raw_result["accuracy"] * 100
+                result["accuracy_summary"] = raw_result
+                return result
+    
+    =====================================================
+    完整示例：自定义评测逻辑
+    =====================================================
+    
+    如果需要完全自定义评测（如交互式环境）：
+    
+    Example:
+        class InteractiveEvaluator(BaseEvaluator):
+            def run_eval(self, model_path, workspace_path, **kwargs) -> EvalResult:
+                result = self.get_default_result(self.benchmark_id, model_path)
+                
+                # 1. 加载模型
+                model = load_model(model_path)
+                
+                # 2. 运行评测循环
+                success = 0
+                for task in tasks:
+                    output = model.generate(task.prompt)
+                    if task.check(output):
+                        success += 1
+                
+                # 3. 返回结果
+                result["score"] = success / len(tasks) * 100
+                result["accuracy_summary"] = {"success": success, "total": len(tasks)}
+                return result
+    """
+    
+    @abstractmethod
+    def run_eval(
+        self,
+        model_path: str,
+        workspace_path: str,
+        **kwargs
+    ) -> EvalResult:
+        """
+        执行评测
+        
+        Args:
+            model_path: 训练后的模型路径（本地目录）
+            workspace_path: 工作目录路径
+            **kwargs: 其他评测参数（见 EvalInput）
+            
+        Returns:
+            EvalResult: 评测结果
+        """
+        pass
+    
+    def validate_model(self, model_path: str) -> bool:
+        """验证模型路径是否有效"""
+        return Path(model_path).exists()
+    
+    def get_default_result(self, benchmark_name: str, model_path: str) -> EvalResult:
+        """返回默认的结果结构"""
+        return {
+            "benchmark": benchmark_name,
+            "model_path": model_path,
+            "score": 0.0,
+            "accuracy_summary": {},
+        }
diff --git a/rdagent/scenarios/rl/autorl_bench/core/instructions.md b/rdagent/scenarios/rl/autorl_bench/core/instructions.md
new file mode 100644
index 000000000..84dcc6259
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/instructions.md
@@ -0,0 +1,93 @@
+# AutoRL-Bench 任务说明
+
+你是一个强化学习训练 Agent，目标是通过 RL Post-Training 提升模型表现。
+
+## 环境变量
+- TASK: 任务名称
+- BASE_MODEL: 基础模型名称
+- MODEL_PATH: 基础模型路径（只读）
+- DATA_PATH: 训练数据路径（只读）
+- OUTPUT_DIR: 模型输出目录（提交评测时指定此目录下的模型路径）
+- GRADING_SERVER_URL: 评测服务地址
+
+## 工作区隔离
+**你的当前目录就是工作区。所有需要的文件都在当前目录下。**
+- **禁止 `cd` 到当前目录之外**（不要访问父目录或其他路径）
+- **只使用相对路径**（如 `./code/train.py`，而非绝对路径）
+- 如果看到 symlink 指向外部路径，忽略它——直接用相对路径访问即可
+
+## 目录结构
+```
+./
+├── code/               # 你的代码区（所有自行编写的代码放在此处）
+├── data/               # 训练数据（只读）
+├── models/             # 基础模型（只读）
+├── output/             # 模型输出（训练好的模型保存在此）
+├── description.md      # 任务描述（必读）
+├── instructions.md     # 本文件
+└── ...                 # benchmark 特有文件（用 ls 查看完整列表）
+```
+
+**先 `ls` 查看当前目录所有可用文件。** 不同类型的 benchmark 会提供不同的额外文件：
+- **交互式环境类**（如 ALFWorld）：会提供 `eval.py`（环境交互 + 评测逻辑）、prompt 模板、配置文件等——这些是编写训练代码的关键参考
+- **静态数据集类**（如 GSM8K）：主要通过 `data/` 下的数据文件提供训练样本
+
+务必先探索工作区，了解可用资源后再编写代码。
+
+**说明**：
+- `code/`：在此编写和执行训练脚本（如 `code/train.py`）
+- `output/`：训练产出的模型存放处。可存放多个版本（如 `output/v1/`、`output/v2/`），提交时指定具体路径
+
+## 任务流程
+1. `ls` 查看当前目录所有可用文件
+2. 阅读 `description.md` 了解任务目标
+3. 如果有 `eval.py`，**仔细阅读**——它包含环境交互逻辑、模型推理方式和评测流程
+4. 探索 `data/` 了解训练数据格式
+5. 在 `code/` 下编写训练脚本（SFT、GRPO、PPO 等均可，最终目标是 RL post-training）
+6. 保存模型到 $OUTPUT_DIR
+7. 提交评测：POST $GRADING_SERVER_URL/submit
+8. 根据返回的 score 调整策略，重复 5-7
+
+## API
+```bash
+# 提交评测（指定模型路径，返回 score、improvement、best）
+curl -X POST "$GRADING_SERVER_URL/submit" \
+    -H "Content-Type: application/json" \
+    -d '{"model_path": "'$OUTPUT_DIR'/v1"}'
+
+# 指定 GPU 评测（可选，默认使用 GPU 0）
+curl -X POST "$GRADING_SERVER_URL/submit" \
+    -H "Content-Type: application/json" \
+    -d '{"model_path": "'$OUTPUT_DIR'/v1", "gpu": "0"}'
+
+# 多卡评测
+curl -X POST "$GRADING_SERVER_URL/submit" \
+    -H "Content-Type: application/json" \
+    -d '{"model_path": "'$OUTPUT_DIR'/v1", "gpu": "2,3"}'
+
+# 健康检查（返回可用 GPU 列表等信息）
+curl "$GRADING_SERVER_URL/health"
+```
+
+### /submit 参数
+| 参数 | 类型 | 必填 | 说明 |
+|------|------|------|------|
+| model_path | string | 是 | 模型路径 |
+| gpu | string | 否 | 指定 GPU（如 "0"、"1"、"0,1"），必须是可用 GPU 之一。不传则默认使用第一个可用 GPU。可通过 /health 查看可用列表 |
+
+### /submit 响应示例
+```json
+{
+  "submission_id": 3,
+  "score": 65.0,
+  "baseline_score": 45.0,
+  "improvement": 20.0,
+  "best": {"submission_id": 2, "score": 68.0},
+  "total_submissions": 3
+}
+```
+
+## 注意
+- 可多次提交不同版本的模型，系统自动跟踪最高分
+- 合理利用时间，根据 score 反馈迭代优化
+- trl 保存模型后，`tokenizer_config.json` 中的 `extra_special_tokens` 会被保存为 list 格式，但 vLLM/transformers 加载时需要 dict 格式。保存模型后需删除该字段，否则评测会失败。
diff --git a/rdagent/scenarios/rl/autorl_bench/core/opencompass.py b/rdagent/scenarios/rl/autorl_bench/core/opencompass.py
new file mode 100644
index 000000000..bda4d8fe0
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/opencompass.py
@@ -0,0 +1,165 @@
+"""
+OpenCompass Evaluator
+
+用于所有使用 OpenCompass 评测的 benchmark（gsm8k, math 等）。
+"""
+import subprocess
+from pathlib import Path
+from typing import Any, Dict
+
+import pandas as pd
+import yaml
+
+from rdagent.components.benchmark import BENCHMARK_CONFIGS_DIR
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.rl.autorl_bench.core.evaluator import BaseEvaluator
+from rdagent.utils.agent.tpl import T
+
+
+class OpenCompassEvaluator(BaseEvaluator):
+    """
+    OpenCompass 通用评测器
+    
+    适用于所有使用 OpenCompass 评测的 benchmark。
+    """
+    
+    def __init__(self, config):
+        self.config = config
+        self.benchmark_id = config.id
+        self.eval_config = config.eval_config or {}
+    
+    def run_eval(
+        self,
+        model_path: str,
+        workspace_path: str,
+        model_name: str = "",
+        gpu_count: int = 1,
+        test_range: str = "[:]",
+        **kwargs
+    ) -> Dict[str, Any]:
+        """使用 OpenCompass 评测"""
+        result = self.get_default_result(self.benchmark_id, model_path)
+        result["eval_type"] = "opencompass"
+        
+        if not self.validate_model(model_path):
+            result["error"] = f"Model not found: {model_path}"
+            return result
+        
+        workspace = Path(workspace_path)
+        model_path = str(Path(model_path).resolve())
+        work_dir = workspace / "benchmark_results"
+        work_dir.mkdir(parents=True, exist_ok=True)
+        
+        # 获取评测配置
+        dataset_import = self.eval_config.get(
+            "dataset", 
+            f"opencompass.configs.datasets.{self.benchmark_id}"
+        )
+        
+        # 从 models.yaml 获取模型推理配置
+        inference_config = self._get_model_inference_config(model_name, gpu_count)
+        
+        # 生成 OpenCompass 配置
+        template_vars = {
+            "model_abbr": f"rl-{self.benchmark_id}",
+            "model_path": model_path,
+            "dataset_imports": [dataset_import],
+            "test_range": test_range,
+            "num_runs": 1,
+            "pass_k": None,
+            "work_dir": str(work_dir),
+            "is_lora": False,
+            "lora_path": "",
+            **inference_config,
+        }
+        
+        config_content = T("rdagent.components.benchmark.configs.opencompass_template:template").r(**template_vars)
+        config_path = workspace / "opencompass_config.py"
+        config_path.write_text(config_content)
+        
+        logger.info(f"Running OpenCompass benchmark: {self.benchmark_id}")
+        logger.info(f"Model: {model_path}")
+        
+        # 运行 OpenCompass
+        cmd = ["opencompass", str(config_path), "--work-dir", str(work_dir)]
+        
+        try:
+            proc = subprocess.run(cmd, capture_output=True, text=True, timeout=3600)
+        except subprocess.TimeoutExpired:
+            result["error"] = "OpenCompass timeout (3600s)"
+            return result
+        
+        if proc.returncode != 0:
+            error_msg = proc.stderr[:1000] if proc.stderr else proc.stdout[:1000] if proc.stdout else "No output"
+            logger.warning(f"OpenCompass failed: {error_msg}")
+            result["error"] = f"OpenCompass exit code: {proc.returncode}"
+            result["raw_output"] = error_msg
+            return result
+        
+        # 解析结果
+        result = self._parse_results(work_dir, result)
+        logger.info(f"Benchmark score: {result['score']}")
+        return result
+    
+    def _get_model_inference_config(self, model_name: str, gpu_count: int) -> dict:
+        """从 models.yaml 加载模型推理配置"""
+        config_data = yaml.safe_load(open(BENCHMARK_CONFIGS_DIR / "models.yaml", "r"))
+        
+        default_config = config_data.get("default", {})
+        models_config = config_data.get("models", {})
+        
+        model_specific = models_config.get(model_name, {})
+        if not model_specific:
+            best_match_len = 5
+            for configured_model in models_config:
+                if model_name.startswith(configured_model) and len(configured_model) > best_match_len:
+                    model_specific = models_config[configured_model]
+                    best_match_len = len(configured_model)
+        
+        final_config = {**default_config, **model_specific}
+        
+        # 处理 auto tensor_parallel_size
+        if final_config.get("tensor_parallel_size") == "auto":
+            if gpu_count <= 0:
+                final_config["tensor_parallel_size"] = 1
+            else:
+                power = 0
+                while (1 << (power + 1)) <= gpu_count:
+                    power += 1
+                final_config["tensor_parallel_size"] = 1 << power
+        
+        return final_config
+    
+    def _parse_results(self, work_dir: Path, result: dict) -> dict:
+        """解析 OpenCompass 输出结果"""
+        timestamped_dirs = sorted(
+            [d for d in work_dir.glob("202*_*") if d.is_dir()], 
+            reverse=True
+        )
+        
+        if not timestamped_dirs:
+            result["error"] = "No results directory found"
+            return result
+        
+        summary_dir = timestamped_dirs[0] / "summary"
+        csv_files = list(summary_dir.rglob("*.csv"))
+        
+        if not csv_files:
+            result["error"] = "No results CSV found"
+            return result
+        
+        df = pd.read_csv(csv_files[0])
+        score_col = [c for c in df.columns if c not in ["dataset", "version", "metric", "mode"]]
+        
+        if score_col:
+            scores = df[score_col[0]].dropna().values
+            if len(scores) > 0:
+                raw = scores[0]
+                try:
+                    result["score"] = float(raw)
+                except (ValueError, TypeError):
+                    logger.warning(f"OpenCompass returned non-numeric score: {raw!r}, treating as 0.0")
+                    result["score"] = 0.0
+                result["accuracy_summary"] = {"accuracy": result["score"]}
+        
+        return result
diff --git a/rdagent/scenarios/rl/autorl_bench/core/server.py b/rdagent/scenarios/rl/autorl_bench/core/server.py
new file mode 100644
index 000000000..d83453df8
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/server.py
@@ -0,0 +1,260 @@
+"""
+AutoRL-Bench Grading Server (Simplified)
+
+精简的评测服务，主要提供 submit 接口。
+"""
+import json
+import os
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Optional, Set
+
+from flask import Flask, jsonify, request
+
+from rdagent.log import rdagent_logger as logger
+
+app = Flask(__name__)
+
+
+def _get_available_gpus() -> Set[str]:
+    """从 CUDA_VISIBLE_DEVICES 获取可用 GPU 集合"""
+    cuda_env = os.environ.get("CUDA_VISIBLE_DEVICES", "")
+    if not cuda_env.strip():
+        return set()
+    return {g.strip() for g in cuda_env.split(",") if g.strip()}
+
+
+def _validate_gpu(gpu: str, available: Set[str]) -> Optional[str]:
+    """校验 gpu 参数，返回错误信息或 None（合法）"""
+    requested = {g.strip() for g in gpu.split(",") if g.strip()}
+    if not requested:
+        return "gpu parameter is empty"
+    invalid = requested - available
+    if invalid:
+        return f"GPU {invalid} not in available GPUs {sorted(available)} (from CUDA_VISIBLE_DEVICES)"
+    return None
+
+
+class GradingServer:
+    """评测服务器"""
+    
+    def __init__(
+        self,
+        task: str,
+        base_model: str,
+        workspace: Path,
+    ):
+        self.task = task
+        self.base_model = base_model
+        self.workspace = Path(workspace)
+        self.scores_file = self.workspace / "scores.json"
+        self.baseline_score: Optional[float] = None
+        self.available_gpus: Set[str] = _get_available_gpus()
+    
+    def load_scores(self) -> list[dict]:
+        if self.scores_file.exists():
+            return json.loads(self.scores_file.read_text())
+        return []
+    
+    def save_scores(self, scores: list[dict]):
+        self.scores_file.write_text(json.dumps(scores, indent=2, ensure_ascii=False))
+    
+    def get_evaluator(self):
+        """获取当前 task 的评测器"""
+        from rdagent.scenarios.rl.autorl_bench.benchmarks import get_evaluator
+        return get_evaluator(self.task)
+    
+    def submit(self, model_path: str, gpu: Optional[str] = None) -> dict:
+        """
+        提交模型评测
+        
+        Args:
+            model_path: 模型路径
+            gpu: 指定 GPU（如 "0", "1", "0,1"），必须是 CUDA_VISIBLE_DEVICES 中的子集。
+                 None 则使用 CUDA_VISIBLE_DEVICES 中的第一个 GPU。
+            
+        Returns:
+            包含 score、best、improvement 等完整信息的结果
+            
+        Raises:
+            ValueError: gpu 不在 CUDA_VISIBLE_DEVICES 范围内
+        """
+        if self.available_gpus:
+            if gpu is None:
+                gpu = sorted(self.available_gpus, key=int)[0]
+            else:
+                err = _validate_gpu(gpu, self.available_gpus)
+                if err:
+                    raise ValueError(err)
+        
+        start_time = time.time()
+        scores = self.load_scores()
+        submission_id = len(scores) + 1
+        
+        logger.info(f"[SUBMIT #{submission_id}] Started | model_path={model_path} | gpu={gpu}")
+        
+        old_cuda = os.environ.get("CUDA_VISIBLE_DEVICES")
+        if gpu is not None:
+            os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu)
+        
+        try:
+            evaluator = self.get_evaluator()
+            result = evaluator.run_eval(
+                model_path=model_path,
+                workspace_path=str(self.workspace),
+                model_name=self.base_model,
+            )
+        finally:
+            if old_cuda is not None:
+                os.environ["CUDA_VISIBLE_DEVICES"] = old_cuda
+            elif "CUDA_VISIBLE_DEVICES" in os.environ:
+                del os.environ["CUDA_VISIBLE_DEVICES"]
+        
+        elapsed_seconds = time.time() - start_time
+        
+        # 解析分数
+        score = result.get("score", 0.0)
+        
+        # 计算 improvement
+        improvement = None
+        if self.baseline_score is not None:
+            improvement = round(score - self.baseline_score, 6)
+        
+        # 构建结果
+        entry = {
+            "submission_id": submission_id,
+            "timestamp": datetime.now().isoformat(),
+            "model_path": model_path,
+            "score": score,
+            "baseline_score": self.baseline_score,
+            "improvement": improvement,
+            "elapsed_seconds": round(elapsed_seconds, 2),
+        }
+        
+        scores.append(entry)
+        self.save_scores(scores)
+        
+        # 查找最高分
+        best_entry = max(scores, key=lambda x: x.get("score", 0))
+        
+        logger.info(f"[SUBMIT #{submission_id}] Done | score={score}, best={best_entry['score']}")
+        
+        return {
+            **entry,
+            "best": best_entry,
+            "total_submissions": len(scores),
+        }
+    
+    def set_baseline(self, score: float):
+        """设置 baseline 分数"""
+        self.baseline_score = score
+        logger.info(f"[BASELINE] Set to {score}")
+
+
+# 全局服务器实例
+_server: Optional[GradingServer] = None
+
+
+def get_server() -> GradingServer:
+    global _server
+    if _server is None:
+        raise RuntimeError("Server not initialized. Call init_server() first.")
+    return _server
+
+
+def init_server(task: str, base_model: str, workspace: str) -> GradingServer:
+    """初始化服务器"""
+    global _server
+    _server = GradingServer(task, base_model, Path(workspace))
+    return _server
+
+
+# Flask 路由
+@app.route("/submit", methods=["POST"])
+def submit():
+    """
+    提交模型评测
+    
+    Request:
+        {"model_path": "/path/to/model"}
+        
+    Response:
+        {
+            "submission_id": 1,
+            "score": 85.0,
+            "improvement": 5.0,
+            "best": {...},
+            "total_submissions": 10
+        }
+    """
+    data = request.get_json() or {}
+    model_path = data.get("model_path")
+    gpu = data.get("gpu")
+    
+    if not model_path:
+        return jsonify({"error": "Missing model_path"}), 400
+
+    server = get_server()
+    if gpu is not None:
+        gpu = str(gpu)
+        err = _validate_gpu(gpu, server.available_gpus)
+        if err:
+            return jsonify({
+                "error": err,
+                "available_gpus": sorted(server.available_gpus, key=int),
+            }), 400
+
+    try:
+        result = server.submit(model_path, gpu=gpu)
+        return jsonify(result)
+    except (RuntimeError, ValueError, OSError) as e:
+        logger.error(f"[SUBMIT] Error: {e}")
+        return jsonify({"error": str(e)}), 500
+
+
+@app.route("/health", methods=["GET"])
+def health():
+    """健康检查"""
+    server = get_server()
+    return jsonify({
+        "status": "ok",
+        "task": server.task,
+        "workspace": str(server.workspace),
+        "available_gpus": sorted(server.available_gpus, key=int) if server.available_gpus else [],
+    })
+
+
+@app.route("/set_baseline", methods=["POST"])
+def set_baseline():
+    """设置 baseline 分数"""
+    data = request.get_json() or {}
+    score = data.get("score")
+    
+    if score is None:
+        return jsonify({"error": "Missing score"}), 400
+    
+    server = get_server()
+    server.set_baseline(float(score))
+    return jsonify({"baseline_score": score, "status": "set"})
+
+
+def run_server(task: str, base_model: str, workspace: str, host: str = "0.0.0.0", port: int = 5000):
+    """启动服务器"""
+    init_server(task, base_model, workspace)
+    logger.info(f"Grading Server | task={task} | {host}:{port}")
+    app.run(host=host, port=port, debug=False, threaded=True)
+
+
+if __name__ == "__main__":
+    import argparse
+    
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--task", type=str, required=True)
+    parser.add_argument("--base-model", type=str, default="")
+    parser.add_argument("--workspace", type=str, default=".")
+    parser.add_argument("--port", type=int, default=5000)
+    parser.add_argument("--host", type=str, default="0.0.0.0")
+    args = parser.parse_args()
+    
+    run_server(args.task, args.base_model, args.workspace, args.host, args.port)
diff --git a/rdagent/scenarios/rl/autorl_bench/core/ui.py b/rdagent/scenarios/rl/autorl_bench/core/ui.py
new file mode 100644
index 000000000..cab5fb5f1
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/ui.py
@@ -0,0 +1,126 @@
+"""
+AutoRL-Bench Results Dashboard
+
+Usage:
+    streamlit run rdagent/scenarios/rl/autorl_bench/core/ui.py --server.port=8510 --server.address=0.0.0.0
+"""
+import pandas as pd
+import streamlit as st
+from pathlib import Path
+
+# ---------- 页面配置 ----------
+st.set_page_config(page_title="AutoRL-Bench", page_icon="🧪", layout="wide")
+
+CSV_PATH = Path(__file__).resolve().parent.parent / "results.csv"
+
+# ---------- 自定义样式 ----------
+st.markdown("""
+<style>
+    /* 指标卡片 */
+    div[data-testid="stMetric"] {
+        background: linear-gradient(135deg, #667eea11, #764ba211);
+        border: 1px solid #e0e0e0;
+        border-radius: 10px;
+        padding: 10px 14px;
+    }
+    div[data-testid="stMetric"] label {
+        font-size: 0.72rem;
+        font-weight: 600;
+        text-transform: uppercase;
+        letter-spacing: 0.5px;
+        opacity: 0.7;
+    }
+    div[data-testid="stMetric"] div[data-testid="stMetricValue"] {
+        font-size: 1.3rem;
+        font-weight: 700;
+    }
+    /* 表格行高亮 */
+    .stDataFrame td {
+        font-size: 0.9rem;
+    }
+</style>
+""", unsafe_allow_html=True)
+
+# ---------- 标题 ----------
+st.markdown("# 🧪 AutoRL-Bench Results")
+st.divider()
+
+# ---------- 加载数据 ----------
+if not CSV_PATH.exists():
+    st.info("No results yet. Run an experiment first.")
+    st.stop()
+
+df = pd.read_csv(CSV_PATH)
+df["timestamp"] = pd.to_datetime(df["timestamp"])
+df["duration_min"] = (df["duration_s"] / 60).round(1)
+
+# ---------- 侧栏 ----------
+with st.sidebar:
+    st.markdown("### Filters")
+    agents = ["All"] + sorted(df["agent"].unique().tolist())
+    sel_agent = st.selectbox("Agent", agents)
+
+    tasks = ["All"] + sorted(df["task"].unique().tolist())
+    sel_task = st.selectbox("Task", tasks)
+
+    st.divider()
+    st.markdown("### About")
+    st.markdown(
+        "Evaluating LLM-driven agents that optimize smaller LLMs "
+        "via RL post-training."
+    )
+
+filtered = df.copy()
+if sel_agent != "All":
+    filtered = filtered[filtered["agent"] == sel_agent]
+if sel_task != "All":
+    filtered = filtered[filtered["task"] == sel_task]
+
+# ---------- Agent 对比 ----------
+if len(filtered) > 1:
+    st.markdown("#### Agent Summary")
+    summary = (
+        filtered.groupby(["agent", "task", "base_model"])
+        .agg(
+            runs=("agent", "size"),
+            success=("success", "sum"),
+            best=("best_score", "max"),
+            best_improve=("improvement", "max"),
+            subs=("submissions", "sum"),
+        )
+        .round(2)
+        .reset_index()
+        .sort_values("best", ascending=False)
+    )
+    summary.columns = ["Agent", "Task", "Base Model", "Runs", "Success", "Best", "Best Impr.", "Submissions"]
+    st.dataframe(summary, use_container_width=True, hide_index=True)
+
+st.divider()
+
+# ---------- 结果表格 ----------
+st.markdown("#### Run History")
+display = filtered[[
+    "timestamp", "agent", "driver_model", "base_model", "task",
+    "baseline", "best_score", "improvement", "submissions",
+    "duration_min", "success", "workspace",
+]].sort_values("timestamp", ascending=False)
+
+display.columns = [
+    "Time", "Agent", "Driver LLM", "Base Model", "Task",
+    "Baseline", "Best Score", "Improvement", "Submissions",
+    "Duration(min)", "Success", "Workspace",
+]
+
+st.dataframe(
+    display,
+    use_container_width=True,
+    hide_index=True,
+    column_config={
+        "Time": st.column_config.DatetimeColumn(format="YYYY-MM-DD HH:mm"),
+        "Best Score": st.column_config.NumberColumn(format="%.2f"),
+        "Baseline": st.column_config.NumberColumn(format="%.2f"),
+        "Improvement": st.column_config.NumberColumn(format="%.2f"),
+        "Duration(min)": st.column_config.NumberColumn(format="%.0f"),
+        "Success": st.column_config.CheckboxColumn(),
+    },
+)
diff --git a/rdagent/scenarios/rl/autorl_bench/core/utils.py b/rdagent/scenarios/rl/autorl_bench/core/utils.py
new file mode 100644
index 000000000..1180cfb9b
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/core/utils.py
@@ -0,0 +1,402 @@
+"""
+AutoRL-Bench Core Utilities
+
+统一的工具函数：下载、baseline、grading client、workspace、results
+"""
+import csv
+import json
+import os
+import re
+import subprocess
+import threading
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Optional
+
+import requests
+from huggingface_hub import snapshot_download
+
+from rdagent.log import rdagent_logger as logger
+
+from rdagent.scenarios.rl.autorl_bench.conf import (
+    AUTORL_BENCH_SETTING,
+    get_baseline_cache_dir,
+    get_data_dir,
+    get_models_dir,
+)
+from werkzeug.serving import make_server
+
+from rdagent.scenarios.rl.autorl_bench.core.server import app, init_server
+
+
+def kill_process_group(proc: "subprocess.Popen") -> None:
+    """尽力杀掉进程组：SIGTERM → SIGKILL → proc.kill()"""
+    import signal as _signal
+
+    if proc.poll() is not None:
+        return
+    for sig in (_signal.SIGTERM, _signal.SIGKILL):
+        try:
+            os.killpg(os.getpgid(proc.pid), sig)
+            proc.wait(timeout=10)
+            return
+        except ProcessLookupError:
+            return
+        except subprocess.TimeoutExpired:
+            continue
+        except OSError:
+            break
+    proc.kill()
+    proc.wait()
+
+
+# ============================================================
+# 文件工具
+# ============================================================
+
+def ensure_symlink(src: Path, dst: Path):
+    """创建软链接（已存在则跳过，并发安全）"""
+    if not src.exists():
+        return
+    try:
+        dst.symlink_to(src)
+    except FileExistsError:
+        pass
+
+
+# ============================================================
+# 下载相关
+# ============================================================
+
+def download_model(model_name: str, model_dir: Optional[str] = None) -> str:
+    """下载模型（已存在则跳过）"""
+    base_dir = Path(model_dir) if model_dir else get_models_dir()
+    target_dir = base_dir / model_name
+    
+    if target_dir.exists() and any(target_dir.iterdir()):
+        logger.info(f"Model exists: {target_dir}")
+        return str(target_dir)
+    
+    logger.info(f"Downloading model: {model_name}...")
+    target_dir.mkdir(parents=True, exist_ok=True)
+    snapshot_download(repo_id=model_name, local_dir=str(target_dir), local_dir_use_symlinks=False)
+    logger.info(f"Model downloaded to {target_dir}")
+    return str(target_dir)
+
+
+def download_data(task: str, data_dir: Optional[str] = None) -> str:
+    """下载训练数据（agent 可见部分）
+
+    支持两种模式：
+    1. data_module 模式（传统）：调用 data.py 中的 download_train_data()
+    2. download_data.py 脚本模式（smith benchmarks）：直接运行脚本
+    """
+    import importlib
+    import shutil
+    import sys
+    from rdagent.scenarios.rl.autorl_bench.benchmarks import get_benchmark, BENCHMARKS_DIR
+
+    config = get_benchmark(task)
+    base_dir = Path(data_dir) if data_dir else get_data_dir()
+    target_dir = base_dir / task
+
+    if config.data_module:
+        # 传统方式（gsm8k、alfworld 等）
+        module = importlib.import_module(config.data_module)
+        module.download_train_data(target_dir)
+    else:
+        # 脚本方式（所有 smith benchmarks）
+        bench_dir = Path(config.bench_dir) if config.bench_dir else BENCHMARKS_DIR / task
+        script = bench_dir / "download_data.py"
+        if script.exists():
+            target_dir.mkdir(parents=True, exist_ok=True)
+            subprocess.run(
+                [sys.executable, str(script)],
+                cwd=str(bench_dir),
+                check=True,
+            )
+            # 脚本输出到 bench_dir/data/train.jsonl，拷贝到 target_dir
+            src = bench_dir / "data" / "train.jsonl"
+            dst = target_dir / "train.jsonl"
+            if src.exists() and not dst.exists():
+                shutil.copy2(src, dst)
+        else:
+            # No download script — copy pre-existing data from bench_dir/data/
+            target_dir.mkdir(parents=True, exist_ok=True)
+            src = bench_dir / "data" / "train.jsonl"
+            dst = target_dir / "train.jsonl"
+            if src.exists() and not dst.exists():
+                shutil.copy2(src, dst)
+                logger.info(f"Copied {src} → {dst}")
+            elif not src.exists():
+                logger.warning(f"Benchmark {task} has no data_module, download_data.py, or train.jsonl")
+
+    return str(target_dir)
+
+
+# ============================================================
+# Baseline 相关
+# ============================================================
+
+def _safe_model_name(model_name: str) -> str:
+    """将模型名转为安全的文件名"""
+    return re.sub(r"[/\\:*?\"<>|]", "_", model_name)
+
+
+def get_baseline_score(
+    task: str,
+    model_name: str,
+    model_path: str,
+    workspace_path: str,
+    gpu_count: int = 1,
+    test_range: str = "[:]",
+    force_rerun: bool = False,
+) -> float:
+    """获取 baseline score（有缓存则读缓存，没有则评测）"""
+    safe_name = _safe_model_name(model_name)
+    cache_file = get_baseline_cache_dir() / f"{task}_{safe_name}.json"
+    
+    # 检查缓存
+    if not force_rerun and cache_file.exists():
+        data = json.loads(cache_file.read_text())
+        score = data.get("score", 0.0)
+        logger.info(f"Baseline cache hit: {cache_file.name}, score={score}")
+        return score
+    
+    # 执行评测
+    logger.info(f"Running baseline evaluation: task={task}, model={model_name}")
+    from rdagent.scenarios.rl.autorl_bench.benchmarks import get_evaluator
+    evaluator = get_evaluator(task)
+    result = evaluator.run_eval(
+        model_path=model_path,
+        workspace_path=workspace_path,
+        model_name=model_name,
+        gpu_count=gpu_count,
+        test_range=test_range,
+    )
+    
+    score = result.get("score", 0.0)
+    logger.info(f"Baseline score: {score}")
+    
+    # 保存缓存
+    cache_file.parent.mkdir(parents=True, exist_ok=True)
+    cache_data = {
+        "task": task,
+        "model_name": model_name,
+        "score": score,
+        "test_range": test_range,
+        "timestamp": datetime.now().isoformat(),
+    }
+    cache_file.write_text(json.dumps(cache_data, indent=2, ensure_ascii=False))
+    
+    return score
+
+
+# ============================================================
+# Grading Server Client
+# ============================================================
+
+def submit_to_grading_server(
+    model_path: str,
+    grading_url: Optional[str] = None,
+    timeout: int = 600,
+) -> dict | None:
+    """提交模型到 grading server 评测"""
+    url = grading_url or os.environ.get("GRADING_SERVER_URL")
+    if not url:
+        return None
+    
+    logger.info(f"Submitting to grading server: {url}/submit")
+    resp = requests.post(f"{url}/submit", json={"model_path": model_path}, timeout=timeout)
+    resp.raise_for_status()
+    result = resp.json()
+    logger.info(f"Grading result: score={result.get('score')}")
+    return result
+
+
+def set_baseline_to_server(score: float, grading_url: Optional[str] = None) -> bool:
+    """设置 baseline score 到 grading server"""
+    url = grading_url or os.environ.get("GRADING_SERVER_URL")
+    if not url:
+        return False
+    
+    resp = requests.post(f"{url}/set_baseline", json={"score": score}, timeout=30)
+    resp.raise_for_status()
+    return True
+
+
+# ============================================================
+# Grading Server 上下文管理器
+# ============================================================
+
+class GradingServerContext:
+    """Grading Server 基类"""
+    
+    def __enter__(self):
+        return self
+    
+    def __exit__(self, *args):
+        pass
+    
+    def get_baseline(self, task: str, model_name: str, model_path: str, workspace_path: str) -> float:
+        raise NotImplementedError
+    
+    def load_scores(self) -> list:
+        raise NotImplementedError
+
+
+class LocalServerContext(GradingServerContext):
+    """本地 Flask Server"""
+
+    def __init__(self, task: str, base_model: str, workspace: str, port: int):
+        self.task = task
+        self.base_model = base_model
+        self.workspace = workspace
+        self.port = port
+        self.server = None
+        self._http_server = None
+        self._thread = None
+
+    def __enter__(self):
+        logger.info(f"[Local Mode] Starting evaluation server on port {self.port}...")
+        self.server = init_server(self.task, self.base_model, self.workspace)
+
+        self._http_server = make_server("0.0.0.0", self.port, app, threaded=True)
+        self._thread = threading.Thread(target=self._http_server.serve_forever, daemon=True)
+        self._thread.start()
+
+        # Poll /health for up to 15 seconds instead of blind sleep(2)
+        deadline = time.time() + 15
+        while time.time() < deadline:
+            try:
+                resp = requests.get(f"http://localhost:{self.port}/health", timeout=2)
+                if resp.status_code == 200:
+                    break
+            except requests.ConnectionError:
+                pass
+            time.sleep(0.5)
+        else:
+            raise RuntimeError(f"Grading server failed to start on port {self.port}")
+
+        return self
+
+    def __exit__(self, *args):
+        if self._http_server:
+            self._http_server.shutdown()
+            self._http_server = None
+    
+    def get_baseline(self, task: str, model_name: str, model_path: str, workspace_path: str) -> float:
+        baseline = get_baseline_score(task, model_name, model_path, workspace_path)
+        self.server.set_baseline(baseline)
+        return baseline
+    
+    def load_scores(self) -> list:
+        return self.server.load_scores() if self.server else []
+
+
+def create_grading_server(benchmark, workspace: Path, port: int, base_model: str) -> GradingServerContext:
+    """创建 Grading Server 上下文"""
+    return LocalServerContext(
+        task=benchmark.id,
+        base_model=base_model,
+        workspace=str(workspace),
+        port=port,
+    )
+
+
+# ============================================================
+# Workspace 搭建
+# ============================================================
+
+def setup_workspace(
+    run_id: str,
+    agent_id: str,
+    task: str,
+    base_model: str,
+    model_path: str,
+    data_path: str,
+    benchmark,
+) -> Path:
+    """创建隔离的 workspace 目录并挂载资源文件，返回 workspace 路径。"""
+    from rdagent.scenarios.rl.autorl_bench.benchmarks import BENCHMARKS_DIR
+    from rdagent.scenarios.rl.autorl_bench.conf import get_instructions_file, get_workspace_dir
+
+    workspace = get_workspace_dir() / task / f"{run_id}_{agent_id}"
+    workspace.mkdir(parents=True, exist_ok=True)
+    (workspace / "code").mkdir(exist_ok=True)
+    (workspace / "output").mkdir(exist_ok=True)
+
+    # 模型 & 数据 symlink
+    model_link = workspace / "models" / base_model
+    data_link = workspace / "data"
+    model_link.parent.mkdir(parents=True, exist_ok=True)
+
+    ensure_symlink(Path(model_path), model_link)
+    ensure_symlink(Path(data_path), data_link)
+
+    # 挂载文件：任务描述 + 通用说明 + benchmark 特有文件
+    bench_dir = Path(benchmark.bench_dir) if benchmark.bench_dir else BENCHMARKS_DIR / task
+    ensure_symlink(bench_dir / "description.md", workspace / "description.md")
+    ensure_symlink(get_instructions_file(), workspace / "instructions.md")
+
+    for fname in benchmark.expose_files:
+        ensure_symlink(bench_dir / fname, workspace / fname)
+
+    return workspace
+
+
+# ============================================================
+# Results CSV 记录
+# ============================================================
+
+RESULTS_CSV_COLUMNS = [
+    "run_id", "timestamp", "task", "agent", "driver_model", "base_model",
+    "baseline", "best_score", "improvement", "submissions",
+    "duration_s", "success", "workspace",
+]
+
+
+def detect_driver_model(env: dict) -> str:
+    """从环境变量检测驱动 agent 的 LLM 模型名。"""
+    return (
+        env.get("LLM_MODEL")
+        or os.environ.get("CHAT_MODEL")
+        or os.environ.get("OPENAI_MODEL")
+        or "unknown"
+    )
+
+
+def append_result(row: dict) -> Path:
+    """追加一行到全局 results.csv，返回文件路径。"""
+    from rdagent.scenarios.rl.autorl_bench.conf import get_autorl_bench_dir
+
+    results_csv = get_autorl_bench_dir() / "results.csv"
+    write_header = not results_csv.exists()
+    with open(results_csv, "a", newline="") as f:
+        writer = csv.DictWriter(f, fieldnames=RESULTS_CSV_COLUMNS)
+        if write_header:
+            writer.writeheader()
+        writer.writerow(row)
+    return results_csv
+
+
+# ============================================================
+# 运行摘要
+# ============================================================
+
+def print_summary(
+    baseline: float,
+    best: dict | None,
+    scores: list,
+    workspace,
+) -> None:
+    """打印运行摘要。"""
+    logger.info("=" * 60)
+    logger.info(f"Baseline: {baseline}")
+    if best:
+        logger.info(f"Best Score: {best.get('score', 0)}")
+        logger.info(f"Improvement: {best.get('improvement')}")
+    logger.info(f"Total Submissions: {len(scores)}")
+    logger.info(f"Workspace: {workspace}")
+    logger.info("=" * 60)
diff --git a/rdagent/scenarios/rl/autorl_bench/requirements.txt b/rdagent/scenarios/rl/autorl_bench/requirements.txt
new file mode 100644
index 000000000..d24f2a53d
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/requirements.txt
@@ -0,0 +1,33 @@
+# AutoRL-Bench 依赖
+# conda 环境: cwy-rl (Python 3.10)
+
+# RL 训练（核心）
+trl>=0.27.0
+accelerate>=1.0.0
+datasets>=3.0.0
+peft>=0.18.1
+
+# 评测
+opencompass==0.5.1
+
+# 推理加速（可选，TRL 支持 0.10.2-0.12.0）
+vllm>=0.12.0
+
+# 数据处理
+numpy>=1.26.0
+pandas>=1.5.0
+pydantic>=2.0.0
+
+# 模型
+torch>=2.0.0
+transformers>=4.40.0
+huggingface_hub>=0.20.0
+
+# Web 服务
+flask
+flask-cors
+
+# 工具
+loguru
+requests
+pyyaml
diff --git a/rdagent/scenarios/rl/autorl_bench/run.py b/rdagent/scenarios/rl/autorl_bench/run.py
new file mode 100644
index 000000000..212dce3bb
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/run.py
@@ -0,0 +1,199 @@
+#!/usr/bin/env python
+"""
+AutoRL-Bench Runner
+
+入口脚本。
+
+Usage:
+    python -m rdagent.scenarios.rl.autorl_bench.run \
+        --agent example_agent --task gsm8k --model Qwen/Qwen2.5-0.5B
+"""
+import argparse
+import os
+import signal
+import subprocess
+import sys
+from datetime import datetime
+
+from dotenv import load_dotenv
+from loguru import logger as loguru_logger
+
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.rl.autorl_bench.agents import get_agent
+from rdagent.scenarios.rl.autorl_bench.benchmarks import get_benchmark
+from rdagent.scenarios.rl.autorl_bench.core import (
+    download_model,
+    download_data,
+    create_grading_server,
+    setup_workspace,
+    append_result,
+    detect_driver_model,
+    print_summary,
+    kill_process_group,
+)
+
+
+def run(
+    agent_id: str,
+    task: str,
+    base_model: str,
+    timeout: int = 3600,
+    port: int = 5000,
+) -> dict:
+    """运行 Agent 评测"""
+    from rdagent.scenarios.rl.autorl_bench.conf import get_workspace_dir
+
+    start_time = datetime.now()
+    run_id = start_time.strftime("%Y%m%dT%H%M%S")
+    if port != 5000:
+        run_id = f"{run_id}_p{port}"
+    benchmark = get_benchmark(task)
+
+    # 每次 run 独立 workspace + 独立日志文件
+    workspace = get_workspace_dir() / task / f"{run_id}_{agent_id}"
+    workspace.mkdir(parents=True, exist_ok=True)
+    log_file = workspace / "run.log"
+    _sink_id = loguru_logger.add(log_file, format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}", level="DEBUG")
+
+    # 用 mutable 容器让闭包能访问后续赋值的 agent 子进程
+    _agent_proc = [None]
+
+    # 收到 SIGTERM/SIGINT 时杀掉整棵进程树再退出
+    def _on_signal(signum, frame):
+        sig_name = signal.Signals(signum).name
+        logger.warning(f"Received {sig_name}, terminating...")
+        proc = _agent_proc[0]
+        if proc is not None:
+            kill_process_group(proc)
+        logger.info(f"Run interrupted by {sig_name} at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+        loguru_logger.remove(_sink_id)
+        sys.exit(128 + signum)
+
+    signal.signal(signal.SIGTERM, _on_signal)
+    signal.signal(signal.SIGINT, _on_signal)
+
+    logger.info(f"=== AutoRL-Bench ===")
+    logger.info(f"Agent: {agent_id}, Task: {task}, Model: {base_model}")
+    logger.info(f"Workspace: {workspace}")
+    logger.info(f"Start: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
+
+    # 1. 准备资源（已有则跳过下载）
+    logger.info("Preparing resources...")
+    model_path = download_model(base_model)
+    data_path = download_data(task)
+
+    # 2. 搭建 workspace（补充 symlink 挂载）
+    workspace = setup_workspace(
+        run_id, agent_id, task, base_model, model_path, data_path, benchmark,
+    )
+
+    # 3. 启动 Grading Server + 运行 Agent
+    with create_grading_server(benchmark, workspace, port, base_model) as grading:
+        logger.info("Evaluating baseline...")
+        baseline = grading.get_baseline(
+            task, base_model, str(workspace / "models" / base_model), str(workspace),
+        )
+        logger.info(f"Baseline Score: {baseline}")
+
+        agent = get_agent(agent_id)
+        logger.info(f"Running agent: {agent.name}")
+
+        env = {
+            **os.environ,
+            "TASK": task,
+            "BASE_MODEL": base_model,
+            "WORKSPACE": str(workspace),
+            "MODEL_PATH": str(workspace / "models" / base_model),
+            "DATA_PATH": str(workspace / "data"),
+            "OUTPUT_DIR": str(workspace / "output"),
+            "GRADING_SERVER_URL": f"http://localhost:{port}",
+            **agent.env_vars,
+        }
+
+        agent_log = workspace / "agent.log"
+        success = False
+        with open(agent_log, "w", encoding="utf-8") as af:
+            proc = subprocess.Popen(
+                ["bash", str(agent.start)],
+                env=env,
+                stdout=af,
+                stderr=subprocess.STDOUT,
+                start_new_session=True,
+            )
+            _agent_proc[0] = proc
+            try:
+                proc.wait(timeout=timeout)
+                success = proc.returncode == 0
+                logger.info(f"Agent finished, exit_code={proc.returncode}, log: {agent_log}")
+            except subprocess.TimeoutExpired:
+                logger.warning(f"Agent timed out after {timeout}s, killing process group...")
+                kill_process_group(proc)
+
+        scores = grading.load_scores()
+
+    # 4. 保存结果
+    end_time = datetime.now()
+    best = max(scores, key=lambda x: x.get("score", 0)) if scores else None
+
+    result = {
+        "success": success,
+        "agent_id": agent_id,
+        "task": task,
+        "base_model": base_model,
+        "baseline_score": baseline,
+        "best": best,
+        "total_submissions": len(scores),
+        "duration_seconds": (end_time - start_time).total_seconds(),
+    }
+
+    # 追加到全局 results.csv
+    append_result({
+        "run_id": run_id,
+        "timestamp": start_time.strftime("%Y-%m-%d %H:%M:%S"),
+        "task": task,
+        "agent": agent_id,
+        "driver_model": detect_driver_model(env),
+        "base_model": base_model,
+        "baseline": baseline,
+        "best_score": best.get("score", 0) if best else 0,
+        "improvement": best.get("improvement") if best else None,
+        "submissions": len(scores),
+        "duration_s": round((end_time - start_time).total_seconds()),
+        "success": success,
+        "workspace": str(workspace),
+    })
+
+    print_summary(baseline, best, scores, workspace)
+
+    logger.info(f"Log saved to: {log_file}")
+
+    # 移除本次 run 添加的 file sink
+    loguru_logger.remove(_sink_id)
+
+    return result
+
+
+def main():
+    load_dotenv(".env")
+
+    parser = argparse.ArgumentParser(description="AutoRL-Bench Runner")
+    parser.add_argument("--agent", "-a", required=True, help="Agent ID (openhands, rdagent)")
+    parser.add_argument("--task", "-t", required=True, help="Task name (gsm8k, math, alfworld)")
+    parser.add_argument("--model", "-m", required=True, help="Base model name")
+    parser.add_argument("--timeout", type=int, default=3600, help="Timeout in seconds")
+    parser.add_argument("--port", type=int, default=5000, help="Grading server port")
+    args = parser.parse_args()
+
+    result = run(
+        agent_id=args.agent,
+        task=args.task,
+        base_model=args.model,
+        timeout=args.timeout,
+        port=args.port,
+    )
+
+    sys.exit(0 if result["success"] else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/rdagent/scenarios/rl/autorl_bench/test/__init__.py b/rdagent/scenarios/rl/autorl_bench/test/__init__.py
new file mode 100644
index 000000000..9c35b136f
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/test/__init__.py
@@ -0,0 +1 @@
+# AutoRL-Bench 测试模块
diff --git a/rdagent/scenarios/rl/autorl_bench/test/test_benchmark.py b/rdagent/scenarios/rl/autorl_bench/test/test_benchmark.py
new file mode 100644
index 000000000..62a6f6d9c
--- /dev/null
+++ b/rdagent/scenarios/rl/autorl_bench/test/test_benchmark.py
@@ -0,0 +1,110 @@
+"""
+测试 benchmark 评测功能
+
+用法:
+    python -m rdagent.scenarios.rl.autorl_bench.test.test_benchmark \
+        --model-path /path/to/model \
+        --task gsm8k
+"""
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+
+import requests
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", required=True, help="本地模型路径")
+    parser.add_argument("--model-name", default=None, help="模型名称（默认从路径推断）")
+    parser.add_argument("--task", default="gsm8k", help="评测任务")
+    parser.add_argument("--port", type=int, default=15000, help="grading server 端口")
+    args = parser.parse_args()
+
+    model_path = Path(args.model_path).resolve()
+    if not model_path.exists():
+        print(f"[ERROR] Model not found: {model_path}")
+        return 1
+
+    model_name = args.model_name or model_path.name
+    grading_url = f"http://localhost:{args.port}"
+
+    print(f"Model Path: {model_path}")
+    print(f"Model Name: {model_name}")
+    print(f"Task: {args.task}")
+    print(f"Grading URL: {grading_url}")
+    print("-" * 50)
+
+    # 使用固定 workspace
+    from rdagent.scenarios.rl.autorl_bench.conf import get_workspace_dir
+    workspace = get_workspace_dir() / args.task
+    workspace.mkdir(parents=True, exist_ok=True)
+    print(f"Workspace: {workspace}")
+
+    # 启动 grading_server
+    from rdagent.scenarios.rl.autorl_bench.core.server import init_server, app
+    import threading
+    
+    server = init_server(args.task, model_name, str(workspace))
+    
+    print(f"Starting grading server on port {args.port}...")
+    server_thread = threading.Thread(
+        target=lambda: app.run(host="0.0.0.0", port=args.port, debug=False, threaded=False),
+        daemon=True
+    )
+    server_thread.start()
+    
+    # 等待 server 启动
+    for i in range(10):
+        time.sleep(0.5)
+        try:
+            resp = requests.get(f"{grading_url}/health", timeout=2)
+            if resp.status_code == 200:
+                print(f"Grading server started.")
+                break
+        except:
+            pass
+    else:
+        print("[ERROR] Grading server failed to start")
+        return 1
+
+    # 提交评测
+    print("-" * 50)
+    print("Submitting model for evaluation...")
+    print(f"POST {grading_url}/submit")
+    
+    start_time = time.time()
+    resp = requests.post(
+        f"{grading_url}/submit",
+        json={"model_path": str(model_path)},
+        timeout=3600,
+    )
+    elapsed = time.time() - start_time
+
+    print("-" * 50)
+    print(f"Response status: {resp.status_code}")
+    print(f"Elapsed: {elapsed:.2f}s")
+    print("Result:")
+    
+    if resp.status_code == 200:
+        result = resp.json()
+        print(json.dumps(result, indent=2, ensure_ascii=False))
+        score = result.get("score", 0)
+        print("-" * 50)
+        if score > 0:
+            print(f"[SUCCESS] Score: {score}")
+        else:
+            print(f"[FAILED] Score: {score}")
+    else:
+        print(f"Error response: {resp.text}")
+        print("-" * 50)
+        print(f"[ERROR] Server returned {resp.status_code}")
+
+    print("Done.")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/rdagent/scenarios/rl/dev/feedback.py b/rdagent/scenarios/rl/dev/feedback.py
new file mode 100644
index 000000000..994e456bd
--- /dev/null
+++ b/rdagent/scenarios/rl/dev/feedback.py
@@ -0,0 +1,116 @@
+import json
+from typing import Any
+
+from rdagent.core.proposal import Experiment2Feedback, HypothesisFeedback
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.utils.agent.tpl import T
+
+
+class RLExperiment2Feedback(Experiment2Feedback):
+    """Generate feedback for RL post-training experiments using LLM."""
+
+    def __init__(self, scen: Scenario, version: str = "exp_feedback") -> None:
+        super().__init__(scen)
+        self.version = version
+
+    def generate_feedback(
+        self, exp: Any, trace: Any | None = None, exception: Exception | None = None
+    ) -> HypothesisFeedback:
+        """Generate feedback using LLM."""
+        # 获取实验结果
+        result = getattr(exp, "result", {}) or {}
+        exit_code = result.get("exit_code", -1)
+        stdout = result.get("stdout", "")
+        running_time = result.get("running_time", 0)
+        benchmark = result.get("benchmark")
+        benchmark_summary = None
+        if benchmark:
+            try:
+                benchmark_summary = json.dumps(benchmark, ensure_ascii=False, indent=2)
+            except TypeError:
+                benchmark_summary = str(benchmark)
+        
+        # 获取假设和任务描述
+        hypothesis = str(exp.hypothesis) if exp.hypothesis else "N/A"
+        task_desc = exp.sub_tasks[0].get_task_information() if exp.sub_tasks else "N/A"
+        
+        if exception is not None:
+            return self._gen_error_feedback(hypothesis, str(exception))
+        
+        return self._gen_feedback_with_llm(
+            hypothesis=hypothesis,
+            task_desc=task_desc,
+            exit_code=exit_code,
+            stdout=stdout,
+            running_time=running_time,
+            benchmark=benchmark_summary,
+        )
+
+    def _gen_feedback_with_llm(
+        self,
+        hypothesis: str,
+        task_desc: str,
+        exit_code: int,
+        stdout: str,
+        running_time: float,
+        benchmark: str | None,
+    ) -> HypothesisFeedback:
+        """Generate feedback using LLM."""
+        system_prompt = T(".prompts:exp_feedback.system").r()
+        user_prompt = T(".prompts:exp_feedback.user").r(
+            hypothesis=hypothesis,
+            task_desc=task_desc,
+            exit_code=exit_code,
+            stdout=stdout,
+            running_time=running_time,
+            benchmark=benchmark,
+            exception=None,
+        )
+
+        resp = APIBackend().build_messages_and_create_chat_completion(
+            user_prompt=user_prompt,
+            system_prompt=system_prompt,
+            json_mode=True,
+        )
+        resp_dict = json.loads(resp)
+
+        decision = resp_dict.get("decision", exit_code == 0)
+        reason = resp_dict.get("reason", "")
+        suggestions = resp_dict.get("suggestions", "")
+
+        logger.info(f"Feedback: decision={decision}, reason={reason[:100]}...")
+
+        return HypothesisFeedback(
+            decision=decision,
+            reason=reason,
+            code_change_summary=suggestions,
+        )
+
+    def _gen_error_feedback(self, hypothesis: str, error_info: str) -> HypothesisFeedback:
+        """Generate feedback for failed experiments."""
+        system_prompt = T(".prompts:exp_feedback_error.system").r()
+        user_prompt = T(".prompts:exp_feedback_error.user").r(
+            hypothesis=hypothesis,
+            error_info=error_info,
+        )
+
+        resp = APIBackend().build_messages_and_create_chat_completion(
+            user_prompt=user_prompt,
+            system_prompt=system_prompt,
+            json_mode=True,
+        )
+        resp_dict = json.loads(resp)
+
+        error_type = resp_dict.get("error_type", "Unknown")
+        root_cause = resp_dict.get("root_cause", error_info)
+        fix_suggestion = resp_dict.get("fix_suggestion", "")
+
+        logger.error(f"Error feedback: {error_type} - {root_cause[:100]}...")
+
+        return HypothesisFeedback(
+            decision=False,
+            reason=f"[{error_type}] {root_cause}",
+            code_change_summary=fix_suggestion,
+        )
diff --git a/rdagent/scenarios/rl/dev/prompts.yaml b/rdagent/scenarios/rl/dev/prompts.yaml
new file mode 100644
index 000000000..dd59b5477
--- /dev/null
+++ b/rdagent/scenarios/rl/dev/prompts.yaml
@@ -0,0 +1,63 @@
+exp_feedback:
+  system: |-
+    你是 RL post-training 专家，负责分析实验结果并生成反馈。
+
+    ## 分析维度
+    1. 训练是否成功完成
+    2. 代码质量和实现正确性
+    3. 是否达成假设目标
+    4. 改进建议
+
+    ## 输出要求
+    JSON 格式：{"decision": true/false, "reason": "...", "suggestions": "..."}
+    - decision: true 表示接受当前实验，false 表示拒绝
+    - reason: 决策原因
+    - suggestions: 下一步改进建议
+
+  user: |-
+    ## 假设
+    {{ hypothesis }}
+
+    ## 任务描述
+    {{ task_desc }}
+
+    ## 执行结果
+    - exit_code: {{ exit_code }}
+    - running_time: {{ running_time }}s
+    {% if stdout %}
+    - stdout (前1000字符):
+    {{ stdout[:1000] }}
+    {% endif %}
+    {% if benchmark %}
+    ## Benchmark 结果
+    {{ benchmark }}
+    {% endif %}
+
+    {% if exception %}
+    ## 异常信息
+    {{ exception }}
+    {% endif %}
+
+    请分析实验结果并给出反馈。
+
+exp_feedback_error:
+  system: |-
+    你是 RL post-training 专家，负责分析失败的实验。
+
+    ## 常见错误类型
+    - ImportError: 缺少依赖库
+    - SyntaxError: 代码语法错误
+    - RuntimeError: 运行时错误（OOM、CUDA 等）
+    - API 不兼容: 库版本问题
+
+    ## 输出要求
+    JSON 格式：{"error_type": "...", "root_cause": "...", "fix_suggestion": "..."}
+
+  user: |-
+    ## 假设
+    {{ hypothesis }}
+
+    ## 错误信息
+    {{ error_info }}
+
+    请分析错误原因并给出修复建议。
diff --git a/rdagent/scenarios/rl/env/__init__.py b/rdagent/scenarios/rl/env/__init__.py
new file mode 100644
index 000000000..1581d26a2
--- /dev/null
+++ b/rdagent/scenarios/rl/env/__init__.py
@@ -0,0 +1,8 @@
+"""RL Environment Configuration"""
+
+from rdagent.scenarios.rl.env.conf import (
+    RL_DATA_DIR,
+    RL_MODELS_DIR,
+)
+
+__all__ = ["RL_DATA_DIR", "RL_MODELS_DIR"]
diff --git a/rdagent/scenarios/rl/env/conf.py b/rdagent/scenarios/rl/env/conf.py
new file mode 100644
index 000000000..85cdc8eb2
--- /dev/null
+++ b/rdagent/scenarios/rl/env/conf.py
@@ -0,0 +1,15 @@
+"""
+RL Training Environment Configuration
+
+autorl_bench 模式下，run.py 已完成环境搭建，不需要 Docker。
+保留基础路径配置供其他模块引用。
+"""
+
+import os
+from pathlib import Path
+
+from rdagent.app.rl.conf import RL_RD_SETTING
+
+# RL 资源路径（从 env var 优先，fallback 到 RL_RD_SETTING）
+RL_MODELS_DIR = Path(os.environ.get("MODEL_PATH", str(RL_RD_SETTING.file_path / "models")))
+RL_DATA_DIR = Path(os.environ.get("DATA_PATH", str(RL_RD_SETTING.file_path / "datasets")))
diff --git a/rdagent/scenarios/rl/env/docker/base/Dockerfile b/rdagent/scenarios/rl/env/docker/base/Dockerfile
new file mode 100644
index 000000000..c11a83bdf
--- /dev/null
+++ b/rdagent/scenarios/rl/env/docker/base/Dockerfile
@@ -0,0 +1,18 @@
+# Base 镜像：PyTorch 2.9.1 + TRL + transformers（训练+评测通用）
+FROM pytorch/pytorch:2.9.1-cuda12.6-cudnn9-runtime
+
+WORKDIR /workspace
+
+# System dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git ca-certificates \
+    && rm -rf /var/lib/apt/lists/*
+
+# LLM post-training 库（trl 会自动安装兼容的 transformers、accelerate、datasets）
+# Also include `litellm` for AutoRL-Bench evaluation adapters (e.g. GSM8K).
+# 注意：transformers 4.57.x 解决 tokenizer save_pretrained 与 vLLM 的兼容性问题
+# transformers 5.0 移除了 Qwen2TokenizerFast，导致保存格式不兼容
+RUN pip install --no-cache-dir trl==0.27.0 peft verl==0.7.0 litellm>=1.73 "transformers>=4.50,<5.0"
+
+# 默认入口
+CMD ["bash"]
diff --git a/rdagent/scenarios/rl/env/docker/evalplus/Dockerfile b/rdagent/scenarios/rl/env/docker/evalplus/Dockerfile
new file mode 100644
index 000000000..2585b327e
--- /dev/null
+++ b/rdagent/scenarios/rl/env/docker/evalplus/Dockerfile
@@ -0,0 +1,9 @@
+# EvalPlus 训练+评测镜像
+FROM autorl-bench/base:latest
+
+WORKDIR /workspace
+
+# 额外安装：evalplus
+RUN pip install --no-cache-dir evalplus
+
+CMD ["bash"]
diff --git a/rdagent/scenarios/rl/env/docker/gsm8k/Dockerfile b/rdagent/scenarios/rl/env/docker/gsm8k/Dockerfile
new file mode 100644
index 000000000..bcd878b6d
--- /dev/null
+++ b/rdagent/scenarios/rl/env/docker/gsm8k/Dockerfile
@@ -0,0 +1,10 @@
+# GSM8K 训练镜像
+FROM autorl-bench/base:latest
+
+WORKDIR /workspace
+
+# GSM8K 不需要额外依赖，base 镜像已包含所有
+# agent 生成的 main.py 会被挂载到 /workspace
+
+CMD ["python", "main.py"]
+
diff --git a/rdagent/scenarios/rl/env/docker/miniwob/Dockerfile b/rdagent/scenarios/rl/env/docker/miniwob/Dockerfile
new file mode 100644
index 000000000..894b70b69
--- /dev/null
+++ b/rdagent/scenarios/rl/env/docker/miniwob/Dockerfile
@@ -0,0 +1,23 @@
+# MiniWoB 训练+评测镜像
+FROM autorl-bench/base:latest
+
+WORKDIR /workspace
+
+# 额外安装：浏览器 + selenium + miniwob
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    chromium chromium-driver \
+    fonts-liberation \
+    libnss3 libxss1 libasound2 libgbm1 \
+    libx11-6 libxext6 libxrender1 libxtst6 \
+    libgtk-3-0 \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN pip install --no-cache-dir \
+    miniwob \
+    gymnasium \
+    selenium
+
+ENV CHROME_BIN=/usr/bin/chromium \
+    CHROMEDRIVER_BIN=/usr/bin/chromedriver
+
+CMD ["bash"]
diff --git a/rdagent/scenarios/rl/experiment/__init__.py b/rdagent/scenarios/rl/experiment/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/rdagent/scenarios/rl/experiment/experiment.py b/rdagent/scenarios/rl/experiment/experiment.py
new file mode 100644
index 000000000..3579d98b0
--- /dev/null
+++ b/rdagent/scenarios/rl/experiment/experiment.py
@@ -0,0 +1,21 @@
+"""RL Post-training Experiment"""
+
+from rdagent.core.experiment import Experiment, Task
+from rdagent.scenarios.rl.experiment.workspace import RLWorkspace
+
+
+class RLTask(Task):
+    """RDLoop 内部的任务描述（每次迭代一个）。
+
+    仅用于 rdagent 框架内部流转，和 autorl_bench 的 benchmark 无关。
+    """
+    pass
+
+
+class RLExperiment(Experiment[RLTask, RLWorkspace, RLWorkspace]):
+    """RL post-training experiment with workspace initialization."""
+
+    def __init__(self, sub_tasks: list[RLTask], *args, **kwargs) -> None:
+        super().__init__(sub_tasks=sub_tasks, *args, **kwargs)
+        # Initialize experiment workspace (required by CoSTEER)
+        self.experiment_workspace = RLWorkspace()
diff --git a/rdagent/scenarios/rl/experiment/workspace.py b/rdagent/scenarios/rl/experiment/workspace.py
new file mode 100644
index 000000000..d9d84429f
--- /dev/null
+++ b/rdagent/scenarios/rl/experiment/workspace.py
@@ -0,0 +1,42 @@
+"""
+RL Post-training Workspace
+
+参考 SFT: rdagent/scenarios/finetune/experiment/workspace.py
+"""
+
+from pathlib import Path
+from typing import TYPE_CHECKING
+
+from rdagent.core.experiment import FBWorkspace
+from rdagent.log import rdagent_logger as logger
+
+if TYPE_CHECKING:
+    from rdagent.utils.env import Env
+
+from rdagent.utils.env import DockerEnv, EnvResult
+
+
+class RLWorkspace(FBWorkspace):
+    """RL 训练工作区"""
+
+    def run(self, env: "Env", entry: str) -> EnvResult:
+        """在环境中执行命令"""
+        self.prepare()
+        self.inject_files(**self.file_dict)
+        
+        result = env.run(entry, str(self.workspace_path))
+        
+        tag_prefix = "docker_run" if isinstance(env, DockerEnv) else "env_run"
+        logger.log_object(
+            {
+                "exit_code": result.exit_code,
+                "stdout": result.stdout or "",
+                "running_time": result.running_time,
+                "entry": entry,
+                "workspace_path": str(self.workspace_path),
+            },
+            tag=f"{tag_prefix}.RLWorkspace",
+        )
+        
+        return result
+
diff --git a/rdagent/scenarios/rl/loop.py b/rdagent/scenarios/rl/loop.py
new file mode 100644
index 000000000..1f944ea3c
--- /dev/null
+++ b/rdagent/scenarios/rl/loop.py
@@ -0,0 +1,65 @@
+import asyncio
+from typing import Any, TYPE_CHECKING
+
+from rdagent.components.workflow.rd_loop import RDLoop
+from rdagent.core.exception import CoderError
+from rdagent.log import rdagent_logger as logger
+from rdagent.scenarios.rl.proposal.trace import RLTrace
+
+if TYPE_CHECKING:
+    from rdagent.scenarios.rl.scen.scenario import RLPostTrainingScen
+
+
+class RLPostTrainingRDLoop(RDLoop):
+    """RL post-training loop using standard RDLoop workflow"""
+
+    skip_loop_error = (CoderError,)
+    skip_loop_error_stepname = "feedback"
+    withdraw_loop_error = ()
+
+    def __init__(self, PROP_SETTING: "RLPostTrainingScen"):
+        # Store rl-specific settings
+        self.rl_rd_setting = PROP_SETTING
+        # Initialize using base class
+        super().__init__(PROP_SETTING)
+
+        # Replace generic Trace with RLTrace for SOTA tracking
+        self.trace = RLTrace(scen=PROP_SETTING)
+
+    async def direct_exp_gen(self, prev_out: dict[str, Any]):
+        """Generate RL post-training experiment"""
+        exp = await self.hypothesis_gen.async_gen(self.trace, self)
+        logger.log_object(exp.hypothesis, tag="hypothesis")
+        logger.log_object(exp.sub_tasks, tag="experiment generation")
+        return exp
+
+    def coding(self, prev_out: dict[str, Any]):
+        """Generate rl post-training code"""
+        exp = prev_out["direct_exp_gen"]
+        exp = self.coder.develop(exp)
+        logger.log_object(exp.sub_workspace_list, tag="coder result")
+        return exp
+
+    def feedback(self, prev_out: dict[str, Any]):
+        """Generate feedback for RL post-training experiment - always call LLM"""
+
+        # Get experiment from available sources
+        exp = prev_out.get("running") or prev_out.get("coding") or prev_out.get("direct_exp_gen")
+        e = prev_out.get(self.EXCEPTION_KEY, None)
+        feedback = self.summarizer.generate_feedback(exp, self.trace, exception=e)
+
+        logger.log_object(feedback, tag="feedback")
+        return feedback
+
+    def record(self, prev_out: dict[str, Any]):
+        """Record the experiment and feedback into trace"""
+        feedback = prev_out["feedback"]
+        exp = prev_out.get("running") or prev_out.get("coding") or prev_out.get("direct_exp_gen")
+        self.trace.sync_dag_parent_and_hist((exp, feedback), prev_out[self.LOOP_IDX_KEY])
+
+    def dump(self, path):
+        """Skip dump if the loop contains unpicklable objects."""
+        try:
+            super().dump(path)
+        except TypeError as e:
+            logger.warning(f"Skip dump due to pickling error: {e}")
diff --git a/rdagent/scenarios/rl/proposal/prompts.yaml b/rdagent/scenarios/rl/proposal/prompts.yaml
new file mode 100644
index 000000000..c9bbf5770
--- /dev/null
+++ b/rdagent/scenarios/rl/proposal/prompts.yaml
@@ -0,0 +1,80 @@
+hypothesis_gen:
+  system: |-
+    你是 RL post-training 专家，负责生成训练假设。
+
+    ## 核心目标
+    **提升模型在 benchmark 上的分数**，这是唯一目标。
+
+    ## 运行环境
+    代码由系统自动部署到 `$WORKSPACE/code/` 并执行。
+    环境变量（已由框架设置，代码中直接 `os.environ` 读取）：
+    - `MODEL_PATH`: 基础模型路径（只读）
+    - `DATA_PATH`: 训练数据路径（只读）
+    - `OUTPUT_DIR`: 模型输出目录（`$WORKSPACE/output/`）
+    - `GRADING_SERVER_URL`: 评测服务地址
+
+    ## 评测机制
+    训练完成后，系统自动将 `$OUTPUT_DIR` 下最新的模型提交到 Grading Server 评测。
+    - `$OUTPUT_DIR` 下有模型 → 自动提交评测，返回 score
+    - `$OUTPUT_DIR` 为空 → 跳过评测
+    - 可用子目录区分版本（如 `output/v1/`、`output/v2/`），系统取最新的
+
+    ## 策略选择
+
+    ### 情况1：首次运行 / 代码一直失败（exit_code≠0）
+    - 生成简单、稳定的训练代码
+    - 目标：让代码能跑通（exit_code=0）
+    - 可以先不保存模型，验证链路
+
+    ### 情况2：代码稳定但没有评测分数
+    - **说明训练没有保存模型到 $OUTPUT_DIR**
+    - 现在应该生成**正式训练**假设
+    - 必须保存模型到 $OUTPUT_DIR
+
+    ### 情况3：已有评测分数，需要优化
+    - 关注超参数调优
+    - 尝试不同算法或配置
+    - 每次改动一个变量，便于归因
+
+    ## 可用算法
+    - **GRPO**: 推荐，数学推理效果好，不需要偏好对
+    - DPO: 需要 (chosen, rejected) 偏好对
+    - PPO/RLOO: 其他选择
+
+    ## 框架
+    - trl (版本 0.27+): GRPOTrainer, DPOTrainer, PPOTrainer
+
+    ## 输出要求
+    JSON 格式：
+    {
+      "hypothesis": "具体的训练策略描述",
+      "reason": "为什么这样做，基于历史分析",
+      "algorithm": "GRPO/DPO/PPO/RLOO",
+      "is_formal_training": true/false
+    }
+
+    - is_formal_training=true: 正式训练，会保存模型到 $OUTPUT_DIR
+    - is_formal_training=false: 调试/验证，不保存模型
+
+  user: |-
+    ## 基础模型
+    {{ base_model }}
+
+    ## 历史实验
+    {% if trace_summary %}
+    {{ trace_summary }}
+
+    **请分析历史：**
+    1. exit_code 情况：有多少次成功(0)/失败(非0)？
+    2. benchmark 分数：是数字还是 None？
+       - 如果是 None：说明没有保存模型，需要正式训练
+       - 如果是数字：可以基于此优化
+    3. 错误模式：是否有重复的错误？如何避免？
+    {% else %}
+    无历史实验（首次运行）
+    - 建议：生成简单稳定的 GRPO 训练代码
+    - 目标：先让代码跑通，验证训练链路
+    {% endif %}
+
+    请生成下一轮实验假设。
+
diff --git a/rdagent/scenarios/rl/proposal/proposal.py b/rdagent/scenarios/rl/proposal/proposal.py
new file mode 100644
index 000000000..d3330b603
--- /dev/null
+++ b/rdagent/scenarios/rl/proposal/proposal.py
@@ -0,0 +1,79 @@
+import json
+
+from rdagent.app.rl.conf import RL_RD_SETTING
+from rdagent.core.proposal import ExpGen, Hypothesis, Trace
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+from rdagent.oai.llm_utils import APIBackend
+from rdagent.scenarios.rl.experiment.experiment import RLTask
+from rdagent.scenarios.rl.experiment.experiment import RLExperiment
+from rdagent.utils.agent.tpl import T
+
+
+
+class RLPostTrainingExpGen(ExpGen):
+    """RL post-training experiment generator with LLM."""
+
+    def __init__(self, scen: Scenario | None = None):
+        super().__init__(scen)
+
+    def gen(self, trace: Trace) -> RLExperiment:
+        """Generate RL post-training experiment using LLM."""
+        # 构建历史摘要
+        trace_summary = self._build_trace_summary(trace)
+
+        # 调用 LLM 生成假设
+        hypothesis_data = self._gen_hypothesis_with_llm(trace_summary)
+
+        # 创建任务和实验
+        rl_task = RLTask(
+            name=f"RLTask_{hypothesis_data.get('algorithm', 'PPO')}",
+            description=hypothesis_data.get("hypothesis", "Train RL agent"),
+        )
+        hypothesis = Hypothesis(
+            hypothesis=hypothesis_data.get("hypothesis", "Train RL agent"),
+            reason=hypothesis_data.get("reason", ""),
+            concise_reason="",
+            concise_observation="",
+            concise_justification="",
+            concise_knowledge="",
+        )
+        algorithm = hypothesis_data.get("algorithm", "PPO")
+        exp = RLExperiment(sub_tasks=[rl_task], hypothesis=hypothesis)
+        logger.info(f"Generated experiment: {hypothesis.hypothesis} (algorithm={algorithm})")
+        return exp
+
+    def _build_trace_summary(self, trace: Trace) -> str:
+        """Build summary of historical experiments."""
+        if not trace or not trace.hist:
+            return ""
+        
+        summaries = []
+        for i, (exp, feedback) in enumerate(trace.hist[-3:]):  # 最近3个实验
+            status = "成功" if feedback is not None and feedback.decision else "失败"
+            hypothesis = exp.hypothesis.hypothesis if exp.hypothesis else "N/A"
+            summaries.append(f"### 实验{i+1}: {hypothesis}")
+            summaries.append(f"- 结果: {status}")
+            # 添加失败原因和建议
+            if feedback is not None:
+                if getattr(feedback, 'reason', None):
+                    summaries.append(f"- 原因: {feedback.reason}")
+                if getattr(feedback, 'code_change_summary', None):
+                    summaries.append(f"- 建议: {feedback.code_change_summary}")
+        
+        return "\n".join(summaries)
+
+    def _gen_hypothesis_with_llm(self, trace_summary: str) -> dict:
+        """Generate hypothesis using LLM."""
+        system_prompt = T(".prompts:hypothesis_gen.system").r()
+        user_prompt = T(".prompts:hypothesis_gen.user").r(
+            base_model=RL_RD_SETTING.base_model or "",
+            trace_summary=trace_summary,
+        )
+
+        resp = APIBackend().build_messages_and_create_chat_completion(
+            user_prompt=user_prompt,
+            system_prompt=system_prompt,
+            json_mode=True,
+        )
+        return json.loads(resp)
diff --git a/rdagent/scenarios/rl/proposal/trace.py b/rdagent/scenarios/rl/proposal/trace.py
new file mode 100644
index 000000000..7b978850a
--- /dev/null
+++ b/rdagent/scenarios/rl/proposal/trace.py
@@ -0,0 +1,6 @@
+from __future__ import annotations
+
+from rdagent.core.evolving_framework import KnowledgeBase
+from rdagent.core.proposal import Trace
+
+RLTrace = Trace["RLPostTrainingScen", KnowledgeBase]
diff --git a/rdagent/scenarios/rl/scen/scenario.py b/rdagent/scenarios/rl/scen/scenario.py
new file mode 100644
index 000000000..6f39dfbe4
--- /dev/null
+++ b/rdagent/scenarios/rl/scen/scenario.py
@@ -0,0 +1,79 @@
+"""
+RL Post-training Scenario
+
+作为 autorl_bench 的 agent 运行时，run.py 已经完成了：
+- 资源下载（模型、数据）
+- workspace 创建 + 软链接
+- Grading Server 启动 + baseline 评测
+- 环境变量传递
+
+本 Scenario 只需读取这些信息，不重复操作。
+"""
+
+import os
+from pathlib import Path
+
+from rdagent.app.rl.conf import RL_RD_SETTING
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+
+
+class RLPostTrainingScen(Scenario):
+    """RL Post-training Scenario
+
+    从 run.py 传递的环境变量中读取配置，不重复下载资源或评测 baseline。
+    """
+
+    def __init__(self) -> None:
+        logger.info("Initializing RL Post-training scenario")
+
+        # 从 env var 读取（run.py 已设置），CLI 参数作为 fallback
+        self.base_model = os.environ.get("BASE_MODEL") or RL_RD_SETTING.base_model or ""
+        self.benchmark = os.environ.get("TASK") or RL_RD_SETTING.benchmark or ""
+        self.workspace = os.environ.get("WORKSPACE", "")
+        self.model_path = os.environ.get("MODEL_PATH", "")
+        self.data_path = os.environ.get("DATA_PATH", "")
+        self.output_dir = os.environ.get("OUTPUT_DIR", "")
+        self.grading_server_url = os.environ.get("GRADING_SERVER_URL", "")
+
+        if not self.base_model:
+            raise ValueError("BASE_MODEL env var or --base-model required")
+        if not self.benchmark:
+            raise ValueError("TASK env var or --benchmark required")
+
+        logger.info(f"  Benchmark: {self.benchmark}")
+        logger.info(f"  Base model: {self.base_model}")
+        logger.info(f"  Workspace: {self.workspace}")
+        logger.info(f"  Grading Server: {self.grading_server_url}")
+
+        # 读取任务描述（workspace 里的 description.md，已由 run.py 软链接）
+        desc_file = Path(self.workspace) / "description.md" if self.workspace else None
+        if desc_file and desc_file.exists():
+            self.task_description = desc_file.read_text()
+            logger.info(f"  Loaded task description from {desc_file}")
+        else:
+            self.task_description = ""
+            logger.warning("  Task description not found in workspace")
+
+    @property
+    def background(self) -> str:
+        """Background information for LLM prompts"""
+        bg = f"""RL Post-training Scenario
+
+Benchmark: {self.benchmark}
+Base Model: {self.base_model}
+Model Path: {self.model_path}
+Data Path: {self.data_path}
+Output Dir: {self.output_dir}
+Grading Server: {self.grading_server_url}
+
+Goal: Improve model performance on {self.benchmark} through RL post-training.
+Submit trained model via POST {self.grading_server_url}/submit for evaluation.
+"""
+        if self.task_description:
+            bg += f"\n## Task Description\n{self.task_description}"
+        return bg
+
+    def get_runtime_environment(self) -> str:
+        """Get runtime environment info"""
+        return f'{{"workspace": "{self.workspace}", "grading_server": "{self.grading_server_url}"}}'
diff --git a/rdagent/scenarios/rl/train/runner.py b/rdagent/scenarios/rl/train/runner.py
new file mode 100644
index 000000000..b5bc746e3
--- /dev/null
+++ b/rdagent/scenarios/rl/train/runner.py
@@ -0,0 +1,138 @@
+"""
+RL Runner - 执行训练代码并提交 Grading Server 评测
+
+作为 autorl_bench agent 运行：
+- 训练代码在本地执行（$WORKSPACE/code/ 下）
+- 评测通过 HTTP POST $GRADING_SERVER_URL/submit
+"""
+
+import json
+import os
+import subprocess
+import time
+from pathlib import Path
+
+import requests
+
+from rdagent.core.developer import Developer
+from rdagent.core.experiment import Experiment
+from rdagent.core.scenario import Scenario
+from rdagent.log import rdagent_logger as logger
+
+
+class RLPostTrainingRunner(Developer):
+    """RL Runner - 本地执行训练 + HTTP API 评测"""
+
+    def __init__(self, scen: Scenario, timeout: int = 360000) -> None:
+        self.scen = scen
+        self.timeout = timeout
+
+    def develop(self, exp: Experiment) -> Experiment:
+        """
+        执行训练代码并提交评测
+
+        流程：
+        1. 将生成的代码写入 $WORKSPACE/code/
+        2. 本地执行 main.py
+        3. POST $GRADING_SERVER_URL/submit 提交评测
+        """
+        workspace = exp.experiment_workspace
+        if workspace is None or "main.py" not in workspace.file_dict:
+            logger.warning("No main.py in experiment workspace, skipping")
+            exp.result = {"exit_code": -1, "stdout": "No main.py generated"}
+            return exp
+
+        # 从 env var 读取路径（run.py 已设置）
+        ws_dir = os.environ.get("WORKSPACE", "")
+        output_dir = os.environ.get("OUTPUT_DIR", "")
+        grading_url = os.environ.get("GRADING_SERVER_URL", "")
+
+        if not ws_dir:
+            logger.error("WORKSPACE env var not set")
+            exp.result = {"exit_code": -1, "stdout": "WORKSPACE not set"}
+            return exp
+
+        code_dir = Path(ws_dir) / "code"
+        code_dir.mkdir(parents=True, exist_ok=True)
+
+        # 1. 将生成的代码写入 code/
+        for filename, content in workspace.file_dict.items():
+            dst = code_dir / filename
+            dst.parent.mkdir(parents=True, exist_ok=True)
+            dst.write_text(content)
+            logger.info(f"  Wrote {dst}")
+
+        # 2. 本地执行 main.py
+        main_py = code_dir / "main.py"
+        logger.info(f"=== Executing {main_py} ===")
+        start_time = time.time()
+
+        try:
+            proc = subprocess.run(
+                ["python", str(main_py)],
+                cwd=str(code_dir),
+                capture_output=True,
+                text=True,
+                timeout=self.timeout,
+                env={**os.environ, "PYTHONUNBUFFERED": "1"},
+            )
+            exit_code = proc.returncode
+            stdout = proc.stdout + proc.stderr
+        except subprocess.TimeoutExpired as e:
+            exit_code = -1
+            stdout = f"Timeout after {self.timeout}s\n{e.stdout or ''}"
+            logger.warning(f"Training timed out after {self.timeout}s")
+
+        elapsed = time.time() - start_time
+        logger.info(f"Training finished: exit_code={exit_code}, time={elapsed:.1f}s")
+
+        if exit_code != 0:
+            logger.warning(f"Training failed:\n{stdout[:2000]}")
+
+        exp.result = {
+            "exit_code": exit_code,
+            "stdout": stdout,
+            "running_time": elapsed,
+            "benchmark": None,
+        }
+
+        # 3. 提交 Grading Server 评测
+        if exit_code != 0 or not grading_url or not output_dir:
+            return exp
+
+        output_path = Path(output_dir)
+        if not output_path.exists() or not any(output_path.iterdir()):
+            logger.info("No model output found, skipping evaluation")
+            return exp
+
+        # 找到 output/ 下最新的模型目录（可能有 v1/, v2/ 等子目录）
+        model_path = self._find_latest_model(output_path)
+        logger.info(f"=== Submitting to Grading Server: {model_path} ===")
+
+        try:
+            resp = requests.post(
+                f"{grading_url}/submit",
+                json={"model_path": str(model_path)},
+                timeout=600,
+            )
+            result = resp.json()
+            exp.result["benchmark"] = result
+            logger.info(f"  Score: {result.get('score')}, "
+                        f"Improvement: {result.get('improvement')}, "
+                        f"Best: {result.get('best', {}).get('score')}")
+        except Exception as e:
+            logger.error(f"Grading server submission failed: {e}")
+
+        return exp
+
+    @staticmethod
+    def _find_latest_model(output_dir: Path) -> Path:
+        """找到 output/ 下的模型路径。
+
+        如果有子目录（v1/, v2/ 等），返回最新修改的那个；
+        否则返回 output/ 本身。
+        """
+        subdirs = [d for d in output_dir.iterdir() if d.is_dir() and not d.name.startswith(".")]
+        if subdirs:
+            return max(subdirs, key=lambda d: d.stat().st_mtime)
+        return output_dir
diff --git a/rdagent/scenarios/shared/get_runtime_info.py b/rdagent/scenarios/shared/get_runtime_info.py
index 3e6cb7c66..9e66c7d3d 100644
--- a/rdagent/scenarios/shared/get_runtime_info.py
+++ b/rdagent/scenarios/shared/get_runtime_info.py
@@ -1,3 +1,5 @@
+import json
+import re
 from pathlib import Path
 
 from rdagent.core.experiment import FBWorkspace
@@ -9,7 +11,9 @@ def get_runtime_environment_by_env(env: Env) -> str:
     fname = "runtime_info.py"
     implementation.inject_files(**{fname: (Path(__file__).absolute().resolve().parent / "runtime_info.py").read_text()})
     stdout = implementation.execute(env=env, entry=f"python {fname}")
-    return stdout
+    # Extract JSON from stdout (skip CUDA/container warnings)
+    json_match = re.search(r"\{.*\}", stdout, re.DOTALL)
+    return json.dumps(json.loads(json_match.group()), indent=2)
 
 
 def check_runtime_environment(env: Env) -> str:
diff --git a/rdagent/scenarios/shared/runtime_info.py b/rdagent/scenarios/shared/runtime_info.py
index 3f3836df9..281e1673a 100644
--- a/rdagent/scenarios/shared/runtime_info.py
+++ b/rdagent/scenarios/shared/runtime_info.py
@@ -1,66 +1,113 @@
+import json
 import platform
+import re
 import subprocess
 import sys
 from importlib.metadata import distributions
 
 
-def print_runtime_info():
-    print("=== Python Runtime Info ===")
-    print(f"Python {sys.version} on {platform.system()} {platform.release()}")
+def get_runtime_info():
+    return {
+        "python_version": sys.version,
+        "os": platform.system(),
+        "os_release": platform.release(),
+    }
 
 
 def get_gpu_info():
+    gpu_info = {}
     try:
-        # Option 1: Use PyTorch
         import torch
 
         if torch.cuda.is_available():
-            print("\n=== GPU Info (via PyTorch) ===")
-            print(f"CUDA Version: {torch.version.cuda}")
-            print(f"GPU Count: {torch.cuda.device_count()}")
+            gpu_info["source"] = "pytorch"
+            gpu_info["cuda_version"] = torch.version.cuda
+            gpu_info["gpu_count"] = torch.cuda.device_count()
             if torch.cuda.device_count() > 0:
                 gpu_name_list = []
                 gpu_total_mem_list = []
                 gpu_allocated_mem_list = []
-                gpu_cached_mem_list = []
 
                 for i in range(torch.cuda.device_count()):
                     gpu_name_list.append(torch.cuda.get_device_name(i))
                     gpu_total_mem_list.append(torch.cuda.get_device_properties(i).total_memory)
                     gpu_allocated_mem_list.append(torch.cuda.memory_allocated(i))
-                    gpu_cached_mem_list.append(torch.cuda.memory_reserved(i))
 
+                gpu_info["gpus"] = []
                 for i in range(torch.cuda.device_count()):
-                    print(f"  - GPU {i}: {gpu_name_list[i]}")
-                    print(f"    Total Memory: {gpu_total_mem_list[i] / 1024**3:.2f} GB")
-                    print(f"    Allocated Memory: {gpu_allocated_mem_list[i] / 1024**3:.2f} GB")
-                    print(f"    Cached Memory: {gpu_cached_mem_list[i] / 1024**3:.2f} GB")
-                print("  - All GPUs Summary:")
-                print(f"    Total Memory: {sum(gpu_total_mem_list) / 1024**3:.2f} GB")
-                print(f"    Total Allocated Memory: {sum(gpu_allocated_mem_list) / 1024**3:.2f} GB")
-                print(f"    Total Cached Memory: {sum(gpu_cached_mem_list) / 1024**3:.2f} GB")
+                    gpu_info["gpus"].append(
+                        {
+                            "index": i,
+                            "name": gpu_name_list[i],
+                            "memory_total_gb": round(gpu_total_mem_list[i] / 1024**3, 2),
+                            "memory_used_gb": round(gpu_allocated_mem_list[i] / 1024**3, 2),
+                        }
+                    )
+                gpu_info["summary"] = {
+                    "gpu_count": torch.cuda.device_count(),
+                    "total_memory_gb": round(sum(gpu_total_mem_list) / 1024**3, 2),
+                    "total_used_memory_gb": round(sum(gpu_allocated_mem_list) / 1024**3, 2),
+                }
             else:
-                print("No CUDA GPU detected (PyTorch)!")
+                gpu_info["message"] = "No CUDA GPU detected (PyTorch)"
         else:
-            print("\nNo CUDA GPU detected (PyTorch).")
-
+            gpu_info["source"] = "pytorch"
+            gpu_info["message"] = "No CUDA GPU detected"
     except ImportError:
-        # Option 2: Use nvidia-smi
         try:
             result = subprocess.run(
-                ["nvidia-smi", "--query-gpu=name,memory.total,memory.used", "--format=csv"],
+                ["nvidia-smi", "--query-gpu=name,memory.total,memory.used", "--format=csv,noheader,nounits"],
                 capture_output=True,
                 text=True,
             )
             if result.returncode == 0:
-                print("\n=== GPU Info (via nvidia-smi) ===")
-                print(result.stdout.strip())
+                gpu_info["source"] = "nvidia-smi"
+                gpu_info["cuda_version"] = None
+                version_result = subprocess.run(
+                    ["nvidia-smi"],
+                    capture_output=True,
+                    text=True,
+                )
+                if version_result.returncode == 0:
+                    match = re.search(r"CUDA Version:\s*([0-9.]+)", version_result.stdout)
+                    if match:
+                        gpu_info["cuda_version"] = match.group(1)
+                lines = result.stdout.strip().splitlines()
+                gpu_info["gpus"] = []
+                total_mem_list = []
+                used_mem_list = []
+                for index, line in enumerate(lines):
+                    name, mem_total, mem_used = [x.strip() for x in line.split(",")]
+                    total_mem_list.append(int(mem_total))
+                    used_mem_list.append(int(mem_used))
+                    gpu_info["gpus"].append(
+                        {
+                            "index": index,
+                            "name": name,
+                            "memory_total_gb": round(int(mem_total) / 1024, 2),
+                            "memory_used_gb": round(int(mem_used) / 1024, 2),
+                        }
+                    )
+                gpu_info["gpu_count"] = len(gpu_info["gpus"])
+                gpu_info["summary"] = {
+                    "gpu_count": len(gpu_info["gpus"]),
+                    "total_memory_gb": round(sum(total_mem_list) / 1024, 2),
+                    "total_used_memory_gb": round(sum(used_mem_list) / 1024, 2),
+                }
             else:
-                print("\nNo GPU detected (nvidia-smi not available).")
+                gpu_info["source"] = "nvidia-smi"
+                gpu_info["cuda_version"] = None
+                gpu_info["message"] = "No GPU detected or nvidia-smi not available"
         except FileNotFoundError:
-            print("\nNo GPU detected (nvidia-smi not installed).")
+            gpu_info["source"] = "nvidia-smi"
+            gpu_info["cuda_version"] = None
+            gpu_info["message"] = "nvidia-smi not installed"
+    return gpu_info
 
 
 if __name__ == "__main__":
-    print_runtime_info()
-    get_gpu_info()
+    info = {
+        "runtime": get_runtime_info(),
+        "gpu": get_gpu_info(),
+    }
+    print(json.dumps(info, indent=4))
diff --git a/rdagent/utils/blob/azsync.sh b/rdagent/utils/blob/azsync.sh
new file mode 100755
index 000000000..15794be27
--- /dev/null
+++ b/rdagent/utils/blob/azsync.sh
@@ -0,0 +1,107 @@
+#!/bin/bash
+# Azure Blob sync script - for syncing FT scenario files across machines
+# Supports both logs and workspace directories
+
+# ========== Configuration ==========
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PROJECT_ROOT="$SCRIPT_DIR/../../.."
+TOKEN_FILE="$PROJECT_ROOT/git_ignore_folder/.az_sas_token"
+
+# Blob configuration
+ACCOUNT="epeastus"
+CONTAINER="rdagent"
+REMOTE_BASE="FinetuneAgenticLLM/FT_qizheng"
+
+# Directory mappings (support environment variable override)
+# Default to project-relative paths; can be overridden by environment variables
+LOCAL_LOG_DIR="${FT_LOG_BASE:-$PROJECT_ROOT/log}"
+LOCAL_WORKSPACE_DIR="${FT_WORKSPACE_BASE:-$PROJECT_ROOT/git_ignore_folder/RD-Agent_workspace}"
+LOCAL_LITELLM_LOG_DIR="${LITELLM_LOG_DIR:-/workspace/rdagent/litllm_log}"
+# Support sub-path for syncing specific job directory (e.g., SYNC_SUBPATH="2024-01-01_12-00")
+SYNC_SUBPATH="${SYNC_SUBPATH:-}"
+REMOTE_LOG_PATH="${REMOTE_BASE}/logs${SYNC_SUBPATH:+/$SYNC_SUBPATH}"
+REMOTE_WORKSPACE_PATH="${REMOTE_BASE}/workspace${SYNC_SUBPATH:+/$SYNC_SUBPATH}"
+# litellm_log doesn't use SYNC_SUBPATH since local dir is shared across jobs
+REMOTE_LITELLM_LOG_PATH="${REMOTE_BASE}/litellm_log"
+
+# Read SAS Token
+if [ -f "$TOKEN_FILE" ]; then
+    SAS_TOKEN=$(cat "$TOKEN_FILE")
+else
+    SAS_TOKEN=""
+fi
+# ========== End Configuration ==========
+
+# Get paths based on sync type (logs/workspace/litellm_log)
+get_paths() {
+    local sync_type="${1:-logs}"
+    case "$sync_type" in
+        logs)
+            LOCAL_DIR="$LOCAL_LOG_DIR"
+            REMOTE_PATH="$REMOTE_LOG_PATH"
+            ;;
+        workspace)
+            LOCAL_DIR="$LOCAL_WORKSPACE_DIR"
+            REMOTE_PATH="$REMOTE_WORKSPACE_PATH"
+            ;;
+        litellm_log)
+            LOCAL_DIR="$LOCAL_LITELLM_LOG_DIR"
+            REMOTE_PATH="$REMOTE_LITELLM_LOG_PATH"
+            ;;
+        *)
+            echo "Error: Unknown sync type '$sync_type'. Use 'logs', 'workspace', or 'litellm_log'."
+            exit 1
+            ;;
+    esac
+    BLOB_URL="https://${ACCOUNT}.blob.core.windows.net/${CONTAINER}/${REMOTE_PATH}?${SAS_TOKEN}"
+}
+
+usage() {
+    echo "Usage: $0 [up|down] [logs|workspace|litellm_log]"
+    echo ""
+    echo "  up    Upload local directory to blob"
+    echo "  down  Download blob to local directory"
+    echo "  (no args) Show this help"
+    echo ""
+    echo "Sync types:"
+    echo "  logs        Sync log directory (default)"
+    echo "  workspace   Sync workspace directory"
+    echo "  litellm_log Sync litellm log directory"
+    echo ""
+    echo "Configuration:"
+    echo "  Log directory:         $LOCAL_LOG_DIR"
+    echo "  Workspace directory:   $LOCAL_WORKSPACE_DIR"
+    echo "  Litellm log directory: $LOCAL_LITELLM_LOG_DIR"
+    echo "  Remote base:           $REMOTE_BASE"
+    echo ""
+    echo "SAS Token: Run ./gen_token.sh to generate"
+    exit 0
+}
+
+check_token() {
+    if [ -z "$SAS_TOKEN" ]; then
+        echo "Error: SAS Token not found"
+        echo "Please run: ./gen_token.sh first"
+        exit 1
+    fi
+}
+
+case "${1:-}" in
+    up)
+        check_token
+        get_paths "${2:-logs}"
+        echo "Uploading: $LOCAL_DIR -> $REMOTE_PATH"
+        azcopy sync "$LOCAL_DIR" "$BLOB_URL" --recursive=true \
+            --exclude-path="pickle_cache;prompt_cache.db"
+        ;;
+    down)
+        check_token
+        get_paths "${2:-logs}"
+        mkdir -p "$LOCAL_DIR"
+        echo "Downloading: $REMOTE_PATH -> $LOCAL_DIR"
+        azcopy sync "$BLOB_URL" "$LOCAL_DIR" --recursive=true
+        ;;
+    *)
+        usage
+        ;;
+esac
diff --git a/rdagent/utils/blob/gen_token.sh b/rdagent/utils/blob/gen_token.sh
new file mode 100755
index 000000000..7ab95590a
--- /dev/null
+++ b/rdagent/utils/blob/gen_token.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+# Generate Azure Blob SAS Token and save it
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PROJECT_ROOT="$SCRIPT_DIR/../../.."
+TOKEN_FILE="$PROJECT_ROOT/git_ignore_folder/.az_sas_token"
+
+# Blob configuration
+ACCOUNT="epeastus"
+CONTAINER="rdagent"
+REMOTE_PATH="FinetuneAgenticLLM/FT_qizheng/logs"
+
+# Default expiry: 7 days from now
+DEFAULT_EXPIRY=$(date -u -d "+7 days" +%Y-%m-%dT00:00Z 2>/dev/null || date -u -v+7d +%Y-%m-%dT00:00Z)
+EXPIRY="${1:-$DEFAULT_EXPIRY}"
+
+echo "Generating SAS Token..."
+echo "Expires at: $EXPIRY"
+echo ""
+
+# Generate token
+TOKEN=$(az storage container generate-sas \
+    --as-user \
+    --auth-mode login \
+    --account-name "$ACCOUNT" \
+    --name "$CONTAINER" \
+    --permissions lrwd \
+    --expiry "$EXPIRY" \
+    -o tsv)
+
+if [ -z "$TOKEN" ]; then
+    echo "Error: Token generation failed, please ensure you are logged in to az cli"
+    echo "Run: az login"
+    exit 1
+fi
+
+# Save token
+mkdir -p "$(dirname "$TOKEN_FILE")"
+echo "$TOKEN" > "$TOKEN_FILE"
+echo "Token saved to: $TOKEN_FILE"
+echo ""
+
+# Output full URL
+BLOB_URL="https://${ACCOUNT}.blob.core.windows.net/${CONTAINER}/${REMOTE_PATH}?${TOKEN}"
+echo "Full Blob URL:"
+echo "$BLOB_URL"
diff --git a/rdagent/utils/env.py b/rdagent/utils/env.py
index 5ae073e92..a44218f7b 100644
--- a/rdagent/utils/env.py
+++ b/rdagent/utils/env.py
@@ -19,10 +19,12 @@
 import uuid
 import zipfile
 from abc import abstractmethod
+from collections import deque
 from dataclasses import dataclass
+from datetime import datetime
 from pathlib import Path
 from types import MappingProxyType
-from typing import Any, Generator, Generic, Mapping, Optional, TypeVar, cast
+from typing import Any, Callable, Generator, Generic, Mapping, Optional, TypeVar, cast
 
 import docker  # type: ignore[import-untyped]
 import docker.models  # type: ignore[import-untyped]
@@ -32,13 +34,16 @@
 from pydantic_settings import SettingsConfigDict
 from rich import print
 from rich.console import Console
+from rich.live import Live
 from rich.progress import Progress, SpinnerColumn, TextColumn
 from rich.rule import Rule
 from rich.table import Table
+from rich.text import Text
 from tqdm import tqdm
 
 from rdagent.core.conf import ExtendedBaseSettings
 from rdagent.core.experiment import RD_AGENT_SETTINGS
+from rdagent.core.utils import cache_with_pickle
 from rdagent.log import rdagent_logger as logger
 from rdagent.oai.llm_utils import md5_hash
 from rdagent.utils import filter_redundant_text
@@ -46,6 +51,32 @@
 from rdagent.utils.fmt import shrink_text
 from rdagent.utils.workflow import wait_retry
 
+CacheKeyFunc = Callable[[str | Path], list[list[str]]]
+
+
+def extract_dir_name_from_path_config(path_str: str) -> str:
+    """
+    Extract the first directory component from a relative path string.
+
+    This is used to get the basename from path configurations like "./workspace_input/"
+    to use in chmod exclusion patterns.
+
+    Args:
+        path_str: A path string, typically from T() template configuration
+
+    Returns:
+        The first directory component, or empty string if not a relative path
+
+    Examples:
+        "./workspace_input/" -> "workspace_input"
+        "./assets/" -> "assets"
+        "/absolute/path" -> ""
+    """
+    p = Path(path_str)
+    if not p.is_absolute() and p.parts:
+        return p.parts[0]
+    return ""
+
 
 def cleanup_container(container: docker.models.containers.Container | None, context: str = "") -> None:  # type: ignore[no-any-unimported]
     """
@@ -120,12 +151,41 @@ def pull_image_with_progress(image: str) -> None:
 
 class EnvConf(ExtendedBaseSettings):
     default_entry: str
+    env_dict: dict = {}
     extra_volumes: dict = {}
     running_timeout_period: int | None = 3600  # 10 minutes
+
+    """it is a function to calculating hash keys"""
+
+    def get_workspace_content_for_hash(self, local_path: str | Path) -> list[list[str]]:
+        """Get content of key files in workspace for cache hash calculation.
+
+        Scans .py, .csv, and .yaml files.
+        """
+        # we must add the information of data (beyond code) into the key.
+        # Otherwise, all commands operating on data will become invalid (e.g. rm -r submission.csv)
+        # So we recursively walk in the folder and add the sorted relative filename list as part of the key.
+        # data_key = []
+        # for path in Path(local_path).rglob("*"):
+        #     p = str(path.relative_to(Path(local_path)))
+        #     if p.startswith("__pycache__"):
+        #         continue
+        #     data_key.append(p)
+        # data_key = sorted(data_key)
+        local_path = Path(local_path)
+        return [
+            [str(path.relative_to(local_path)), path.read_text()]
+            for path in sorted(
+                list(local_path.rglob("*.py")) + list(local_path.rglob("*.csv")) + list(local_path.rglob("*.yaml"))
+            )
+        ]
+
+    redirect_stdout_to_file: bool = False
     # helper settings to support transparent;
     enable_cache: bool = True
     retry_count: int = 5  # retry count for the docker run
     retry_wait_seconds: int = 10  # retry wait seconds for the docker run
+    exclude_chmod_paths: list[str] = []  # List of directory names to exclude from chmod operation
 
     model_config = SettingsConfigDict(
         # TODO: add prefix ....
@@ -143,13 +203,30 @@ class EnvResult:
     It contains the stdout, the exit code, and the running time in seconds.
     """
 
-    stdout: str
-    exit_code: int
-    running_time: float
+    def __init__(self, stdout: str, exit_code: int, running_time: float):
+        self.full_stdout = stdout
+        self.exit_code = exit_code
+        self.running_time = running_time
+        self.stored_full_stdout_to_truncated_stdout = {}
+
+    def update_stdout(self, stdout: str) -> None:
+        self.full_stdout = stdout
+
+    @property
+    def stdout(self) -> str:
+        if self.full_stdout not in self.stored_full_stdout_to_truncated_stdout:
+            self.stored_full_stdout_to_truncated_stdout[self.full_stdout] = self._get_truncated_stdout(
+                full_stdout=self.full_stdout
+            )
+        return self.stored_full_stdout_to_truncated_stdout[self.full_stdout]
+
+    def hash_full_stdout(self, full_stdout) -> str:
+        return md5_hash(full_stdout)
 
-    def get_truncated_stdout(self) -> str:
+    @cache_with_pickle(hash_full_stdout)
+    def _get_truncated_stdout(self, full_stdout) -> str:
         return shrink_text(
-            filter_redundant_text(self.stdout),
+            filter_redundant_text(full_stdout),
             context_lines=RD_AGENT_SETTINGS.stdout_context_len,
             line_len=RD_AGENT_SETTINGS.stdout_line_len,
         )
@@ -174,19 +251,32 @@ def zip_a_folder_into_a_file(self, folder_path: str, zip_file_path: str) -> None
         with zipfile.ZipFile(zip_file_path, "w") as z:
             for root, _, files in os.walk(folder_path):
                 for file in files:
-                    z.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), folder_path))
+                    z.write(
+                        os.path.join(root, file),
+                        os.path.relpath(os.path.join(root, file), folder_path),
+                    )
 
-    def unzip_a_file_into_a_folder(self, zip_file_path: str, folder_path: str) -> None:
+    def unzip_a_file_into_a_folder(
+        self, zip_file_path: str, folder_path: str, files_to_extract: list[str] | None = None
+    ) -> None:
         """
         Unzip a file into a folder, use zipfile instead of subprocess
         """
-        # Clear folder_path before extracting
-        if os.path.exists(folder_path):
-            shutil.rmtree(folder_path)
-        os.makedirs(folder_path)
+        if files_to_extract is None:
+            # Clear folder_path before extracting
+            if os.path.exists(folder_path):
+                shutil.rmtree(folder_path)
+            os.makedirs(folder_path)
 
         with zipfile.ZipFile(zip_file_path, "r") as z:
-            z.extractall(folder_path)
+            if files_to_extract is not None:
+                for file_name in files_to_extract:
+                    try:
+                        z.extract(file_name, folder_path)
+                    except KeyError:
+                        logger.warning(f"File {file_name} not found in cache zip.")
+            else:
+                z.extractall(folder_path)
 
     @abstractmethod
     def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
@@ -195,7 +285,11 @@ def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
         """
 
     def check_output(
-        self, entry: str | None = None, local_path: str = ".", env: dict | None = None, **kwargs: dict
+        self,
+        entry: str | None = None,
+        local_path: str = ".",
+        env: dict | None = None,
+        **kwargs: dict,
     ) -> str:
         """
         Run the folder under the environment.
@@ -258,7 +352,9 @@ def run(
         entry: str | None = None,
         local_path: str = ".",
         env: dict | None = None,
-        **kwargs: dict,
+        running_extra_volume: Mapping = MappingProxyType({}),
+        cache_key_extra_func: CacheKeyFunc | None = None,
+        cache_files_to_extract: list[str] | None = None,
     ) -> EnvResult:
         """
         Run the folder under the environment and return the stdout, exit code, and running time.
@@ -275,12 +371,22 @@ def run(
             - simply run the image. The results are produced by output or network
         env : dict | None
             Run the code with your specific environment.
+        running_extra_volume : Mapping
+            Extra volumes to mount during execution.
+        cache_key_extra_func : CacheKeyFunc | None
+            Optional function to calculate extra information for cache key calculation
+        cache_files_to_extract : list[str] | None
+            Optional list of files to extract from cache zip. If None, extract all.
 
         Returns
         -------
             EnvResult: An object containing the stdout, the exit code, and the running time in seconds.
         """
-        running_extra_volume = kwargs.get("running_extra_volume", {})
+        _env = self.conf.env_dict.copy()
+        if env:
+            _env.update(env)
+        env = _env
+
         if entry is None:
             entry = self.conf.default_entry
 
@@ -291,26 +397,25 @@ def run(
                 "the last command in the pipeline.",
             )
 
-        # FIXME: the input path and cache path is hard coded here.
-        # We don't want to change the content in input and cache path.
-        # Otherwise, it may produce large amount of warnings.
+        # Exclude configured directories from chmod operation to prevent modifying
+        # read-only or specially configured directories that may produce warnings.
         def _get_chmod_cmd(workspace_path: str) -> str:
-            def _get_path_stem(path: str) -> str | None:
-                # If the input path is relative, keep only the first component
-                p = Path(path)
-                if not p.is_absolute() and p.parts:
-                    return p.parts[0]
-                return None
-
             find_cmd = f"find {workspace_path} -mindepth 1 -maxdepth 1"
-            for name in [
-                _get_path_stem(T("scenarios.data_science.share:scen.cache_path").r()),
-                _get_path_stem(T("scenarios.data_science.share:scen.input_path").r()),
-            ]:
-                find_cmd += f" ! -name {name}"
+
+            # Use configurable exclude paths from DockerConf
+            for name in self.conf.exclude_chmod_paths:
+                if name:  # Skip empty names
+                    find_cmd += f" ! -name {name}"
+
             chmod_cmd = f"{find_cmd} -exec chmod -R 777 {{}} +"
             return chmod_cmd
 
+        if self.conf.redirect_stdout_to_file:
+            log_file_name = md5_hash(entry)[:8] + ".log"
+            log_file = Path(local_path) / f"{log_file_name}"
+            log_file_relative_path = log_file.relative_to(Path(local_path))
+            entry = f"{entry} > {log_file_relative_path} 2>&1"
+
         if self.conf.running_timeout_period is None:
             timeout_cmd = entry
         else:
@@ -331,7 +436,14 @@ def _get_path_stem(path: str) -> str | None:
         )
 
         if self.conf.enable_cache:
-            result = self.cached_run(entry_add_timeout, local_path, env, running_extra_volume)
+            result = self.cached_run(
+                entry_add_timeout,
+                local_path,
+                env,
+                running_extra_volume,
+                cache_key_extra_func,
+                cache_files_to_extract,
+            )
         else:
             result = self.__run_with_retry(
                 entry_add_timeout,
@@ -339,6 +451,12 @@ def _get_path_stem(path: str) -> str | None:
                 env,
                 running_extra_volume,
             )
+        if self.conf.redirect_stdout_to_file:
+            stdout = log_file.read_text()
+            log_file.unlink(missing_ok=True)
+            result.update_stdout(stdout)
+        if str(Path(local_path).resolve()) in result.stdout:
+            result.update_stdout(result.stdout.replace(str(Path(local_path).resolve()), "<WORKSPACE_PATH>"))
 
         return result
 
@@ -348,6 +466,8 @@ def cached_run(
         local_path: str = ".",
         env: dict | None = None,
         running_extra_volume: Mapping = MappingProxyType({}),
+        cache_key_extra_func: CacheKeyFunc | None = None,
+        cache_files_to_extract: list[str] | None = None,
     ) -> EnvResult:
         """
         Run the folder under the environment.
@@ -357,24 +477,13 @@ def cached_run(
         target_folder = Path(RD_AGENT_SETTINGS.pickle_cache_folder_path_str) / f"utils.env.run"
         target_folder.mkdir(parents=True, exist_ok=True)
 
-        # we must add the information of data (beyond code) into the key.
-        # Otherwise, all commands operating on data will become invalid (e.g. rm -r submission.csv)
-        # So we recursively walk in the folder and add the sorted relative filename list as part of the key.
-        # data_key = []
-        # for path in Path(local_path).rglob("*"):
-        #     p = str(path.relative_to(Path(local_path)))
-        #     if p.startswith("__pycache__"):
-        #         continue
-        #     data_key.append(p)
-        # data_key = sorted(data_key)
+        if cache_key_extra_func is not None:
+            cache_key_extra = cache_key_extra_func(local_path)
+        else:
+            cache_key_extra = self.conf.get_workspace_content_for_hash(local_path)
 
         key = md5_hash(
-            json.dumps(
-                [
-                    [str(path.relative_to(Path(local_path))), path.read_text()]
-                    for path in sorted(list(Path(local_path).rglob("*.py")) + list(Path(local_path).rglob("*.csv")))
-                ]
-            )
+            json.dumps(cache_key_extra)
             + json.dumps({"entry": entry, "running_extra_volume": dict(running_extra_volume)})
             + json.dumps({"extra_volumes": self.conf.extra_volumes})
             # + json.dumps(data_key)
@@ -382,7 +491,7 @@ def cached_run(
         if Path(target_folder / f"{key}.pkl").exists() and Path(target_folder / f"{key}.zip").exists():
             with open(target_folder / f"{key}.pkl", "rb") as f:
                 ret = pickle.load(f)
-            self.unzip_a_file_into_a_folder(str(target_folder / f"{key}.zip"), local_path)
+            self.unzip_a_file_into_a_folder(str(target_folder / f"{key}.zip"), local_path, cache_files_to_extract)
         else:
             ret = self.__run_with_retry(entry, local_path, env, running_extra_volume)
             with open(target_folder / f"{key}.pkl", "wb") as f:
@@ -447,6 +556,10 @@ def dump_python_code_run_and_get_results(
                 return log_output, []
         return log_output, results
 
+    def refresh_env(self) -> None:
+        """Refresh the environment, e.g., pull the latest docker image. rebuild the conda env."""
+        pass
+
 
 # class EnvWithCache
 #
@@ -521,7 +634,17 @@ def _symlink_ctx(vol_map: Mapping[str, str]) -> Generator[None, None, None]:
             # Setup environment
             if env is None:
                 env = {}
-            path = [*self.conf.bin_path.split(":"), "/bin/", "/usr/bin/", *env.get("PATH", "").split(":")]
+
+            # Auto-propagate CUDA_VISIBLE_DEVICES for proper GPU isolation
+            if "CUDA_VISIBLE_DEVICES" in os.environ and "CUDA_VISIBLE_DEVICES" not in env:
+                env["CUDA_VISIBLE_DEVICES"] = os.environ["CUDA_VISIBLE_DEVICES"]
+
+            path = [
+                *self.conf.bin_path.split(":"),
+                "/bin/",
+                "/usr/bin/",
+                *env.get("PATH", "").split(":"),
+            ]
             env["PATH"] = ":".join(path)
 
             if entry is None:
@@ -613,6 +736,15 @@ class CondaConf(LocalConf):
 
     @model_validator(mode="after")
     def change_bin_path(self, **data: Any) -> "CondaConf":
+        self._update_bin_path()
+        return self
+
+    def _update_bin_path(self) -> None:
+        """Update bin_path by querying the conda environment's PATH.
+
+        This is called during initialization and can be called again after prepare()
+        to ensure bin_path is set correctly even if the conda env was just created.
+        """
         conda_path_result = subprocess.run(
             f"conda run -n {self.conda_env_name} --no-capture-output env | grep '^PATH='",
             capture_output=True,
@@ -620,7 +752,6 @@ def change_bin_path(self, **data: Any) -> "CondaConf":
             shell=True,
         )
         self.bin_path = conda_path_result.stdout.strip().split("=")[1] if conda_path_result.returncode == 0 else ""
-        return self
 
 
 class MLECondaConf(CondaConf):
@@ -643,6 +774,16 @@ class DockerConf(EnvConf):
     {<host_path>: {"bind": <container_path>, "mode": <mode, ro/rw/default is extra_volume_mode>}}
     """
     extra_volume_mode: str = "ro"  # by default. only the mount_path should be writable, others are changed to read-only
+
+    exclude_chmod_paths: list[str] = []
+    """List of directory names to exclude from chmod -R 777 operation.
+    This prevents modifying permissions of read-only or specially configured directories."""
+
+    # Declarative configuration for auto-populating exclude_chmod_paths from share.yaml
+    # Subclasses can override these to specify which config keys to read
+    _scenario_name: str | None = None  # e.g., "data_science", "finetune"
+    _exclude_path_keys: list[str] = []  # e.g., ["input_path", "cache_path"]
+
     # Sometime, we need maintain some extra data for the workspace.
     # And the extra data may be shared and the downloading can be time consuming.
     # So we just want to download it once.
@@ -651,6 +792,9 @@ class DockerConf(EnvConf):
     enable_gpu: bool = True  # because we will automatically disable GPU if not available. So we enable it by default.
     mem_limit: str | None = "48g"  # Add memory limit attribute
     cpu_count: int | None = None  # Add CPU limit attribute
+    read_only: bool = False  # Mount container filesystem as read-only
+    cap_drop_all: bool = False  # Drop all Linux capabilities
+    pids_limit: int | None = None  # Limit the number of processes
 
     running_timeout_period: int | None = 3600  # 1 hour
 
@@ -659,6 +803,30 @@ class DockerConf(EnvConf):
     retry_count: int = 5  # retry count for the docker run
     retry_wait_seconds: int = 10  # retry wait seconds for the docker run
 
+    terminal_tail_lines: int = 20
+    save_logs_to_file: bool = False  # keep the behavior before
+
+    @model_validator(mode="after")
+    def populate_exclude_chmod_paths(self) -> "DockerConf":
+        """
+        Automatically populate exclude_chmod_paths from share.yaml configuration.
+
+        This method reads path configurations from scenarios/<scenario_name>/share.yaml
+        based on _scenario_name and _exclude_path_keys class attributes.
+        """
+        if not self.exclude_chmod_paths and self._scenario_name and self._exclude_path_keys:
+            # Extract directory names from scenario configuration
+            self.exclude_chmod_paths = [
+                name
+                for key in self._exclude_path_keys
+                if (
+                    name := extract_dir_name_from_path_config(
+                        T(f"scenarios.{self._scenario_name}.share:scen.{key}").r()
+                    )
+                )
+            ]
+        return self
+
 
 class QlibCondaConf(CondaConf):
     conda_env_name: str = "rdagent4qlib"
@@ -690,10 +858,153 @@ def prepare(self) -> None:
                     f"conda run -n {self.conf.conda_env_name} pip install catboost xgboost tables torch",
                     shell=True,
                 )
+
         except Exception as e:
             print(f"[red]Failed to prepare conda env: {e}[/red]")
 
 
+# ========== Conda Environment Configuration Loader ==========
+# Config files location: rdagent/scenarios/finetune/env/conda/
+
+FT_CONDA_CONFIG_DIR = Path(__file__).parent.parent / "scenarios" / "finetune" / "env" / "conda"
+
+# Track which conda environments have been prepared in this process
+# This avoids redundant pip install checks that produce verbose output
+_CONDA_ENV_PREPARED: set[str] = set()
+
+
+def _sync_conda_cache_with_real_envs() -> None:
+    """Ensure the prepared cache includes environments that already exist on disk."""
+    try:
+        result = subprocess.run(
+            "conda env list",
+            capture_output=True,
+            text=True,
+            shell=True,
+            check=False,
+        )
+    except Exception as exc:  # pragma: no cover - best-effort helper
+        logger.warning(f"Failed to inspect conda env list: {exc}")
+        return
+
+    env_names: set[str] = set()
+    for line in result.stdout.splitlines():
+        line = line.strip()
+        if not line or line.startswith("#"):
+            continue
+        # Lines look like: "base                  *  /opt/conda"
+        first_column = line.split()[0]
+        name = first_column.replace("*", "").strip()
+        if name:
+            env_names.add(name)
+
+    _CONDA_ENV_PREPARED.update(env_names)
+
+
+def _prepare_conda_env(env_name: str, requirements_file: Path, python_version: str = "3.10") -> None:
+    """Prepare conda environment with dependencies from requirements.txt.
+
+    Creates the env if it doesn't exist, then installs dependencies.
+    Uses a process-level cache to avoid redundant preparation in the same run.
+
+    Args:
+        env_name: Conda environment name
+        requirements_file: Path to requirements.txt file
+        python_version: Python version for the environment
+    """
+    # 1. Create conda environment if not exists
+    result = subprocess.run(f"conda env list | grep -q '^{env_name} '", shell=True)
+    if result.returncode != 0:
+        print(f"[yellow]Creating conda env '{env_name}' (Python {python_version})...[/yellow]")
+        subprocess.check_call(f"conda create -y -n {env_name} python={python_version}", shell=True)
+        subprocess.check_call(f"conda run -n {env_name} pip install --upgrade pip", shell=True)
+
+    print(f"[yellow]Installing dependencies from {requirements_file.name}...[/yellow]")
+    subprocess.check_call(f"conda run -n {env_name} pip install -r {requirements_file}", shell=True)
+    print(f"[green]Conda env '{env_name}' ready[/green]")
+
+    _CONDA_ENV_PREPARED.add(env_name)
+
+
+# ========== FT (LLaMA Factory) Conda Environment ==========
+class FTCondaConf(CondaConf):
+    """Conda configuration for LLM fine-tuning environment."""
+
+    model_config = SettingsConfigDict(env_prefix="FT_CONDA_")
+
+    conda_env_name: str = "llm_finetune"
+    default_entry: str = "llamafactory-cli version"
+    enable_cache: bool = False
+
+
+class FTCondaEnv(LocalEnv[FTCondaConf]):
+    """LLaMA Factory Conda Environment with auto-dependency installation.
+
+    Requirements: rdagent/scenarios/finetune/conda/llm_finetune_requirements.txt
+    Docker equivalent: rdagent/scenarios/finetune/docker/llm_finetune_docker/Dockerfile
+    """
+
+    def prepare(self) -> None:
+        try:
+            # Skip if already prepared
+            _sync_conda_cache_with_real_envs()
+            if self.conf.conda_env_name in _CONDA_ENV_PREPARED:
+                return
+
+            # Step 1: Install base dependencies (torch, llamafactory, etc.)
+            req_file = FT_CONDA_CONFIG_DIR / "llm_finetune_requirements.txt"
+            _prepare_conda_env(self.conf.conda_env_name, req_file)
+
+            # Step 2: Install flash-attn (requires torch first, uses --no-build-isolation)
+            # --no-cache-dir: avoid cross-filesystem hardlink error when /tmp and ~/.cache/pip are on different mounts
+            # Note: flash-attn>=2.8 is required for B200 (sm_100) support
+            print("[yellow]Installing flash-attn (compiling, may take a few minutes)...[/yellow]")
+            subprocess.check_call(
+                f"conda run -n {self.conf.conda_env_name} pip install 'flash-attn>=2.8' --no-build-isolation --no-cache-dir",
+                shell=True,
+            )
+
+            # Re-update bin_path after prepare() in case the conda env was just created
+            if not self.conf.bin_path:
+                self.conf._update_bin_path()
+        except Exception as e:
+            print(f"[red]Failed to prepare LLaMA Factory conda env: {e}[/red]")
+
+
+# ========== Benchmark (OpenCompass) Conda Environment ==========
+class BenchmarkCondaConf(CondaConf):
+    """Conda configuration for OpenCompass benchmark evaluation."""
+
+    model_config = SettingsConfigDict(env_prefix="BENCHMARK_CONDA_")
+
+    conda_env_name: str = "opencompass"
+    default_entry: str = "opencompass --help"
+    enable_cache: bool = False
+    env_dict: dict = {"COMPASS_DATA_CACHE": "/benchmarks/opencompass_data"}
+
+
+class BenchmarkCondaEnv(LocalEnv[BenchmarkCondaConf]):
+    """OpenCompass Conda Environment with auto-dependency installation.
+
+    Requirements: rdagent/scenarios/finetune/conda/opencompass_requirements.txt
+    Docker equivalent: rdagent/scenarios/finetune/docker/opencompass/Dockerfile
+    """
+
+    def prepare(self) -> None:
+        try:
+            # Skip if already prepared
+            _sync_conda_cache_with_real_envs()
+            if self.conf.conda_env_name in _CONDA_ENV_PREPARED:
+                return
+            req_file = FT_CONDA_CONFIG_DIR / "opencompass_requirements.txt"
+            _prepare_conda_env(self.conf.conda_env_name, req_file)
+            # Re-update bin_path after prepare() in case the conda env was just created
+            if not self.conf.bin_path:
+                self.conf._update_bin_path()
+        except Exception as e:
+            print(f"[red]Failed to prepare OpenCompass conda env: {e}[/red]")
+
+
 class QlibDockerConf(DockerConf):
     model_config = SettingsConfigDict(
         env_prefix="QLIB_DOCKER_",
@@ -706,7 +1017,10 @@ class QlibDockerConf(DockerConf):
     mount_path: str = "/workspace/qlib_workspace/"
     default_entry: str = "qrun conf.yaml"
     extra_volumes: dict = {
-        str(Path("~/.qlib/").expanduser().resolve().absolute()): {"bind": "/root/.qlib/", "mode": "rw"}
+        str(Path("~/.qlib/").expanduser().resolve().absolute()): {
+            "bind": "/root/.qlib/",
+            "mode": "rw",
+        }
     }
     shm_size: str | None = "16g"
     enable_gpu: bool = True
@@ -747,6 +1061,10 @@ class DSDockerConf(DockerConf):
         "48g"  # Add memory limit attribute # new-york-city-taxi-fare-prediction may need more memory
     )
 
+    # Declarative configuration: automatically loads from scenarios/data_science/share.yaml
+    _scenario_name: str = "data_science"
+    _exclude_path_keys: list[str] = ["input_path", "cache_path"]
+
 
 class MLEBDockerConf(DockerConf):
     model_config = SettingsConfigDict(env_prefix="MLEB_DOCKER_")
@@ -767,6 +1085,74 @@ class MLEBDockerConf(DockerConf):
     enable_cache: bool = False
 
 
+class FTDockerConf(DockerConf):
+    model_config = SettingsConfigDict(env_prefix="FT_DOCKER_")
+
+    build_from_dockerfile: bool = True
+    dockerfile_folder_path: Path = (
+        Path(__file__).parent.parent / "scenarios" / "finetune" / "env" / "docker" / "llm_finetune"
+    )
+    image: str = "local_llm_finetune:latest"
+    mount_path: str = "/workspace/"
+    default_entry: str = "llamafactory-cli version"
+
+    running_timeout_period: int | None = 36000  # 10 hours for training
+    mem_limit: str | None = "48g"  # Large memory for LLM training
+    shm_size: str | None = "16g"  # Shared memory for multi-GPU training
+    enable_gpu: bool = True  # Enable GPU for LLM training
+    enable_cache: bool = False  # Disable cache to avoid conflicts during training, True for debug
+
+    # Override log output control for FT training
+    save_logs_to_file: bool = True
+    terminal_tail_lines: int = 20
+
+    # Declarative configuration: automatically loads from scenarios/finetune/share.yaml
+    _scenario_name: str = "finetune"
+    _exclude_path_keys: list[str] = ["assets_path"]
+
+    network: str | None = "host"  # Use host network for finetune access to litellm proxy
+
+    def get_workspace_content_for_hash(self, local_path: str | Path) -> list[list[str]]:
+        """Include dataset_info.json in cache key calculation."""
+        content = super().get_workspace_content_for_hash(local_path)
+        local_path = Path(local_path)
+        # Add dataset_info.json if it exists
+        # NOTE: data.json is excluded because it is a generated file
+        for path in local_path.rglob("dataset_info.json"):
+            content.append([str(path.relative_to(local_path)), path.read_text()])
+
+        # Sort again to ensure deterministic order (though super is sorted, appended one might not be)
+        content.sort(key=lambda x: x[0])
+        return content
+
+
+class BenchmarkDockerConf(DockerConf):
+    """Docker configuration for OpenCompass benchmark evaluation."""
+
+    model_config = SettingsConfigDict(env_prefix="BENCHMARK_DOCKER_")
+
+    build_from_dockerfile: bool = True
+    dockerfile_folder_path: Path = (
+        Path(__file__).parent.parent / "scenarios" / "finetune" / "env" / "docker" / "opencompass"
+    )
+    image: str = "rdagent-opencompass:latest"
+    mount_path: str = "/workspace/"
+    default_entry: str = "opencompass --help"
+
+    running_timeout_period: int | None = 3600  # 1 hour default for benchmarks
+    mem_limit: str | None = "32g"  # Moderate memory for inference
+    shm_size: str | None = "8g"  # Shared memory for model loading
+    enable_gpu: bool = True  # Enable GPU for fast inference
+    enable_cache: bool = False  # Disable cache for reproducibility
+
+    # Benchmark-specific log settings
+    save_logs_to_file: bool = True
+    terminal_tail_lines: int = 50  # Show more lines for benchmark progress
+
+    network: str | None = "host"  # Use host network for benchmark access to litellm proxy
+    env_dict: dict = {"COMPASS_DATA_CACHE": "/benchmarks/opencompass_data"}
+
+
 # physionet.org/files/mimic-eicu-fiddle-feature/1.0.0/FIDDLE_mimic3
 class DockerEnv(Env[DockerConf]):
     # TODO: Save the output into a specific file
@@ -783,7 +1169,9 @@ def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
         ):
             logger.info(f"Building the image from dockerfile: {self.conf.dockerfile_folder_path}")
             resp_stream = client.api.build(
-                path=str(self.conf.dockerfile_folder_path), tag=self.conf.image, network_mode=self.conf.network
+                path=str(self.conf.dockerfile_folder_path),
+                tag=self.conf.image,
+                network_mode=self.conf.network,
             )
             if isinstance(resp_stream, str):
                 logger.info(resp_stream)
@@ -795,7 +1183,10 @@ def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
                         if line.strip():
                             status_dict = json.loads(line)
                             if "error" in status_dict:
-                                p.update(task, description=f"[red]error: {status_dict['error']}")
+                                p.update(
+                                    task,
+                                    description=f"[red]error: {status_dict['error']}",
+                                )
                                 raise docker.errors.BuildError(status_dict["error"], "")
                             if "stream" in status_dict:
                                 p.update(task, description=status_dict["stream"])
@@ -812,7 +1203,11 @@ def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
                 status_task = sp.add_task("[bright_magenta]layer status", progress="")
                 for line in image_pull:
                     if "error" in line:
-                        sp.update(status_task, description=f"[red]error", progress=line["error"])
+                        sp.update(
+                            status_task,
+                            description=f"[red]error",
+                            progress=line["error"],
+                        )
                         raise docker.errors.APIError(line["error"])
 
                     layer_id = line["id"]
@@ -828,7 +1223,10 @@ def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
                     if status == "Pull complete" or status == "Already exists":
                         completed_layers += 1
 
-                    sp.update(main_task, progress=f"[green]{completed_layers}[white]/{len(layer_set)} layers completed")
+                    sp.update(
+                        main_task,
+                        progress=f"[green]{completed_layers}[white]/{len(layer_set)} layers completed",
+                    )
                     sp.update(
                         status_task,
                         description=f"[bright_magenta]layer {layer_id} [yellow]{status}",
@@ -838,14 +1236,29 @@ def prepare(self, *args, **kwargs) -> None:  # type: ignore[no-untyped-def]
             raise RuntimeError(f"Error while pulling the image: {e}")
 
     def _gpu_kwargs(self, client: docker.DockerClient) -> dict:  # type: ignore[no-any-unimported]
-        """get gpu kwargs based on its availability"""
+        """get gpu kwargs based on its availability.
+
+        Supports GPU selection via CUDA_VISIBLE_DEVICES environment variable.
+        If set, only the specified GPUs will be available in the container.
+        Example: CUDA_VISIBLE_DEVICES=0,1 will only expose GPU 0 and 1.
+        """
         if not self.conf.enable_gpu:
             return {}
-        gpu_kwargs = {
-            "device_requests": (
-                [docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]])] if self.conf.enable_gpu else None
-            ),
-        }
+
+        # Check if specific GPUs are requested via CUDA_VISIBLE_DEVICES
+        cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
+        if cuda_visible:
+            # Use device_ids to specify exact GPUs (cannot use count with device_ids)
+            device_ids = [gpu.strip() for gpu in cuda_visible.split(",") if gpu.strip()]
+            gpu_kwargs = {
+                "device_requests": [docker.types.DeviceRequest(device_ids=device_ids, capabilities=[["gpu"]])],
+            }
+            logger.info(f"GPU selection: using specific GPUs {device_ids}")
+        else:
+            # Default: use all available GPUs
+            gpu_kwargs = {
+                "device_requests": [docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]])],
+            }
 
         def get_image(image_name: str) -> None:
             try:
@@ -870,6 +1283,129 @@ def _f() -> dict:
 
         return _f()
 
+    def _generate_log_header(self, entry: str | None = None) -> str:
+        """
+        Generate a header for log files with execution info.
+
+        Args:
+            entry: Command entry that was executed
+
+        Returns:
+            Formatted header string
+        """
+        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        header = "=" * 80 + "\n"
+        header += f"Docker Execution Log\n"
+        header += f"Timestamp: {timestamp}\n"
+        header += f"Image: {self.conf.image}\n"
+        if entry:
+            header += f"Command: {entry}\n"
+        header += "=" * 80 + "\n\n"
+        return header
+
+    def _process_container_logs(self, logs, local_path: str = ".", entry: str | None = None) -> str:
+        """
+        Process Docker container logs with optional tail mode.
+
+        This method can be controlled via configuration:
+        - save_logs_to_file: Save full logs to timestamped files in logs/ subdirectory
+        - terminal_tail_lines: Show only last N lines in terminal (0 = show all)
+
+        Args:
+            logs: Docker container log stream
+            local_path: Path to workspace for saving log files
+            entry: Command entry that was executed (for logging header)
+
+        Returns:
+            Complete log output as string
+        """
+        log_output = ""
+
+        # Determine if we should use tail mode
+        use_tail_mode = self.conf.terminal_tail_lines > 0
+        save_to_file = self.conf.save_logs_to_file
+
+        # Set up log file with timestamp if needed
+        log_file_path = None
+        if save_to_file and local_path:
+            workspace = Path(local_path)
+
+            # Create logs subdirectory
+            logs_dir = workspace / "logs"
+            logs_dir.mkdir(parents=True, exist_ok=True)
+
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            log_file_path = logs_dir / f"docker_execution_{timestamp}.log"
+
+            # Write header with execution info
+            header = self._generate_log_header(entry)
+            with open(log_file_path, "w", encoding="utf-8") as f:
+                f.write(header)
+
+            # Also create/update a symlink to the latest log for convenience
+            latest_link = logs_dir / "docker_execution_latest.log"
+
+            print(f"[cyan]Full logs will be saved to: {log_file_path.absolute()}[/cyan]")
+
+        # Process logs with tail mode
+        if use_tail_mode:
+
+            log_buffer = deque(maxlen=self.conf.terminal_tail_lines)
+
+            def format_tail_display():
+                text = Text()
+                text.append(
+                    f"[Showing last {len(log_buffer)}/{self.conf.terminal_tail_lines} lines",
+                    style="dim",
+                )
+                if log_file_path:
+                    text.append(f" | Full log: {log_file_path.name}]\n", style="dim cyan")
+                else:
+                    text.append("]\n", style="dim")
+                text.append("-" * 80 + "\n", style="dim")
+                for line in log_buffer:
+                    text.append(line + "\n")
+                return text
+
+            with Live(format_tail_display(), refresh_per_second=2, console=Console()) as live:
+                for log in logs:
+                    decoded_log = log.strip().decode()
+                    log_output += decoded_log + "\n"
+                    log_buffer.append(decoded_log)
+
+                    if log_file_path:
+                        with open(log_file_path, "a", encoding="utf-8") as f:
+                            f.write(decoded_log + "\n")
+
+                    live.update(format_tail_display())
+        else:
+            # Default behavior: show all logs
+            for log in logs:
+                decoded_log = log.strip().decode()
+                Console().print(decoded_log, markup=False)
+                log_output += decoded_log + "\n"
+
+                if log_file_path:
+                    with open(log_file_path, "a", encoding="utf-8") as f:
+                        f.write(decoded_log + "\n")
+
+        # Show log file location and create latest symlink
+        if log_file_path and log_file_path.exists():
+            print(f"[green]Full execution log saved to: {log_file_path.absolute()}[/green]")
+
+            # Create or update symlink to latest log
+            latest_link = log_file_path.parent / "docker_execution_latest.log"
+            if latest_link.exists() or latest_link.is_symlink():
+                latest_link.unlink()
+            try:
+                latest_link.symlink_to(log_file_path.name)
+                print(f"[dim]Latest log symlink: logs/{latest_link.name} -> {log_file_path.name}[/dim]")
+            except Exception:
+                # Symlinks might not work on all systems (e.g., Windows without admin)
+                pass
+
+        return log_output
+
     def _run(
         self,
         entry: str | None = None,
@@ -883,6 +1419,7 @@ def _run(
         env["PYTHONWARNINGS"] = "ignore"
         env["TF_CPP_MIN_LOG_LEVEL"] = "2"
         env["PYTHONUNBUFFERED"] = "1"
+        env["TOKENIZERS_PARALLELISM"] = "false"  # Avoid tokenizer fork warning in multi-process training
         client = docker.from_env()
 
         volumes = {}
@@ -895,7 +1432,10 @@ def _run(
                 volumes[lp] = rp if isinstance(rp, dict) else {"bind": rp, "mode": self.conf.extra_volume_mode}
             cache_path = "/tmp/sample" if "/sample/" in "".join(self.conf.extra_volumes.keys()) else "/tmp/full"
             Path(cache_path).mkdir(parents=True, exist_ok=True)
-            volumes[cache_path] = {"bind": T("scenarios.data_science.share:scen.cache_path").r(), "mode": "rw"}
+            volumes[cache_path] = {
+                "bind": T("scenarios.data_science.share:scen.cache_path").r(),
+                "mode": "rw",
+            }
         for lp, rp in running_extra_volume.items():
             volumes[lp] = rp if isinstance(rp, dict) else {"bind": rp, "mode": self.conf.extra_volume_mode}
 
@@ -917,6 +1457,10 @@ def _run(
                 shm_size=self.conf.shm_size,
                 mem_limit=self.conf.mem_limit,  # Set memory limit
                 cpu_count=self.conf.cpu_count,  # Set CPU limit
+                read_only=self.conf.read_only,
+                cap_drop=["ALL"] if self.conf.cap_drop_all else None,
+                pids_limit=self.conf.pids_limit,
+                tmpfs={"/tmp": "rw,noexec,nosuid,size=1g"} if self.conf.read_only else None,
                 **self._gpu_kwargs(client),
             )
             assert container is not None  # Ensure container was created successfully
@@ -932,10 +1476,10 @@ def _run(
             table.add_row("Env", "\n".join(f"{k}:{v}" for k, v in env.items()))
             table.add_row("Volumes", "\n".join(f"{k}:\n  {v}" for k, v in volumes.items()))
             print(table)
-            for log in logs:
-                decoded_log = log.strip().decode()
-                Console().print(decoded_log, markup=False)
-                log_output += decoded_log + "\n"
+
+            # Process logs (supports tail mode if configured)
+            log_output = self._process_container_logs(logs, local_path, entry=entry)
+
             exit_status = container.wait()["StatusCode"]
             print(Rule("[bold green]Docker Logs End[/bold green]", style="dark_orange"))
             return log_output, exit_status
@@ -948,6 +1492,23 @@ def _run(
         finally:
             cleanup_container(container)
 
+    def refresh_env(self) -> None:
+        """Remove the Docker image associated with this environment."""
+        client = docker.from_env()
+        try:
+            # Remove the specific image
+            client.images.remove(image=self.conf.image, force=True)
+            logger.info(f"Removed Docker image: {self.conf.image}")
+
+            client.images.prune()
+            client.api.prune_builds()
+            logger.info(f"Successfully removed Docker image: {self.conf.image}")
+        except docker.errors.ImageNotFound:
+            logger.warning(f"Docker image not found, cannot remove: {self.conf.image}")
+        except docker.errors.APIError as e:
+            logger.error(f"Error while removing Docker image: {e}")
+        self.prepare()
+
 
 class QTDockerEnv(DockerEnv):
     """Qlib Torch Docker"""
@@ -981,3 +1542,75 @@ class MLEBDockerEnv(DockerEnv):
 
     def __init__(self, conf: DockerConf = MLEBDockerConf()):
         super().__init__(conf)
+
+
+class FTDockerEnv(DockerEnv):
+    """
+    LLM Fine-tuning Docker Environment with improved log output control.
+
+    FTDockerConf enables:
+    - save_logs_to_file: True (saves full logs to workspace/docker_execution.log)
+    - terminal_tail_lines: 20 (only shows last 20 lines in terminal)
+
+    To customize, set environment variables:
+        export FT_DOCKER_terminal_tail_lines=50  # show last 50 lines
+        export FT_DOCKER_save_logs_to_file=false # disable log file
+    """
+
+    def __init__(self, conf: DockerConf = FTDockerConf()):
+        super().__init__(conf)
+
+
+class BenchmarkDockerEnv(DockerEnv):
+    """
+    OpenCompass Benchmark Docker Environment.
+
+    Uses BenchmarkDockerConf for evaluation-specific settings:
+    - Moderate memory/GPU allocation for inference
+    - Longer terminal output (50 lines) to track benchmark progress
+    - Automatic Dockerfile building from scenarios/finetune/docker/opencompass
+
+    To customize, set environment variables:
+        export BENCHMARK_DOCKER_running_timeout_period=7200  # 2 hours
+        export BENCHMARK_DOCKER_terminal_tail_lines=100  # show last 100 lines
+    """
+
+    def __init__(self, conf: DockerConf = BenchmarkDockerConf()):
+        super().__init__(conf)
+
+
+class RLDockerConf(DockerConf):
+    model_config = SettingsConfigDict(env_prefix="RL_DOCKER_")
+
+    build_from_dockerfile: bool = True
+    dockerfile_folder_path: Path = (
+        Path(__file__).parent.parent / "scenarios" / "rl" / "eval" / "autorl_bench" / "env" / "train"
+    )
+    image: str = "local_rl:latest"
+    mount_path: str = "/workspace/"
+    default_entry: str = "python main.py"
+
+    # 挂载 assets 目录 (只读)
+    extra_volumes: dict = {
+        str(Path(__file__).parent.parent / "scenarios" / "rl" / "eval" / "autorl_bench" / "assets" / "data"): {
+            "bind": "/data",
+            "mode": "ro"
+        },
+        str(Path(__file__).parent.parent / "scenarios" / "rl" / "eval" / "autorl_bench" / "assets" / "models"): {
+            "bind": "/models",
+            "mode": "ro"
+        },
+    }
+
+    running_timeout_period: int | None = 3600
+    mem_limit: str | None = "48g"
+    shm_size: str | None = "16g"
+    enable_gpu: bool = True
+    enable_cache: bool = False
+
+
+class RLDockerEnv(DockerEnv):
+    """RL Docker Environment"""
+
+    def __init__(self, conf: DockerConf = RLDockerConf()):
+        super().__init__(conf)
diff --git a/rdagent/utils/workflow/loop.py b/rdagent/utils/workflow/loop.py
index 34dc119e3..256f598e6 100644
--- a/rdagent/utils/workflow/loop.py
+++ b/rdagent/utils/workflow/loop.py
@@ -95,6 +95,7 @@ class LoopBase:
     loop_trace: dict[int, list[LoopTrace]]
 
     skip_loop_error: tuple[type[BaseException], ...] = ()  # you can define a list of error that will skip current loop
+    skip_loop_error_stepname: str | None = None  # if skip_loop_error exception happens, what's the next step to work on
     withdraw_loop_error: tuple[
         type[BaseException], ...
     ] = ()  # you can define a list of error that will withdraw current loop
@@ -245,8 +246,15 @@ async def _run_step(self, li: int, force_subproc: bool = False) -> None:
                 except Exception as e:
                     if isinstance(e, self.skip_loop_error):
                         logger.warning(f"Skip loop {li} due to {e}")
-                        # Jump to the last step (assuming last step is for recording)
-                        next_step_idx = len(self.steps) - 1
+                        if self.skip_loop_error_stepname:
+                            next_step_idx = self.steps.index(self.skip_loop_error_stepname)
+                            if next_step_idx <= si:
+                                raise RuntimeError(
+                                    f"Cannot skip backwards or to same step. Current: {si} ({name}), Target: {next_step_idx} ({self.skip_loop_error_stepname})"
+                                ) from e
+                        else:
+                            # Jump to the last step (assuming last step is for recording)
+                            next_step_idx = len(self.steps) - 1
                         self.loop_prev_out[li][name] = None
                         self.loop_prev_out[li][self.EXCEPTION_KEY] = e
                     elif isinstance(e, self.withdraw_loop_error):
@@ -465,28 +473,35 @@ def load(
             An instance of LoopBase with the loaded session.
         """
         path = Path(path)
+        session_folder = None
         # if the path is a directory, load the latest session
         if path.is_dir():
             if path.name != "__session__":
-                path = path / "__session__"
+                session_folder = path / "__session__"
+            else:
+                session_folder = path
 
-            if not path.exists():
+            if not session_folder.exists():
                 raise FileNotFoundError(f"No session file found in {path}")
 
             # iterate the dump steps in increasing order
-            files = sorted(path.glob("*/*_*"), key=lambda f: (int(f.parent.name), int(f.name.split("_")[0])))
+            files = sorted(session_folder.glob("*/*_*"), key=lambda f: (int(f.parent.name), int(f.name.split("_")[0])))
             path = files[-1]
             logger.info(f"Loading latest session from {path}")
+        else:
+            session_folder = path.parent.parent
+
         with path.open("rb") as f:
             session = cast(LoopBase, pickle.load(f))
 
         # set session folder
         if checkout:
             if checkout is True:
+                session.session_folder = session_folder
                 logger.set_storages_path(session.session_folder.parent)
-                max_loop = max(session.loop_trace.keys())
 
                 # truncate log storages after the max loop
+                max_loop = max(session.loop_trace.keys())
                 session.truncate_session_folder(max_loop, len(session.loop_trace[max_loop]) - 1)
                 logger.truncate_storages(session.loop_trace[max_loop][-1].end)
             else:
@@ -495,6 +510,8 @@ def load(
                 session.session_folder = checkout / "__session__"
                 logger.set_storages_path(checkout)
 
+            logger.info(f"Checkout session to {session.session_folder.parent}")
+
         if session.timer.started:
             if replace_timer:
                 RD_Agent_TIMER_wrapper.replace_timer(session.timer)
diff --git a/requirements.txt b/requirements.txt
index 619b19fa8..952bd3e02 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -75,4 +75,9 @@ types-pytz
 pydantic-ai-slim[mcp,openai,prefect]
 nest-asyncio
 
-prefect
\ No newline at end of file
+# visualize SFT train
+tensorboard     # tensorboard --logdir git_ignore_folder/RD-Agent_workspace
+prefect
+
+# HuggingFace datasets
+datasets
\ No newline at end of file
diff --git a/test/finetune/test_benchmark.py b/test/finetune/test_benchmark.py
new file mode 100644
index 000000000..5f1774641
--- /dev/null
+++ b/test/finetune/test_benchmark.py
@@ -0,0 +1,251 @@
+"""
+Standalone test script for testing extract_error_samples.
+
+Usage:
+    python test_benchmark.py
+
+Uses rdagent's Docker environment with cache enabled.
+"""
+
+from __future__ import annotations
+
+import os
+from datetime import datetime
+from pathlib import Path
+
+# Set FT_file_path BEFORE importing rdagent modules (so Docker mounts correct path)
+_project_root = Path(__file__).resolve().parents[2]
+os.environ["FT_file_path"] = str(_project_root / "git_ignore_folder" / "finetune_files")
+
+import pandas as pd
+
+from rdagent.components.coder.finetune.conf import get_benchmark_env
+from rdagent.scenarios.finetune.benchmark.data.adaptor import BENCHMARK_CONFIG_DICT
+from rdagent.scenarios.finetune.benchmark.data.default import extract_error_samples
+from rdagent.utils.agent.tpl import T
+
+
+def run_benchmark_simple(
+    workspace_path: str,
+    model_path_in_docker: str,
+    benchmark_name: str,
+    gpu_count: int = 4,
+    limit: int = 3,
+    offset: int = 0,
+    max_error_samples: int = 5,
+    result_subdir: str = "",
+):
+    """
+    Simplified benchmark runner using rdagent Docker env.
+
+    Args:
+        workspace_path: Local workspace path
+        model_path_in_docker: Model path inside Docker (e.g., /finetune/models/Qwen/Qwen2.5-1.5B)
+        benchmark_name: Benchmark name
+        gpu_count: GPU count
+        limit: Dataset limit
+        offset: Starting offset for dataset sampling (default: 0)
+        max_error_samples: Max error samples to extract
+        result_subdir: Subdirectory for results (e.g., "validation", "test")
+    """
+    workspace = Path(workspace_path)
+    workspace.mkdir(parents=True, exist_ok=True)
+
+    cfg = BENCHMARK_CONFIG_DICT[benchmark_name]
+
+    # Auto download dependent data if configured
+    if cfg.download is not None:
+        cfg.download()
+
+    # Calculate tensor_parallel_size (round down to power of 2)
+    tp_size = 1
+    power = 0
+    while (1 << (power + 1)) <= gpu_count:
+        power += 1
+    tp_size = 1 << power
+
+    # Generate config.py (paths are Docker paths)
+    config_content = T("rdagent.scenarios.finetune.benchmark.configs.opencompass_template:template").r(
+        model_abbr=f"test-{benchmark_name}",
+        model_path=model_path_in_docker,
+        is_lora=False,
+        lora_path="",
+        dataset_imports=[cfg.dataset],
+        limit=limit,
+        offset=offset,
+        num_runs=1,
+        pass_k=None,
+        work_dir="/workspace",  # Docker workspace path
+        tensor_parallel_size=tp_size,
+        gpu_memory_utilization=0.9,
+        dtype="bfloat16",
+        max_seq_len=32768,
+        max_out_len=8192,
+        batch_size=16,
+        temperature=0.0,
+        top_p=1.0,
+        top_k=1,
+        repetition_penalty=1.0,
+        enable_thinking=False,
+    )
+
+    config_file = workspace / "config.py"
+    config_file.write_text(config_content)
+
+    # Get Docker env with cache enabled
+    env = get_benchmark_env()
+    env.conf.enable_cache = True
+
+    # Environment variables for LLM judge (required for cascade eval benchmarks like AIME25)
+    env_vars = {
+        "OC_JUDGE_MODEL": "gpt-5.1",
+        "OC_JUDGE_API_KEY": "sk-1234",
+        "OC_JUDGE_API_BASE": "http://localhost:3000",
+        "OC_JUDGE_RETRY": "3",
+    }
+
+    # Run opencompass in Docker
+    if result_subdir:
+        benchmark_work_dir = f"/workspace/benchmark_results/{result_subdir}"
+    else:
+        benchmark_work_dir = "/workspace/benchmark_results"
+    cmd = f"opencompass /workspace/config.py --work-dir {benchmark_work_dir}"
+    print(f"Running in Docker: {cmd}")
+    if offset:
+        print(f"Dataset range: [{offset}:{offset + limit}]")
+
+    result = env.run(
+        entry=cmd,
+        local_path=str(workspace),
+        env=env_vars,
+    )
+
+    print(f"Exit code: {result.exit_code}")
+    if result.exit_code != 0:
+        print(f"Error: {result.stdout[-2000:] if result.stdout else 'No output'}")
+        raise RuntimeError(f"Benchmark failed with exit code {result.exit_code}")
+
+    # Extract results from local workspace
+    work_dir = workspace / "benchmark_results"
+    if result_subdir:
+        work_dir = work_dir / result_subdir
+    timestamped_dirs = sorted(work_dir.glob("202*_*"), reverse=True)
+    if not timestamped_dirs:
+        raise RuntimeError(f"No results found in {work_dir}")
+
+    result_dir = timestamped_dirs[0]
+    csv_files = sorted(result_dir.rglob("summary/*.csv"), reverse=True)
+    if not csv_files:
+        raise RuntimeError(f"No CSV files found in {result_dir}")
+
+    # Parse benchmark results from CSV, grouped by dataset
+    df = pd.read_csv(csv_files[0])
+    # Get score column (the model name column, e.g., 'test-chemcotbench')
+    score_col = [c for c in df.columns if c not in ["dataset", "version", "metric", "mode"]][0]
+    # Pivot to group by dataset, with metrics as columns (use pivot_table to handle duplicates)
+    pivoted = df.pivot_table(index="dataset", columns="metric", values=score_col, aggfunc="first").to_dict("index")
+    # Filter out NaN values (different datasets have different metrics)
+    benchmark_results = {ds: {k: v for k, v in metrics.items() if pd.notna(v)} for ds, metrics in pivoted.items()}
+
+    # Extract error samples
+    errors = extract_error_samples(
+        result_dir,
+        max_samples=max_error_samples,
+    )
+
+    return {"benchmark_results": benchmark_results, "error_samples": errors}
+
+
+if __name__ == "__main__":
+    # Change to project root (required for template resolution)
+    os.chdir(_project_root)
+
+    # Configuration
+    MODEL = "Qwen/Qwen3-8B"
+    LIMIT = 3
+    GPU_COUNT = 4
+
+    # Docker model path (models are mounted at /finetune/models)
+    model_path_in_docker = f"/finetune/models/{MODEL}"
+
+    # Create test directory
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    test_base = _project_root / "git_ignore_folder" / "test" / timestamp
+
+    print("=" * 60)
+    print(f"BENCHMARK TEST: {MODEL} (limit={LIMIT})")
+    print(f"Docker model path: {model_path_in_docker}")
+    print(f"Output: {test_base}")
+    print("=" * 60)
+
+    results_summary = {}
+
+    # Hardcoded benchmark list - comment/uncomment to select benchmarks to test
+    BENCHMARKS_TO_TEST = [
+        # Math Reasoning
+        # "aime24",
+        # "aime25",
+        # "math",
+        # General Knowledge
+        # "mmlu",
+        # Code Generation
+        # "humaneval",
+        # "mbpp",
+        # PANORAMA - Patent Analysis (zero-shot)
+        # "panorama",
+        # "panorama_par4pc",
+        # "panorama_pi4pc",
+        # "panorama_noc4pc",
+        # PANORAMA - Patent Analysis (CoT)
+        # "panorama_par4pc_cot",
+        # "panorama_pi4pc_cot",
+        # "panorama_noc4pc_cot",
+        # ChemCoTBench - Chemistry Reasoning
+        # "chemcotbench",
+        "chemcotbench_mol_und",
+        "chemcotbench_mol_edit",
+        "chemcotbench_mol_opt",
+        "chemcotbench_reaction",
+        # TableBench - Table QA
+        "tablebench_data_analysis",
+        "tablebench_fact_checking",
+        "tablebench_numerical_reasoning",
+        "tablebench_visualization",
+        # "tablebench_gen",
+        # Finance
+        # "FinanceIQ_gen",
+    ]
+
+    for benchmark_name in BENCHMARKS_TO_TEST:
+        print(f"\n{'='*60}")
+        print(f"Running: {benchmark_name}")
+        print("=" * 60)
+
+        workspace = test_base / benchmark_name
+        result = run_benchmark_simple(
+            workspace_path=str(workspace),
+            model_path_in_docker=model_path_in_docker,
+            benchmark_name=benchmark_name,
+            gpu_count=GPU_COUNT,
+            limit=LIMIT,
+            max_error_samples=5,
+        )
+
+        error_samples = result.get("error_samples", [])
+        benchmark_results = result.get("benchmark_results", [])
+
+        print(f"  Results: {benchmark_results}")
+        print(f"  Error samples: {len(error_samples)}")
+        if error_samples:
+            print(f"  Sample: {error_samples[0]}")
+
+        results_summary[benchmark_name] = {
+            "error_count": len(error_samples),
+            "benchmark_results": benchmark_results,
+        }
+
+    print("\n" + "=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    for name, info in results_summary.items():
+        print(f"  {name}: errors={info['error_count']}")
diff --git a/test/finetune/test_benchmark_api.py b/test/finetune/test_benchmark_api.py
new file mode 100644
index 000000000..b5c7c7c91
--- /dev/null
+++ b/test/finetune/test_benchmark_api.py
@@ -0,0 +1,512 @@
+"""
+Standalone test script for API-based benchmark testing.
+
+Usage:
+    python test_benchmark_api.py
+
+Uses OpenAI-compatible API with Docker environment for running opencompass.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+from datetime import datetime
+from pathlib import Path
+
+# Set FT_file_path BEFORE importing rdagent modules (so Docker mounts correct path)
+_project_root = Path(__file__).resolve().parents[2]
+os.environ["FT_file_path"] = str(_project_root / "git_ignore_folder" / "finetune_files")
+
+import pandas as pd
+
+from rdagent.components.coder.finetune.conf import get_benchmark_env
+from rdagent.scenarios.finetune.benchmark.benchmark import get_benchmark_ranges
+from rdagent.scenarios.finetune.benchmark.data.adaptor import BENCHMARK_CONFIG_DICT
+from rdagent.scenarios.finetune.benchmark.data.default import extract_error_samples
+
+# OpenCompass API config template
+API_CONFIG_TEMPLATE = """
+from mmengine.config import read_base
+from opencompass.models import OpenAI
+
+# ==================== Dataset Import ====================
+with read_base():
+{dataset_imports}
+
+# Aggregate all dataset variables
+datasets = sum([v for k, v in locals().items() if (k == 'datasets' or k.endswith('_datasets')) and isinstance(v, list)], [])
+
+# Apply dataset modifications
+for ds in datasets:
+{limit_config}
+    pass
+
+# ==================== API Model Configuration ====================
+api_meta_template = dict(round=[
+    dict(role='HUMAN', api_role='HUMAN'),
+    dict(role='BOT', api_role='BOT', generate=True),
+])
+
+models = [
+    dict(
+        abbr='{model_abbr}',
+        type=OpenAI,
+        path='{model_path}',
+        key='{api_key}',
+        openai_api_base='{api_base}',
+        meta_template=api_meta_template,
+        query_per_second={query_per_second},
+        max_out_len={max_out_len},
+        max_seq_len={max_seq_len},
+        batch_size={batch_size},
+        retry={retry},
+    ),
+]
+
+# ==================== Inference Configuration ====================
+infer = dict(
+    partitioner=dict(type='NaivePartitioner'),
+    runner=dict(
+        type='LocalRunner',
+        max_num_workers={max_num_workers},
+        retry=2,
+        task=dict(type='OpenICLInferTask'),
+    ),
+)
+
+# ==================== Evaluation Configuration ====================
+eval = dict(
+    partitioner=dict(type='NaivePartitioner'),
+    runner=dict(
+        type='LocalRunner',
+        max_num_workers=4,
+        retry=2,
+        task=dict(type='OpenICLEvalTask', dump_details=True),
+    ),
+)
+
+# ==================== Work Directory ====================
+work_dir = '{work_dir}'
+"""
+
+
+def generate_api_config(
+    model_abbr: str,
+    model_path: str,
+    api_key: str,
+    api_base: str,
+    dataset_imports: list[str],
+    limit: int | None = None,
+    offset: int = 0,
+    test_range: str | None = None,
+    work_dir: str = "/workspace",
+    max_out_len: int = 8192,
+    max_seq_len: int = 32768,
+    batch_size: int = 8,
+    query_per_second: int = 1,
+    max_num_workers: int = 16,
+    retry: int = 5,
+) -> str:
+    """Generate OpenCompass config for API-based model evaluation.
+
+    Args:
+        test_range: Direct test_range expression (e.g., "[:min(100, len(index_list)//2)]").
+                    If provided, overrides limit/offset parameters.
+    """
+    # Format dataset imports
+    dataset_import_lines = "\n".join(f"    from {module} import *" for module in dataset_imports)
+
+    # Format limit config - support direct test_range or limit/offset
+    if test_range:
+        # Use direct test_range expression (supports dynamic expressions like len(index_list))
+        limit_config = f"""    # Apply test_range for dataset sampling
+    if 'reader_cfg' not in ds:
+        ds['reader_cfg'] = {{}}
+    ds['reader_cfg']['test_range'] = '{test_range}'
+
+    # Sync to evaluator's dataset_cfg
+    if 'eval_cfg' in ds and 'evaluator' in ds['eval_cfg']:
+        evaluator = ds['eval_cfg']['evaluator']
+        if isinstance(evaluator, dict) and 'dataset_cfg' in evaluator:
+            if 'reader_cfg' not in evaluator['dataset_cfg']:
+                evaluator['dataset_cfg']['reader_cfg'] = {{}}
+            evaluator['dataset_cfg']['reader_cfg']['test_range'] = '{test_range}'"""
+    elif limit:
+        if offset:
+            computed_range = f"[{offset}:{offset + limit}]"
+        else:
+            computed_range = f"[:{limit}]"
+        limit_config = f"""    # Limit dataset size for faster testing
+    if 'reader_cfg' not in ds:
+        ds['reader_cfg'] = {{}}
+    ds['reader_cfg']['test_range'] = '{computed_range}'
+
+    # Limit few-shot examples to avoid index out of range
+    # FixKRetriever uses fix_id_list to select examples from train/dev split
+    if 'infer_cfg' in ds and 'retriever' in ds['infer_cfg']:
+        retriever = ds['infer_cfg']['retriever']
+        if isinstance(retriever, dict) and 'fix_id_list' in retriever:
+            # Limit fix_id_list to valid range (0 to limit-1)
+            retriever['fix_id_list'] = [i for i in retriever['fix_id_list'] if i < {limit}]
+
+    # Sync to evaluator's dataset_cfg
+    if 'eval_cfg' in ds and 'evaluator' in ds['eval_cfg']:
+        evaluator = ds['eval_cfg']['evaluator']
+        if isinstance(evaluator, dict) and 'dataset_cfg' in evaluator:
+            if 'reader_cfg' not in evaluator['dataset_cfg']:
+                evaluator['dataset_cfg']['reader_cfg'] = {{}}
+            evaluator['dataset_cfg']['reader_cfg']['test_range'] = '{computed_range}'"""
+    else:
+        limit_config = ""
+
+    return API_CONFIG_TEMPLATE.format(
+        dataset_imports=dataset_import_lines,
+        limit_config=limit_config,
+        model_abbr=model_abbr,
+        model_path=model_path,
+        api_key=api_key,
+        api_base=api_base,
+        work_dir=work_dir,
+        max_out_len=max_out_len,
+        max_seq_len=max_seq_len,
+        batch_size=batch_size,
+        query_per_second=query_per_second,
+        max_num_workers=max_num_workers,
+        retry=retry,
+    )
+
+
+def run_benchmark_api(
+    workspace_path: str,
+    model_name: str,
+    api_key: str,
+    api_base: str,
+    benchmark_name: str,
+    limit: int | None = 3,
+    offset: int = 0,
+    test_range: str | None = None,
+    max_error_samples: int = 5,
+    max_out_len: int = 8192,
+    max_seq_len: int = 32768,
+    batch_size: int = 8,
+    query_per_second: int = 1,
+    max_num_workers: int = 16,
+    retry: int = 5,
+    hf_token: str | None = None,
+    result_subdir: str = "",
+):
+    """
+    API-based benchmark runner using rdagent Docker env.
+
+    Args:
+        workspace_path: Local workspace path
+        model_name: API model name (e.g., gpt-4o-mini)
+        api_key: OpenAI API key
+        api_base: OpenAI API base URL (will be converted to Docker-accessible URL)
+        benchmark_name: Benchmark name
+        limit: Dataset limit (ignored if test_range is provided)
+        offset: Starting offset for dataset sampling (ignored if test_range is provided)
+        test_range: Direct test_range expression (e.g., "[:min(100, len(index_list)//2)]").
+                    If provided, overrides limit/offset parameters.
+        max_error_samples: Max error samples to extract
+        max_out_len: Maximum output length
+        max_seq_len: Maximum sequence length
+        batch_size: Batch size for API calls
+        query_per_second: Rate limit for API calls
+        max_num_workers: Max number of workers for inference
+        hf_token: Hugging Face token for gated datasets
+        result_subdir: Subdirectory for results (e.g., "validation", "test")
+    """
+    workspace = Path(workspace_path)
+    workspace.mkdir(parents=True, exist_ok=True)
+
+    cfg = BENCHMARK_CONFIG_DICT[benchmark_name]
+
+    # Auto download dependent data if configured
+    if cfg.download is not None:
+        cfg.download()
+
+    # Docker uses host network, so localhost works directly
+    # OpenAI class (inference) expects full URL with /chat/completions
+    docker_api_base = "http://localhost:3000/v1/chat/completions"
+    # OpenAISDK class (LLM judge) auto-appends /chat/completions, so use base only
+    docker_api_base_sdk = "http://localhost:3000/v1"
+
+    # Generate config.py
+    config_content = generate_api_config(
+        model_abbr=f"api-{benchmark_name}",
+        model_path=model_name,
+        api_key=api_key,
+        api_base=docker_api_base,
+        dataset_imports=[cfg.dataset],
+        limit=limit,
+        offset=offset,
+        test_range=test_range,
+        work_dir="/workspace",
+        max_out_len=max_out_len,
+        max_seq_len=max_seq_len,
+        batch_size=batch_size,
+        query_per_second=query_per_second,
+        max_num_workers=max_num_workers,
+        retry=retry,
+    )
+
+    config_file = workspace / "config.py"
+    config_file.write_text(config_content)
+
+    # Get Docker env with cache enabled
+    env = get_benchmark_env()
+    env.conf.enable_cache = True
+
+    # Environment variables for LLM judge (required for cascade eval benchmarks like AIME25)
+    # Note: LLM judge uses OpenAISDK which auto-appends /chat/completions
+    env_vars = {
+        "OC_JUDGE_MODEL": model_name,
+        "OC_JUDGE_API_KEY": api_key,
+        "OC_JUDGE_API_BASE": docker_api_base_sdk,  # SDK auto-appends /chat/completions
+        "OC_JUDGE_RETRY": "3",
+        # Pass API credentials for use inside Docker
+        "OPENAI_API_KEY": api_key,
+        "OPENAI_BASE_URL": docker_api_base_sdk,  # SDK auto-appends /chat/completions
+    }
+    # Add HF token for gated datasets (e.g., ChemCoTBench)
+    if hf_token:
+        env_vars["HF_TOKEN"] = hf_token
+
+    # Run opencompass in Docker with --debug to avoid subprocess segfault
+    if result_subdir:
+        benchmark_work_dir = f"/workspace/benchmark_results/{result_subdir}"
+    else:
+        benchmark_work_dir = "/workspace/benchmark_results"
+    cmd = f"opencompass /workspace/config.py --work-dir {benchmark_work_dir} --debug"
+    print(f"Running in Docker: {cmd}")
+    print(f"API Base (Docker): {docker_api_base}")
+    if offset:
+        print(f"Dataset range: [{offset}:{offset + limit}]")
+
+    result = env.run(
+        entry=cmd,
+        local_path=str(workspace),
+        env=env_vars,
+    )
+
+    print(f"Exit code: {result.exit_code}")
+    if result.exit_code != 0:
+        print(f"Error: {result.stdout[-2000:] if result.stdout else 'No output'}")
+        raise RuntimeError(f"Benchmark failed with exit code {result.exit_code}")
+
+    # Extract results from local workspace
+    work_dir = workspace / "benchmark_results"
+    if result_subdir:
+        work_dir = work_dir / result_subdir
+    timestamped_dirs = sorted(work_dir.glob("202*_*"), reverse=True)
+    if not timestamped_dirs:
+        raise RuntimeError(f"No results found in {work_dir}")
+
+    result_dir = timestamped_dirs[0]
+    csv_files = sorted(result_dir.rglob("summary/*.csv"), reverse=True)
+    if not csv_files:
+        raise RuntimeError(f"No CSV files found in {result_dir}")
+
+    # Parse benchmark results from CSV, grouped by dataset
+    df = pd.read_csv(csv_files[0])
+    # Get score column (the model name column, e.g., 'api-chemcotbench')
+    score_col = [c for c in df.columns if c not in ["dataset", "version", "metric", "mode"]][0]
+    # Pivot to group by dataset, with metrics as columns (use pivot_table to handle duplicates)
+    pivoted = df.pivot_table(index="dataset", columns="metric", values=score_col, aggfunc="first").to_dict("index")
+    # Filter out NaN values (different datasets have different metrics)
+    benchmark_results = {ds: {k: v for k, v in metrics.items() if pd.notna(v)} for ds, metrics in pivoted.items()}
+
+    # Extract error samples
+    errors = extract_error_samples(
+        result_dir,
+        max_samples=max_error_samples,
+    )
+
+    return {"benchmark_results": benchmark_results, "error_samples": errors}
+
+
+if __name__ == "__main__":
+    # Change to project root (required for template resolution)
+    os.chdir(_project_root)
+
+    # ==================== API Configuration ====================
+    API_KEY = "sk-1234"
+    API_BASE = "http://localhost:3000"
+    MODEL = "gpt-4o-mini"
+    HF_TOKEN = "hf_xxxx"  # For gated datasets
+
+    # ==================== Test Configuration ====================
+    MAX_OUT_LEN = 8192
+    MAX_SEQ_LEN = 32768
+    BATCH_SIZE = 8
+    QUERY_PER_SECOND = 1
+    MAX_NUM_WORKERS = 16
+
+    # Create test directory
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    test_base = _project_root / "git_ignore_folder" / "test_api" / timestamp
+
+    # ==================== Test Mode Selection ====================
+    # Set to True to test get_benchmark_ranges() with validation/test splits
+    TEST_BENCHMARK_RANGES = True
+
+    if TEST_BENCHMARK_RANGES:
+        # Test get_benchmark_ranges() with AIME25 (small dataset, 15 samples per subset)
+        val_range, test_range = get_benchmark_ranges()
+        print("=" * 60)
+        print("TESTING get_benchmark_ranges() NON-OVERLAPPING SPLITS")
+        print("=" * 60)
+        print(f"Validation range: {val_range}")
+        print(f"Test range: {test_range}")
+        print(f"API Base: {API_BASE}")
+        print(f"Output: {test_base}")
+        print("=" * 60)
+
+        # Test with AIME25 - a small dataset (15 samples per subset)
+        BENCHMARK = "aime25"
+        results_summary = {}
+
+        for split_name, split_range in [("validation", val_range), ("test", test_range)]:
+            print(f"\n{'='*60}")
+            print(f"Running: {BENCHMARK} - {split_name} split")
+            print(f"test_range: {split_range}")
+            print("=" * 60)
+
+            workspace = test_base / BENCHMARK / split_name
+            result = run_benchmark_api(
+                workspace_path=str(workspace),
+                model_name=MODEL,
+                api_key=API_KEY,
+                api_base=API_BASE,
+                benchmark_name=BENCHMARK,
+                limit=None,  # Disabled, use test_range instead
+                test_range=split_range,
+                max_error_samples=5,
+                max_out_len=MAX_OUT_LEN,
+                max_seq_len=MAX_SEQ_LEN,
+                batch_size=BATCH_SIZE,
+                query_per_second=QUERY_PER_SECOND,
+                max_num_workers=MAX_NUM_WORKERS,
+                hf_token=HF_TOKEN,
+                result_subdir=split_name,
+            )
+
+            error_samples = result.get("error_samples", [])
+            benchmark_results = result.get("benchmark_results", {})
+
+            # Save result to workspace
+            result_file = workspace / "result.json"
+            with open(result_file, "w", encoding="utf-8") as f:
+                json.dump(result, f, indent=2, ensure_ascii=False)
+            print(f"  Result saved to: {result_file}")
+
+            print(f"  Results: {benchmark_results}")
+            print(f"  Error samples: {len(error_samples)}")
+
+            results_summary[f"{BENCHMARK}_{split_name}"] = {
+                "error_count": len(error_samples),
+                "benchmark_results": benchmark_results,
+            }
+
+        print("\n" + "=" * 60)
+        print("SUMMARY - get_benchmark_ranges() TEST")
+        print("=" * 60)
+        for name, info in results_summary.items():
+            print(f"  {name}: {info['benchmark_results']}")
+
+    else:
+        # Original test mode with fixed limit/offset
+        LIMIT = 3
+        print("=" * 60)
+        print(f"API BENCHMARK TEST: {MODEL} (limit={LIMIT})")
+        print(f"API Base: {API_BASE}")
+        print(f"Output: {test_base}")
+        print("=" * 60)
+
+        results_summary = {}
+
+        # Hardcoded benchmark list - comment/uncomment to select benchmarks to test
+        BENCHMARKS_TO_TEST = [
+            # Math Reasoning
+            # "aime24",
+            # "aime25",
+            # "math",
+            # General Knowledge
+            # "mmlu",
+            # Code Generation
+            # "humaneval",
+            # "mbpp",
+            # PANORAMA - Patent Analysis (zero-shot)
+            "panorama",
+            "panorama_par4pc",
+            "panorama_pi4pc",
+            "panorama_noc4pc",
+            # PANORAMA - Patent Analysis (CoT)
+            "panorama_par4pc_cot",
+            "panorama_pi4pc_cot",
+            "panorama_noc4pc_cot",
+            # ChemCoTBench - Chemistry Reasoning
+            "chemcotbench",
+            "chemcotbench_mol_und",
+            "chemcotbench_mol_edit",
+            "chemcotbench_mol_opt",
+            "chemcotbench_reaction",
+            # TableBench - Table QA
+            "tablebench_data_analysis",
+            "tablebench_fact_checking",
+            "tablebench_numerical_reasoning",
+            "tablebench_visualization",
+            "tablebench_gen",
+            # Finance
+            "FinanceIQ_gen",
+        ]
+
+        for benchmark_name in BENCHMARKS_TO_TEST:
+            print(f"\n{'='*60}")
+            print(f"Running: {benchmark_name}")
+            print("=" * 60)
+
+            workspace = test_base / benchmark_name
+            result = run_benchmark_api(
+                workspace_path=str(workspace),
+                model_name=MODEL,
+                api_key=API_KEY,
+                api_base=API_BASE,
+                benchmark_name=benchmark_name,
+                limit=LIMIT,
+                max_error_samples=5,
+                max_out_len=MAX_OUT_LEN,
+                max_seq_len=MAX_SEQ_LEN,
+                batch_size=BATCH_SIZE,
+                query_per_second=QUERY_PER_SECOND,
+                max_num_workers=MAX_NUM_WORKERS,
+                hf_token=HF_TOKEN,
+                offset=100,
+            )
+
+            error_samples = result.get("error_samples", [])
+            benchmark_results = result.get("benchmark_results", [])
+
+            # Save result to workspace
+            result_file = workspace / "result.json"
+            with open(result_file, "w", encoding="utf-8") as f:
+                json.dump(result, f, indent=2, ensure_ascii=False)
+            print(f"  Result saved to: {result_file}")
+
+            print(f"  Results: {benchmark_results}")
+            print(f"  Error samples: {len(error_samples)}")
+            if error_samples:
+                print(f"  Sample: {error_samples[0]}")
+
+            results_summary[benchmark_name] = {
+                "error_count": len(error_samples),
+                "benchmark_results": benchmark_results,
+            }
+
+        print("\n" + "=" * 60)
+        print("SUMMARY")
+        print("=" * 60)
+        for name, info in results_summary.items():
+            print(f"  {name}: errors={info['error_count']}")
diff --git a/test/finetune/test_benchmark_tablebench.py b/test/finetune/test_benchmark_tablebench.py
new file mode 100644
index 000000000..575734d4a
--- /dev/null
+++ b/test/finetune/test_benchmark_tablebench.py
@@ -0,0 +1,221 @@
+"""
+TableBench 独立测试脚本
+运行 TableBench 系列基准测试
+"""
+
+from __future__ import annotations
+
+import os
+from datetime import datetime
+from pathlib import Path
+
+# 1. 设置环境变量（必须在导入 rdagent 之前）
+_project_root = Path(__file__).resolve().parents[2]
+os.environ["FT_file_path"] = str(_project_root / "git_ignore_folder" / "finetune_files")
+
+import pandas as pd
+
+from rdagent.components.coder.finetune.conf import get_benchmark_env
+from rdagent.scenarios.finetune.benchmark.data.adaptor import BENCHMARK_CONFIG_DICT
+from rdagent.scenarios.finetune.benchmark.data.default import extract_error_samples
+from rdagent.utils.agent.tpl import T
+
+
+def run_benchmark_simple(
+    workspace_path: str,
+    model_path_in_docker: str,
+    benchmark_name: str,
+    gpu_count: int = 4,
+    limit: int = 3,
+    offset: int = 0,
+    max_error_samples: int = 5,
+    result_subdir: str = "",
+):
+    """
+    简化的 benchmark 运行器
+
+    Args:
+        workspace_path: 本地工作区路径（结果保存位置）
+        model_path_in_docker: Docker 内的模型路径
+        benchmark_name: benchmark 名称
+        gpu_count: GPU 数量
+        limit: 样本限制（用于快速测试）
+        offset: 数据集采样起始偏移量 (默认: 0)
+        max_error_samples: 提取的错误样本数
+        result_subdir: 结果子目录 (如 "validation", "test")
+    """
+    workspace = Path(workspace_path)
+    workspace.mkdir(parents=True, exist_ok=True)
+
+    # 获取 benchmark 配置
+    cfg = BENCHMARK_CONFIG_DICT[benchmark_name]
+
+    # 自动下载依赖数据
+    if cfg.download is not None:
+        cfg.download()
+
+    # 计算 tensor_parallel_size（向下取最接近的 2 的幂）
+    tp_size = 1
+    power = 0
+    while (1 << (power + 1)) <= gpu_count:
+        power += 1
+    tp_size = 1 << power
+
+    # 生成 OpenCompass 配置文件
+    config_content = T("rdagent.scenarios.finetune.benchmark.configs.opencompass_template:template").r(
+        model_abbr=f"test-{benchmark_name}",
+        model_path=model_path_in_docker,
+        is_lora=False,
+        lora_path="",
+        dataset_imports=[cfg.dataset],
+        limit=limit,
+        offset=offset,
+        num_runs=1,
+        pass_k=None,
+        work_dir="/workspace",
+        tensor_parallel_size=tp_size,
+        gpu_memory_utilization=0.9,
+        dtype="bfloat16",
+        max_seq_len=32768,
+        max_out_len=8192,
+        batch_size=16,
+        temperature=0.0,
+        top_p=1.0,
+        top_k=1,
+        repetition_penalty=1.0,
+        enable_thinking=False,
+    )
+
+    config_file = workspace / "config.py"
+    config_file.write_text(config_content)
+
+    # 获取 Docker 环境（启用缓存）
+    env = get_benchmark_env()
+    env.conf.enable_cache = True
+
+    # 环境变量（用于需要 LLM judge 的 benchmark）
+    env_vars = {
+        "OC_JUDGE_MODEL": "gpt-5.1",
+        "OC_JUDGE_API_KEY": "sk-1234",
+        "OC_JUDGE_API_BASE": "http://localhost:3000",
+        "OC_JUDGE_RETRY": "3",
+    }
+
+    # 在 Docker 中运行 OpenCompass
+    if result_subdir:
+        benchmark_work_dir = f"/workspace/benchmark_results/{result_subdir}"
+    else:
+        benchmark_work_dir = "/workspace/benchmark_results"
+    cmd = f"opencompass /workspace/config.py --work-dir {benchmark_work_dir}"
+    print(f"Running in Docker: {cmd}")
+    if offset:
+        print(f"Dataset range: [{offset}:{offset + limit}]")
+
+    result = env.run(
+        entry=cmd,
+        local_path=str(workspace),
+        env=env_vars,
+    )
+
+    print(f"Exit code: {result.exit_code}")
+    if result.exit_code != 0:
+        print(f"Error: {result.stdout[-2000:] if result.stdout else 'No output'}")
+        raise RuntimeError(f"Benchmark failed with exit code {result.exit_code}")
+
+    # 从本地工作区提取结果
+    work_dir = workspace / "benchmark_results"
+    if result_subdir:
+        work_dir = work_dir / result_subdir
+    timestamped_dirs = sorted(work_dir.glob("202*_*"), reverse=True)
+    if not timestamped_dirs:
+        raise RuntimeError(f"No results found in {work_dir}")
+
+    result_dir = timestamped_dirs[0]
+    csv_files = sorted(result_dir.rglob("summary/*.csv"), reverse=True)
+    if not csv_files:
+        raise RuntimeError(f"No CSV files found in {result_dir}")
+
+    # 解析 CSV 结果
+    df = pd.read_csv(csv_files[0])
+    score_col = [c for c in df.columns if c not in ["dataset", "version", "metric", "mode"]][0]
+    pivoted = df.pivot_table(index="dataset", columns="metric", values=score_col, aggfunc="first").to_dict("index")
+    benchmark_results = {ds: {k: v for k, v in metrics.items() if pd.notna(v)} for ds, metrics in pivoted.items()}
+
+    # 提取错误样本
+    errors = extract_error_samples(result_dir, max_samples=max_error_samples)
+
+    return {"benchmark_results": benchmark_results, "error_samples": errors}
+
+
+if __name__ == "__main__":
+    # 切换到项目根目录（模板解析需要）
+    os.chdir(_project_root)
+
+    # ========== 配置区域 ==========
+    MODEL = "Qwen/Qwen2.5-1.5B"  # 修改为你的模型名称
+    LIMIT = 10  # 样本数限制（None 表示无限制）
+    GPU_COUNT = 4  # 你的 GPU 数量
+
+    # Docker 模型路径（自动挂载在 /finetune/models）
+    model_path_in_docker = f"/finetune/models/{MODEL}"
+
+    # 创建测试目录
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    test_base = _project_root / "git_ignore_folder" / "test" / timestamp
+
+    print("=" * 60)
+    print(f"TABLEBENCH TEST: {MODEL} (limit={LIMIT})")
+    print(f"Docker model path: {model_path_in_docker}")
+    print(f"Output: {test_base}")
+    print("=" * 60)
+
+    results_summary = {}
+
+    # TableBench 基准列表
+    BENCHMARKS_TO_TEST = [
+        "tablebench_data_analysis",  # 数据分析
+        "tablebench_fact_checking",  # 事实检查
+        "tablebench_numerical_reasoning",  # 数值推理
+        "tablebench_visualization",  # 可视化
+        # "tablebench_gen",               # 综合（包含上述所有类型）
+    ]
+
+    # 运行每个 benchmark
+    for benchmark_name in BENCHMARKS_TO_TEST:
+        print(f"\n{'='*60}")
+        print(f"Running: {benchmark_name}")
+        print("=" * 60)
+
+        workspace = test_base / benchmark_name
+        result = run_benchmark_simple(
+            workspace_path=str(workspace),
+            model_path_in_docker=model_path_in_docker,
+            benchmark_name=benchmark_name,
+            gpu_count=GPU_COUNT,
+            limit=LIMIT,
+            max_error_samples=5,
+        )
+
+        error_samples = result.get("error_samples", [])
+        benchmark_results = result.get("benchmark_results", {})
+
+        print(f"  Results: {benchmark_results}")
+        print(f"  Error samples: {len(error_samples)}")
+        if error_samples:
+            print(f"  First error: {error_samples[0]}")
+
+        results_summary[benchmark_name] = {
+            "error_count": len(error_samples),
+            "benchmark_results": benchmark_results,
+        }
+
+    # 打印汇总
+    print("\n" + "=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    for name, info in results_summary.items():
+        results = info["benchmark_results"]
+        print(f"\n{name}:")
+        print(f"  Error count: {info['error_count']}")
+        for dataset, metrics in results.items():
+            print(f"  {dataset}: {metrics}")
diff --git a/test/oai/test_llm_connectivity.py b/test/oai/test_llm_connectivity.py
new file mode 100644
index 000000000..49d6653ed
--- /dev/null
+++ b/test/oai/test_llm_connectivity.py
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+"""Test LLM connectivity for multiple models in parallel."""
+import concurrent.futures
+import os
+
+os.environ["OPENAI_API_KEY"] = "sk-1234"
+os.environ["OPENAI_API_BASE"] = "http://localhost:4000"
+
+import litellm
+
+litellm.suppress_debug_info = True
+from litellm import completion
+
+TIMEOUT = 30
+
+MODELS = [
+    "gpt-5",
+    "gpt-5.1",
+    "gpt-5.2",
+    "openai/gpt-5.1-chat",
+    "openai/gpt-5.2-chat",
+    "gpt-4o-mini",
+    "o3",
+    "o4-mini",
+    "gpt-5-mini",
+    "gpt-5-nano",
+    "gpt-4.1",
+    "gpt-4o",
+]
+
+
+def test_model(model: str) -> tuple:
+    try:
+        resp = completion(
+            model=model,
+            messages=[{"role": "user", "content": "Who is the president of the United States?"}],
+            drop_params=True,
+            timeout=TIMEOUT,
+        )
+        return (model, True, resp.choices[0].message.content)
+    except Exception as e:
+        return (model, False, str(e))
+
+
+if __name__ == "__main__":
+    print(f"Testing {len(MODELS)} model(s)...\n")
+    with concurrent.futures.ThreadPoolExecutor(max_workers=len(MODELS)) as ex:
+        for model, ok, msg in ex.map(test_model, MODELS):
+            status = "OK" if ok else "FAIL"
+            print(f"[{status}] {model}: {msg}")
diff --git a/test/rl/__init__.py b/test/rl/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/test/rl/test_example_workspace.py b/test/rl/test_example_workspace.py
new file mode 100644
index 000000000..d91c1cce5
--- /dev/null
+++ b/test/rl/test_example_workspace.py
@@ -0,0 +1,25 @@
+import pytest
+from pathlib import Path
+from rdagent.utils.env import RLDockerEnv
+from rdagent.scenarios.rl.eval.autorl_bench.env.workspace import RLWorkspace
+
+def test_example_workspace():
+    # 1. Create an RLDockerEnv
+    env = RLDockerEnv()
+    env.prepare() # build the docker image
+
+    # 2. Create an RLWorkspace
+    workspace = RLWorkspace()
+
+    # 3. Inject the code from rdagent/scenarios/rl/eval/AutoRL-Bench/example_workspace into the workspace
+    # 原代码：example_workspace_path = Path(__file__).parent.parent.parent / "rdagent" / "scenarios" / "rl" / "eval" / "example_workspace"
+    example_workspace_path = Path(__file__).parent.parent.parent / "rdagent" / "scenarios" / "rl" / "eval" / "autorl_bench" / "example_workspace"
+    workspace.inject_code_from_folder(example_workspace_path)
+
+    # 4. Run the workspace in the Docker environment
+    result = workspace.run(env, "python main.py")
+
+    # 5. Assert that the run was successful and the model file exists
+    assert result.exit_code == 0
+    model_file_path = workspace.workspace_path / "ppo_cartpole.zip"
+    assert model_file_path.exists()
diff --git a/test/utils/coder/test_finetune_coder.py b/test/utils/coder/test_finetune_coder.py
new file mode 100644
index 000000000..1bec18815
--- /dev/null
+++ b/test/utils/coder/test_finetune_coder.py
@@ -0,0 +1,31 @@
+from rdagent.components.coder.finetune import LLMFinetuneCoSTEER
+from rdagent.components.coder.finetune.exp import FTTask
+from rdagent.scenarios.finetune.experiment.experiment import FTExperiment
+from rdagent.scenarios.finetune.scen.scenario import LLMFinetuneScen
+
+desc = "Data loading and preparation:\n- Load LIMO-v2/limo-v2.jsonl and s1K-1.1/data/train-00000-of-00001.parquet.\n- For s1K, treat the following fields as primary: question, deepseek_thinking_trajectory, deepseek_attempt, deepseek_grade, solution, metadata (parse Year if present).\n- For LIMO, treat: question, solution (the step-by-step), answer (final). \n\nFiltering and decontamination:\n- s1K correctness filter: keep only rows where deepseek_grade == 'Yes'.\n- s1K benchmark decontamination: if metadata contains Year and Year >= 2023, drop the sample.\n- Answer-consistency checks:\n  - For s1K retained rows: extract a final numeric/string answer from deepseek_attempt. If s1K solution is numeric, ensure it matches (string-equal after normalization). If solution is non-numeric prose, trust deepseek_grade. Drop mismatches.\n  - For LIMO: ensure the final tokenized answer in ‘solution’ ends with the ‘answer’ field value (normalize spaces/LaTeX formatting). If mismatch, drop.\n- Length/quality screening:\n  - Drop samples where the reasoning text (solution for LIMO; deepseek_thinking_trajectory for s1K) is too short (< 60 words) or excessively long (> 2500 words) or incoherent (gpt-4o-mini coherence score < 3/5).\n- Structural health check:\n  - Use gpt-4o to score each reasoning trace on a 1–5 scale for step progression, local verification, clarity, and absence of major leaps. Keep samples with score >= 3.5.\n- Deduplication:\n  - Normalize questions (strip whitespace, unify punctuation, lowercase except LaTeX, remove redundant spaces).\n  - Apply 13-gram overlap dedup across combined set; drop one from pairs with overlap >= 0.8.\n  - Apply embedding-based dedup on normalized questions; drop pairs with cosine similarity >= 0.92.\n\nTopic classification and difficulty tagging:\n- Topic: Use gpt-4o-mini to classify each question into algebra, geometry, number theory, combinatorics, probability, or other. Store tag for balancing.\n- Difficulty: Use gpt-4o-mini to attempt each question with pass@3. If any attempt hits the correct final answer (as per previous extraction), tag as easy; else medium/hard. Aim for final sampling proportions: 30% easy, 50% medium, 20% hard. If topic or difficulty buckets are imbalanced, downsample overrepresented buckets and upsample (preferentially keep highest structural scores) underrepresented ones.\n\nLong-short CoT mixture creation:\n- Long-CoT split:\n  - For LIMO: output text = original solution cleaned, ensure it ends with a line “Final Answer: {answer}”.\n  - For s1K: output text = deepseek_thinking_trajectory cleaned, followed by a final line “Final Answer: {extracted_answer}”.\n  - For both, set input = “Provide a detailed derivation.”\n- Short-CoT split creation (for ~70% of retained long samples, stratified by topic and difficulty):\n  - Use gpt-4o-mini to compress each long solution into 5–7 ordered steps focusing on key inferences.\n  - Append a final ‘Check’ step that verifies the final result (e.g., substitution, modular check, dimensional consistency) and concludes.\n  - Ensure final line “Final Answer: {same_answer_as_long}”.\n  - Set input = “Provide a concise 5–7 step solution and include a check.”\n\nFormat normalization (Alpaca):\n- Convert all items (long and short) into Alpaca schema:\n  - instruction: the problem statement (question) with LaTeX preserved.\n  - input: guidance string as above.\n  - output: curated reasoning trace ending with “Final Answer: X”.\n- Sanitize artifacts:\n  - Remove extraneous headers, markdown footers, and unintended HTML.\n  - Preserve LaTeX math blocks and escape sequences.\n  - Standardize the final answer line exactly as “Final Answer: {answer}”.\n\nFinal assembly and splits:\n- Create two JSONL files in an output folder:\n  - processed/train-long.jsonl: all retained long-CoT items (~1200–1300 expected).\n  - processed/train-short.jsonl: compressed short-CoT items (~800–900 expected).\n- Ensure each item includes topic and difficulty tags in an auxiliary field if supported; if not, keep a separate CSV index mapping IDs to topic/difficulty for future sampling.\n- Keep a 5% random holdout (stratified by topic/difficulty) in separate files: processed/holdout-long.jsonl and processed/holdout-short.jsonl, excluded from training.\n\nQuality report and expected counts:\n- After each major step (correctness filter, decontamination, dedup, structural filter), log retained counts and proportions by topic and difficulty.\n- Target final total items: ~2000–2200 combined (long + short), with topic balance approx ~20% per major category and difficulty balance 30/50/20 (easy/medium/hard). If deviations exceed ±8 percentage points, adjust by sampling from the highest structural-scored items in underrepresented buckets.\n\nLLM endpoints usage:\n- gpt-4o: structural scoring (1–5), and occasional ambiguous answer-extraction resolution.\n- gpt-4o-mini: topic classification, coherence scoring (1–5), difficulty tagging via pass@3 attempts, and short-CoT compression/summarization.\n\nDeliverables:\n- Alpaca-formatted training files: processed/train-long.jsonl, processed/train-short.jsonl.\n- Stratified holdouts: processed/holdout-long.jsonl, processed/holdout-short.jsonl.\n- A summary report (JSON) capturing per-step counts, topic/difficulty distributions, and dedup stats."
+
+
+def develop_one_competition():
+    # Initialize scenario and coder
+    scen = LLMFinetuneScen()
+    ft_coder = LLMFinetuneCoSTEER(scen)
+
+    # Create the ensemble task with actual data context and specification
+    task = FTTask(
+        base_model="Qwen/Qwen3-1.7B",
+        description=desc,
+        benchmark="aime25",
+    )
+
+    exp = FTExperiment(sub_tasks=[task])
+
+    # # Injecting the corresponding specification
+    # exp.experiment_workspace.inject_files(**{"spec/ensemble.md": ensemble_spec})
+
+    # Develop the experiment
+    exp = ft_coder.develop(exp)
+
+
+if __name__ == "__main__":
+    develop_one_competition()