Standalone benchmark for evaluating A2UI JSON generation.
This repository is organized around a JSON-first evaluation path: the main benchmark reads model JSON outputs and computes L1/L2/L3 scores without requiring a render service. Render- and VLM-based visual checks are available as an optional extension, not the default workflow.
Core path:
- Generate or reuse task JSONs
- Run
evaluate_api_model.py - Score L1/L2/L3 from JSON outputs
Optional extension:
- Start
render/ - Run
visual_eval.pyorvisual_compare_models.py - Add VLM-based visual scoring on top of the JSON benchmark
evaluate_api_model.py: main JSON-based L1/L2/L3 evaluator for API models.prepare_eval_split.py: build a fixed-size eval split from the bundled source tasks.run_benchmark.sh: default JSON-first pipeline. Visual stage stays off unless explicitly enabled.data/eval_300/: bundled 300-task benchmark split.data/source/: bundled source task files used for resampling.visual_eval.py: optional render + screenshot + VLM-based visual scoring.visual_compare_models.py: optional cross-model visual comparison.render/: optional bundled renderer project.vendor/a2ui_demo/: bundled A2UI lint/schema assets required by the evaluator.render/vendor/a2ui/renderers/: bundled local renderer packages required only byrender/.
Core JSON benchmark only:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf you also want visual evaluation:
pip install -r requirements-visual.txtCreate local env file:
cp .env.example .envSet at least:
OPENAI_API_KEYOPENAI_BASE_URLif you are not using the default OpenRouter-compatible endpoint
Run the default JSON evaluation pipeline:
bash run_benchmark.shBy default this will:
- Build a sampled eval set from bundled
./data/source. - Run JSON-based API evaluation.
- Compute L1/L2/L3 outputs into
./results. - Skip render/VLM entirely unless
ENABLE_VISUAL_EVAL=1.
If you want to use the bundled fixed split directly:
python -u evaluate_api_model.py \
--task-dir ./data/eval_300 \
--sources annomi esconv multiwoz sgd \
--models openai/gpt-4o-mini \
--judge-model openai/gpt-5.1 \
--max-per-scenario 0 \
--seed 42 \
--prompt-mode minimal \
--output-dir ./results \
--model-concurrency 8 \
--judge-concurrency 8Only use this if you specifically want render- and VLM-based visual checks in addition to the JSON benchmark.
Install optional Python deps first:
pip install -r requirements-visual.txtStart the bundled renderer in another terminal:
cd render
npm install
npm run dev -- --host 127.0.0.1 --port 5173Sanity check:
curl -I http://127.0.0.1:5173/Then enable the optional stage:
ENABLE_VISUAL_EVAL=1 bash run_benchmark.shOr run visual comparison directly:
python visual_compare_models.py \
--results-dir ./results \
--model-slugs openai__gpt-4o-mini \
--render-url http://127.0.0.1:5173/ \
--vlm-model moonshotai/kimi-k2.5 \
--max-workers 2 \
--output-dir ./results/visual_compareTASK_SOURCE_DIR: source task directory, default./data/sourceEVAL_SPLIT_DIR: sampled/fixed task directory, default./data/eval_300RESULTS_DIR: JSON evaluation outputs, default./resultsMODEL_LIST: space-separated API model list, defaultopenai/gpt-4o-miniENABLE_VISUAL_EVAL: set1only when you want the optional visual stageVISUAL_MODEL_SLUGS: space-separated result folder slugs for visual comparison, defaultopenai__gpt-4o-miniRENDER_URL: visual renderer URL, defaulthttp://127.0.0.1:5173/MODEL_CONCURRENCY,JUDGE_CONCURRENCY,VISUAL_CONCURRENCY: concurrency controls
- The intended open-source default is JSON L1-L3 evaluation, not render/VLM.
- The repo bundles the validator/schema subset required by the evaluator under
vendor/a2ui_demo/. - The repo bundles local renderer package dependencies under
render/vendor/a2ui/renderers/sorender/can be installed independently when needed. - Do not commit real API keys.
.envis ignored.