feat(benchmark): separate judge VLM from pipeline VLM by dippatel1994 · Pull Request #253 · llmsresearch/paperbanana

dippatel1994 · 2026-06-12T06:46:47Z

What

Adds the ability to run the benchmark judge on a different model than the generation pipeline:

New CLI flags --judge-provider / --judge-model on paperbanana benchmark.
New settings judge_vlm_provider / judge_vlm_model (env: JUDGE_VLM_PROVIDER / JUDGE_VLM_MODEL, plus judge.* YAML keys).
ProviderRegistry.create_vlm() gains a model_override param that takes precedence over provider-specific and generic model fields, without mutating the passed settings.
The benchmark judge factory swaps provider/model when configured; pipeline settings are untouched.

Why

The judge and the generation pipeline previously shared one VLM, so --vlm-provider changed both. Separating them lets you pair a cheap pipeline model with a stronger, independent judge (and keeps the judge fixed when comparing pipeline variants).

Tests

New coverage in test_benchmark.py (judge defaults to pipeline VLM; uses judge model/provider override; pipeline settings unchanged) and test_features.py (model_override precedence, no settings mutation). Full affected suites green (66 passed).

Backwards compatible: both new settings default to None, preserving existing behavior.

Add --judge-provider/--judge-model (JUDGE_VLM_PROVIDER/JUDGE_VLM_MODEL) so the benchmark judge can use a different, stronger model than the generation pipeline. create_vlm() gains a model_override param that takes precedence over provider-specific and generic model fields. The benchmark judge factory swaps provider/model when configured, leaving pipeline settings untouched. Covered by tests for fallback and override behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): separate judge VLM from pipeline VLM#253

feat(benchmark): separate judge VLM from pipeline VLM#253
dippatel1994 wants to merge 1 commit into
mainfrom
feat/judge-model-separation

dippatel1994 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dippatel1994 commented Jun 12, 2026

What

Why

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant