Skip to content

feat(benchmark): separate judge VLM from pipeline VLM#253

Open
dippatel1994 wants to merge 1 commit into
mainfrom
feat/judge-model-separation
Open

feat(benchmark): separate judge VLM from pipeline VLM#253
dippatel1994 wants to merge 1 commit into
mainfrom
feat/judge-model-separation

Conversation

@dippatel1994

Copy link
Copy Markdown
Member

What

Adds the ability to run the benchmark judge on a different model than the generation pipeline:

  • New CLI flags --judge-provider / --judge-model on paperbanana benchmark.
  • New settings judge_vlm_provider / judge_vlm_model (env: JUDGE_VLM_PROVIDER / JUDGE_VLM_MODEL, plus judge.* YAML keys).
  • ProviderRegistry.create_vlm() gains a model_override param that takes precedence over provider-specific and generic model fields, without mutating the passed settings.
  • The benchmark judge factory swaps provider/model when configured; pipeline settings are untouched.

Why

The judge and the generation pipeline previously shared one VLM, so --vlm-provider changed both. Separating them lets you pair a cheap pipeline model with a stronger, independent judge (and keeps the judge fixed when comparing pipeline variants).

Tests

New coverage in test_benchmark.py (judge defaults to pipeline VLM; uses judge model/provider override; pipeline settings unchanged) and test_features.py (model_override precedence, no settings mutation). Full affected suites green (66 passed).

Backwards compatible: both new settings default to None, preserving existing behavior.

Add --judge-provider/--judge-model (JUDGE_VLM_PROVIDER/JUDGE_VLM_MODEL)
so the benchmark judge can use a different, stronger model than the
generation pipeline. create_vlm() gains a model_override param that takes
precedence over provider-specific and generic model fields. The benchmark
judge factory swaps provider/model when configured, leaving pipeline
settings untouched. Covered by tests for fallback and override behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant