Skip to content

fix: don't pass task stop sequences to vLLM for reasoning models#3700

Open
jwmacd wants to merge 1 commit intoEleutherAI:mainfrom
jwmacd:fix/reasoning-stop-sequences
Open

fix: don't pass task stop sequences to vLLM for reasoning models#3700
jwmacd wants to merge 1 commit intoEleutherAI:mainfrom
jwmacd:fix/reasoning-stop-sequences

Conversation

@jwmacd
Copy link
Copy Markdown

@jwmacd jwmacd commented Apr 12, 2026

When think_end_token is set, task-level stop sequences like "\n\n" (the fewshot delimiter default) fire inside blocks and truncate generation before any response is produced.

Two changes:

  1. vllm_causallms.py: When think_end_token is set, only pass EOS to vLLM's SamplingParams. Task stop sequences remain in the cached gen_kwargs for post-processing.
  2. utils.py: Reorder postprocess_generated_text to strip thinking content before applying stop sequences, so stops match the actual response rather than the reasoning trace.

Non-reasoning models are unaffected — the code path only diverges when think_end_token is set.

Scope

This affects all generate_until tasks. 17 tasks in the repo don't specify explicit until in their generation_kwargs and inherit the
fewshot delimiter (typically "\n\n") as a stop sequence. Any reasoning model evaluated on these tasks may produce truncated or empty output without this fix.

Test results

Tested with Kimi-K2.5 (MoE, reasoning_parser=kimi_k2) on JSONSchemaBench (generate_until task, default until: ["\n\n"] from fewshot delimiter, max_model_len=65536, max_gen_toks=32768):

┌────────────────────┬─────────┬───────────┬─────────┐
│ │ JS-Easy │ JS-Medium │ JS-Hard │
├────────────────────┼─────────┼───────────┼─────────┤
│ Before (unpatched) │ 0.00 │ 0.00 │ 0.00 │
├────────────────────┼─────────┼───────────┼─────────┤
│ After (patched) │ 0.99 │ 0.96 │ 0.88 │
└────────────────────┴─────────┴───────────┴─────────┘

Before: all 1,531 samples scored 0.0 on both json_validity and schema_compliance — generation was truncated inside blocks before any JSON was produced. Total eval time 778s (near-immediate truncation per sample).

After: real scores, 34 min eval time with full thinking traces and JSON output.

When think_end_token is set, task-level stop sequences like "\n\n"
(the fewshot delimiter default) fire inside <think> blocks and truncate
generation before any response is produced.

Two changes:

1. vllm_causallms.py: When think_end_token is set, only pass EOS to
   vLLM's SamplingParams. Task stop sequences remain in the cached
   gen_kwargs for post-processing.

2. utils.py: Reorder postprocess_generated_text to strip thinking
   content before applying stop sequences, so stops match the actual
   response rather than the reasoning trace.

Non-reasoning models are completely unaffected — the code path only
diverges when think_end_token is set.
@jwmacd jwmacd requested a review from 0xSMT as a code owner April 12, 2026 20:24
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants