[Feature] `tool_calls` and `reasoning`: Tracking and evaluation by RawthiL · Pull Request #3685 · EleutherAI/lm-evaluation-harness

RawthiL · 2026-04-07T18:01:52Z

This PR adds support for tracking and filtering evaluation results based on model reasoning and tool call data.

Currently the tool_calls and reasoning fields of the generated answers are discarded, however they can be useful for evaluation some specific behavior of Language Models when they are used with chat-templates. For example, one would want to know if a model is using the tools provided correctly and calling the tools using the correct format without having to re-write the whole parsing step (something normally solved by the backend), additionally one might be interested in observing (and evaluating) the reasoning trace of a thinking model in addition to the final response.

To make this possible we contribute with the following:

We add tool_calls and reasoning fields to the Instance class (as optional), to track generations.
We add these to the result schema to keep it tidy.
We modify the evaluator (keeping it backward compatible) to accept a tupple (responses, tool_calls, reasoning) and log them properly.
We include an example modification for the client we normally use, the LocalChatCompletion of lm_eval/models/openai_completions.py. This can be extended to other models later.
Finally we modify the Filter(ABC) class to include a new method: apply_wkwargs, a version of apply that accepts kwargs and include the handling of the tool_calls and reasoning optional parameters. The reason to surface the traces here is to keep the evaluation function tidy. One might want to do a simple exact-match evaluation but with content coming from the tool_calls or the resps field, or a combination of both. Surfacing traces at this point (using custom filters) offers the most versatility to the user that can prepare the filtered_resps in a way that makes it compatible with existent comparison methods or create custom structures that can be handled by other custom evaluation methods. Using kwargs instead of fixed parameters also enables the function to be used with potentially new fields in the future, without needing to modify this any further (the basic apply function signature was not modified to keep everything backward compatible).

With this PR it is possible to define a filter like:

filter_list:
  - name: "tool_call_extract"
    filter:
      - function: "custom"
        filter_wkwargs_fn: !function utils.build_predictions_call_with_tools

That will respond to a signature like:

def build_predictions_call_with_tools(
        resps: list[list[str]], docs: list[dict], tool_calls: list[list[dict]], reasoning: list[list[str]] = None, **kwargs,
) -> list[list[str]]:

This PR along with #3684 are part of an effort to enable tool-usage and function calling evaluation on the lm-eval software. We have modified the T-Eval dataset and created a lm-eval compatible one: T-Eval PNYX dataset that includes several tasks for measuring function calls. Tasks are working and we wish to share them.

RawthiL · 2026-04-07T18:03:48Z

I prefer not to lint at this point since the changes made by the linter will be more than the actual changes introduced in this PR.

Happy to lint before merging tho...

RawthiL · 2026-04-07T18:18:41Z

I'm not sure if the failing CPU test is something introduced in this PR. I would love to have some guidance here as many other task-execution tests are passing.

…hould never happen with a 200 response)

RawthiL added 2 commits April 7, 2026 11:14

added tool_calls surfacing to custom filters

b5369fb

added reasoning tracking and filtering surfacing

81ef2da

RawthiL mentioned this pull request Apr 7, 2026

[Feature] Add generation_kwargs jinja templating #3684

Open

raised error when non openai-api compatible responses are obtained (s…

2063df8

…hould never happen with a 200 response)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] `tool_calls` and `reasoning`: Tracking and evaluation#3685

[Feature] `tool_calls` and `reasoning`: Tracking and evaluation#3685
RawthiL wants to merge 3 commits intoEleutherAI:mainfrom
pnyxai:feature-tool-calls-evaluation

RawthiL commented Apr 7, 2026

Uh oh!

RawthiL commented Apr 7, 2026

Uh oh!

RawthiL commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RawthiL commented Apr 7, 2026

Uh oh!

RawthiL commented Apr 7, 2026

Uh oh!

RawthiL commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant