Skip to content

[Feature] tool_calls and reasoning: Tracking and evaluation#3685

Draft
RawthiL wants to merge 3 commits intoEleutherAI:mainfrom
pnyxai:feature-tool-calls-evaluation
Draft

[Feature] tool_calls and reasoning: Tracking and evaluation#3685
RawthiL wants to merge 3 commits intoEleutherAI:mainfrom
pnyxai:feature-tool-calls-evaluation

Conversation

@RawthiL
Copy link
Copy Markdown
Contributor

@RawthiL RawthiL commented Apr 7, 2026

This PR adds support for tracking and filtering evaluation results based on model reasoning and tool call data.

Currently the tool_calls and reasoning fields of the generated answers are discarded, however they can be useful for evaluation some specific behavior of Language Models when they are used with chat-templates. For example, one would want to know if a model is using the tools provided correctly and calling the tools using the correct format without having to re-write the whole parsing step (something normally solved by the backend), additionally one might be interested in observing (and evaluating) the reasoning trace of a thinking model in addition to the final response.

To make this possible we contribute with the following:

  • We add tool_calls and reasoning fields to the Instance class (as optional), to track generations.
  • We add these to the result schema to keep it tidy.
  • We modify the evaluator (keeping it backward compatible) to accept a tupple (responses, tool_calls, reasoning) and log them properly.
  • We include an example modification for the client we normally use, the LocalChatCompletion of lm_eval/models/openai_completions.py. This can be extended to other models later.
  • Finally we modify the Filter(ABC) class to include a new method: apply_wkwargs, a version of apply that accepts kwargs and include the handling of the tool_calls and reasoning optional parameters. The reason to surface the traces here is to keep the evaluation function tidy. One might want to do a simple exact-match evaluation but with content coming from the tool_calls or the resps field, or a combination of both. Surfacing traces at this point (using custom filters) offers the most versatility to the user that can prepare the filtered_resps in a way that makes it compatible with existent comparison methods or create custom structures that can be handled by other custom evaluation methods. Using kwargs instead of fixed parameters also enables the function to be used with potentially new fields in the future, without needing to modify this any further (the basic apply function signature was not modified to keep everything backward compatible).

With this PR it is possible to define a filter like:

filter_list:
  - name: "tool_call_extract"
    filter:
      - function: "custom"
        filter_wkwargs_fn: !function utils.build_predictions_call_with_tools

That will respond to a signature like:

def build_predictions_call_with_tools(
        resps: list[list[str]], docs: list[dict], tool_calls: list[list[dict]], reasoning: list[list[str]] = None, **kwargs,
) -> list[list[str]]:

This PR along with #3684 are part of an effort to enable tool-usage and function calling evaluation on the lm-eval software. We have modified the T-Eval dataset and created a lm-eval compatible one: T-Eval PNYX dataset that includes several tasks for measuring function calls. Tasks are working and we wish to share them.

@RawthiL
Copy link
Copy Markdown
Contributor Author

RawthiL commented Apr 7, 2026

I prefer not to lint at this point since the changes made by the linter will be more than the actual changes introduced in this PR.

Happy to lint before merging tho...

@RawthiL
Copy link
Copy Markdown
Contributor Author

RawthiL commented Apr 7, 2026

I'm not sure if the failing CPU test is something introduced in this PR. I would love to have some guidance here as many other task-execution tests are passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant