[Feature] tool_calls and reasoning: Tracking and evaluation#3685
Draft
RawthiL wants to merge 3 commits intoEleutherAI:mainfrom
Draft
[Feature] tool_calls and reasoning: Tracking and evaluation#3685RawthiL wants to merge 3 commits intoEleutherAI:mainfrom
tool_calls and reasoning: Tracking and evaluation#3685RawthiL wants to merge 3 commits intoEleutherAI:mainfrom
Conversation
Contributor
Author
|
I prefer not to lint at this point since the changes made by the linter will be more than the actual changes introduced in this PR. Happy to lint before merging tho... |
Contributor
Author
|
I'm not sure if the failing CPU test is something introduced in this PR. I would love to have some guidance here as many other task-execution tests are passing. |
…hould never happen with a 200 response)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for tracking and filtering evaluation results based on model reasoning and tool call data.
Currently the
tool_callsandreasoningfields of the generated answers are discarded, however they can be useful for evaluation some specific behavior of Language Models when they are used withchat-templates. For example, one would want to know if a model is using the tools provided correctly and calling the tools using the correct format without having to re-write the whole parsing step (something normally solved by the backend), additionally one might be interested in observing (and evaluating) the reasoning trace of a thinking model in addition to the final response.To make this possible we contribute with the following:
tool_callsandreasoningfields to theInstanceclass (as optional), to track generations.LocalChatCompletionoflm_eval/models/openai_completions.py. This can be extended to other models later.Filter(ABC)class to include a new method:apply_wkwargs, a version ofapplythat acceptskwargsand include the handling of thetool_callsandreasoningoptional parameters. The reason to surface the traces here is to keep the evaluation function tidy. One might want to do a simpleexact-matchevaluation but with content coming from thetool_callsor therespsfield, or a combination of both. Surfacing traces at this point (using custom filters) offers the most versatility to the user that can prepare thefiltered_respsin a way that makes it compatible with existent comparison methods or create custom structures that can be handled by other custom evaluation methods. Usingkwargsinstead of fixed parameters also enables the function to be used with potentially new fields in the future, without needing to modify this any further (the basicapplyfunction signature was not modified to keep everything backward compatible).With this PR it is possible to define a filter like:
That will respond to a signature like:
This PR along with #3684 are part of an effort to enable tool-usage and function calling evaluation on the
lm-evalsoftware. We have modified the T-Eval dataset and created alm-evalcompatible one: T-Eval PNYX dataset that includes several tasks for measuring function calls. Tasks are working and we wish to share them.