feat: cross-loader tool-calling toolkit + transformers wiring#51
feat: cross-loader tool-calling toolkit + transformers wiring#51
Conversation
Adds modelship.openai.tool_calling, a small package that turns raw
chat-completion text into OpenAI-shape tool_calls. Loaders whose engines
already emit structured calls (vLLM, llama.cpp via a function-calling
chat handler) keep their native path; loaders that emit raw text
(Transformers today, plugin-wrapped engines later) call into the toolkit.
Includes:
- ToolCallParser ABC + ParsedToolCalls result type
- Hermes-style <tool_call>{...}</tool_call> parser (Hermes-2, Qwen2.5,
many community fine-tunes)
- name -> parser registry with register_parser hook for plugin code
- resolve_tools_for_request applying OpenAI tool_choice semantics
(none / auto / required / specific function)
Wires the Transformers chat path to it: when tools are active,
pre-renders the prompt via apply_chat_template(tools=...) and parses
output through the configured parser, setting finish_reason="tool_calls"
and populating ChatMessage.tool_calls. Streaming buffers tokens while
tools are active and emits a single resolved delta at the end so we
never stream a fragment of a tool-call marker as if it were prose.
Also fixes ChatCompletionRequest.tool_choice default from "none" to
None: per the OpenAI spec, "auto" is the default when tools are
present. The previous default suppressed tools whenever a client
omitted tool_choice, including via the llama.cpp passthrough.
Tests:
- 28 unit tests covering parser shape, registry behavior, tool_choice
resolution, and the serving_chat tool path against a faked HF pipeline
- Integration test deploying Qwen/Qwen2.5-0.5B-Instruct via the
transformers loader and round-tripping a get_weather tool call
There was a problem hiding this comment.
Code Review
This pull request introduces a cross-loader tool-calling toolkit designed to enable tool-use support for loaders that emit raw text, specifically the Transformers loader. It adds a registry for model-family-specific parsers, starting with a Hermes-style XML parser, and implements logic to resolve OpenAI-style tool_choice semantics. The chat completion flow in serving_chat.py is updated to handle tool resolution, parsing, and a buffering mechanism for streaming when tools are active. Feedback suggests optimizing the streaming experience by buffering only potential tool-call tags instead of the entire response and moving local imports to the top of the file for consistency.
Replaces the buffer-until-done streaming path with a vLLM-style stateful diff loop so the client receives content tokens and tool-call argument fragments as fast as the model emits them, instead of seeing nothing until generation finishes. ToolCallParser is reshaped around three knobs per family: ``start_marker`` / ``end_marker`` and two extractors, ``extract_partial_name`` and ``extract_partial_args``. A new ``ToolCallStreamer`` instance is created per request and holds the high-water-marks ``_sent_content_idx`` / ``_sent_name[i]`` / ``_sent_args[i]``. On each ``extract_streaming(current_text)`` call it re-derives the content stream view (text with tool-call regions excised) and per-block fragments, then diffs against state and returns a ``DeltaMessage | None`` carrying just the new bytes. Tests: - 9 new TestToolCallStreamer cases covering pure-content streaming, marker-prefix tail held back until disambiguated/finalize, name emitted before args, args streamed incrementally across many small chunks (concatenated they form valid JSON), multiple tool calls get distinct indices, content resumes after a tool call, partial name held until its closing quote, unterminated block doesn't crash on finalize. - Two existing parser tests updated where vLLM-style semantics differ from the old block-level parser (raw-bytes args passthrough; blocks with no extractable name silently dropped). - Integration: ``test_tool_calling_streaming_transformers_loader`` and ``test_tool_calling_streaming_vllm_loader`` exercise streaming + tool calling end to end through the gateway. The transformers test asserts the function name arrives in exactly one delta, arguments arrive in >= 2 deltas (the key invariant proving the diff loop is actually diffing rather than buffering), and the rebuilt args parse as JSON.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a cross-loader tool-calling toolkit, enabling tool-calling support for the transformers loader via marker-based parsing. It includes a new ToolCallStreamer for incremental streaming, a registry for parsers, and extensive testing. Feedback focuses on improving the robustness of the tool-call finalization logic to handle malformed outputs, ensuring consistency of the created timestamp in streaming responses, and removing unused class attributes.
Replaces the fragile index-equals-length condition for finalizing tool-call blocks with a dedicated `_finalized_indices` set. This ensures valid blocks are correctly finalized even if preceding blocks are malformed and skipped by the streaming parser. Also allows the parser to continue processing subsequent blocks when a malformed complete block (missing a valid function name) is encountered.
The OpenAI specification requires the `created` timestamp to remain consistent across all chunks in a streaming response. Previously, the transformers loader recalculated the timestamp for each chunk using `int(time.time())` inside `_delta_chunk` and for the final finish/usage chunks. Now, the timestamp is calculated once at the start of `_stream` and explicitly passed to all chunk generation functions.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a cross-loader tool-calling toolkit designed to support models that emit structured markers (such as Hermes-style <tool_call> tags) but lack native engine support. The transformers loader is updated to handle tool rendering in prompts and output parsing for both standard and streaming completions. Feedback highlights a performance concern in the streaming implementation, where re-parsing the entire accumulated string on every token delta results in
Eliminate the O(N^2) complexity caused by calling `"".join(accumulated)` inside the chunk-by-chunk stream loop. Now we maintain a running cumulative string `accumulated_str` that is built via fast appends rather than allocating and joining the entire array of previously yielded tokens on every new token.
Adds modelship.openai.tool_calling, a small package that turns raw chat-completion text into OpenAI-shape tool_calls. Loaders whose engines already emit structured calls (vLLM, llama.cpp via a function-calling chat handler) keep their native path; loaders that emit raw text (Transformers today, plugin-wrapped engines later) call into the toolkit.
Includes: