feat: cross-loader tool-calling toolkit + transformers wiring by alez007 · Pull Request #51 · alez007/modelship

alez007 · 2026-05-01T19:12:01Z

Adds modelship.openai.tool_calling, a small package that turns raw chat-completion text into OpenAI-shape tool_calls. Loaders whose engines already emit structured calls (vLLM, llama.cpp via a function-calling chat handler) keep their native path; loaders that emit raw text (Transformers today, plugin-wrapped engines later) call into the toolkit.

Includes:

ToolCallParser ABC + ParsedToolCalls result type
Hermes-style <tool_call>{...}</tool_call> parser (Hermes-2, Qwen2.5, many community fine-tunes)
name -> parser registry with register_parser hook for plugin code
resolve_tools_for_request applying OpenAI tool_choice semantics (none / auto / required / specific function)

Adds modelship.openai.tool_calling, a small package that turns raw chat-completion text into OpenAI-shape tool_calls. Loaders whose engines already emit structured calls (vLLM, llama.cpp via a function-calling chat handler) keep their native path; loaders that emit raw text (Transformers today, plugin-wrapped engines later) call into the toolkit. Includes: - ToolCallParser ABC + ParsedToolCalls result type - Hermes-style <tool_call>{...}</tool_call> parser (Hermes-2, Qwen2.5, many community fine-tunes) - name -> parser registry with register_parser hook for plugin code - resolve_tools_for_request applying OpenAI tool_choice semantics (none / auto / required / specific function) Wires the Transformers chat path to it: when tools are active, pre-renders the prompt via apply_chat_template(tools=...) and parses output through the configured parser, setting finish_reason="tool_calls" and populating ChatMessage.tool_calls. Streaming buffers tokens while tools are active and emits a single resolved delta at the end so we never stream a fragment of a tool-call marker as if it were prose. Also fixes ChatCompletionRequest.tool_choice default from "none" to None: per the OpenAI spec, "auto" is the default when tools are present. The previous default suppressed tools whenever a client omitted tool_choice, including via the llama.cpp passthrough. Tests: - 28 unit tests covering parser shape, registry behavior, tool_choice resolution, and the serving_chat tool path against a faked HF pipeline - Integration test deploying Qwen/Qwen2.5-0.5B-Instruct via the transformers loader and round-tripping a get_weather tool call

gemini-code-assist

Code Review

This pull request introduces a cross-loader tool-calling toolkit designed to enable tool-use support for loaders that emit raw text, specifically the Transformers loader. It adds a registry for model-family-specific parsers, starting with a Hermes-style XML parser, and implements logic to resolve OpenAI-style tool_choice semantics. The chat completion flow in serving_chat.py is updated to handle tool resolution, parsing, and a buffering mechanism for streaming when tools are active. Feedback suggests optimizing the streaming experience by buffering only potential tool-call tags instead of the entire response and moving local imports to the top of the file for consistency.

Replaces the buffer-until-done streaming path with a vLLM-style stateful diff loop so the client receives content tokens and tool-call argument fragments as fast as the model emits them, instead of seeing nothing until generation finishes. ToolCallParser is reshaped around three knobs per family: ``start_marker`` / ``end_marker`` and two extractors, ``extract_partial_name`` and ``extract_partial_args``. A new ``ToolCallStreamer`` instance is created per request and holds the high-water-marks ``_sent_content_idx`` / ``_sent_name[i]`` / ``_sent_args[i]``. On each ``extract_streaming(current_text)`` call it re-derives the content stream view (text with tool-call regions excised) and per-block fragments, then diffs against state and returns a ``DeltaMessage | None`` carrying just the new bytes. Tests: - 9 new TestToolCallStreamer cases covering pure-content streaming, marker-prefix tail held back until disambiguated/finalize, name emitted before args, args streamed incrementally across many small chunks (concatenated they form valid JSON), multiple tool calls get distinct indices, content resumes after a tool call, partial name held until its closing quote, unterminated block doesn't crash on finalize. - Two existing parser tests updated where vLLM-style semantics differ from the old block-level parser (raw-bytes args passthrough; blocks with no extractable name silently dropped). - Integration: ``test_tool_calling_streaming_transformers_loader`` and ``test_tool_calling_streaming_vllm_loader`` exercise streaming + tool calling end to end through the gateway. The transformers test asserts the function name arrives in exactly one delta, arguments arrive in >= 2 deltas (the key invariant proving the diff loop is actually diffing rather than buffering), and the rebuilt args parse as JSON.

alez007 · 2026-05-04T16:50:04Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a cross-loader tool-calling toolkit, enabling tool-calling support for the transformers loader via marker-based parsing. It includes a new ToolCallStreamer for incremental streaming, a registry for parsers, and extensive testing. Feedback focuses on improving the robustness of the tool-call finalization logic to handle malformed outputs, ensuring consistency of the created timestamp in streaming responses, and removing unused class attributes.

Replaces the fragile index-equals-length condition for finalizing tool-call blocks with a dedicated `_finalized_indices` set. This ensures valid blocks are correctly finalized even if preceding blocks are malformed and skipped by the streaming parser. Also allows the parser to continue processing subsequent blocks when a malformed complete block (missing a valid function name) is encountered.

The OpenAI specification requires the `created` timestamp to remain consistent across all chunks in a streaming response. Previously, the transformers loader recalculated the timestamp for each chunk using `int(time.time())` inside `_delta_chunk` and for the final finish/usage chunks. Now, the timestamp is calculated once at the start of `_stream` and explicitly passed to all chunk generation functions.

…amer

alez007 · 2026-05-04T17:09:05Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a cross-loader tool-calling toolkit designed to support models that emit structured markers (such as Hermes-style <tool_call> tags) but lack native engine support. The transformers loader is updated to handle tool rendering in prompts and output parsing for both standard and streaming completions. Feedback highlights a performance concern in the streaming implementation, where re-parsing the entire accumulated string on every token delta results in $O(N^2)$ complexity, suggesting a move toward incremental parsing.

Eliminate the O(N^2) complexity caused by calling `"".join(accumulated)` inside the chunk-by-chunk stream loop. Now we maintain a running cumulative string `accumulated_str` that is built via fast appends rather than allocating and joining the entire array of previously yielded tokens on every new token.

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated

Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Alex M added 3 commits May 4, 2026 17:01

chore: remove unused _content_parts_len attribute from ToolCallStre…

6f72596

…amer

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated

alez007 merged commit 37c2784 into main May 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cross-loader tool-calling toolkit + transformers wiring#51

feat: cross-loader tool-calling toolkit + transformers wiring#51
alez007 merged 6 commits intomainfrom
feat/transformers-tool-calling

alez007 commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

alez007 commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alez007 commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alez007 commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

alez007 commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alez007 commented May 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant