Skip to content

feat: cross-loader tool-calling toolkit + transformers wiring#51

Merged
alez007 merged 6 commits intomainfrom
feat/transformers-tool-calling
May 4, 2026
Merged

feat: cross-loader tool-calling toolkit + transformers wiring#51
alez007 merged 6 commits intomainfrom
feat/transformers-tool-calling

Conversation

@alez007
Copy link
Copy Markdown
Owner

@alez007 alez007 commented May 1, 2026

Adds modelship.openai.tool_calling, a small package that turns raw chat-completion text into OpenAI-shape tool_calls. Loaders whose engines already emit structured calls (vLLM, llama.cpp via a function-calling chat handler) keep their native path; loaders that emit raw text (Transformers today, plugin-wrapped engines later) call into the toolkit.

Includes:

  • ToolCallParser ABC + ParsedToolCalls result type
  • Hermes-style <tool_call>{...}</tool_call> parser (Hermes-2, Qwen2.5, many community fine-tunes)
  • name -> parser registry with register_parser hook for plugin code
  • resolve_tools_for_request applying OpenAI tool_choice semantics (none / auto / required / specific function)

Adds modelship.openai.tool_calling, a small package that turns raw
chat-completion text into OpenAI-shape tool_calls. Loaders whose engines
already emit structured calls (vLLM, llama.cpp via a function-calling
chat handler) keep their native path; loaders that emit raw text
(Transformers today, plugin-wrapped engines later) call into the toolkit.

Includes:
- ToolCallParser ABC + ParsedToolCalls result type
- Hermes-style <tool_call>{...}</tool_call> parser (Hermes-2, Qwen2.5,
  many community fine-tunes)
- name -> parser registry with register_parser hook for plugin code
- resolve_tools_for_request applying OpenAI tool_choice semantics
  (none / auto / required / specific function)

Wires the Transformers chat path to it: when tools are active,
pre-renders the prompt via apply_chat_template(tools=...) and parses
output through the configured parser, setting finish_reason="tool_calls"
and populating ChatMessage.tool_calls. Streaming buffers tokens while
tools are active and emits a single resolved delta at the end so we
never stream a fragment of a tool-call marker as if it were prose.

Also fixes ChatCompletionRequest.tool_choice default from "none" to
None: per the OpenAI spec, "auto" is the default when tools are
present. The previous default suppressed tools whenever a client
omitted tool_choice, including via the llama.cpp passthrough.

Tests:
- 28 unit tests covering parser shape, registry behavior, tool_choice
  resolution, and the serving_chat tool path against a faked HF pipeline
- Integration test deploying Qwen/Qwen2.5-0.5B-Instruct via the
  transformers loader and round-tripping a get_weather tool call
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cross-loader tool-calling toolkit designed to enable tool-use support for loaders that emit raw text, specifically the Transformers loader. It adds a registry for model-family-specific parsers, starting with a Hermes-style XML parser, and implements logic to resolve OpenAI-style tool_choice semantics. The chat completion flow in serving_chat.py is updated to handle tool resolution, parsing, and a buffering mechanism for streaming when tools are active. Feedback suggests optimizing the streaming experience by buffering only potential tool-call tags instead of the entire response and moving local imports to the top of the file for consistency.

Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated
Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated
Replaces the buffer-until-done streaming path with a vLLM-style stateful
diff loop so the client receives content tokens and tool-call argument
fragments as fast as the model emits them, instead of seeing nothing
until generation finishes.

ToolCallParser is reshaped around three knobs per family:
``start_marker`` / ``end_marker`` and two extractors,
``extract_partial_name`` and ``extract_partial_args``. A new
``ToolCallStreamer`` instance is created per request and holds the
high-water-marks ``_sent_content_idx`` / ``_sent_name[i]`` /
``_sent_args[i]``. On each ``extract_streaming(current_text)`` call it
re-derives the content stream view (text with tool-call regions excised)
and per-block fragments, then diffs against state and returns a
``DeltaMessage | None`` carrying just the new bytes.

Tests:
- 9 new TestToolCallStreamer cases covering pure-content streaming,
  marker-prefix tail held back until disambiguated/finalize, name
  emitted before args, args streamed incrementally across many small
  chunks (concatenated they form valid JSON), multiple tool calls get
  distinct indices, content resumes after a tool call, partial name
  held until its closing quote, unterminated block doesn't crash on
  finalize.
- Two existing parser tests updated where vLLM-style semantics differ
  from the old block-level parser (raw-bytes args passthrough; blocks
  with no extractable name silently dropped).
- Integration: ``test_tool_calling_streaming_transformers_loader`` and
  ``test_tool_calling_streaming_vllm_loader`` exercise streaming + tool
  calling end to end through the gateway. The transformers test asserts
  the function name arrives in exactly one delta, arguments arrive in
  >= 2 deltas (the key invariant proving the diff loop is actually
  diffing rather than buffering), and the rebuilt args parse as JSON.
@alez007
Copy link
Copy Markdown
Owner Author

alez007 commented May 4, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cross-loader tool-calling toolkit, enabling tool-calling support for the transformers loader via marker-based parsing. It includes a new ToolCallStreamer for incremental streaming, a registry for parsers, and extensive testing. Feedback focuses on improving the robustness of the tool-call finalization logic to handle malformed outputs, ensuring consistency of the created timestamp in streaming responses, and removing unused class attributes.

Comment thread modelship/openai/tool_calling/parsers/base.py Outdated
Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated
Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated
Comment thread modelship/openai/tool_calling/parsers/base.py Outdated
Comment thread modelship/openai/tool_calling/parsers/base.py Outdated
Alex M added 3 commits May 4, 2026 17:01
Replaces the fragile index-equals-length condition for finalizing
tool-call blocks with a dedicated `_finalized_indices` set. This ensures
valid blocks are correctly finalized even if preceding blocks are
malformed and skipped by the streaming parser. Also allows the parser
to continue processing subsequent blocks when a malformed complete block
(missing a valid function name) is encountered.
The OpenAI specification requires the `created` timestamp to remain
consistent across all chunks in a streaming response. Previously, the
transformers loader recalculated the timestamp for each chunk using
`int(time.time())` inside `_delta_chunk` and for the final finish/usage
chunks. Now, the timestamp is calculated once at the start of `_stream`
and explicitly passed to all chunk generation functions.
@alez007
Copy link
Copy Markdown
Owner Author

alez007 commented May 4, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cross-loader tool-calling toolkit designed to support models that emit structured markers (such as Hermes-style <tool_call> tags) but lack native engine support. The transformers loader is updated to handle tool rendering in prompts and output parsing for both standard and streaming completions. Feedback highlights a performance concern in the streaming implementation, where re-parsing the entire accumulated string on every token delta results in $O(N^2)$ complexity, suggesting a move toward incremental parsing.

Comment thread modelship/infer/transformers/openai/serving_chat.py Outdated
Eliminate the O(N^2) complexity caused by calling `"".join(accumulated)`
inside the chunk-by-chunk stream loop. Now we maintain a running cumulative
string `accumulated_str` that is built via fast appends rather than
allocating and joining the entire array of previously yielded tokens
on every new token.
@alez007 alez007 merged commit 37c2784 into main May 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant