[None][feat] Add llm.encode() fast path for encoder-only models#12801
[None][feat] Add llm.encode() fast path for encoder-only models#12801tingyangk wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: tingyangk <tingyangk@nvidia.com>
04bcd64 to
83bc6b9
Compare
nvrohanv
left a comment
There was a problem hiding this comment.
Some comments on tokenization piece and handling of empty batch but overall looks good!
|
|
||
| unbatched = not isinstance(inputs, list) | ||
| if not unbatched: | ||
| if isinstance(inputs[0], int): |
There was a problem hiding this comment.
Nit: How do we handle case where an empty batch is passed in, unless theres handling elsewhere I think this would cause Index Error. I'm guessing this logic exists elsewhere as well so this might be a general question of how we want to handle it.
| max_seq_len_batch = max(max_seq_len_batch, seq_len) | ||
| prompts.append(None) | ||
| elif "prompt" in inp: | ||
| token_ids, _ = self.input_processor(inp, sampling_params) |
There was a problem hiding this comment.
Is it faster to do the tokenization in a batch - I see that later we do the processing to turn it into a flat "packed" tensor? Especially for cases with larger batch sizes I'm curious if we could cut down on tokenization overhead this way. Not sure about the structure of input_processor and if it can do this well.
| encoder_only: Optional[bool] = Field( | ||
| default=None, | ||
| description= | ||
| "Set to True for encoder-only models (BERT, RoBERTa, reward models, " |
There was a problem hiding this comment.
Can we make this a bit clearer that it would work for decoder only models being run in "encoder-only" style
| status="prototype", | ||
| ) | ||
|
|
||
| encoder_only: Optional[bool] = Field( |
There was a problem hiding this comment.
Similar to mm_encoder_only above, should this field be bool with default=False?
| from .encoder_executor import EncoderExecutor | ||
|
|
||
| torch.cuda.set_per_process_memory_fraction(1.0) | ||
| checkpoint_loader = _construct_checkpoint_loader(llm_args.backend, |
There was a problem hiding this comment.
checkpoint loader logic is common across create_py_executor, create_encoder_executor, can be moved to helper function.
| Dict with 'logits' tensor and any other model outputs. | ||
| """ | ||
| model_inputs = self._prepare_encoder_inputs(inputs) | ||
| return self._forward_step(model_inputs, None, False) |
There was a problem hiding this comment.
| return self._forward_step(model_inputs, None, False) | |
| return self._forward_step(model_inputs, gather_ids=None, gather_context_logits=False) |
There was a problem hiding this comment.
I would suggest to enforce kwargs by adding *, after inputs here.
|
@Superjomn could you review since it adds a new method to LLM API? Thx. |
| Dict with 'logits' tensor and any other model outputs. | ||
| """ | ||
| model_inputs = self._prepare_encoder_inputs(inputs) | ||
| return self._forward_step(model_inputs, None, False) |
There was a problem hiding this comment.
I would suggest to enforce kwargs by adding *, after inputs here.
| @torch.inference_mode() | ||
| @with_model_extra_attrs(lambda self: self.model.extra_attrs) | ||
| @nvtx_range("encoder_forward") | ||
| def encoder_forward(self, |
There was a problem hiding this comment.
How does this function relate to _forward_step_mm_encoder_only? The purpose seems similar. Does it make sense to unify these?
@coderabbitai summary
Summary
Adds a dedicated
llm.encode()API for encoder-only models (BERT, RoBERTa, reward models) that bypasses the decoder-oriented PyExecutor loop entirely.Problem
The current LLM API routes encoder models through the same
PyExecutordesigned for autoregressive decoders, introducing siginificant CPU overhead per batch from scheduler, KV cache management, sampling, and request state machine — none of which apply to encoders. Encoder models need a simple, direct path to the model’s forward call with batch inference executed in a single pass.Solution
A new execution path (
encoder_only=True) that creates a lightweightEncoderExecutorinstead of the fullPyExecutor. Theencode()method tokenizes, packs, and runs a single forward pass directly throughModelEngine.encoder_forward(), returningEncoderOutputwith logits. This new API demonstrates a 3.92× speedup for the BERT 110M model (textattack/bert-base-uncased-yelp-polarity) in eager mode with batch size 10.Usage
encoder_only=Truemust be explicitly set. Default (None) uses the oldgenerate()path.encoder_only=Truecreates onlyEncoderExecutor; False/None creates onlyPyExecutor. Mutually exclusive.generate()/generate_async()raiseRuntimeErrorwhenencoder_only=True.encode()is the only API.llm.encode()reusesPyTorchModelEngineand its_forward_step()path, features likeTorchCompileConfigare compatible.Future Works
EncoderExecutorAttentionMetadata, etc)Test Coverage
tests/unittest/llmapi/test_llm_encode.py— 11 new tests:BertForSequenceClassificationencoder_only=True→ old path):tests/integration/defs/test_e2e.py::test_ptp_quickstart_berttests/unittest/llmapi/test_llm_pytorch.py::test_llm_reward_modelCC: @symphonylyh @amukkara @nvrohanv @schetlur-nv @juney-nvidia
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.