[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters#12815
[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters#12815brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
93d25ac to
5de12f6
Compare
📝 WalkthroughWalkthroughA backend-dependent optimization is introduced to LoRA tensor management in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unittest/others/test_lora_manager.py`:
- Around line 1-169: Run the ruff formatter on the test file to apply the
CI-required style fixes; locally run the equivalent of `ruff format` targeting
the test file that contains TestLoraManagerRetainDeviceTensors and the helper
_create_dummy_hf_lora_adapter so imports/spacing/line breaks match CI
expectations, then re-run tests and commit the formatted changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e90c5094-e382-42e9-81b3-cc4692ae673d
📒 Files selected for processing (3)
tensorrt_llm/lora_manager.pytests/unittest/llmapi/test_llm_pytorch.pytests/unittest/others/test_lora_manager.py
5de12f6 to
042db83
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #42184 [ run ] triggered by Bot. Commit: |
042db83 to
9f6fe8f
Compare
…adapters Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
9f6fe8f to
0543050
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #42203 [ run ] triggered by Bot. Commit: |
|
PR_Github #42203 [ run ] completed with state |
Description
This is to address https://nvbugspro.nvidia.com/bug/5658258 for Pytorch backend. Legacy TRT backend behavior is unchanged.
Also, unwaives test for https://nvbugs/5636857 as the test for that no longer fails on
main.Background:
When serving many unique LoRA adapters, TRTLLM hits CUDA OOM after processing a few requests, despite the C++
PeftCacheManagerbeing configured withmax_lorasto limit GPU-resident adapters and evict via LRU.Root cause:
LoraManagerunconditionally appended every loaded adapter's GPU tensors toself._lora_weights, a list that only grows and never evicts (eventually causes OOM). This list is used only by the legacy TRT backend._lora_weights— it uses the C++PeftCacheManager, which maintains its own GPU cache with proper eviction. The result was two copies of every adapter on GPU: one in the C++ cache (bounded, evictable) and one in_lora_weights(unbounded, leaked, more importantly unused).Fix:
_lora_weightsappend and_lora_weights_pointers_listpopulation by a_retain_device_tensorsflag onLoraManager._cpp_lora_weights) are retained for the C++ cache.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
Bug Fixes
Tests