Skip to content

[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters#12815

Open
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/many-lora-adapters
Open

[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters#12815
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/many-lora-adapters

Conversation

@brb-nv
Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv commented Apr 7, 2026

Description

This is to address https://nvbugspro.nvidia.com/bug/5658258 for Pytorch backend. Legacy TRT backend behavior is unchanged.

Also, unwaives test for https://nvbugs/5636857 as the test for that no longer fails on main.

Background:
When serving many unique LoRA adapters, TRTLLM hits CUDA OOM after processing a few requests, despite the C++ PeftCacheManager being configured with max_loras to limit GPU-resident adapters and evict via LRU.

Root cause:

  • The Python LoraManager unconditionally appended every loaded adapter's GPU tensors to self._lora_weights, a list that only grows and never evicts (eventually causes OOM). This list is used only by the legacy TRT backend.
  • The PyTorch backend never reads from _lora_weights — it uses the C++ PeftCacheManager, which maintains its own GPU cache with proper eviction. The result was two copies of every adapter on GPU: one in the C++ cache (bounded, evictable) and one in _lora_weights (unbounded, leaked, more importantly unused).

Fix:

  • Guard _lora_weights append and _lora_weights_pointers_list population by a _retain_device_tensors flag on LoraManager.
  • In the PyTorch backend, the temporary GPU tensors created during loading go out of scope and are freed each iteration - only CPU copies (_cpp_lora_weights) are retained for the C++ cache.

Test Coverage

$ pytest tests/unittest/others/test_lora_manager.py -s -v
$ pytest tests/unittest/llmapi/test_llm_pytorch.py::test_lora_many_adapters_no_memory_leak -s -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed GPU memory accumulation issue in LoRA adapter management to prevent memory leaks during multi-adapter inference.
  • Tests

    • Added regression tests to validate GPU memory efficiency with multiple LoRA adapters and backend-specific behavior.

@brb-nv brb-nv requested a review from a team as a code owner April 7, 2026 20:25
@brb-nv brb-nv requested a review from amitz-nv April 7, 2026 20:25
@brb-nv brb-nv requested a review from shaharmor98 April 7, 2026 20:26
@brb-nv brb-nv force-pushed the user/brb/many-lora-adapters branch from 93d25ac to 5de12f6 Compare April 7, 2026 20:28
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

A backend-dependent optimization is introduced to LoRA tensor management in tensorrt_llm/lora_manager.py. The change adds a retention flag that controls whether GPU tensors are stored in memory, conditionally based on whether a C++ PEFT cache manager is available. Corresponding unit and integration tests validate the new behavior and detect potential GPU memory accumulation issues.

Changes

Cohort / File(s) Summary
Core LoRA Manager Implementation
tensorrt_llm/lora_manager.py
Added _retain_device_tensors flag set based on cpp_peft_cache_manager availability. Modified weight loading in load_from_model_file and load_from_model_dir to conditionally populate _lora_weights_pointers_list and GPU tensors, while always computing ranks and building CPU tensors for _cpp_lora_weights.
Unit Tests
tests/unittest/others/test_lora_manager.py
New test module validating _retain_device_tensors behavior across backend scenarios. Tests verify the flag state based on cache manager presence, validate _lora_weights and _cpp_lora_weights contents in each path, and confirm GPU memory does not accumulate when loading many adapters with C++ cache manager.
Integration Tests
tests/unittest/llmapi/test_llm_pytorch.py
Added regression test test_lora_many_adapters_no_memory_leak that constructs 20 unique LoRA adapters with constrained slots and measures GPU memory growth per adapter, asserting growth stays under ~1 MB/adapter threshold to detect _lora_weights accumulation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the fix as addressing an OOM issue with large numbers of LoRA adapters, which matches the main objective of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR description comprehensively explains the issue, root cause, and fix with clear background and test coverage provided.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/others/test_lora_manager.py`:
- Around line 1-169: Run the ruff formatter on the test file to apply the
CI-required style fixes; locally run the equivalent of `ruff format` targeting
the test file that contains TestLoraManagerRetainDeviceTensors and the helper
_create_dummy_hf_lora_adapter so imports/spacing/line breaks match CI
expectations, then re-run tests and commit the formatted changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e90c5094-e382-42e9-81b3-cc4692ae673d

📥 Commits

Reviewing files that changed from the base of the PR and between a1777fd and 93d25ac.

📒 Files selected for processing (3)
  • tensorrt_llm/lora_manager.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tests/unittest/others/test_lora_manager.py

@brb-nv brb-nv force-pushed the user/brb/many-lora-adapters branch from 5de12f6 to 042db83 Compare April 7, 2026 20:33
@brb-nv brb-nv requested a review from a team as a code owner April 7, 2026 20:33
@brb-nv brb-nv requested a review from StanleySun639 April 7, 2026 20:34
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented Apr 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42184 [ run ] triggered by Bot. Commit: 042db83 Link to invocation

@brb-nv brb-nv force-pushed the user/brb/many-lora-adapters branch from 042db83 to 9f6fe8f Compare April 7, 2026 23:48
…adapters

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv force-pushed the user/brb/many-lora-adapters branch from 9f6fe8f to 0543050 Compare April 8, 2026 00:04
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented Apr 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42203 [ run ] triggered by Bot. Commit: 0543050 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42203 [ run ] completed with state SUCCESS. Commit: 0543050
/LLM/main/L0_MergeRequest_PR pipeline #33023 completed with status: 'SUCCESS'

CI Report

Link to invocation

@brb-nv brb-nv enabled auto-merge (squash) April 8, 2026 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants