[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by brb-nv · Pull Request #12815 · NVIDIA/TensorRT-LLM

brb-nv · 2026-04-07T20:25:16Z

Description

This is to address https://nvbugspro.nvidia.com/bug/5658258 for Pytorch backend. Legacy TRT backend behavior is unchanged.

Also, unwaives test for https://nvbugs/5636857 as the test for that no longer fails on main.

Background:
When serving many unique LoRA adapters, TRTLLM hits CUDA OOM after processing a few requests, despite the C++ PeftCacheManager being configured with max_loras to limit GPU-resident adapters and evict via LRU.

Root cause:

The Python LoraManager unconditionally appended every loaded adapter's GPU tensors to self._lora_weights, a list that only grows and never evicts (eventually causes OOM). This list is used only by the legacy TRT backend.
The PyTorch backend never reads from _lora_weights — it uses the C++ PeftCacheManager, which maintains its own GPU cache with proper eviction. The result was two copies of every adapter on GPU: one in the C++ cache (bounded, evictable) and one in _lora_weights (unbounded, leaked, more importantly unused).

Fix:

Guard _lora_weights append and _lora_weights_pointers_list population by a _retain_device_tensors flag on LoraManager.
In the PyTorch backend, the temporary GPU tensors created during loading go out of scope and are freed each iteration - only CPU copies (_cpp_lora_weights) are retained for the C++ cache.

Test Coverage

$ pytest tests/unittest/others/test_lora_manager.py -s -v
$ pytest tests/unittest/llmapi/test_llm_pytorch.py::test_lora_many_adapters_no_memory_leak -s -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Bug Fixes
- Fixed GPU memory accumulation issue in LoRA adapter management to prevent memory leaks during multi-adapter inference.
Tests
- Added regression tests to validate GPU memory efficiency with multiple LoRA adapters and backend-specific behavior.

coderabbitai · 2026-04-07T20:30:41Z

📝 Walkthrough

Walkthrough

A backend-dependent optimization is introduced to LoRA tensor management in tensorrt_llm/lora_manager.py. The change adds a retention flag that controls whether GPU tensors are stored in memory, conditionally based on whether a C++ PEFT cache manager is available. Corresponding unit and integration tests validate the new behavior and detect potential GPU memory accumulation issues.

Changes

Cohort / File(s)	Summary
Core LoRA Manager Implementation `tensorrt_llm/lora_manager.py`	Added `_retain_device_tensors` flag set based on `cpp_peft_cache_manager` availability. Modified weight loading in `load_from_model_file` and `load_from_model_dir` to conditionally populate `_lora_weights_pointers_list` and GPU tensors, while always computing ranks and building CPU tensors for `_cpp_lora_weights`.
Unit Tests `tests/unittest/others/test_lora_manager.py`	New test module validating `_retain_device_tensors` behavior across backend scenarios. Tests verify the flag state based on cache manager presence, validate `_lora_weights` and `_cpp_lora_weights` contents in each path, and confirm GPU memory does not accumulate when loading many adapters with C++ cache manager.
Integration Tests `tests/unittest/llmapi/test_llm_pytorch.py`	Added regression test `test_lora_many_adapters_no_memory_leak` that constructs 20 unique LoRA adapters with constrained slots and measures GPU memory growth per adapter, asserting growth stays under ~1 MB/adapter threshold to detect `_lora_weights` accumulation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the fix as addressing an OOM issue with large numbers of LoRA adapters, which matches the main objective of the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	PR description comprehensively explains the issue, root cause, and fix with clear background and test coverage provided.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/others/test_lora_manager.py`:
- Around line 1-169: Run the ruff formatter on the test file to apply the
CI-required style fixes; locally run the equivalent of `ruff format` targeting
the test file that contains TestLoraManagerRetainDeviceTensors and the helper
_create_dummy_hf_lora_adapter so imports/spacing/line breaks match CI
expectations, then re-run tests and commit the formatted changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e90c5094-e382-42e9-81b3-cc4692ae673d

📥 Commits

Reviewing files that changed from the base of the PR and between a1777fd and 93d25ac.

📒 Files selected for processing (3)

tensorrt_llm/lora_manager.py
tests/unittest/llmapi/test_llm_pytorch.py
tests/unittest/others/test_lora_manager.py

tests/unittest/others/test_lora_manager.py

brb-nv · 2026-04-07T20:34:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T20:41:27Z

PR_Github #42184 [ run ] triggered by Bot. Commit: 042db83 Link to invocation

…adapters Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv · 2026-04-08T00:08:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T00:16:44Z

PR_Github #42203 [ run ] triggered by Bot. Commit: 0543050 Link to invocation

tensorrt-cicd · 2026-04-08T04:49:59Z

PR_Github #42203 [ run ] completed with state SUCCESS. Commit: 0543050
/LLM/main/L0_MergeRequest_PR pipeline #33023 completed with status: 'SUCCESS'

CI Report

Link to invocation

brb-nv requested a review from a team as a code owner April 7, 2026 20:25

brb-nv requested a review from amitz-nv April 7, 2026 20:25

github-actions bot assigned brb-nv Apr 7, 2026

brb-nv requested a review from shaharmor98 April 7, 2026 20:26

brb-nv force-pushed the user/brb/many-lora-adapters branch from 93d25ac to 5de12f6 Compare April 7, 2026 20:28

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

tests/unittest/others/test_lora_manager.py Show resolved Hide resolved

brb-nv force-pushed the user/brb/many-lora-adapters branch from 5de12f6 to 042db83 Compare April 7, 2026 20:33

brb-nv requested a review from a team as a code owner April 7, 2026 20:33

brb-nv requested a review from StanleySun639 April 7, 2026 20:34

brb-nv force-pushed the user/brb/many-lora-adapters branch from 042db83 to 9f6fe8f Compare April 7, 2026 23:48

[https://nvbugs/5658258][fix] Fix CUDA OOM with large number of LoRA …

0543050

…adapters Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/many-lora-adapters branch from 9f6fe8f to 0543050 Compare April 8, 2026 00:04

StanleySun639 approved these changes Apr 8, 2026

View reviewed changes

brb-nv enabled auto-merge (squash) April 8, 2026 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters#12815

[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters#12815
brb-nv wants to merge 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/many-lora-adapters

brb-nv commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

brb-nv commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

brb-nv commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brb-nv commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brb-nv commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

brb-nv commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brb-nv commented Apr 7, 2026 •

edited

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading