[main] qwen-vl THD packed-sequence support and fixes#3323
[main] qwen-vl THD packed-sequence support and fixes#3323DAISY-gh wants to merge 40 commits intoNVIDIA-NeMo:mainfrom
Conversation
e9ff163 to
8d8b782
Compare
📝 WalkthroughWalkthroughThis update introduces batch-level sequence packing support for Qwen Vision Language models with THD (Token Head Dimension) diagnostics infrastructure, adds optional dependency handling for diffusion and modelopt modules, and extends Energon dataset configuration for flexible path resolution and packing strategies. Changes
Sequence Diagram(s)sequenceDiagram
participant Energon as Energon Provider
participant TaskEnc as Task Encoder
participant Step as VLM Step
participant Model as Qwen3VL Model
rect rgba(100, 150, 200, 0.5)
Note over Energon,Model: Unpacked Path (Standard)
Energon->>TaskEnc: samples (unpacked)
TaskEnc->>TaskEnc: batch() → QwenVLTaskBatch<br/>(no packing metadata)
Step->>Step: forward_step() → standard<br/>attention_mask, position_ids
Step->>Model: forward(position_ids, attention_mask)
end
rect rgba(150, 200, 100, 0.5)
Note over Energon,Model: Energon Pre-Packed Path (THD)
Energon->>TaskEnc: samples (packed)<br/>or batch_level_packing=True
TaskEnc->>TaskEnc: select_samples_to_pack()<br/>+ pack_selected_samples()
TaskEnc->>TaskEnc: batch() → QwenVLTaskBatch<br/>(cu_lengths, cu_seqlens_argmin)
Step->>Step: forward_step() detects cu_seqlens<br/>in batch → fast return<br/>cu_seqlens, moe_padding_mask
Step->>Model: forward(rope_cu_seqlens,<br/>moe_padding_mask, position_ids=None)
Model->>Model: derive MRoPE positions<br/>from cu_seqlens
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)
251-265:⚠️ Potential issue | 🟠 Major
batch_level_packingis enabled before the THD metadata is actually wired through.This path now turns on packed handling for
dataset.batch_level_packing, but it still re-synthesizesPackedSeqParams, rewrites the mask to all ones, and never forwards the newpadding_mask/ rope-MoE boundary metadata. THD batches will therefore keep counting alignment padding as real tokens on this Qwen3-specific step. Mirror the metadata flow insrc/megatron/bridge/training/vlm_step.pybefore enabling this mode here.Also applies to: 280-292
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 251 - 265, The code enables batch_level_packing (via dataset.batch_level_packing and variable batch_level_pack_enabled) before the THD/packed-sequence metadata is propagated, but then calls pack_or_pad_batch_sequences which re-synthesizes PackedSeqParams, overwrites masks (loss_mask/attention_mask) and never forwards the original padding_mask / rope-MoE boundary metadata, causing THD batches to treat alignment padding as real tokens; fix by mirroring the metadata flow from src/megatron/bridge/training/vlm_step.py: when batch_level_packing (and in_batch_pack_enabled) is true, do not re-create PackedSeqParams blindly—retrieve and forward the existing PackedSeqParams/padding_mask/rope-MoE boundary metadata from this_pg_collection (or the THD batch metadata), ensure pack_or_pad_batch_sequences is invoked in a mode that preserves and returns the original packed_seq_params and padding_mask, and wire those returned values into subsequent steps so that batch_level_packing is only enabled after the proper metadata has been wired through.src/megatron/bridge/training/utils/packed_seq_utils.py (1)
46-57:⚠️ Potential issue | 🔴 CriticalThe
torch.argminfallback trims everycu_seqlens*tensor to empty.
cu_seqlensstarts with0by construction, sotorch.argmin(...)returns the first element, not the first sentinel tail. Any batch that reaches this fallback path will build empty boundary arrays and invalidPackedSeqParams.Suggested change
if cu_seqlens_argmin is not None: cu_seqlens_padded = cu_seqlens_padded[: cu_seqlens_argmin.item()] else: - cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded)] + padded_tail = torch.nonzero(cu_seqlens_padded[1:] == 0, as_tuple=False) + first_tail = padded_tail[0].item() + 1 if padded_tail.numel() else cu_seqlens_padded.numel() + cu_seqlens_padded = cu_seqlens_padded[:first_tail] if cu_seqlens_unpadded is not None: if cu_seqlens_unpadded_argmin is not None: cu_seqlens_unpadded = cu_seqlens_unpadded[: cu_seqlens_unpadded_argmin.item()] else: - cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded)] + unpadded_tail = torch.nonzero(cu_seqlens_unpadded[1:] == 0, as_tuple=False) + first_tail = unpadded_tail[0].item() + 1 if unpadded_tail.numel() else cu_seqlens_unpadded.numel() + cu_seqlens_unpadded = cu_seqlens_unpadded[:first_tail]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 46 - 57, The fallback use of torch.argmin on cu_seqlens_padded and cu_seqlens_unpadded yields 0 (because they start with 0) and trims tensors to empty; replace both torch.argmin fallbacks with logic that finds the start index of the sentinel/tail region (e.g., the first index whose value equals the final cumulative value or otherwise the first index where values stop increasing) and slice up to that index; update the branches that reference cu_seqlens_padded, cu_seqlens_argmin, cu_seqlens_unpadded, and cu_seqlens_unpadded_argmin to use this computed tail_start instead of torch.argmin to avoid producing empty boundary arrays.
🧹 Nitpick comments (3)
src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py (3)
470-470: Remove unusedmerge_lengthvariable.The
merge_lengthvariable is assigned but never used inpack_selected_samples().Suggested fix
- merge_length = self.merge_size ** 2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` at line 470, In pack_selected_samples(), remove the unused local variable merge_length (assignment: merge_length = self.merge_size ** 2) since it is never referenced; simply delete that line from the pack_selected_samples function (reference symbols: merge_length, self.merge_size, pack_selected_samples) so no unused variable remains.
528-531: Rename ambiguous loop variableltolength.Single-letter
lis easily confused with1(one). Use a descriptive name.Suggested fix
- for l in sub_lengths: - cu_lengths_list.append(cu_lengths_list[-1] + l) + for length in sub_lengths: + cu_lengths_list.append(cu_lengths_list[-1] + length)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` around lines 528 - 531, The loop uses an ambiguous single-letter variable `l` when iterating `sub_lengths`; rename it to a descriptive name like `length` in the loop that builds `cu_lengths_list` (the block that defines `cu_lengths_list = [0]`, iterates over `sub_lengths`, appends cumulative sums, and constructs `cu_lengths`), updating the iteration variable wherever referenced to avoid confusion with the digit `1`.
68-104: Addstrict=Trueto zip() for safer iteration.The
zip()at line 85 should includestrict=Trueto catch length mismatches betweenitem_sizesandsamples, which would indicate a bug in the caller.Suggested fix
- sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples), key=lambda x: x[0])) + sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples, strict=True), key=lambda x: x[0]), strict=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` around lines 68 - 104, The zip() call that builds sorted_sizes and sorted_samples in greedy_knapsack should use strict=True to catch any unexpected length mismatches between item_sizes and samples; update the unzip step (currently: sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples), key=lambda x: x[0]))) to pass strict=True to the inner zip so the tuple unpacking of sorted_sizes and sorted_samples will raise immediately if lengths differ, keeping the existing sort and subsequent logic (sorted_sizes, sorted_samples, _search_for_fit, popping) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/conversion/auto_bridge.py`:
- Around line 53-59: The optional import of is_quantized currently uses a broad
except Exception which can mask real errors; change the except to catch
ImportError only and emit a fallback log message including the exception reason
before defining the fallback is_quantized function so callers know quantization
detection was unavailable; update the try/except around "from
modelopt.torch.quantization.utils import is_quantized" to except ImportError as
e and log the fallback (using the module logger or logging.getLogger(__name__))
consistent with the other optional-import patterns in this file.
In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 21-27: The current single try/except around "from
megatron.bridge.diffusion.recipes.flux.flux import *" and "from
megatron.bridge.diffusion.recipes.wan.wan import *" hides one optional family if
the other fails; split them into independent guarded imports so a
ModuleNotFoundError in flux doesn't suppress wan (and vice versa) by wrapping
each import in its own try/except (catch ModuleNotFoundError and pass) and keep
the same module names (flux, wan) so exports remain correct.
In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py`:
- Around line 737-739: cu_seqlens_argmin is created with dtype=torch.int32 but
downstream code (and vlm_step.py) expects int64; change the creation of
cu_seqlens_argmin from torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
to use dtype=torch.int64 so its .item()/host handling is consistent with
vlm_step.py; ensure related tensors (e.g., max_lengths if needed) remain
unchanged unless they also require int64.
In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`:
- Around line 29-32: Replace the broad "except Exception" guards around the
Qwen3VLProcessor import and the runtime .from_pretrained() load with narrow
exception types and record a log when falling back: for the import block
catching Qwen3VLProcessor, catch only ImportError and ModuleNotFoundError and
log the exception and that the processor is unavailable; for the runtime load
around Qwen3VLProcessor.from_pretrained(), catch AttributeError (missing
submodule) and expected load errors such as OSError/RuntimeError (or ValueError)
and log the exception and that you're using the fallback path; reference
Qwen3VLProcessor and its from_pretrained() call when implementing these changes
so the handlers are applied to the correct places.
In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 393-417: The docstring for qwen35_vl_35b_a3b_sft_energon_config
incorrectly claims "enables batch-level packing by default (THD path)" while the
implementation sets cfg.dataset.batch_level_packing = False (and also disables
cfg.dataset.pack_sequences_in_batch); either update the docstring to state that
both in-batch and batch-level packing are disabled by default and must be
enabled via CLI overrides, or change the default in the function to
cfg.dataset.batch_level_packing = True (and adjust packing_buffer_size/policy
comments accordingly) so the code matches the docstring; ensure references to
cfg.dataset.pack_sequences_in_batch, cfg.dataset.batch_level_packing, and
cfg.dataset.packing_buffer_size are consistent.
---
Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 251-265: The code enables batch_level_packing (via
dataset.batch_level_packing and variable batch_level_pack_enabled) before the
THD/packed-sequence metadata is propagated, but then calls
pack_or_pad_batch_sequences which re-synthesizes PackedSeqParams, overwrites
masks (loss_mask/attention_mask) and never forwards the original padding_mask /
rope-MoE boundary metadata, causing THD batches to treat alignment padding as
real tokens; fix by mirroring the metadata flow from
src/megatron/bridge/training/vlm_step.py: when batch_level_packing (and
in_batch_pack_enabled) is true, do not re-create PackedSeqParams
blindly—retrieve and forward the existing PackedSeqParams/padding_mask/rope-MoE
boundary metadata from this_pg_collection (or the THD batch metadata), ensure
pack_or_pad_batch_sequences is invoked in a mode that preserves and returns the
original packed_seq_params and padding_mask, and wire those returned values into
subsequent steps so that batch_level_packing is only enabled after the proper
metadata has been wired through.
In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 46-57: The fallback use of torch.argmin on cu_seqlens_padded and
cu_seqlens_unpadded yields 0 (because they start with 0) and trims tensors to
empty; replace both torch.argmin fallbacks with logic that finds the start index
of the sentinel/tail region (e.g., the first index whose value equals the final
cumulative value or otherwise the first index where values stop increasing) and
slice up to that index; update the branches that reference cu_seqlens_padded,
cu_seqlens_argmin, cu_seqlens_unpadded, and cu_seqlens_unpadded_argmin to use
this computed tail_start instead of torch.argmin to avoid producing empty
boundary arrays.
---
Nitpick comments:
In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py`:
- Line 470: In pack_selected_samples(), remove the unused local variable
merge_length (assignment: merge_length = self.merge_size ** 2) since it is never
referenced; simply delete that line from the pack_selected_samples function
(reference symbols: merge_length, self.merge_size, pack_selected_samples) so no
unused variable remains.
- Around line 528-531: The loop uses an ambiguous single-letter variable `l`
when iterating `sub_lengths`; rename it to a descriptive name like `length` in
the loop that builds `cu_lengths_list` (the block that defines `cu_lengths_list
= [0]`, iterates over `sub_lengths`, appends cumulative sums, and constructs
`cu_lengths`), updating the iteration variable wherever referenced to avoid
confusion with the digit `1`.
- Around line 68-104: The zip() call that builds sorted_sizes and sorted_samples
in greedy_knapsack should use strict=True to catch any unexpected length
mismatches between item_sizes and samples; update the unzip step (currently:
sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples), key=lambda
x: x[0]))) to pass strict=True to the inner zip so the tuple unpacking of
sorted_sizes and sorted_samples will raise immediately if lengths differ,
keeping the existing sort and subsequent logic (sorted_sizes, sorted_samples,
_search_for_fit, popping) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 48c45fd4-5bfc-4e7e-957c-c94172451c57
📒 Files selected for processing (17)
3rdparty/Megatron-LMsrc/megatron/bridge/data/energon/energon_provider.pysrc/megatron/bridge/models/conversion/auto_bridge.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.pysrc/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.pysrc/megatron/bridge/models/qwen_vl/qwen35_vl_provider.pysrc/megatron/bridge/models/qwen_vl/qwen3_vl_step.pysrc/megatron/bridge/recipes/__init__.pysrc/megatron/bridge/recipes/qwen_vl/__init__.pysrc/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.pysrc/megatron/bridge/recipes/qwen_vl/qwen35_vl.pysrc/megatron/bridge/recipes/qwen_vl/qwen3_vl.pysrc/megatron/bridge/training/config.pysrc/megatron/bridge/training/utils/packed_seq_utils.pysrc/megatron/bridge/training/vlm_step.py
| cu_seqlens_argmin_list = [len(cu) for cu in cu_lengths_list] | ||
| cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32) | ||
| max_lengths = torch.tensor(max_lengths_list, dtype=torch.int32) |
There was a problem hiding this comment.
cu_seqlens_argmin dtype should be torch.int64 to match downstream expectations.
The downstream code in vlm_step.py (line 122-125) expects cu_seqlens_argmin on host with dtype compatible with .item() returning an integer. While int32 works, vlm_step.py line 414 explicitly creates it as torch.int64. For consistency:
Suggested fix
- cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
+ cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int64)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cu_seqlens_argmin_list = [len(cu) for cu in cu_lengths_list] | |
| cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32) | |
| max_lengths = torch.tensor(max_lengths_list, dtype=torch.int32) | |
| cu_seqlens_argmin_list = [len(cu) for cu in cu_lengths_list] | |
| cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int64) | |
| max_lengths = torch.tensor(max_lengths_list, dtype=torch.int32) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` around
lines 737 - 739, cu_seqlens_argmin is created with dtype=torch.int32 but
downstream code (and vlm_step.py) expects int64; change the creation of
cu_seqlens_argmin from torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
to use dtype=torch.int64 so its .item()/host handling is consistent with
vlm_step.py; ensure related tensors (e.g., max_lengths if needed) remain
unchanged unless they also require int64.
There was a problem hiding this comment.
Thanks for the suggestion. cu_seqlens_argmin is currently consumed via .item() / host-side scalar reads, where torch.int32 and torch.int64 are functionally equivalent. In this code path, we intentionally keep int32 to stay consistent with surrounding packed-seq metadata tensors and avoid unnecessary dtype conversions. So here, changing to int64 is a consistency preference, not a functional requirement. If we later standardize all sequence-length/index metadata to int64 across the full pipeline, we can update this in one pass.
There was a problem hiding this comment.
The PR description image shows four training metric charts for the Qwen 35B-A3B model, comparing three runs: qwen35vl_35B-A3B_bin0_mbs1_tp1cp1pp1ep4_energon_cord (pink), qwen35vl_35B-A3B_bin0_mbs1_tp1cp1pp1ep4_energon_cord_nopa (purple), and qwen35vl_35B-A3B_mbs4_tp1cp1pp1ep8 (brown). The charts show lm loss, mtp_1 loss, grad-norm, and global_load_balancing_loss over ~150 steps. The lm and mtp losses converge smoothly and comparably across all three runs. The grad-norm shows some early spikes but stabilizes similarly. The global_load_balancing_loss shows more differentiation — the brown (non-Energon) run has a faster declining curve starting around step 25, while the two Energon runs track closely together with a slightly higher and slower decline.
@DAISY-gh, that's a fair design rationale — since cu_seqlens_argmin is only consumed via .item() / host-side scalar reads in this path, int32 is functionally equivalent to int64 and keeping it consistent with the surrounding int32 packed-seq metadata (e.g., padded_cu, max_lengths) avoids unnecessary conversions. Acknowledged, and I'll note this for future reviews.
✏️ Learnings added
Learnt from: DAISY-gh
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 3323
File: src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py:737-739
Timestamp: 2026-04-14T13:42:22.684Z
Learning: In NVIDIA-NeMo/Megatron-Bridge (task_encoder.py), `cu_seqlens_argmin` for Qwen-VL Energon packed batches is intentionally created as `torch.int32` (not `torch.int64`) to stay consistent with surrounding packed-sequence metadata tensors (`padded_cu`, `max_lengths`). It is only consumed via `.item()` / host-side scalar reads where `int32` and `int64` are functionally equivalent. Do not flag the `int32` dtype as a bug or inconsistency.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: cuichenx
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 3100
File: src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py:196-270
Timestamp: 2026-04-01T22:37:23.790Z
Learning: In NVIDIA-NeMo/Megatron-Bridge, treat Qwen3.5 model HuggingFace identifiers (e.g., "Qwen/Qwen3.5-9B", "Qwen/Qwen3.5-35B-A3B", "Qwen/Qwen3.5-122B-A10B", "Qwen/Qwen3.5-397B-A17B") as vision-language models even though the identifier text does not include "VL". When review feedback suggests these are text-only due to the model name, do not flag them as incorrect if they are used as VLM pretrain/SFT/PEFT `hf_path` values (as in `qwen35_vl.py` and its `qwen35_vl_*` functions).
Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2007
File: src/megatron/bridge/data/mimo/dataset.py:158-166
Timestamp: 2026-01-28T20:41:39.210Z
Learning: In Bridge Megatron-Core datasets, use torch.ones_like(input_ids) for attention masks instead of masking padding tokens. The Megatron-Core attention kernel handles causal masking internally during the forward pass, not at the dataset level. This pattern should apply broadly to Bridge datasets: SFT Dataset uses get_attention_mask_from_fusion=True (setting attention_mask=None), VLM Providers use skip_getting_attention_mask_from_dataset=True, and Qwen VL uses torch.ones_like(input_ids). Ensure dataset code consistently generates input_ids with a corresponding attention_mask of ones (or None where the kernel derives it) and verify through the dataset and forward-pass configuration that the model relies on the kernel's internal masking.
Learnt from: yaoyu-33
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2342
File: src/megatron/bridge/models/qwen_omni/thinker_model.py:326-345
Timestamp: 2026-02-13T00:22:54.549Z
Learning: In Python implementations (e.g., src/megatron/bridge/models/qwen_omni/thinker_model.py), when a feature is not supported (such as audio embeddings), raise an explicit error (e.g., NotImplementedError) instead of silently ignoring the input to fail fast with a clear message. Use a descriptive exception message that identifies the unsupported feature and the expected behavior.
Pass a THD padding_mask derived from lm_input_ids into GPTModel so MoE router aux/global-aux metrics exclude tail padding and match BSHD token-count semantics. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Allow Qwen3VLGPTModel and Qwen3VLTransformerBlock to accept and propagate padding_mask so THD MoE routing can consume valid-token masking without crashing. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Forward padding_mask from Qwen3VLGPTModel into _postprocess so MTP MoE global aux-loss uses masked token accounting consistently. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Point submodule to the main-based Megatron-LM branch commit that propagates padding_mask through MTP MoE routing. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Guard diffusion recipe imports in recipes.__init__ so missing optional dependencies like diffusers do not block LLM/VLM training startup. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Avoid import-time crashes in training environments missing optional diffusion/modelopt/transformers pieces by adding graceful fallbacks for quantization and processor loading. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…cessor fallback Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…step Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…sync TaskEncoder seq_len from CLI override - QwenVLTaskBatch: pass __key__ and __restore_key__ to satisfy Energon Batch base class - energon_provider: sync TaskEncoder.seq_len from EnergonProvider.seq_length at build time so CLI overrides (e.g. dataset.seq_length=8192) propagate to packing capacity and padding Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…D format Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…ng segment to avoid GDN NaN Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…schedule.py convention Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…oss regression Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Pass position_ids=None for VLM batches so Qwen3VLModel runs its packed MRoPE path, and add short-lived debug logs to verify packed boundary position resets are actually applied. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add local helper functions used by THD diagnostic logging after cherry-picking THD MRoPE fixes, preventing runtime NameError for _thd_diag_enabled/_rank0. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add missing Energon batch key fields and normalize/concat visual blocks along token dimension so packed multimodal batches no longer fail on mixed visual lengths. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Drop hardcoded first-iteration debug dumps from vlm_step and Qwen3VL MRoPE path so training logs stay clean while preserving core packed-sequence behavior. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Route THD diagnostic logs behind THD_DIAG and compute image token counts from the active model config so IterStats reflects real tokenizer ids. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add switch-gated low-content packed-sample diagnostics and gate remaining THD debug logs so all related debug output is controlled by THD_DIAG. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Connect THD_FORCE_BSHD, THD_FORCE_SINGLE_SEGMENT_CU, and THD_SKIP_PREPROCESS_PACKED_POS to actual vlm_step control flow, and log switch activation only when enabled. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Keep unpadded cu_seqlens for MRoPE sub-sequence semantics while passing padded boundaries for TE/GDN kernels, and limit THD switch logs to rank0. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Avoid switching to the unstable unpadded TE packed path by keeping PackedSeqParams padded-only, and pass unpadded cu_seqlens separately for Qwen MRoPE sub-sequence indexing. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add rank0 THD_DIAG logs to correlate packed boundaries, MoE padding mask counts, and loss-mask valid tokens without changing training behavior. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Build MoE padding mask from packed boundaries and pass it explicitly through Qwen3VL instead of inferring padding from lm_input_ids.eq(0), with fallback kept for compatibility. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Ensure packed mask remapping uses boolean indexing semantics when preprocessing moe_padding_mask, preventing explicit THD padding mask from collapsing to zero. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add a default-off THD_DIAG_ALIGN switch to log pre/post packed remap summaries for labels and loss_mask, so we can directly verify whether THD preprocessing preserves supervision alignment. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add opt-in THD MRoPE pre/post remap diagnostics behind THD_DIAG_MROPE to inspect position semantics without affecting default behavior. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Use boolean attention masks at all THD preprocess_packed_seqs call sites and add a function-level bool cast guard to prevent silent positional remap corruption. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add opt-in strict diagnostics to compare THD-remapped position ids against explicit bool-mask gather semantics for cp_size==1. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add opt-in THD boundary diagnostics to compare kernel vs rope cu summaries and report boundary-window label/loss/moe mask consistency, while keeping all new logic disabled by default. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Document how argmin-based trimming and padded boundary fields are used so THD packed-kernel boundary expectations are explicit. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Allow EnergonProvider to accept and forward shuffle_buffer_size so training scripts can explicitly control sample shuffling behavior in THD/BSHD comparison runs. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Apply nodebug recipe/provider updates while intentionally keeping the Megatron-LM submodule pointer unchanged. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Introduce independent batch-level packing control for THD dataloader, keep in-batch packing semantics isolated, and update validation/step logic so the two modes cannot be mixed. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Keep both in-batch and THD batch-level packing disabled by default, requiring explicit CLI overrides to enable either mode and avoid unintended coupling. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
8d8b782 to
30106c3
Compare
Narrow optional-import exception scopes, add fallback warnings, split diffusion optional imports, and align the qwen35 energon docstring with actual default packing behavior. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…level packing Use safe cu_seqlens tail trimming when argmin metadata is absent, and prevent qwen3_vl_step from enabling batch-level packing before full THD metadata wiring. Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Summary
Summary by CodeRabbit
Release Notes
New Features
Improvements