Skip to content

[main] qwen-vl THD packed-sequence support and fixes#3323

Open
DAISY-gh wants to merge 40 commits intoNVIDIA-NeMo:mainfrom
DAISY-gh:thd-support-main
Open

[main] qwen-vl THD packed-sequence support and fixes#3323
DAISY-gh wants to merge 40 commits intoNVIDIA-NeMo:mainfrom
DAISY-gh:thd-support-main

Conversation

@DAISY-gh
Copy link
Copy Markdown
Contributor

@DAISY-gh DAISY-gh commented Apr 14, 2026

Summary

image curves of SFT on cord_v2 dataset based on Qwen3.5 35BA3B.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added batch-level sequence packing support for improved training efficiency in data processing pipelines.
    • Introduced new Qwen3.5-VL training recipes with Energon dataset integration and configurable packing options.
  • Improvements

    • Enhanced robustness by making optional dependencies (modelopt, diffusion modules) gracefully degrade when unavailable.
    • Improved multi-modal model training with better position computation and token accounting for mixture-of-experts layers.
    • Extended data provider configuration with new shuffling and packing buffer parameters for flexible batch preparation.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

This update introduces batch-level sequence packing support for Qwen Vision Language models with THD (Token Head Dimension) diagnostics infrastructure, adds optional dependency handling for diffusion and modelopt modules, and extends Energon dataset configuration for flexible path resolution and packing strategies.

Changes

Cohort / File(s) Summary
Submodule Reference
3rdparty/Megatron-LM
Updated Megatron-LM vendored submodule commit reference from 980211ae... to 9da0ee7b....
Dependency Management
src/megatron/bridge/models/conversion/auto_bridge.py, src/megatron/bridge/recipes/__init__.py, src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
Added guarded imports with fallback behavior: is_quantized defaults to False on import failure; diffusion recipes wrapped in try/except; Qwen3VLProcessor falls back to AutoProcessor when unavailable. Enables graceful degradation when optional dependencies are absent.
Energon Provider Configuration
src/megatron/bridge/data/energon/energon_provider.py
Added batch-level packing fields (batch_level_packing, packing_buffer_size, shuffle_buffer_size) and bin-based path resolution (cord_bins_root, cord_bin_prefix, cord_bin_id). Implements computed resolved_path logic and synchronizes task encoder sequence lengths when applicable.
Packed Sequence Core Logic
src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py
Introduced QwenVLTaskSamplePacked dataclass, greedy knapsack binpacking algorithm, and packing metadata fields (cu_lengths, max_lengths, cu_seqlens_argmin) to QwenVLTaskBatch. Implements select_samples_to_pack() and pack_selected_samples() with optional THD diagnostics.
Qwen3VL Model Extensions
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
Added rope_cu_seqlens and moe_padding_mask parameters, THD-gated diagnostics infrastructure, packed RoPE/position-id computation from cu_seqlens splits, and MoE padding mask derivation. Includes alignment and multi-rope (MRoPE) diagnostics for packed sequences.
Qwen3VL Model Integration
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py, src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py
Threaded padding_mask parameter through text model and transformer block forward passes, including checkpointed activation paths; ensures mask propagation into decoder, postprocess, and layer invocations.
Packed Sequence Utilities
src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py, src/megatron/bridge/training/utils/packed_seq_utils.py
Converted attention_mask to boolean in preprocess_packed_seqs for correct indexing semantics. Updated get_packed_seq_params to compute padded/unpadded CU boundaries and generate cu_seqlens_q_padded/cu_seqlens_kv_padded when unpadded variant unavailable.
Model Provider and Configuration
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py, src/megatron/bridge/recipes/qwen_vl/__init__.py, src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
Added _patch_mtp_attention_specs() helper and cached MTP spec computation; introduced new Energon-backed SFT configs (qwen35_vl_35b_a3b_sft_energon_config, qwen35_vl_35b_a3b_sft_energon_nopack_config); exported new config symbols in __all__.
Step and Training Integration
src/megatron/bridge/training/vlm_step.py, src/megatron/bridge/training/config.py, src/megatron/bridge/recipes/qwen_vl/qwen3_vl_step.py
Added Energon pre-packed fast path in vlm_step, THD feature gates (THD_DIAG, THD_FORCE_BSHD, etc.), unpadded CU boundary derivation, and MoE padding mask generation. Implemented mutual-exclusivity validation between in-batch and batch-level packing; unified packing enablement logic across both modes.

Sequence Diagram(s)

sequenceDiagram
    participant Energon as Energon Provider
    participant TaskEnc as Task Encoder
    participant Step as VLM Step
    participant Model as Qwen3VL Model

    rect rgba(100, 150, 200, 0.5)
        Note over Energon,Model: Unpacked Path (Standard)
        Energon->>TaskEnc: samples (unpacked)
        TaskEnc->>TaskEnc: batch() → QwenVLTaskBatch<br/>(no packing metadata)
        Step->>Step: forward_step() → standard<br/>attention_mask, position_ids
        Step->>Model: forward(position_ids, attention_mask)
    end

    rect rgba(150, 200, 100, 0.5)
        Note over Energon,Model: Energon Pre-Packed Path (THD)
        Energon->>TaskEnc: samples (packed)<br/>or batch_level_packing=True
        TaskEnc->>TaskEnc: select_samples_to_pack()<br/>+ pack_selected_samples()
        TaskEnc->>TaskEnc: batch() → QwenVLTaskBatch<br/>(cu_lengths, cu_seqlens_argmin)
        Step->>Step: forward_step() detects cu_seqlens<br/>in batch → fast return<br/>cu_seqlens, moe_padding_mask
        Step->>Model: forward(rope_cu_seqlens,<br/>moe_padding_mask, position_ids=None)
        Model->>Model: derive MRoPE positions<br/>from cu_seqlens
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.96% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR introduces major changes affecting numerics, performance, and stability but explicitly states testing is not yet executed and provides no test results or regression documentation. Execute documented test plan including unit tests, pre-commit checks, convergence validation, and performance measurements before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'qwen-vl THD packed-sequence support and fixes' accurately reflects the main changes: THD packed-sequence infrastructure for Qwen-VL, including packing support, padding masks, MRoPE fixes, and related configuration updates.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)

251-265: ⚠️ Potential issue | 🟠 Major

batch_level_packing is enabled before the THD metadata is actually wired through.

This path now turns on packed handling for dataset.batch_level_packing, but it still re-synthesizes PackedSeqParams, rewrites the mask to all ones, and never forwards the new padding_mask / rope-MoE boundary metadata. THD batches will therefore keep counting alignment padding as real tokens on this Qwen3-specific step. Mirror the metadata flow in src/megatron/bridge/training/vlm_step.py before enabling this mode here.

Also applies to: 280-292

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 251 - 265,
The code enables batch_level_packing (via dataset.batch_level_packing and
variable batch_level_pack_enabled) before the THD/packed-sequence metadata is
propagated, but then calls pack_or_pad_batch_sequences which re-synthesizes
PackedSeqParams, overwrites masks (loss_mask/attention_mask) and never forwards
the original padding_mask / rope-MoE boundary metadata, causing THD batches to
treat alignment padding as real tokens; fix by mirroring the metadata flow from
src/megatron/bridge/training/vlm_step.py: when batch_level_packing (and
in_batch_pack_enabled) is true, do not re-create PackedSeqParams
blindly—retrieve and forward the existing PackedSeqParams/padding_mask/rope-MoE
boundary metadata from this_pg_collection (or the THD batch metadata), ensure
pack_or_pad_batch_sequences is invoked in a mode that preserves and returns the
original packed_seq_params and padding_mask, and wire those returned values into
subsequent steps so that batch_level_packing is only enabled after the proper
metadata has been wired through.
src/megatron/bridge/training/utils/packed_seq_utils.py (1)

46-57: ⚠️ Potential issue | 🔴 Critical

The torch.argmin fallback trims every cu_seqlens* tensor to empty.

cu_seqlens starts with 0 by construction, so torch.argmin(...) returns the first element, not the first sentinel tail. Any batch that reaches this fallback path will build empty boundary arrays and invalid PackedSeqParams.

Suggested change
     if cu_seqlens_argmin is not None:
         cu_seqlens_padded = cu_seqlens_padded[: cu_seqlens_argmin.item()]
     else:
-        cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded)]
+        padded_tail = torch.nonzero(cu_seqlens_padded[1:] == 0, as_tuple=False)
+        first_tail = padded_tail[0].item() + 1 if padded_tail.numel() else cu_seqlens_padded.numel()
+        cu_seqlens_padded = cu_seqlens_padded[:first_tail]
 
     if cu_seqlens_unpadded is not None:
         if cu_seqlens_unpadded_argmin is not None:
             cu_seqlens_unpadded = cu_seqlens_unpadded[: cu_seqlens_unpadded_argmin.item()]
         else:
-            cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded)]
+            unpadded_tail = torch.nonzero(cu_seqlens_unpadded[1:] == 0, as_tuple=False)
+            first_tail = unpadded_tail[0].item() + 1 if unpadded_tail.numel() else cu_seqlens_unpadded.numel()
+            cu_seqlens_unpadded = cu_seqlens_unpadded[:first_tail]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 46 - 57,
The fallback use of torch.argmin on cu_seqlens_padded and cu_seqlens_unpadded
yields 0 (because they start with 0) and trims tensors to empty; replace both
torch.argmin fallbacks with logic that finds the start index of the
sentinel/tail region (e.g., the first index whose value equals the final
cumulative value or otherwise the first index where values stop increasing) and
slice up to that index; update the branches that reference cu_seqlens_padded,
cu_seqlens_argmin, cu_seqlens_unpadded, and cu_seqlens_unpadded_argmin to use
this computed tail_start instead of torch.argmin to avoid producing empty
boundary arrays.
🧹 Nitpick comments (3)
src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py (3)

470-470: Remove unused merge_length variable.

The merge_length variable is assigned but never used in pack_selected_samples().

Suggested fix
-        merge_length = self.merge_size ** 2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` at line
470, In pack_selected_samples(), remove the unused local variable merge_length
(assignment: merge_length = self.merge_size ** 2) since it is never referenced;
simply delete that line from the pack_selected_samples function (reference
symbols: merge_length, self.merge_size, pack_selected_samples) so no unused
variable remains.

528-531: Rename ambiguous loop variable l to length.

Single-letter l is easily confused with 1 (one). Use a descriptive name.

Suggested fix
-        for l in sub_lengths:
-            cu_lengths_list.append(cu_lengths_list[-1] + l)
+        for length in sub_lengths:
+            cu_lengths_list.append(cu_lengths_list[-1] + length)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` around
lines 528 - 531, The loop uses an ambiguous single-letter variable `l` when
iterating `sub_lengths`; rename it to a descriptive name like `length` in the
loop that builds `cu_lengths_list` (the block that defines `cu_lengths_list =
[0]`, iterates over `sub_lengths`, appends cumulative sums, and constructs
`cu_lengths`), updating the iteration variable wherever referenced to avoid
confusion with the digit `1`.

68-104: Add strict=True to zip() for safer iteration.

The zip() at line 85 should include strict=True to catch length mismatches between item_sizes and samples, which would indicate a bug in the caller.

Suggested fix
-    sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples), key=lambda x: x[0]))
+    sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples, strict=True), key=lambda x: x[0]), strict=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` around
lines 68 - 104, The zip() call that builds sorted_sizes and sorted_samples in
greedy_knapsack should use strict=True to catch any unexpected length mismatches
between item_sizes and samples; update the unzip step (currently: sorted_sizes,
sorted_samples = zip(*sorted(zip(item_sizes, samples), key=lambda x: x[0]))) to
pass strict=True to the inner zip so the tuple unpacking of sorted_sizes and
sorted_samples will raise immediately if lengths differ, keeping the existing
sort and subsequent logic (sorted_sizes, sorted_samples, _search_for_fit,
popping) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/conversion/auto_bridge.py`:
- Around line 53-59: The optional import of is_quantized currently uses a broad
except Exception which can mask real errors; change the except to catch
ImportError only and emit a fallback log message including the exception reason
before defining the fallback is_quantized function so callers know quantization
detection was unavailable; update the try/except around "from
modelopt.torch.quantization.utils import is_quantized" to except ImportError as
e and log the fallback (using the module logger or logging.getLogger(__name__))
consistent with the other optional-import patterns in this file.

In `@src/megatron/bridge/recipes/__init__.py`:
- Around line 21-27: The current single try/except around "from
megatron.bridge.diffusion.recipes.flux.flux import *" and "from
megatron.bridge.diffusion.recipes.wan.wan import *" hides one optional family if
the other fails; split them into independent guarded imports so a
ModuleNotFoundError in flux doesn't suppress wan (and vice versa) by wrapping
each import in its own try/except (catch ModuleNotFoundError and pass) and keep
the same module names (flux, wan) so exports remain correct.

In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py`:
- Around line 737-739: cu_seqlens_argmin is created with dtype=torch.int32 but
downstream code (and vlm_step.py) expects int64; change the creation of
cu_seqlens_argmin from torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
to use dtype=torch.int64 so its .item()/host handling is consistent with
vlm_step.py; ensure related tensors (e.g., max_lengths if needed) remain
unchanged unless they also require int64.

In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`:
- Around line 29-32: Replace the broad "except Exception" guards around the
Qwen3VLProcessor import and the runtime .from_pretrained() load with narrow
exception types and record a log when falling back: for the import block
catching Qwen3VLProcessor, catch only ImportError and ModuleNotFoundError and
log the exception and that the processor is unavailable; for the runtime load
around Qwen3VLProcessor.from_pretrained(), catch AttributeError (missing
submodule) and expected load errors such as OSError/RuntimeError (or ValueError)
and log the exception and that you're using the fallback path; reference
Qwen3VLProcessor and its from_pretrained() call when implementing these changes
so the handlers are applied to the correct places.

In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 393-417: The docstring for qwen35_vl_35b_a3b_sft_energon_config
incorrectly claims "enables batch-level packing by default (THD path)" while the
implementation sets cfg.dataset.batch_level_packing = False (and also disables
cfg.dataset.pack_sequences_in_batch); either update the docstring to state that
both in-batch and batch-level packing are disabled by default and must be
enabled via CLI overrides, or change the default in the function to
cfg.dataset.batch_level_packing = True (and adjust packing_buffer_size/policy
comments accordingly) so the code matches the docstring; ensure references to
cfg.dataset.pack_sequences_in_batch, cfg.dataset.batch_level_packing, and
cfg.dataset.packing_buffer_size are consistent.

---

Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 251-265: The code enables batch_level_packing (via
dataset.batch_level_packing and variable batch_level_pack_enabled) before the
THD/packed-sequence metadata is propagated, but then calls
pack_or_pad_batch_sequences which re-synthesizes PackedSeqParams, overwrites
masks (loss_mask/attention_mask) and never forwards the original padding_mask /
rope-MoE boundary metadata, causing THD batches to treat alignment padding as
real tokens; fix by mirroring the metadata flow from
src/megatron/bridge/training/vlm_step.py: when batch_level_packing (and
in_batch_pack_enabled) is true, do not re-create PackedSeqParams
blindly—retrieve and forward the existing PackedSeqParams/padding_mask/rope-MoE
boundary metadata from this_pg_collection (or the THD batch metadata), ensure
pack_or_pad_batch_sequences is invoked in a mode that preserves and returns the
original packed_seq_params and padding_mask, and wire those returned values into
subsequent steps so that batch_level_packing is only enabled after the proper
metadata has been wired through.

In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 46-57: The fallback use of torch.argmin on cu_seqlens_padded and
cu_seqlens_unpadded yields 0 (because they start with 0) and trims tensors to
empty; replace both torch.argmin fallbacks with logic that finds the start index
of the sentinel/tail region (e.g., the first index whose value equals the final
cumulative value or otherwise the first index where values stop increasing) and
slice up to that index; update the branches that reference cu_seqlens_padded,
cu_seqlens_argmin, cu_seqlens_unpadded, and cu_seqlens_unpadded_argmin to use
this computed tail_start instead of torch.argmin to avoid producing empty
boundary arrays.

---

Nitpick comments:
In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py`:
- Line 470: In pack_selected_samples(), remove the unused local variable
merge_length (assignment: merge_length = self.merge_size ** 2) since it is never
referenced; simply delete that line from the pack_selected_samples function
(reference symbols: merge_length, self.merge_size, pack_selected_samples) so no
unused variable remains.
- Around line 528-531: The loop uses an ambiguous single-letter variable `l`
when iterating `sub_lengths`; rename it to a descriptive name like `length` in
the loop that builds `cu_lengths_list` (the block that defines `cu_lengths_list
= [0]`, iterates over `sub_lengths`, appends cumulative sums, and constructs
`cu_lengths`), updating the iteration variable wherever referenced to avoid
confusion with the digit `1`.
- Around line 68-104: The zip() call that builds sorted_sizes and sorted_samples
in greedy_knapsack should use strict=True to catch any unexpected length
mismatches between item_sizes and samples; update the unzip step (currently:
sorted_sizes, sorted_samples = zip(*sorted(zip(item_sizes, samples), key=lambda
x: x[0]))) to pass strict=True to the inner zip so the tuple unpacking of
sorted_sizes and sorted_samples will raise immediately if lengths differ,
keeping the existing sort and subsequent logic (sorted_sizes, sorted_samples,
_search_for_fit, popping) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 48c45fd4-5bfc-4e7e-957c-c94172451c57

📥 Commits

Reviewing files that changed from the base of the PR and between ad27e2c and e9ff163.

📒 Files selected for processing (17)
  • 3rdparty/Megatron-LM
  • src/megatron/bridge/data/energon/energon_provider.py
  • src/megatron/bridge/models/conversion/auto_bridge.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/transformer_block.py
  • src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/utils.py
  • src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
  • src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py
  • src/megatron/bridge/recipes/__init__.py
  • src/megatron/bridge/recipes/qwen_vl/__init__.py
  • src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py
  • src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
  • src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
  • src/megatron/bridge/training/config.py
  • src/megatron/bridge/training/utils/packed_seq_utils.py
  • src/megatron/bridge/training/vlm_step.py

Comment thread src/megatron/bridge/models/conversion/auto_bridge.py
Comment thread src/megatron/bridge/recipes/__init__.py
Comment on lines +737 to +739
cu_seqlens_argmin_list = [len(cu) for cu in cu_lengths_list]
cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
max_lengths = torch.tensor(max_lengths_list, dtype=torch.int32)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

cu_seqlens_argmin dtype should be torch.int64 to match downstream expectations.

The downstream code in vlm_step.py (line 122-125) expects cu_seqlens_argmin on host with dtype compatible with .item() returning an integer. While int32 works, vlm_step.py line 414 explicitly creates it as torch.int64. For consistency:

Suggested fix
-            cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
+            cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int64)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cu_seqlens_argmin_list = [len(cu) for cu in cu_lengths_list]
cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
max_lengths = torch.tensor(max_lengths_list, dtype=torch.int32)
cu_seqlens_argmin_list = [len(cu) for cu in cu_lengths_list]
cu_seqlens_argmin = torch.tensor(cu_seqlens_argmin_list, dtype=torch.int64)
max_lengths = torch.tensor(max_lengths_list, dtype=torch.int32)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py` around
lines 737 - 739, cu_seqlens_argmin is created with dtype=torch.int32 but
downstream code (and vlm_step.py) expects int64; change the creation of
cu_seqlens_argmin from torch.tensor(cu_seqlens_argmin_list, dtype=torch.int32)
to use dtype=torch.int64 so its .item()/host handling is consistent with
vlm_step.py; ensure related tensors (e.g., max_lengths if needed) remain
unchanged unless they also require int64.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. cu_seqlens_argmin is currently consumed via .item() / host-side scalar reads, where torch.int32 and torch.int64 are functionally equivalent. In this code path, we intentionally keep int32 to stay consistent with surrounding packed-seq metadata tensors and avoid unnecessary dtype conversions. So here, changing to int64 is a consistency preference, not a functional requirement. If we later standardize all sequence-length/index metadata to int64 across the full pipeline, we can update this in one pass.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description image shows four training metric charts for the Qwen 35B-A3B model, comparing three runs: qwen35vl_35B-A3B_bin0_mbs1_tp1cp1pp1ep4_energon_cord (pink), qwen35vl_35B-A3B_bin0_mbs1_tp1cp1pp1ep4_energon_cord_nopa (purple), and qwen35vl_35B-A3B_mbs4_tp1cp1pp1ep8 (brown). The charts show lm loss, mtp_1 loss, grad-norm, and global_load_balancing_loss over ~150 steps. The lm and mtp losses converge smoothly and comparably across all three runs. The grad-norm shows some early spikes but stabilizes similarly. The global_load_balancing_loss shows more differentiation — the brown (non-Energon) run has a faster declining curve starting around step 25, while the two Energon runs track closely together with a slightly higher and slower decline.

@DAISY-gh, that's a fair design rationale — since cu_seqlens_argmin is only consumed via .item() / host-side scalar reads in this path, int32 is functionally equivalent to int64 and keeping it consistent with the surrounding int32 packed-seq metadata (e.g., padded_cu, max_lengths) avoids unnecessary conversions. Acknowledged, and I'll note this for future reviews.


✏️ Learnings added
Learnt from: DAISY-gh
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 3323
File: src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py:737-739
Timestamp: 2026-04-14T13:42:22.684Z
Learning: In NVIDIA-NeMo/Megatron-Bridge (task_encoder.py), `cu_seqlens_argmin` for Qwen-VL Energon packed batches is intentionally created as `torch.int32` (not `torch.int64`) to stay consistent with surrounding packed-sequence metadata tensors (`padded_cu`, `max_lengths`). It is only consumed via `.item()` / host-side scalar reads where `int32` and `int64` are functionally equivalent. Do not flag the `int32` dtype as a bug or inconsistency.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: cuichenx
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 3100
File: src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py:196-270
Timestamp: 2026-04-01T22:37:23.790Z
Learning: In NVIDIA-NeMo/Megatron-Bridge, treat Qwen3.5 model HuggingFace identifiers (e.g., "Qwen/Qwen3.5-9B", "Qwen/Qwen3.5-35B-A3B", "Qwen/Qwen3.5-122B-A10B", "Qwen/Qwen3.5-397B-A17B") as vision-language models even though the identifier text does not include "VL". When review feedback suggests these are text-only due to the model name, do not flag them as incorrect if they are used as VLM pretrain/SFT/PEFT `hf_path` values (as in `qwen35_vl.py` and its `qwen35_vl_*` functions).

Learnt from: aroshanghias-nvd
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2007
File: src/megatron/bridge/data/mimo/dataset.py:158-166
Timestamp: 2026-01-28T20:41:39.210Z
Learning: In Bridge Megatron-Core datasets, use torch.ones_like(input_ids) for attention masks instead of masking padding tokens. The Megatron-Core attention kernel handles causal masking internally during the forward pass, not at the dataset level. This pattern should apply broadly to Bridge datasets: SFT Dataset uses get_attention_mask_from_fusion=True (setting attention_mask=None), VLM Providers use skip_getting_attention_mask_from_dataset=True, and Qwen VL uses torch.ones_like(input_ids). Ensure dataset code consistently generates input_ids with a corresponding attention_mask of ones (or None where the kernel derives it) and verify through the dataset and forward-pass configuration that the model relies on the kernel's internal masking.

Learnt from: yaoyu-33
Repo: NVIDIA-NeMo/Megatron-Bridge PR: 2342
File: src/megatron/bridge/models/qwen_omni/thinker_model.py:326-345
Timestamp: 2026-02-13T00:22:54.549Z
Learning: In Python implementations (e.g., src/megatron/bridge/models/qwen_omni/thinker_model.py), when a feature is not supported (such as audio embeddings), raise an explicit error (e.g., NotImplementedError) instead of silently ignoring the input to fail fast with a clear message. Use a descriptive exception message that identifies the unsupported feature and the expected behavior.

Comment thread src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
Comment thread src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
DAISY-gh added 25 commits April 14, 2026 21:13
Pass a THD padding_mask derived from lm_input_ids into GPTModel so MoE router
aux/global-aux metrics exclude tail padding and match BSHD token-count semantics.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Allow Qwen3VLGPTModel and Qwen3VLTransformerBlock to accept and propagate padding_mask so THD MoE routing can consume valid-token masking without crashing.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Forward padding_mask from Qwen3VLGPTModel into _postprocess so MTP MoE global aux-loss uses masked token accounting consistently.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Point submodule to the main-based Megatron-LM branch commit that propagates padding_mask through MTP MoE routing.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Guard diffusion recipe imports in recipes.__init__ so missing optional dependencies like diffusers do not block LLM/VLM training startup.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Avoid import-time crashes in training environments missing optional diffusion/modelopt/transformers pieces by adding graceful fallbacks for quantization and processor loading.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…cessor fallback

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…step

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…sync TaskEncoder seq_len from CLI override

- QwenVLTaskBatch: pass __key__ and __restore_key__ to satisfy Energon Batch base class
- energon_provider: sync TaskEncoder.seq_len from EnergonProvider.seq_length at build time
  so CLI overrides (e.g. dataset.seq_length=8192) propagate to packing capacity and padding

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…D format

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…ng segment to avoid GDN NaN

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…schedule.py convention

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…oss regression

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Pass position_ids=None for VLM batches so Qwen3VLModel runs its packed MRoPE path, and add short-lived debug logs to verify packed boundary position resets are actually applied.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add local helper functions used by THD diagnostic logging after cherry-picking THD MRoPE fixes, preventing runtime NameError for _thd_diag_enabled/_rank0.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add missing Energon batch key fields and normalize/concat visual blocks along token dimension so packed multimodal batches no longer fail on mixed visual lengths.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Drop hardcoded first-iteration debug dumps from vlm_step and Qwen3VL MRoPE path so training logs stay clean while preserving core packed-sequence behavior.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Route THD diagnostic logs behind THD_DIAG and compute image token counts from the active model config so IterStats reflects real tokenizer ids.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add switch-gated low-content packed-sample diagnostics and gate remaining THD debug logs so all related debug output is controlled by THD_DIAG.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Connect THD_FORCE_BSHD, THD_FORCE_SINGLE_SEGMENT_CU, and THD_SKIP_PREPROCESS_PACKED_POS to actual vlm_step control flow, and log switch activation only when enabled.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Keep unpadded cu_seqlens for MRoPE sub-sequence semantics while passing padded boundaries for TE/GDN kernels, and limit THD switch logs to rank0.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Avoid switching to the unstable unpadded TE packed path by keeping PackedSeqParams padded-only, and pass unpadded cu_seqlens separately for Qwen MRoPE sub-sequence indexing.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
DAISY-gh added 13 commits April 14, 2026 21:13
Add rank0 THD_DIAG logs to correlate packed boundaries, MoE padding mask counts,
and loss-mask valid tokens without changing training behavior.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Build MoE padding mask from packed boundaries and pass it explicitly through Qwen3VL
instead of inferring padding from lm_input_ids.eq(0), with fallback kept for compatibility.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Ensure packed mask remapping uses boolean indexing semantics when preprocessing
moe_padding_mask, preventing explicit THD padding mask from collapsing to zero.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add a default-off THD_DIAG_ALIGN switch to log pre/post packed remap summaries for labels and loss_mask, so we can directly verify whether THD preprocessing preserves supervision alignment.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add opt-in THD MRoPE pre/post remap diagnostics behind THD_DIAG_MROPE to inspect position semantics without affecting default behavior.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Use boolean attention masks at all THD preprocess_packed_seqs call sites and add a function-level bool cast guard to prevent silent positional remap corruption.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add opt-in strict diagnostics to compare THD-remapped position ids against explicit bool-mask gather semantics for cp_size==1.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Add opt-in THD boundary diagnostics to compare kernel vs rope cu summaries and report boundary-window label/loss/moe mask consistency, while keeping all new logic disabled by default.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Document how argmin-based trimming and padded boundary fields are used so THD packed-kernel boundary expectations are explicit.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Allow EnergonProvider to accept and forward shuffle_buffer_size so training scripts can explicitly control sample shuffling behavior in THD/BSHD comparison runs.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Apply nodebug recipe/provider updates while intentionally keeping the Megatron-LM submodule pointer unchanged.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Introduce independent batch-level packing control for THD dataloader, keep in-batch packing semantics isolated, and update validation/step logic so the two modes cannot be mixed.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Keep both in-batch and THD batch-level packing disabled by default, requiring explicit CLI overrides to enable either mode and avoid unintended coupling.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Narrow optional-import exception scopes, add fallback warnings, split diffusion optional imports, and align the qwen35 energon docstring with actual default packing behavior.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
…level packing

Use safe cu_seqlens tail trimming when argmin metadata is absent, and prevent qwen3_vl_step from enabling batch-level packing before full THD metadata wiring.

Signed-off-by: Daisy Gao <daisyg@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants