[ROCm]: fix: reduce pipeline temp memory — replace ppermute collectives with lax.slice/pad (PR2) by cj401-amd · Pull Request #4192 · AI-Hypercomputer/maxtext

cj401-amd · 2026-06-17T22:54:22Z

Summary

Replace shard_map + ppermute collective operations in the pipeline with pure
lax.slice/jnp.pad/jnp.concatenate equivalents. This eliminates the shard_map
overhead and removes stage-axis sharding constraints that caused temp memory bloat
and shape-divisibility errors.

Changes in PipelineBase.get_new_loop_state:

_rotate_right: shard_map + ppermute → lax.slice_in_dim + concatenate
_shift_right: shard_map + ppermute + where → pad + lax.slice
_update_state_io: shard_map + _rotate_left/_shift_left → pad +
slice_in_dim + where (also removes extra stream_buf_idx arg)

Changes in PipelineBase.get_iteration_inputs:

Remove redundant _maybe_shard_with_logical calls on shift and first_stage_in
Remove out_sharding from broadcasted_iota

Other:

Remove _maybe_shard_with_name on microbatches_processed in get_microbatch_and_repeat_ids
Make get_pipeline_remat_policy conditional on pipeline_save_decoder_layer_input:
when False, omit decoder_layer_input from saved names to reduce remat temp memory

Test plan

python3 -m pytest tests/unit/train_compile_test.py -v -k "pipeline"
python3 -m pytest tests/integration/pipeline_parallelism_test.py -v
Smoke-test pp=8 config with pipeline_save_decoder_layer_input=false

…anspose

…o yml can reference it

…ipeline_save_decoder_layer_input flag

codecov · 2026-06-18T22:47:47Z

Codecov Report

❌ Patch coverage is 40.69767% with 51 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/pipeline.py	38.59%	35 Missing ⚠️
src/maxtext/layers/attention_op.py	0.00%	7 Missing ⚠️
src/maxtext/layers/normalizations.py	54.54%	4 Missing and 1 partial ⚠️
src/maxtext/trainers/pre_train/train.py	42.85%	3 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

cj401-amd requested a review from NuojCheng June 17, 2026 22:54

cj401-amd added 3 commits June 19, 2026 06:40

fix: JAX/TE compatibility — sharding, reshard, serialize API, fsdp_tr…

768d279

…anspose

fix: add pipeline_save_decoder_layer_input config field to branch 1 s…

4d69958

…o yml can reference it

fix: pipeline tmem reduction — replace ppermute collectives, expose p…

62907cf

…ipeline_save_decoder_layer_input flag

cj401-amd force-pushed the cj/tmem-fixes-clean-2-pipeline-tmem branch from face36a to 62907cf Compare June 18, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm]: fix: reduce pipeline temp memory — replace ppermute collectives with lax.slice/pad (PR2) #4192

[ROCm]: fix: reduce pipeline temp memory — replace ppermute collectives with lax.slice/pad (PR2) #4192
cj401-amd wants to merge 3 commits into
AI-Hypercomputer:mainfrom
cj401-amd:cj/tmem-fixes-clean-2-pipeline-tmem

cj401-amd commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cj401-amd commented Jun 17, 2026

Summary

Test plan

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 18, 2026 •

edited

Loading