[ROCm]: fix: reduce MoE temp memory — embedding cap, weight sum default, skip trivial specs (PR3) by cj401-amd · Pull Request #4193 · AI-Hypercomputer/maxtext

cj401-amd · 2026-06-17T22:57:03Z

Summary

Embeddings: cap use_iota_embed to ≤2 GiB one-hot size to prevent OOM on large
vocabularies; add explicit nn.with_logical_constraint after embedding lookup
MoE config: change float32_weight_sum default from true to false — the f32
upcast adds ~2 GB temp per device with minimal numerical benefit for most configs
DeepSeek: fix activation PartitionSpec to include fsdp_transpose and context
axes; use remove_size_one_mesh_axis helper; remove redundant jax.reshard calls
Mixtral: replace nn.with_logical_constraint with maybe_shard_with_logical(..., skip_trivial_specs=True) throughout MixtralDecoderLayer to avoid no-op sharding
constraints that add XLA overhead

Test plan

python3 -m pytest tests/unit/train_compile_test.py -v -k "moe or deepseek or mixtral"
Smoke-test MoE model (e.g. mixtral-8x7b or deepseek3-test config)

…anspose

…o yml can reference it

…ipeline_save_decoder_layer_input flag

… default, skip_trivial_specs

codecov · 2026-06-18T22:50:46Z

Codecov Report

❌ Patch coverage is 53.78151% with 55 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/pipeline.py	38.59%	35 Missing ⚠️
src/maxtext/layers/attention_op.py	0.00%	7 Missing ⚠️
src/maxtext/layers/normalizations.py	54.54%	4 Missing and 1 partial ⚠️
src/maxtext/trainers/pre_train/train.py	42.85%	3 Missing and 1 partial ⚠️
src/maxtext/models/mixtral.py	76.92%	0 Missing and 3 partials ⚠️
src/maxtext/models/deepseek.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

cj401-amd requested a review from NuojCheng June 17, 2026 22:57

cj401-amd added 4 commits June 19, 2026 06:41

fix: JAX/TE compatibility — sharding, reshard, serialize API, fsdp_tr…

57bb35b

…anspose

fix: add pipeline_save_decoder_layer_input config field to branch 1 s…

4bf6d87

…o yml can reference it

fix: pipeline tmem reduction — replace ppermute collectives, expose p…

281944a

…ipeline_save_decoder_layer_input flag

fix: MoE tmem reduction — megablox 9-tuple tiling, float32_weight_sum…

0ed140e

… default, skip_trivial_specs

cj401-amd force-pushed the cj/tmem-fixes-clean-3-moe-tmem branch from 666bf09 to 0ed140e Compare June 18, 2026 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm]: fix: reduce MoE temp memory — embedding cap, weight sum default, skip trivial specs (PR3)#4193

[ROCm]: fix: reduce MoE temp memory — embedding cap, weight sum default, skip trivial specs (PR3)#4193
cj401-amd wants to merge 4 commits into
AI-Hypercomputer:mainfrom
cj401-amd:cj/tmem-fixes-clean-3-moe-tmem

cj401-amd commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cj401-amd commented Jun 17, 2026

Summary

Test plan

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 18, 2026 •

edited

Loading