Releases: AI-Hypercomputer/maxtext
Releases · AI-Hypercomputer/maxtext
maxtext-v0.2.3
Changes
- Upgraded JAX to version 0.10.0 for pre-training and 0.10.1 for post-training.
- New vLLM-Powered Evaluation Framework: Introduced an eval framework for running lm-eval, evalchemy, and custom benchmarking against MaxText checkpoints. See the evaluation guide for details.
- Added support for pre-training new models:
- Direct Preference Optimization (DPO/ORPO) Support: Full support for DPO and ORPO alignment pipelines. See the DPO tutorial for details.
- Reinforcement Learning (RL) Recipe: Added a pre-configured RL recipe for Qwen3-30b-a3b.
- Iterative Quality Monitoring (RL): Added intermediate evaluation hooks to automatically run quality benchmarks during RL training (every
eval_intervalsteps), optimized with a neweval_batch_sizeconfiguration knob. - Developer Extensibility: Added
dataset_processor_pathCLI knob for custom dataset integration, and refactored shared post-training hooks to simplify custom SFT, DPO, and RL workflow development. - Generalized Learn-to-Init (LTI) for Distillation: Enhanced post-training distillation capabilities with generalized LTI support.
- Added support for recording elastic goodput events during training to track efficiency (PR #3901).
- Installation Updates: Updated the
[tpu-post-train]installation command to requireUV_TORCH_BACKEND=cpu(see Installation Guide). - Zero1 AOT Compilation: Added zero1 support to Ahead-Of-Time (AOT) compilation in train compile, improving compilation capabilities for zero1 config.
- MoE Performance Optimization: Integrated ragged gather reduce into Mixture of Experts (MoE) layers to optimize memory and performance by replacing ragged scatter and supporting backward pass.
- Added E2E scripts to run checkpoint conversion, pre-training and post-training (SFT, RL) with Gemma3-4B model.
- Bug Fixes and Usability Enhancements:
- Attention Masking Fix in RL: Fixed an issue in
TunixMaxTextAdapterwhere queries at non-pad positions could attend to pad-position keys during training, which was corrupting log-probabilities and affecting GRPO training reward trajectories (PR #4016). - JAX/NNX Gradient Mutation Fix: Refactored post-training loops (
train_distill,train_sft,train_rl) to usejax.value_and_gradwith explicit NNX state split/merge instead of nestingnnx.value_and_gradinsidennx.jit(PR #3652). - Qwen3-MoE Checkpoint Conversion: Fixed checkpoint conversion issues for Qwen3-MoE models (PR #3868).
- Duplicate Configuration Failures Fix: Allowed identical config overrides and handled configuration exceptions cleanly (PR #3933).
- Attention Masking Fix in RL: Fixed an issue in
- Documentation Improvements: Updated Getting started guide, including new guides for the evaluation framework and the DPO tutorial.
Deprecations
- Deleted legacy DPO implementation in favor of the integrated DPO trainer.
- Removed stack trace collection feature.
maxtext-v0.2.2
Changes
- Upgraded JAX to version 0.9.2, improving support for both pre-training and post-training.
- Introduced simplified APIs for accessing MaxText models.
- Included maxtext_with_gepa.ipynb, a new notebook demonstrating AIME prompt optimization using the GEPA framework within MaxText.
- Added support for Kimi-K2 models and the MuonClip optimizer. Users can explore this with the kimi-k2-1t config (see user guide for details).
- Kimi-K2-Thinking, Kimi-K2.5 (text), and Kimi-K2.6 (text) are now supported. See Run_Kimi.md for details.
- DeepSeek-V3.2 is now supported, including DeepSeek Sparse Attention for handling long contexts. Use the deepseek3.2-671b config to try it out (refer to the user guide for more information).
- Support has been added for Gemma 4 multi-modal models (26B MoE and 31B dense). These can be used with the gemma4-26b and gemma4-31b configs. See Run_Gemma4.md for further details.
- Support has been added for Gemma 4 inference using MaxText on vLLM plugin.
- Enhanced RL capabilities with support for the
open-r1/OpenR1-Math-220kdataset andnvidia/OpenMathReasoning. - Added more evaluation modes for RL like majority voting and pass@1 estimation.
- Sync weights to vllm prior to pre RL evaluation.
- More robust usage of math-verify in RL.
- MaxText's Supervised Fine-Tuning (SFT) now supports non-instruct models.
- Added support for tensor parallelism using the Fused MoE kernel for MaxText on vLLM inference.
- Added support for MaxText to vllm converters for Qwen3 and Gemma4 family of models.
- validate_converter.py now runs on multislice environment to test larger models with utilities to compare maxtext and vllm weights.
Deprecations
- Legacy
MaxText.*shims have been removed. Please refer to src/MaxText/README.md for details on the new command locations and how to migrate. - Sequence parallelism has been deprecated, please use context parallelism instead.
- The flag
expert_shard_attention_optionis deprecated, usecustom_mesh_and_rule=ep-as-cpfor the same functionality.
maxtext-v0.2.1
- Use the new maxtext[runner] installation option to build Docker images without cloning the repository. This can be used for scheduling jobs through XPK. See the MaxText installation instructions for more info.
- Config can now be inferred for most MaxText commands. If you choose not to provide a config, MaxText will now select an appropriate one.
- Configs in MaxText PyPI will now be picked up without storing them locally.
- New features from DeepSeek-AI are now supported: Conditional Memory via Scalable Lookup (Engram) and Manifold-Constrained Hyper-Connections (mHC). Try them out with our deepseek-custom starter config.
- MaxText now supports customizing your own mesh and logical rules. Two examples guiding how to use your own mesh and rules for sharding are provided in the custom_mesh_and_rule directory.
maxtext-v0.2.0
Changes
- Qwen3-Next is now supported.
- New
tpu-post-traintarget in PyPI. Please also use this installation option for running vllm_decode. See the MaxText installation instructions for more info. - New MaxText structure! MaxText has been restructured according to RESTRUCTURE.md. Please feel free to share your thoughts and feedback.
- Muon optimizer is now supported.
- DeepSeek V3.1 is now supported. Use existing configs for DeepSeek V3 671B and load in V3.1 checkpoint to use model.
- New RL and SFT Notebook tutorials are available.
- The ReadTheDocs documentation site has been reorganized.
- Multi-host support for GSPO and GRPO is now available via new RL tutorials.
- A new guide, What is Post Training in MaxText?, is now available.
- Ironwood TPU co-designed AI stack announced. Read the blog post on its co-design with MaxText.
- Optimized models tiering documentation has been refreshed.
- Added Versioning. Check out our first set of release notes!
- Post-Training (SFT, RL) via Tunix is now available.
- Vocabulary tiling (PR) is now supported in MaxText! Adjust config
num_vocab_tilingto unlock more efficient memory usage. - The GPT-OSS family of models (20B, 120B) is now supported.
Deprecations
- Many MaxText modules have changed locations. Core commands like train, decode, sft, etc. will still work as expected temporarily. Please update your commands to the latest file locations
- install_maxtext_github_deps installation script replaced with install_maxtext_tpu_github_deps
tools/setup/setup_post_training_requirements.shfor post training dependency installation is deprecated in favor of pip installation
maxtext-tutorial-v1.5.0
Merge pull request #2898 from AI-Hypercomputer:tests_docker_image PiperOrigin-RevId: 850456883
maxtext-tutorial-v1.4.0
maxtext-tutorial-v1.4.0
maxtext-tutorial-v1.3.0
Merge pull request #2706 from AI-Hypercomputer:mohit/tokamax_quant_gmm PiperOrigin-RevId: 834605168
maxtext-tutorial-v1.2.0: Merge pull request #2676 from AI-Hypercomputer:pypi_release
PiperOrigin-RevId: 832378885
Recipe Branch for TPU performance results
Merge pull request #2539 from AI-Hypercomputer:qinwen/latest-tokamax PiperOrigin-RevId: 823749360
maxtext-tutorial-v1.0.0
Merge pull request #2538 from AI-Hypercomputer:mohit/fix_docker PiperOrigin-RevId: 822796389