Tinytron

A minimal, hackable pre-training stack for GPT-style language models. This project provides a clean foundation for training transformer models from scratch with distributed training support.

Features

Modular GPT Architecture: Flexible transformer implementation with support for:
- Grouped Query Attention (GQA)
- Mixture of Experts (MoE)
- Customizable attention, MLP, and normalization layers
- Flash Attention optimization support
Distributed Training:
- ZeRO-1 optimizer state partitioning for memory efficiency (Native support for Muon)
- DistributedDataParallel (DDP) for multi-GPU training
- Sequence-Expert joint parallelism via SEP_SIZE / --sep_size (SEP)
- Gradient accumulation for large effective batch sizes
Training Optimizations:
- Mixed precision training (BFloat16)
- Gradient clipping
- Cosine learning rate schedule with warmup
- Automatic checkpoint resumption with full state recovery
Developer-Friendly:
- Comprehensive profiling utilities
- Model FLOPs Utilization (MFU) tracking
- Auto-tune script for fast throughput search (scripts/autotune.sh)
- Mock data mode for rapid debugging
- Minimal dependencies

Project Structure

.
├── tinytron/
│   ├── model/                              # Model architecture
│   │   ├── __init__.py
│   │   ├── gpt.py                          # GPT model implementation
│   │   └── modules/                        # Modular components
│   │       ├── attn.py                     # Attention mechanisms
│   │       ├── mlp.py                      # Dense MLP and MoE layers
│   │       ├── norm.py                     # Normalization layers
│   │       ├── loss.py                     # SP-aware cross entropy loss
│   │       └── emb.py                      # Embedding layers
│   │
│   ├── training/                           # Training pipeline
│   │   ├── __init__.py
│   │   ├── config.py                       # Config dataclasses (ModelConfig, etc.)
│   │   ├── arguments.py                    # CLI argument definitions
│   │   └── trainer.py                      # Trainer and dataset init
│   │
│   ├── distributed/                        # Distributed training components
│   │   ├── __init__.py
│   │   ├── parallel_state.py               # DP/SEP process group construction
│   │   ├── zero1/
│   │   │   └── distributed_optimizer.py    # ZeRO-1 implementation
│   │   ├── sequence_parallel/
│   │   │   └── ulysses.py                  # SP collectives and grad sync helpers
│   │   └── expert_parallel/
│   │       └── comm.py                     # EP all-to-all communication
│   │
│   └── utils/                              # Utility functions
│       ├── __init__.py
│       ├── model.py                        # Model utilities (param counting, etc.)
│       ├── training.py                     # Schedule helpers (get_training_info, etc.)
│       └── profile.py                      # Profiling and MFU computation
│
├── scripts/                                # Launch scripts
│   ├── autotune.sh                         # Auto-tune SEP_SIZE/BATCH_SIZE by tok/sec
│   ├── debug_gpt_0.25b/
│   │   └── pretrain.sh                     # 0.25B debug (pretrain_debug.py)
│   ├── debug_gpt_0.3b_a0.17b/
│   │   └── pretrain.sh                     # 0.3B MoE debug (pretrain_debug.py)
│   ├── example_gpt_0.25b/
│   │   └── pretrain.sh                     # 0.25B example with custom data (pretrain_example.py)
│
├── pretrain_debug.py                       # Debug entry (mock data, minimal deps)
├── pretrain_example.py                     # Example entry (custom dataset / tokenizer)
└── README.md

Requirements

Python 3.10+
PyTorch 2.0+ with CUDA/NCCL support
tqdm
numpy

Install minimal runtime dependencies:

pip install torch tqdm numpy

For pretrain_example.py, also install:

pip install datasets transformers

Quick Start

1. Single Node Training

Using training scripts (recommended):

# Train 0.25B dense model (8 GPUs)
bash scripts/debug_gpt_0.25b/pretrain.sh

# Train 0.3B MoE model (8 GPUs)
bash scripts/debug_gpt_0.3b_a0.17b/pretrain.sh

# Override SEP (sequence-expert joint) parallel size
SEP_SIZE=2 bash scripts/debug_gpt_0.25b/pretrain.sh

Direct command for quick testing:

torchrun --nproc_per_node=8 pretrain_debug.py \
  --exp_name debug_test \
  --use_mock_data \
  --mock_data_num_samples 1280 \
  --total_batch_size 524288 \
  --batch_size 8 \
  --seq_len 4096 \
  --sep_size 1 \
  --max_epochs 1 \
  --debug

2. Multi-Node Training

All training scripts support multi-node training via environment variables:

# Node 0 (master, e.g. IP: 192.168.1.100)
NUM_NODES=2 NODE_RANK=0 MASTER_ADDR=192.168.1.100 \
bash scripts/debug_gpt_0.25b/pretrain.sh

# Node 1 (worker)
NUM_NODES=2 NODE_RANK=1 MASTER_ADDR=192.168.1.100 \
bash scripts/debug_gpt_0.25b/pretrain.sh

When running under some distributed training platforms, You do not need to specify --node_rank, --nnodes, or --master_addr. 'torchrun' automatically detects and uses these injected variables from 'env://' to set up distributed communication.

3. Custom Dataset

Use the example entry point and override _init_dataset: see pretrain_example.py for a subclass that uses a real dataset and tokenizer. The base implementation (mock data) lives in tinytron/training/trainer.py; override it in your entry script or subclass Trainer and pass your dataset there. You can also use Streaming-Dataloader to build a memory-efficient streaming dataset pipeline for LLM pretraining.

4. Auto-Tune Throughput (`tok/sec`)

The repository includes an auto-tuner at scripts/autotune.sh to search throughput-friendly combinations of SEP_SIZE and BATCH_SIZE.

Default search space:

SEP_SIZES="1 2 4 8"
BATCH_SIZES="1 2 4 8 16 32"
RUN_SCRIPT="scripts/debug_gpt_0.25b/pretrain.sh"

Run with defaults:

bash scripts/autotune.sh

Run with custom search space and target script:

SEP_SIZES="1 2 4" \
BATCH_SIZES="4 8 16" \
TARGET_STEPS=80 \
WARMUP_STEPS=20 \
RUN_SCRIPT="scripts/debug_gpt_0.3b_a0.17b/pretrain.sh" \
bash scripts/autotune.sh

Outputs:

Summary CSV: autotune_results.csv
Temporary log (auto-cleaned): autotune_temp.log
Best config printed at the end as SEP_SIZE=<...>, BATCH_SIZE=<...>

Configuration

Configuration is built from CLI arguments via tinytron/training/arguments.py and assembled into a unified Config in tinytron/training/config.py.

Model Configuration (`ModelConfig` in `tinytron/training/config.py`)

@dataclass
class ModelConfig:
    block_size: int = 4096              # Maximum sequence length
    vocab_size: int = 50304             # Vocabulary size
    num_layer: int = 32                 # Number of transformer layers
    num_attention_heads: int = 128       # Number of attention heads
    num_key_value_heads: int = 8        # Number of KV heads (GQA)
    hidden_size: int = 1024             # Hidden dimension
    intermediate_size: int = 4096       # FFN intermediate size
    dropout: float = 0.0                # Dropout rate
    tied_lm_head: bool = False          # Tie input/output embeddings (enable via --tied_lm_head)

    # Mixture of Experts (optional)
    use_moe: bool = False               # Enable MoE
    num_experts: int = 128              # Total number of experts
    num_experts_per_tok: int = 8        # Active experts per token
    moe_intermediate_size: int = 256    # Expert FFN size

Training Arguments (CLI → `TrainingConfig`)

Key CLI options (see tinytron/training/arguments.py for full list):

Option	Default	Description
`--exp_name`	`gpt`	Experiment name
`--total_batch_size`	`524288`	Global batch size in tokens
`--batch_size`	`8`	Micro batch size per device
`--seq_len`	`4096`	Sequence length
`--max_lr` / `--min_lr`	`4e-3` / `3e-5`	Learning rate range
`--weight_decay`	`0.1`	AdamW weight decay
`--grad_clip_value`	`1.0`	Gradient clipping
`--warmup_steps`	`1000`	LR warmup steps
`--max_epochs`	`1`	Training epochs
`--do_save`	`False`	Enable checkpoint saving
`--save_every_steps`	`5000`	Checkpoint frequency
`--do_val`	`False`	Enable validation during training
`--val_every_steps`	`250`	Validation frequency (when `--do_val` is enabled)
`--optimizer`	`adam`	Optimizer type (`adam` / `muon`)
`--use_distributed_optimizer`	`False`	Enable ZeRO-1-style optimizer sharding
`--pin_memory`	`False`	Enable DataLoader pinned memory
`--tied_lm_head`	`False`	Tie token embedding and LM head weights
`--use_compile`	flag	PyTorch 2.0 compilation

Parallelism Configuration

sep_size controls SEP group size (sequence-expert joint parallelism).

CLI flag: --sep_size (default: 8 in tinytron/training/arguments.py)
Script env var: SEP_SIZE (mapped to --sep_size, script default is 1)
Dense models (--use_moe disabled): SEP degenerates to pure SP.
Constraints:
- WORLD_SIZE % sep_size == 0
- sequence length must be divisible by SEP size (seq_len % sep_size == 0)

Example:

torchrun --nproc_per_node=8 pretrain_debug.py \
  --batch_size 8 \
  --seq_len 4096 \
  --sep_size 2 \
  --max_epochs 1

Training Features

Checkpoint Saving and Resumption

Checkpoint saving is disabled by default. Enable it with:

--do_save --save_every_steps 5000

When enabled, the trainer can save and resume checkpoints, preserving:

Model weights (*_model.pt)
Optimizer states (*_opt/ directory)
Training metadata (*_meta.pt): step counter, RNG state, dataloader position

To resume, restart the same training command. The trainer searches checkpoints under the current experiment log_dir by default, or you can specify --resume_path explicitly.

ZeRO-1 Optimizer

Memory-efficient optimizer state partitioning:

Optimizer states are sharded across GPUs
Model parameters remain replicated
Automatic gradient synchronization and parameter broadcasting

Enable it with:

--use_distributed_optimizer

Native support for Muon + ZeRO-1 is also available:

--optimizer muon --use_distributed_optimizer

Gradient Accumulation

Automatically computed based on:

grad_accum_steps = total_batch_size / (batch_size × seq_len × num_dp_ranks)

Learning Rate Schedule

Implements cosine annealing with linear warmup:

Linear warmup: 0 → max_lr over warmup_steps
Cosine decay: max_lr → min_lr over remaining steps

Validation

Validation is optional and disabled by default.

--do_val --val_every_steps 250

When enabled, validation runs every val_every_steps and on the last step (unless --debug is set).

Model FLOPs Utilization (MFU)

Real-time tracking of hardware efficiency:

MFU = (Actual FLOPs) / (Peak Hardware FLOPs)

Profiling

Enable PyTorch profiler for performance analysis:

python pretrain_debug.py \
  --use_profiler \
  --steps_to_profile 15 20  # profile on step 15 to 20

This generates a Chrome trace file at <log_dir>/rank{rank}_trace.json (for the exporting process) that can be viewed in chrome://tracing.

Example Model Configurations

GPT-0.25B (12 layers)

--num_layer 12 \
--num_attention_heads 32 \
--num_key_value_heads 4 \
--hidden_size 1024 \
--intermediate_size 4096

GPT-1B (24 layers)

--num_layer 24 \
--num_attention_heads 64 \
--num_key_value_heads 8 \
--hidden_size 2048 \
--intermediate_size 8192

GPT-7B (32 layers)

--num_layer 32 \
--num_attention_heads 128 \
--num_key_value_heads 16 \
--hidden_size 4096 \
--intermediate_size 16384

Extending the Code

Custom Dataset

Implement your dataset class and override _init_dataset: subclass Trainer in your entry script (e.g. pretrain_example.py) and set self.train_dataset to your dataset. Each item should provide tensors compatible with the trainer (e.g. contiguous token ids of length seq_len+1 for causal LM).

Custom Architecture

Modify components in tinytron/model/modules/:

attn.py: Implement custom attention mechanisms
mlp.py: Add new feedforward architectures
norm.py: Experiment with normalization strategies

Custom Optimizer

Replace AdamW in _init_optimizer in tinytron/training/trainer.py (or in a Trainer subclass):

def _init_optimizer(self, config: Config):
    self.optimizer = YourOptimizer(
        self.raw_model.parameters(),
        lr=config.optim.max_lr,
    )
    self.optimizer = DistributedOptimizer(
        optimizer=self.optimizer,
        process_group=self.dp_group,
    )

Logging

Training logs are saved to:

<log_dir>/<exp_name>_modelsize_<...>_lr<...>_BS<...>_SL<...>_DP<...>_SEP<...>/log.txt

Log format:

<step> train <loss>
<step> val <val_loss>

Example:

0 train 10.8234
100 train 8.4521
250 val 8.3012

Performance Tips

Enable compilation: Add --use_compile for PyTorch 2.0+ (20-30% speedup)
Tune batch size: Maximize --batch_size per GPU to improve throughput
Run auto-tune first: Use bash scripts/autotune.sh to quickly find strong SEP_SIZE + BATCH_SIZE settings
Use Flash Attention: Ensure Flash Attention is available for faster attention
Gradient checkpointing: Implement in tinytron/model/gpt.py for larger models
Mixed precision: BFloat16 is enabled by default (better than FP16 for training)

Common Issues

Out of Memory

Reduce --batch_size (micro batch size)
Enable gradient checkpointing
Use larger grad_accum_steps by reducing --batch_size

Slow Training

Ensure your PyTorch/CUDA build supports optimized SDPA kernels
Enable --use_compile
For MoE, prefer grouped GEMM kernels as much as possible
Check MFU percentage (should be >30% for efficient training)
Increase --batch_size to better utilize GPU

Checkpoint Issues

Ensure all processes have write access to log_dir
Check disk space for optimizer state storage

Citation

If you use this code in your research, please cite:

@software{tinytron,
  title = {Tinytron},
  author = {Liangyu Wang},
  year = {2026},
  url = {https://github.com/liangyuwang/Tinytron}
}

License

This project is licensed under the terms specified in the LICENSE file.

Acknowledgments

This implementation draws inspiration from:

nanoGPT by Andrej Karpathy
Megatron-LM by NVIDIA
DeepSpeed ZeRO optimization

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Note: This is a minimal training stack designed for educational purposes and rapid prototyping. For production-scale training, consider using frameworks like DeepSpeed, Megatron-LM, or Composer.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
scripts		scripts
tinytron		tinytron
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pretrain_debug.py		pretrain_debug.py
pretrain_example.py		pretrain_example.py

Folders and files

Latest commit

History

Repository files navigation

Tinytron

Features

Project Structure

Requirements

Quick Start

1. Single Node Training

2. Multi-Node Training

3. Custom Dataset

4. Auto-Tune Throughput (tok/sec)

Configuration

Model Configuration (ModelConfig in tinytron/training/config.py)

Training Arguments (CLI → TrainingConfig)

Parallelism Configuration

Training Features

Checkpoint Saving and Resumption

ZeRO-1 Optimizer

Gradient Accumulation

Learning Rate Schedule

Validation

Model FLOPs Utilization (MFU)

Profiling

Example Model Configurations

GPT-0.25B (12 layers)

GPT-1B (24 layers)

GPT-7B (32 layers)

Extending the Code

Custom Dataset

Custom Architecture

Custom Optimizer

Logging

Performance Tips

Common Issues

Out of Memory

Slow Training

Checkpoint Issues

Citation

License

Acknowledgments

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4. Auto-Tune Throughput (`tok/sec`)

Model Configuration (`ModelConfig` in `tinytron/training/config.py`)

Training Arguments (CLI → `TrainingConfig`)

Packages