This repository contains a PyTorch implementation of a Diffusion Language Model trained on the TinyStories dataset. Size of the model ~10M parameters.
Check out the live demo on Hugging Face Spaces: Tiny DiffLM Story Teller
The outputs on HF Spaces isn't great because of the limited compute resources(using only CPU inferencing). You can try it out on Colab or locally on a GPU for better results.
The model weights are availble HERE -> Medium model & Large model
- Diffusion-style Token Generation: Iterative decoding where the model predicts tokens over multiple steps instead of traditional autoregressive (left-to-right) generation.
- SwiGLU Activation: ⚡ Employs the SwiGLU variant in the MLP blocks (
F.silu(self.w1(x)) * self.w2(x)), maintaining an effective 8/3 expansion ratio. - Rotary Position Embeddings (RoPE): Multi-Head Attention leverages rotary embeddings for relative positional encoding.
The project dependencies are outlined within the notebook itself, but you will minimally need:
- Python 3.8+
- PyTorch
- Hugging Face
datasets tiktokenwandb
-
Install dependencies:
Open the notebooktinystories_diffusion.ipynb. The first cell contains the install commands necessary:pip install -r requirements.txt
-
Dataset Setup:
Ensure your custom datasettinystories_46k.jsonlortinystories_full.jsonlis present in the root of the project directory alongside the notebook.Or
You can download using the script
python scripts/Tinystories_data_download_all.py
-
Training:
Run all subsequent cells intinystories-diffusion_gpt-2.ipynbor use the scriptpython scripts/Tinystories-diffusion-GPT-2.py
The training will take around 4hr on 2x T4 GPU
-
Inferencing:
If you only want to perform inferencing, you can use the scriptmkdir -p model && wget -P model https://huggingface.co/spaces/Jyo-K/Tiny-Diffusion-Language-Model/resolve/main/tinystories_diffusion_GPT2_dual.pt?download=true python3 scripts/inference_new.py \ --model gpt2 or medium \ --prompt "Once upon a time, there was a dog who loved" \ --max_new_tokens 150
-
Tokenizer: TikToken (
gpt2BPE encoding) with an extended vocabulary space adding a[MASK]token. -
Context Window: 512 tokens (
block_size) -
Dimensions:
$768$ embedding size,$12$ attention heads, and$12$ transformer layers. - Loss Calculation: Computed dynamically using Mean Cross-Entropy over randomly injected masked tokens.
Below is a visual representation comparing standard autoregressive generation (GPT-style) against the parallel decoding method used in this diffusion language model.
Instead of strictly decoding the next token left-to-right, the diffusion generation algorithm progressively evaluates a fully masked block window and unmasks the tokens that yield a confidence score over a specified confidence_threshold iteratively until all variables within the context block are satisfied.
