TinyStories Diffusion Language Model

This repository contains a PyTorch implementation of a Diffusion Language Model trained on the TinyStories dataset. Size of the model ~10M parameters.

Check out the live demo on Hugging Face Spaces: Tiny DiffLM Story Teller

The outputs on HF Spaces isn't great because of the limited compute resources(using only CPU inferencing). You can try it out on Colab or locally on a GPU for better results.

The model weights are availble HERE -> Medium model & Large model

About

Diffusion-style Token Generation: Iterative decoding where the model predicts tokens over multiple steps instead of traditional autoregressive (left-to-right) generation.
SwiGLU Activation: ⚡ Employs the SwiGLU variant in the MLP blocks (F.silu(self.w1(x)) * self.w2(x)), maintaining an effective 8/3 expansion ratio.
Rotary Position Embeddings (RoPE): Multi-Head Attention leverages rotary embeddings for relative positional encoding.

Requirements

The project dependencies are outlined within the notebook itself, but you will minimally need:

Python 3.8+
PyTorch
Hugging Face datasets
tiktoken
wandb

Usage

Install dependencies:
Open the notebook tinystories_diffusion.ipynb. The first cell contains the install commands necessary:
```
pip install -r requirements.txt
```
Dataset Setup:
Ensure your custom dataset tinystories_46k.jsonl or tinystories_full.jsonl is present in the root of the project directory alongside the notebook.

Or

You can download using the script
```
python scripts/Tinystories_data_download_all.py
```
Training:
Run all subsequent cells in tinystories-diffusion_gpt-2.ipynb or use the script
```
python scripts/Tinystories-diffusion-GPT-2.py
```
The training will take around 4hr on 2x T4 GPU

Inferencing:
If you only want to perform inferencing, you can use the script

mkdir -p model && wget -P model https://huggingface.co/spaces/Jyo-K/Tiny-Diffusion-Language-Model/resolve/main/tinystories_diffusion_GPT2_dual.pt?download=true
python3 scripts/inference_new.py \
 --model gpt2 or medium \
 --prompt "Once upon a time, there was a dog who loved" \
 --max_new_tokens 150

Model Architecture Details

Tokenizer: TikToken (gpt2 BPE encoding) with an extended vocabulary space adding a [MASK] token.
Context Window: 512 tokens (block_size)
Dimensions: $768$ embedding size, $12$ attention heads, and $12$ transformer layers.
Loss Calculation: Computed dynamically using Mean Cross-Entropy over randomly injected masked tokens.

Parallel Decoding: GPT (Autoregressive) vs Diffusion LM

Below is a visual representation comparing standard autoregressive generation (GPT-style) against the parallel decoding method used in this diffusion language model.

Instead of strictly decoding the next token left-to-right, the diffusion generation algorithm progressively evaluates a fully masked block window and unmasks the tokens that yield a confidence score over a specified confidence_threshold iteratively until all variables within the context block are satisfied.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
notebooks		notebooks
others		others
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
app_new.py		app_new.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyStories Diffusion Language Model

About

Requirements

Usage

Model Architecture Details

Parallel Decoding: GPT (Autoregressive) vs Diffusion LM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyStories Diffusion Language Model

About

Requirements

Usage

Model Architecture Details

Parallel Decoding: GPT (Autoregressive) vs Diffusion LM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages