Real Sam

A language model built from biological neurons — not transformers.

Real Sam uses Leaky Integrate-and-Fire (LIF) spiking neurons, curriculum learning, and environment-driven plasticity to learn language the way a brain does: through binary spikes, temporal dynamics, and sparse computation.

Results

Phase	Seq Length	Val Loss	Perplexity
Words	8	2.64	14
Phrases	16	2.55	13
Sentences	32	2.48	12
Stories	64	2.40	11

6M parameters. ~10% firing rate. Trained on a single GTX 1050 Ti.

Architecture

Token → STE Spike Encoder → Linear Projection (256 → 512)
    → 6x Environment Spiking Blocks (LIF + diversity + residual)
    → Weight-Tied Readout → Next Token

Each token becomes a binary spike vector via the Straight-Through Estimator. Six layers of LIF neurons with learnable decay process the sequence recurrently — membrane potential carries temporal context, spikes are binary {0, 1}. The readout layer reuses the embedding matrix transpose, saving 60%+ parameters.

Generation is O(1) per token. No attention. No KV cache. Just spiking state.

Key Innovations

Curriculum Learning — Data complexity increases in phases (words, phrases, sentences, stories, conversations), mimicking infant language development. Phase transitions are automatic, triggered by loss convergence.

Shared Environment — One global stress signal modulates all neurons simultaneously, like cortisol in the bloodstream. High loss = stressed environment = neurons explore more. Inspired by Cortical Labs' DishBrain and the Free Energy Principle.

Neuron Diversity — Each neuron has a fixed "personality" sampled at initialization (LogNormal diversity factor), simulating biological receptor density. Same environment, different responses. Sensitive explorers and resilient anchors.

Firing Rate Regularization — Neurons maintain ~10% sparse firing through a loss penalty, not threshold manipulation. Backprop naturally discovers efficient sparse codes, just like biological cortex.

Getting Started

git clone https://github.com/nakaiwilliams/real-sam.git
cd real-sam
pip install -r requirements.txt

Download training data

python -m src.large_data --data-dir data --vocab 4096

This downloads TinyStories, Alpaca, Dolly, and OpenAssistant data, trains a BPE tokenizer, and caches everything locally.

Train

python src/train.py --mode v4 --epochs 80 --batch-size 32 --grad-accum 4

Training runs on CUDA, Apple MPS, or CPU. A GTX 1050 Ti (4GB VRAM) handles batch_size=32 comfortably.

Resume from checkpoint

python src/train.py --mode v4 --resume --epochs 80

Chat

python -m src.chat --checkpoint checkpoints/real-sam-v4.pt

Generate text

python src/generate.py --checkpoint checkpoints/real-sam-v4.pt --prompt "Once upon a time"

Project Structure

src/
  neurons.py          LIF neuron implementations (V1-V4)
  encoder.py          STE spike encoder (tokens → binary spikes)
  network.py          Full model architectures (RealSam V1-V4)
  train.py            Training loop with curriculum learning
  data.py             BPE tokenizer and dataset utilities
  curriculum_data.py  Multi-phase curriculum data pipeline
  chat.py             Interactive chat interface
  generate.py         Text generation
  spiking_ner.py      Spiking NER model (for PII detection)
  train_spiking_ner.py  NER training pipeline

docs/
  index.html          Project landing page

checkpoints/          Model checkpoints (not in git — train or download)
data/                 Training data (not in git — download via script)

Model Versions

Version	Params	Key Feature	Notes
V1	~1M	Basic LIF + recurrence	Character-level Shakespeare
V2	~3M	Residual blocks + LayerNorm	BPE tokenizer, conversation data
V3	~6M	Homeostatic thresholds	Per-neuron adaptive thresholds (deprecated)
V4	~6M	Environment + diversity	Shared stress signal, neuron personalities

Requirements

Python 3.9+
PyTorch 2.0+
snnTorch 0.7+
tokenizers
datasets (HuggingFace)
tqdm, numpy, matplotlib

How It Works

Real Sam processes language through spiking dynamics:

Encoding: Each BPE token is embedded and passed through a sigmoid + threshold to produce a binary spike vector. The Straight-Through Estimator provides gradients for backpropagation.
Processing: Six stacked spiking blocks process the sequence one token at a time. Each block has:
- A feedforward path (fc_in)
- A recurrent path from previous spikes (fc_rec)
- A LIF neuron with learnable decay (beta)
- A gated residual connection
Environment: A shared stress signal, computed from the training loss, modulates all neurons' gain. Each neuron's response is scaled by its fixed diversity factor — some neurons are sensitive explorers, others are resilient anchors.
Readout: The output projects back to embedding space and multiplies by the transposed embedding matrix (weight tying). This produces next-token logits without a separate vocabulary projection.
Curriculum: Training data complexity increases automatically through phases. The model learns words before phrases, phrases before sentences, and so on — just like a child.

Why Spikes

Transformers are brilliant. But they're not how brains work.

Biological neurons communicate through binary spikes — discrete events in time. Information is encoded in when neurons fire, not in continuous activation values. This is fundamentally more efficient: most neurons are silent most of the time.

Real Sam explores whether this principle can work for language. It's not trying to beat GPT-4. It's asking: what if we built language models the way evolution built brains?

The answer, so far: 6 million spiking neurons can learn grammar, narrative structure, and basic conversation. Not perfectly. But they do it with ~10% of neurons active at any time, O(1) generation per token, and no attention mechanism at all.

License

MIT. See LICENSE.

Built by Nakai Williams. Powered by spikes, not attention.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
run_v3_training.py		run_v3_training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Sam

Results

Architecture

Key Innovations

Getting Started

Download training data

Train

Resume from checkpoint

Chat

Generate text

Project Structure

Model Versions

Requirements

How It Works

Why Spikes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real Sam

Results

Architecture

Key Innovations

Getting Started

Download training data

Train

Resume from checkpoint

Chat

Generate text

Project Structure

Model Versions

Requirements

How It Works

Why Spikes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages