fix: implement Llama 3 RoPE frequency scaling by KenForever1 · Pull Request #1 · jmaczan/tiny-vllm

KenForever1 · 2026-05-30T16:32:38Z

Fix: Implement Llama 3 RoPE Frequency Scaling

Problem

The Llama 3.2 1B Instruct model uses rope_type: "llama3" frequency scaling (config: factor=32, high_freq_factor=4, low_freq_factor=1, original_max_position_embeddings=8192), which was not implemented. The RoPE kernels were computing simple inverse frequencies 1/(500000^(2i/64)) without the three-region scaling, causing all hidden states after position 0 to diverge from the reference HuggingFace implementation. Token generation was garbled.

Root Cause

Llama 3 RoPE divides frequency pairs into three regions based on wavelength λ = 2π / inv_freq:

Region	Condition	Transformation
High freq	λ < old_context_len / high_freq_factor (2048)	Keep original `inv_freq`
Mid freq	2048 < λ < old_context_len / low_freq_factor (8192)	Smooth interpolation: `(1-smooth)·inv_freq/32 + smooth·inv_freq`
Low freq	λ > 8192	Divide `inv_freq` by 32

For the 32 dimension pairs (head_dim=64), pairs 0-14 are unchanged, pairs 15-17 are interpolated, and pairs 18-31 are divided by 32. Without this scaling, the high-index frequency pairs have angles 32× too large, corrupting the rotation.

Changes

src/kernels.cu

Added Llama 3 RoPE scaling parameters as constexpr values
Added __device__ __constant__ float d_rope_freqs[32] — precomputed scaled inverse frequencies accessible from all kernels
Added initRopeFreqs() — computes the 32 scaled frequencies once at startup using the three-region algorithm, uploads to GPU constant memory via cudaMemcpyToSymbol
Refactored ropeKernel (prefill) to use d_rope_freqs[pair_idx] instead of computing 1/pow(500000, 2i/64) per thread, removing a standing TODO
Refactored ropeKernelDecode (decode) identically

src/kernels.cuh

Declared initRopeFreqs()

src/main.cpp

Call initRopeFreqs() at startup before loading weights

Verification

Tested with unsloth/Llama-3.2-1B-Instruct on RTX 4050 Laptop GPU:

Prompt	Before Fix	After Fix
"What is 2+2?"	garbled tokens	"The answer is 4"
"Name a color."	garbled tokens	"Blue."
"Capital of France?"	garbled tokens	"The capital of France is Paris."

Outputs now match the HuggingFace reference implementation.

The Llama 3.2 1B model uses `rope_type: "llama3"` frequency scaling (factor=32, high_freq_factor=4, low_freq_factor=1) which was not implemented. The RoPE kernels were using simple `1/(500000^(2i/64))` frequencies without scaling, causing completely wrong hidden states after position 0 and garbled token generation. - Precompute 32 scaled inverse frequencies at init time into __constant__ GPU memory via `initRopeFreqs()` - Wavelengths > 8192: divide inv_freq by factor (32) - Wavelengths 2048-8192: smooth interpolation between original and scaled frequencies - Wavelengths < 2048: keep original inv_freq unchanged - Both ropeKernel (prefill) and ropeKernelDecode now use precomputed frequencies, resolving a standing TODO Verified: "What is 2+2?" → "The answer is 4", "Capital of France?" → "The capital of France is Paris.", matching HuggingFace reference. Co-Authored-By: DeepSeek V4 Pro <noreply@deepseek.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: implement Llama 3 RoPE frequency scaling#1

fix: implement Llama 3 RoPE frequency scaling#1
KenForever1 wants to merge 1 commit into
jmaczan:mainfrom
KenForever1:main

KenForever1 commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KenForever1 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!