Skip to content

fix: implement Llama 3 RoPE frequency scaling#1

Open
KenForever1 wants to merge 1 commit into
jmaczan:mainfrom
KenForever1:main
Open

fix: implement Llama 3 RoPE frequency scaling#1
KenForever1 wants to merge 1 commit into
jmaczan:mainfrom
KenForever1:main

Conversation

@KenForever1
Copy link
Copy Markdown

@KenForever1 KenForever1 commented May 30, 2026

Fix: Implement Llama 3 RoPE Frequency Scaling

Problem

The Llama 3.2 1B Instruct model uses rope_type: "llama3" frequency scaling (config: factor=32, high_freq_factor=4, low_freq_factor=1, original_max_position_embeddings=8192), which was not implemented. The RoPE kernels were computing simple inverse frequencies 1/(500000^(2i/64)) without the three-region scaling, causing all hidden states after position 0 to diverge from the reference HuggingFace implementation. Token generation was garbled.

Root Cause

Llama 3 RoPE divides frequency pairs into three regions based on wavelength λ = 2π / inv_freq:

Region Condition Transformation
High freq λ < old_context_len / high_freq_factor (2048) Keep original inv_freq
Mid freq 2048 < λ < old_context_len / low_freq_factor (8192) Smooth interpolation: (1-smooth)·inv_freq/32 + smooth·inv_freq
Low freq λ > 8192 Divide inv_freq by 32

For the 32 dimension pairs (head_dim=64), pairs 0-14 are unchanged, pairs 15-17 are interpolated, and pairs 18-31 are divided by 32. Without this scaling, the high-index frequency pairs have angles 32× too large, corrupting the rotation.

Changes

src/kernels.cu

  • Added Llama 3 RoPE scaling parameters as constexpr values
  • Added __device__ __constant__ float d_rope_freqs[32] — precomputed scaled inverse frequencies accessible from all kernels
  • Added initRopeFreqs() — computes the 32 scaled frequencies once at startup using the three-region algorithm, uploads to GPU constant memory via cudaMemcpyToSymbol
  • Refactored ropeKernel (prefill) to use d_rope_freqs[pair_idx] instead of computing 1/pow(500000, 2i/64) per thread, removing a standing TODO
  • Refactored ropeKernelDecode (decode) identically

src/kernels.cuh

  • Declared initRopeFreqs()

src/main.cpp

  • Call initRopeFreqs() at startup before loading weights

Verification

Tested with unsloth/Llama-3.2-1B-Instruct on RTX 4050 Laptop GPU:

Prompt Before Fix After Fix
"What is 2+2?" garbled tokens "The answer is 4"
"Name a color." garbled tokens "Blue."
"Capital of France?" garbled tokens "The capital of France is Paris."

Outputs now match the HuggingFace reference implementation.

The Llama 3.2 1B model uses `rope_type: "llama3"` frequency scaling
(factor=32, high_freq_factor=4, low_freq_factor=1) which was not
implemented. The RoPE kernels were using simple `1/(500000^(2i/64))`
frequencies without scaling, causing completely wrong hidden states
after position 0 and garbled token generation.

- Precompute 32 scaled inverse frequencies at init time into
  __constant__ GPU memory via `initRopeFreqs()`
- Wavelengths > 8192: divide inv_freq by factor (32)
- Wavelengths 2048-8192: smooth interpolation between original
  and scaled frequencies
- Wavelengths < 2048: keep original inv_freq unchanged
- Both ropeKernel (prefill) and ropeKernelDecode now use
  precomputed frequencies, resolving a standing TODO

Verified: "What is 2+2?" → "The answer is 4", "Capital of France?"
→ "The capital of France is Paris.", matching HuggingFace reference.

Co-Authored-By: DeepSeek V4 Pro <noreply@deepseek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant