fix: implement Llama 3 RoPE frequency scaling#1
Open
KenForever1 wants to merge 1 commit into
Open
Conversation
The Llama 3.2 1B model uses `rope_type: "llama3"` frequency scaling (factor=32, high_freq_factor=4, low_freq_factor=1) which was not implemented. The RoPE kernels were using simple `1/(500000^(2i/64))` frequencies without scaling, causing completely wrong hidden states after position 0 and garbled token generation. - Precompute 32 scaled inverse frequencies at init time into __constant__ GPU memory via `initRopeFreqs()` - Wavelengths > 8192: divide inv_freq by factor (32) - Wavelengths 2048-8192: smooth interpolation between original and scaled frequencies - Wavelengths < 2048: keep original inv_freq unchanged - Both ropeKernel (prefill) and ropeKernelDecode now use precomputed frequencies, resolving a standing TODO Verified: "What is 2+2?" → "The answer is 4", "Capital of France?" → "The capital of France is Paris.", matching HuggingFace reference. Co-Authored-By: DeepSeek V4 Pro <noreply@deepseek.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Implement Llama 3 RoPE Frequency Scaling
Problem
The Llama 3.2 1B Instruct model uses
rope_type: "llama3"frequency scaling (config:factor=32,high_freq_factor=4,low_freq_factor=1,original_max_position_embeddings=8192), which was not implemented. The RoPE kernels were computing simple inverse frequencies1/(500000^(2i/64))without the three-region scaling, causing all hidden states after position 0 to diverge from the reference HuggingFace implementation. Token generation was garbled.Root Cause
Llama 3 RoPE divides frequency pairs into three regions based on wavelength
λ = 2π / inv_freq:inv_freq(1-smooth)·inv_freq/32 + smooth·inv_freqinv_freqby 32For the 32 dimension pairs (head_dim=64), pairs 0-14 are unchanged, pairs 15-17 are interpolated, and pairs 18-31 are divided by 32. Without this scaling, the high-index frequency pairs have angles 32× too large, corrupting the rotation.
Changes
src/kernels.cuconstexprvalues__device__ __constant__ float d_rope_freqs[32]— precomputed scaled inverse frequencies accessible from all kernelsinitRopeFreqs()— computes the 32 scaled frequencies once at startup using the three-region algorithm, uploads to GPU constant memory viacudaMemcpyToSymbolropeKernel(prefill) to used_rope_freqs[pair_idx]instead of computing1/pow(500000, 2i/64)per thread, removing a standing TODOropeKernelDecode(decode) identicallysrc/kernels.cuhinitRopeFreqs()src/main.cppinitRopeFreqs()at startup before loading weightsVerification
Tested with
unsloth/Llama-3.2-1B-Instructon RTX 4050 Laptop GPU:Outputs now match the HuggingFace reference implementation.