fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058)#2087
fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058)#2087pjdurden wants to merge 1 commit into
Conversation
`BpeTrainer` accumulated pair counts as `i32`. On large corpora a single pair can occur more than `i32::MAX` (2,147,483,647) times -- e.g. the space-space pair in heavily indented source code -- causing the counter to wrap negative. Negative counts fail the `count > 0` guard, so the high-frequency pair is silently dropped from the merges and the merge ordering is corrupted. Widen the pair-count accumulator from `i32` to `i64` in `count_pairs` and the merge-delta update. `i64::MAX` is far beyond any realistic pair frequency, and the map is short-lived and allocated once per training run, so there is no meaningful performance impact. Downstream `as u64` conversions are unaffected. Add a regression test that exercises the exact overflow path with a single word whose `u64` frequency exceeds `i32::MAX`, keeping the test fast and deterministic (no multi-billion-token corpus required). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes an integer overflow issue in BPE pair-count accumulation by widening count types and adds a regression test to ensure high-frequency pairs aren’t silently dropped.
Changes:
- Switched pair-count accumulation from
i32toi64in the BPE trainer. - Updated pair-count delta computation to use
i64arithmetic. - Added a regression test covering counts above
i32::MAX.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Initialize pair_counts and where_to_update for this pair if we just saw it | ||
| // Then update counts | ||
| *pair_counts.entry(cur_pair).or_default() += counts[i] as i32; | ||
| *pair_counts.entry(cur_pair).or_default() += counts[i] as i64; |
There was a problem hiding this comment.
i64 is sufficient here, so I'm keeping it. counts[i] is a single word's frequency, so this cast only wraps if one word occurs more than i64::MAX (~9.2e18) times. The largest training corpora today are on the order of 10^13 tokens total, ~6 orders of magnitude below that, so it's unreachable in practice. The bug this PR fixes is the i32 limit (~2.1e9), which is reachable - e.g. the space-space pair in large, heavily-indented code corpora.
| let count = change as i64 * counts[iw] as i64; | ||
| *pair_counts.entry(pair).or_default() += count; |
There was a problem hiding this comment.
Same thing here i64 is enough. change is a small i32 adjacency delta (a handful per word), and counts[iw] is a per-word frequency, so the product is bounded by roughly |change| × total-corpus-tokens, on the order of 10^14 at most. well under i64::MAX (~9.2e18). I cast both operands to i64 before the multiply specifically so the product is evaluated in i64 rather than i32, which closes the only realistic overflow (the original code did this multiply in i32). An i128 intermediate would only guard magnitudes that aren't physically reachable.
Motivation
Closes #2058.
BpeTraineraccumulates pair counts asi32. On large corpora a single pair canoccur more than
i32::MAX(2,147,483,647) times — e.g. the space-space pair inheavily indented source code — so the counter wraps negative. A negative count
fails the
count > 0guard, so the high-frequency pair is silently dropped fromthe merges and the merge ordering is corrupted.
Modifications
i32toi64incount_pairsand themerge-delta update (
tokenizers/src/models/bpe/trainer.rs).i64::MAXis farbeyond any realistic pair frequency, the map is short-lived and allocated once
per training run, so there is no meaningful performance impact, and the
downstream
as u64conversions are unaffected.change as i64 * counts[iw] as i64, casting bothoperands before the multiply so the product itself can't overflow
i32either.Tests
bpe_test_pair_count_no_i32_overflow, which drives the exact overflow pathwith a single word whose
u64frequency exceedsi32::MAX(the('a', 'a')pairat 3e9 occurrences) and asserts the pair survives into the merges. It is fast and
deterministic — no multi-billion-token corpus required. Verified RED before the
fix and GREEN after.
cargo test --lib, 202 tests);cargo fmt --checkandcargo clippyare clean.Note on the existing PRs
There are already two open PRs for this issue — #2057 (draft) and #2059 — both of
which widen the field but neither adds a regression test for the overflow. This PR
keeps the same minimal widening and adds a test that pins the behavior so it cannot
silently regress (and also widens the merge-delta multiplication, not just the map
value). Happy to consolidate or defer if a maintainer prefers one of the others —
mainly flagging the coverage gap.