Skip to content

fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058)#2087

Open
pjdurden wants to merge 1 commit into
huggingface:mainfrom
pjdurden:fix/bpe-trainer-pair-count-i64-overflow
Open

fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058)#2087
pjdurden wants to merge 1 commit into
huggingface:mainfrom
pjdurden:fix/bpe-trainer-pair-count-i64-overflow

Conversation

@pjdurden
Copy link
Copy Markdown

@pjdurden pjdurden commented Jun 6, 2026

Motivation

Closes #2058.

BpeTrainer accumulates pair counts as i32. On large corpora a single pair can
occur more than i32::MAX (2,147,483,647) times — e.g. the space-space pair in
heavily indented source code — so the counter wraps negative. A negative count
fails the count > 0 guard, so the high-frequency pair is silently dropped from
the merges and the merge ordering is corrupted.

Modifications

  • Widen the pair-count accumulator from i32 to i64 in count_pairs and the
    merge-delta update (tokenizers/src/models/bpe/trainer.rs). i64::MAX is far
    beyond any realistic pair frequency, the map is short-lived and allocated once
    per training run, so there is no meaningful performance impact, and the
    downstream as u64 conversions are unaffected.
  • The merge-delta computes change as i64 * counts[iw] as i64, casting both
    operands before the multiply so the product itself can't overflow i32 either.

Tests

  • Added bpe_test_pair_count_no_i32_overflow, which drives the exact overflow path
    with a single word whose u64 frequency exceeds i32::MAX (the ('a', 'a') pair
    at 3e9 occurrences) and asserts the pair survives into the merges. It is fast and
    deterministic — no multi-billion-token corpus required. Verified RED before the
    fix and GREEN after.
  • Full lib suite passes (cargo test --lib, 202 tests); cargo fmt --check and
    cargo clippy are clean.

Note on the existing PRs

There are already two open PRs for this issue — #2057 (draft) and #2059 — both of
which widen the field but neither adds a regression test for the overflow. This PR
keeps the same minimal widening and adds a test that pins the behavior so it cannot
silently regress (and also widens the merge-delta multiplication, not just the map
value). Happy to consolidate or defer if a maintainer prefers one of the others —
mainly flagging the coverage gap.

`BpeTrainer` accumulated pair counts as `i32`. On large corpora a single
pair can occur more than `i32::MAX` (2,147,483,647) times -- e.g. the
space-space pair in heavily indented source code -- causing the counter
to wrap negative. Negative counts fail the `count > 0` guard, so the
high-frequency pair is silently dropped from the merges and the merge
ordering is corrupted.

Widen the pair-count accumulator from `i32` to `i64` in `count_pairs`
and the merge-delta update. `i64::MAX` is far beyond any realistic pair
frequency, and the map is short-lived and allocated once per training
run, so there is no meaningful performance impact. Downstream `as u64`
conversions are unaffected.

Add a regression test that exercises the exact overflow path with a
single word whose `u64` frequency exceeds `i32::MAX`, keeping the test
fast and deterministic (no multi-billion-token corpus required).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 6, 2026 23:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes an integer overflow issue in BPE pair-count accumulation by widening count types and adds a regression test to ensure high-frequency pairs aren’t silently dropped.

Changes:

  • Switched pair-count accumulation from i32 to i64 in the BPE trainer.
  • Updated pair-count delta computation to use i64 arithmetic.
  • Added a regression test covering counts above i32::MAX.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Initialize pair_counts and where_to_update for this pair if we just saw it
// Then update counts
*pair_counts.entry(cur_pair).or_default() += counts[i] as i32;
*pair_counts.entry(cur_pair).or_default() += counts[i] as i64;
Copy link
Copy Markdown
Author

@pjdurden pjdurden Jun 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i64 is sufficient here, so I'm keeping it. counts[i] is a single word's frequency, so this cast only wraps if one word occurs more than i64::MAX (~9.2e18) times. The largest training corpora today are on the order of 10^13 tokens total, ~6 orders of magnitude below that, so it's unreachable in practice. The bug this PR fixes is the i32 limit (~2.1e9), which is reachable - e.g. the space-space pair in large, heavily-indented code corpora.

Comment on lines +584 to 585
let count = change as i64 * counts[iw] as i64;
*pair_counts.entry(pair).or_default() += count;
Copy link
Copy Markdown
Author

@pjdurden pjdurden Jun 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here i64 is enough. change is a small i32 adjacency delta (a handful per word), and counts[iw] is a per-word frequency, so the product is bounded by roughly |change| × total-corpus-tokens, on the order of 10^14 at most. well under i64::MAX (~9.2e18). I cast both operands to i64 before the multiply specifically so the product is evaluated in i64 rather than i32, which closes the only realistic overflow (the original code did this multiply in i32). An i128 intermediate would only guard magnitudes that aren't physically reachable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BPE trainer: i32 overflow in pair_counts corrupts merge ordering on large code corpora

2 participants