fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058) by pjdurden · Pull Request #2087 · huggingface/tokenizers

pjdurden · 2026-06-06T23:59:16Z

Motivation

Closes #2058.

BpeTrainer accumulates pair counts as i32. On large corpora a single pair can
occur more than i32::MAX (2,147,483,647) times — e.g. the space-space pair in
heavily indented source code — so the counter wraps negative. A negative count
fails the count > 0 guard, so the high-frequency pair is silently dropped from
the merges and the merge ordering is corrupted.

Modifications

Widen the pair-count accumulator from i32 to i64 in count_pairs and the
merge-delta update (tokenizers/src/models/bpe/trainer.rs). i64::MAX is far
beyond any realistic pair frequency, the map is short-lived and allocated once
per training run, so there is no meaningful performance impact, and the
downstream as u64 conversions are unaffected.
The merge-delta computes change as i64 * counts[iw] as i64, casting both
operands before the multiply so the product itself can't overflow i32 either.

Tests

Added bpe_test_pair_count_no_i32_overflow, which drives the exact overflow path
with a single word whose u64 frequency exceeds i32::MAX (the ('a', 'a') pair
at 3e9 occurrences) and asserts the pair survives into the merges. It is fast and
deterministic — no multi-billion-token corpus required. Verified RED before the
fix and GREEN after.
Full lib suite passes (cargo test --lib, 202 tests); cargo fmt --check and
cargo clippy are clean.

Note on the existing PRs

There are already two open PRs for this issue — #2057 (draft) and #2059 — both of
which widen the field but neither adds a regression test for the overflow. This PR
keeps the same minimal widening and adds a test that pins the behavior so it cannot
silently regress (and also widens the merge-delta multiplication, not just the map
value). Happy to consolidate or defer if a maintainer prefers one of the others —
mainly flagging the coverage gap.

`BpeTrainer` accumulated pair counts as `i32`. On large corpora a single pair can occur more than `i32::MAX` (2,147,483,647) times -- e.g. the space-space pair in heavily indented source code -- causing the counter to wrap negative. Negative counts fail the `count > 0` guard, so the high-frequency pair is silently dropped from the merges and the merge ordering is corrupted. Widen the pair-count accumulator from `i32` to `i64` in `count_pairs` and the merge-delta update. `i64::MAX` is far beyond any realistic pair frequency, and the map is short-lived and allocated once per training run, so there is no meaningful performance impact. Downstream `as u64` conversions are unaffected. Add a regression test that exercises the exact overflow path with a single word whose `u64` frequency exceeds `i32::MAX`, keeping the test fast and deterministic (no multi-billion-token corpus required). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes an integer overflow issue in BPE pair-count accumulation by widening count types and adds a regression test to ensure high-frequency pairs aren’t silently dropped.

Changes:

Switched pair-count accumulation from i32 to i64 in the BPE trainer.
Updated pair-count delta computation to use i64 arithmetic.
Added a regression test covering counts above i32::MAX.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pjdurden · 2026-06-07T00:02:19Z

                    // Initialize pair_counts and where_to_update for this pair if we just saw it
                    // Then update counts
-                    *pair_counts.entry(cur_pair).or_default() += counts[i] as i32;
+                    *pair_counts.entry(cur_pair).or_default() += counts[i] as i64;


i64 is sufficient here, so I'm keeping it. counts[i] is a single word's frequency, so this cast only wraps if one word occurs more than i64::MAX (~9.2e18) times. The largest training corpora today are on the order of 10^13 tokens total, ~6 orders of magnitude below that, so it's unreachable in practice. The bug this PR fixes is the i32 limit (~2.1e9), which is reachable - e.g. the space-space pair in large, heavily-indented code corpora.

pjdurden · 2026-06-07T00:02:20Z

+                let count = change as i64 * counts[iw] as i64;
                *pair_counts.entry(pair).or_default() += count;


Same thing here i64 is enough. change is a small i32 adjacency delta (a handful per word), and counts[iw] is a per-word frequency, so the product is bounded by roughly |change| × total-corpus-tokens, on the order of 10^14 at most. well under i64::MAX (~9.2e18). I cast both operands to i64 before the multiply specifically so the product is evaluated in i64 rather than i32, which closes the only realistic overflow (the original code did this multiply in i32). An i128 intermediate would only guard magnitudes that aren't physically reachable.

Copilot AI review requested due to automatic review settings June 6, 2026 23:59

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058)#2087

fix(bpe): widen pair_counts to i64 + add overflow regression test (#2058)#2087
pjdurden wants to merge 1 commit into
huggingface:mainfrom
pjdurden:fix/bpe-trainer-pair-count-i64-overflow

pjdurden commented Jun 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

pjdurden Jun 7, 2026 •

edited

Loading

Uh oh!

pjdurden Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let count = change as i64 * counts[iw] as i64;
		*pair_counts.entry(pair).or_default() += count;

Conversation

pjdurden commented Jun 6, 2026

Motivation

Modifications

Tests

Note on the existing PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

pjdurden Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjdurden Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pjdurden Jun 7, 2026 •

edited

Loading

pjdurden Jun 7, 2026 •

edited

Loading