Fix Unigram trainer prune loss to use per-piece alternative count#2070
Open
hunter-heidenreich wants to merge 2 commits into
Open
Fix Unigram trainer prune loss to use per-piece alternative count#2070hunter-heidenreich wants to merge 2 commits into
hunter-heidenreich wants to merge 2 commits into
Conversation
prune_sentence_pieces computed logsum_alt with alternatives.len() (the total piece count) instead of alternatives[id].len() (the alternatives for the piece being scored), diverging from SentencePiece's PruneSentencePieces. This inflates the loss for high-frequency pieces and changes which pieces survive pruning. Add the first tests for prune_sentence_pieces (previously uncovered): a regression test locking the per-piece keep decision, and a smoke test exercising the EM + prune loop end to end. Refs huggingface#2069 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment-only follow-up to the per-piece alternatives fix (no logic change): - Cite SentencePiece's Trainer::PruneSentencePieces at the fix site and note why the term must use alternatives[id], not alternatives.len(), to guard against silent re-drift of the ported comment. - Tidy the drifted comment: (alternatives[i] - 1) -> (alternatives[i].size() - 1). - Explain the slot arithmetic in test_prune_sentence_pieces_keeps_costly_alternative (why exactly one slot is contested, and what m/n padding is for). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
UnigramTrainer::prune_sentence_piecescomputes the prune-loss termlogsum_altwithalternatives.len()(the total number of pieces) where it should usealternatives[id].len()(the number of alternative segmentations for the onepiece being scored).
alternativesis the per-piece vectorVec<Vec<usize>>, soalternatives.len()is justpieces.len()(the whole vocabulary size) rather than the handful of piecesidwould be re-segmented into if removed.Closes #2069. (Same defect previously reported in #1536, closed without a reply.)
Why it matters
logsum_altis the log of the corpus token total after removing pieceidand reassigning its frequency to its alternatives. That total should grow byfreq[id] * (number_of_alternatives - 1). Using the total piece count instead multiplies the growth term by the entire vocabulary size, inflating the piece's loss. Because the inflation scales withfreq[id], it systematically over-values high-frequency pieces in the keep/drop ranking.The giveaway is the comment, ported verbatim from SentencePiece but with the
[i]dropped, and the code followed the comment.Parity with SentencePiece
This restores the reference behavior. SentencePiece
src/unigram_model_trainer.cc,Trainer::PruneSentencePieces:Does it change real tokenizers? Yes... bounded, and only when retraining
This affects only new
train()runs; existing serializedtokenizer.jsonfiles and all inference/encoding are byte-for-byte unaffected. The divergence is concentrated where pruning does most of the work (smallvocab_size) and disappears once the vocab is large enough that the EM threshold andfinalizetrim the affected margin anyway.Training Unigram on Pride and Prejudice (Project Gutenberg 1342, words lowercased),
shrinking_factor=0.9, single-threaded:At
vocab_size=1000the fix keeps more complete words (arrangement,benefit,explanation,inferior,leisure,original,similar); the buggy formula instead keeps short high-frequency pieces and fragments (into,act,best,friends,ten,tion,ize). The result is deterministic.Reproduction script (train on
mainvs this branch, then diff)Set
RAYON_RS_NUM_THREADS=1for a deterministic single-threaded run. The--compare differ=Nvalue counts both directions (sodiffer=28⇒ 14 each way).Tests
prune_sentence_pieceshad no coverage. The existing training tests use a tiny corpus that breaks out of the EM loop before pruning runs. This adds:test_prune_sentence_pieces_keeps_costly_alternative: a white-box unit test that locks the per-piece keep decision. It fails on the old line (keeps the cheap-to-replace piece) and passes on the fix, independent of the corpus above.test_do_train_runs_prune: the first end-to-end test that actually exercises the EM + prune loop (coverage for the previously-untested path).cargo fmt --checkandcargo clippy --all-targets --all-features -- -D warningsare clean;models::unigramand the Python Unigram binding tests pass.