Add Latin (LA) language support#657
Open
balwierz wants to merge 1 commit into
Open
Conversation
Cardinals fully declined for the forms that decline in Latin: all
three genders × six cases for 1, 2, 3 and the hundreds 200..900.
Default citation form is masculine nominative (matching standard
Latin grammars / dictionaries); pass `gender=` ('m', 'f', 'n') and
`case=` ('nom', 'gen', 'dat', 'acc', 'abl', 'voc') to request
specific agreement.
Output uses Classical-Latin orthography with macrons (long-vowel
diacritics) so that case-syncretic forms like `ūnā` (abl.sg.fem.)
and `ūna` (nom.sg.fem.) are visually distinguished. Pass
`macrons=False` to opt out and get plain ASCII.
>>> num2words(50, lang='la')
'quīnquāgintā'
>>> num2words(50, lang='la', macrons=False)
'quinquaginta'
>>> num2words(38_630_666, lang='la', macrons=False)
'triginta octo miliones sescenta triginta milia sescenti sexaginta sex'
>>> num2words(1, lang='la', gender='f', case='abl')
'ūnā'
>>> num2words(200, lang='la', gender='n', case='abl')
'ducentīs'
>>> num2words(2024, lang='la', to='year')
'duo mīlia vīgintī quattuor'
>>> num2words(-44, lang='la', to='year')
'ante Chrīstum quadrāgintā quattuor'
Powers of ten use the long-scale Neo-Latin convention (parallel to
lang_PL): mīlle (10^3), milio / miliones (10^6), miliardus /
miliardi (10^9), billio / billiones (10^12), billiardus /
billiardi (10^15). MAXVAL = 10^18; OverflowError raised beyond
that. The Neo-Latin high-unit nouns are left without macrons —
those coinages have no settled long-vowel convention.
Ordinals partially supported: 1..20, decadal (20, 30, …, 90),
100..900 (step 100), 1000, plus a placeholder
`deciēs centiēs mīllēsimus` for 10^6. Compound ordinals (21st,
33rd, …) fall back to the cardinal form since Latin grammar
allows both `ūnus et vīcēsimus` and `vīcēsimus prīmus` — no
unambiguous algorithm. `to_ordinal_num` returns Roman numerals
(1..3999); falls back to digits beyond 3999 since the overline
convention isn't representable in plain ASCII.
Side-fix in tests/test_errors.py: the "unknown language" sentinel
used to be `lalala`, which the dispatcher truncates to `la` and
now matches Latin. Replaced with `xxxxx` (2-letter prefix `xx`
isn't a registered code either).
161 new tests in tests/test_la.py cover the full cardinal range,
gender/case agreement (including the macron-only distinctions
between syncretic forms), negatives, decimals, year formatting,
ordinals, Roman numerals, and the ASCII opt-out path. Full
repository suite: 1649 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Latin (
la) language supportAdds a Latin number-to-words converter, registers it in the dispatcher, and ships ~150 sub-test assertions covering it.
What lands
num2words(38_630_666, lang='la')→'trīgintā octō miliones sescenta trīgintā mīlia sescentī sexāgintā sex'1,2,3and the hundreds200..900. Everything else (4..19, the tens,100,1000) is invariant in Latin and stays invariant here.ūnus,trēs,mīlle,vīgintī, …) so that case-syncretic forms are visually distinguishable: e.g.ūnā(abl.sg.fem.) vsūna(nom.sg.fem.) — without macrons these are indistinguishable. Passmacrons=Falsefor plain ASCII.mīlle(10³),milio/miliones(10⁶),miliardus/miliardi(10⁹),billio/billiones(10¹²),billiardus/billiardi(10¹⁵).MAXVAL = 10**18. The Neo-Latin terms are left without macrons — those coinages have no settled long-vowel tradition, and marking them would be hypercorrection.gender=("m"/"f"/"n") andcase=("nom"/"gen"/"dat"/"acc"/"abl"/"voc") kwargs control agreement when the caller knows what they're modifying.ūnus et vīcēsimusandvīcēsimus prīmusand there's no unambiguous rule to pick between them.to_ordinal_numreturns Roman numerals (I,IV,MCMXCIX, …) up to 3999; numbers ≥ 4000 fall back to digits (the overline convention isn't representable in plain ASCII).to_yearspells the year as a cardinal in masc.nom; negative years are prefixed withante Chrīstum.Examples
Side-fix in
tests/test_errors.pyNum2WordsErrorsTest.test_NotImplementedErrorusedlang="lalala"as its "unknown language" sentinel. The dispatcher truncates an unknown long code to its first two letters before giving up, so"lalala"[:2] == "la"now matches the new Latin entry and the test stops raising. Replaced the sentinel with"xxxxx"(the 2-letter prefixxxisn't a registered code either) and left a comment explaining the collision so the next person to touch this can pick a non-conflicting sentinel safely.Verification
Locally, with the project's own test workflow:
The 8 uncovered production lines are all dead-code defensive guards (
if n == 0in helpers that are already filtered,raise ValueErroron impossible inputs, anexcept (ValueError, TypeError)for float-coercion edge cases, a fallthroughreturn stemin the ordinal-declension helper for a stem shape that doesn't occur in our tables). All sensibly skippable; the public-API surface is fully covered.CHANGES.rstDeliberately untouched — following the established pattern of one-line release-entry additions by maintainers when a version is cut.
Latin grammar notes (for review)
A few choice points worth flagging:
lang_PL(Polish) which uses Latin-derived Neo-Latin-style names with-iardfor the intermediate orders.miliardusmatches Italian miliardo, French milliard, German Milliarde. The short-scale convention (billion= 10⁹) is debated in Neo-Latin; long-scale is more widely attested in scientific Latin and Vatican publications.ūnus,trēs,mīlle,vīgintī,quīnquāgintā,ducentī,nōngentī,prīmus,quīntus, etc. — every form for which Lewis & Short / Oxford Latin Dictionary marks a long vowel. I deliberately did not markmilio,miliardus,billio,billiardus— these are 15th-century-and-later coinages with no settled macron tradition, and marking them would be hypercorrection.vīgintī octō/vīgintī novem/trīgintā octō/trīgintā novemstyle throughout, not the Classical subtractiveduodēvīgintī/ūndēvīgintī/duodētrīgintā/ūndētrīgintā(those forms only appear for18/19themselves, since they're invariant lemmas; for28/29etc. I emit the additive form). The subtractive variants are documented but less common in practice and inconsistent across modern Latin pedagogy.Happy to revisit any of these — I'm guessing there are Latinists on the project's review side who'll have strong opinions, and I'd rather get it right than push a default that needs follow-up patches.