Skip to content

Add Latin (LA) language support#657

Open
balwierz wants to merge 1 commit into
savoirfairelinux:masterfrom
balwierz:add-latin
Open

Add Latin (LA) language support#657
balwierz wants to merge 1 commit into
savoirfairelinux:masterfrom
balwierz:add-latin

Conversation

@balwierz
Copy link
Copy Markdown

Add Latin (la) language support

Adds a Latin number-to-words converter, registers it in the dispatcher, and ships ~150 sub-test assertions covering it.

What lands

num2words(38_630_666, lang='la')
'trīgintā octō miliones sescenta trīgintā mīlia sescentī sexāgintā sex'

  • Cardinals fully declined across all three genders × six cases for the forms that decline in Latin: 1, 2, 3 and the hundreds 200..900. Everything else (4..19, the tens, 100, 1000) is invariant in Latin and stays invariant here.
  • Output is in Classical-Latin orthography with macrons by default (ūnus, trēs, mīlle, vīgintī, …) so that case-syncretic forms are visually distinguishable: e.g. ūnā (abl.sg.fem.) vs ūna (nom.sg.fem.) — without macrons these are indistinguishable. Pass macrons=False for plain ASCII.
  • Long-scale Neo-Latin units up to 10¹⁵: mīlle (10³), milio / miliones (10⁶), miliardus / miliardi (10⁹), billio / billiones (10¹²), billiardus / billiardi (10¹⁵). MAXVAL = 10**18. The Neo-Latin terms are left without macrons — those coinages have no settled long-vowel tradition, and marking them would be hypercorrection.
  • Optional gender= ("m" / "f" / "n") and case= ("nom" / "gen" / "dat" / "acc" / "abl" / "voc") kwargs control agreement when the caller knows what they're modifying.
  • Ordinals partially supported: 1..20, the decadal forms (20, 30, …, 90), 100..900, and 1000. Compound ordinals (21st, 33rd, …) fall back to the cardinal because Latin grammar allows both ūnus et vīcēsimus and vīcēsimus prīmus and there's no unambiguous rule to pick between them.
  • to_ordinal_num returns Roman numerals (I, IV, MCMXCIX, …) up to 3999; numbers ≥ 4000 fall back to digits (the overline convention isn't representable in plain ASCII).
  • to_year spells the year as a cardinal in masc.nom; negative years are prefixed with ante Chrīstum.

Examples

>>> num2words(50, lang='la')
'quīnquāgintā'
>>> num2words(50, lang='la', macrons=False)
'quinquaginta'
>>> num2words(1, lang='la', gender='f', case='abl')
'ūnā'
>>> num2words(200, lang='la', gender='n', case='abl')
'ducentīs'
>>> num2words(2024, lang='la', to='year')
'duo mīlia vīgintī quattuor'
>>> num2words(-44, lang='la', to='year')
'ante Chrīstum quadrāgintā quattuor'
>>> num2words(38_630_666, lang='la')
'trīgintā octō miliones sescenta trīgintā mīlia sescentī sexāgintā sex'

Side-fix in tests/test_errors.py

Num2WordsErrorsTest.test_NotImplementedError used lang="lalala" as its "unknown language" sentinel. The dispatcher truncates an unknown long code to its first two letters before giving up, so "lalala"[:2] == "la" now matches the new Latin entry and the test stops raising. Replaced the sentinel with "xxxxx" (the 2-letter prefix xx isn't a registered code either) and left a comment explaining the collision so the next person to touch this can pick a non-conflicting sentinel safely.

Verification

Locally, with the project's own test workflow:

Ran 1509 tests in 1.2s   # 1488 pre-existing + 21 new Latin TestCase methods
                          # (~150 sub-test assertions via subTest)
                          # (the 1 collection error is the pre-existing
                          #  delegator-missing test_cli.py; unrelated)

coverage report --fail-under=75 --omit=.tox/*,tests/*,/usr/*
  TOTAL                97%    # pre-existing was 95%
  num2words/lang_LA.py 93%    # 8 uncovered lines are defensive guards

coverage report --fail-under=100 --include=tests/test_la.py
  TOTAL                100%

flake8 num2words/lang_LA.py tests/test_la.py …  # clean
isort  --check-only --float-to-top …             # clean

The 8 uncovered production lines are all dead-code defensive guards (if n == 0 in helpers that are already filtered, raise ValueError on impossible inputs, an except (ValueError, TypeError) for float-coercion edge cases, a fallthrough return stem in the ordinal-declension helper for a stem shape that doesn't occur in our tables). All sensibly skippable; the public-API surface is fully covered.

CHANGES.rst

Deliberately untouched — following the established pattern of one-line release-entry additions by maintainers when a version is cut.

Latin grammar notes (for review)

A few choice points worth flagging:

  • Default citation form = masculine nominative. Matches standard Latin grammars and Lewis & Short. Other languages in this repo similarly emit the most-cited form by default.
  • Long-scale powers of ten. Parallels lang_PL (Polish) which uses Latin-derived Neo-Latin-style names with -iard for the intermediate orders. miliardus matches Italian miliardo, French milliard, German Milliarde. The short-scale convention (billion = 10⁹) is debated in Neo-Latin; long-scale is more widely attested in scientific Latin and Vatican publications.
  • Macrons on Classical lemmas, not on Neo-Latin coinages. I marked ūnus, trēs, mīlle, vīgintī, quīnquāgintā, ducentī, nōngentī, prīmus, quīntus, etc. — every form for which Lewis & Short / Oxford Latin Dictionary marks a long vowel. I deliberately did not mark milio, miliardus, billio, billiardus — these are 15th-century-and-later coinages with no settled macron tradition, and marking them would be hypercorrection.
  • Compound 18 / 19 / 28 / 29. I went with the additive vīgintī octō / vīgintī novem / trīgintā octō / trīgintā novem style throughout, not the Classical subtractive duodēvīgintī / ūndēvīgintī / duodētrīgintā / ūndētrīgintā (those forms only appear for 18 / 19 themselves, since they're invariant lemmas; for 28 / 29 etc. I emit the additive form). The subtractive variants are documented but less common in practice and inconsistent across modern Latin pedagogy.

Happy to revisit any of these — I'm guessing there are Latinists on the project's review side who'll have strong opinions, and I'd rather get it right than push a default that needs follow-up patches.

Cardinals fully declined for the forms that decline in Latin: all
three genders × six cases for 1, 2, 3 and the hundreds 200..900.
Default citation form is masculine nominative (matching standard
Latin grammars / dictionaries); pass `gender=` ('m', 'f', 'n') and
`case=` ('nom', 'gen', 'dat', 'acc', 'abl', 'voc') to request
specific agreement.

Output uses Classical-Latin orthography with macrons (long-vowel
diacritics) so that case-syncretic forms like `ūnā` (abl.sg.fem.)
and `ūna` (nom.sg.fem.) are visually distinguished. Pass
`macrons=False` to opt out and get plain ASCII.

  >>> num2words(50, lang='la')
  'quīnquāgintā'
  >>> num2words(50, lang='la', macrons=False)
  'quinquaginta'
  >>> num2words(38_630_666, lang='la', macrons=False)
  'triginta octo miliones sescenta triginta milia sescenti sexaginta sex'
  >>> num2words(1, lang='la', gender='f', case='abl')
  'ūnā'
  >>> num2words(200, lang='la', gender='n', case='abl')
  'ducentīs'
  >>> num2words(2024, lang='la', to='year')
  'duo mīlia vīgintī quattuor'
  >>> num2words(-44, lang='la', to='year')
  'ante Chrīstum quadrāgintā quattuor'

Powers of ten use the long-scale Neo-Latin convention (parallel to
lang_PL): mīlle (10^3), milio / miliones (10^6), miliardus /
miliardi (10^9), billio / billiones (10^12), billiardus /
billiardi (10^15). MAXVAL = 10^18; OverflowError raised beyond
that. The Neo-Latin high-unit nouns are left without macrons —
those coinages have no settled long-vowel convention.

Ordinals partially supported: 1..20, decadal (20, 30, …, 90),
100..900 (step 100), 1000, plus a placeholder
`deciēs centiēs mīllēsimus` for 10^6. Compound ordinals (21st,
33rd, …) fall back to the cardinal form since Latin grammar
allows both `ūnus et vīcēsimus` and `vīcēsimus prīmus` — no
unambiguous algorithm. `to_ordinal_num` returns Roman numerals
(1..3999); falls back to digits beyond 3999 since the overline
convention isn't representable in plain ASCII.

Side-fix in tests/test_errors.py: the "unknown language" sentinel
used to be `lalala`, which the dispatcher truncates to `la` and
now matches Latin. Replaced with `xxxxx` (2-letter prefix `xx`
isn't a registered code either).

161 new tests in tests/test_la.py cover the full cardinal range,
gender/case agreement (including the macron-only distinctions
between syncretic forms), negatives, decimals, year formatting,
ordinals, Roman numerals, and the ASCII opt-out path. Full
repository suite: 1649 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant