Releases: modesty/pdf2json
Stable Build v4.0.3
pdf2json v4.0.3 Release Notes
Bug Fixes
- Text reading order — Added spatial sort (lib/pdftextsorter.js) to getRawTextContent() so multi-column
and complex-layout PDFs return text in correct top-to-bottom, left-to-right order instead of internal PDF
object order. (#422)
CLI Improvements (#423)
- New --json flag — Emits a structured JSON summary to stdout (version, output file paths, stats, errors,
elapsed time) for programmatic and scripted consumption. - New --quiet flag — Suppresses all non-error output (timer, status messages).
- Granular exit codes — 0 success · 1 parse failure · 2 argument error · 3 I/O error (previously only 0 or
1). - Fixed --singleton / -si flags — Parser instance is now correctly shared at the CLI level; previously
broken. - Directory filter — Only skips dotfiles now; previously silently skipped files starting with -, _, or
whitespace. - 7 internal bug fixes — Eliminated Promise constructor anti-pattern, replaced callback-style
fs.writeFile/fs.readdir with fs.promises, fixed addResultCount type mismatch, removed dead warningCount,
and resolved a TOCTOU race condition in validateParams.
Build & Configuration
- tsconfig.json: Removed dead decorator options; updated moduleResolution/module to node16.
- package.json: Fixed exports map with proper types entries for ESM and CJS TypeScript consumers; removed
unused tslib dependency; added test:coverage script. - rollup.config.js: Enabled tree-shaking for CLI bundle; documented build order dependency.
- CI: Upgraded to actions/checkout@v4; added tsc --noEmit type-check step; bumped Node.js to 22.x.
Tests
- 3 new test suites, 22 new tests — CLI integration (_test_cli.cjs), Stream API (_test_stream.cjs), and
error paths (_test_errors.cjs); all previously had zero coverage. - Total: 74 tests / 7 suites (up from 52 / 4).
- Fixed listener leak in multi-parse test; standardized on Jest expect() over Node assert.
- Renamed _test_getRawTextContent.cjs → _test_sortBidiTexts.cjs to reflect actual coverage.
- Regenerated 37 baseline JSON files to reflect current parser output (baselines were stale since v0.6.8).
Full Changelog
Stable Build v4.0.2
add support for transparent groups, ensure endGroup would merge sub-canvas text/line/etc. back to primary output data. this completes the fix for #418
Stable Build v4.0.1
Stable Build v4.0.0 [Breaking Changes]
v4.0.0 Release Notes
includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.
🚨 Breaking Changes
Text Encoding Change (Issue #385, PR #410)
What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.
Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.
Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.
JSON Output Examples
Before v4.0.0 (URI-encoded):
{
"Pages": [{
"Texts": [{
"R": [{
"T": "Added%20Text%20from%20Acrobat"
}]
}]
}]
}After v4.0.0 (UTF-8):
{
"Pages": [{
"Texts": [{
"R": [{
"T": "Added Text from Acrobat"
}]
}]
}]
}Code Migration
Before v4.0.0:
// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"After v4.0.0:
// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"CJK Character Support
Before v4.0.0:
{
"T": "%E4%B8%AD%E6%96%87"
}After v4.0.0:
{
"T": "中文"
}✨ Features & Enhancements
Accurate Space Preservation (Issues #355, #361, #319, PR #411)
Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):
- Glyph-based width calculation - Uses actual font metrics instead of estimates
- Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
- Text scale support - Applies
textHScalefor compressed/expanded text - Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)
Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.
Example Output Improvement
Before v4.0.0:
Name:JohnDoeSSN:123-45-6789
After v4.0.0:
Name: John Doe SSN: 123-45-6789
🐛 Bug Fixes
Text Block Coordinate Accuracy (Issue #408, PR #409)
- Fixed text block coordinate calculations for proper positioning
- Added comprehensive coordinate tests
- Ensures accurate x/y values in JSON output
Character Extraction Completeness (Issue #385, PR #410)
- Fixed missing character extraction for glyphs marked as "disabled"
- Moved text extraction outside glyph.disabled check
- All visible characters now properly extracted
CLI Error Handling (Issue #414)
- Unified error and exception handling for CLI operations
- Better error messages for invalid input parameters
- Auto-creates output directory when not specified (removed unnecessary validation)
- Improved stack trace display
more related issues should have been fixed (needs testing PDFs)
- #352 : unexpected space
- #291 : problem with sentences broken into 1 word
- #272 : unrecognized Text
- #220 : two TEXTs unexpected joined together in one RUN
- #212 : content is being randomly split into multiple lines
- #177 : heading level of text is not captured
- #156 : extracting table content
- #94 : parser not handling some spaces between words
📦 Dependencies
- Maintained zero runtime dependencies (since v3.1.6)
- Updated development dependencies for build tooling
Stable Build v3.2.2
- fix #406
- refactor: separate out logger functionality from nodeUtil
Stable build: V3.2.1
- types update:
- fix #392
- update types for root pdfparser.js
- feat: add type3 glyph font test support
- chores: update README, bump dev dependencies versions while keeping zero dependency
Stable build v3.2.0
- add support for deno and bun plus tests
-- fix: issue #68 and #396
-- add node:protocol to make them explicit when running in env other than node, including deno and bun - moved root pdfparser source and types to ./src and ./src/types respectively ---- double check your import path please, all exports from ./dist now
- reduce distributed package size to 2.1mb, improve pack and build
- feat: enable reading multiple pdf files with a single PDFParser object, credit @nicolabaesso
- other chores, including tests, jest upgrade, readme update, etc.
Stable build v3.1.6
What's Changed
- zero dependency: remove dependency on @xmldom/xmldom to make pdf2json zero dependency
- fix: correct link for open code of conduct #204
- Fixed radio/checkbox return values in getAllFieldsTypes(), thanks @bogie for #383
- fix: move package manager version from
enginestodevEngines, thanks @styfle for #387
New Contributors
Full Changelog: v3.1.5...v3.1.6
Stable build v3.1.5
feature added:
- add commonjs type definition file generation, thanks @grainrigi
- add 'types' to package.json 'exports' root, thanks @jeremybanka
Issues addressed:
- fix #165: check and make buffer before parse
- fix #373: handle bad encoding expcetion by start page rendering after page operator list is resolved
- fix #306: infinite loop of invalid stram
- fix #369: handle object value for field's rectangle coordinates
- other maintenance, eslint, tsconfig, dependency version bumps, etc.