16 Apr 01:51

modesty

8554a3a

Stable Build v4.0.3 Latest

Latest

pdf2json v4.0.3 Release Notes

Bug Fixes

Text reading order — Added spatial sort (lib/pdftextsorter.js) to getRawTextContent() so multi-column
and complex-layout PDFs return text in correct top-to-bottom, left-to-right order instead of internal PDF
object order. (#422)

CLI Improvements (#423)

New --json flag — Emits a structured JSON summary to stdout (version, output file paths, stats, errors,
elapsed time) for programmatic and scripted consumption.
New --quiet flag — Suppresses all non-error output (timer, status messages).
Granular exit codes — 0 success · 1 parse failure · 2 argument error · 3 I/O error (previously only 0 or
1).
Fixed --singleton / -si flags — Parser instance is now correctly shared at the CLI level; previously
broken.
Directory filter — Only skips dotfiles now; previously silently skipped files starting with -, _, or
whitespace.
7 internal bug fixes — Eliminated Promise constructor anti-pattern, replaced callback-style
fs.writeFile/fs.readdir with fs.promises, fixed addResultCount type mismatch, removed dead warningCount,
and resolved a TOCTOU race condition in validateParams.

Build & Configuration

tsconfig.json: Removed dead decorator options; updated moduleResolution/module to node16.
package.json: Fixed exports map with proper types entries for ESM and CJS TypeScript consumers; removed
unused tslib dependency; added test:coverage script.
rollup.config.js: Enabled tree-shaking for CLI bundle; documented build order dependency.
CI: Upgraded to actions/checkout@v4; added tsc --noEmit type-check step; bumped Node.js to 22.x.

Tests

3 new test suites, 22 new tests — CLI integration (_test_cli.cjs), Stream API (_test_stream.cjs), and
error paths (_test_errors.cjs); all previously had zero coverage.
Total: 74 tests / 7 suites (up from 52 / 4).
Fixed listener leak in multi-parse test; standardized on Jest expect() over Node assert.
Renamed _test_getRawTextContent.cjs → _test_sortBidiTexts.cjs to reflect actual coverage.
Regenerated 37 baseline JSON files to reflect current parser output (baselines were stale since v0.6.8).

Full Changelog

b0067d7...eed63fb

Assets 2

17 Jan 01:27

modesty

v4.0.2

48b50bf

Stable Build v4.0.2

add support for transparent groups, ensure endGroup would merge sub-canvas text/line/etc. back to primary output data. this completes the fix for #418

Assets 2

07 Jan 20:26

modesty

v4.0.1

de176e5

Stable Build v4.0.1

Bug fixes

fix: correct circular dependency without dup](PR #415)
fix: issue #418

Assets 2

12 Oct 19:47

modesty

v4.0.0

c8b372b

Stable Build v4.0.0 [Breaking Changes]

v4.0.0 Release Notes

includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.

Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.

Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.

JSON Output Examples

Before v4.0.0 (URI-encoded):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added%20Text%20from%20Acrobat"
      }]
    }]
  }]
}

After v4.0.0 (UTF-8):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added Text from Acrobat"
      }]
    }]
  }]
}

Code Migration

Before v4.0.0:

// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"

After v4.0.0:

// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"

CJK Character Support

Before v4.0.0:

{
  "T": "%E4%B8%AD%E6%96%87"
}

After v4.0.0:

{
  "T": "中文"
}

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):

Glyph-based width calculation - Uses actual font metrics instead of estimates
Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
Text scale support - Applies textHScale for compressed/expanded text
Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)

Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.

Example Output Improvement

Before v4.0.0:

Name:JohnDoeSSN:123-45-6789

After v4.0.0:

Name: John Doe    SSN: 123-45-6789

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

Fixed text block coordinate calculations for proper positioning
Added comprehensive coordinate tests
Ensures accurate x/y values in JSON output

Character Extraction Completeness (Issue #385, PR #410)

Fixed missing character extraction for glyphs marked as "disabled"
Moved text extraction outside glyph.disabled check
All visible characters now properly extracted

CLI Error Handling (Issue #414)

Unified error and exception handling for CLI operations
Better error messages for invalid input parameters
Auto-creates output directory when not specified (removed unnecessary validation)
Improved stack trace display

more related issues should have been fixed (needs testing PDFs)

#352 : unexpected space
#291 : problem with sentences broken into 1 word
#272 : unrecognized Text
#220 : two TEXTs unexpected joined together in one RUN
#212 : content is being randomly split into multiple lines
#177 : heading level of text is not captured
#156 : extracting table content
#94 : parser not handling some spaces between words

📦 Dependencies

Maintained zero runtime dependencies (since v3.1.6)
Updated development dependencies for build tooling

Assets 2

19 Sep 01:47

modesty

v3.2.2

1faf820

Stable Build v3.2.2

fix #406
refactor: separate out logger functionality from nodeUtil

Assets 2

13 Sep 23:47

modesty

v3.2.1

b03348e

Stable build: V3.2.1

types update:
- fix #392
- update types for root pdfparser.js
feat: add type3 glyph font test support
- issue fixed: #389, #377, #332
- architectural compliance, separate the type3 glyph fonts processing from rendering, use standard canvas text rendering pipeline for glyph, tested with /test/pdf/misc/i389_type3_glyph.pdf
chores: update README, bump dev dependencies versions while keeping zero dependency

Assets 2

26 Jul 23:02

modesty

v3.2.0

be47b08

Stable build v3.2.0

add support for deno and bun plus tests
-- fix: issue #68 and #396
-- add node:protocol to make them explicit when running in env other than node, including deno and bun
moved root pdfparser source and types to ./src and ./src/types respectively ---- double check your import path please, all exports from ./dist now
reduce distributed package size to 2.1mb, improve pack and build
feat: enable reading multiple pdf files with a single PDFParser object, credit @nicolabaesso
other chores, including tests, jest upgrade, readme update, etc.

Contributors

nicolabaesso

Assets 2

24 May 00:20

modesty

v3.1.6

2298a86

Stable build v3.1.6

What's Changed

zero dependency: remove dependency on @xmldom/xmldom to make pdf2json zero dependency
fix: correct link for open code of conduct #204
Fixed radio/checkbox return values in getAllFieldsTypes(), thanks @bogie for #383
fix: move package manager version from engines to devEngines, thanks @styfle for #387

New Contributors

@bogie made their first contribution in #383
@styfle made their first contribution in #387

Full Changelog: v3.1.5...v3.1.6

Contributors

bogie and styfle

Assets 2

03 Jan 23:13

modesty

v3.1.5

49486ef

Stable build v3.1.5

feature added:

add commonjs type definition file generation, thanks @grainrigi
add 'types' to package.json 'exports' root, thanks @jeremybanka

Issues addressed:

fix #165: check and make buffer before parse
fix #373: handle bad encoding expcetion by start page rendering after page operator list is resolved
fix #306: infinite loop of invalid stram
fix #369: handle object value for field's rectangle coordinates
other maintenance, eslint, tsconfig, dependency version bumps, etc.

Contributors

jeremybanka and grainrigi

Assets 2

09 Aug 18:54

modesty

v3.1.4

115d618

Stable Build v3.1.4

dev-dependency updates for braces,
correct import for typescript type to fix #349: Cannot compile project with 3.1.3
plus issues addressed in v3.1.4:
- #350: replace nodeUtil.warn with nodeUtil.p2jwarn
- #274: Invalid XRef stream
- #216: stream must have data, verfied fix

Assets 2

Releases: modesty/pdf2json

Stable Build v4.0.3

pdf2json v4.0.3 Release Notes

Uh oh!

Stable Build v4.0.2

Uh oh!

Stable Build v4.0.1

Uh oh!

Stable Build v4.0.0 [Breaking Changes]

v4.0.0 Release Notes

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

JSON Output Examples

Code Migration

CJK Character Support

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Example Output Improvement

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

Character Extraction Completeness (Issue #385, PR #410)

CLI Error Handling (Issue #414)

more related issues should have been fixed (needs testing PDFs)

📦 Dependencies

Uh oh!

Stable Build v3.2.2

Uh oh!

Stable build: V3.2.1

Uh oh!

Stable build v3.2.0

Contributors

Uh oh!

Stable build v3.1.6

What's Changed

New Contributors

Contributors

Uh oh!

Stable build v3.1.5

feature added:

Issues addressed:

Contributors

Uh oh!

Stable Build v3.1.4

Uh oh!