Skip to content

Releases: modesty/pdf2json

Stable Build v4.0.3

16 Apr 01:51

Choose a tag to compare

pdf2json v4.0.3 Release Notes


Bug Fixes

  • Text reading order — Added spatial sort (lib/pdftextsorter.js) to getRawTextContent() so multi-column
    and complex-layout PDFs return text in correct top-to-bottom, left-to-right order instead of internal PDF
    object order. (#422)

CLI Improvements (#423)

  • New --json flag — Emits a structured JSON summary to stdout (version, output file paths, stats, errors,
    elapsed time) for programmatic and scripted consumption.
  • New --quiet flag — Suppresses all non-error output (timer, status messages).
  • Granular exit codes — 0 success · 1 parse failure · 2 argument error · 3 I/O error (previously only 0 or
    1).
  • Fixed --singleton / -si flags — Parser instance is now correctly shared at the CLI level; previously
    broken.
  • Directory filter — Only skips dotfiles now; previously silently skipped files starting with -, _, or
    whitespace.
  • 7 internal bug fixes — Eliminated Promise constructor anti-pattern, replaced callback-style
    fs.writeFile/fs.readdir with fs.promises, fixed addResultCount type mismatch, removed dead warningCount,
    and resolved a TOCTOU race condition in validateParams.

Build & Configuration

  • tsconfig.json: Removed dead decorator options; updated moduleResolution/module to node16.
  • package.json: Fixed exports map with proper types entries for ESM and CJS TypeScript consumers; removed
    unused tslib dependency; added test:coverage script.
  • rollup.config.js: Enabled tree-shaking for CLI bundle; documented build order dependency.
  • CI: Upgraded to actions/checkout@v4; added tsc --noEmit type-check step; bumped Node.js to 22.x.

Tests

  • 3 new test suites, 22 new tests — CLI integration (_test_cli.cjs), Stream API (_test_stream.cjs), and
    error paths (_test_errors.cjs); all previously had zero coverage.
  • Total: 74 tests / 7 suites (up from 52 / 4).
  • Fixed listener leak in multi-parse test; standardized on Jest expect() over Node assert.
  • Renamed _test_getRawTextContent.cjs → _test_sortBidiTexts.cjs to reflect actual coverage.
  • Regenerated 37 baseline JSON files to reflect current parser output (baselines were stale since v0.6.8).

Full Changelog

b0067d7...eed63fb

Stable Build v4.0.2

17 Jan 01:27
48b50bf

Choose a tag to compare

add support for transparent groups, ensure endGroup would merge sub-canvas text/line/etc. back to primary output data. this completes the fix for #418

Stable Build v4.0.1

07 Jan 20:26

Choose a tag to compare

Bug fixes

  1. fix: correct circular dependency without dup](PR #415)
  2. fix: issue #418

Stable Build v4.0.0 [Breaking Changes]

12 Oct 19:47
c8b372b

Choose a tag to compare

v4.0.0 Release Notes

includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.

Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.

Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.

JSON Output Examples

Before v4.0.0 (URI-encoded):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added%20Text%20from%20Acrobat"
      }]
    }]
  }]
}

After v4.0.0 (UTF-8):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added Text from Acrobat"
      }]
    }]
  }]
}

Code Migration

Before v4.0.0:

// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"

After v4.0.0:

// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"

CJK Character Support

Before v4.0.0:

{
  "T": "%E4%B8%AD%E6%96%87"
}

After v4.0.0:

{
  "T": "中文"
}

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):

  • Glyph-based width calculation - Uses actual font metrics instead of estimates
  • Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
  • Text scale support - Applies textHScale for compressed/expanded text
  • Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)

Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.

Example Output Improvement

Before v4.0.0:

Name:JohnDoeSSN:123-45-6789

After v4.0.0:

Name: John Doe    SSN: 123-45-6789

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

  • Fixed text block coordinate calculations for proper positioning
  • Added comprehensive coordinate tests
  • Ensures accurate x/y values in JSON output

Character Extraction Completeness (Issue #385, PR #410)

  • Fixed missing character extraction for glyphs marked as "disabled"
  • Moved text extraction outside glyph.disabled check
  • All visible characters now properly extracted

CLI Error Handling (Issue #414)

  • Unified error and exception handling for CLI operations
  • Better error messages for invalid input parameters
  • Auto-creates output directory when not specified (removed unnecessary validation)
  • Improved stack trace display

more related issues should have been fixed (needs testing PDFs)

  • #352 : unexpected space
  • #291 : problem with sentences broken into 1 word
  • #272 : unrecognized Text
  • #220 : two TEXTs unexpected joined together in one RUN
  • #212 : content is being randomly split into multiple lines
  • #177 : heading level of text is not captured
  • #156 : extracting table content
  • #94 : parser not handling some spaces between words

📦 Dependencies

  • Maintained zero runtime dependencies (since v3.1.6)
  • Updated development dependencies for build tooling

Stable Build v3.2.2

19 Sep 01:47
1faf820

Choose a tag to compare

  • fix #406
  • refactor: separate out logger functionality from nodeUtil

Stable build: V3.2.1

13 Sep 23:47
b03348e

Choose a tag to compare

  • types update:
    • fix #392
    • update types for root pdfparser.js
  • feat: add type3 glyph font test support
    • issue fixed: #389, #377, #332
    • architectural compliance, separate the type3 glyph fonts processing from rendering, use standard canvas text rendering pipeline for glyph, tested with /test/pdf/misc/i389_type3_glyph.pdf
  • chores: update README, bump dev dependencies versions while keeping zero dependency

Stable build v3.2.0

26 Jul 23:02

Choose a tag to compare

  1. add support for deno and bun plus tests
    -- fix: issue #68 and #396
    -- add node:protocol to make them explicit when running in env other than node, including deno and bun
  2. moved root pdfparser source and types to ./src and ./src/types respectively ---- double check your import path please, all exports from ./dist now
  3. reduce distributed package size to 2.1mb, improve pack and build
  4. feat: enable reading multiple pdf files with a single PDFParser object, credit @nicolabaesso
  5. other chores, including tests, jest upgrade, readme update, etc.

Stable build v3.1.6

24 May 00:20
2298a86

Choose a tag to compare

What's Changed

  • zero dependency: remove dependency on @xmldom/xmldom to make pdf2json zero dependency
  • fix: correct link for open code of conduct #204
  • Fixed radio/checkbox return values in getAllFieldsTypes(), thanks @bogie for #383
  • fix: move package manager version from engines to devEngines, thanks @styfle for #387

New Contributors

Full Changelog: v3.1.5...v3.1.6

Stable build v3.1.5

03 Jan 23:13
49486ef

Choose a tag to compare

feature added:

  1. add commonjs type definition file generation, thanks @grainrigi
  2. add 'types' to package.json 'exports' root, thanks @jeremybanka

Issues addressed:

  1. fix #165: check and make buffer before parse
  2. fix #373: handle bad encoding expcetion by start page rendering after page operator list is resolved
  3. fix #306: infinite loop of invalid stram
  4. fix #369: handle object value for field's rectangle coordinates
  5. other maintenance, eslint, tsconfig, dependency version bumps, etc.

Stable Build v3.1.4

09 Aug 18:54
115d618

Choose a tag to compare

  • dev-dependency updates for braces,
  • correct import for typescript type to fix #349: Cannot compile project with 3.1.3
  • plus issues addressed in v3.1.4:
    • #350: replace nodeUtil.warn with nodeUtil.p2jwarn
    • #274: Invalid XRef stream
    • #216: stream must have data, verfied fix