Skip to content

GH-49955: [C++] Fix OOM vulnerability in Parquet Delta decoders#49954

Open
sivaadityacoder wants to merge 1 commit intoapache:mainfrom
sivaadityacoder:fix/oom-delta-decoder
Open

GH-49955: [C++] Fix OOM vulnerability in Parquet Delta decoders#49954
sivaadityacoder wants to merge 1 commit intoapache:mainfrom
sivaadityacoder:fix/oom-delta-decoder

Conversation

@sivaadityacoder
Copy link
Copy Markdown

@sivaadityacoder sivaadityacoder commented May 9, 2026

Rationale for this change

This PR addresses an uncontrolled memory allocation vulnerability in the Parquet DeltaByteArrayDecoder, DeltaLengthByteArrayDecoder, and DeltaBitPackDecoder.

Currently, these decoders trust the num_values (and implicitly the total_value_count_) provided by the Parquet data page header. The decoders eagerly allocate memory arrays sized proportionally to this unvalidated count (e.g. buffered_prefix_length_->Resize(num_prefix * sizeof(int32_t))).

An attacker can exploit this by crafting a tiny Parquet file (e.g., ~300 bytes) with a maliciously forged num_values (e.g., 1,000,000,000). When this file is opened (even just to read dictionary pages during ParquetFileReader::Open), Arrow attempts an immediate massive allocation (e.g., 4 GB), resulting in a std::bad_alloc and an immediate Denial of Service (OOM) crash.

What changes are included in this PR?

This PR introduces conservative bounds checking in cpp/src/parquet/decoder.cc to ensure that the parsed value counts conceptually fit within the available byte array buffer size (bytes_left()) before attempting to eagerly allocate memory.

  • DeltaBitPackDecoder::InitHeader: Validates that the requested mini_blocks_per_block_ buffer allocation does not exceed the remaining bytes in the buffer.
  • DeltaLengthByteArrayDecoder::DecodeLengths: Validates that num_length does not exceed a conservative multiplier of the remaining buffer size (bytes_left() * 10000) before calling Resize().
  • DeltaByteArrayDecoderImpl::SetData: Validates that num_prefix does not exceed a conservative multiplier of the remaining buffer size before calling Resize().

These checks successfully catch maliciously forged massive values in tiny files while seamlessly permitting legitimate, highly-compressed Parquet files to be decoded.

Are these changes tested?

Yes, the changes have been tested against a Proof of Concept (PoC) malicious file. Instead of triggering an OS OOM or std::bad_alloc, the parser now correctly throws a ParquetException ("Excessive num_prefix in DeltaByteArrayDecoder").

(Optional: Note if you have included any new C++ unit tests in the PR that generate a bad header and assert the exception is thrown).

Are there any user-facing changes?

No, this strictly improves the security and stability of the Parquet C++ decoders.


Note: This vulnerability was originally submitted to the Huntr bug bounty platform (Reference ID: 7814255d-e945-427f-ab84-6eddc3a35a37), and is being filed here at the recommendation of the ASF Security Team.

@sivaadityacoder sivaadityacoder requested a review from wgtmac as a code owner May 9, 2026 04:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@sivaadityacoder sivaadityacoder changed the title GH-XXXXX: [C++] Fix OOM vulnerability in Parquet Delta decoders GH-49955: [C++] Fix OOM vulnerability in Parquet Delta decoders May 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

⚠️ GitHub issue #49955 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

⚠️ GitHub issue #49955 has no components, please add labels for components.

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

⚠️ GitHub issue #49955 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant