Skip to content

GH-48701: [C++][Parquet] Add ALPpd encoding#48345

Open
prtkgaur wants to merge 90 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding
Open

GH-48701: [C++][Parquet] Add ALPpd encoding#48345
prtkgaur wants to merge 90 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding

Conversation

@prtkgaur
Copy link
Copy Markdown

@prtkgaur prtkgaur commented Dec 5, 2025

Co-authored-by: dhirhan17@gmail.com

@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

Spec

Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md

Parquet Format PR

Dataset PR (parquet-testing)

apache/parquet-testing#100

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

  • Alp h/cc : Houses core logic for encoding and decoding.
  • Sampler h/cc : Houses logic to sample and select parameters for encoding.
  • AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

  • Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Unit tests

  • alp_test.cc

Benchmark tests

  • encoding_benchmark.cc and encoding_alp_benchmark.cc

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

DuckDB

  • We did look at DuckDB's ALP implementation while we were implementing ALP and would like to give that team the desired credit.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 5, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

Comment thread cpp/src/parquet/types.h
DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parquet-format we have this PR : apache/parquet-format#557

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Dec 8, 2025

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

@prtkgaur prtkgaur changed the title [Gh540] Add ALPpd encoding to parquet [Gh539] Add ALPpd encoding to parquet Dec 8, 2025
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

std::string tarball_path = std::string(__FILE__);
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100


// Unsafe resize without initialization - use only when you will immediately
// overwrite the memory (e.g., before memcpy). Only safe for POD types.
void UnsafeResize(size_t n) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this over resize gave us around 2-3% performance improvement

@prtkgaur prtkgaur changed the title [Gh539] Add ALPpd encoding to parquet [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
@prtkgaur prtkgaur changed the title [Gh539][Encoding] Add ALPpd encoding to parquet [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
Comment thread cpp/src/arrow/util/alp/alp.h Outdated
///
/// \tparam T the floating point type (float or double)
template <typename T>
struct AlpEncodedForVectorInfo {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same note on struct vs class.

Comment thread cpp/src/arrow/util/alp/alp.h Outdated
class AlpEncodedVector {
public:
/// ALP-specific metadata (exponent, factor, num_exceptions)
AlpEncodedVectorInfo alp_info;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread cpp/src/arrow/util/alp/alp.h Outdated
}

// ----------------------------------------------------------------------
// AlpMetadataCache (LEGACY - not used with offset-based layout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed then?

/// The format supports arbitrary power-of-2 sizes via log_vector_size in the
/// page header, but this implementation currently only supports 1024.
/// Must fit in uint16_t (max 65535), so log_vector_size must be <= 15.
static constexpr int64_t kAlpVectorSize = 1024;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you said C++ now supports arbitrary sizes?

Comment thread cpp/src/arrow/util/alp/alp.h Outdated
///
/// \tparam T the type of data to be compressed. Currently float and double.
template <typename T>
class AlpCompression : private AlpConstants {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we shouldn't be using private inheritence

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
alp_info.Store({output_buffer.data() + offset, AlpEncodedVectorInfo::kStoredSize});
offset += AlpEncodedVectorInfo::kStoredSize;

// Store ForInfo (6/10 bytes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 or 9? maybe just remove the detail?

Suggested change
// Store ForInfo (6/10 bytes)
// Store ForInfo

std::memcpy(output_buffer.data() + offset, exceptions.data(), exception_size);
offset += exception_size;

ARROW_CHECK(offset == data_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about safety.

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
{input_buffer.data() + input_offset, AlpEncodedVectorInfo::kStoredSize}));
input_offset += AlpEncodedVectorInfo::kStoredSize;

// Load ForInfo (6/10 bytes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Load ForInfo (6/10 bytes)
// Load ForInfo

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
const int64_t bit_packed_size =
AlpEncodedForVectorInfo<T>::GetBitPackedSize(num_elements, result.for_info.bit_width);

result.packed_values.resize(bit_packed_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this in the hot path? did zeroing out values show up in any profiling? Maybe we can leave a TODO to re-examine?

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
ptr += sizeof(result.frame_of_reference);

// bit_width: 1 byte
result.bit_width = *ptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate bit_width is <= target size.

Comment thread cpp/src/arrow/util/alp/alp_codec.h Outdated
/// \param[in] input the compressed buffer
/// \param[in] input_size the size of the compressed data
/// \return the AlpHeader, or an error if the buffer is too small
static Result<AlpHeader> LoadHeader(const char* input, int64_t input_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consistency on char/uint_8

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
// AlpCodec implementation

template <typename T>
auto AlpCodec<T>::LoadHeader(const char* input, int64_t input_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is auto needed here? It can just return AlpHeader directly?

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
header.compression_mode = static_cast<uint8_t>(input[0]);
header.integer_encoding = static_cast<uint8_t>(input[1]);
header.log_vector_size = static_cast<uint8_t>(input[2]);
std::memcpy(&header.num_elements, input + 3, sizeof(header.num_elements));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SafeLoad ...

header.integer_encoding = static_cast<uint8_t>(input[1]);
header.log_vector_size = static_cast<uint8_t>(input[2]);
std::memcpy(&header.num_elements, input + 3, sizeof(header.num_elements));
return header;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate log_vector_size, compression_mode and integer encoding here? Also, num_elements > 0

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
template <typename T>
auto AlpCodec<T>::CreateSamplingPreset(const T* input, int64_t input_size)
-> AlpSamplerResult {
ARROW_CHECK(input_size >= 0 && input_size % sizeof(T) == 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just take a span T, or alternatively should input_size be number of elements to to begin with?

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
const AlpSamplerResult& preset,
int32_t vector_size,
char* output, int64_t* output_size) {
ARROW_CHECK(input_size >= 0 && input_size % sizeof(T) == 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, can Span be used instead?

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
header.compression_mode = static_cast<uint8_t>(AlpMode::kAlp);
header.integer_encoding = static_cast<uint8_t>(AlpIntegerEncoding::kForBitPack);
header.log_vector_size = AlpHeader::Log2(vector_size);
header.num_elements = static_cast<int32_t>(element_count);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing element_count directly avoids could avoid the down cast? we probably. If we are allowing freedom of int64, we probably want to check this is a safe truncation?

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
encoded_header[0] = header.compression_mode;
encoded_header[1] = header.integer_encoding;
encoded_header[2] = header.log_vector_size;
std::memcpy(encoded_header + 3, &header.num_elements, sizeof(header.num_elements));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safestore

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
Status AlpCodec<T>::Decode(int32_t num_elements, const char* input, int64_t input_size,
TargetType* output) {
ARROW_ASSIGN_OR_RAISE(const AlpHeader header, LoadHeader(input, input_size));
if (header.log_vector_size > AlpConstants::kMaxLogVectorSize) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, less than? Move this into the LoadHeader function?

Comment thread cpp/src/arrow/util/alp/alp_codec.cc Outdated
const char* body = input + AlpHeader::kSize;
const int64_t body_size = input_size - static_cast<int64_t>(AlpHeader::kSize);

if (header.GetCompressionMode() != AlpMode::kAlp) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, consider doing all validation in one place.

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing but wanted to flush comments for what I have so far.

sfc-gh-pgaur and others added 9 commits June 8, 2026 03:17
- Replace std::memcpy with util::SafeLoadAs/SafeStore for all single-value
  loads/stores from uint8_t* in alp.cc and alp_codec.cc
- Convert AlpEncodedVectorInfo and AlpEncodedForVectorInfo from struct to
  class per Google C++ style guide (private data with trailing underscore,
  public getters/setters)
- Add bit_width validation in AlpEncodedForVectorInfo::Load
- Fix incorrect comment "(6/10 bytes)" → remove byte count detail
- Add safety comments on ARROW_CHECK assertions in Store paths
- Add TODO for resize() zero-initialization on decode hot path
- Make AlpMode::kAlp explicitly = 0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…with getters

Per Google C++ style guide, class data members use trailing underscores.
Convert AlpEncodedVector<T> and AlpEncodedVectorView<T> (struct→class)
with private members, const getters, mutable getters for vectors, and
setters. Updates all ~95 call sites across alp.cc, alp_codec.cc, and
alp_test.cc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlpMetadataCache was designed for an older grouped metadata layout that
has been superseded by the offset-based interleaved format. The codec
reads offsets and metadata inline, making this cache unnecessary.

Also removes GetNumElements() which is now redundant with the
num_elements() getter added in the prior commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use explicit AlpConstants:: qualification instead of private inheritance
per reviewer feedback. Private inheritance is discouraged as it obscures
the relationship between classes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eader

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject invalid compression_mode, integer_encoding, log_vector_size,
and negative num_elements when loading the ALP page header.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace input_size (bytes) with num_elements across all AlpCodec
encode APIs. This removes the sizeof(T) divisibility precondition,
simplifies callers, and makes the encode path consistent with the
decode path which already takes element count.

Also consolidates validation: encode checks are in EncodeWithPreset,
decode checks are in LoadHeader. Adds INT32_MAX bounds check before
header truncation, and uses SafeStore for all header field writes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with Arrow buffer conventions (Buffer::data() returns uint8_t*).
This eliminates reinterpret_casts at parquet encoder/decoder call sites.

Also updates kAlpVectorSize comment to reflect that arbitrary power-of-2
vector sizes are supported (1024 is just the default, not a limitation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants