GH-48701: [C++][Parquet] Add ALPpd encoding#48345
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
1b78a5c to
d563ce0
Compare
There was a problem hiding this comment.
I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations
In this case I would recommend https://github.com/apache/parquet-testing
| DELTA_BYTE_ARRAY = 7, | ||
| RLE_DICTIONARY = 8, | ||
| BYTE_STREAM_SPLIT = 9, | ||
| ALP = 10, |
There was a problem hiding this comment.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/parquet.thrift#L631 needs to be updated here and in parqut-format.
There was a problem hiding this comment.
For parquet-format we have this PR : apache/parquet-format#557
|
Thanks @prtkgaur -- it is super exciting to see this movement. Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review. I started the CI checks on this PR and had some comments about the testing. |
| std::string tarball_path = std::string(__FILE__); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz"; |
There was a problem hiding this comment.
@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100
|
|
||
| // Unsafe resize without initialization - use only when you will immediately | ||
| // overwrite the memory (e.g., before memcpy). Only safe for POD types. | ||
| void UnsafeResize(size_t n) { |
There was a problem hiding this comment.
Using this over resize gave us around 2-3% performance improvement
Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>
Also ensure that no line exceeds 90 characters
| /// | ||
| /// \tparam T the floating point type (float or double) | ||
| template <typename T> | ||
| struct AlpEncodedForVectorInfo { |
There was a problem hiding this comment.
same note on struct vs class.
| class AlpEncodedVector { | ||
| public: | ||
| /// ALP-specific metadata (exponent, factor, num_exceptions) | ||
| AlpEncodedVectorInfo alp_info; |
There was a problem hiding this comment.
nit: class member names end in '_' (https://google.github.io/styleguide/cppguide.html#Variable_Names)
| } | ||
|
|
||
| // ---------------------------------------------------------------------- | ||
| // AlpMetadataCache (LEGACY - not used with offset-based layout) |
There was a problem hiding this comment.
Can this be removed then?
| /// The format supports arbitrary power-of-2 sizes via log_vector_size in the | ||
| /// page header, but this implementation currently only supports 1024. | ||
| /// Must fit in uint16_t (max 65535), so log_vector_size must be <= 15. | ||
| static constexpr int64_t kAlpVectorSize = 1024; |
There was a problem hiding this comment.
I thought you said C++ now supports arbitrary sizes?
| /// | ||
| /// \tparam T the type of data to be compressed. Currently float and double. | ||
| template <typename T> | ||
| class AlpCompression : private AlpConstants { |
| alp_info.Store({output_buffer.data() + offset, AlpEncodedVectorInfo::kStoredSize}); | ||
| offset += AlpEncodedVectorInfo::kStoredSize; | ||
|
|
||
| // Store ForInfo (6/10 bytes) |
There was a problem hiding this comment.
5 or 9? maybe just remove the detail?
| // Store ForInfo (6/10 bytes) | |
| // Store ForInfo |
| std::memcpy(output_buffer.data() + offset, exceptions.data(), exception_size); | ||
| offset += exception_size; | ||
|
|
||
| ARROW_CHECK(offset == data_size) |
There was a problem hiding this comment.
same comment about safety.
| {input_buffer.data() + input_offset, AlpEncodedVectorInfo::kStoredSize})); | ||
| input_offset += AlpEncodedVectorInfo::kStoredSize; | ||
|
|
||
| // Load ForInfo (6/10 bytes) |
There was a problem hiding this comment.
| // Load ForInfo (6/10 bytes) | |
| // Load ForInfo |
| const int64_t bit_packed_size = | ||
| AlpEncodedForVectorInfo<T>::GetBitPackedSize(num_elements, result.for_info.bit_width); | ||
|
|
||
| result.packed_values.resize(bit_packed_size); |
There was a problem hiding this comment.
is this in the hot path? did zeroing out values show up in any profiling? Maybe we can leave a TODO to re-examine?
| ptr += sizeof(result.frame_of_reference); | ||
|
|
||
| // bit_width: 1 byte | ||
| result.bit_width = *ptr; |
There was a problem hiding this comment.
validate bit_width is <= target size.
| /// \param[in] input the compressed buffer | ||
| /// \param[in] input_size the size of the compressed data | ||
| /// \return the AlpHeader, or an error if the buffer is too small | ||
| static Result<AlpHeader> LoadHeader(const char* input, int64_t input_size); |
There was a problem hiding this comment.
nit: consistency on char/uint_8
| // AlpCodec implementation | ||
|
|
||
| template <typename T> | ||
| auto AlpCodec<T>::LoadHeader(const char* input, int64_t input_size) |
There was a problem hiding this comment.
is auto needed here? It can just return AlpHeader directly?
| header.compression_mode = static_cast<uint8_t>(input[0]); | ||
| header.integer_encoding = static_cast<uint8_t>(input[1]); | ||
| header.log_vector_size = static_cast<uint8_t>(input[2]); | ||
| std::memcpy(&header.num_elements, input + 3, sizeof(header.num_elements)); |
| header.integer_encoding = static_cast<uint8_t>(input[1]); | ||
| header.log_vector_size = static_cast<uint8_t>(input[2]); | ||
| std::memcpy(&header.num_elements, input + 3, sizeof(header.num_elements)); | ||
| return header; |
There was a problem hiding this comment.
validate log_vector_size, compression_mode and integer encoding here? Also, num_elements > 0
| template <typename T> | ||
| auto AlpCodec<T>::CreateSamplingPreset(const T* input, int64_t input_size) | ||
| -> AlpSamplerResult { | ||
| ARROW_CHECK(input_size >= 0 && input_size % sizeof(T) == 0) |
There was a problem hiding this comment.
should this just take a span T, or alternatively should input_size be number of elements to to begin with?
| const AlpSamplerResult& preset, | ||
| int32_t vector_size, | ||
| char* output, int64_t* output_size) { | ||
| ARROW_CHECK(input_size >= 0 && input_size % sizeof(T) == 0) |
There was a problem hiding this comment.
same question, can Span be used instead?
| header.compression_mode = static_cast<uint8_t>(AlpMode::kAlp); | ||
| header.integer_encoding = static_cast<uint8_t>(AlpIntegerEncoding::kForBitPack); | ||
| header.log_vector_size = AlpHeader::Log2(vector_size); | ||
| header.num_elements = static_cast<int32_t>(element_count); |
There was a problem hiding this comment.
passing element_count directly avoids could avoid the down cast? we probably. If we are allowing freedom of int64, we probably want to check this is a safe truncation?
| encoded_header[0] = header.compression_mode; | ||
| encoded_header[1] = header.integer_encoding; | ||
| encoded_header[2] = header.log_vector_size; | ||
| std::memcpy(encoded_header + 3, &header.num_elements, sizeof(header.num_elements)); |
| Status AlpCodec<T>::Decode(int32_t num_elements, const char* input, int64_t input_size, | ||
| TargetType* output) { | ||
| ARROW_ASSIGN_OR_RAISE(const AlpHeader header, LoadHeader(input, input_size)); | ||
| if (header.log_vector_size > AlpConstants::kMaxLogVectorSize) { |
There was a problem hiding this comment.
also, less than? Move this into the LoadHeader function?
| const char* body = input + AlpHeader::kSize; | ||
| const int64_t body_size = input_size - static_cast<int64_t>(AlpHeader::kSize); | ||
|
|
||
| if (header.GetCompressionMode() != AlpMode::kAlp) { |
There was a problem hiding this comment.
same comment, consider doing all validation in one place.
emkornfield
left a comment
There was a problem hiding this comment.
Still reviewing but wanted to flush comments for what I have so far.
- Replace std::memcpy with util::SafeLoadAs/SafeStore for all single-value loads/stores from uint8_t* in alp.cc and alp_codec.cc - Convert AlpEncodedVectorInfo and AlpEncodedForVectorInfo from struct to class per Google C++ style guide (private data with trailing underscore, public getters/setters) - Add bit_width validation in AlpEncodedForVectorInfo::Load - Fix incorrect comment "(6/10 bytes)" → remove byte count detail - Add safety comments on ARROW_CHECK assertions in Store paths - Add TODO for resize() zero-initialization on decode hot path - Make AlpMode::kAlp explicitly = 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…with getters Per Google C++ style guide, class data members use trailing underscores. Convert AlpEncodedVector<T> and AlpEncodedVectorView<T> (struct→class) with private members, const getters, mutable getters for vectors, and setters. Updates all ~95 call sites across alp.cc, alp_codec.cc, and alp_test.cc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlpMetadataCache was designed for an older grouped metadata layout that has been superseded by the offset-based interleaved format. The codec reads offsets and metadata inline, making this cache unnecessary. Also removes GetNumElements() which is now redundant with the num_elements() getter added in the prior commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use explicit AlpConstants:: qualification instead of private inheritance per reviewer feedback. Private inheritance is discouraged as it obscures the relationship between classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eader Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reject invalid compression_mode, integer_encoding, log_vector_size, and negative num_elements when loading the ALP page header. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace input_size (bytes) with num_elements across all AlpCodec encode APIs. This removes the sizeof(T) divisibility precondition, simplifies callers, and makes the encode path consistent with the decode path which already takes element count. Also consolidates validation: encode checks are in EncodeWithPreset, decode checks are in LoadHeader. Adds INT32_MAX bounds check before header truncation, and uses SafeStore for all header field writes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with Arrow buffer conventions (Buffer::data() returns uint8_t*). This eliminates reinterpret_casts at parquet encoder/decoder call sites. Also updates kAlpVectorSize comment to reflect that arbitrary power-of-2 vector sizes are supported (1024 is just the default, not a limitation). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: dhirhan17@gmail.com
@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.
Rationale for this change
ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.
Spec
Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md
Parquet Format PR
Dataset PR (parquet-testing)
apache/parquet-testing#100
What changes are included in this PR?
This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.
Adding above needed us to add following classes.
Integration of the above code was done in
Are these changes tested?
Unit tests
Benchmark tests
Are there any user-facing changes?
DuckDB