From b2412fdc43cdc80b3c83118aa9802bcc552e4821 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Isma=C3=ABl=20Mej=C3=ADa?= Date: Sat, 23 May 2026 14:57:49 +0200 Subject: [PATCH 1/4] Fix specification inconsistencies, typos, and errors - BloomFilter.md: Fix block_check pseudocode (setBit -> isSet) - BloomFilter.md: Fix struct name to match thrift (BloomFilterHeader) - parquet.thrift: Fix typos ("to be be", "documention", "not necessary") - parquet.thrift: Remove off-by-one in DataPageHeaderV2 is_compressed comment - README.md: Fix repetition level value for non-nested columns (1 -> 0) - README.md: Update defunct Twitter Code of Conduct links to ASF - LogicalTypes.md: Fix embedded types ordering contradiction - LogicalTypes.md: Add nanosecond to TIME precision description - VariantEncoding.md: Fix BINARY -> BYTE_ARRAY for equivalent Parquet type - VariantEncoding.md: Add note on decimal little-endian vs big-endian difference - Compression.md: Fix ZSTD RFC reference (8478 -> 8878) - Encryption.md: Fix double-negative and align GCM invocation limit to NIST - Encodings.md: Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY --- BloomFilter.md | 8 ++++---- Compression.md | 2 +- Encodings.md | 2 -- Encryption.md | 6 +++--- LogicalTypes.md | 6 ++++-- README.md | 6 ++++-- VariantEncoding.md | 6 +++++- src/main/thrift/parquet.thrift | 8 ++++---- 8 files changed, 25 insertions(+), 19 deletions(-) diff --git a/BloomFilter.md b/BloomFilter.md index 98ec685b6..a85736381 100644 --- a/BloomFilter.md +++ b/BloomFilter.md @@ -122,7 +122,7 @@ boolean block_check(block b, unsigned int32 x) { for i in [0..7] { for j in [0..31] { if (masked.getWord(i).isSet(j)) { - if (not b.getWord(i).setBit(j)) { + if (not b.getWord(i).isSet(j)) { return false } } @@ -307,7 +307,7 @@ union BloomFilterCompression { * Bloom filter header is stored at beginning of Bloom filter data of each column * and followed by its bitset. **/ -struct BloomFilterPageHeader { +struct BloomFilterHeader { /** The size of bitset in bytes **/ 1: required i32 numBytes; /** The algorithm for setting bits. **/ @@ -339,8 +339,8 @@ information such as the presence of value. Therefore the Bloom filter of columns data should be encrypted with the column key, and the Bloom filter of other (not sensitive) columns do not need to be encrypted. -Bloom filters have two serializable modules - the PageHeader thrift structure (with its internal -fields, including the BloomFilterPageHeader `bloom_filter_page_header`), and the Bitset. The header +Bloom filters have two serializable modules - the Bloom filter header (the BloomFilterHeader thrift +structure and its internal fields), and the Bitset. The header structure is serialized by Thrift, and written to file output stream; it is followed by the serialized Bitset. diff --git a/Compression.md b/Compression.md index c1cad5d29..be3b70988 100644 --- a/Compression.md +++ b/Compression.md @@ -89,7 +89,7 @@ switch to the newer, interoperable `LZ4_RAW` codec. ### ZSTD A codec based on the Zstandard format defined by -[RFC 8478](https://tools.ietf.org/html/rfc8478). If any ambiguity arises +[RFC 8878](https://tools.ietf.org/html/rfc8878). If any ambiguity arises when implementing this format, the implementation provided by the [Zstandard compression library](https://facebook.github.io/zstd/) is authoritative. diff --git a/Encodings.md b/Encodings.md index 1c766fb5a..c8436f13c 100644 --- a/Encodings.md +++ b/Encodings.md @@ -322,8 +322,6 @@ The delta encoding algorithm described above stores a bit width per miniblock an Supported Types: BYTE_ARRAY -This encoding is always preferred over PLAIN for byte array columns. - For this encoding, we will take all the byte array lengths and encode them using delta encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just concatenated back to back. The expected savings is from the cost of encoding the lengths diff --git a/Encryption.md b/Encryption.md index 180b9aa6b..a2509cd00 100644 --- a/Encryption.md +++ b/Encryption.md @@ -136,9 +136,9 @@ one IV is ever repeated, then the implementation may be vulnerable"*. *"Complian requirement is crucial to the security of GCM"*. The bulk of modules in a Parquet file are page headers and data pages. Therefore, one encryption -key shall not not be used for more than 2^31 (~2 billion) pages. In Parquet files encrypted with -multiple keys (footer and column keys), the constraint on the number of invocations is applied -to each key separately. +key shall not be used for more than 2^32 total module encryptions, as per the NIST specification. +In Parquet files encrypted with multiple keys (footer and column keys), the constraint on the +number of invocations is applied to each key separately. When running in the context of a larger system, any particular Parquet writer implementation likely does not have sufficient context to enforce key invocation limits system-wide. Therefore, diff --git a/LogicalTypes.md b/LogicalTypes.md index 795c223f9..b7d55ac5c 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -271,7 +271,8 @@ The sort order used for `DATE` is signed. ### TIME -`TIME` is used for a logical time type without a date with millisecond or microsecond precision. +`TIME` is used for a logical time type without a date with millisecond, microsecond, +or nanosecond precision. The type has two type parameters: UTC adjustment (`true` or `false`) and unit (`MILLIS` or `MICROS`, `NANOS`). @@ -544,7 +545,8 @@ are found during reading, they must be ignored. ## Embedded Types -Embedded types do not have type-specific orderings. +Embedded types do not have type-specific orderings beyond the unsigned +byte-wise comparison of their physical type (`BYTE_ARRAY`). ### JSON diff --git a/README.md b/README.md index d398ac4f2..87519d2ab 100644 --- a/README.md +++ b/README.md @@ -193,7 +193,7 @@ The value of `uncompressed_page_size` specified in the header is for all the 3 p The encoded values for the data page is always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. the path to the column has length 1), we do not encode the repetition levels (it would -always have the value 1). For data that is required, the definition levels are +always have the value 0). For data that is required, the definition levels are skipped (if encoded, it will always have the value of the max definition level). For example, in the case where the column is non-nested and required, the data in the @@ -278,7 +278,9 @@ Changes to this core format definition are proposed and discussed in depth on th ## Code of Conduct -We hold ourselves and the Parquet developer community to a code of conduct as described by [Twitter OSS](https://engineering.twitter.com/opensource): . +We hold ourselves and the Parquet developer community to two codes of conduct: +1. [The Apache Software Foundation Code of Conduct](https://www.apache.org/foundation/policies/conduct.html) +2. [The Apache Software Foundation Code of Conduct for GitHub](https://github.com/apache/.github/blob/main/CODE_OF_CONDUCT.md) ## License Copyright 2013 Twitter, Cloudera and other contributors. diff --git a/VariantEncoding.md b/VariantEncoding.md index b78c02ef8..2481880d6 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -390,6 +390,10 @@ It is semantically identical to the "string" primitive type. The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`. +Note: Decimal values in the Variant binary encoding use little-endian byte order for the +unscaled value. This differs from Parquet's DECIMAL logical type which uses big-endian +two's complement encoding for `BYTE_ARRAY` and `FIXED_LEN_BYTE_ARRAY` physical types. + ## Encoding types *Variant basic types* @@ -419,7 +423,7 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Timestamp | timestamp | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | | TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | Float | float | `14` | FLOAT | IEEE little-endian | -| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | +| Binary | binary | `15` | BYTE_ARRAY | 4 byte little-endian size, followed by bytes | | String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | TimeNTZ | time without time zone | `17` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | Timestamp | timestamp with time zone | `18` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 225f85f96..a88203165 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -752,7 +752,7 @@ struct DataPageHeaderV2 { /** Whether the values are compressed. Which means the section of the page between - definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included) + definition_levels_byte_length + repetition_levels_byte_length and compressed_page_size (included) is compressed with the compression_codec. If missing it is considered compressed */ 7: optional bool is_compressed = true; @@ -816,7 +816,7 @@ struct PageHeader { /** Compressed (and potentially encrypted) page size in bytes, not including this header **/ 3: required i32 compressed_page_size - /** The 32-bit CRC checksum for the page, to be be calculated as follows: + /** The 32-bit CRC checksum for the page, to be calculated as follows: * * - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7, * the same as in e.g. GZip). @@ -1230,7 +1230,7 @@ struct OffsetIndex { /** * Unencoded/uncompressed size for BYTE_ARRAY types. * - * See documention for unencoded_byte_array_data_bytes in SizeStatistics for + * See documentation for unencoded_byte_array_data_bytes in SizeStatistics for * more details on this field. */ 2: optional list unencoded_byte_array_data_bytes @@ -1399,7 +1399,7 @@ struct FileMetaData { * Sort order used for the min_value and max_value fields in the Statistics * objects and the min_values and max_values fields in the ColumnIndex * objects of each column in this file. Sort orders are listed in the order - * matching the columns in the schema. The indexes are not necessary the same + * matching the columns in the schema. The indexes are not necessarily the same * though, because only leaf nodes of the schema are represented in the list * of sort orders. * From 2cfbc57eb7e509be2bf4ee716663428fe49bdc8d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Isma=C3=ABl=20Mej=C3=ADa?= Date: Sat, 23 May 2026 15:23:55 +0200 Subject: [PATCH 2/4] Fix more specification inconsistencies and clarify ambiguous descriptions Round-2 cleanup pass over the spec docs. Highlights: Bugs / clear errors: - PageIndex.md, parquet.thrift: fix ""Blart Versenwald III" double-quote typo - VariantShredding.md: fix Python syntax error (`: Variant:` -> `-> Variant:`) - VariantShredding.md: replace BINARY with BYTE_ARRAY (BINARY is not a Parquet physical type); fix `,` -> `:` inside a JSON-like literal in a table cell - BloomFilter.md: include missing `bloom_filter_length` field in the ColumnMetaData snippet (it exists in parquet.thrift) - Encodings.md: "bitwidth of each block" -> "each miniblock" in DELTA_BINARY_PACKED description - README.md: add missing colon and code formatting in BIT_PACKED/RLE sentence - LogicalTypes.md: fix TIME unit list punctuation, "Despite there is" grammar Cross-document consistency: - LogicalTypes.md: align DECIMAL precision/scale wording with parquet.thrift ("should" -> "must") - Geospatial.md: use uppercase edge-interpolation algorithm names (SPHERICAL/VINCENTY/...) to match parquet.thrift enum and LogicalTypes.md - Geospatial.md: make srid: prefix consistent (lowercase example) - VariantEncoding.md, VariantShredding.md: use `INT(N, true)` notation consistent with LogicalTypes.md syntax - VariantEncoding.md: make sorted_strings description consistent across the three places it is defined - Encodings.md: PLAIN BOOLEAN no longer links to the deprecated MSB-first BITPACKED section; references the RLE/bit-packing hybrid section instead - parquet.thrift: disambiguate `compressed_page_size` comment in PageLocation (it includes the header; the field of the same name on PageHeader does not) Coherence/clarity: - VariantEncoding.md: label undocumented reserved bits in metadata header, object value_header, and array value_header diagrams - VariantEncoding.md: fix decimal implied-precision formula for val <= 0 - Encodings.md: BIT_PACKED is already replaced, not "will be replaced" - Encryption.md: replace "allows to" with idiomatic English - Encryption.md: "Data PageHeader" -> "Data Page Header" (spacing) - BloomFilter.md: "64 bits version" -> "the 64-bit version" - Geospatial.md: fix XYZM table column alignment --- BloomFilter.md | 10 +++++++++- Encodings.md | 9 +++++---- Encryption.md | 11 ++++++----- Geospatial.md | 14 +++++++------- LogicalTypes.md | 6 +++--- PageIndex.md | 2 +- README.md | 2 +- VariantEncoding.md | 21 ++++++++++++--------- VariantShredding.md | 12 ++++++------ src/main/thrift/parquet.thrift | 6 +++--- 10 files changed, 53 insertions(+), 40 deletions(-) diff --git a/BloomFilter.md b/BloomFilter.md index a85736381..c8824de55 100644 --- a/BloomFilter.md +++ b/BloomFilter.md @@ -282,7 +282,7 @@ union BloomFilterAlgorithm { } /** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash - * algorithm. It uses 64 bits version of xxHash. + * algorithm. It uses the 64-bit version of xxHash. **/ struct XxHash {} @@ -322,6 +322,14 @@ struct ColumnMetaData { ... /** Byte offset from beginning of file to Bloom filter data. **/ 14: optional i64 bloom_filter_offset; + + /** Size of Bloom filter data including the serialized header, in bytes. + * Added in 2.10 so readers may not read this field from old files and + * it can be obtained after the BloomFilterHeader has been deserialized. + * Writers should write this field so readers can read the bloom filter + * in a single I/O. + */ + 15: optional i32 bloom_filter_length; } ``` diff --git a/Encodings.md b/Encodings.md index c8436f13c..eb83c8c35 100644 --- a/Encodings.md +++ b/Encodings.md @@ -56,7 +56,8 @@ intended to be the simplest encoding. Values are encoded back to back. The plain encoding is used whenever a more efficient encoding can not be used. It stores the data in the following format: - - BOOLEAN: [Bit Packed](#BITPACKED), LSB first + - BOOLEAN: bit-packed, LSB first (using the same packing scheme as the + [RLE/bit-packing hybrid](#RLE) encoding) - INT32: 4 bytes little endian - INT64: 8 bytes little endian - INT96: 12 bytes little endian (deprecated) @@ -151,7 +152,7 @@ data: * Dictionary indices * Boolean values in data pages, as an alternative to PLAIN encoding -Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below: +Whether prepending the four-byte `length` to the `encoded-data` is summarized in the table below: ``` +--------------+------------------------+-----------------+ | Page kind | RLE-encoded data kind | Prepend length? | @@ -171,7 +172,7 @@ Whether prepending the four-byte `length` to the `encoded-data` is summarized as ### Bit-packed (Deprecated) (BIT_PACKED = 4) -This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding. +This is a bit-packed only encoding, which is deprecated; it has been replaced by the [RLE/bit-packing](#RLE) hybrid encoding. Each value is encoded back to back using a fixed width. There is no padding between values (except for the last byte, which is padded with 0s). For example, if the max repetition level was 3 (2 bits) and the max definition level as 3 @@ -230,7 +231,7 @@ Each block contains ``` * the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers for bit packing) - * the bitwidth of each block is stored as a byte + * the bitwidth of each miniblock is stored as a byte * each miniblock is a list of bit packed ints according to the bit width stored at the beginning of the block diff --git a/Encryption.md b/Encryption.md index a2509cd00..9d43315bb 100644 --- a/Encryption.md +++ b/Encryption.md @@ -101,7 +101,7 @@ other on a combination of GCM and CTR modes. AES GCM is an authenticated encryption. Besides the data confidentiality (encryption), it supports two levels of integrity verification (authentication): of the data (default), and of the data combined with an optional AAD (“additional authenticated data”). The -authentication allows to make sure the data has not been tampered with. An AAD +authentication makes it possible to verify that the data has not been tampered with. An AAD is a free text to be authenticated, together with the data. The user can, for example, pass the file name with its version (or creation timestamp) as an AAD input, to verify that the file has not been replaced with an older version. The details on how Parquet creates @@ -161,8 +161,9 @@ tag used to verify the ciphertext and AAD integrity. #### 4.2.2 AES_GCM_CTR_V1 + In this Parquet algorithm, all modules except pages are encrypted with the GCM cipher, as described -above. The pages are encrypted by the CTR cipher without padding. This allows to encrypt/decrypt +above. The pages are encrypted by the CTR cipher without padding. This makes it possible to encrypt/decrypt the bulk of the data faster, while still verifying the metadata integrity and making sure the file has not been replaced with a wrong version. However, tampering with the page data might go unnoticed. The AES CTR cipher @@ -221,7 +222,7 @@ group 1. The module AAD is a direct concatenation of the prefix and suffix parts #### 4.4.1 AAD prefix File swapping can be prevented by an AAD prefix string, that uniquely identifies the file and -allows to differentiate it e.g. from older versions of the file or from other partition files in the same +makes it possible to differentiate it e.g. from older versions of the file or from other partition files in the same data set (table). This string is optionally passed by a writer upon file creation. If provided, the AAD prefix is stored in an `aad_prefix` field in the file, and is made available to the readers. This field is not encrypted. If a user is concerned about keeping the file identity inside the file, @@ -262,8 +263,8 @@ The following module types are defined: * ColumnMetaData (1) * Data Page (2) * Dictionary Page (3) - * Data PageHeader (4) - * Dictionary PageHeader (5) + * Data Page Header (4) + * Dictionary Page Header (5) * ColumnIndex (6) * OffsetIndex (7) * BloomFilter Header (8) diff --git a/Geospatial.md b/Geospatial.md index 50a16e39d..6a6253b31 100644 --- a/Geospatial.md +++ b/Geospatial.md @@ -48,7 +48,7 @@ Non-default CRS values are specified by any string that uniquely identifies a co To maximize interoperability, suggested (but not limited to) formats for CRS are: * `` - A complete CRS definition embedded directly using the [PROJJSON](https://proj.org/en/stable/specifications/projjson.html) specification. Example for `OGC:CRS83`: `{"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json","type": "GeographicCRS","name": "NAD83 (CRS83)","datum": {"type": "GeodeticReferenceFrame"...` * `:` - where `` represents some well known authorities, and `code` is the code used by the authority to identify the CRS. Examples are - `OGC:CRS84`, `OGC:CRS83`, `OGC:CRS27`, `EPSG:4326`, `EPSG:3857`, `IGNF:ATI`. See [https://spatialreference.org/](https://spatialreference.org/) for definitions of coordinate reference systems provided by some well known authorities. -* `srid:` - A reference using a [Spatial reference identifier (SRID)](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), where is the numeric SRID value. For example: `SRID:0`. +* `srid:` - A reference using a [Spatial reference identifier (SRID)](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), where is the numeric SRID value. For example: `srid:0`. * `projjson:` - where refers to a key within the file key-value metadata, where CRS definition in [PROJJSON](https://proj.org/en/stable/specifications/projjson.html) format is stored. For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound @@ -60,11 +60,11 @@ by [-90, 90]. An algorithm for interpolating edges, and is one of the following values: -* `spherical`: edges are interpolated as geodesics on a sphere. -* `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae) -* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970. -* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965. -* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/) +* `SPHERICAL`: edges are interpolated as geodesics on a sphere. +* `VINCENTY`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae) +* `THOMAS`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970. +* `ANDOYER`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965. +* `KARNEY`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/) # Logical Types @@ -137,7 +137,7 @@ values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code]. Table below shows the most common geospatial types and their codes: | Type | XY | XYZ | XYM | XYZM | -| :----------------- | :--- | :--- | :--- | :--: | +| :----------------- | :--- | :--- | :--- | :--- | | Point | 0001 | 1001 | 2001 | 3001 | | LineString | 0002 | 1002 | 2002 | 3002 | | Polygon | 0003 | 1003 | 2003 | 3003 | diff --git a/LogicalTypes.md b/LogicalTypes.md index b7d55ac5c..8a15c67d9 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -243,7 +243,7 @@ comparison. *Compatibility* -To support compatibility with older readers, implementations of parquet-format should +To support compatibility with older readers, implementations of parquet-format must write `DecimalType` precision and scale into the corresponding SchemaElement field in metadata. ### FLOAT16 @@ -274,7 +274,7 @@ The sort order used for `DATE` is signed. `TIME` is used for a logical time type without a date with millisecond, microsecond, or nanosecond precision. The type has two type parameters: UTC adjustment (`true` or `false`) -and unit (`MILLIS` or `MICROS`, `NANOS`). +and unit (`MILLIS`, `MICROS`, or `NANOS`). `TIME` with unit `MILLIS` is used for millisecond precision. It must annotate an `int32` that stores the number of @@ -300,7 +300,7 @@ counterpart, it must annotate an `int32`. type that is UTC normalized and has `MICROS` precision. Like the logical type counterpart, it must annotate an `int64`. -Despite there is no exact corresponding ConvertedType for local time semantic, +Although there is no exact corresponding ConvertedType for local time semantic, in order to support forward compatibility with those libraries, which annotated their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotation, Parquet writer implementation *must* annotate local time with legacy annotations too, diff --git a/PageIndex.md b/PageIndex.md index a371c4253..60a93ea40 100644 --- a/PageIndex.md +++ b/PageIndex.md @@ -81,7 +81,7 @@ Some observations: * We store lower and upper bounds for the values of each page. These may be the actual minimum and maximum values found on a page, but can also be (more compact) values that do not exist on a page. For example, instead of storing - ""Blart Versenwald III", a writer may set `min_values[i]="B"`, + `"Blart Versenwald III"`, a writer may set `min_values[i]="B"`, `max_values[i]="C"`. This allows writers to truncate large values and writers should use this to enforce some reasonable bound on the size of the index structures. diff --git a/README.md b/README.md index 87519d2ab..49a46c16a 100644 --- a/README.md +++ b/README.md @@ -172,7 +172,7 @@ be computed from the schema (i.e. how much nesting there is). This defines the maximum number of bits required to store the levels (levels are defined for all values in the column). -Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED. +Two encodings for the levels are supported: `BIT_PACKED` and `RLE`. Only `RLE` is now used as it supersedes `BIT_PACKED`. ## Nulls Nullity is encoded in the definition levels (which is run-length encoded). NULL values diff --git a/VariantEncoding.md b/VariantEncoding.md index 2481880d6..fc425d72e 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -77,7 +77,7 @@ The encoded metadata always starts with a header byte. ``` 7 6 5 4 3 0 +-------+---+---+---------------+ -header | | | | version | +header | | R | | version | +-------+---+---+---------------+ ^ ^ | +-- sorted_strings @@ -87,6 +87,7 @@ The `version` is a 4-bit value that must always contain the value `1`. `sorted_strings` is a 1-bit value indicating whether dictionary strings are sorted and unique. `offset_size_minus_one` is a 2-bit value providing the number of bytes per dictionary size and offset field. The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`. +Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers. The entire metadata is encoded as the following diagram shows: ``` @@ -129,7 +130,7 @@ The grammar for encoded metadata is as follows metadata:
header: 1 byte ( | << 4 | ( << 6)) version: a 4-bit version ID. Currently, must always contain the value 1 -sorted_strings: a 1-bit value indicating whether metadata strings are sorted +sorted_strings: a 1-bit value indicating whether dictionary strings are sorted and unique offset_size_minus_one: 2-bit value providing the number of bytes per dictionary size and offset field. dictionary_size: `offset_size` bytes. unsigned little-endian value indicating the number of strings in the dictionary dictionary: * @@ -195,7 +196,7 @@ When `basic_type` is `2`, `value_header` is made up of `field_offset_size_minus_ ``` 5 4 3 2 1 0 +---+---+-------+-------+ -value_header | | | | | +value_header | R | | | | +---+---+-------+-------+ ^ ^ ^ | | +-- field_offset_size_minus_one @@ -206,6 +207,7 @@ value_header | | | | | The actual number of bytes is computed as `field_offset_size_minus_one + 1` and `field_id_size_minus_one + 1`. `is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements. If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used. +Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers. #### Value Header for Array (`basic_type`=3) @@ -213,7 +215,7 @@ When `basic_type` is `3`, `value_header` is made up of `field_offset_size_minus_ ``` 5 3 2 1 0 +-----------+---+-------+ -value_header | | | | +value_header | RRR | | | +-----------+---+-------+ ^ ^ | +-- field_offset_size_minus_one @@ -223,6 +225,7 @@ value_header | | | | The actual number of bytes is computed as `field_offset_size_minus_one + 1`. `is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements. If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used. +Bits 5-3 (marked `RRR`) are reserved; they must be set to 0 by writers and ignored by readers. ### Value Data @@ -388,7 +391,7 @@ It is valid for an implementation to use a larger value than necessary for any o The "short string" basic type may be used as an optimization to fold string length into the type byte for strings less than 64 bytes. It is semantically identical to the "string" primitive type. -The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(val)) + 1`. +The Decimal type contains a scale, but no precision. The implied precision of a decimal value is `floor(log_10(|val|)) + 1` (and `1` when `val` is `0`). Note: Decimal values in the Variant binary encoding use little-endian byte order for the unscaled value. This differs from Parquet's DECIMAL logical type which uses big-endian @@ -411,10 +414,10 @@ two's complement encoding for `BYTE_ARRAY` and `FIXED_LEN_BYTE_ARRAY` physical t | NullType | null | `0` | UNKNOWN | none | | Boolean | boolean (True) | `1` | BOOLEAN | none | | Boolean | boolean (False) | `2` | BOOLEAN | none | -| Exact Numeric | int8 | `3` | INT(8, signed) | 1 byte | -| Exact Numeric | int16 | `4` | INT(16, signed) | 2 byte little-endian | -| Exact Numeric | int32 | `5` | INT(32, signed) | 4 byte little-endian | -| Exact Numeric | int64 | `6` | INT(64, signed) | 8 byte little-endian | +| Exact Numeric | int8 | `3` | INT(8, true) | 1 byte | +| Exact Numeric | int16 | `4` | INT(16, true) | 2 byte little-endian | +| Exact Numeric | int32 | `5` | INT(32, true) | 4 byte little-endian | +| Exact Numeric | int64 | `6` | INT(64, true) | 8 byte little-endian | | Double | double | `7` | DOUBLE | IEEE little-endian | | Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | diff --git a/VariantShredding.md b/VariantShredding.md index 4f7d61423..fe579ed0c 100644 --- a/VariantShredding.md +++ b/VariantShredding.md @@ -85,8 +85,8 @@ Shredded values must use the following Parquet types: | Variant Type | Parquet Physical Type | Parquet Logical Type | |-----------------------------|-----------------------------------|--------------------------| | boolean | BOOLEAN | | -| int8 | INT32 | INT(8, signed=true) | -| int16 | INT32 | INT(16, signed=true) | +| int8 | INT32 | INT(8, true) | +| int16 | INT32 | INT(16, true) | | int32 | INT32 | | | int64 | INT64 | | | float | FLOAT | | @@ -100,8 +100,8 @@ Shredded values must use the following Parquet types: | timestamptz(9) | INT64 | TIMESTAMP(true, NANOS) | | timestampntz(6) | INT64 | TIMESTAMP(false, MICROS) | | timestampntz(9) | INT64 | TIMESTAMP(false, NANOS) | -| binary | BINARY | | -| string | BINARY | STRING | +| binary | BYTE_ARRAY | | +| string | BYTE_ARRAY | STRING | | uuid | FIXED_LEN_BYTE_ARRAY[len=16] | UUID | | array | GROUP; see Arrays below | LIST | | object | GROUP; see Objects below | | @@ -206,7 +206,7 @@ The table below shows how the series of objects in the first column would be sto |-----------------------------------------------------------------------------------|-----------------------------------|---------------|--------------------------------|--------------------------------------|------------------------------|------------------------------------|----------------------------------------------------------------------------| | `{"event_type": "noop", "event_ts": 1729794114937}` | null | non-null | null | `noop` | null | 1729794114937 | Fully shredded object | | `{"event_type": "login", "event_ts": 1729794146402, "email": "user@example.com"}` | `{"email": "user@example.com"}` | non-null | null | `login` | null | 1729794146402 | Partially shredded object | -| `{"error_msg": "malformed: ..."}` | `{"error_msg", "malformed: ..."}` | non-null | null | null | null | null | Object with all shredded fields missing | +| `{"error_msg": "malformed: ..."}` | `{"error_msg": "malformed: ..."}` | non-null | null | null | null | null | Object with all shredded fields missing | | `"malformed: not an object"` | `malformed: not an object` | null | | | | | Not an object (stored as Variant string) | | `{"event_ts": 1729794240241, "click": "_button"}` | `{"click": "_button"}` | non-null | null | null | null | 1729794240241 | Field `event_type` is missing | | `{"event_type": null, "event_ts": 1729794954163}` | null | non-null | `00` (field exists, is null) | null | null | 1729794954163 | Field `event_type` is present and is null | @@ -334,7 +334,7 @@ def construct_variant(metadata: Metadata, value: Variant, typed_value: Any) -> V # value is missing return None -def primitive_to_variant(typed_value: Any): Variant: +def primitive_to_variant(typed_value: Any) -> Variant: if isinstance(typed_value, int): return VariantInteger(typed_value) elif isinstance(typed_value, str): diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index a88203165..289172285 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -1201,8 +1201,8 @@ struct PageLocation { 1: required i64 offset /** - * Size of the page, including header. Sum of compressed_page_size and header - * length + * Size of the page, including header. Equal to the sum of the page's + * PageHeader.compressed_page_size and the size of the serialized PageHeader. */ 2: required i32 compressed_page_size @@ -1260,7 +1260,7 @@ struct ColumnIndex { * Two lists containing lower and upper bounds for the values of each page * determined by the ColumnOrder of the column. These may be the actual * minimum and maximum values found on a page, but can also be (more compact) - * values that do not exist on a page. For example, instead of storing ""Blart + * values that do not exist on a page. For example, instead of storing "Blart * Versenwald III", a writer may set min_values[i]="B", max_values[i]="C". * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before From 4970764ad148671b576c1af6c81fce6db14e0e84 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Isma=C3=ABl=20Mej=C3=ADa?= Date: Sat, 23 May 2026 16:01:19 +0200 Subject: [PATCH 3/4] Fix additional typos, grammar, invalid HTML, and consistency issues Round 3 of specification cleanup. 28 minor fixes across 8 files: CONTRIBUTING.md (7 typos): docuemnt, an prototype, demostrate, interopability, libaries, highlighed, compatiblity Encryption.md (6): - 'reflects the identity' -> 'reflect the identity' (plural subject) - explictly -> explicitly - Data/Dictionary PageHeader -> Data/Dictionary Page Header (spacing, same as previous fix in section 4.1, second occurrence in section 4.13) - Removed double space after 'right after' - 'the the FileMetaData' -> 'the FileMetaData' - Smart quotes 'PAR1' -> ASCII "PAR1" for magic-bytes literal Encodings.md (1): 'at at time' -> 'at a time' PageIndex.md (1): Added missing terminal period after parquet.thrift link parquet.thrift (4): - 'a element' -> 'an element' (DataPageHeaderV2.num_nulls comment) - Terminal periods after 'It was never used' and 'use PLAIN instead' - Rewrote BIT_PACKED comment to clarify it is superseded by RLE and cross-reference Encodings.md Compression.md (1): Removed double space before [Brotli] link LogicalTypes.md (7): - Terminal periods after INT(8, true) and INT(8, false) paragraphs - Removed invalid from three logical-type tables ( does not accept colspan; the intended colspan is already on the header row) - Made 'precision' consistent with backticked identifier style - 'NANs' -> 'NaNs' - 'Despite there is no' -> 'Although there is no' (same fix as round 2 in a different paragraph) - 'In case of not present' -> 'If not present' VariantEncoding.md (1): Hyphenated '1 byte', '2 byte', '4 byte', '8 byte' to '1-byte', etc. in primitive-types table (character-count-neutral edit, column widths preserved) --- CONTRIBUTING.md | 12 ++++++------ Compression.md | 2 +- Encodings.md | 2 +- Encryption.md | 16 ++++++++-------- LogicalTypes.md | 16 ++++++++-------- PageIndex.md | 2 +- VariantEncoding.md | 20 ++++++++++---------- src/main/thrift/parquet.thrift | 9 +++++---- 8 files changed, 40 insertions(+), 39 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d6049a887..083bd3ce5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -45,8 +45,8 @@ The general steps for adding features to the format are as follows: This phase starts with a discussion of changes on the developer mailing list (dev@parquet.apache.org). Depending on the scope and goals of the feature the it can be useful to provide additional artifacts as part of a discussion. The - artifacts can include a design docuemnt, a draft pull request to make the - discussion concrete and/or an prototype implementation to demostrate the + artifacts can include a design document, a draft pull request to make the + discussion concrete and/or a prototype implementation to demonstrate the viability of implementation. This step is complete when there is lazy consensus. Part of the consensus is whether it is sufficient to provide two working implementations as outlined in step 2, or if demonstration of the @@ -58,7 +58,7 @@ The general steps for adding features to the format are as follows: 2. Completeness: The goal of this phase is to ensure the feature is viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two - implementations of the feature demonstrating interopability must also be + implementations of the feature demonstrating interoperability must also be provided. One implementation MUST be [`parquet-java`](http://github.com/apache/parquet-java). It is preferred that the second implementation be @@ -154,7 +154,7 @@ recommendations for managing features: 2. Forward compatible features/changes may be enabled and used by default in implementations once the parquet-format containing those changes has been formally released. For features that may pose a significant performance - regression to older format readers, libaries should consider delaying default + regression to older format readers, libraries should consider delaying default enablement until 1 year after the release of the parquet-java implementation that contains the feature implementation. @@ -162,7 +162,7 @@ recommendations for managing features: until 2 years after the parquet-java implementation containing the feature is released. It is recommended that changing the default value for a forward incompatible feature flag should be clearly advertised to consumers (e.g. via - a major version release if using Semantic Versioning, or highlighed in + a major version release if using Semantic Versioning, or highlighted in release notes). For forward compatible changes which have a high chance of performance @@ -174,7 +174,7 @@ the same timelines as `parquet-java`. Parquet-java will wait to enable features by default until the most conservative timelines outlined above have been exceeded. This timeline is an attempt to balance ensuring new features make their way into the ecosystem and avoiding -breaking compatiblity for readers that are slower to adopt new standards. We +breaking compatibility for readers that are slower to adopt new standards. We encourage earlier adoption of new features when an organization using Parquet can guarantee that all readers of the parquet files they produce can read a new feature. diff --git a/Compression.md b/Compression.md index be3b70988..41341a696 100644 --- a/Compression.md +++ b/Compression.md @@ -72,7 +72,7 @@ A codec based on or interoperable with the A codec based on the Brotli format defined by [RFC 7932](https://tools.ietf.org/html/rfc7932). If any ambiguity arises when implementing this format, the implementation -provided by the [Brotli compression library](https://github.com/google/brotli) +provided by the [Brotli compression library](https://github.com/google/brotli) is authoritative. ### LZ4 diff --git a/Encodings.md b/Encodings.md index eb83c8c35..a2698017c 100644 --- a/Encodings.md +++ b/Encodings.md @@ -131,7 +131,7 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex ``` The reason for this packing order is to have fewer word-boundaries on little-endian hardware - when deserializing more than one byte at at time. This is because 4 bytes can be read into a + when deserializing more than one byte at a time. This is because 4 bytes can be read into a 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by shifting and ORing with a mask. (to make this optimization work on a big-endian machine, you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding) diff --git a/Encryption.md b/Encryption.md index 9d43315bb..1839ac04f 100644 --- a/Encryption.md +++ b/Encryption.md @@ -209,7 +209,7 @@ it can't prevent replacement of one ciphertext with another (encrypted with the Parquet modular encryption leverages AADs to protect against swapping ciphertext modules (encrypted with AES GCM) inside a file or between files. Parquet can also protect against swapping full files - for example, replacement of a file with an old version, or replacement of one table -partition with another. AADs are built to reflects the identity of a file and of the modules +partition with another. AADs are built to reflect the identity of a file and of the modules inside the file. Parquet constructs a module AAD from two components: an optional AAD prefix - a string provided @@ -227,7 +227,7 @@ data set (table). This string is optionally passed by a writer upon file creatio the AAD prefix is stored in an `aad_prefix` field in the file, and is made available to the readers. This field is not encrypted. If a user is concerned about keeping the file identity inside the file, the writer code can explicitly request Parquet not to store the AAD prefix. Then the aad_prefix field -will be empty; AAD prefixes must be fully managed by the caller code and supplied explictly to Parquet +will be empty; AAD prefixes must be fully managed by the caller code and supplied explicitly to Parquet readers for each file. The protection against swapping full files is optional. It is not enabled by default because @@ -277,8 +277,8 @@ The following module types are defined: | ColumnMetaData | yes | yes (1) | yes | yes | no | | Data Page | yes | yes (2) | yes | yes | yes | | Dictionary Page | yes | yes (3) | yes | yes | no | -| Data PageHeader | yes | yes (4) | yes | yes | yes | -| Dictionary PageHeader| yes | yes (5) | yes | yes | no | +| Data Page Header | yes | yes (4) | yes | yes | yes | +| Dictionary Page Header| yes | yes (5) | yes | yes | no | | ColumnIndex | yes | yes (6) | yes | yes | no | | OffsetIndex | yes | yes (7) | yes | yes | no | | BloomFilter Header | yes | yes (8) | yes | yes | no | @@ -440,7 +440,7 @@ little endian integer, followed by a final magic string, "PARE". The same magic written at the beginning of the file (offset 0). Parquet readers start file parsing by reading and checking the magic string. Therefore, the encrypted footer mode uses a new magic string ("PARE") in order to instruct readers to look for a file crypto metadata -before the footer - and also to immediately inform legacy readers (expecting ‘PAR1’ +before the footer - and also to immediately inform legacy readers (expecting "PAR1" bytes) that they can’t parse this file. ```c @@ -491,14 +491,14 @@ The plaintext footer is signed in order to prevent tampering with the structure with the AES GCM algorithm - using a footer signing key, and an AAD constructed according to the instructions of the section 4.4. Only the nonce and GCM tag are stored in the file – as a 28-byte -fixed-length array, written right after the footer itself. The ciphertext is not stored, +fixed-length array, written right after the footer itself. The ciphertext is not stored, because it is not required for footer integrity verification by readers. | nonce (12 bytes) | tag (16 bytes) | |------------------|-----------------| -The plaintext footer mode sets the following fields in the the FileMetaData structure: +The plaintext footer mode sets the following fields in the FileMetaData structure: ```c struct FileMetaData { @@ -523,7 +523,7 @@ The 28-byte footer signature is written after the plaintext footer, followed by that contains the combined length of the footer and its signature. A final magic string, "PAR1", is written at the end of the file. The same magic string is written at the beginning of the file (offset 0). The magic bytes -for plaintext footer mode are ‘PAR1’ to allow legacy readers to read projections of the file +for plaintext footer mode are "PAR1" to allow legacy readers to read projections of the file that do not include encrypted columns. ![File Layout - Encrypted footer](doc/images/FileLayoutEncryptionPF.png) diff --git a/LogicalTypes.md b/LogicalTypes.md index 8a15c67d9..fd54d1dcd 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -97,7 +97,7 @@ The sort order used for `UUID` values is unsigned byte-wise comparison. The annotation has two parameters: bit width and sign. Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`. For signed integers, the second parameter should be `true`, -for example, a signed integer with bit width of 8 is defined as `INT(8, true)` +for example, a signed integer with bit width of 8 is defined as `INT(8, true)`. Implementations may use these annotations to produce smaller in-memory representations when reading data. @@ -120,7 +120,7 @@ along with a maximum number of bits in the stored value. The annotation has two parameters: bit width and sign. Allowed bit width values are `8`, `16`, `32`, `64`, and sign can be `true` or `false`. In case of unsigned integers, the second parameter should be `false`, -for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)` +for example, an unsigned integer with bit width of 8 is defined as `INT(8, false)`. Implementations may use these annotations to produce smaller in-memory representations when reading data. @@ -166,7 +166,7 @@ unsigned integers with 8, 16, 32, or 64 bit width. *Forward compatibility:* - + @@ -227,7 +227,7 @@ integer. A precision too large for the underlying type (see below) is an error. * `int32`: for 1 <= precision <= 9 * `int64`: for 1 <= precision <= 18; precision < 10 will produce a warning -* `fixed_len_byte_array`: precision is limited by the array size. Length `n` +* `fixed_len_byte_array`: `precision` is limited by the array size. Length `n` can store <= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits * `byte_array`: `precision` is not limited, but is required. The minimum number of bytes to store the unscaled value should be used. @@ -316,7 +316,7 @@ as shown below. *Forward compatibility:*
LogicalType ConvertedType
- + @@ -476,7 +476,7 @@ type counterpart, it must annotate an `int64`. logical type that is UTC normalized and has `MICROS` precision. Like the logical type counterpart, it must annotate an `int64`. -Despite there is no exact corresponding ConvertedType for local timestamp semantic, +Although there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotation, Parquet writer implementation *must* annotate local timestamps with legacy annotations too, @@ -492,7 +492,7 @@ as shown below. *Forward compatibility:*
LogicalType ConvertedType
- + @@ -836,7 +836,7 @@ to values. `MAP` must annotate a 3-level structure: field of the repeated `key_value` group. * The `value` field encodes the map's value type and repetition. This field can be `required`, `optional`, or omitted. It must always be the second field of - the repeated `key_value` group if present. In case of not present, it can be + the repeated `key_value` group if present. If not present, it can be represented as a map with all null values or as a set of keys. The following example demonstrates the type for a non-null map from strings to diff --git a/PageIndex.md b/PageIndex.md index 60a93ea40..cb184e428 100644 --- a/PageIndex.md +++ b/PageIndex.md @@ -23,7 +23,7 @@ In Parquet, a *page index* is optional metadata for a ColumnChunk, containing statistics for DataPages that can be used to skip those pages when scanning in ordered and unordered columns. The page index is stored using the OffsetIndex and ColumnIndex structures, -defined in [`parquet.thrift`](src/main/thrift/parquet.thrift) +defined in [`parquet.thrift`](src/main/thrift/parquet.thrift). ## Problem Statement In previous versions of the format, Statistics are stored for ColumnChunks in diff --git a/VariantEncoding.md b/VariantEncoding.md index fc425d72e..e566c7535 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -414,20 +414,20 @@ two's complement encoding for `BYTE_ARRAY` and `FIXED_LEN_BYTE_ARRAY` physical t | NullType | null | `0` | UNKNOWN | none | | Boolean | boolean (True) | `1` | BOOLEAN | none | | Boolean | boolean (False) | `2` | BOOLEAN | none | -| Exact Numeric | int8 | `3` | INT(8, true) | 1 byte | -| Exact Numeric | int16 | `4` | INT(16, true) | 2 byte little-endian | -| Exact Numeric | int32 | `5` | INT(32, true) | 4 byte little-endian | -| Exact Numeric | int64 | `6` | INT(64, true) | 8 byte little-endian | +| Exact Numeric | int8 | `3` | INT(8, true) | 1-byte | +| Exact Numeric | int16 | `4` | INT(16, true) | 2-byte little-endian | +| Exact Numeric | int32 | `5` | INT(32, true) | 4-byte little-endian | +| Exact Numeric | int64 | `6` | INT(64, true) | 8-byte little-endian | | Double | double | `7` | DOUBLE | IEEE little-endian | -| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | -| Date | date | `11` | DATE | 4 byte little-endian | +| Exact Numeric | decimal4 | `8` | DECIMAL(precision, scale) | 1-byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1-byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1-byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | +| Date | date | `11` | DATE | 4-byte little-endian | | Timestamp | timestamp | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | | TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | Float | float | `14` | FLOAT | IEEE little-endian | -| Binary | binary | `15` | BYTE_ARRAY | 4 byte little-endian size, followed by bytes | -| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | +| Binary | binary | `15` | BYTE_ARRAY | 4-byte little-endian size, followed by bytes | +| String | string | `16` | STRING | 4-byte little-endian size, followed by UTF-8 encoded bytes | | TimeNTZ | time without time zone | `17` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | Timestamp | timestamp with time zone | `18` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | | TimestampNTZ | timestamp without time zone | `19` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 289172285..09c4e4095 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -504,7 +504,7 @@ union LogicalType { } /** - * Represents a element inside a schema definition. + * Represents an element inside a schema definition. * - if it is a group (inner node) then type is undefined and num_children is defined * - if it is a primitive type (leaf) then type is defined and num_children is undefined * the nodes are listed in depth first traversal order. @@ -583,7 +583,7 @@ enum Encoding { PLAIN = 0; /** Group VarInt encoding for INT32/INT64. - * This encoding is deprecated. It was never used + * This encoding is deprecated. It was never used. */ // GROUP_VAR_INT = 1; @@ -591,7 +591,7 @@ enum Encoding { * Deprecated: Dictionary encoding. The values in the dictionary are encoded in the * plain type. * in a data page use RLE_DICTIONARY instead. - * in a Dictionary page use PLAIN instead + * in a Dictionary page use PLAIN instead. */ PLAIN_DICTIONARY = 2; @@ -600,8 +600,9 @@ enum Encoding { */ RLE = 3; - /** Bit packed encoding. This can only be used if the data has a known max + /** Deprecated: Bit packed encoding. This can only be used if the data has a known max * width. Usable for definition/repetition levels encoding. + * Superseded by RLE (which is a hybrid of RLE and bit packing); see Encodings.md. */ BIT_PACKED = 4; From 1c1089a4229b511e75f15c5eb38d010702321817 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Isma=C3=ABl=20Mej=C3=ADa?= Date: Sat, 23 May 2026 16:24:38 +0200 Subject: [PATCH 4/4] Fix additional typos, grammar, hyphenation, and consistency issues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 4 of specification cleanup. 52 minor fixes across 13 files. parquet.thrift (6): - L676: 'a OffsetIndex' -> 'an OffsetIndex' - L427/446/452: 'edges interpolation' -> 'edge interpolation' (Geospatial doc comments; align with thrift enum/struct naming) - L44: 'frameworks(e.g. hive, pig)' -> 'frameworks (e.g. Hive, Pig)' (missing space + capitalize proper nouns) - L816: 'GZip' -> 'GZIP' (match Compression.md heading and enum casing) Geospatial.md (8): - L31: Missing space before parenthesis: 'OGC(' -> 'OGC (' - L50: 'well known' -> 'well-known' (compound adjective, 2 occurrences) - L61: ', and is also commonly used' -> '. It is also commonly used' (comma splice between independent clauses) - L97: 'Y Values' -> 'Y values' (mid-sentence; sibling 'X values' lowercase) - L137: Added 'The' before bullet sentence for grammaticality - L157: Markdown heading: bare line -> proper '# Coordinate Axis Order' - L72/73: 'edges interpolation' -> 'edge interpolation' LogicalTypes.md (11): - 'to describes' -> 'that describes' - DECIMAL scale/precision: 'integer literal annotation' wording made precise ('non-negative integer' / 'positive integer') - TIME L302-303: 'annotation' -> 'annotations'; 'implementation' -> 'implementations' (plural agreement) - L302/353/360: Oxford comma added before 'or NANOS' (consistency with earlier paragraph) - L449/460/478/479: 'can not' -> 'cannot' (consistent with rest of repo) - L608/623: 'edges interpolation' -> 'edge interpolation' README.md (8): - L191: 'encoded values for the data page is' -> 'are' (plural subject) - L193/195: 'it would always' / 'if encoded, it will' -> 'they would' / 'they will' (refer to plural repetition/definition levels) - L137-140: '1 bit' / '32 bit' / '64 bit' / '96 bit' / 'fixed length' -> hyphenated forms (compound adjectives) - L225: 'GZip' -> 'GZIP' (same as thrift) - L240: 'rc or avro files' -> 'RCFile or Avro files' (proper nouns) - L243: Removed trailing period from heading - L257: 'fine grained' -> 'fine-grained' (compound adjective) CONTRIBUTING.md (6): - L46: Removed extra 'the'; added comma after 'feature' - L46: 'features desirability' -> "a feature's desirability" (possessive apostrophe) - L53: 'After the first two steps are complete' + 'After the vote passes': added missing commas after introductory clauses - L58: 'an external dependencies' -> 'an external dependency' (agreement) - L70: Comma splice between independent clauses -> semicolon - L90: Removed trailing period from heading BinaryProtocolExtensions.md (10): - L29/53/56/79: 'FileMetadata' -> 'FileMetaData' (4 occurrences; match the canonical thrift struct name used elsewhere in this file) - L29: 'implementers which MUST' -> 'implementers who MUST' (who for people) - L70: 'extension shared publicly' -> 'extension is shared publicly' (missing copula) - L53/55/72/77/80: 'Flatbuffers' / 'flatbuffer' -> 'FlatBuffers' (5 occurrences; project's official capitalization, already used twice elsewhere in the same file) Encodings.md (5): - L57: 'can not' -> 'cannot' - L72: '4 byte little endian' -> '4-byte little endian' (compound adjective; preserved 'little endian' as in surrounding text) - L134-135: '32 bit register' / '64 bit register' -> '32-bit' / '64-bit' - L178: 'max definition level as 3' -> 'max definition level was 3' (parallel construction with previous clause) - L175: 'RLE/Bit packed described above' -> 'RLE/Bit-Packed described above' (match section heading 'Bit-packed (Deprecated)') - L235: 'list of bit packed ints' -> 'list of bit-packed ints' Compression.md (2): - L51: 'provided by Google Snappy [library]' -> 'provided by the [Snappy compression library]' (match parallel construction used for GZIP, BROTLI, ZSTD sibling entries in the same file) - L61: Comma splice between independent clauses -> semicolon Encryption.md (7): - L82: Removed double space after 'pages and' - L250: Removed double space after 'files' - L289: Heading '## 5 File Format' -> '## 5. File Format' (period to match sibling headings ## 5.1, ## 5.2, ## 5.3) - L256-258: '2 byte short, little endian' -> '2-byte short, little-endian' (compound adjectives; 3 lines) - L396: 'from a secret data' -> 'from secret data' (mass noun; no article) - L412: '/** Column metadata for this chunk.. **/' -> 'this chunk' (two trailing dots is neither sentence end nor ellipsis) BloomFilter.md (4): - L268-270: 'multi-block bloom filter' / 'bloom filter header' / 'bloom filter bitset' / 'bloom filter bit set' -> normalized casing ('Bloom filter' as a proper noun, consistent with the rest of this file and with the thrift comments) and 'bit set' -> 'bitset' - L311: Added terminal period in '/** The size of bitset in bytes **/' - L317: Added terminal period in '/** The compression used in the Bloom filter **/' PageIndex.md (2): - L20: Title-case in heading: 'page index' -> 'Page Index' (proper noun; rest of heading already title-cased) - L40: 'one data page per the retrieved column' -> 'one data page per retrieved column' (article+'per' is ungrammatical) VariantShredding.md (2): - L297: Python bug 'for (name, field) in typed_value' -> 'for (name, field) in typed_value.items()' (iterating a dict yields keys only; the unpacking would fail at runtime — verified with 'python3 -c' that the original raises ValueError) - L151: Removed trailing space inside backticks of table header ('`typed_value `' -> '`typed_value`') VariantEncoding.md (3 lines, 8 hyphenations): - L145: '3 byte offsets' -> '3-byte offsets' (compound adjective; the same sentence already uses '1-byte', '2-byte', '4-byte' hyphenated) - L373-375: 'a 1 or 4 byte' / 'a 1, 2, 3 or 4 byte' -> '1- or 4-byte' / '1-, 2-, 3-, or 4-byte' (hyphenated number/unit compounds) - L386: '3 byte IDs' -> '3-byte IDs' - L387: 'one or four byte value' -> 'one- or four-byte value' Thrift validation passes after these edits (only pre-existing doctext warnings on lines 18, 22, 588 remain — unrelated to any fix). --- BinaryProtocolExtensions.md | 22 +++++++++++----------- BloomFilter.md | 8 ++++---- CONTRIBUTING.md | 16 ++++++++-------- Compression.md | 4 ++-- Encodings.md | 12 ++++++------ Encryption.md | 16 ++++++++-------- Geospatial.md | 16 ++++++++-------- LogicalTypes.md | 24 ++++++++++++------------ PageIndex.md | 4 ++-- README.md | 24 ++++++++++++------------ VariantEncoding.md | 12 ++++++------ VariantShredding.md | 4 ++-- src/main/thrift/parquet.thrift | 12 ++++++------ 13 files changed, 87 insertions(+), 87 deletions(-) diff --git a/BinaryProtocolExtensions.md b/BinaryProtocolExtensions.md index e23d3328f..431e67404 100644 --- a/BinaryProtocolExtensions.md +++ b/BinaryProtocolExtensions.md @@ -26,11 +26,11 @@ The extension mechanism of the `binary` Thrift field-id `32767` has some desirab * The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift. * Extensions can be appended to existing Thrift serialized structs [without requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation (or changes to the thrift IDL). -Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers which MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers. +Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers who MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers. Putting everything together in an example, if we would extend `FileMetaData` it would look like this on the wire. - N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field) 4 bytes | 08 FF FF 01 (long form header for 32767: binary) 1-5 bytes | ULEB128(M) encoded size of the extension M bytes | extension bytes @@ -50,14 +50,14 @@ To illustrate the applicability of the extension mechanism we provide examples o ### Footer -A variant of `FileMetaData` encoded in Flatbuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with. +A variant of `FileMetaData` encoded in FlatBuffers is introduced. This variant is more performant and can scale to very wide tables, something that current Thrift `FileMetaData` struggles with. In its private form the footer of a Parquet file will look like so: - N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field) 4 bytes | 08 FF FF 01 (long form header for 32767: binary) 1-5 bytes | ULEB128(K+28) encoded size of the extension - K bytes | Flatbuffers representation (v0) of FileMetaData + K bytes | FlatBuffers representation (v0) of FileMetaData 4 bytes | little-endian crc32(flatbuffer) 4 bytes | little-endian size(flatbuffer) 4 bytes | little-endian crc32(size(flatbuffer)) @@ -67,20 +67,20 @@ In its private form the footer of a Parquet file will look like so: some-UUID is some UUID picked for this extension and it is used throughout (possibly internal) experimentation. It is put at the end to allow detection of the extension when parsed in reverse. The little-endian sizes and crc32s are also to the end to facilitate efficient parsing the footer in reverse without requiring parsing the Thrift compact protocol that precedes it. -At some point the experiments conclude and the extension shared publicly with the community. The extension is proposed for inclusion to the standard with a migration plan to replace the existing `FileMetaData`. +At some point the experiments conclude and the extension is shared publicly with the community. The extension is proposed for inclusion to the standard with a migration plan to replace the existing `FileMetaData`. -The community reviews the proposal and (potentially) proposes changes to the Flatbuffers IDL representation. In addition, because this extension is a *replacement* of an existing struct, it must: +The community reviews the proposal and (potentially) proposes changes to the FlatBuffers IDL representation. In addition, because this extension is a *replacement* of an existing struct, it must: 1. have some way of being extended in the future much like what it replaces. Because the extension mechanism only allows for a single extension, without this in place we cannot have footer extensions during the migration. 2. consider its intermediate form where both the **Thrift** `FileMetaData` and the **FlatBuffers** `FileMetaData` will be present. 3. consider its final form where the long form header for `32767: binary` may not be present. -Once the design is ratified the new `FileMetaData` encoding is made final with the following migration plan. For the next N years writers will write both the Thrift and the flatbuffer `FileMetaData`. It will look much like its private form except the flatbuffer IDL may be different: +Once the design is ratified the new `FileMetaData` encoding is made final with the following migration plan. For the next N years writers will write both the Thrift and the FlatBuffers `FileMetaData`. It will look much like its private form except the FlatBuffers IDL may be different: - N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift stop field) + N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift stop field) 4 bytes | 08 FF FF 01 (long form header for 32767: binary) 1-5 bytes | ULEB128(K+28) encoded size of the extension - K bytes | Flatbuffers representation (v1) of FileMetaData + K bytes | FlatBuffers representation (v1) of FileMetaData 4 bytes | little-endian crc32(flatbuffer) 4 bytes | little-endian size(flatbuffer) 4 bytes | little-endian crc32(size(flatbuffer)) @@ -90,7 +90,7 @@ Once the design is ratified the new `FileMetaData` encoding is made final with t After the migration period, the end of the Parquet file may look like this: - K bytes | Flatbuffers representation (v1) of FileMetaData + K bytes | FlatBuffers representation (v1) of FileMetaData 4 bytes | little-endian crc32(flatbuffer) 4 bytes | little-endian size(flatbuffer) 4 bytes | little-endian crc32(size(flatbuffer)) diff --git a/BloomFilter.md b/BloomFilter.md index c8824de55..3b50e54f7 100644 --- a/BloomFilter.md +++ b/BloomFilter.md @@ -266,8 +266,8 @@ false positive rates: #### File Format Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block -bloom filter consists of the bloom filter header followed by the bloom filter bitset. The bloom filter -header encodes the size of the bloom filter bit set in bytes that is used to read the bitset. +Bloom filter consists of the Bloom filter header followed by the Bloom filter bitset. The Bloom filter +header encodes the size of the Bloom filter bitset in bytes that is used to read the bitset. Here are the Bloom filter definitions in thrift: @@ -308,13 +308,13 @@ union BloomFilterCompression { * and followed by its bitset. **/ struct BloomFilterHeader { - /** The size of bitset in bytes **/ + /** The size of bitset in bytes. **/ 1: required i32 numBytes; /** The algorithm for setting bits. **/ 2: required BloomFilterAlgorithm algorithm; /** The hash function used for Bloom filter. **/ 3: required BloomFilterHash hash; - /** The compression used in the Bloom filter **/ + /** The compression used in the Bloom filter. **/ 4: required BloomFilterCompression compression; } diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 083bd3ce5..c1d1750b2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -43,8 +43,8 @@ The general steps for adding features to the format are as follows: 1. Design/scoping: The goal of this phase is to identify design goals of a feature and provide some demonstration that the feature meets those goals. This phase starts with a discussion of changes on the developer mailing list - (dev@parquet.apache.org). Depending on the scope and goals of the feature the - it can be useful to provide additional artifacts as part of a discussion. The + (dev@parquet.apache.org). Depending on the scope and goals of the feature, it + can be useful to provide additional artifacts as part of a discussion. The artifacts can include a design document, a draft pull request to make the discussion concrete and/or a prototype implementation to demonstrate the viability of implementation. This step is complete when there is lazy @@ -73,21 +73,21 @@ The general steps for adding features to the format are as follows: fit for inclusion (for example, they were submitted as a pull request against the target repository and committers gave positive reviews). Reports on the benefits from closed source implementations are welcome and can help lend - weight to features desirability but are not sufficient for acceptance of a + weight to a feature's desirability but are not sufficient for acceptance of a new feature. Unless otherwise discussed, it is expected the implementations will be developed from their respective main branch (i.e. backporting is not required), to demonstrate that the feature is mergeable to its implementation. -3. Ratification: After the first two steps are complete a formal vote is held on +3. Ratification: After the first two steps are complete, a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote - passes the format change is merged into the `parquet-format` repository and + passes, the format change is merged into the `parquet-format` repository and it is expected the changes from step 2 will also be merged soon after (implementations should not be merged until the addition has been merged to `parquet-format`). -#### General guidelines/preferences on additions. +#### General guidelines/preferences on additions 1. To the greatest extent possible changes should have an option for forward compatibility (old readers can still read files). The [compatibility and @@ -95,13 +95,13 @@ demonstrate that the feature is mergeable to its implementation. provides more details on expectations for changes that break compatibility. 2. New encodings should be fully specified in this repository and not - rely on an external dependencies for implementation (i.e. `parquet-format` is + rely on an external dependency for implementation (i.e. `parquet-format` is the source of truth for the encoding). If it does require an external dependency, then the external dependency must have its own specification separate from implementation. 3. New compression mechanisms should have a pure Java implementation that can be - used as a dependency in `parquet-java`, exceptions may be + used as a dependency in `parquet-java`; exceptions may be discussed on the mailing list to see if a non-native Java implementation is acceptable. diff --git a/Compression.md b/Compression.md index 41341a696..c397476ab 100644 --- a/Compression.md +++ b/Compression.md @@ -48,7 +48,7 @@ No-op codec. Data is left uncompressed. A codec based on the [Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt). If any ambiguity arises when implementing this format, the implementation -provided by Google Snappy [library](https://github.com/google/snappy/) +provided by the [Snappy compression library](https://github.com/google/snappy/) is authoritative. ### GZIP @@ -58,7 +58,7 @@ formats) defined by [RFC 1952](https://tools.ietf.org/html/rfc1952). If any ambiguity arises when implementing this format, the implementation provided by the [zlib compression library](https://zlib.net/) is authoritative. -Readers should support reading pages containing multiple GZIP members, however, +Readers should support reading pages containing multiple GZIP members; however, as this has historically not been supported by all implementations, it is recommended that writers refrain from creating such pages by default for better interoperability. diff --git a/Encodings.md b/Encodings.md index a2698017c..00d178e0a 100644 --- a/Encodings.md +++ b/Encodings.md @@ -54,7 +54,7 @@ Supported Types: all This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back. -The plain encoding is used whenever a more efficient encoding can not be used. It +The plain encoding is used whenever a more efficient encoding cannot be used. It stores the data in the following format: - BOOLEAN: bit-packed, LSB first (using the same packing scheme as the [RLE/bit-packing hybrid](#RLE) encoding) @@ -69,7 +69,7 @@ stores the data in the following format: For native types, this outputs the data as little endian. Floating point types are encoded in IEEE. -For the byte array type, it encodes the length as a 4 byte little +For the byte array type, it encodes the length as a 4-byte little endian, followed by the bytes. @@ -83,7 +83,7 @@ written first, before the data pages of the column chunk. Dictionary page format: the entries in the dictionary using the [plain](#PLAIN) encoding. Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), -followed by the values encoded using RLE/Bit packed described above (with the given bit width). +followed by the values encoded using RLE/Bit-Packed described above (with the given bit width). Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY` in a data page and `PLAIN` in a dictionary page for new Parquet files. @@ -132,7 +132,7 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex The reason for this packing order is to have fewer word-boundaries on little-endian hardware when deserializing more than one byte at a time. This is because 4 bytes can be read into a - 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by + 32-bit register (or 8 bytes into a 64-bit register) and values can be unpacked just by shifting and ORing with a mask. (to make this optimization work on a big-endian machine, you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding) @@ -175,7 +175,7 @@ Whether prepending the four-byte `length` to the `encoded-data` is summarized in This is a bit-packed only encoding, which is deprecated; it has been replaced by the [RLE/bit-packing](#RLE) hybrid encoding. Each value is encoded back to back using a fixed width. There is no padding between values (except for the last byte, which is padded with 0s). -For example, if the max repetition level was 3 (2 bits) and the max definition level as 3 +For example, if the max repetition level was 3 (2 bits) and the max definition level was 3 (2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes. This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is a superset of this implementation. @@ -232,7 +232,7 @@ Each block contains * the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers for bit packing) * the bitwidth of each miniblock is stored as a byte - * each miniblock is a list of bit packed ints according to the bit width + * each miniblock is a list of bit-packed ints according to the bit width stored at the beginning of the block To encode a block, we will: diff --git a/Encryption.md b/Encryption.md index 1839ac04f..150beceee 100644 --- a/Encryption.md +++ b/Encryption.md @@ -79,7 +79,7 @@ in order to verify its integrity. New footer fields keep an information about the file encryption algorithm and the footer signing key. For encrypted columns, the following modules are always encrypted, with the same column key: -pages and page headers (both dictionary and data), column indexes, offset indexes, bloom filter +pages and page headers (both dictionary and data), column indexes, offset indexes, bloom filter headers and bitsets. If the column key is different from the footer encryption key, the column metadata is serialized separately and encrypted with the column key. In this case, the column metadata is also @@ -247,15 +247,15 @@ of all partition files (prefixes) from 0 to N-1. #### 4.4.2 AAD suffix The suffix part of a module AAD protects against module swapping inside a file. It also protects against -module swapping between files - in situations when an encryption key is re-used in multiple files and the +module swapping between files - in situations when an encryption key is re-used in multiple files and the writer has not provided a unique AAD prefix for each file. Unlike AAD prefix, a suffix is built internally by Parquet, by direct concatenation of the following parts: 1. [All modules] internal file identifier - a random byte array generated for each file (implementation-defined length) 2. [All modules] module type (1 byte) -3. [All modules except footer] row group ordinal (2 byte short, little endian) -4. [All modules except footer] column ordinal (2 byte short, little endian) -5. [Data page and header only] page ordinal (2 byte short, little endian) +3. [All modules except footer] row group ordinal (2-byte short, little-endian) +4. [All modules except footer] column ordinal (2-byte short, little-endian) +5. [Data page and header only] page ordinal (2-byte short, little-endian) The following module types are defined: @@ -286,7 +286,7 @@ The following module types are defined: -## 5 File Format +## 5. File Format ### 5.1 Encrypted module serialization All modules, except column pages, are encrypted with the GCM cipher. In the AES_GCM_V1 algorithm, @@ -393,7 +393,7 @@ struct ColumnChunk { ### 5.3 Protection of sensitive metadata The Parquet file footer, and its nested structures, contain sensitive information - ranging -from a secret data (column statistics) to other information that can be exploited by an +from secret data (column statistics) to other information that can be exploited by an attacker (e.g. schema, num_values, key_value_metadata, encoding and crypto_metadata). This information is automatically protected when the footer and secret columns are encrypted with the same key. In other cases - when column(s) and the @@ -409,7 +409,7 @@ field in the `ColumnChunk`. struct ColumnChunk { ... - /** Column metadata for this chunk.. **/ + /** Column metadata for this chunk **/ 3: optional ColumnMetaData meta_data .. /** Crypto metadata of encrypted columns **/ diff --git a/Geospatial.md b/Geospatial.md index 6a6253b31..cc5958acc 100644 --- a/Geospatial.md +++ b/Geospatial.md @@ -28,7 +28,7 @@ The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and Well-Known Binary (WKB) serializations (ISO variant supporting XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture][sfa-part1], -from [OGC(Open Geospatial Consortium)][ogc]. +from [OGC (Open Geospatial Consortium)][ogc]. The version of the OGC standard first used here is 1.2.1, but future versions may also be used if the WKB representation remains wire-compatible. @@ -47,7 +47,7 @@ in the order of longitude/latitude based on the WGS84 datum. Non-default CRS values are specified by any string that uniquely identifies a coordinate reference system associated with this type. To maximize interoperability, suggested (but not limited to) formats for CRS are: * `` - A complete CRS definition embedded directly using the [PROJJSON](https://proj.org/en/stable/specifications/projjson.html) specification. Example for `OGC:CRS83`: `{"$schema": "https://proj.org/schemas/v0.7/projjson.schema.json","type": "GeographicCRS","name": "NAD83 (CRS83)","datum": {"type": "GeodeticReferenceFrame"...` -* `:` - where `` represents some well known authorities, and `code` is the code used by the authority to identify the CRS. Examples are - `OGC:CRS84`, `OGC:CRS83`, `OGC:CRS27`, `EPSG:4326`, `EPSG:3857`, `IGNF:ATI`. See [https://spatialreference.org/](https://spatialreference.org/) for definitions of coordinate reference systems provided by some well known authorities. +* `:` - where `` represents some well-known authorities, and `code` is the code used by the authority to identify the CRS. Examples are - `OGC:CRS84`, `OGC:CRS83`, `OGC:CRS27`, `EPSG:4326`, `EPSG:3857`, `IGNF:ATI`. See [https://spatialreference.org/](https://spatialreference.org/) for definitions of coordinate reference systems provided by some well-known authorities. * `srid:` - A reference using a [Spatial reference identifier (SRID)](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), where is the numeric SRID value. For example: `srid:0`. * `projjson:` - where refers to a key within the file key-value metadata, where CRS definition in [PROJJSON](https://proj.org/en/stable/specifications/projjson.html) format is stored. @@ -58,7 +58,7 @@ by [-90, 90]. ## Edge Interpolation Algorithm -An algorithm for interpolating edges, and is one of the following values: +An algorithm for interpolating edges. It is one of the following values: * `SPHERICAL`: edges are interpolated as geodesics on a sphere. * `VINCENTY`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae) @@ -69,8 +69,8 @@ An algorithm for interpolating edges, and is one of the following values: # Logical Types Two geospatial logical type annotations are supported: -* `GEOMETRY`: geospatial features in the WKB format with linear/planar edges interpolation. See [Geometry](LogicalTypes.md#geometry) -* `GEOGRAPHY`: geospatial features in the WKB format with an explicit (non-linear/non-planar) edges interpolation algorithm. See [Geography](LogicalTypes.md#geography) +* `GEOMETRY`: geospatial features in the WKB format with linear/planar edge interpolation. See [Geometry](LogicalTypes.md#geometry) +* `GEOGRAPHY`: geospatial features in the WKB format with an explicit (non-linear/non-planar) edge interpolation algorithm. See [Geography](LogicalTypes.md#geography) # Statistics @@ -94,7 +94,7 @@ fourth dimension. These values can be used as a linear reference value (e.g., highway milepost value), a timestamp, or some other value as defined by the CRS. Bounding box is defined as the thrift struct below in the representation of -min/max value pair of coordinates from each axis. Note that X and Y Values are +min/max value pair of coordinates from each axis. Note that X and Y values are always present. Z and M are omitted for 2D geospatial instances. When calculating a bounding box, null or NaN values in a coordinate @@ -134,7 +134,7 @@ column, or an empty list if they are not known. This is borrowed from [geometry_types of GeoParquet][geometry-types] except that values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code]. -Table below shows the most common geospatial types and their codes: +The table below shows the most common geospatial types and their codes: | Type | XY | XYZ | XYM | XYZM | | :----------------- | :--- | :--- | :--- | :--- | @@ -154,7 +154,7 @@ In addition, the following rules are applied: [geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159 [wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary -# Coordinate axis order +# Coordinate Axis Order The axis order of the coordinates in WKB and bounding box stored in Parquet follows the de facto standard for axis order in WKB and is therefore always diff --git a/LogicalTypes.md b/LogicalTypes.md index fd54d1dcd..d951e1f04 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -219,8 +219,8 @@ scale stores the number of digits of that value that are to the right of the decimal point, and the precision stores the maximum number of digits supported in the unscaled value. -If not specified, the scale is 0. Scale must be zero or a positive integer less -than or equal to the precision. Precision is required and must be a non-zero positive +If not specified, the scale is 0. Scale must be a non-negative integer less +than or equal to the precision. Precision is required and must be a positive integer. A precision too large for the underlying type (see below) is an error. `DECIMAL` can be used to annotate the following types: @@ -302,8 +302,8 @@ counterpart, it must annotate an `int64`. Although there is no exact corresponding ConvertedType for local time semantic, in order to support forward compatibility with those libraries, which annotated -their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotation, -Parquet writer implementation *must* annotate local time with legacy annotations too, +their local time with legacy `TIME_MICROS` and `TIME_MILLIS` annotations, +Parquet writer implementations *must* annotate local time with legacy annotations too, as shown below. *Backward compatibility:* @@ -359,7 +359,7 @@ time-line and such interpretations are allowed on purpose. The `TIMESTAMP` type has two type parameters: - `isAdjustedToUTC` must be either `true` or `false`. -- `unit` must be one of `MILLIS`, `MICROS` or `NANOS`. This list is subject +- `unit` must be one of `MILLIS`, `MICROS`, or `NANOS`. This list is subject to potential expansion in the future. Upon reading, unknown `unit`-s must be handled as unsupported features (rather than as errors in the data files). @@ -449,7 +449,7 @@ limits and implementations may choose to only support a limited range. On the other hand, not every combination of year, month, day, hour, minute, second and subsecond values can be encoded into an `int64`. Most notably: -- An arbitrary combination of timestamp fields can not be encoded as a single +- An arbitrary combination of timestamp fields cannot be encoded as a single number if the values for some of the fields are outside of their normal range (where the "normal range" corresponds to everyday usage). For example, neither of the following can be represented in a timestamp: @@ -460,7 +460,7 @@ second and subsecond values can be encoded into an `int64`. Most notably: - day = 29, month = 2, year = any non-leap year - Due to the range of the `int64` type, timestamps using the `NANOS` unit can only represent values between 1677-09-21 00:12:43 and 2262-04-11 23:47:16. - Values outside of this range can not be represented with the `NANOS` + Values outside of this range cannot be represented with the `NANOS` unit. (Other precisions have similar limits but those are outside of the domain for practical everyday usage.) @@ -478,8 +478,8 @@ type counterpart, it must annotate an `int64`. Although there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated -their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotation, -Parquet writer implementation *must* annotate local timestamps with legacy annotations too, +their local timestamps with legacy `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` annotations, +Parquet writer implementations *must* annotate local timestamps with legacy annotations too, as shown below. *Backward compatibility:* @@ -608,7 +608,7 @@ optional group variant_shredded (VARIANT(1)) { ### GEOMETRY `GEOMETRY` is used for geospatial features in the Well-Known Binary (WKB) format -with linear/planar edges interpolation. It must annotate a `BYTE_ARRAY` +with linear/planar edge interpolation. It must annotate a `BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail. The type has only one type parameter: @@ -623,14 +623,14 @@ are found during reading, they must be ignored. ### GEOGRAPHY `GEOGRAPHY` is used for geospatial features in the WKB format with an explicit -(non-linear/non-planar) edges interpolation algorithm. It must annotate a +(non-linear/non-planar) edge interpolation algorithm. It must annotate a `BYTE_ARRAY` primitive type. See [Geospatial.md](Geospatial.md) for more detail. The type has two type parameters: - `crs`: An optional string value for CRS. It must be a geographic CRS, where longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90]. If unset, the CRS defaults to `"OGC:CRS84"`. -- `algorithm`: An optional enum value to describes the edge interpolation +- `algorithm`: An optional enum value that describes the edge interpolation algorithm. Supported values are: `SPHERICAL`, `VINCENTY`, `THOMAS`, `ANDOYER`, `KARNEY`. If unset, the algorithm defaults to `SPHERICAL`. diff --git a/PageIndex.md b/PageIndex.md index cb184e428..58a8ecf06 100644 --- a/PageIndex.md +++ b/PageIndex.md @@ -17,7 +17,7 @@ - under the License. --> -# Parquet page index: Layout to Support Page Skipping +# Parquet Page Index: Layout to Support Page Skipping In Parquet, a *page index* is optional metadata for a ColumnChunk, containing statistics for DataPages that can be used @@ -37,7 +37,7 @@ data from disk. 1. Make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular: * A single-row lookup in a row group based on the sort column of that row group - will only read one data page per the retrieved column. + will only read one data page per retrieved column. * Range scans on the sort column will only need to read the exact data pages that contain relevant data. * Make other selective scans I/O efficient: if we have a very selective diff --git a/README.md b/README.md index 49a46c16a..fe59583cd 100644 --- a/README.md +++ b/README.md @@ -134,14 +134,14 @@ with a focus on how the types affect disk storage. For example, 16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding. This reduces the complexity of implementing readers and writers for the format. The types are: - - BOOLEAN: 1 bit boolean - - INT32: 32 bit signed ints - - INT64: 64 bit signed ints - - INT96: 96 bit signed ints + - BOOLEAN: 1-bit boolean + - INT32: 32-bit signed ints + - INT64: 64-bit signed ints + - INT96: 96-bit signed ints - FLOAT: IEEE 32-bit floating point values - DOUBLE: IEEE 64-bit floating point values - BYTE_ARRAY: arbitrarily long byte arrays - - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays + - FIXED_LEN_BYTE_ARRAY: fixed-length byte arrays ### Logical Types Logical types are used to extend the types that parquet can be used to store, @@ -190,11 +190,11 @@ In order we have: The value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined. -The encoded values for the data page is always required. The definition and repetition levels +The encoded values for the data page are always required. The definition and repetition levels are optional, based on the schema definition. If the column is not nested (i.e. -the path to the column has length 1), we do not encode the repetition levels (it would +the path to the column has length 1), we do not encode the repetition levels (they would always have the value 0). For data that is required, the definition levels are -skipped (if encoded, it will always have the value of the max definition level). +skipped (if encoded, they will always have the value of the max definition level). For example, in the case where the column is non-nested and required, the data in the page is only the encoded values. @@ -224,7 +224,7 @@ the reasoning behind adding these to the format. ## Checksumming Pages of all kinds can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. Checksums are calculated -using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary +using the standard CRC32 algorithm - as used in e.g. GZIP - on the serialized binary representation of a page (not including the page header itself). ## Error recovery @@ -239,10 +239,10 @@ metadata at the end. If an error happens while writing the file metadata, all t data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so -far. Combining this with the strategy used for rc or avro files using sync markers, +far. Combining this with the strategy used for RCFile or Avro files using sync markers, a reader could recover partially written files. -## Separating metadata and column data. +## Separating metadata and column data The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. @@ -256,7 +256,7 @@ one HDFS block. Therefore, HDFS block sizes should also be set to be larger. A optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. - Data page size: Data pages should be considered indivisible so smaller data pages -allow for more fine grained reading (e.g. single row lookup). Larger page sizes +allow for more fine-grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers). Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes. diff --git a/VariantEncoding.md b/VariantEncoding.md index e566c7535..eec9de0a8 100644 --- a/VariantEncoding.md +++ b/VariantEncoding.md @@ -142,7 +142,7 @@ Notes: - Offsets are relative to the start of the `bytes` array. - The length of the ith string can be computed as `offset[i+1] - offset[i]`. - The offset of the first string is always equal to 0 and is therefore redundant. It is included in the spec to simplify in-memory-processing. -- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3 byte offsets and 3 indicates 4-byte offsets. +- `offset_size_minus_one` indicates the number of bytes per `dictionary_size` and `offset` entry. I.e. a value of 0 indicates 1-byte offsets, 1 indicates 2-byte offsets, 2 indicates 3-byte offsets and 3 indicates 4-byte offsets. - If `sorted_strings` is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness. @@ -370,9 +370,9 @@ primitive_val: see table for binary representation short_string_val: UTF-8 encoded bytes object_val: * * array_val: * -num_elements: a 1 or 4 byte unsigned little-endian value (depending on is_large in /) -field_id: a 1, 2, 3 or 4 byte unsigned little-endian value (depending on field_id_size_minus_one in ), indexing into the dictionary -field_offset: a 1, 2, 3 or 4 byte unsigned little-endian value (depending on field_offset_size_minus_one in /), providing the offset in bytes within fields +num_elements: a 1- or 4-byte unsigned little-endian value (depending on is_large in /) +field_id: a 1-, 2-, 3-, or 4-byte unsigned little-endian value (depending on field_id_size_minus_one in ), indexing into the dictionary +field_offset: a 1-, 2-, 3-, or 4-byte unsigned little-endian value (depending on field_offset_size_minus_one in /), providing the offset in bytes within fields fields: * ``` @@ -383,8 +383,8 @@ The last entry is the offset that is one byte past the last field (i.e. the tota All offsets are relative to the first byte of the first field in the object/array. `field_id_size_minus_one` and `field_offset_size_minus_one` indicate the number of bytes per field ID/offset. -For example, a value of 0 indicates 1-byte IDs, 1 indicates 2-byte IDs, 2 indicates 3 byte IDs and 3 indicates 4-byte IDs. -The `is_large` flag for arrays and objects is used to indicate whether the number of elements is indicated using a one or four byte value. +For example, a value of 0 indicates 1-byte IDs, 1 indicates 2-byte IDs, 2 indicates 3-byte IDs and 3 indicates 4-byte IDs. +The `is_large` flag for arrays and objects is used to indicate whether the number of elements is indicated using a one- or four-byte value. When more than 255 elements are present, `is_large` must be set to true. It is valid for an implementation to use a larger value than necessary for any of these fields (e.g. `is_large` may be true for an object with less than 256 elements). diff --git a/VariantShredding.md b/VariantShredding.md index fe579ed0c..2303bd826 100644 --- a/VariantShredding.md +++ b/VariantShredding.md @@ -148,7 +148,7 @@ Null elements must be encoded in `value` as Variant null: basic type 0 (primitiv The series of `tags` arrays `["comedy", "drama"], ["horror", null], ["comedy", "drama", "romance"], null` would be stored as: -| Array | `value` | `typed_value `| `typed_value...value` | `typed_value...typed_value` | +| Array | `value` | `typed_value` | `typed_value...value` | `typed_value...typed_value` | |----------------------------------|-------------|---------------|-----------------------|--------------------------------| | `["comedy", "drama"]` | null | non-null | [null, null] | [`comedy`, `drama`] | | `["horror", null]` | null | non-null | [null, `00`] | [`horror`, null] | @@ -294,7 +294,7 @@ def construct_variant(metadata: Metadata, value: Variant, typed_value: Any) -> V # this is a shredded object object_fields = { name: construct_variant(metadata, field.value, field.typed_value) - for (name, field) in typed_value + for (name, field) in typed_value.items() } if value is not None: diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 09c4e4095..b72674271 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -41,7 +41,7 @@ enum Type { } /** - * DEPRECATED: Common types used by frameworks(e.g. hive, pig) using parquet. + * DEPRECATED: Common types used by frameworks (e.g. Hive, Pig) using parquet. * ConvertedType is superseded by LogicalType. This enum should not be extended. * * See LogicalTypes.md for conversion between ConvertedType and LogicalType. @@ -431,7 +431,7 @@ enum EdgeInterpolationAlgorithm { /** * Embedded Geometry logical type annotation * - * Geospatial features in the Well-Known Binary (WKB) format and edges interpolation + * Geospatial features in the Well-Known Binary (WKB) format and edge interpolation * is always linear/planar. * * A custom CRS can be set by the crs field. If unset, it defaults to "OGC:CRS84", @@ -450,13 +450,13 @@ struct GeometryType { * Embedded Geography logical type annotation * * Geospatial features in the WKB format with an explicit (non-linear/non-planar) - * edges interpolation algorithm. + * edge interpolation algorithm. * * A custom geographic CRS can be set by the crs field, where longitudes are * bound by [-180, 180] and latitudes are bound by [-90, 90]. If unset, the CRS * defaults to "OGC:CRS84". * - * An optional algorithm can be set to correctly interpret edges interpolation + * An optional algorithm can be set to correctly interpret edge interpolation * of the geometries. If unset, the algorithm defaults to SPHERICAL. * * Allowed for physical type: BYTE_ARRAY. @@ -680,7 +680,7 @@ struct DataPageHeader { /** * Number of values, including NULLs, in this data page. * - * If a OffsetIndex is present, a page must begin at a row + * If an OffsetIndex is present, a page must begin at a row * boundary (repetition_level = 0). Otherwise, pages may begin * within a row (repetition_level > 0). **/ @@ -820,7 +820,7 @@ struct PageHeader { /** The 32-bit CRC checksum for the page, to be calculated as follows: * * - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7, - * the same as in e.g. GZip). + * the same as in e.g. GZIP). * - All page types can have a CRC (v1 and v2 data pages, dictionary pages, * etc.). * - The CRC is computed on the serialization binary representation of the page
LogicalType ConvertedType