DRILL-5970: DrillParquetReader always builds the schema with "OPTIONA…#1047
DRILL-5970: DrillParquetReader always builds the schema with "OPTIONA…#1047vdiravka wants to merge 1 commit intoapache:masterfrom
Conversation
…L" dataMode columns instead of "REQUIRED" ones - Added supporting of specifying the DataMode of data type in MapWriter; - Use nullable vectors in case of parent nested data type with OPTIONAL repetition.
|
@sachouche can you please review this? |
paul-rogers
left a comment
There was a problem hiding this comment.
The design approach on this one needs reconsideration. Please see the comment in the diffs that explains the issue and offers alternatives.
| Float8Writer float8(String name); | ||
| BitWriter bit(String name); | ||
| VarBinaryWriter varBinary(String name); | ||
| UInt1Writer uInt1(String name, TypeProtos.DataMode dataMode); |
There was a problem hiding this comment.
Really not sure we want to do this. These writers are also used in JSON, and are used for every field in every object. Now, every request to get a writer will have to pass the mode. This seems like we are making the problem far, far more complex than necessary.
The current code has rules for the type to choose. In general, the type is OPTIONAL for single (scalar) values and REPEATED for repeated (array) values.
The mode passed here cannot be REPEATED: just won't work. So, can we pass REQUIRED?
Let's think about how these are used in JSON. In JSON, we discover fields as we read each object. A field need not appear in the first object, it might appear 20 objects in. Then, after the 25th object, it may not ever appear again.
So, in JSON, we can never use REQUIRED; we must use OPTIONAL. Key reason: JSON provides no schema and Drill cannot predict what the schema will turn out to be.
Now, let's move to Parquet. Parquet does have a schema. In fact, with Parquet, we know the schema before we read the first row. And, in Parquet, a scalar column can be REQUIRED or OPTIONAL.
All of this suggests a better solution. (Indeed, the solution implemented in the new column writer layer.) Allow Parquet to declare an "early" schema. That is, prior to the first row, call methods that declare each column with its cardinality.
Then, when reading the field, always pass the name as today. If the field is new, it must be OPTIONAL. Otherwise, it will use whatever was used before.
Let's say this another way. Below these methods is a call to an addOrGet() method. In Parquet, call those methods before the first row to do the "add" part. Then, later, the method will do only the "get" part.
The result is that you won't have to modify so many files, won't have to complicate the APIs and won't have to worry about a client passing in REPEATED mode for a scalar column.
There was a problem hiding this comment.
The mode passed here cannot be REPEATED: just won't work. So, can we pass REQUIRED?
Yes, we can with my improvements - for OPTIONAL dataMode we will pass Nullable Vectors and usual one for REQUIRED.
The proposed alternative is interesting, I will think how to implement it in the best way. Maybe replacing SingleMapWriter with OptionalMapWriter and RequredMapWriter could be done.
Note: This PR is useful also due to good refactoring of some generated code and logic of choosing the necessary reader.
…L" dataMode columns instead of "REQUIRED" ones