DRILL-5970: DrillParquetReader always builds the schema with "OPTIONA… by vdiravka · Pull Request #1047 · apache/drill

vdiravka · 2017-11-22T17:51:34Z

…L" dataMode columns instead of "REQUIRED" ones

Added supporting of specifying the DataMode of data type in MapWriter;
Use nullable vectors in case of parent nested data type with OPTIONAL repetition.

…L" dataMode columns instead of "REQUIRED" ones - Added supporting of specifying the DataMode of data type in MapWriter; - Use nullable vectors in case of parent nested data type with OPTIONAL repetition.

priteshm · 2017-11-27T18:59:18Z

@sachouche can you please review this?

paul-rogers

The design approach on this one needs reconsideration. Please see the comment in the diffs that explains the issue and offers alternatives.

paul-rogers · 2018-01-05T02:19:33Z

-    Float8Writer float8(String name);
-    BitWriter bit(String name);
-    VarBinaryWriter varBinary(String name);
+    UInt1Writer uInt1(String name, TypeProtos.DataMode dataMode);


Really not sure we want to do this. These writers are also used in JSON, and are used for every field in every object. Now, every request to get a writer will have to pass the mode. This seems like we are making the problem far, far more complex than necessary.

The current code has rules for the type to choose. In general, the type is OPTIONAL for single (scalar) values and REPEATED for repeated (array) values.

The mode passed here cannot be REPEATED: just won't work. So, can we pass REQUIRED?

Let's think about how these are used in JSON. In JSON, we discover fields as we read each object. A field need not appear in the first object, it might appear 20 objects in. Then, after the 25th object, it may not ever appear again.

So, in JSON, we can never use REQUIRED; we must use OPTIONAL. Key reason: JSON provides no schema and Drill cannot predict what the schema will turn out to be.

Now, let's move to Parquet. Parquet does have a schema. In fact, with Parquet, we know the schema before we read the first row. And, in Parquet, a scalar column can be REQUIRED or OPTIONAL.

All of this suggests a better solution. (Indeed, the solution implemented in the new column writer layer.) Allow Parquet to declare an "early" schema. That is, prior to the first row, call methods that declare each column with its cardinality.

Then, when reading the field, always pass the name as today. If the field is new, it must be OPTIONAL. Otherwise, it will use whatever was used before.

Let's say this another way. Below these methods is a call to an addOrGet() method. In Parquet, call those methods before the first row to do the "add" part. Then, later, the method will do only the "get" part.

The result is that you won't have to modify so many files, won't have to complicate the APIs and won't have to worry about a client passing in REPEATED mode for a scalar column.

The mode passed here cannot be REPEATED: just won't work. So, can we pass REQUIRED?
Yes, we can with my improvements - for OPTIONAL dataMode we will pass Nullable Vectors and usual one for REQUIRED.

The proposed alternative is interesting, I will think how to implement it in the best way. Maybe replacing SingleMapWriter with OptionalMapWriter and RequredMapWriter could be done.

Note: This PR is useful also due to good refactoring of some generated code and logic of choosing the necessary reader.

DRILL-5970: DrillParquetReader always builds the schema with "OPTIONA…

f7fd63c

…L" dataMode columns instead of "REQUIRED" ones - Added supporting of specifying the DataMode of data type in MapWriter; - Use nullable vectors in case of parent nested data type with OPTIONAL repetition.

vdiravka force-pushed the DRILL-5970 branch from 22fe810 to f7fd63c Compare November 22, 2017 18:02

paul-rogers requested changes Jan 5, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-5970: DrillParquetReader always builds the schema with "OPTIONA…#1047

DRILL-5970: DrillParquetReader always builds the schema with "OPTIONA…#1047
vdiravka wants to merge 1 commit intoapache:masterfrom
vdiravka:DRILL-5970

vdiravka commented Nov 22, 2017

Uh oh!

priteshm commented Nov 27, 2017

Uh oh!

paul-rogers left a comment

Uh oh!

paul-rogers Jan 5, 2018

Uh oh!

vdiravka Jan 9, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vdiravka commented Nov 22, 2017

Uh oh!

priteshm commented Nov 27, 2017

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

paul-rogers Jan 5, 2018

Choose a reason for hiding this comment

Uh oh!

vdiravka Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vdiravka Jan 9, 2018 •

edited

Loading