-
Notifications
You must be signed in to change notification settings - Fork 985
DRILL-4842: SELECT * on JSON data results in NumberFormatException #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| /** | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
|
|
@@ -17,12 +17,15 @@ | |
| */ | ||
| package org.apache.drill.exec.vector.complex.fn; | ||
|
|
||
| import com.google.common.collect.ImmutableList; | ||
| import com.google.common.collect.Maps; | ||
| import io.netty.buffer.DrillBuf; | ||
|
|
||
| import java.io.IOException; | ||
| import java.io.InputStream; | ||
| import java.util.BitSet; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| import org.apache.drill.common.exceptions.UserException; | ||
| import org.apache.drill.common.expression.PathSegment; | ||
|
|
@@ -55,13 +58,20 @@ public class JsonReader extends BaseJsonProcessor { | |
| private final ListVectorOutput listOutput; | ||
| private final boolean extended = true; | ||
| private final boolean readNumbersAsDouble; | ||
| private List<String> path = Lists.newArrayList(); | ||
|
|
||
| /** | ||
| * Collection for tracking empty array writers during reading | ||
| * and storing them for initializing empty arrays | ||
| */ | ||
| private final List<ListWriter> emptyArrayWriters = Lists.newArrayList(); | ||
|
|
||
| /** | ||
| * Collection for tracking nullable fields during reading | ||
| * and storing them for creating default typed vectors | ||
| */ | ||
| private Map<List<String>, MapWriter> fieldPathWriter = Maps.newHashMap(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, in JSON, all scalar types can be null. Because JSON has no schema, any field can have any type. Drill assumes that a given field has a single type. But, in JSON semantics, that single type can be null. That is, the following is always valid: {code} JSON, but not Drill, differentiates between "not present" and null. Given this, we don't need a special map for nullable fields: all JSON fields are nullable. Of course, JSON allows a null map, which Drill does not handle, but let's ignore that here... |
||
|
|
||
| /** | ||
| * Describes whether or not this reader can unwrap a single root array record | ||
| * and treat it like a set of distinct records. | ||
|
|
@@ -149,11 +159,15 @@ public void ensureAtLeastOneField(ComplexWriter writer) { | |
| for (int j = 0; j < fieldPathList.size(); j++) { | ||
| BaseWriter.MapWriter fieldWriter = writerList.get(j); | ||
| PathSegment fieldPath = fieldPathList.get(j); | ||
| if (emptyStatus.get(j)) { | ||
| if (allTextMode) { | ||
| fieldWriter.varChar(fieldPath.getNameSegment().getPath()); | ||
| } else { | ||
| fieldWriter.integer(fieldPath.getNameSegment().getPath()); | ||
| if (emptyStatus.get(j) && !checkNullFields(fieldPathList)) { | ||
| initializeFieldWriter(fieldWriter, fieldPath.getNameSegment().getPath()); | ||
| } | ||
| } | ||
|
|
||
| if (checkNullFields(fieldPathList)) { | ||
| for (Map.Entry<List<String>, MapWriter> fieldPath : fieldPathWriter.entrySet()) { | ||
| if(fieldPath.getValue() != null) { | ||
| initializeFieldWriter(fieldPath.getValue(), fieldPath.getKey().get(fieldPath.getKey().size() - 1)); | ||
| } | ||
| } | ||
| } | ||
|
|
@@ -170,6 +184,21 @@ public void ensureAtLeastOneField(ComplexWriter writer) { | |
| } | ||
| } | ||
|
|
||
| private void initializeFieldWriter(MapWriter fieldWriter, String path) { | ||
| if (allTextMode) { | ||
| fieldWriter.varChar(path); | ||
| } else { | ||
| fieldWriter.integer(path); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At the Drill Hangout it was mentioned that this fix handles nulls for non-Varchar fields. Note that it does so using the same mechanism that Drill uses elsewhere: assume the field is of type int. However, we have many, many bugs that result from that assumption. There is simply no guarantee that, in a later batch, when we see the field, that it will, in fact, be an int. I'm not sure whether it is OK to simply continue to propagate that well-known error here (as we have done) or to take another path that avoids the error. (Since doing so requires a design change that has, so far, always been beyond our ability to accomplish.) |
||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Check query having a '*' and existing nullable fields in result | ||
| */ | ||
| private boolean checkNullFields(List<PathSegment> fieldPathList) { | ||
| return (fieldPathList.size() == 1) && fieldPathList.get(0).getNameSegment().getPath().equals("*") && !fieldPathWriter.isEmpty(); | ||
| } | ||
|
|
||
| public void setSource(int start, int end, DrillBuf buf) throws IOException { | ||
| setSource(DrillBufInputStream.getStream(start, end, buf)); | ||
| } | ||
|
|
@@ -229,6 +258,7 @@ public ReadState write(ComplexWriter writer) throws IOException { | |
| } | ||
| } catch (com.fasterxml.jackson.core.JsonParseException ex) { | ||
| if (ignoreJSONParseError()) { | ||
| path.clear(); | ||
| if (processJSONException() == JsonExceptionProcessingState.END_OF_STREAM) { | ||
| return ReadState.JSON_RECORD_PARSE_EOF_ERROR; | ||
| } else { | ||
|
|
@@ -367,12 +397,19 @@ private void writeData(MapWriter map, FieldSelection selection, | |
| t = parser.getCurrentToken(); | ||
| moveForward = true; | ||
| } | ||
| if (t == JsonToken.NOT_AVAILABLE || t == JsonToken.END_OBJECT) { | ||
|
|
||
| if (t == JsonToken.NOT_AVAILABLE) { | ||
| return; | ||
| } | ||
|
|
||
| if (t == JsonToken.END_OBJECT) { | ||
| if (path.size() > 0) { | ||
| path.remove(path.size() - 1); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| assert t == JsonToken.FIELD_NAME : String.format( | ||
| "Expected FIELD_NAME but got %s.", t.name()); | ||
| assert t == JsonToken.FIELD_NAME : String.format("Expected FIELD_NAME but got %s.", t.name()); | ||
|
|
||
| final String fieldName = parser.getText(); | ||
| this.currentFieldName = fieldName; | ||
|
|
@@ -388,6 +425,7 @@ private void writeData(MapWriter map, FieldSelection selection, | |
| break; | ||
| case START_OBJECT: | ||
| if (!writeMapDataIfTyped(map, fieldName)) { | ||
| path.add(fieldName); | ||
| writeData(map.map(fieldName), childSelection, false); | ||
| } | ||
| break; | ||
|
|
@@ -403,7 +441,7 @@ private void writeData(MapWriter map, FieldSelection selection, | |
| break; | ||
| } | ||
| case VALUE_NULL: | ||
| // do nothing as we don't have a type. | ||
| putFieldPath(fieldName, map); | ||
| break; | ||
| case VALUE_NUMBER_FLOAT: | ||
| map.float8(fieldName).writeFloat8(parser.getDoubleValue()); | ||
|
|
@@ -446,12 +484,19 @@ private void writeDataAllText(MapWriter map, FieldSelection selection, | |
| t = parser.getCurrentToken(); | ||
| moveForward = true; | ||
| } | ||
| if (t == JsonToken.NOT_AVAILABLE || t == JsonToken.END_OBJECT) { | ||
|
|
||
| if (t == JsonToken.NOT_AVAILABLE) { | ||
| return; | ||
| } | ||
|
|
||
| assert t == JsonToken.FIELD_NAME : String.format( | ||
| "Expected FIELD_NAME but got %s.", t.name()); | ||
| if (t == JsonToken.END_OBJECT) { | ||
| if (path.size() > 0) { | ||
| path.remove(path.size() - 1); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| assert t == JsonToken.FIELD_NAME : String.format("Expected FIELD_NAME but got %s.", t.name()); | ||
|
|
||
| final String fieldName = parser.getText(); | ||
| this.currentFieldName = fieldName; | ||
|
|
@@ -467,6 +512,7 @@ private void writeDataAllText(MapWriter map, FieldSelection selection, | |
| break; | ||
| case START_OBJECT: | ||
| if (!writeMapDataIfTyped(map, fieldName)) { | ||
| path.add(fieldName); | ||
| writeDataAllText(map.map(fieldName), childSelection, false); | ||
| } | ||
| break; | ||
|
|
@@ -482,7 +528,7 @@ private void writeDataAllText(MapWriter map, FieldSelection selection, | |
| handleString(parser, map, fieldName); | ||
| break; | ||
| case VALUE_NULL: | ||
| // do nothing as we don't have a type. | ||
| putFieldPath(fieldName, map); | ||
| break; | ||
|
|
||
| default: | ||
|
|
@@ -492,7 +538,19 @@ private void writeDataAllText(MapWriter map, FieldSelection selection, | |
| } | ||
| } | ||
| map.end(); | ||
| } | ||
|
|
||
| /** | ||
| * Puts copy of field path list to fieldPathWriter map. | ||
| * @param fieldName | ||
| */ | ||
| private void putFieldPath(String fieldName, MapWriter map) { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have the performance data that you had collected earlier with different percentage of NULLs, after this change ? I still feel there could be some further optimization needed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I attached results of testing on the last fix to the Jira. |
||
| path.add(fieldName); | ||
| if (!fieldPathWriter.containsKey(path)) { | ||
| List<String> fieldPath = ImmutableList.copyOf(path); | ||
| fieldPathWriter.put(fieldPath, map); | ||
| } | ||
| path.remove(path.size() - 1); | ||
| } | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are adding a stateful variable 'path', could you add a test with some error json items injected in the middle to make sure it can still recover and have good status when the option introduced in DRILL-4653 is enabled ('store.json.reader.skip_invalid_records') which will skip invalid tokens.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it and added test.