DRILL-4842: SELECT * on JSON data results in NumberFormatException by Serhii-Harnyk · Pull Request #594 · apache/drill

Serhii-Harnyk · 2016-09-21T15:13:43Z

No description provided.

chunhui-shi · 2016-10-03T19:03:48Z

+          }
        } else {
-          fieldWriter.integer(fieldPath.getNameSegment().getPath());
+          if (checkNullFields(fieldPathList)) {


Your fix is not targeted to change behavior under non-allTextMode, right? If this is the case, is this change in "else" needed? If you want to fix behavior under non-allTextMode as well, then you should also add unit test for this mode as well.

chunhui-shi · 2016-10-03T19:11:33Z

          continue outside;
        }

+        nullableFields.remove(fieldName);


Is this line needed? Since it is a set.

chunhui-shi · 2016-10-03T19:16:25Z

        continue outside;
      }

+      nullableFields.remove(fieldName);


Is this line needed? Since it is a set.

chunhui-shi · 2016-10-03T19:29:54Z

      if (emptyStatus.get(j)) {
        if (allTextMode) {
-          fieldWriter.varChar(fieldPath.getNameSegment().getPath());
+          if (checkNullFields(fieldPathList)) {


This is in the loop of going through fieldPathList, if there are two fields in emptyStatus, will it result in calling checkNullFields twice and then fieldWriter.varChar(fieldName) twice for the same field?
To make sure here we are doing the right thing, I think testing more on hierarchical json could be helpful, what is your opinion? Like this example:
{"c0":{"c11": "I am not NULL", "c1": null}, "c1": "I am not NULL", "c11":null}.

Added fixes for nested fields with handling "path" of the field.

chunhui-shi · 2016-10-03T20:01:45Z

+  /**
+   * Collection for tracking nullable fields during reading
+   * and storing them for creating default typed vectors
+   */


I don't see the reason you need it to be a LinkedHashSet. And if there is no particular reason for it, making it LinkedHashSet but not HashSet adds extra cost.

We need to store an order of nullable fields.

Serhii-Harnyk · 2016-10-21T16:11:20Z

Added fixes according to review

Serhii-Harnyk · 2016-11-01T17:25:17Z

@chunhui-shi what about second commit?

chunhui-shi

Since this change appies on the code path executed against every records, we need to have performance test to make sure no negative impact.

chunhui-shi · 2016-11-01T17:14:15Z

  private final ListVectorOutput listOutput;
  private final boolean extended = true;
  private final boolean readNumbersAsDouble;
+  private List<String> path = Lists.newArrayList();


Since you are adding a stateful variable 'path', could you add a test with some error json items injected in the middle to make sure it can still recover and have good status when the option introduced in DRILL-4653 is enabled ('store.json.reader.skip_invalid_records') which will skip invalid tokens.

I fixed it and added test.

chunhui-shi · 2016-11-01T17:49:45Z

   * and storing them for creating default typed vectors
   */
-  private final Set<String> nullableFields  = Sets.newLinkedHashSet();
+  private final Set<List<String>> nullableFields  = Sets.newLinkedHashSet();


Do you mean the iterating on line 163 needs the order to be preserved? I doubt it. Could you please double check?

I did not see update on this.

Yes, you are right. I fixed it.

Serhii-Harnyk · 2016-11-09T14:32:23Z

@chunhui-shi
Results of performance test

| % of nullable records | with fix | master | %(t-t0)/t0 |
| 0.001 | 13986,77778 | 13752,11111 | 1,71% |
| 0.1 | 13873,55556 | 13081,33333 | 6,06% |
| 0.5 | 11345,55556 | 7552,444444 | 50,22% |
| 0.7 | 12699,11111 | 5753,444444 | 120,72% |
| 0.99 | 14544,22222 | 494 | 2844,17% |

So increase of performance depends on % of nullable fields in dataset

jinma1978 · 2016-11-17T11:40:30Z

@chunhui-shi I tried to play with different java structures for column path, but no real advance in performance.

Serhii-Harnyk · 2016-12-09T16:21:21Z

@chunhui-shi, could you please review new changes?

chunhui-shi · 2016-12-14T18:04:03Z

      case VALUE_NUMBER_FLOAT:
      case VALUE_NUMBER_INT:
      case VALUE_STRING:
+        removeNotNullColumn(fieldName);


I feel uncomfortable since the fix added 'removeNotNullColumn(fieldName)' on the code path that is supposed to be hotspot. Could the reader just clean 'path' if it knows it is recovering from some errors?

To clean 'path' if it knows it is recovering from some errors, was added this line: https://github.com/Serhii-Harnyk/drill/blob/9ce6d56b46bcd540697c85cd2f280831dd50b277/exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/JsonReader.java#L243. Call of the method removeNotNullColumn() is a try to optimize the final size of the map fieldPathWriter, by removing fields, which were added to the map and initializing at current iteration.

But in the case, when null fields will be at the end of the file, it wouldn't work. So I removed it.

Serhii-Harnyk · 2016-12-16T13:14:21Z

@chunhui-shi, I have made some changes, could you take a look?

chunhui-shi · 2017-02-06T17:25:12Z

+1. LGTM. Need to address conflict before ready to commit.

Serhii-Harnyk · 2017-02-07T12:11:04Z

Squashed all changes into single commit, rebased on master and resolved conflicts.

close apache#594

amansinha100 · 2017-02-11T01:01:50Z

+   * @param fieldName
+   */
+  private void putFieldPath(String fieldName, MapWriter map) {
+    List<String> fieldPath = Lists.newArrayList(path);


Before allocating the fieldPath list, should we first check if it is already present in the fieldPathWriter ? This function is called for every null value which makes it quite expensive (as shown by your performance experiments). If there are 1000 records of type {'x': null}, ideally we want to call this method only for field 'x' once.

Thanks, fixed.

Serhii-Harnyk · 2017-02-14T12:34:31Z

@amansinha100, could you please review new changes?

amansinha100 · 2017-02-16T20:56:28Z

+   * Puts copy of field path list to fieldPathWriter map.
+   * @param fieldName
+   */
+  private void putFieldPath(String fieldName, MapWriter map) {


Do you have the performance data that you had collected earlier with different percentage of NULLs, after this change ? I still feel there could be some further optimization needed.

I attached results of testing on the last fix to the Jira.
https://drive.google.com/open?id=1l3Dg0DCV3p-OhwA0v6qdl2wGcezXDl5qaUqv56EEBn8
It should be mentioned that a lot of time goes to the creating fields and them writers. So times of execution also compared with time of querying files, in which nulls replaced by varchar.

paul-rogers · 2017-02-18T02:30:24Z

The bug here is fundamental to the way Drill works with JSON. We already had an extensive discussion around this area in another PR. The problem is that JSON supports a null type which is independent of all other types. In JSON, a null is not a "null int" or a "null string" -- it is just null.

Drill must infer a type for a field. This leads to all kinds of grief when a file contains a run of nulls before the real value:

{code}
{ id: 1, b: null }
...
{ id: 80000, b: "gee, I'm a string!" }
{code}

Drill must do something with the leading values. "b" is a null... what? Int? String?

We've had many bugs in this area. The bugs are not just code bugs, they represent a basic incompatibility between Drill and JSON.

This fix is yet another attempt to work around the limitation, but cannot overcome the basic incompatibility.

What we are doing, it seems, is building a list of fields that have seen only null values, deferring action on those fields until later. That works fine if "later" occurs in the same record batch. It is not clear what happens if we get to the end of the batch (as in the example above), but have never seen the type of the field: what type of vector do we create?

There are several solutions.

One is to have a "null" type in Drill. When we see the initial run of nulls, we simply create a field of the "null" type. We have type conversion rules that say that a "null" vector can be coerced into any other type when we ultimately see the type. (And, if we don't see a type in one batch, we can pass the null vector along upstream for later reconciliation.) This is a big change; too big for a bug fix.

Another solution, used here, is to keep track of "null only" fields, to defer the decision for later. That has a performance impact.

A third solution is to go ahead and create a vector of any type, keep setting its values to null (as if we had already seen the field type), but be ready to discard that vector and convert it to the proper type once we see that type. In this way, we treat null fields just as any other up to the point where we realize we have a type conflict. Only then do we check the "null only" map and decide we can quietly convert the vector type to the proper type.

These are the initial thoughts. I'll add more nuanced comments as I review the code in more detail.

paul-rogers · 2017-02-19T00:54:39Z

We have three cases for nulls:

See a typed field, followed by nulls. Here, we know the type, just mark the current field as null.
Run of nulls at start of file, followed by a non-null value within the same record batch. We can transform the null field into a typed field. That is what this fix tries to do.
Run of nulls in one record batch, followed by a non-null value in the next batch. This must trigger a schema change. The question is, does it trigger based on adding a new field, or assuming a type in the first batch, then changing the type in the second?

The code presumably handles the first case. Case 3 is beyond the scope of this fix. So, the question is, how best to handle the second case? Is the present fix sufficient?

paul-rogers · 2017-02-19T00:58:11Z

Note that Drill does have a vector that can possibly used to represent a run of nulls: the {{ZeroVector}}. Using this, we can:

On the first record where we see a field, if that field is null, add a {{ZeroVector}} to the record batch.
On subsequent records, if the value is still null, do nothing.
If the value is non-null, and the current vector is a {{ZeroVector}}, replace it with the proper Nullable vector, with all (0..i-1) values set to null, and the ith value set to the current column.
At end of batch, if any {{ZeroVector}}s remain, simply remove them so that the column does not appear in the batch output.

The result of this is that we need not do two map lookups for a null value, we just do one: the one to find the column value vector as we'd do for an int or string.

paul-rogers · 2017-02-20T23:58:32Z

Three general rules to keep in mind in the current JSON reader implementation:

Drill can remember the past. (Once a type as been seen for a column, Drill will remember that type.)
Drill cannot predict the future. (If a type has not been seen for a column by the end of a record batch, Drill cannot predict what type will appear in some later batch.)
Drill can amend the past within a single record batch. (If a batch starts with nulls, but later a type is seen, the previous values are automatically filled with nulls.)

Actual implementation of the JSON reader, and the value writers that form the implementation, is complex. As we read JSON values, we ask a type-specific writer to set that value into the value vector. Each writer marks the column as non-null, then adds the value. Any values not so set will default to null.

Consider a file with five null "c1" values followed by a string value "foo" for that field. The five nulls are ignored. When we see the non-null c1, the code creates a VarChar vector and sets the 6th value to the string "foo". Doing so automatically marks the previous five column values as null.

Suppose we have a file with a single string value "foo" for column "c1", followed by five nulls. In this case, the first value creates and sets the VarChar vector as before. Later, at the end of reading the record batch, the reader sets the record count for the vectors. This action, on the VarChar vector, has the effect of setting the trailing five column values to null.

Since values default to null, we get this behavior, and the previous, for free. The result is that if a record batch contains even a single non-null value for a field, that column will be fully populated with nulls for all other records in the same batch.

This gets us back to the same old problem in Drill: if all we see are nulls, Drill needs to know, "null of what type" while in JSON the value is just null. The JIRA tickets linked to this ticket all related to that same underlying issue.

There is a long history of this issue: DRILL-5033, DRILL-1256, DRILL-4479, DRILL-3806 and more.

This fix affects only "all text mode." This means that, regardless of the JSON type, create a VarChar column. Doing so provides a very simple fix. Since all columns are VarChar, when we see a new column, with a null value, just create a VarChar column. (No need to set the column to null.)

That is, we can "predict the future" for nulls because all columns are VarChar -- so there is not much to predict.

Otherwise, we have to stick with Jacques' design decision in DRILL-1256: "Drill's perspective is a non-existent column and a column with no value are equivalent." A record batch of all nulls, followed by a record batch with a non-null value, will cause a schema change.

Again, Drill needs a "null" type that is compatible with all other types in order to support JSON semantics. (And, needs to differentiate between value-exists-and-is-null and value-does-not-exist.)

Yet another solution is to have the user tell us their intent. The JSON Schema project provides a way to express the expected schema so that Drill would know up front the type of each column (and whether the column is really nullable.)

paul-rogers · 2017-02-21T00:09:58Z

Given all the above, there is very simple fix to the particular case that this bug covers.

{code}
private void writeDataAllText(MapWriter map, FieldSelection selection,
...
case VALUE_NULL:
// Here we do have a type. This is a null VarChar.
handleNullString(map, fieldName);
break;
...

/**

Create a VarChar column. No need to explicitly set a
null value; nulls are the default.
Note: This only works for all-text mode because we can
predict that, if we ever see an actual value, it will be
treated as a VarChar. This trick will not work for the
general case because we cannot predict the actual column
type.
@param writer
@param fieldName
*/

private void handleNullString(MapWriter writer, String fieldName) {
writer.varChar(fieldName);
}
{code}

The above simply leverages the existing mechanism for mapping columns to types, and for filling in missing null values.

Output, when printing {{tooManyNulls.json}} to CSV:

{code}
4096 row(s):
c1
null
...
null
1 row(s):
c1
Hello World
Total rows returned : 4097. Returned in 242ms.
{code}

Performance here will be slower than master because we now do a field lookup for each null column where in the past we did not. The performance of null columns, however, should be identical to non-null columns. And, performance of the above fix should be identical to the fix proposed in this PR: but the code here is simpler.

paul-rogers

+0

This change does not fix the underlying problem, but it is no more broken than doing nothing. I have no objection to merging it; but I don't think that tinkering around the edges will help us solve the actual, underlying design issue with how Drill handles nulls.

paul-rogers · 2017-02-18T02:15:47Z

+  /**
+   * Collection for tracking nullable fields during reading
+   * and storing them for creating default typed vectors
+   */


Actually, in JSON, all scalar types can be null. Because JSON has no schema, any field can have any type. Drill assumes that a given field has a single type. But, in JSON semantics, that single type can be null. That is, the following is always valid:

{code}
{ a: 10, b: null }
{ a: null, b: "foo" }
{ a: 20 }
{code}

JSON, but not Drill, differentiates between "not present" and null.

Given this, we don't need a special map for nullable fields: all JSON fields are nullable.

Of course, JSON allows a null map, which Drill does not handle, but let's ignore that here...

paul-rogers · 2017-02-28T20:41:23Z

+    if (allTextMode) {
+      fieldWriter.varChar(path);
+    } else {
+      fieldWriter.integer(path);


At the Drill Hangout it was mentioned that this fix handles nulls for non-Varchar fields. Note that it does so using the same mechanism that Drill uses elsewhere: assume the field is of type int. However, we have many, many bugs that result from that assumption. There is simply no guarantee that, in a later batch, when we see the field, that it will, in fact, be an int.

I'm not sure whether it is OK to simply continue to propagate that well-known error here (as we have done) or to take another path that avoids the error. (Since doing so requires a design change that has, so far, always been beyond our ability to accomplish.)

chunhui-shi suggested changes Oct 3, 2016

View reviewed changes

Serhii-Harnyk force-pushed the DRILL-4842 branch from 190a69a to 473e5fb Compare October 20, 2016 17:28

chunhui-shi reviewed Nov 1, 2016

View reviewed changes

Serhii-Harnyk force-pushed the DRILL-4842 branch from 473e5fb to 9ce6d56 Compare December 9, 2016 14:20

chunhui-shi reviewed Dec 14, 2016

View reviewed changes

Serhii-Harnyk force-pushed the DRILL-4842 branch from 9ce6d56 to 1a67f9f Compare December 16, 2016 13:00

Serhii-Harnyk force-pushed the DRILL-4842 branch from 1a67f9f to 4ee4366 Compare February 7, 2017 10:21

amansinha100 pushed a commit to amansinha100/drill that referenced this pull request Feb 10, 2017

DRILL-4842: SELECT * on JSON data results in NumberFormatException

52923a7

close apache#594

amansinha100 pushed a commit to amansinha100/drill that referenced this pull request Feb 10, 2017

DRILL-4842: SELECT * on JSON data results in NumberFormatException

aff8d5a

close apache#594

amansinha100 reviewed Feb 11, 2017

View reviewed changes

Serhii-Harnyk force-pushed the DRILL-4842 branch from 4ee4366 to ce3591c Compare February 14, 2017 12:31

DRILL-4842: SELECT * on JSON data results in NumberFormatException

ce3591c

amansinha100 reviewed Feb 16, 2017

View reviewed changes

paul-rogers reviewed Feb 28, 2017

View reviewed changes

ashevchuk123 pushed a commit to mapr/incubator-drill that referenced this pull request Oct 28, 2025

MD-6373: Filter is not pushed down in hive queries (apache#594)

0e30c91

Conversation

Serhii-Harnyk commented Sep 21, 2016

Uh oh!

chunhui-shi Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chunhui-shi Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chunhui-shi Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chunhui-shi Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serhii-Harnyk commented Oct 21, 2016

Uh oh!

Serhii-Harnyk commented Nov 1, 2016

Uh oh!

chunhui-shi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serhii-Harnyk Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serhii-Harnyk commented Nov 9, 2016

Uh oh!

jinma1978 commented Nov 17, 2016

Uh oh!

Serhii-Harnyk commented Dec 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serhii-Harnyk commented Dec 16, 2016

Uh oh!

chunhui-shi commented Feb 6, 2017

Uh oh!

Serhii-Harnyk commented Feb 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Serhii-Harnyk commented Feb 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paul-rogers commented Feb 18, 2017

Uh oh!

chunhui-shi Oct 3, 2016 •

edited

Loading

chunhui-shi Oct 3, 2016 •

edited

Loading

chunhui-shi Oct 3, 2016 •

edited

Loading

chunhui-shi Oct 3, 2016 •

edited

Loading

Serhii-Harnyk Dec 9, 2016 •

edited

Loading