HIVE-29516 Fix NPE in StatsUtils.updateStats when column statistics a…#6382
HIVE-29516 Fix NPE in StatsUtils.updateStats when column statistics a…#6382shubhluck wants to merge 3 commits intoapache:masterfrom
Conversation
soumyakanti3578
left a comment
There was a problem hiding this comment.
This doesn't address the root cause mentioned in the PR description: getColumnStats.
Ideally it should return an empty list instead of returning null. But I understand that it is difficult to change it there as it's used in many other places. One solution could be to add a new method similar to getColumnStats:
public List<ColStatistics> getColumnStatsOrEmpty() {
if (columnStats != null) {
return Lists.newArrayList(columnStats.values());
}
return Collections.emptyList();
}
This method can be used in the three places that were changed in the PR.
Another thing that I noticed in updateStats is because now colStats is not null, we will set empty column stats in:
stats.setColumnStats(colStats);
long newDataSize = StatsUtils.getDataSizeFromColumnStats(newNumRows, colStats);
stats.setDataSize(StatsUtils.getMaxIfOverflow(newDataSize));
This is a subtle change of this PR as earlier we would get an NPE instead. Do you think we should at least add a LOG.warn when useColStats = true and colStats is empty? Maybe it's a better idea to go to the else block when this happens?
There might be other changes like this which we need to be wary of.
Additionally, I looked into the other places which calls the method. I think there is a risk of NPE from SemanticAnalyzer.getMaterializedTableStats() too, so maybe we can call the new method there too. I think the other places are fine but it would help if you could investigate those too.
Finally, I think we should add some unit tests for the methods with changes.
|
@soumyakanti3578 Thanks for the review! Addressed all feedback: Added getColumnStatsOrEmpty() - returns empty list instead of null. Now used in all 4 locations (3 original + SemanticAnalyzer.getMaterializedTableStats()). Fixed updateStats() behavior - when useColStats=true but stats are empty, we now fall back to ratio-based calculation instead of setting empty stats. Added LOG.warn for visibility. Investigated other callers - all safe: SortedDynPartitionOptimizer, StatsRulesProcFactory (UDTF), GenericUDAFBloomFilter, GroupingSetOptimizer, and ConvertJoinMapJoin already have null handling. Added unit tests for getColumnStatsOrEmpty() and the updated behavior in updateStats()/getColStatisticsUpdatingTableAlias(). |
zabetak
left a comment
There was a problem hiding this comment.
The bug appears during a very specific use-case where semijoin removal takes place. Ideally, it would be nice to have a .q test that can reproduce the NPE. Note, that you can fake statistics by using ALTERA TABLE xx UPDATE STATISTICS commands (examples in reduce_deduplicate_stat_based.q).
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
Outdated
Show resolved
Hide resolved
zabetak
left a comment
There was a problem hiding this comment.
Now that the fix in TezCompiler is in place the only thing pending is getting the semijoin_stats_missing_colstats.q test right so that it actually goes through the problematic code.
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q
Outdated
Show resolved
Hide resolved
963e305 to
f9074cc
Compare
…unavailable Check column stats availability before passing useColStats=true in TezCompiler.removeSemijoinOptimizationByBenefit() to avoid NPE when column statistics are not present.
In order to trigger the NPE there are two important requirements: * `hive.stats.fetch.column.stats` must be set to false * the big table must be partitioned All configuration properties that are not necessary or using the default values are dropped.
|



…re unavailable
Added null checks before iterating over column statistics in:
What changes were proposed in this pull request?
This PR adds null checks before iterating over column statistics in three locations to prevent NullPointerException:
stats.getColumnStats()before the for-each loop, defaulting to empty list when nullThe root cause is that
Statistics.getColumnStats()returnsnull(not an empty list) when no column statistics are available:Why are the changes needed?
Query compilation fails with NullPointerException during semijoin optimization when column statistics are unavailable:
This issue is particularly prevalent with:
Large TPC-DS datasets (100GB+) where statistics collection may be incomplete
Tables where column-level statistics have not been computed
Complex queries where intermediate operators lack column statistics
The fix ensures graceful handling when column statistics are unavailable, allowing the optimizer to continue using row-based statistics instead of failing.
Does this PR introduce any user-facing change?
No. This is a bug fix that prevents query compilation failures. Previously failing queries will now compile and execute successfully. There is no change to query results or behavior for queries that were already working.
How was this patch tested?
To reproduce the original issue: