Add NaN and NULL count to MIN/MAX stats#3139
Conversation
|
Label error. Requires exactly 1 of: patch, minor, major. Found: |
| pl.Series("v1_NAT_COUNT(ts_col)", [0, 1, 2, 1], dtype=pl.UInt64), | ||
| ) | ||
|
|
||
| column_stats = lib.read_column_stats(sym) |
There was a problem hiding this comment.
The new test only covers writing-then-reading with the same (new) client. Please add (or update an existing test) so that the all-NaN / all-NaT segments verify the new v1_NAN_COUNT / v1_NAT_COUNT values directly (e.g. extend test_column_stats_only_nat_values), and add a test for backwards compatibility — column stats written before this change (where only v1_MIN/v1_MAX columns exist) must still be readable, droppable, and mergeable by the new code path.
| return {ColumnStatTypeInternal::MIN_V1, ColumnStatTypeInternal::MAX_V1}; | ||
| return {ColumnStatTypeInternal::MIN_V1, | ||
| ColumnStatTypeInternal::MAX_V1, | ||
| ColumnStatTypeInternal::NAN_COUNT_V1, |
There was a problem hiding this comment.
external_to_internal(MINMAX) now unconditionally returns 4 internal stat types. drop() calls this to construct the list of column names to remove (v1_MIN, v1_MAX, v1_NAN_COUNT, v1_NAT_COUNT). For column stats segments that were written by an older client (only v1_MIN and v1_MAX columns exist), dropping will produce names for columns that aren't in the segment.
Please verify what the downstream consumer of dropped_names does when asked to drop a non-existent column — if it raises, this is a forward-compatibility break that needs handling; if it silently ignores, please add a test that creates column stats with the old format and then drops them with the new client.
| seg.add_column(scalar_field(min_col->type().data_type(), output_column_names[0].value), min_col); | ||
| seg.add_column(scalar_field(max_col->type().data_type(), output_column_names[1].value), max_col); | ||
| seg.add_column(scalar_field(DataType::UINT64, output_column_names[2].value), nan_count_col); | ||
| seg.add_column(scalar_field(DataType::UINT64, output_column_names[3].value), nat_count_col); |
There was a problem hiding this comment.
finalize now unconditionally writes 4 columns (min, max, nan_count, nat_count) and 4 header entries for the MINMAX stat. This changes the on-disk shape of every MINMAX column-stats segment.
Two concerns that should be addressed before merge:
- Backwards-compatibility test missing. There is no test that creates column stats with a pinned older ArcticDB version and then reads/drops/recreates them with this new client. Old segments only contain 2 stat columns;
merge_column_stats_segmentsandColumnStats::ColumnStats(header, tsd)should be exercised against that shape. - Old client reading new data. Although the proto comment relies on the
UNKNOWN = 0fallback in proto3 for forward-compat, the new client also adds two extra real columns to the segment. Please confirm that an older release reading a segment with the extrav1_NAN_COUNT/v1_NAT_COUNTfields doesn't fail descriptor/field-count assertions, and document the result in the PR description.
ArcticDB Code Review SummaryAPI & Compatibility
Testing
PR Title and Description
Documentation
|
| // Sparse-map gaps are real nulls (e.g. from Arrow validity bitmaps) that the dense | ||
| // for_each below never visits. Count them from metadata so they reach null_count_. | ||
| if (input_column.column_->is_sparse()) { | ||
| const auto sparse_gap_count = | ||
| input_column.column_->last_row() + 1 - input_column.column_->row_count(); | ||
| if (sparse_gap_count > 0) { | ||
| null_count_ += static_cast<uint64_t>(sparse_gap_count); | ||
| } | ||
| } |
There was a problem hiding this comment.
This sparse-map gap counting is a new behavioural code path, but no test in this PR exercises it — test_column_stats_nan_and_null_counts only uses dense pandas frames with in-band NaN/NaT, and test_column_stats_dynamic_schema_missing_data covers slices where the column is wholly absent (a different mechanism — the column isn't aggregated at all, so the expected stats are None rather than non-zero counts). Please add a test that produces a single segment with a sparse column containing gaps (e.g. via an Arrow input with a validity bitmap, or whatever existing fixture produces column->is_sparse() == true with last_row() + 1 > row_count()) and asserts the resulting v1_NULL_COUNT matches the number of gaps.
Also, please double-check the formula: last_row() + 1 - row_count() assumes last_row() (i.e. last_logical_row_) reflects the full logical length of the slice. If the trailing rows of a segment are null, last_logical_row_ may be set to the last present row rather than the segment's logical length, which would under-count trailing nulls. The test should cover that case explicitly.
Add NaN and null counts to MINMAX column stats
Summary
Extends the
MINMAXcolumn statistic to also record per-segment counts ofNaN(floating point) and null values (NaT timestamps, plus sparse-map gaps such as those from Arrow validity bitmaps), alongside the existing min/max values.Changes
NAN_COUNT_V1 = 3andNULL_COUNT_V1 = 4entries to theColumnStatsTypeenum incolumn_stats.proto.unsorted_aggregation.{hpp,cpp}):MinMaxAggregatorDatanow tracksnan_count_(floating-point NaNs) andnull_count_(NaT cells plus sparse-map gaps). For sparse columns, the gap countlast_row()+1 - row_count()is added tonull_count_so validity-bitmap nulls that the dense iterator never visits are still counted.MinMaxAggregatortakes two additional output column names andfinalizeproduces 4 output columns (MIN,MAX,NAN_COUNT,NULL_COUNT) instead of 2.column_stats.cpp):MINMAXexternal stat now maps to the four internal stat types;v1_NAN_COUNT/v1_NULL_COUNToperator strings and segment column naming wired through.test_column_stats_nan_and_null_countscovering multi-segment symbols with mixed NaN, NaT, and valid values across floating-point and timestamp columns. Widened the sharedassert_stats_equalhelper to subselect the received frame to the columns the test cares about (so existing min/max-only tests don't have to spell out all-zero count columns), and updatedtest_column_stats_header_metadatafor the 4-entries-per-column header layout.Backwards compatibility
The new fields are written under new proto enum values (
NAN_COUNT_V1=3,NULL_COUNT_V1=4). ExistingMINMAXstats written by older clients (onlyMIN_V1+MAX_V1) remain readable. Stats written by this version cannot be read by older clients that don't recognise the new enum values or that expect exactly 2 output columns fromMinMaxAggregatorData::finalize.