Skip to content

Add NaN and NULL count to MIN/MAX stats#3139

Open
markovskipetar wants to merge 2 commits into
masterfrom
add_nan_and_nat_counter_to_col_stats
Open

Add NaN and NULL count to MIN/MAX stats#3139
markovskipetar wants to merge 2 commits into
masterfrom
add_nan_and_nat_counter_to_col_stats

Conversation

@markovskipetar
Copy link
Copy Markdown
Collaborator

@markovskipetar markovskipetar commented May 28, 2026

Add NaN and null counts to MINMAX column stats

Summary

Extends the MINMAX column statistic to also record per-segment counts of NaN (floating point) and null values (NaT timestamps, plus sparse-map gaps such as those from Arrow validity bitmaps), alongside the existing min/max values.

Changes

  • Proto: Added NAN_COUNT_V1 = 3 and NULL_COUNT_V1 = 4 entries to the ColumnStatsType enum in column_stats.proto.
  • Aggregation (unsorted_aggregation.{hpp,cpp}): MinMaxAggregatorData now tracks nan_count_ (floating-point NaNs) and null_count_ (NaT cells plus sparse-map gaps). For sparse columns, the gap count last_row()+1 - row_count() is added to null_count_ so validity-bitmap nulls that the dense iterator never visits are still counted. MinMaxAggregator takes two additional output column names and finalize produces 4 output columns (MIN, MAX, NAN_COUNT, NULL_COUNT) instead of 2.
  • Pipeline (column_stats.cpp): MINMAX external stat now maps to the four internal stat types; v1_NAN_COUNT / v1_NULL_COUNT operator strings and segment column naming wired through.
  • Tests: Added test_column_stats_nan_and_null_counts covering multi-segment symbols with mixed NaN, NaT, and valid values across floating-point and timestamp columns. Widened the shared assert_stats_equal helper to subselect the received frame to the columns the test cares about (so existing min/max-only tests don't have to spell out all-zero count columns), and updated test_column_stats_header_metadata for the 4-entries-per-column header layout.

Backwards compatibility

The new fields are written under new proto enum values (NAN_COUNT_V1=3, NULL_COUNT_V1=4). Existing MINMAX stats written by older clients (only MIN_V1 + MAX_V1) remain readable. Stats written by this version cannot be read by older clients that don't recognise the new enum values or that expect exactly 2 output columns from MinMaxAggregatorData::finalize.

@github-actions
Copy link
Copy Markdown

Label error. Requires exactly 1 of: patch, minor, major. Found:

pl.Series("v1_NAT_COUNT(ts_col)", [0, 1, 2, 1], dtype=pl.UInt64),
)

column_stats = lib.read_column_stats(sym)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test only covers writing-then-reading with the same (new) client. Please add (or update an existing test) so that the all-NaN / all-NaT segments verify the new v1_NAN_COUNT / v1_NAT_COUNT values directly (e.g. extend test_column_stats_only_nat_values), and add a test for backwards compatibility — column stats written before this change (where only v1_MIN/v1_MAX columns exist) must still be readable, droppable, and mergeable by the new code path.

return {ColumnStatTypeInternal::MIN_V1, ColumnStatTypeInternal::MAX_V1};
return {ColumnStatTypeInternal::MIN_V1,
ColumnStatTypeInternal::MAX_V1,
ColumnStatTypeInternal::NAN_COUNT_V1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external_to_internal(MINMAX) now unconditionally returns 4 internal stat types. drop() calls this to construct the list of column names to remove (v1_MIN, v1_MAX, v1_NAN_COUNT, v1_NAT_COUNT). For column stats segments that were written by an older client (only v1_MIN and v1_MAX columns exist), dropping will produce names for columns that aren't in the segment.

Please verify what the downstream consumer of dropped_names does when asked to drop a non-existent column — if it raises, this is a forward-compatibility break that needs handling; if it silently ignores, please add a test that creates column stats with the old format and then drops them with the new client.

seg.add_column(scalar_field(min_col->type().data_type(), output_column_names[0].value), min_col);
seg.add_column(scalar_field(max_col->type().data_type(), output_column_names[1].value), max_col);
seg.add_column(scalar_field(DataType::UINT64, output_column_names[2].value), nan_count_col);
seg.add_column(scalar_field(DataType::UINT64, output_column_names[3].value), nat_count_col);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalize now unconditionally writes 4 columns (min, max, nan_count, nat_count) and 4 header entries for the MINMAX stat. This changes the on-disk shape of every MINMAX column-stats segment.

Two concerns that should be addressed before merge:

  1. Backwards-compatibility test missing. There is no test that creates column stats with a pinned older ArcticDB version and then reads/drops/recreates them with this new client. Old segments only contain 2 stat columns; merge_column_stats_segments and ColumnStats::ColumnStats(header, tsd) should be exercised against that shape.
  2. Old client reading new data. Although the proto comment relies on the UNKNOWN = 0 fallback in proto3 for forward-compat, the new client also adds two extra real columns to the segment. Please confirm that an older release reading a segment with the extra v1_NAN_COUNT/v1_NAT_COUNT fields doesn't fail descriptor/field-count assertions, and document the result in the PR description.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

ArcticDB Code Review Summary

API & Compatibility

  • On-disk format change is now called out in the updated PR description (Backwards compatibility section), and the proto enum is additive.
  • PR description references the stale NAT_COUNT_V1 name. The latest commit (change nat_count to null_count) renamed enum value 4 to NULL_COUNT_V1, but the description still says "NAN_COUNT_V1=3, NAT_COUNT_V1=4". Update the description to match the merged identifier.
  • Forward compatibility with old data still not exercised. external_to_internal(MINMAX) returns 4 internal types and MinMaxAggregatorData::finalize unconditionally produces 4 columns; behaviour against column stats segments written by previous releases (only v1_MIN / v1_MAX) still needs a test (read, drop, merge).

Testing

  • New sparse-map gap counting in MinMaxAggregatorData::aggregate is untested. The new is_sparse() block added in this commit (unsorted_aggregation.cpp lines 28–36) increments null_count_ from last_row() + 1 - row_count(). No test in this PR drives a sparse column through this path — test_column_stats_nan_and_null_counts uses dense pandas frames; test_column_stats_dynamic_schema_missing_data covers whole-column-missing slices (a different mechanism whose expected stats are None, not non-zero counts). Add a test producing a sparse segment (Arrow input with a validity bitmap) and verifying v1_NULL_COUNT equals the gap count, including the trailing-null edge case.
  • ❌ No backwards-compatibility test that reads/drops/merges pre-existing column stats segments with the new 4-column code path.
  • ❌ Existing test_column_stats_only_nat_values / test_column_stats_only_nan_values still don't assert the new v1_NAN_COUNT / v1_NULL_COUNT columns. The new assert_stats_equal subselects received down to the columns listed in expected, so these tests silently no longer validate the count columns at all — the all-NaN / all-NaT degenerate cases are the most natural places to verify the counts.

PR Title and Description

  • ✅ PR description is now filled in; user-visible 2→4 column change is acknowledged.
  • ❌ No enhancement (or other) label set. This is user-visible and needs release notes; no-release-notes is not appropriate.
  • ❌ Description should be updated to reflect the NULL_COUNT_V1 rename (see API & Compatibility above) and to explain the float-vs-timestamp semantic split (in-band NaN → v1_NAN_COUNT, in-band NaT → v1_NULL_COUNT, sparse/validity-bitmap nulls → v1_NULL_COUNT for both).

Documentation

  • ❌ No update to user-facing column-stats docs or tutorials describing the new v1_NAN_COUNT / v1_NULL_COUNT columns. Please update the relevant docstrings / mkdocs page so users know these columns are present in the output and what the NAN-vs-NULL split means.

@markovskipetar markovskipetar changed the title Add NaN and NaT count to MIN/MAX stats Add NaN and NULL count to MIN/MAX stats May 29, 2026
Comment on lines +28 to +36
// Sparse-map gaps are real nulls (e.g. from Arrow validity bitmaps) that the dense
// for_each below never visits. Count them from metadata so they reach null_count_.
if (input_column.column_->is_sparse()) {
const auto sparse_gap_count =
input_column.column_->last_row() + 1 - input_column.column_->row_count();
if (sparse_gap_count > 0) {
null_count_ += static_cast<uint64_t>(sparse_gap_count);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sparse-map gap counting is a new behavioural code path, but no test in this PR exercises it — test_column_stats_nan_and_null_counts only uses dense pandas frames with in-band NaN/NaT, and test_column_stats_dynamic_schema_missing_data covers slices where the column is wholly absent (a different mechanism — the column isn't aggregated at all, so the expected stats are None rather than non-zero counts). Please add a test that produces a single segment with a sparse column containing gaps (e.g. via an Arrow input with a validity bitmap, or whatever existing fixture produces column->is_sparse() == true with last_row() + 1 > row_count()) and asserts the resulting v1_NULL_COUNT matches the number of gaps.

Also, please double-check the formula: last_row() + 1 - row_count() assumes last_row() (i.e. last_logical_row_) reflects the full logical length of the slice. If the trailing rows of a segment are null, last_logical_row_ may be set to the last present row rather than the segment's logical length, which would under-count trailing nulls. The test should cover that case explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant