Skip to content

Propagate SortedValue to Polars set_sorted() on index column#3049

Open
G-D-Petrov wants to merge 4 commits into
masterfrom
gpetrov/set_order
Open

Propagate SortedValue to Polars set_sorted() on index column#3049
G-D-Petrov wants to merge 4 commits into
masterfrom
gpetrov/set_order

Conversation

@G-D-Petrov
Copy link
Copy Markdown
Collaborator

Reference Issues/PRs

Monday ticket ref: 10659398340

What does this implement or fix?

When reading data from ArcticDB with output_format="polars", the resulting Polars DataFrame had no sorted flag set on the index column, even though ArcticDB tracks sort order internally via SortedValue in the C++ StreamDescriptor. This means Polars couldn't take advantage of optimized operations (e.g., fast joins, binary search filtering) that rely on knowing a column is sorted.

Changes

C++ layer — propagate SortedValue through the read path:

  • cpp/arcticdb/entity/read_result.hpp — Added a SortedValue sorted field to both NodeReadResult and ReadResult, defaulting to UNKNOWN.
  • cpp/arcticdb/pipeline/pipeline_utils.hpp — In create_python_read_result, extracts the sorted value from the stream descriptor and passes it into NodeReadResult and the top-level ReadResult.
  • cpp/arcticdb/python/adapt_read_dataframe.hpp — Includes sorted (cast to int) in the tuple returned to Python from adapt_read_df.
  • cpp/arcticdb/python/python_utils.hpp — Includes sorted in the tuples produced by node_results_to_python_list and adapt_read_dfs.

Python layer — receive sorted value and apply Polars flag:

  • python/arcticdb/version_store/read_result.pyReadResult and NodeReadResult now accept and store a sorted parameter (converted from int to SortedValue enum).
  • python/arcticdb/version_store/_store.py:
    • Added _get_index_col_from_norm() helper to extract the index column name from normalization metadata (handles pandas DataFrames, Series, and experimental Arrow formats).
    • Added _apply_polars_sorted_flag_to_index() static method that calls polars.col(...).set_sorted() on the index column when the stored sort order is ASCENDING or DESCENDING.
    • Integrated the flag application into _adapt_read_res() so it runs automatically on every read.

Tests:

  • python/tests/unit/arcticdb/version_store/test_polars_set_sorted.py — New test file with 6 tests covering:
    • Unnamed DatetimeIndex__index__ column gets SORTED_ASC
    • Named DatetimeIndex → named column gets SORTED_ASC
    • RangeIndex (not physically stored) → no sorted flag set
    • Pandas output path still works without issues
    • Stage + finalize workflow preserves sorted flag
    • Value columns do not get the sorted flag (only the index does)

Benchmark Results

Before vs After (stock ArcticDB 6.13.0 vs branch dev build)

The benchmark script reads 1M rows as Polars and measures various operations. On stock ArcticDB, the index has no sorted flag (SORTED_ASC: False). With the branch's dev build, the sorted flag is automatically applied (SORTED_ASC: True).

Comparing stock (6.13.0) vs branch (dev) on key index operations:

Benchmark Stock 6.13.0 (ms) Branch dev (ms) Speedup
sort_by_index 18.90 3.06 6.18x
unique_index 3.71 0.69 5.38x
upsample 25.20 8.32 3.03x
filter_range 1.93 1.89 1.02x
filter_after 1.30 1.27 1.02x
group_by_dynamic 9.40 8.83 1.06x
rolling_mean 25.59 24.87 1.03x

The branch delivers significant speedups on sort_by_index (6.2x), unique_index (5.4x), and upsample (3x). These are operations where Polars can skip work entirely when it knows the column is sorted.

Branch dev build: detailed benchmark (Polars 1.35.2, 1M rows, 10 iters)

Note: In the branch build, "Current (no flag)" already has SORTED_ASC: True on the index (the branch's feature), so applying set_sorted() again is a no-op. This confirms the feature is working automatically.

Part 2: Value Columns — Future Work

These measure the potential benefit of also setting set_sorted() on value columns, which would require populating FieldStats.sorted_ in C++. Not implemented in this branch.

Benchmark Current/branch (ms) + index + values (ms) Speedup Notes
sort_by_price 22.71 3.42 6.63x Big win
filter_price_range 0.91 0.79 1.14x
unique_price 4.07 0.70 5.82x Big win
search_sorted_price 0.11 0.10 1.10x
join_asof_on_price 5.81 6.42 0.90x
sort_by_volume 20.77 4.19 4.96x Big win
unique_volume 3.31 0.64 5.14x Big win

Overhead

The set_sorted() call adds negligible overhead (~0.3 ms) relative to the read cost (~399 ms debug build):

Operation Time (ms)
set_sorted (index only) 0.28
set_sorted (index + 2 cols) 0.65

from arcticdb_ext.version_store import PandasOutputFrame
from arcticdb_ext.version_store import PandasOutputFrame, SortedValue
from arcticdb.version_store._normalization import FrameData

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted shadows the Python built-in sorted(). This can cause subtle bugs in any method that needs to call the built-in while also accessing self.sorted. Suggest renaming to sort_order or sort_value throughout (including the ReadResult counterpart and the C++-side parameter names).

Suggested change
def __init__(self, sym, frame_data, norm, sort_order=SortedValue.UNKNOWN):

assert isinstance(result, pl.DataFrame)
assert result["__index__"].flags["SORTED_ASC"] is True
assert result["sorted_val"].flags["SORTED_ASC"] is False
assert result["another"].flags["SORTED_ASC"] is False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DESCENDING code path in _apply_polars_sorted_flag_to_index is exercised by production code but not covered by any test here. A simple case — write data with a descending datetime index (or use update() with sort_on_index=False to produce a reverse-sorted symbol) — would verify that result["__index__"].flags["SORTED_DESC"] is True and SORTED_ASC is False. Without this, a regression in the descending=True branch would go undetected.

elif input_type in ("df", "series"):
common = norm.df.common if input_type == "df" else norm.series.common
index_type = common.WhichOneof("index_type")
if index_type == "index" and common.index.is_physically_stored:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WhichOneof("index_type") can also return "multi_index" for MultiIndex DataFrames. In that case index_type != "index" and the function returns None, so no sorted flag is set — which is probably the right conservative choice since multi-level index sorting is more complex. Worth adding a comment here (or in the PR description) acknowledging this as a known limitation so future maintainers don't think it was accidentally missed.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

ArcticDB Code Review Summary

PR propagates ArcticDB's internal SortedValue through the read path to call Polars' set_sorted() on the index column, enabling Polars to use sort-aware optimisations automatically. All previously flagged issues have been resolved; the "Fix tests" commit (72e11c9) brings the branch to a clean state.

API & Compatibility

  • ✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
  • ✅ Deprecation protocol followed for removed/renamed parameters
  • ✅ On-disk format unchanged
  • ✅ Protobuf schema backwards-compatible
  • ✅ Key types and ValueType enum unchanged

Memory & Safety

  • ✅ RAII used for all resource management
  • ✅ No use-after-move or use-after-free patterns
  • ✅ Buffer size calculations correct
  • ✅ GIL correctly managed at Python-C++ boundary
  • ✅ No accidental copies of large objects

Correctness

  • sorted parameter renamed to sort_order throughout ReadResult, NodeReadResult, and _adapt_read_res — Python built-in shadowing resolved
  • _get_index_col_from_norm documents why multi_index is intentionally excluded
  • read_result_from_single_frame correctly propagates sorted value via create_python_read_result (no additional change required)
  • batch_read path flows through _post_process_dataframe_adapt_read_res_apply_polars_sorted_flag_to_index — sorted flag is applied for batch reads too

Code Quality

  • ✅ No duplicated logic
  • ✅ C++ and Python implementations consistent
  • ✅ Tests not duplicating existing scenarios

Testing

  • ✅ Main behavioural paths covered (unnamed index, named index, RangeIndex, pandas output)
  • SortedValue.DESCENDING code path tested (test_descending_sort_order_sets_sorted_desc)
  • SortedValue.UNKNOWN code path tested (test_unknown_sort_order_does_not_set_sorted_flag)
  • ✅ Integration test for stage + finalize workflow included
  • ✅ Tuple length assertion in test_library_tool.py updated to match 7-element tuple

Build & Dependencies

  • ✅ No new source files requiring CMakeLists.txt changes
  • ✅ No dependency changes
  • ✅ Cross-platform compatibility maintained

Security

  • ✅ No hardcoded credentials or secrets
  • ✅ No buffer overflow potential

PR Title & Description

  • ✅ Title reads "Propagate SortedValue to Polars set_sorted() on index column" — accurate and concise
  • ✅ Description explains what changed and why, with benchmark results
  • ✅ PR labelled enhancement and patch

Documentation

  • ✅ New behaviour documented in library.py read and read_batch docstrings (Polars sorted-flag note added to output_format parameter)
  • docs/claude/python/NATIVE_VERSION_STORE.md updated to reflect sort_order field propagation and _apply_polars_sorted_flag_to_index logic

@G-D-Petrov G-D-Petrov added patch Small change, should increase patch version enhancement New feature or request labels Apr 20, 2026
@G-D-Petrov G-D-Petrov changed the title Apply set_order for polars dataframes Propagate SortedValue to Polars set_sorted() on index column Apr 20, 2026
assert result[col].flags["SORTED_ASC"] is False, f"Column {col} should not have SORTED_ASC"


def test_sorted_flag_not_set_for_pandas_output(lmdb_library):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test? We have a lot of other pandas coverage of pandas reads.


result = NativeVersionStore._apply_polars_sorted_flag_to_index(data, read_result)

assert result["__index__"].flags["SORTED_ASC"] is False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A meta point worth discussing is whether we should treat UNKNOWN as SORTED_ASC.

For all other purposes we treat UNKOWN (written by old arctcdb versions before sortedness was introduced) as sorted. (e.g. for update we use this code).

I'm personally in favor of treating unknown constistently and set sorted even if unkown.

We should instead add a test for UNSORTED


class ReadResult:
def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results):
def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results, sort_order=SortedValue.UNKNOWN):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried if we're using a default value we might have missed some place where this is used incorrectly.

Output format for the returned dataframe.
If `None`, uses the output format from the `Library` instance.
See `OutputFormat` documentation for details on available formats.
When using ``POLARS`` output format, the index column (if physically stored) will automatically have
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move these docs to OutputFormat docs. The same description applies to methods like head and tail all of which reference the OutputFormat docs.

return True


def _get_index_col_from_norm(norm):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get drastically simpler after #3037 as the index column will always be first.


@staticmethod
def _apply_polars_sorted_flag_to_index(data, read_result):
if pl is None or not isinstance(data, pl.DataFrame):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be slightly cleaner if we call this only for polars DataFrames here, then this check would not be needed.

elif input_type in ("df", "series"):
common = norm.df.common if input_type == "df" else norm.series.common
index_type = common.WhichOneof("index_type")
# multi_index intentionally excluded: multi-level sorting semantics are more complex
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably still want to include multi_index as long as first index is sorted.

pyuser_meta,
multi_key_meta,
std::move(node_results),
static_cast<int>(ret.sorted)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cast to int? Can't we expose SortedValue in python bindings?

NodeReadResult(
const StreamId& symbol, OutputFrame&& frame_data,
arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta
arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta, SortedValue sorted = SortedValue::UNKNOWN
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Again slightly worried about this having a default value, we might have missed some constructions which should specify a sorted value.

This only matters if we decide to tread UNKNOWN as sorted. Otherwise it's probably fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants