Propagate SortedValue to Polars set_sorted() on index column by G-D-Petrov · Pull Request #3049 · man-group/ArcticDB

G-D-Petrov · 2026-04-20T09:40:43Z

Reference Issues/PRs

Monday ticket ref: 10659398340

What does this implement or fix?

When reading data from ArcticDB with output_format="polars", the resulting Polars DataFrame had no sorted flag set on the index column, even though ArcticDB tracks sort order internally via SortedValue in the C++ StreamDescriptor. This means Polars couldn't take advantage of optimized operations (e.g., fast joins, binary search filtering) that rely on knowing a column is sorted.

Changes

C++ layer — propagate SortedValue through the read path:

cpp/arcticdb/entity/read_result.hpp — Added a SortedValue sorted field to both NodeReadResult and ReadResult, defaulting to UNKNOWN.
cpp/arcticdb/pipeline/pipeline_utils.hpp — In create_python_read_result, extracts the sorted value from the stream descriptor and passes it into NodeReadResult and the top-level ReadResult.
cpp/arcticdb/python/adapt_read_dataframe.hpp — Includes sorted (cast to int) in the tuple returned to Python from adapt_read_df.
cpp/arcticdb/python/python_utils.hpp — Includes sorted in the tuples produced by node_results_to_python_list and adapt_read_dfs.

Python layer — receive sorted value and apply Polars flag:

python/arcticdb/version_store/read_result.py — ReadResult and NodeReadResult now accept and store a sorted parameter (converted from int to SortedValue enum).
python/arcticdb/version_store/_store.py:
- Added _get_index_col_from_norm() helper to extract the index column name from normalization metadata (handles pandas DataFrames, Series, and experimental Arrow formats).
- Added _apply_polars_sorted_flag_to_index() static method that calls polars.col(...).set_sorted() on the index column when the stored sort order is ASCENDING or DESCENDING.
- Integrated the flag application into _adapt_read_res() so it runs automatically on every read.

Tests:

python/tests/unit/arcticdb/version_store/test_polars_set_sorted.py — New test file with 6 tests covering:
- Unnamed DatetimeIndex → __index__ column gets SORTED_ASC
- Named DatetimeIndex → named column gets SORTED_ASC
- RangeIndex (not physically stored) → no sorted flag set
- Pandas output path still works without issues
- Stage + finalize workflow preserves sorted flag
- Value columns do not get the sorted flag (only the index does)

Benchmark Results

Before vs After (stock ArcticDB 6.13.0 vs branch dev build)

The benchmark script reads 1M rows as Polars and measures various operations. On stock ArcticDB, the index has no sorted flag (SORTED_ASC: False). With the branch's dev build, the sorted flag is automatically applied (SORTED_ASC: True).

Comparing stock (6.13.0) vs branch (dev) on key index operations:

Benchmark	Stock 6.13.0 (ms)	Branch dev (ms)	Speedup
sort_by_index	18.90	3.06	6.18x
unique_index	3.71	0.69	5.38x
upsample	25.20	8.32	3.03x
filter_range	1.93	1.89	1.02x
filter_after	1.30	1.27	1.02x
group_by_dynamic	9.40	8.83	1.06x
rolling_mean	25.59	24.87	1.03x

The branch delivers significant speedups on sort_by_index (6.2x), unique_index (5.4x), and upsample (3x). These are operations where Polars can skip work entirely when it knows the column is sorted.

Branch dev build: detailed benchmark (Polars 1.35.2, 1M rows, 10 iters)

Note: In the branch build, "Current (no flag)" already has SORTED_ASC: True on the index (the branch's feature), so applying set_sorted() again is a no-op. This confirms the feature is working automatically.

Part 2: Value Columns — Future Work

These measure the potential benefit of also setting set_sorted() on value columns, which would require populating FieldStats.sorted_ in C++. Not implemented in this branch.

Benchmark	Current/branch (ms)	+ index + values (ms)	Speedup	Notes
sort_by_price	22.71	3.42	6.63x	Big win
filter_price_range	0.91	0.79	1.14x
unique_price	4.07	0.70	5.82x	Big win
search_sorted_price	0.11	0.10	1.10x
join_asof_on_price	5.81	6.42	0.90x
sort_by_volume	20.77	4.19	4.96x	Big win
unique_volume	3.31	0.64	5.14x	Big win

Overhead

The set_sorted() call adds negligible overhead (~0.3 ms) relative to the read cost (~399 ms debug build):

Operation	Time (ms)
`set_sorted` (index only)	0.28
`set_sorted` (index + 2 cols)	0.65

claude · 2026-04-20T09:44:53Z

-from arcticdb_ext.version_store import PandasOutputFrame
+from arcticdb_ext.version_store import PandasOutputFrame, SortedValue
 from arcticdb.version_store._normalization import FrameData



sorted shadows the Python built-in sorted(). This can cause subtle bugs in any method that needs to call the built-in while also accessing self.sorted. Suggest renaming to sort_order or sort_value throughout (including the ReadResult counterpart and the C++-side parameter names).

Suggested change

def __init__(self, sym, frame_data, norm, sort_order=SortedValue.UNKNOWN):

claude · 2026-04-20T09:44:58Z

+    assert isinstance(result, pl.DataFrame)
+    assert result["__index__"].flags["SORTED_ASC"] is True
+    assert result["sorted_val"].flags["SORTED_ASC"] is False
+    assert result["another"].flags["SORTED_ASC"] is False


The DESCENDING code path in _apply_polars_sorted_flag_to_index is exercised by production code but not covered by any test here. A simple case — write data with a descending datetime index (or use update() with sort_on_index=False to produce a reverse-sorted symbol) — would verify that result["__index__"].flags["SORTED_DESC"] is True and SORTED_ASC is False. Without this, a regression in the descending=True branch would go undetected.

claude · 2026-04-20T09:45:01Z

+    elif input_type in ("df", "series"):
+        common = norm.df.common if input_type == "df" else norm.series.common
+        index_type = common.WhichOneof("index_type")
+        if index_type == "index" and common.index.is_physically_stored:


WhichOneof("index_type") can also return "multi_index" for MultiIndex DataFrames. In that case index_type != "index" and the function returns None, so no sorted flag is set — which is probably the right conservative choice since multi-level index sorting is more complex. Worth adding a comment here (or in the PR description) acknowledging this as a known limitation so future maintainers don't think it was accidentally missed.

claude · 2026-04-20T09:45:33Z

ArcticDB Code Review Summary

PR propagates ArcticDB's internal SortedValue through the read path to call Polars' set_sorted() on the index column, enabling Polars to use sort-aware optimisations automatically. All previously flagged issues have been resolved; the "Fix tests" commit (72e11c9) brings the branch to a clean state.

API & Compatibility

✅ No breaking changes to public Python API (Arctic, Library, NativeVersionStore)
✅ Deprecation protocol followed for removed/renamed parameters
✅ On-disk format unchanged
✅ Protobuf schema backwards-compatible
✅ Key types and ValueType enum unchanged

Memory & Safety

✅ RAII used for all resource management
✅ No use-after-move or use-after-free patterns
✅ Buffer size calculations correct
✅ GIL correctly managed at Python-C++ boundary
✅ No accidental copies of large objects

Correctness

✅ sorted parameter renamed to sort_order throughout ReadResult, NodeReadResult, and _adapt_read_res — Python built-in shadowing resolved
✅ _get_index_col_from_norm documents why multi_index is intentionally excluded
✅ read_result_from_single_frame correctly propagates sorted value via create_python_read_result (no additional change required)
✅ batch_read path flows through _post_process_dataframe → _adapt_read_res → _apply_polars_sorted_flag_to_index — sorted flag is applied for batch reads too

Code Quality

✅ No duplicated logic
✅ C++ and Python implementations consistent
✅ Tests not duplicating existing scenarios

Testing

✅ Main behavioural paths covered (unnamed index, named index, RangeIndex, pandas output)
✅ SortedValue.DESCENDING code path tested (test_descending_sort_order_sets_sorted_desc)
✅ SortedValue.UNKNOWN code path tested (test_unknown_sort_order_does_not_set_sorted_flag)
✅ Integration test for stage + finalize workflow included
✅ Tuple length assertion in test_library_tool.py updated to match 7-element tuple

Build & Dependencies

✅ No new source files requiring CMakeLists.txt changes
✅ No dependency changes
✅ Cross-platform compatibility maintained

Security

✅ No hardcoded credentials or secrets
✅ No buffer overflow potential

PR Title & Description

✅ Title reads "Propagate SortedValue to Polars set_sorted() on index column" — accurate and concise
✅ Description explains what changed and why, with benchmark results
✅ PR labelled enhancement and patch

Documentation

✅ New behaviour documented in library.py read and read_batch docstrings (Polars sorted-flag note added to output_format parameter)
✅ docs/claude/python/NATIVE_VERSION_STORE.md updated to reflect sort_order field propagation and _apply_polars_sorted_flag_to_index logic

IvoDD · 2026-04-21T14:09:59Z

+        assert result[col].flags["SORTED_ASC"] is False, f"Column {col} should not have SORTED_ASC"
+
+
+def test_sorted_flag_not_set_for_pandas_output(lmdb_library):


Do we need this test? We have a lot of other pandas coverage of pandas reads.

IvoDD · 2026-04-21T14:37:32Z

+
+    result = NativeVersionStore._apply_polars_sorted_flag_to_index(data, read_result)
+
+    assert result["__index__"].flags["SORTED_ASC"] is False


A meta point worth discussing is whether we should treat UNKNOWN as SORTED_ASC.

For all other purposes we treat UNKOWN (written by old arctcdb versions before sortedness was introduced) as sorted. (e.g. for update we use this code).

I'm personally in favor of treating unknown constistently and set sorted even if unkown.

We should instead add a test for UNSORTED

IvoDD · 2026-04-21T14:39:17Z


 class ReadResult:
-    def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results):
+    def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results, sort_order=SortedValue.UNKNOWN):


I'm slightly worried if we're using a default value we might have missed some place where this is used incorrectly.

IvoDD · 2026-04-21T14:42:51Z

            Output format for the returned dataframe.
            If `None`, uses the output format from the `Library` instance.
            See `OutputFormat` documentation for details on available formats.
+            When using ``POLARS`` output format, the index column (if physically stored) will automatically have


We should probably move these docs to OutputFormat docs. The same description applies to methods like head and tail all of which reference the OutputFormat docs.

IvoDD · 2026-04-21T14:44:19Z

        return True


+def _get_index_col_from_norm(norm):


This will get drastically simpler after #3037 as the index column will always be first.

IvoDD · 2026-04-21T14:47:25Z


+    @staticmethod
+    def _apply_polars_sorted_flag_to_index(data, read_result):
+        if pl is None or not isinstance(data, pl.DataFrame):


Nit: I think it would be slightly cleaner if we call this only for polars DataFrames here, then this check would not be needed.

IvoDD · 2026-04-21T14:48:03Z

+    elif input_type in ("df", "series"):
+        common = norm.df.common if input_type == "df" else norm.series.common
+        index_type = common.WhichOneof("index_type")
+        # multi_index intentionally excluded: multi-level sorting semantics are more complex


We probably still want to include multi_index as long as first index is sorted.

IvoDD · 2026-04-21T14:49:29Z

+            pyuser_meta,
+            multi_key_meta,
+            std::move(node_results),
+            static_cast<int>(ret.sorted)


Why cast to int? Can't we expose SortedValue in python bindings?

IvoDD · 2026-04-21T14:52:16Z

    NodeReadResult(
            const StreamId& symbol, OutputFrame&& frame_data,
-            arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta
+            arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta, SortedValue sorted = SortedValue::UNKNOWN


Nit: Again slightly worried about this having a default value, we might have missed some constructions which should specify a sorted value.

This only matters if we decide to tread UNKNOWN as sorted. Otherwise it's probably fine

Apply set_order for polars dataframes

07a36e5

G-D-Petrov requested review from IvoDD, alexowens90 and poodlewars as code owners April 20, 2026 09:40

claude Bot reviewed Apr 20, 2026

View reviewed changes

Fix test failures

f0c2189

G-D-Petrov added patch Small change, should increase patch version enhancement New feature or request labels Apr 20, 2026

G-D-Petrov changed the title ~~Apply set_order for polars dataframes~~ Propagate SortedValue to Polars set_sorted() on index column Apr 20, 2026

Georgi Petrov added 2 commits April 20, 2026 13:27

Address PR comments

870cd35

Fix tests

72e11c9

IvoDD reviewed Apr 21, 2026

View reviewed changes


	def __init__(self, sym, frame_data, norm, sort_order=SortedValue.UNKNOWN):

		assert result[col].flags["SORTED_ASC"] is False, f"Column {col} should not have SORTED_ASC"


		def test_sorted_flag_not_set_for_pandas_output(lmdb_library):


		result = NativeVersionStore._apply_polars_sorted_flag_to_index(data, read_result)

		assert result["__index__"].flags["SORTED_ASC"] is False

Conversation

G-D-Petrov commented Apr 20, 2026

Reference Issues/PRs

What does this implement or fix?

Changes

Benchmark Results

Before vs After (stock ArcticDB 6.13.0 vs branch dev build)

Branch dev build: detailed benchmark (Polars 1.35.2, 1M rows, 10 iters)

Part 2: Value Columns — Future Work

Overhead

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArcticDB Code Review Summary

API & Compatibility

Memory & Safety

Correctness

Code Quality

Testing

Build & Dependencies

Security

PR Title & Description

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 20, 2026 •

edited

Loading