Skip to content

feat: dataframe support via narwhals#3912

Open
zilto wants to merge 9 commits into
develfrom
feat/dataframe-support-via-narwhals
Open

feat: dataframe support via narwhals#3912
zilto wants to merge 9 commits into
develfrom
feat/dataframe-support-via-narwhals

Conversation

@zilto
Copy link
Copy Markdown
Collaborator

@zilto zilto commented May 5, 2026

Summary

Instead of implementing logic per dataframe library, centralize logic around Narwhals.

Context

Narwhals is a dependency-free library that implements a (large) subset of the Polars API with the ability to delegate execution of the expression to many different backends (pandas, polars, duckdb, Ibis, PySpark, cuDF).

Benefits

Single logic to maintain for dlt and support more libraries for end-users. Testing and features are less brittle and better tested. Avoid Python plugin dependency hell (lazy import in Python 3.15 should help; see you in 5 years)

@zilto zilto self-assigned this May 5, 2026
@zilto zilto added the enhancement New feature or request label May 5, 2026
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 5, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs a62a381 Commit Preview URL

Branch Preview URL
May 11 2026, 09:22 PM

@zilto zilto requested a review from rudolfix May 5, 2026 17:15
@zilto zilto marked this pull request as ready for review May 5, 2026 17:15
@zilto
Copy link
Copy Markdown
Collaborator Author

zilto commented May 5, 2026

Test failures are unrelated

Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! we can add narwhals as main dep as this is pure python without any hard deps (nice!)

my main issue here is: changes touch a critical path - every extracted item goes through type check to tell the tabular data vs. Python objects. right now everything goes through narwhals which is ~50x slower than old method on devel. I include output from optimize-code skill below.

tell me what you want to do:

  1. I can just apply implementation plan that will fix do the speedups mentioned below
  2. you want to work on it further.

overall there's no problem with overall concept and structure. some minimal tests for ie. duckdb relation passed as tabular data would be nice PoC

@rudolfix
Copy link
Copy Markdown
Collaborator

rudolfix commented May 8, 2026

Benchmark: per-item overhead of narwhals.from_native

I benchmarked the new is_dataframe / df_to_arrow against the prior direct-isinstance / direct-converter path on this branch (narwhals 2.20.0, Python 3.13). Results below — is_dataframe runs on every yielded item via wrap_additional_type, Incremental._get_transform, and get_data_item_format, so the per-dict number is the headline.

Type detection (per-item, n = 1,000,000)

item devel (3-check) this PR (narwhals naive) PR vs devel
dict 265 ns 5,928 ns 22.4× slower
int 263 ns 5,794 ns 22.0× slower
None 265 ns 5,857 ns 22.1× slower
pandas.DataFrame 179 ns 11,468 ns 64× slower
polars.DataFrame 253 ns 5,976 ns 23.6× slower
polars.LazyFrame 259 ns 6,293 ns 24.3× slower
pyarrow.Table 90 ns 90 ns par (short-circuits on is_arrow_object)

Bare-process scenario (no DF library imported — most JSON/REST resources): old=155 ns/item → PR=5,566 ns/item → 36× regression.

Cost decomposition: narwhals.from_native(d, allow_series=False, pass_through=True) itself is ~6 µs/op; the isinstance after it is only 55 ns. Almost all the cost is from_native.

Real-world impact for a resource yielding 10M plain dict rows on the extract path:

  • devel: 0.155 µs × 10M ≈ 1.6 s
  • this PR: 5.6 µs × 10M ≈ 56 s (~54 s pure regression on detection alone)

Frame conversion (per-batch)

kind rows direct (devel) narwhals (PR) PR vs devel
pandas DataFrame 10 62 µs 96 µs 1.55×
pandas DataFrame 1,000 79 µs 103 µs 1.30×
pandas DataFrame 100,000 78 µs 107 µs 1.36×
polars DataFrame 10 9.5 µs 17.5 µs 1.83×
polars DataFrame 1,000 11.0 µs 18.3 µs 1.66×
polars DataFrame 100,000 13.3 µs 19.4 µs 1.46×
polars LazyFrame 100,000 13.2 µs 23.6 µs 1.79×

df_to_arrow adds 30–96 % overhead per pandas/polars batch — narwhals routes through extra indirection rather than calling df.to_arrow() directly.

Proposed optimizations

1. Type-cache is_dataframe (per-item path)

# dlt/common/libs/narwhals.py
_BUILTIN_NON_FRAME = (dict, list, str, int, float, bool, tuple, bytes, type(None), set, frozenset)
_TYPE_IS_FRAME: dict[type, bool] = {}

def is_dataframe(obj: Any) -> bool:
    t = type(obj)
    if t in _BUILTIN_NON_FRAME:
        return False
    cached = _TYPE_IS_FRAME.get(t)
    if cached is not None:
        return cached
    maybe = narwhals.from_native(obj, allow_series=False, pass_through=True)
    result = isinstance(maybe, (narwhals.DataFrame, narwhals.LazyFrame))
    _TYPE_IS_FRAME[t] = result
    return result

2. Type-dispatched df_to_arrow (per-batch path)

_CONVERTERS: dict[type, Callable[[Any], "pyarrow.Table"]] = {}

def df_to_arrow(df):
    t = type(df)
    fn = _CONVERTERS.get(t)
    if fn is None:
        mod = t.__module__
        if mod.startswith("pandas."):
            from dlt.common.libs.pandas import pandas_to_arrow
            fn = pandas_to_arrow
        elif mod.startswith("polars."):
            from dlt.common.libs.polars import polars_to_arrow
            fn = polars_to_arrow
        else:
            fn = _df_to_arrow_via_narwhals
        _CONVERTERS[t] = fn
    return fn(df)

def _df_to_arrow_via_narwhals(df):
    nw_df = narwhals.from_native(df, allow_series=False)
    if isinstance(nw_df, narwhals.LazyFrame):
        nw_df = nw_df.collect()
    return nw_df.to_arrow()

Results with both patches applied (warm cache)

item devel this PR proposed proposed vs PR proposed vs devel
dict 265 ns 5,928 ns 126 ns 47× faster 2.1× faster
int 263 ns 5,794 ns 138 ns 42× faster 1.9× faster
pandas.DataFrame 179 ns 11,468 ns 187 ns 61× faster par
polars.DataFrame 253 ns 5,976 ns 171 ns 35× faster 1.5× faster
conversion direct (devel) narwhals (PR) dispatched dispatched vs PR
pandas (100k rows) 78 µs 107 µs 67 µs 1.60× faster
polars (100k rows) 13.3 µs 19.4 µs 9.9 µs 1.96× faster
polars LazyFrame (100k rows) 13.2 µs 23.6 µs 12.0 µs 1.97× faster

Both patches together are ~25 LoC, preserve PR semantics, fully restore parity with devel for pandas/polars, and keep narwhals as the fallback for the new backends this PR is aiming to unlock.

@zilto
Copy link
Copy Markdown
Collaborator Author

zilto commented May 11, 2026

Thanks for the review. I agree with the performance issues and we should find a better solution.

I would spend more time trying to improve the PR (next week) if that's ok. I believe this will help with incremental logic

Type guard clause for dataframes

Main branch uses a cheap is_pyarrow(), is_pandas(), etc. but narwhals doesn't provide such util (by design). This clashes with our "guard clause" and dispatch code pattern

The try/except with .from_native(passthrough=True) was the solution recommended by narwhals maintainers. But this requires that we use the converted object directly instead of converting it again later.

Suggestions

There are some extra layers (e.g., wrap_additional_types()) that were required because we handled dataframe libraries one-by-one. Now, this layer is costly because it does a redundant narwhals conversion in is_dataframe(). By removing this layer, we can reduce redundant operations

@MarcoGorelli
Copy link
Copy Markdown

Main branch uses a cheap is_pyarrow(), is_pandas(), etc. but narwhals doesn't provide such util (by design). This clashes with our "guard clause" and dispatch code pattern

not sure if i've misunderstood, but would the is_* functions from https://narwhals-dev.github.io/narwhals/api-reference/dependencies/ work for you?

@zilto
Copy link
Copy Markdown
Collaborator Author

zilto commented May 11, 2026

@MarcoGorelli Thanks! Will check the is_into_dataframe() function!

aside: did the docs get an overhaul since last week?

@MarcoGorelli
Copy link
Copy Markdown

😄 yeah moved to zensical

@zilto zilto force-pushed the feat/dataframe-support-via-narwhals branch from fd2ae4a to 5df5b4b Compare May 11, 2026 13:16
@zilto
Copy link
Copy Markdown
Collaborator Author

zilto commented May 11, 2026

@rudolfix can you run the performance analysis? Switching to the native narwhals type checking should fix all performance concerns.

Also, leaving this for follow-up task (can assign it to me):

Future work would be to change some logic to be Narwhals-first (e.g., ArrowIncremental). This would allow us to prepare data using the native representation and vectorized operations available on the engine (e.g., I see that ArrowIncremental requires converting the full column to native Python object to hash them). This would remove the need for a separate ModelIncremental

@zilto zilto requested a review from rudolfix May 11, 2026 13:26
Copy link
Copy Markdown

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zilto - I hope it's ok for me to comment here 😇
I raised a couple of concerns, one for the type checking, one for the difference with current behavior

Comment thread dlt/common/libs/narwhals.py Outdated
Comment thread dlt/extract/extractors.py
Comment on lines -44 to -45
if is_polars_frame(item):
from dlt.common.libs.polars import polars_to_arrow
Copy link
Copy Markdown

@FBruzzesi FBruzzesi May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that unlike is_polars_frame, narwhals.dependencies.is_into_dataframe returns False for a polars LazyFrame (and we don't have a lazy equivalent of is_into_dataframe - we can consider adding it to be honest, @MarcoGorelli 👀)

Copy link
Copy Markdown
Collaborator Author

@zilto zilto May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to you and Marco for the reviews! It's saving us a lot of time actually.

@FBruzzesi it would be nice to have a type guard to that catches LazyFrame. Though, it doesn't block this PR. This polars support hasn't reach master branch and isn't released yet

@rudolfix IMO, the "eager arrow code path" shouldn't support lazyframes and raise "Received LazyFrame, call .collect() inside your @dlt.resource". Currently, we're hiding an expensive operation for minimal convenience. AFAIK, we're not returning LazyFrame objects back for downstream @dlt.transformer.

Lazy objects should be supported via the "lazy model code path" where we have Ibis expressions, SQLGlot, etc. for now.

Hopefully, we can unify both by implementing load package preparations and incremental logic via Narwhals

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think calling collect on behalf of your users would only be justified if you were doing a lot of operations lazily before reaching collect which would be difficult for the user to do outside of dlt

instead of calling collect first thing, a clear error message like your one looks like a nice solution 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants