feat: dataframe support via `narwhals` by zilto · Pull Request #3912 · dlt-hub/dlt

zilto · 2026-05-05T02:47:50Z

Summary

Instead of implementing logic per dataframe library, centralize logic around Narwhals.

Context

Narwhals is a dependency-free library that implements a (large) subset of the Polars API with the ability to delegate execution of the expression to many different backends (pandas, polars, duckdb, Ibis, PySpark, cuDF).

Benefits

Single logic to maintain for dlt and support more libraries for end-users. Testing and features are less brittle and better tested. Avoid Python plugin dependency hell (lazy import in Python 3.15 should help; see you in 5 years)

cloudflare-workers-and-pages · 2026-05-05T02:57:20Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	docs	`a62a381`	Commit Preview URL Branch Preview URL	May 11 2026, 09:22 PM

zilto · 2026-05-05T17:15:46Z

Test failures are unrelated

rudolfix

I like it! we can add narwhals as main dep as this is pure python without any hard deps (nice!)

my main issue here is: changes touch a critical path - every extracted item goes through type check to tell the tabular data vs. Python objects. right now everything goes through narwhals which is ~50x slower than old method on devel. I include output from optimize-code skill below.

tell me what you want to do:

I can just apply implementation plan that will fix do the speedups mentioned below
you want to work on it further.

overall there's no problem with overall concept and structure. some minimal tests for ie. duckdb relation passed as tabular data would be nice PoC

rudolfix · 2026-05-08T09:48:35Z

Benchmark: per-item overhead of `narwhals.from_native`

I benchmarked the new is_dataframe / df_to_arrow against the prior direct-isinstance / direct-converter path on this branch (narwhals 2.20.0, Python 3.13). Results below — is_dataframe runs on every yielded item via wrap_additional_type, Incremental._get_transform, and get_data_item_format, so the per-dict number is the headline.

Type detection (per-item, n = 1,000,000)

item	devel (3-check)	this PR (narwhals naive)	PR vs devel
`dict`	265 ns	5,928 ns	22.4× slower
`int`	263 ns	5,794 ns	22.0× slower
`None`	265 ns	5,857 ns	22.1× slower
`pandas.DataFrame`	179 ns	11,468 ns	64× slower
`polars.DataFrame`	253 ns	5,976 ns	23.6× slower
`polars.LazyFrame`	259 ns	6,293 ns	24.3× slower
`pyarrow.Table`	90 ns	90 ns	par (short-circuits on `is_arrow_object`)

Bare-process scenario (no DF library imported — most JSON/REST resources): old=155 ns/item → PR=5,566 ns/item → 36× regression.

Cost decomposition: narwhals.from_native(d, allow_series=False, pass_through=True) itself is ~6 µs/op; the isinstance after it is only 55 ns. Almost all the cost is from_native.

Real-world impact for a resource yielding 10M plain dict rows on the extract path:

devel: 0.155 µs × 10M ≈ 1.6 s
this PR: 5.6 µs × 10M ≈ 56 s (~54 s pure regression on detection alone)

Frame conversion (per-batch)

kind	rows	direct (devel)	narwhals (PR)	PR vs devel
pandas DataFrame	10	62 µs	96 µs	1.55×
pandas DataFrame	1,000	79 µs	103 µs	1.30×
pandas DataFrame	100,000	78 µs	107 µs	1.36×
polars DataFrame	10	9.5 µs	17.5 µs	1.83×
polars DataFrame	1,000	11.0 µs	18.3 µs	1.66×
polars DataFrame	100,000	13.3 µs	19.4 µs	1.46×
polars LazyFrame	100,000	13.2 µs	23.6 µs	1.79×

df_to_arrow adds 30–96 % overhead per pandas/polars batch — narwhals routes through extra indirection rather than calling df.to_arrow() directly.

Proposed optimizations

1. Type-cache `is_dataframe` (per-item path)

# dlt/common/libs/narwhals.py
_BUILTIN_NON_FRAME = (dict, list, str, int, float, bool, tuple, bytes, type(None), set, frozenset)
_TYPE_IS_FRAME: dict[type, bool] = {}

def is_dataframe(obj: Any) -> bool:
    t = type(obj)
    if t in _BUILTIN_NON_FRAME:
        return False
    cached = _TYPE_IS_FRAME.get(t)
    if cached is not None:
        return cached
    maybe = narwhals.from_native(obj, allow_series=False, pass_through=True)
    result = isinstance(maybe, (narwhals.DataFrame, narwhals.LazyFrame))
    _TYPE_IS_FRAME[t] = result
    return result

2. Type-dispatched `df_to_arrow` (per-batch path)

_CONVERTERS: dict[type, Callable[[Any], "pyarrow.Table"]] = {}

def df_to_arrow(df):
    t = type(df)
    fn = _CONVERTERS.get(t)
    if fn is None:
        mod = t.__module__
        if mod.startswith("pandas."):
            from dlt.common.libs.pandas import pandas_to_arrow
            fn = pandas_to_arrow
        elif mod.startswith("polars."):
            from dlt.common.libs.polars import polars_to_arrow
            fn = polars_to_arrow
        else:
            fn = _df_to_arrow_via_narwhals
        _CONVERTERS[t] = fn
    return fn(df)

def _df_to_arrow_via_narwhals(df):
    nw_df = narwhals.from_native(df, allow_series=False)
    if isinstance(nw_df, narwhals.LazyFrame):
        nw_df = nw_df.collect()
    return nw_df.to_arrow()

Results with both patches applied (warm cache)

item	devel	this PR	proposed	proposed vs PR	proposed vs devel
`dict`	265 ns	5,928 ns	126 ns	47× faster	2.1× faster
`int`	263 ns	5,794 ns	138 ns	42× faster	1.9× faster
`pandas.DataFrame`	179 ns	11,468 ns	187 ns	61× faster	par
`polars.DataFrame`	253 ns	5,976 ns	171 ns	35× faster	1.5× faster

conversion	direct (devel)	narwhals (PR)	dispatched	dispatched vs PR
pandas (100k rows)	78 µs	107 µs	67 µs	1.60× faster
polars (100k rows)	13.3 µs	19.4 µs	9.9 µs	1.96× faster
polars LazyFrame (100k rows)	13.2 µs	23.6 µs	12.0 µs	1.97× faster

Both patches together are ~25 LoC, preserve PR semantics, fully restore parity with devel for pandas/polars, and keep narwhals as the fallback for the new backends this PR is aiming to unlock.

zilto · 2026-05-11T12:47:37Z

Thanks for the review. I agree with the performance issues and we should find a better solution.

I would spend more time trying to improve the PR (next week) if that's ok. I believe this will help with incremental logic

Type guard clause for dataframes

Main branch uses a cheap is_pyarrow(), is_pandas(), etc. but narwhals doesn't provide such util (by design). This clashes with our "guard clause" and dispatch code pattern

The try/except with .from_native(passthrough=True) was the solution recommended by narwhals maintainers. But this requires that we use the converted object directly instead of converting it again later.

Suggestions

There are some extra layers (e.g., wrap_additional_types()) that were required because we handled dataframe libraries one-by-one. Now, this layer is costly because it does a redundant narwhals conversion in is_dataframe(). By removing this layer, we can reduce redundant operations

MarcoGorelli · 2026-05-11T12:50:42Z

Main branch uses a cheap is_pyarrow(), is_pandas(), etc. but narwhals doesn't provide such util (by design). This clashes with our "guard clause" and dispatch code pattern

not sure if i've misunderstood, but would the is_* functions from https://narwhals-dev.github.io/narwhals/api-reference/dependencies/ work for you?

zilto · 2026-05-11T13:08:36Z

@MarcoGorelli Thanks! Will check the is_into_dataframe() function!

aside: did the docs get an overhaul since last week?

MarcoGorelli · 2026-05-11T13:10:09Z

😄 yeah moved to zensical

zilto · 2026-05-11T13:25:51Z

@rudolfix can you run the performance analysis? Switching to the native narwhals type checking should fix all performance concerns.

Also, leaving this for follow-up task (can assign it to me):

Future work would be to change some logic to be Narwhals-first (e.g., ArrowIncremental). This would allow us to prepare data using the native representation and vectorized operations available on the engine (e.g., I see that ArrowIncremental requires converting the full column to native Python object to hash them). This would remove the need for a separate ModelIncremental

FBruzzesi

Hey @zilto - I hope it's ok for me to comment here 😇
I raised a couple of concerns, one for the type checking, one for the difference with current behavior

FBruzzesi · 2026-05-11T19:04:30Z

-    if is_polars_frame(item):
-        from dlt.common.libs.polars import polars_to_arrow


Notice that unlike is_polars_frame, narwhals.dependencies.is_into_dataframe returns False for a polars LazyFrame (and we don't have a lazy equivalent of is_into_dataframe - we can consider adding it to be honest, @MarcoGorelli 👀)

Thanks to you and Marco for the reviews! It's saving us a lot of time actually.

@FBruzzesi it would be nice to have a type guard to that catches LazyFrame. Though, it doesn't block this PR. This polars support hasn't reach master branch and isn't released yet

@rudolfix IMO, the "eager arrow code path" shouldn't support lazyframes and raise "Received LazyFrame, call .collect() inside your @dlt.resource". Currently, we're hiding an expensive operation for minimal convenience. AFAIK, we're not returning LazyFrame objects back for downstream @dlt.transformer.

Lazy objects should be supported via the "lazy model code path" where we have Ibis expressions, SQLGlot, etc. for now.

Hopefully, we can unify both by implementing load package preparations and incremental logic via Narwhals

Yeah I think calling collect on behalf of your users would only be justified if you were doing a lot of operations lazily before reaching collect which would be difficult for the user to do outside of dlt

instead of calling collect first thing, a clear error message like your one looks like a nice solution 👍

zilto self-assigned this May 5, 2026

zilto added the enhancement New feature or request label May 5, 2026

zilto requested a review from rudolfix May 5, 2026 17:15

zilto marked this pull request as ready for review May 5, 2026 17:15

rudolfix requested changes May 8, 2026

View reviewed changes

zilto added 7 commits May 11, 2026 08:51

add narwhals deps

a597301

remove pandas and polars from dlt/extract/incremental/__init__

1408db5

remove pandas and polars from dlt/extract/extractors.py

a6ad157

remove pandas and polars from dlt/extract/wrappers.py

b2adf3d

remove pandas and polars from dlt/extract/utils.py

7b55df9

bumped narwhals version

35c803a

handle pyarrow recordbatch

61e510f

use native narwhals.dependencies.is_into_dataframe()

5df5b4b

zilto force-pushed the feat/dataframe-support-via-narwhals branch from fd2ae4a to 5df5b4b Compare May 11, 2026 13:16

zilto requested a review from rudolfix May 11, 2026 13:26

FBruzzesi reviewed May 11, 2026

View reviewed changes

narwhals typing fix

a62a381

FBruzzesi mentioned this pull request May 11, 2026

feat: Add is_into_lazyframe narwhals-dev/narwhals#3612

Closed

		if is_polars_frame(item):
		from dlt.common.libs.polars import polars_to_arrow

Conversation

zilto commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Benefits

Uh oh!

cloudflare-workers-and-pages Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

zilto commented May 5, 2026

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

rudolfix commented May 8, 2026

Benchmark: per-item overhead of narwhals.from_native

Type detection (per-item, n = 1,000,000)

Frame conversion (per-batch)

Proposed optimizations

1. Type-cache is_dataframe (per-item path)

2. Type-dispatched df_to_arrow (per-batch path)

Results with both patches applied (warm cache)

Uh oh!

zilto commented May 11, 2026

Type guard clause for dataframes

Suggestions

Uh oh!

MarcoGorelli commented May 11, 2026

Uh oh!

zilto commented May 11, 2026

Uh oh!

MarcoGorelli commented May 11, 2026

Uh oh!

zilto commented May 11, 2026

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FBruzzesi May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zilto May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zilto commented May 5, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented May 5, 2026 •

edited

Loading

Benchmark: per-item overhead of `narwhals.from_native`

1. Type-cache `is_dataframe` (per-item path)

2. Type-dispatched `df_to_arrow` (per-batch path)

FBruzzesi May 11, 2026 •

edited

Loading

zilto May 11, 2026 •

edited

Loading