feat: Introduce testing.constructors module and a pytest plugin#3552
feat: Introduce testing.constructors module and a pytest plugin#3552
testing.constructors module and a pytest plugin#3552Conversation
|
Just a quick one from me RE:
Is it possible to add something like this, and keep the imports in place (temporarily)?: # tests.utils.py
from narwhals.testing.typing import Constructor as Constructor , ConstructorEager as ConstructorEagerI really REALLY want this PR, but the diff 😳 (yes yes, I know I'm guilty of this too 😉) |
|
@dangotbanned reverted such commit 😉 Coverage is not picked up in CI though 🥹 |
|
@MarcoGorelli @dangotbanned @camriddell I would say this PR is ready to review as is - yet not ready to be merged because all the open points + conflicts with plotly & fairlearn pytest fixtures (#3556 addresses this last issue) |
camriddler
left a comment
There was a problem hiding this comment.
As a first pass I focused on the testing.constructors.py since that is where the implementation details are.
The current registration mechanism has implicit conditions (e.g. name in cls.__dict__) that can cause subclasses to fail to register without error (I believe this is used to implicitly differentiate leaf classes). Combined with metadata split between class-body attributes and subclass keyword arguments, this increases contributor error surface and makes registration behavior harder to reason about.
The above observations increase the chance that downstream users will need to read the source code implementation if they are cut by either of above details. (e.g. "why is my class not registering? Oh I forgot to add a name=...)
Open for discussion on this, but I'm still a fan of the decorator + function based registration for a few reasons:
- Constructing instances is easier to validate/invalidate than initializing subclasses which may have some documentation-only fields (e.g. anything you can put into the class body is fair game)
- The registration mechanism is an explicit "opt-in" (either the decorator is there or it is not) instead of being tucked behind a class definition + specific field
nameavailablility - The functions can keep their same name as they originally had and we don't need concern with the
legacy_namefield at all while we transition away from thestr(constructor)stuff.
from dataclasses import dataclass, field
@dataclass(frozen=True)
class Constructor:
name: str
is_eager: bool
# omitting fields for brevity
func: Callable[[...], ...]
_registry = {} # could be a global or class var
def __call__(self, *args, **kwargs):
return self.func(*args, **kwargs)
@classmethod
def from_func(cls, name, is_eager):
def decorator(func):
inst = cls(name=name, is_eager=is_eager, func=func)
cls._registry[name] = inst
return inst
return decorator
@Constructor.from_func("pandas", is_eager=True)
def pandas_constructor(data): # can keep the original function calls, str(...) behavior does not change
import pandas as pd
return pd.DataFrame(data)
print(f'{Constructor._registry = }')
# constructors are the same as before, but have metadata attached directly to them
print(f'{pandas_constructor = }')
print(f'{pandas_constructor.is_eager = }')outputs
Constructor._registry = {'pandas': Constructor(name='pandas', is_eager=True)}
pandas_constructor = Constructor(name='pandas', is_eager=True)
pandas_constructor.is_eager = TrueThe caveat of course being that the function pandas_constructor is no longer a function, it is an instance of Constructor that has an identical __call__ behavior to the function it decorated.
| cls.needs_gpu = needs_gpu | ||
|
|
||
| if "name" not in cls.__dict__: | ||
| return |
There was a problem hiding this comment.
Does this if check this only exist to catch classes that don't have a name to them? Thus meaning "if you don't have a name specified, you don't get inserted in to the registration mechanism? Seems surprising that if I forget to add a name= to my class that I get neither the registration behavior nor an error telling me I forgot something.
| implementation: Implementation | None = None, | ||
| requirements: tuple[str, ...] = (), | ||
| legacy_name: str = "", | ||
| is_non_nullable: bool = False, |
There was a problem hiding this comment.
I'd personally prefer is_nullable instead of is_non_nullable (if we don't have a backwards compat to worry about)? I find the boolean logic to be easier to understand when making assertions about the positive case as something like not is_non_nullable is a touch trickier to reason about than not is_nullable.
There was a problem hiding this comment.
Absolutely I fully agree
| needs_gpu: Whether the backend requires GPU hardware. | ||
| **kwargs: Forwarded to `super().__init_subclass__`. | ||
| """ | ||
| super().__init_subclass__(**kwargs) |
There was a problem hiding this comment.
Is there a suspected case for this super? I can't imagine nesting classes that override this logic.
There was a problem hiding this comment.
Fair enough - I kept it as "you never know what users want to do"
There was a problem hiding this comment.
That's fair, and this won't hurt us so I'm +1 to keep it in. Was curious if you had something in mind :)
| raise NotImplementedError | ||
|
|
||
|
|
||
| class LazyFrameConstructor(FrameConstructor): |
There was a problem hiding this comment.
Do these Lazy|Eager... sub-classes help anything? They set a default class variable for their children, but it feels like boilerplate outside of that.
There was a problem hiding this comment.
Useful for type checking (especially constructor_eager) + setting one class variable.
| ): | ||
| """Constructor backed by `pandas.DataFrame` with default NumPy dtypes.""" | ||
|
|
||
| name = "pandas" |
There was a problem hiding this comment.
Feels odd that this argument needs to be attached to the class body instead of passing into the class construction mechanism.
There was a problem hiding this comment.
It was an hack to allow to create the two intermediate EagerFrameConstructor and LazyFrameConstructor without passing a name (hence without registering the two)
|
|
||
| ALL_CPU_CONSTRUCTORS: frozenset[str] = frozenset( | ||
| name for name, c in FrameConstructor._registry.items() if not c.needs_gpu | ||
| ) |
There was a problem hiding this comment.
Can ALL_CPU_CONSTRUCTORS be a module property or function so that it is kept it in sync in-case the registry is updated after this variable is created?
Exactly as what was done with available_constructors below.
There was a problem hiding this comment.
Makes sense, thanks for suggesting it
| def prepare_constructors( | ||
| *, include: Iterable[str] | None = None, exclude: Iterable[str] | None = None | ||
| ) -> list[FrameConstructor]: | ||
| """Return available constructors, optionally filtered. |
There was a problem hiding this comment.
- We should work more with the underlying types when possible instead of strings. A tricky bug will probably be
prepare_constructors(include=["pndas"])and you don't get anything back (not even an error). Whereasprepare_constructors(include=testing.PndasConstructor) provides an easy to understand error about what was mistyped. - Needs a comment that the
excludeis given precedence here since it is the last conditions that are checked (e.g. it is currently allowed to doprepare_constructors(include=["pandas"], exclude=["pandas"])and becomes ambiguous as to what one should expect to return when perhaps this should be an errant case?
There was a problem hiding this comment.
I think I am not even using it anymore now - if we are going to expose something like this, then I am definitely going to add better type checking. Sure will add better docstrings if it stays
| ) | ||
|
|
||
|
|
||
| class FrameConstructor(Protocol): |
There was a problem hiding this comment.
What is gained from making this a typing.Protocol?
There was a problem hiding this comment.
I think it has to do with the two EagerFrameConstructor and LazyFrameConstructor but it was a bunch of commits ago... I swear I already forgot 😭
There was a problem hiding this comment.
If it isn't needed can we remove it? At least in my mind (and someone please update my way of thinking), but a Protocol shouldn't have any logic embedded into its methods just scaffolding for attribute & method names/signatures and types. But that's more of a conceptual framework rather than something enforced strongly at runtime.
FBruzzesi
left a comment
There was a problem hiding this comment.
Thanks for the review @camriddell - there are a lot of great points made there! I replied to single comments
RE: #3552 (review)
My TL;DR is that I don't have a strong preference.
The current registration mechanism has implicit conditions (e.g.
name in cls.__dict__) that can cause subclasses to fail to register without error (I believe this is used to implicitly differentiate leaf classes). Combined with metadata split between class-body attributes and subclass keyword arguments, this increases contributor error surface and makes registration behavior harder to reason about.
That's totally valid, and yes we should pick one way of declaring what we need, and avoid having to declare both a class variable and metadata, it should be either one or the other. Also we should not fail silently.
- The functions can keep their same name as they originally had and we don't need concern with the
legacy_namefield at all while we transition away from thestr(constructor)stuff.
Oh man, that's a killer feature!
I will sleep on how to do it a bit longer, thanks for providing example code as well!
The caveat of course being that the function
pandas_constructoris no longer a function, it is an instance ofConstructorthat has an identical__call__behavior to the function it decorated.
I would not worry too much about this to be honest
| def prepare_constructors( | ||
| *, include: Iterable[str] | None = None, exclude: Iterable[str] | None = None | ||
| ) -> list[FrameConstructor]: | ||
| """Return available constructors, optionally filtered. |
There was a problem hiding this comment.
I think I am not even using it anymore now - if we are going to expose something like this, then I am definitely going to add better type checking. Sure will add better docstrings if it stays
|
|
||
| ALL_CPU_CONSTRUCTORS: frozenset[str] = frozenset( | ||
| name for name, c in FrameConstructor._registry.items() if not c.needs_gpu | ||
| ) |
There was a problem hiding this comment.
Makes sense, thanks for suggesting it
| ): | ||
| """Constructor backed by `pandas.DataFrame` with default NumPy dtypes.""" | ||
|
|
||
| name = "pandas" |
There was a problem hiding this comment.
It was an hack to allow to create the two intermediate EagerFrameConstructor and LazyFrameConstructor without passing a name (hence without registering the two)
| raise NotImplementedError | ||
|
|
||
|
|
||
| class LazyFrameConstructor(FrameConstructor): |
There was a problem hiding this comment.
Useful for type checking (especially constructor_eager) + setting one class variable.
| needs_gpu: Whether the backend requires GPU hardware. | ||
| **kwargs: Forwarded to `super().__init_subclass__`. | ||
| """ | ||
| super().__init_subclass__(**kwargs) |
There was a problem hiding this comment.
Fair enough - I kept it as "you never know what users want to do"
| implementation: Implementation | None = None, | ||
| requirements: tuple[str, ...] = (), | ||
| legacy_name: str = "", | ||
| is_non_nullable: bool = False, |
There was a problem hiding this comment.
Absolutely I fully agree
| ) | ||
|
|
||
|
|
||
| class FrameConstructor(Protocol): |
There was a problem hiding this comment.
I think it has to do with the two EagerFrameConstructor and LazyFrameConstructor but it was a bunch of commits ago... I swear I already forgot 😭
|
@FBruzzesi want to correct an earlier statement, (yeah this one woke me up in the middle of the night)
This wasn't actually true, we would need to override the class print(f'{str(pandas_constructor) = }') # Constructor(name='pandas', is_eager=True, ...)
print(f'{str(pandas_constructor.func) = }') # '<function pandas_constructor at 0x7f6f96213b60>'The modifications: class Constructor:
....
def __str__(self):
return self.func.__name__ # or str(self.func) for a more literal backwards compat |
|
@camriddell I moved to the registry pattern in e1d09e5, looking forward to your feedback! |
Description
This PR introduces two new features:
tests/conftest.pyintonarwhals/testing/constructorsnw.from_nativeright after.nw.from_nativekwargs -> For this it's enough to provide it as different proper arguments in__call__, e.g.nw_kwargs,backend_kwargs.Version.Xor the namespace{nw, nw_v1, nw_v2}? If we do so, I would argue that we must enforce it and never assume for the user. In my opinionnw_frame_constructor(data, version=nw)would still look better thannw.from_native(nw_frame_constructor(data))constructorandconstructor_eageris not really intuitive. I am happy to brainstorm better names for downstream users. And similarly the flags--all-cpu-constructors,--constructors. (*)Update: refactor: Rename exposed fixtures and pytest options #3556
nw_lazy_constructorfor "free"nw_series_constructor). I know it's possible to make a dataframe and then get a column. But it would definitely improve users ergonomics. However, that's feature request/improvement and out of scope of this PR(*) Renaming proposal
nw_frame_constructor,nw_eager_frame_constructor--nw-all-backends(misleading that has no CPU in it?),--nw-backends=.... I think the term constructor might be be too vague, and there is no other case in the codebase. While we have a bunch of cases in which we allow to pass a backendNow implemented in #3556, targeting this branch
pyest-cov changes
The
[project.entry-points.pytest11]registers the pytest plugin which we dogfood in the testsuite here. Whenpyteststarts, it loads all entry-point plugins beforepytest-covinstalls its coverage tracer.Importing
narwhals.testing.pytest_pluginforces Python to loadnarwhals/__init__.py(to resolve the package), which eagerly imports the entire public API.On main, no entry point existed, so no narwhals code loaded before coverage started.
The fix:
pytest-covdependencypyproject.tomlto track subprocess, execv and fork (the latter two are not available on windows, hence multiple hacks are performed in GithubAction's to dynamically populate the fields inpyproject.toml)coverage run -m pytest ...,coverage combineandcoverage report, so that coverage run starts the tracer at Python startup before any importsThanks to @EdAbati and @dangotbanned - I mostly adjusted from your work, not much more than that!
What type of PR is this? (check all applicable)
Related issues
testingmodule to enable downstream libraries testing #739 (not marking as closing, since we should fully dogfood in the test suite)Checklist