adding the variance threshold#2155
Conversation
|
There's a formatting issue that Skrub's Pixi environment can handle automatically. I'm going to correct it on my end straight away. |
rcap107
left a comment
There was a problem hiding this comment.
Hi @jannesantana, thanks for the PR. I left a few specific comments, and aside from that there should be an entry in the docstring of DropUninformative for the variance_threshold parameter
| def _drop_if_constant(self, column): | ||
| if self.drop_if_constant: | ||
| if (sbd.n_unique(column) == 1) and (self._null_count == 0): | ||
| if sbd.is_numeric(column) == 1 and ( |
There was a problem hiding this comment.
is_numeric returns false if the column has dtype bool
| if sbd.is_numeric(column) == 1 and ( | |
| if (sbd.is_numeric(column) or sbd.is_bool(column)) and ( |
also there is no need to check that it's == 1
| drop_if_constant=False, | ||
| drop_if_unique=False, | ||
| drop_null_fraction=1.0, | ||
| threshold=0.0, |
There was a problem hiding this comment.
| threshold=0.0, | |
| variance_threshold=0.0, |
it's better to be explicit here
the parameter should be renamed everywhere
| return True | ||
| else: | ||
| return False | ||
| elif (sbd.n_unique(column) == 1) and (self._null_count == 0): |
There was a problem hiding this comment.
the condition can be simplified a bit by moving the _null_count clause outside
if self.drop_if_constant and self._null_count == 0:
...
if sbd.is_numeric(column) or sbd.is_bool(column):
...
elif sbd.n_unique(column) == 1
...| :pr:`2096` by :user:`Ayesha Siddiqua <siddiqua-tamk>`. | ||
| - The :class:`TableReport` can now be exported in markdown format with ``.markdown``. | ||
| :pr:`2048` by :user:`Riccardo Cappuzzo <rcap107>`. | ||
| - The :class:`DropUninformative` was improved so that `drop_if_constant` becomes a variance |
There was a problem hiding this comment.
This entry should be updated slightly. Something like:
DropUninformative now has the variance_threshold parameter, which allows to drop numeric and boolean columns whose variance is lower than the given threshold. The default behavior is unchanged.
Bug Fix Pull Request
Description
Before, drop_if_constant would only drop columns that were constant. We added a variance threshold for numeric and boolean columns types.
The next step would be to change the name to drop_var_threshold, for example.
Addresses #2109
Checklist
How Has This Been Tested?
We've created a testing dataframe that covers different cases.
AI Disclosure