Skip to content

adding the variance threshold#2155

Draft
jannesantana wants to merge 7 commits into
skrub-data:mainfrom
jannesantana:variance_threshold
Draft

adding the variance threshold#2155
jannesantana wants to merge 7 commits into
skrub-data:mainfrom
jannesantana:variance_threshold

Conversation

@jannesantana

@jannesantana jannesantana commented Jun 10, 2026

Copy link
Copy Markdown

Bug Fix Pull Request

Description

Before, drop_if_constant would only drop columns that were constant. We added a variance threshold for numeric and boolean columns types.
The next step would be to change the name to drop_var_threshold, for example.

Addresses #2109

Checklist

  • I have read the contributing guidelines
  • I have added tests that verify the bug fix (the tests have been made separately)
  • I have added an entry to CHANGES.rst describing the fix
  • My code follows the code style of this project
  • I have checked my code and corrected any misspellings

How Has This Been Tested?

We've created a testing dataframe that covers different cases.

AI Disclosure

  • This PR contains AI-generated code
    • I have tested the code generated in my PR
    • I have read and understood every line that has been generated by the AI agent
    • I can explain what the AI-generated code does

@MarieSacksick MarieSacksick added the CFM sprint June 2026 For PRs opened during the CFM sprint in June 2026 label Jun 10, 2026
@jannesantana jannesantana marked this pull request as draft June 10, 2026 12:47
@emassoulie

Copy link
Copy Markdown
Contributor

There's a formatting issue that Skrub's Pixi environment can handle automatically. I'm going to correct it on my end straight away.

@rcap107 rcap107 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jannesantana, thanks for the PR. I left a few specific comments, and aside from that there should be an entry in the docstring of DropUninformative for the variance_threshold parameter

def _drop_if_constant(self, column):
if self.drop_if_constant:
if (sbd.n_unique(column) == 1) and (self._null_count == 0):
if sbd.is_numeric(column) == 1 and (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_numeric returns false if the column has dtype bool

Suggested change
if sbd.is_numeric(column) == 1 and (
if (sbd.is_numeric(column) or sbd.is_bool(column)) and (

also there is no need to check that it's == 1

drop_if_constant=False,
drop_if_unique=False,
drop_null_fraction=1.0,
threshold=0.0,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
threshold=0.0,
variance_threshold=0.0,

it's better to be explicit here

the parameter should be renamed everywhere

return True
else:
return False
elif (sbd.n_unique(column) == 1) and (self._null_count == 0):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition can be simplified a bit by moving the _null_count clause outside

if self.drop_if_constant and self._null_count == 0:
 ... 
if sbd.is_numeric(column) or sbd.is_bool(column):
...
elif sbd.n_unique(column) == 1
...

Comment thread CHANGES.rst
:pr:`2096` by :user:`Ayesha Siddiqua <siddiqua-tamk>`.
- The :class:`TableReport` can now be exported in markdown format with ``.markdown``.
:pr:`2048` by :user:`Riccardo Cappuzzo <rcap107>`.
- The :class:`DropUninformative` was improved so that `drop_if_constant` becomes a variance

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry should be updated slightly. Something like:

DropUninformative now has the variance_threshold parameter, which allows to drop numeric and boolean columns whose variance is lower than the given threshold. The default behavior is unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CFM sprint June 2026 For PRs opened during the CFM sprint in June 2026

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants