Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ New Features
regular numeric format (``(432)`` becomes ``-432``). :pr:`1772` by :user:`Gabriela
Gómez Jiménez <gabrielapgomezji>`.


Changes
-------
- :meth:`choose_from` now transparently converts `outcomes` to a list when it is another type of sequence. :pr:`2100` by
Expand All @@ -54,6 +55,8 @@ Changes
:pr:`2096` by :user:`Ayesha Siddiqua <siddiqua-tamk>`.
- The :class:`TableReport` can now be exported in markdown format with ``.markdown``.
:pr:`2048` by :user:`Riccardo Cappuzzo <rcap107>`.
- The :class:`TableReport` can now be exported the estimated memory usage in TableReport when display data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The :class:`TableReport` can now be exported the estimated memory usage in TableReport when display data.
- The :class:`TableReport` can now display the estimated memory usage of
the data it is applied to.

:pr:`2153` by :user:`Salam AlKaissi <salam-alkaissi> Sanae Janati Idrissi <sana-22>`.

Bugfixes
--------
Expand Down
7 changes: 7 additions & 0 deletions skrub/_reporting/_data/templates/dataframe-sample.html
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,13 @@
<strong>{{ summary.n_rows | format_number }}</strong> rows ✕
<strong data-manager="ColumnFilterMatchCount"
data-test="n-columns-display">{{ summary.n_columns | format_number }}</strong> columns
{%- if summary.get("memory_usage_kb") is not none -%}
<span data-test="memory-usage-display"> (estimated memory usage: {{ "%.1f" | format(summary.get("memory_usage_kb")) }} KB)
{%- if summary.get("memory_estimate_unreliable") -%}
<em class="memory-warning">— estimate may be inaccurate for complex objects</em>
{%- endif -%}
</span>
{%- endif -%}
Comment thread
emassoulie marked this conversation as resolved.
{% if 'is_subsampled' in summary %}
(subsampled from more rows)
{% endif %}
Expand Down
6 changes: 6 additions & 0 deletions skrub/_reporting/_data/templates/report.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

The provided dataframe uses the {{ summary.dataframe_module }} library. It has
**shape** {{ summary.n_rows }} rows × {{ summary.n_columns }} columns.
{% if summary.get("memory_usage_kb") is not none %}
**memory usage** {{ "%.1f" | format(summary.get("memory_usage_kb")) }} KB.
{% if summary.get("memory_estimate_unreliable") %}
_Note: memory estimate may be inaccurate for complex object columns._
{% endif %}
{% endif %}
Comment on lines +5 to +10

Columns are marked as "high cardinality" if they contain more than
{{ summary.cardinality_threshold }} unique values.
Expand Down
37 changes: 37 additions & 0 deletions skrub/_reporting/_summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

import sys

from skrub._dataframe._common import raise_dispatch_unregistered_type
from skrub._dispatch import dispatch

from .. import _column_associations, _config
from .. import _dataframe as sbd
from . import _plotting, _sample_table, _utils
Expand All @@ -18,6 +21,34 @@
_N_TOP_ASSOCIATIONS = 1000


@dispatch
def _memory_usage_kb(obj):
raise_dispatch_unregistered_type(obj)


@_memory_usage_kb.specialize("pandas")
def _memory_usage_pandas(obj):
memory_usage_bytes = obj.memory_usage(deep=False).sum()
return memory_usage_bytes / 1024


@_memory_usage_kb.specialize("polars")
def _memory_usage_polars(obj):
memory_usage_bytes = obj.estimated_size()
return memory_usage_bytes / 1024


def _has_complex_objects(df):
"""Return True when pandas has object-dtype columns.

The memory estimate is less reliable for object-dtype columns, so we warn
as soon as any are present.
"""
if sbd.dataframe_module_name(df) != "pandas":
return False
return any(dtype == object for dtype in df.dtypes)


def summarize_dataframe(
df,
*,
Expand Down Expand Up @@ -77,6 +108,7 @@ def summarize_dataframe(
"dataframe_module": sbd.dataframe_module_name(df),
"n_rows": n_rows,
"n_columns": n_columns,
"memory_usage_kb": _memory_usage_kb(df),
Comment thread
MarieSacksick marked this conversation as resolved.
"columns": [],
"dataframe_is_empty": not n_rows or not n_columns,
"plots_skipped": not with_plots,
Expand All @@ -90,6 +122,11 @@ def summarize_dataframe(
}
if title is not None:
summary["title"] = title
# detect complex objects that make memory estimates unreliable
# try:
summary["memory_estimate_unreliable"] = _has_complex_objects(df)
# except Exception:
# summary["memory_estimate_unreliable"] = False
if order_by is not None:
df = sbd.sort(df, by=order_by)
summary["order_by"] = order_by
Expand Down
2 changes: 1 addition & 1 deletion skrub/_reporting/_table_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ class TableReport:

>>> j = TableReport(df, plot_distributions=False).json()
>>> print(j)
{"dataframe_module": "pandas", "n_rows": 2, "n_columns": 3, "columns": ...
{"dataframe_module": "pandas", "n_rows": 2, "n_columns": 3, ...}


Advanced configuration: you can add custom column filters that will appear
Expand Down
1 change: 1 addition & 0 deletions skrub/_reporting/tests/test_markdown_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def test_markdown_report_structure_and_titles(df_module):
assert "# " in markdown_default # Header should exist
# Shape info should be present
assert "**shape** 3 rows × 3 columns" in markdown
assert "**memory usage**" in markdown
# Unique values should be present (default value)
assert "40 unique values." in markdown

Expand Down
1 change: 1 addition & 0 deletions skrub/_reporting/tests/test_table_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def test_report(air_quality):
assert "With nulls" in html
assert "First 10" in html
assert "First 2" in html
assert "memory usage:" in html
for col_name in sbd.column_names(air_quality):
assert col_name in html
report_id = get_report_id(html)
Expand Down
Loading