Skip to content
Open
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
f7fdcd7
Adding the SessionEncoder
rcap107 Feb 24, 2026
65be83a
more work
rcap107 Feb 25, 2026
a6caeb7
adding tests
rcap107 Feb 25, 2026
cc14c55
changelog
rcap107 Feb 25, 2026
dba476f
adding drop cols, various improvements
rcap107 Feb 25, 2026
d46090d
adding a test
rcap107 Feb 25, 2026
28b958a
simplifying tests and code
rcap107 Feb 25, 2026
ac7389d
docstrings
rcap107 Feb 25, 2026
8a61ea0
adding support for multiple by columns
rcap107 Feb 26, 2026
a70006f
fixing optional by
rcap107 Feb 26, 2026
429f155
ddocs
rcap107 Feb 26, 2026
23c4111
fixing a compatibility problem
rcap107 Feb 26, 2026
acb091f
changelog
rcap107 Feb 26, 2026
d69fb9e
improving tests
rcap107 Feb 26, 2026
e0199f5
renaming session id column
rcap107 Feb 26, 2026
603a57b
fixing some broken tests
rcap107 Feb 26, 2026
9d2c577
doctest
rcap107 Feb 26, 2026
8d869b2
testing error dispatch
rcap107 Feb 26, 2026
01fed5f
addressing some of the coments
rcap107 Feb 27, 2026
746109b
addressing more comments
rcap107 Feb 27, 2026
71302b1
fixing test
rcap107 Feb 27, 2026
ea99353
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Feb 27, 2026
bca2db2
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Mar 27, 2026
868d529
changelo
rcap107 Mar 27, 2026
e27b23c
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 Apr 9, 2026
25ce541
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Apr 10, 2026
280d6f1
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 Apr 13, 2026
b37bec7
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Apr 22, 2026
bd64559
reordering rows after adding session id
rcap107 Apr 22, 2026
3758a43
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 Apr 28, 2026
c8e91c9
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 May 22, 2026
8ddd651
fixing changelog after merge
rcap107 May 22, 2026
dcc1369
implementing a fix from review
rcap107 May 24, 2026
87371e6
reordering columns so that the session id is added as last col
rcap107 May 26, 2026
b336d9b
more fixes
rcap107 May 26, 2026
9fbe79d
_
rcap107 May 26, 2026
1680162
example
rcap107 May 27, 2026
091a66c
docstrings
rcap107 May 28, 2026
8248f50
changing to seconds
rcap107 May 28, 2026
6c2a92c
more tests
rcap107 May 28, 2026
542042b
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 May 28, 2026
8c6d6a3
ensuring that columns do not get overwritten
rcap107 May 28, 2026
b834395
renaming a parameter
rcap107 May 28, 2026
091c122
_
rcap107 May 28, 2026
6de367c
fixing a bug on windows
rcap107 May 28, 2026
08d2b11
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 May 29, 2026
528006b
adding new generator and example
rcap107 May 29, 2026
d26222f
adding comments
rcap107 May 29, 2026
26e4436
fixing inconsistency
rcap107 May 29, 2026
fecc532
fixing possible bug
rcap107 May 29, 2026
bf58d53
comments
rcap107 May 29, 2026
2aa11cc
grr
rcap107 May 29, 2026
3c79333
Apply suggestions from code review
rcap107 Jun 1, 2026
8b30ebd
Update skrub/_session_encoder.py
rcap107 Jun 1, 2026
e561d9b
improvements and changes from review
rcap107 Jun 1, 2026
d9012dc
more improvements
rcap107 Jun 1, 2026
d73c45b
adding plain example
rcap107 Jun 1, 2026
b77a59e
rewording
rcap107 Jun 1, 2026
8ee2bee
cleanup docstring
rcap107 Jun 1, 2026
981134e
Update examples/data_ops/1170_session_encoder.py
rcap107 Jun 2, 2026
d5aeecd
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Jun 2, 2026
f62a605
Merge branch 'feat-session-encoder' of github.com:rcap107/skrub into …
rcap107 Jun 2, 2026
6b03b14
fixing timezone
rcap107 Jun 2, 2026
89c79b7
doc cleanup
rcap107 Jun 2, 2026
1f5fe6f
more on docs
rcap107 Jun 2, 2026
ecf5e0f
_
rcap107 Jun 2, 2026
f694e15
_
rcap107 Jun 2, 2026
36e6585
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 Jun 3, 2026
fdab72a
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Jun 8, 2026
e033cdf
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Jun 8, 2026
2de71c4
addressing some of the comments from the review
rcap107 Jun 8, 2026
1dc32c2
clean up test
rcap107 Jun 8, 2026
d199ec2
addressing more comments
rcap107 Jun 8, 2026
aada173
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Jun 9, 2026
d22774c
changelog
rcap107 Jun 9, 2026
f728604
example
rcap107 Jun 9, 2026
f215bdb
slight rewording
rcap107 Jun 9, 2026
679077a
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 Jun 15, 2026
5f04c14
Apply suggestions from code review
rcap107 Jun 15, 2026
efada37
fixing doctest
rcap107 Jun 15, 2026
73adcfb
reworking docstring, renaming attr
rcap107 Jun 15, 2026
74f5991
moving error checking, more work on docstring
rcap107 Jun 15, 2026
49a4583
simplifying part of the code
rcap107 Jun 15, 2026
77942d2
addressing comments
rcap107 Jun 15, 2026
05c7bf7
removing factorizer
rcap107 Jun 15, 2026
b389575
docstring
rcap107 Jun 15, 2026
9675950
doc fixes
rcap107 Jun 15, 2026
0787468
pandas grrr
rcap107 Jun 15, 2026
2951486
fixing test on min deps
rcap107 Jun 16, 2026
326dbb9
adding a comment
rcap107 Jun 16, 2026
e04fd29
changelog
rcap107 Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,12 @@ New Features
:meth:`DataOp.skb.eval`, :meth:`SkrubLearner.predict`, etc., or in
:meth:`DataOp.skb.find` or :meth:`SkrubLearner.truncated_after`. :pr:`2062` by
:user:`Jérôme Dockès <jeromedockes>`.
- The :class:`DropSimilar` transformer has been added, for removing columns in a
- The :class:`SessionEncoder` is now available. This encoder takes a dataframe with
a timestamp column and computes sessions based on the given session duration.
Additionally, it is possible to provide a ``by`` column or list of columns
(e.g., user ID or (user ID, user device)) to compute sessions for each grouping
value. A new synthetic dataset generator has also been added.
:pr:`1930` by :user:`Riccardo Cappuzzo <rcap107>`.- The :class:`DropSimilar` transformer has been added, for removing columns in a
dataframe that present high correlation with other columns. :pr:`2023` by
:user:`Eloi Massoulié <emassoulie>`.
- :class:`ToFloat32` now allows users to specify ``decimal`` and ``thousand``
Expand Down
4 changes: 4 additions & 0 deletions doc/api_reference.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@
"SimilarityEncoder",
"ToCategorical",
"DatetimeEncoder",
"SessionEncoder",
"ToDatetime",
"ToFloat",
],
Expand Down Expand Up @@ -338,6 +339,9 @@
"datasets.get_data_dir",
"datasets.make_deduplication_data",
"datasets.toy_orders",
"datasets.toy_products",
"datasets.toy_cities",
"datasets.make_retail_events",
],
}
],
Expand Down
62 changes: 62 additions & 0 deletions doc/modules/multi_column_operations/sessionization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
.. _sessionization:

.. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder`
.. |BaseEstimator| replace:: :class:`~sklearn.base.BaseEstimator`
.. |TransformerMixin| replace:: :class:`~sklearn.base.TransformerMixin`


Detecting sessions in timestamped data with the SessionEncoder
----------------------------------------------------------------

When dealing with timestamped data (data that includes at least a timestamp column),
it may be beneficial to try and identify groups of events through **sessionization**.
Comment thread
rcap107 marked this conversation as resolved.
Outdated

Sessionization is the process of grouping a sequence of events (like user
interactions) into meaningful sessions. A session typically starts fresh or
after a period of inactivity.

For example, in an online retail context, you might define a new session whenever
Comment thread
rcap107 marked this conversation as resolved.
Outdated
more than 30 minutes pass with no activity from a user. On a website, a session may
define a sequence of requests made by a single end-user within a certain time duration.

While definitions may vary depending on the specific use case, being able to detect
such "bursts" of activity by a user can help with building features that often have
Comment thread
rcap107 marked this conversation as resolved.
Outdated
greater predictive power than raw individual events.
Comment thread
rcap107 marked this conversation as resolved.
Outdated

The |SessionEncoder| helps addressing this problem by detecting sessions based on
Comment thread
rcap107 marked this conversation as resolved.
Outdated
a timestamp column, other "session columns" (e.g., user and device) that should be

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a little lost in this sentence. Maybe something like 'a timestamp column, other session-related columns (e.g. ...)'?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session columns = split_by right?

used to distinguish between sessions, and a ``session_gap``. A session is then
defined as a sequence of events that share the same value in the "session columns"
and whose events are closer to each other than the ``session_gap``.

>>> from skrub import SessionEncoder
>>> from skrub.datasets import make_retail_events
>>> events = make_retail_events(n_events=100, random_state=0)
>>> X, y = events.X, events.y

Once the necessary features are provided, the |SessionEncoder|
returns a dataframe that includes a ``session_id`` column, which includes an integer,
monotonically increasing ID, for each session:
Comment thread
rcap107 marked this conversation as resolved.
Outdated

>>> se = SessionEncoder(timestamp_col="timestamp", split_by="user_id", session_gap=30 * 60)
>>> res = se.fit_transform(X)
>>> res.head(5) # doctest: +SKIP
user_id timestamp device_type page_category event_type time_on_page price_viewed timestamp_session_id
0 user_0164 2024-01-01 03:29:07.708922+00:00 mobile fashion page_view 134.1 309.80 59
1 user_0164 2024-01-01 03:29:42.185048+00:00 tablet books search 103.4 11.00 59
2 user_0164 2024-01-01 03:32:38.352703+00:00 desktop home wishlist 180.3 4.80 59
3 user_0008 2024-01-02 10:49:56.974375+00:00 mobile books page_view 7.0 33.94 2
4 user_0149 2024-01-04 10:00:15.882835+00:00 desktop electronics page_view 108.5 4.44 49

Once the session ID is available, it becomes possible to compute aggregations on
Comment thread
rcap107 marked this conversation as resolved.
Outdated
each session, for example to find the duration of a session, or the number of sessions
Comment thread
rcap107 marked this conversation as resolved.
Outdated
by a user.

.. warning::

Aggregation can introduce data leakage! Records should only be aggregated from
Comment thread
rcap107 marked this conversation as resolved.
Outdated
within the training set at training time and the test set at predict time. To
Comment thread
rcap107 marked this conversation as resolved.
Outdated
ensure this is the case, any code that performs aggregation can be wrapped in a
scikit-learn |BaseEstimator| (as shown in the
:ref:`SessionEncoder example <sphx_glr_auto_examples_0110_session_encoder.py>`,
Comment thread
rcap107 marked this conversation as resolved.
Outdated
or the pipeline should use the skrub :ref:`Data Ops framework<user_guide_data_ops_plan>`.
Comment thread
rcap107 marked this conversation as resolved.
Outdated
1 change: 1 addition & 0 deletions doc/multi_column_operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ multiple columns.
modules/multi_column_operations/selectors
modules/multi_column_operations/type_of_selectors
modules/multi_column_operations/advanced_selectors
modules/multi_column_operations/sessionization
171 changes: 171 additions & 0 deletions examples/0110_session_encoder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
"""

Sessions in time-based data: Predicting user purchases with the SessionEncoder
===============================================================================

.. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder`
.. |make_retail_events| replace:: :func:`~skrub.datasets.make_retail_events`
.. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline`
.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`
.. |DummyClassifier| replace:: :class:`~sklearn.dummy.DummyClassifier`
.. |TimeSeriesSplit| replace:: :class:`~sklearn.model_selection.TimeSeriesSplit`
.. |BaseEstimator| replace:: :class:`~sklearn.base.BaseEstimator`
.. |TransformerMixin| replace:: :class:`~sklearn.base.TransformerMixin`

This example shows how to use |SessionEncoder| in a scikit-learn pipeline to
create session-level features (sessionization) for conversion prediction, that is
predicting whether a user session will eventually lead to a purchase.

.. topic:: What is sessionization?

Sessionization is the process of grouping a sequence of events (like user
interactions) into meaningful sessions. A session typically starts fresh or
after a period of inactivity. For example, in an online retail context, you
might define a new session whenever more than 30 minutes pass with no activity
from a user. This allows you to extract session-level features (like the total
number of events in a session or the dominant device type used) which often have
greater predictive power than raw individual events.

We will:

1. Use |make_retail_events| to generate synthetic retail event data
2. Build a baseline classifier on raw event-level features with the |tabular_pipeline|
3. Add session-level and historical features with |SessionEncoder|
4. Train the same model again and compare ROC-AUC

The data includes columns such as event type, device type, viewed price, and
timestamp. The target is binary: whether the session eventually contains a
purchase event or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned I find it weird that we have session-level targets to predict but still need to construct session ids from the events. we can revisit the example and dataset in a later PR


"""

# %%
# Since this is temporal data, we use a time-aware CV strategy with
# |TimeSeriesSplit| to avoid leakage. We reuse the same splitter for all evaluations.
from sklearn.model_selection import TimeSeriesSplit

splitter = TimeSeriesSplit(n_splits=5)
Comment thread
rcap107 marked this conversation as resolved.
# %%
# We begin by generating the data with |make_retail_events| and defining out
Comment thread
rcap107 marked this conversation as resolved.
Outdated
# features and target.
from skrub import TableReport
from skrub.datasets import make_retail_events

events = make_retail_events(n_users=20, n_events=5000, random_state=0)
X, y = events.X, events.y
TableReport(X)
# %%
# The data contains 5000 events from 20 users, where each event is timestamped.
# Other columns include the event type, device used by the user, page category,
# time spent on page and price of the item. The target variable indicates whether
# a user session eventually contains a purchase event: all events in that session
# will have a target value of 1 if a purchase happens, and 0 otherwise.

# %%
# Sanity check: evaluate a DummyClassifier on raw event data
# ---------------------------------------------------------------
# We begin by evaluating a |DummyClassifier| on the original event data
# (without session features). Since it's a |DummyClassifier|, we expect
# chance-level performance (ROC-AUC of 0.5).
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score

dummy = DummyClassifier(strategy="most_frequent")

scores = cross_val_score(dummy, X, y, cv=splitter, scoring="roc_auc")
print(f"ROC-AUC with DummyClassifier: {scores.mean():.3f}")

# %%
# First attempt: training a model without using session-level features
# --------------------------------------------------------------------
# We first use the |tabular_pipeline| on raw event-level data, without any session
# encoding or aggregation. This serves as a baseline to compare against the enriched
# model later.
# Remember that the |tabular_pipeline| will automatically add a |TableVectorizer|
# to perform feature engineering, so the model can still learn from the raw event
# features. However, it won't be able to directly capture session-level patterns.
from skrub import tabular_pipeline

model = tabular_pipeline("classification")

scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc")
print(f"ROC-AUC without session encoding: {scores.mean():.3f}")
# %%
# The model is not performing much better than the DummyClassifier, which suggests
# that raw event-level features are not sufficient for good conversion prediction.
# This baseline is limited because it cannot directly use session-level behavior
# (for example, whether "add_to_cart" happened in the same session).

# %%
# A better approach: session encoding and aggregation
# ------------------------------------------------------
# Next, we use the |SessionEncoder| to create session-level features that we can
# aggregate over. We define a session boundary as "a user has been inactive for
# more than 30 minutes". The |SessionEncoder| will create a new column
# ``timestamp_session_id`` that assigns a unique session ID to each session detected.
# The parameter ``session_gap=30 * 60`` specifies the inactivity threshold in
# seconds (30 minutes).
#
# Note that session-based features involve aggregations, which must be performed
# only on the training data within each fold to avoid leakage. In a scikit-learn
# pipeline, we can achieve this by using |SessionEncoder| followed by a custom
# transformer that computes session aggregates, and ensuring that the pipeline is
Comment thread
rcap107 marked this conversation as resolved.
Outdated
# properly fitted within each fold of cross-validation.
Comment thread
rcap107 marked this conversation as resolved.
# %%
Comment thread
rcap107 marked this conversation as resolved.
from skrub import SessionEncoder, tabular_pipeline

se = SessionEncoder("timestamp", split_by="user_id", session_gap=30 * 60)
# Here we fit the SessionEncoder on the entire dataset for demonstration purposes
X_sessions = se.fit_transform(X)
X_sessions.head()

# %%
# Defining a custom transformer for session-level aggregation
# -----------------------------------------------------------
# To avoid data leakage and maintain a clean pipeline, we can create a custom
# transformer that inherits from |BaseEstimator| and |TransformerMixin| and
# computes session-level aggregates within a scikit-learn pipeline.
# This transformer will be fitted and applied separately within each fold of
# cross-validation, ensuring that session features are computed only on the training
# data of each fold.

from sklearn.base import BaseEstimator, TransformerMixin


class SessionAggregator(BaseEstimator, TransformerMixin):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider, later, adding this as a transformer in Skrub.

Let's "sleep on it" (a few times)

def fit(self, X, y=None):
return self

def transform(self, X):
# Compute session-level aggregates
session_agg = X.groupby("timestamp_session_id").agg(
session_has_add_to_cart=("event_type", lambda x: "add_to_cart" in x.values),
session_n_events=("event_type", "count"),
session_mean_price=("price_viewed", "mean"),
session_dominant_device=("device_type", lambda x: x.mode()[0]),
)
# Join back to the original data
return X.join(session_agg, on="timestamp_session_id")


# %%
# Then, we create a pipeline that includes the |SessionEncoder|, our custom
# ``SessionAggregator``, and the |tabular_pipeline| for classification. This
# pipeline will be used in cross-validation to evaluate the model
# with session features.
from sklearn.pipeline import make_pipeline

model = make_pipeline(se, SessionAggregator(), tabular_pipeline("classification"))
scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc")
print("ROC-AUC with session encoding:", scores.mean())

# %%
# As expected, the model with session encoding performs much better than the baseline
Comment thread
rcap107 marked this conversation as resolved.
Outdated
# without session features, demonstrating the value of sessionization for conversion
# prediction.
#
# The fact that we are working with aggregation means that it was necessary to
# create a custom transformer to compute session-level features. This situation
Comment thread
rcap107 marked this conversation as resolved.
Outdated
# can be avoided by using the skrub DataOps workflow, which allows for more
Comment thread
rcap107 marked this conversation as resolved.
Outdated
# flexible data transformations without needing to fit everything within a
# scikit-learn pipeline.
Loading
Loading