skrub-data · rcap107 · Feb 24, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -31,6 +31,15 @@ New Features
   :meth:`DataOp.skb.eval`, :meth:`SkrubLearner.predict`, etc., or in
   :meth:`DataOp.skb.find` or :meth:`SkrubLearner.truncated_after`. :pr:`2062` by
   :user:`Jérôme Dockès <jeromedockes>`.
+- The :class:`SessionEncoder` is now available. This encoder adds a `session_id`
+  column, which groups together events that occur within the given session gap.
+  Additionally, it is possible to provide a ``split_by`` column or list of columns
+  (e.g., user ID or (user ID, user device)) to compute sessions for each grouping
+  value.
+  :pr:`1930` by  :user:`Riccardo Cappuzzo <rcap107>`.
+-  A new synthetic dataset generator for timestamped data and session-based
+  operations has been added: :meth:`~skrub.datasets.make_retail_events`.
+  :pr:`1930` by  :user:`Riccardo Cappuzzo <rcap107>`.
 - The :class:`DropSimilar` transformer has been added, for removing columns in a
   dataframe that present high correlation with other columns. :pr:`2023` by
   :user:`Eloi Massoulié <emassoulie>`.

diff --git a/doc/api_reference.py b/doc/api_reference.py
@@ -90,6 +90,7 @@
                     "SimilarityEncoder",
                     "ToCategorical",
                     "DatetimeEncoder",
+                    "SessionEncoder",
                     "ToDatetime",
                     "ToFloat",
                 ],
@@ -339,6 +340,9 @@
                     "datasets.get_data_dir",
                     "datasets.make_deduplication_data",
                     "datasets.toy_orders",
+                    "datasets.toy_products",
+                    "datasets.toy_cities",
+                    "datasets.make_retail_events",
                 ],
             }
         ],

diff --git a/doc/modules/multi_column_operations/sessionization.rst b/doc/modules/multi_column_operations/sessionization.rst
@@ -0,0 +1,101 @@
+.. _sessionization:
+
+.. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder`
+.. |BaseEstimator| replace:: :class:`~sklearn.base.BaseEstimator`
+.. |TransformerMixin| replace:: :class:`~sklearn.base.TransformerMixin`
+
+
+Detecting sessions in timestamped data with the SessionEncoder
+----------------------------------------------------------------
+
+When dealing with timestamped data (data that includes at least a timestamp column),
+it may be beneficial to try and identify groups of events as
+:ref:`"sessions" <https://en.wikipedia.org/wiki/Session_(web_analytics)>`_,
+through **sessionization**.
+
+Sessionization is the process of grouping a sequence of events (like user
+interactions) into meaningful sessions.
+For example, in an online retail context you might define a new session whenever
+more than 30 minutes pass with no activity from the user. On a website, a session may
+define a sequence of requests made by a single end-user within a certain time duration.
+
+While definitions may vary depending on the specific use case, being able to detect
+such "bursts" of activity by a user can often help with building features that have
+greater predictive power than raw individual events, such as number of sessions or
+average session duration.
+
+The |SessionEncoder| addresses this problem by detecting sessions based on
+a timestamp column, other session-related columns (e.g., user and device) that should be
+used to distinguish between sessions, and a ``session_gap``. Session-related columns
+-- identified by the ``split_by`` parameter -- allow to split sessions based on
+the provided parameters, for example to group user actions only if they were conducted
+on the same device.
+
+A session is then defined as a sequence of events that share the same value in the
+``split_by`` columns, and whose events are closer to each other than the
+``session_gap``.
+
+>>> from skrub import SessionEncoder
+>>> from skrub.datasets import make_retail_events
+>>> events = make_retail_events(n_events=100, random_state=0)
+>>> X, y = events.X, events.y
+
+Once the necessary features are provided, the |SessionEncoder|
+returns a dataframe that includes a ``timestamp_session_id`` column, which is
+composed of a monotonically increasing integer ID for each session:
+>>> se = SessionEncoder(timestamp_col="timestamp", split_by="user_id", session_gap=30 * 60)
+>>> res = se.fit_transform(X)
+>>> res.head(5) # doctest: +SKIP
+     user_id                        timestamp device_type page_category event_type  time_on_page  price_viewed  timestamp_session_id
+0  user_0164 2024-01-01 03:29:07.708922+00:00      mobile       fashion  page_view         134.1        309.80                    59
+1  user_0164 2024-01-01 03:29:42.185048+00:00      tablet         books     search         103.4         11.00                    59
+2  user_0164 2024-01-01 03:32:38.352703+00:00     desktop          home   wishlist         180.3          4.80                    59
+3  user_0008 2024-01-02 10:49:56.974375+00:00      mobile         books  page_view           7.0         33.94                     2
+4  user_0149 2024-01-04 10:00:15.882835+00:00     desktop   electronics  page_view         108.5          4.44                    49
+
+With the session ID, it becomes possible to compute aggregations on
+each session, for example to find the duration or number of sessions
+by a user.
+
+.. warning::
+
+Caution! Aggregation can introduce data leakage. Records should only be aggregated from
+within the training set at training time, and the test set at predict time. To
+ensure this is the case, any code that performs aggregation can be wrapped in a
+scikit-learn |BaseEstimator| (as shown in the
+:ref:`SessionEncoder example <sphx_glr_auto_examples_0110_session_encoder.py>`),
+otherwise the pipeline should use the skrub :ref:`Data Ops framework<user_guide_data_ops_plan>`.
+
+The |SessionEncoder| includes the ``suffix`` parameter (by default
+``suffix="session_id"``) to specify what the name of the new column should be.
+This can help with creating multiple session IDs based on the same timestamp.
+For example, we might want to create sessions based on users, and based on users
+and their device:
+
+>>> se = SessionEncoder(timestamp_col="timestamp",
+... split_by="user_id",
+... session_gap=30 * 60,
+... suffix="user"
+... )
+>>> res = se.fit_transform(X)
+>>> res.head(5) # doctest: +SKIP
+     user_id                        timestamp  ... price_viewed timestamp_user
+0  user_0164 2024-01-01 03:29:07.708922+00:00  ...       309.80             59
+1  user_0164 2024-01-01 03:29:42.185048+00:00  ...        11.00             59
+2  user_0164 2024-01-01 03:32:38.352703+00:00  ...         4.80             59
+3  user_0008 2024-01-02 10:49:56.974375+00:00  ...        33.94              2
+4  user_0149 2024-01-04 10:00:15.882835+00:00  ...         4.44             49
+
+>>> se = SessionEncoder(timestamp_col="timestamp",
+... split_by=["user_id", "device_type"],
+... session_gap=30 * 60,
+... suffix="user_device"
+... )
+>>> res = se.fit_transform(X)
+>>> res.head(5) # doctest: +SKIP
+     user_id                        timestamp  ... price_viewed timestamp_user_device
+0  user_0164 2024-01-01 03:29:07.708922+00:00  ...       309.80                    75
+1  user_0164 2024-01-01 03:29:42.185048+00:00  ...        11.00                    76
+2  user_0164 2024-01-01 03:32:38.352703+00:00  ...         4.80                    74
+3  user_0008 2024-01-02 10:49:56.974375+00:00  ...        33.94                     2
+4  user_0149 2024-01-04 10:00:15.882835+00:00  ...         4.44                    59
diff --git a/doc/multi_column_operations.rst b/doc/multi_column_operations.rst
@@ -15,3 +15,4 @@ multiple columns.
    modules/multi_column_operations/selectors
    modules/multi_column_operations/type_of_selectors
    modules/multi_column_operations/advanced_selectors
+   modules/multi_column_operations/sessionization
diff --git a/examples/0110_session_encoder.py b/examples/0110_session_encoder.py
@@ -0,0 +1,174 @@
+"""
+
+Sessions in time-based data: Predicting user purchases with the SessionEncoder
+===============================================================================
+
+.. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder`
+.. |make_retail_events| replace:: :func:`~skrub.datasets.make_retail_events`
+.. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline`
+.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`
+.. |DummyClassifier| replace:: :class:`~sklearn.dummy.DummyClassifier`
+.. |TimeSeriesSplit| replace:: :class:`~sklearn.model_selection.TimeSeriesSplit`
+.. |BaseEstimator| replace:: :class:`~sklearn.base.BaseEstimator`
+.. |TransformerMixin| replace:: :class:`~sklearn.base.TransformerMixin`
+
+This example shows how to use |SessionEncoder| in a scikit-learn pipeline to
+create session-level features (sessionization) for conversion prediction, that is
+predicting whether a user session will eventually lead to a purchase.
+
+.. topic:: What is sessionization?
+
+    Sessionization is the process of grouping a sequence of events (like user
+    interactions) into meaningful sessions. A session typically starts fresh or
+    after a period of inactivity. For example, in an online retail context, you
+    might define a new session whenever more than 30 minutes pass with no activity
+    from a user. This allows you to extract session-level features (like the total
+    number of events in a session or the dominant device type used) which often have
+    greater predictive power than raw individual events.
+
+We will:
+
+1. Use |make_retail_events| to generate synthetic retail event data
+2. Build a baseline classifier on raw event-level features with the |tabular_pipeline|
+3. Add session-level and historical features with |SessionEncoder|
+4. Train the same model again and compare ROC-AUC
+
+The data includes columns such as event type, device type, viewed price, and
+timestamp. The target is binary: whether the session eventually contains a
+purchase event or not.
+
+"""
+
+# %%
+# Since this is temporal data, we use a time-aware CV strategy with
+# |TimeSeriesSplit| to avoid leakage. We reuse the same splitter for all evaluations.
+# The dataset is sorted by timestamp, so the training set will always contain only
+# past data relative to the test set.
+from sklearn.model_selection import TimeSeriesSplit
+
+splitter = TimeSeriesSplit(n_splits=5)
+# %%
+# We begin by generating the data with |make_retail_events| and defining our
+# features and target.
+from skrub import TableReport
+from skrub.datasets import make_retail_events
+
+events = make_retail_events(n_users=20, n_events=5000, random_state=0)
+X, y = events.X, events.y
+TableReport(X)
+# %%
+# The data contains 5000 events from 20 users, where each event is timestamped.
+# Other columns include the event type, device used by the user, page category,
+# time spent on page and price of the item. The target variable indicates whether
+# a user session eventually contains a purchase event: all events in that session
+# will have a target value of 1 if a purchase happens, and 0 otherwise.
+
+# %%
+# Sanity check: evaluate a DummyClassifier on raw event data
+# ---------------------------------------------------------------
+# We begin by evaluating a |DummyClassifier| on the original event data
+# (without session features).  Since it's a |DummyClassifier|, we expect
+# chance-level performance (ROC-AUC of 0.5).
+from sklearn.dummy import DummyClassifier
+from sklearn.model_selection import cross_val_score
+
+dummy = DummyClassifier(strategy="most_frequent")
+
+scores = cross_val_score(dummy, X, y, cv=splitter, scoring="roc_auc")
+print(f"ROC-AUC with DummyClassifier: {scores.mean():.3f}")
+
+# %%
+# First attempt: training a model without using session-level features
+# --------------------------------------------------------------------
+# We first use the |tabular_pipeline| on raw event-level data, without any session
+# encoding or aggregation. This serves as a baseline to compare against the enriched
+# model later.
+# Remember that the |tabular_pipeline| will automatically add a |TableVectorizer|
+# to perform feature engineering, so the model can still learn from the raw event
+# features. However, it won't be able to directly capture session-level patterns.
+from skrub import tabular_pipeline
+
+model = tabular_pipeline("classification")
+
+scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc")
+print(f"ROC-AUC without session encoding: {scores.mean():.3f}")
+# %%
+# The model is not performing much better than the DummyClassifier, which suggests
+# that raw event-level features are not sufficient for good conversion prediction.
+# This baseline is limited because it cannot directly use session-level behavior
+# (for example, whether "add_to_cart" happened in the same session).
+
+# %%
+# A better approach: session encoding and aggregation
+# ------------------------------------------------------
+# Next, we use the |SessionEncoder| to create session-level features that we can
+# aggregate over. We define a session boundary as "a user has been inactive for
+# more than 30 minutes". The |SessionEncoder| will create a new column
+# ``timestamp_session_id`` that assigns a unique session ID to each session detected.
+# The parameter ``session_gap=30 * 60`` specifies the inactivity threshold in
+# seconds (30 minutes).
+#
+# Note that session-based features involve aggregations, which must be performed
+# only on the training data within each fold to avoid leakage. In a scikit-learn
+# pipeline, we can achieve this by using |SessionEncoder| followed by a custom
+# transformer that computes session aggregates, and ensures that the pipeline is
+# properly fitted within each fold of cross-validation.
+
+# %%
+from skrub import SessionEncoder, tabular_pipeline
+
+se = SessionEncoder("timestamp", split_by="user_id", session_gap=30 * 60)
+# Here we fit the SessionEncoder on the entire dataset for demonstration purposes
+X_sessions = se.fit_transform(X)
+X_sessions.head()
+
+# %%
+# Defining a custom transformer for session-level aggregation
+# -----------------------------------------------------------
+# To avoid data leakage and maintain a clean pipeline, we can create a custom
+# transformer that inherits from |BaseEstimator| and |TransformerMixin| and
+# computes session-level aggregates within a scikit-learn pipeline.
+# This transformer will be fitted and applied separately within each fold of
+# cross-validation, ensuring that session features are computed only on the training
+# data of each fold.
+
+from sklearn.base import BaseEstimator, TransformerMixin
+
+
+class SessionAggregator(BaseEstimator, TransformerMixin):
+    def fit(self, X, y=None):
+        return self
+
+    def transform(self, X):
+        # Compute session-level aggregates
+        session_agg = X.groupby("timestamp_session_id").agg(
+            session_has_add_to_cart=("event_type", lambda x: "add_to_cart" in x.values),
+            session_n_events=("event_type", "count"),
+            session_mean_price=("price_viewed", "mean"),
+            session_dominant_device=("device_type", lambda x: x.mode()[0]),
+        )
+        # Join back to the original data
+        return X.join(session_agg, on="timestamp_session_id")
+
+
+# %%
+# Then, we create a pipeline that includes the |SessionEncoder|, our custom
+# ``SessionAggregator``, and the |tabular_pipeline| for classification. This
+# pipeline will be used in cross-validation to evaluate the model
+# with session features.
+from sklearn.pipeline import make_pipeline
+
+model = make_pipeline(se, SessionAggregator(), tabular_pipeline("classification"))
+scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc")
+print("ROC-AUC with session encoding:", scores.mean())
+
+# %%
+# As expected the model with session encoding performs much better than the baseline
+# without session features, demonstrating the value of sessionization for conversion
+# prediction.
+#
+# The fact that we are working with aggregation means that it was necessary to
+# create a custom transformer to compute session-level features. However, this situation
+# can be avoided entirely by using the skrub DataOps workflow, which allows for more
+# flexible data transformations without needing to fit everything within a
+# scikit-learn pipeline.