-
Notifications
You must be signed in to change notification settings - Fork 258
Adding the SessionEncoder #1930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 68 commits
f7fdcd7
65be83a
a6caeb7
cc14c55
dba476f
d46090d
28b958a
ac7389d
8a61ea0
a70006f
429f155
23c4111
acb091f
d69fb9e
e0199f5
603a57b
9d2c577
8d869b2
01fed5f
746109b
71302b1
ea99353
bca2db2
868d529
e27b23c
25ce541
280d6f1
b37bec7
bd64559
3758a43
c8e91c9
8ddd651
dcc1369
87371e6
b336d9b
9fbe79d
1680162
091a66c
8248f50
6c2a92c
542042b
8c6d6a3
b834395
091c122
6de367c
08d2b11
528006b
d26222f
26e4436
fecc532
bf58d53
2aa11cc
3c79333
8b30ebd
e561d9b
d9012dc
d73c45b
b77a59e
8ee2bee
981134e
d5aeecd
f62a605
6b03b14
89c79b7
1f5fe6f
ecf5e0f
f694e15
36e6585
fdab72a
e033cdf
2de71c4
1dc32c2
d199ec2
aada173
d22774c
f728604
f215bdb
679077a
5f04c14
efada37
73adcfb
74f5991
49a4583
77942d2
05c7bf7
b389575
9675950
0787468
2951486
326dbb9
e04fd29
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| .. _sessionization: | ||
|
|
||
| .. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder` | ||
| .. |BaseEstimator| replace:: :class:`~sklearn.base.BaseEstimator` | ||
| .. |TransformerMixin| replace:: :class:`~sklearn.base.TransformerMixin` | ||
|
|
||
|
|
||
| Detecting sessions in timestamped data with the SessionEncoder | ||
| ---------------------------------------------------------------- | ||
|
|
||
| When dealing with timestamped data (data that includes at least a timestamp column), | ||
| it may be beneficial to try and identify groups of events through **sessionization**. | ||
|
|
||
| Sessionization is the process of grouping a sequence of events (like user | ||
| interactions) into meaningful sessions. A session typically starts fresh or | ||
| after a period of inactivity. | ||
|
|
||
| For example, in an online retail context, you might define a new session whenever | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| more than 30 minutes pass with no activity from a user. On a website, a session may | ||
| define a sequence of requests made by a single end-user within a certain time duration. | ||
|
|
||
| While definitions may vary depending on the specific use case, being able to detect | ||
| such "bursts" of activity by a user can help with building features that often have | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| greater predictive power than raw individual events. | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
|
|
||
| The |SessionEncoder| helps addressing this problem by detecting sessions based on | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| a timestamp column, other "session columns" (e.g., user and device) that should be | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get a little lost in this sentence. Maybe something like 'a timestamp column, other session-related columns (e.g. ...)'?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. session columns = split_by right? |
||
| used to distinguish between sessions, and a ``session_gap``. A session is then | ||
| defined as a sequence of events that share the same value in the "session columns" | ||
| and whose events are closer to each other than the ``session_gap``. | ||
|
|
||
| >>> from skrub import SessionEncoder | ||
| >>> from skrub.datasets import make_retail_events | ||
| >>> events = make_retail_events(n_events=100, random_state=0) | ||
| >>> X, y = events.X, events.y | ||
|
|
||
| Once the necessary features are provided, the |SessionEncoder| | ||
| returns a dataframe that includes a ``session_id`` column, which includes an integer, | ||
| monotonically increasing ID, for each session: | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
|
|
||
| >>> se = SessionEncoder(timestamp_col="timestamp", split_by="user_id", session_gap=30 * 60) | ||
| >>> res = se.fit_transform(X) | ||
| >>> res.head(5) # doctest: +SKIP | ||
| user_id timestamp device_type page_category event_type time_on_page price_viewed timestamp_session_id | ||
| 0 user_0164 2024-01-01 03:29:07.708922+00:00 mobile fashion page_view 134.1 309.80 59 | ||
| 1 user_0164 2024-01-01 03:29:42.185048+00:00 tablet books search 103.4 11.00 59 | ||
| 2 user_0164 2024-01-01 03:32:38.352703+00:00 desktop home wishlist 180.3 4.80 59 | ||
| 3 user_0008 2024-01-02 10:49:56.974375+00:00 mobile books page_view 7.0 33.94 2 | ||
| 4 user_0149 2024-01-04 10:00:15.882835+00:00 desktop electronics page_view 108.5 4.44 49 | ||
|
|
||
| Once the session ID is available, it becomes possible to compute aggregations on | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| each session, for example to find the duration of a session, or the number of sessions | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| by a user. | ||
|
|
||
| .. warning:: | ||
|
|
||
| Aggregation can introduce data leakage! Records should only be aggregated from | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| within the training set at training time and the test set at predict time. To | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| ensure this is the case, any code that performs aggregation can be wrapped in a | ||
| scikit-learn |BaseEstimator| (as shown in the | ||
| :ref:`SessionEncoder example <sphx_glr_auto_examples_0110_session_encoder.py>`, | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| or the pipeline should use the skrub :ref:`Data Ops framework<user_guide_data_ops_plan>`. | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| """ | ||
|
|
||
| Sessions in time-based data: Predicting user purchases with the SessionEncoder | ||
| =============================================================================== | ||
|
|
||
| .. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder` | ||
| .. |make_retail_events| replace:: :func:`~skrub.datasets.make_retail_events` | ||
| .. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline` | ||
| .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` | ||
| .. |DummyClassifier| replace:: :class:`~sklearn.dummy.DummyClassifier` | ||
| .. |TimeSeriesSplit| replace:: :class:`~sklearn.model_selection.TimeSeriesSplit` | ||
| .. |BaseEstimator| replace:: :class:`~sklearn.base.BaseEstimator` | ||
| .. |TransformerMixin| replace:: :class:`~sklearn.base.TransformerMixin` | ||
|
|
||
| This example shows how to use |SessionEncoder| in a scikit-learn pipeline to | ||
| create session-level features (sessionization) for conversion prediction, that is | ||
| predicting whether a user session will eventually lead to a purchase. | ||
|
|
||
| .. topic:: What is sessionization? | ||
|
|
||
| Sessionization is the process of grouping a sequence of events (like user | ||
| interactions) into meaningful sessions. A session typically starts fresh or | ||
| after a period of inactivity. For example, in an online retail context, you | ||
| might define a new session whenever more than 30 minutes pass with no activity | ||
| from a user. This allows you to extract session-level features (like the total | ||
| number of events in a session or the dominant device type used) which often have | ||
| greater predictive power than raw individual events. | ||
|
|
||
| We will: | ||
|
|
||
| 1. Use |make_retail_events| to generate synthetic retail event data | ||
| 2. Build a baseline classifier on raw event-level features with the |tabular_pipeline| | ||
| 3. Add session-level and historical features with |SessionEncoder| | ||
| 4. Train the same model again and compare ROC-AUC | ||
|
|
||
| The data includes columns such as event type, device type, viewed price, and | ||
| timestamp. The target is binary: whether the session eventually contains a | ||
| purchase event or not. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as I mentioned I find it weird that we have session-level targets to predict but still need to construct session ids from the events. we can revisit the example and dataset in a later PR |
||
|
|
||
| """ | ||
|
|
||
| # %% | ||
| # Since this is temporal data, we use a time-aware CV strategy with | ||
| # |TimeSeriesSplit| to avoid leakage. We reuse the same splitter for all evaluations. | ||
| from sklearn.model_selection import TimeSeriesSplit | ||
|
|
||
| splitter = TimeSeriesSplit(n_splits=5) | ||
|
rcap107 marked this conversation as resolved.
|
||
| # %% | ||
| # We begin by generating the data with |make_retail_events| and defining out | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # features and target. | ||
| from skrub import TableReport | ||
| from skrub.datasets import make_retail_events | ||
|
|
||
| events = make_retail_events(n_users=20, n_events=5000, random_state=0) | ||
| X, y = events.X, events.y | ||
| TableReport(X) | ||
| # %% | ||
| # The data contains 5000 events from 20 users, where each event is timestamped. | ||
| # Other columns include the event type, device used by the user, page category, | ||
| # time spent on page and price of the item. The target variable indicates whether | ||
| # a user session eventually contains a purchase event: all events in that session | ||
| # will have a target value of 1 if a purchase happens, and 0 otherwise. | ||
|
|
||
| # %% | ||
| # Sanity check: evaluate a DummyClassifier on raw event data | ||
| # --------------------------------------------------------------- | ||
| # We begin by evaluating a |DummyClassifier| on the original event data | ||
| # (without session features). Since it's a |DummyClassifier|, we expect | ||
| # chance-level performance (ROC-AUC of 0.5). | ||
| from sklearn.dummy import DummyClassifier | ||
| from sklearn.model_selection import cross_val_score | ||
|
|
||
| dummy = DummyClassifier(strategy="most_frequent") | ||
|
|
||
| scores = cross_val_score(dummy, X, y, cv=splitter, scoring="roc_auc") | ||
| print(f"ROC-AUC with DummyClassifier: {scores.mean():.3f}") | ||
|
|
||
| # %% | ||
| # First attempt: training a model without using session-level features | ||
| # -------------------------------------------------------------------- | ||
| # We first use the |tabular_pipeline| on raw event-level data, without any session | ||
| # encoding or aggregation. This serves as a baseline to compare against the enriched | ||
| # model later. | ||
| # Remember that the |tabular_pipeline| will automatically add a |TableVectorizer| | ||
| # to perform feature engineering, so the model can still learn from the raw event | ||
| # features. However, it won't be able to directly capture session-level patterns. | ||
| from skrub import tabular_pipeline | ||
|
|
||
| model = tabular_pipeline("classification") | ||
|
|
||
| scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc") | ||
| print(f"ROC-AUC without session encoding: {scores.mean():.3f}") | ||
| # %% | ||
| # The model is not performing much better than the DummyClassifier, which suggests | ||
| # that raw event-level features are not sufficient for good conversion prediction. | ||
| # This baseline is limited because it cannot directly use session-level behavior | ||
| # (for example, whether "add_to_cart" happened in the same session). | ||
|
|
||
| # %% | ||
| # A better approach: session encoding and aggregation | ||
| # ------------------------------------------------------ | ||
| # Next, we use the |SessionEncoder| to create session-level features that we can | ||
| # aggregate over. We define a session boundary as "a user has been inactive for | ||
| # more than 30 minutes". The |SessionEncoder| will create a new column | ||
| # ``timestamp_session_id`` that assigns a unique session ID to each session detected. | ||
| # The parameter ``session_gap=30 * 60`` specifies the inactivity threshold in | ||
| # seconds (30 minutes). | ||
| # | ||
| # Note that session-based features involve aggregations, which must be performed | ||
| # only on the training data within each fold to avoid leakage. In a scikit-learn | ||
| # pipeline, we can achieve this by using |SessionEncoder| followed by a custom | ||
| # transformer that computes session aggregates, and ensuring that the pipeline is | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # properly fitted within each fold of cross-validation. | ||
|
rcap107 marked this conversation as resolved.
|
||
| # %% | ||
|
rcap107 marked this conversation as resolved.
|
||
| from skrub import SessionEncoder, tabular_pipeline | ||
|
|
||
| se = SessionEncoder("timestamp", split_by="user_id", session_gap=30 * 60) | ||
| # Here we fit the SessionEncoder on the entire dataset for demonstration purposes | ||
| X_sessions = se.fit_transform(X) | ||
| X_sessions.head() | ||
|
|
||
| # %% | ||
| # Defining a custom transformer for session-level aggregation | ||
| # ----------------------------------------------------------- | ||
| # To avoid data leakage and maintain a clean pipeline, we can create a custom | ||
| # transformer that inherits from |BaseEstimator| and |TransformerMixin| and | ||
| # computes session-level aggregates within a scikit-learn pipeline. | ||
| # This transformer will be fitted and applied separately within each fold of | ||
| # cross-validation, ensuring that session features are computed only on the training | ||
| # data of each fold. | ||
|
|
||
| from sklearn.base import BaseEstimator, TransformerMixin | ||
|
|
||
|
|
||
| class SessionAggregator(BaseEstimator, TransformerMixin): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should consider, later, adding this as a transformer in Skrub. Let's "sleep on it" (a few times) |
||
| def fit(self, X, y=None): | ||
| return self | ||
|
|
||
| def transform(self, X): | ||
| # Compute session-level aggregates | ||
| session_agg = X.groupby("timestamp_session_id").agg( | ||
| session_has_add_to_cart=("event_type", lambda x: "add_to_cart" in x.values), | ||
| session_n_events=("event_type", "count"), | ||
| session_mean_price=("price_viewed", "mean"), | ||
| session_dominant_device=("device_type", lambda x: x.mode()[0]), | ||
| ) | ||
| # Join back to the original data | ||
| return X.join(session_agg, on="timestamp_session_id") | ||
|
|
||
|
|
||
| # %% | ||
| # Then, we create a pipeline that includes the |SessionEncoder|, our custom | ||
| # ``SessionAggregator``, and the |tabular_pipeline| for classification. This | ||
| # pipeline will be used in cross-validation to evaluate the model | ||
| # with session features. | ||
| from sklearn.pipeline import make_pipeline | ||
|
|
||
| model = make_pipeline(se, SessionAggregator(), tabular_pipeline("classification")) | ||
| scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc") | ||
| print("ROC-AUC with session encoding:", scores.mean()) | ||
|
|
||
| # %% | ||
| # As expected, the model with session encoding performs much better than the baseline | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # without session features, demonstrating the value of sessionization for conversion | ||
| # prediction. | ||
| # | ||
| # The fact that we are working with aggregation means that it was necessary to | ||
| # create a custom transformer to compute session-level features. This situation | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # can be avoided by using the skrub DataOps workflow, which allows for more | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # flexible data transformations without needing to fit everything within a | ||
| # scikit-learn pipeline. | ||
Uh oh!
There was an error while loading. Please reload this page.