-
Notifications
You must be signed in to change notification settings - Fork 258
Adding the SessionEncoder #1930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rcap107
wants to merge
91
commits into
skrub-data:main
Choose a base branch
from
rcap107:feat-session-encoder
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 59 commits
Commits
Show all changes
91 commits
Select commit
Hold shift + click to select a range
f7fdcd7
Adding the SessionEncoder
rcap107 65be83a
more work
rcap107 a6caeb7
adding tests
rcap107 cc14c55
changelog
rcap107 dba476f
adding drop cols, various improvements
rcap107 d46090d
adding a test
rcap107 28b958a
simplifying tests and code
rcap107 ac7389d
docstrings
rcap107 8a61ea0
adding support for multiple by columns
rcap107 a70006f
fixing optional by
rcap107 429f155
ddocs
rcap107 23c4111
fixing a compatibility problem
rcap107 acb091f
changelog
rcap107 d69fb9e
improving tests
rcap107 e0199f5
renaming session id column
rcap107 603a57b
fixing some broken tests
rcap107 9d2c577
doctest
rcap107 8d869b2
testing error dispatch
rcap107 01fed5f
addressing some of the coments
rcap107 746109b
addressing more comments
rcap107 71302b1
fixing test
rcap107 ea99353
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 bca2db2
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 868d529
changelo
rcap107 e27b23c
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 25ce541
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 280d6f1
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 b37bec7
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 bd64559
reordering rows after adding session id
rcap107 3758a43
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 c8e91c9
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 8ddd651
fixing changelog after merge
rcap107 dcc1369
implementing a fix from review
rcap107 87371e6
reordering columns so that the session id is added as last col
rcap107 b336d9b
more fixes
rcap107 9fbe79d
_
rcap107 1680162
example
rcap107 091a66c
docstrings
rcap107 8248f50
changing to seconds
rcap107 6c2a92c
more tests
rcap107 542042b
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 8c6d6a3
ensuring that columns do not get overwritten
rcap107 b834395
renaming a parameter
rcap107 091c122
_
rcap107 6de367c
fixing a bug on windows
rcap107 08d2b11
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 528006b
adding new generator and example
rcap107 d26222f
adding comments
rcap107 26e4436
fixing inconsistency
rcap107 fecc532
fixing possible bug
rcap107 bf58d53
comments
rcap107 2aa11cc
grr
rcap107 3c79333
Apply suggestions from code review
rcap107 8b30ebd
Update skrub/_session_encoder.py
rcap107 e561d9b
improvements and changes from review
rcap107 d9012dc
more improvements
rcap107 d73c45b
adding plain example
rcap107 b77a59e
rewording
rcap107 8ee2bee
cleanup docstring
rcap107 981134e
Update examples/data_ops/1170_session_encoder.py
rcap107 d5aeecd
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 f62a605
Merge branch 'feat-session-encoder' of github.com:rcap107/skrub into …
rcap107 6b03b14
fixing timezone
rcap107 89c79b7
doc cleanup
rcap107 1f5fe6f
more on docs
rcap107 ecf5e0f
_
rcap107 f694e15
_
rcap107 36e6585
Merge remote-tracking branch 'upstream/main' into feat-session-encoder
rcap107 fdab72a
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 e033cdf
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 2de71c4
addressing some of the comments from the review
rcap107 1dc32c2
clean up test
rcap107 d199ec2
addressing more comments
rcap107 aada173
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 d22774c
changelog
rcap107 f728604
example
rcap107 f215bdb
slight rewording
rcap107 679077a
Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder
rcap107 5f04c14
Apply suggestions from code review
rcap107 efada37
fixing doctest
rcap107 73adcfb
reworking docstring, renaming attr
rcap107 74f5991
moving error checking, more work on docstring
rcap107 49a4583
simplifying part of the code
rcap107 77942d2
addressing comments
rcap107 05c7bf7
removing factorizer
rcap107 b389575
docstring
rcap107 9675950
doc fixes
rcap107 0787468
pandas grrr
rcap107 2951486
fixing test on min deps
rcap107 326dbb9
adding a comment
rcap107 e04fd29
changelog
rcap107 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| """ | ||
|
|
||
| .. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder` | ||
| .. |make_retail_events| replace:: :func:`~skrub.datasets.make_retail_events` | ||
| .. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline` | ||
| .. |skrub.X| replace:: :func:`~skrub.X` | ||
| .. |skrub.y| replace:: :func:`~skrub.y` | ||
| .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` | ||
| .. |DummyClassifier| replace:: :class:`~sklearn.dummy.DummyClassifier` | ||
| .. |TimeSeriesSplit| replace:: :class:`~sklearn.model_selection.TimeSeriesSplit` | ||
| .. |cross_validate| replace:: :func:`~skrub.cross_validate` | ||
| .. |apply_func| replace:: :func:`~skrub.DataOp.skb.apply_func` | ||
|
|
||
| Sessions in time-based data: Predicting conversion with the |SessionEncoder| | ||
| ========================================================================== | ||
|
|
||
| This example shows how to use |SessionEncoder| in a scikit-learn pipeline to | ||
| create session-level features (sessionization) for conversion prediction, that is | ||
| predicting whether a user session will eventually lead to a purchase. | ||
|
|
||
| .. note:: | ||
|
|
||
| **What is sessionization?** | ||
|
|
||
| Sessionization is the process of grouping a sequence of events (like user | ||
| interactions) into meaningful sessions. A session typically starts fresh or | ||
| after a period of inactivity. For example, in an online retail context, you | ||
| might define a new session whenever more than 30 minutes pass with no activity | ||
| from a user. This allows you to extract session-level features (like the total | ||
| number of events in a session or the dominant device type used) which often have | ||
| greater predictive power than raw individual events. | ||
|
|
||
| We will: | ||
|
|
||
| 1. Use |make_retail_events| to generate synthetic retail event data | ||
| 2. Build a baseline classifier on raw event-level features with the |tabular_pipeline| | ||
| 3. Add session-level and historical features with |SessionEncoder| | ||
| 4. Train the same model again and compare ROC-AUC | ||
|
|
||
| The data includes columns such as event type, device type, viewed price, and | ||
| timestamp. The target is binary: whether the session eventually contains a | ||
| purchase event or not. | ||
|
|
||
|
|
||
| .. note:: | ||
|
|
||
| A version of this example that uses the skrub DataOps workflow instead of a | ||
| scikit-learn pipeline is available in :ref:`examples/data_ops/1170_session_encoder`. | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| """ | ||
|
|
||
| # %% | ||
| # Since this is temporal data, we use a time-aware CV strategy with | ||
| # |TimeSeriesSplit| to avoid leakage. We reuse the same splitter for all evaluations. | ||
| from sklearn.model_selection import TimeSeriesSplit | ||
|
|
||
| splitter = TimeSeriesSplit(n_splits=5) | ||
|
rcap107 marked this conversation as resolved.
|
||
| # %% | ||
| # We begin by generating the data with |make_retail_events| and marking feature | ||
| # and target data with |skrub.X| and |skrub.y| so they can be used | ||
| # in a DataOps workflow. | ||
|
|
||
| from skrub.datasets import make_retail_events | ||
|
|
||
| events = make_retail_events(n_users=20, n_events=5000, random_state=0) | ||
| X, y = events.X, events.y | ||
| X | ||
| # %% | ||
| # Sanity check: evaluate a DummyClassifier on raw event data | ||
| # --------------------------------------------------------------- | ||
| # We begin by evaluating a |DummyClassifier| on the original event data | ||
| # (without session features). Since it's a |DummyClassifier|, we expect | ||
| # chance-level performance (ROC-AUC of 0.5). | ||
| from sklearn.dummy import DummyClassifier | ||
| from sklearn.model_selection import cross_val_score | ||
|
|
||
| dummy = DummyClassifier(strategy="most_frequent") | ||
|
|
||
| scores = cross_val_score(dummy, X, y, cv=splitter, scoring="roc_auc") | ||
| print(f"ROC-AUC with DummyClassifier: {scores.mean():.3f}") | ||
|
|
||
| # %% | ||
| # First attempt: training a model without using session-level features | ||
| # -------------------------------------------------------------------- | ||
| # We first use the |tabular_pipeline| on raw event-level data, without any session | ||
| # encoding or aggregation. This serves as a baseline to compare against the enriched | ||
| # model later. | ||
| # Remember that the |tabular_pipeline| will automatically add a |TableVectorizer| | ||
| # to perform feature engineering, so the model can still learn from the raw event | ||
| # features. However, it won't be able to directly capture session-level patterns. | ||
| from skrub import tabular_pipeline | ||
|
|
||
| model = tabular_pipeline("classification") | ||
|
|
||
| scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc") | ||
| print(f"ROC-AUC without session encoding: {scores.mean():.3f}") | ||
| # %% | ||
| # The model is not performing much better than the DummyClassifier, which suggests | ||
| # that raw event-level features are not sufficient for good conversion prediction. | ||
| # This baseline is limited because it cannot directly use session-level behavior | ||
| # (for example, whether "add_to_cart" happened in the same session). | ||
|
|
||
| # %% | ||
| # A better approach: session encoding and aggregation | ||
| # ------------------------------------------------------ | ||
| # Next, we use the |SessionEncoder| to create session-level features that we can | ||
| # aggregate over. We define a session boundary as "a user has been inactive for | ||
| # more than 30 minutes". The |SessionEncoder| will create a new column | ||
| # ``timestamp_session_id`` that assigns a unique session ID to each session detected. | ||
| # The parameter ``session_gap=30 * 60`` specifies the inactivity threshold in | ||
| # seconds (30 minutes). | ||
| # | ||
| # Note that session-based features involve aggregations, which must be performed | ||
| # only on the training data within each fold to avoid leakage. In a scikit-learn | ||
| # pipeline, we can achieve this by using |SessionEncoder| followed by a custom | ||
| # transformer that computes session aggregates, and ensuring that the pipeline is | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # properly fitted within each fold of cross-validation. | ||
|
rcap107 marked this conversation as resolved.
|
||
| # %% | ||
|
rcap107 marked this conversation as resolved.
|
||
| from skrub import SessionEncoder, tabular_pipeline | ||
|
|
||
| se = SessionEncoder("timestamp", split_by="user_id", session_gap=30 * 60) | ||
| # Here we fit the SessionEncoder on the entire dataset for demonstration purposes | ||
| X_sessions = se.fit_transform(X) | ||
| X_sessions.head() | ||
|
|
||
| # %% | ||
| # To avoid data leakage and maintain a clean pipeline, we can create a custom | ||
| # transformer that computes session-level aggregates within a scikit-learn pipeline. | ||
| # This transformer will be fitted and applied separately within each fold of | ||
| # cross-validation, ensuring that session features are computed only on the training | ||
| # data of each fold. | ||
|
|
||
| from sklearn.base import BaseEstimator, TransformerMixin | ||
|
|
||
|
|
||
| class SessionAggregator(BaseEstimator, TransformerMixin): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should consider, later, adding this as a transformer in Skrub. Let's "sleep on it" (a few times) |
||
| def fit(self, X, y=None): | ||
| return self | ||
|
|
||
| def transform(self, X): | ||
| # Compute session-level aggregates | ||
| session_agg = X.groupby("timestamp_session_id").agg( | ||
| session_has_add_to_cart=("event_type", lambda x: "add_to_cart" in x.values), | ||
| session_n_events=("event_type", "count"), | ||
| session_mean_price=("price_viewed", "mean"), | ||
| session_dominant_device=("device_type", lambda x: x.mode()[0]), | ||
| ) | ||
| # Join back to the original data | ||
| return X.join(session_agg, on="timestamp_session_id") | ||
|
|
||
|
|
||
| # %% | ||
| # Then, we create a pipeline that includes the |SessionEncoder|, our custom | ||
| # ``SessionAggregator``, and the |tabular_pipeline| for classification. This | ||
| # pipeline will be used in cross-validation to evaluate the model | ||
| # with session features. | ||
| from sklearn.pipeline import make_pipeline | ||
|
|
||
| model = make_pipeline(se, SessionAggregator(), tabular_pipeline("classification")) | ||
| scores = cross_val_score(model, X, y, cv=splitter, scoring="roc_auc") | ||
| print("ROC-AUC with session encoding:", scores.mean()) | ||
|
|
||
| # %% | ||
| # As expected, the model with session encoding performs much better than the baseline | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # without session features, demonstrating the value of sessionization for conversion | ||
| # prediction. | ||
| # | ||
| # The fact that we are working with aggregation means that it was necessary to | ||
| # create a custom transformer to compute session-level features. This situation | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # can be avoided by using the skrub DataOps workflow, which allows for more | ||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| # flexible data transformations without needing to fit everything within a | ||
| # scikit-learn pipeline. For an example of how to do this with DataOps, see | ||
| # :ref:`examples/data_ops/1170_session_encoder`. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,186 @@ | ||
| """ | ||
|
|
||
| .. |SessionEncoder| replace:: :class:`~skrub.SessionEncoder` | ||
| .. |make_retail_events| replace:: :func:`~skrub.datasets.make_retail_events` | ||
| .. |tabular_pipeline| replace:: :func:`~skrub.tabular_pipeline` | ||
| .. |skrub.X| replace:: :func:`~skrub.X` | ||
| .. |skrub.y| replace:: :func:`~skrub.y` | ||
| .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` | ||
| .. |DummyClassifier| replace:: :class:`~sklearn.dummy.DummyClassifier` | ||
| .. |TimeSeriesSplit| replace:: :class:`~sklearn.model_selection.TimeSeriesSplit` | ||
| .. |cross_validate| replace:: :func:`~skrub.cross_validate` | ||
| .. |apply_func| replace:: :func:`~skrub.DataOp.skb.apply_func` | ||
|
|
||
| Sessions in time-based data: Using SessionEncoder in rich DataOps pipeline | ||
| ========================================================================== | ||
|
|
||
| This example shows how to use |SessionEncoder| in a skrub DataOps workflow to | ||
| create session-level features (sessionization) for conversion prediction, that is | ||
| predicting whether a user session will eventually lead to a purchase. | ||
|
|
||
| .. note:: | ||
|
|
||
| **What is sessionization?** | ||
|
|
||
|
rcap107 marked this conversation as resolved.
Outdated
|
||
| Sessionization is the process of grouping a sequence of events (like user | ||
| interactions) into meaningful sessions. A session typically starts fresh or | ||
| after a period of inactivity. For example, in an online retail context, you | ||
| might define a new session whenever more than 30 minutes pass with no activity | ||
| from a user. This allows you to extract session-level features (like the total | ||
| number of events in a session or the dominant device type used) which often have | ||
| greater predictive power than raw individual events. | ||
|
|
||
| We will: | ||
|
|
||
| 1. Use |make_retail_events| to generate synthetic retail event data | ||
| 2. Build a baseline classifier on raw event-level features with the |tabular_pipeline| | ||
| 3. Add session-level and historical features with |SessionEncoder| | ||
| 4. Train the same model again and compare ROC-AUC | ||
|
|
||
| The data includes columns such as event type, device type, viewed price, and | ||
| timestamp. The target is binary: whether the session eventually contains a | ||
| purchase event or not. | ||
| """ | ||
|
|
||
| # %% | ||
| # Since this is temporal data, we use a time-aware CV strategy with | ||
| # |TimeSeriesSplit| to avoid leakage. We reuse the same splitter for all evaluations. | ||
| from sklearn.model_selection import TimeSeriesSplit | ||
|
|
||
| splitter = TimeSeriesSplit(n_splits=5) | ||
|
|
||
| # %% | ||
| # We begin by generating the data with |make_retail_events| and marking feature | ||
| # and target data with |skrub.X| and |skrub.y| so they can be used | ||
| # in a DataOps workflow. | ||
|
|
||
| import skrub | ||
| from skrub.datasets import make_retail_events | ||
|
|
||
| events = make_retail_events(n_users=20, n_events=5000, random_state=0) | ||
| X, y = skrub.X(events.X), skrub.y(events.y) | ||
| X | ||
| # %% | ||
| # Sanity check: evaluate a DummyClassifier on raw event data | ||
| # --------------------------------------------------------------- | ||
| # We begin by evaluating a |DummyClassifier| on the original event data | ||
| # (without session features). Since it's a |DummyClassifier|, we expect | ||
| # chance-level performance (ROC-AUC of 0.5). | ||
| from sklearn.dummy import DummyClassifier | ||
|
|
||
| dummy = DummyClassifier(strategy="most_frequent") | ||
| dummy_pred = X.skb.apply(dummy, y=y) | ||
| dummy_learner = dummy_pred.skb.make_learner() | ||
| dummy_results = skrub.cross_validate( | ||
| dummy_learner, environment=dummy_pred.skb.get_data(), cv=splitter, scoring="roc_auc" | ||
| ) | ||
| print(f"ROC-AUC with DummyClassifier: {dummy_results['test_score'].mean():.3f}") | ||
|
|
||
| # %% | ||
| # First attempt: training a model without using session-level features | ||
| # -------------------------------------------------------------------- | ||
| # We first use the |tabular_pipeline| on raw event-level data, without any session | ||
| # encoding or aggregation. This serves as a baseline to compare against the enriched | ||
| # model later. | ||
| # Remember that the |tabular_pipeline| will automatically add a |TableVectorizer| | ||
| # to perform feature engineering, so the model can still learn from the raw event | ||
| # features. However, it won't be able to directly capture session-level patterns. | ||
| from skrub import tabular_pipeline | ||
|
|
||
| model = tabular_pipeline("classification") | ||
|
|
||
| pred = X.skb.apply(model, y=y) | ||
| learner = pred.skb.make_learner() | ||
| results = skrub.cross_validate( | ||
| learner, environment=pred.skb.get_data(), cv=splitter, scoring="roc_auc" | ||
| ) | ||
| print(f"ROC-AUC without session encoding: {results['test_score'].mean():.3f}") | ||
|
|
||
| # %% | ||
| # The model is not performing much better than the DummyClassifier, which suggests | ||
| # that raw event-level features are not sufficient for good conversion prediction. | ||
| # This baseline is limited because it cannot directly use session-level behavior | ||
| # (for example, whether "add_to_cart" happened in the same session). | ||
|
|
||
| # %% | ||
| # A better approach: session encoding and aggregation | ||
| # ------------------------------------------------------ | ||
| # Next, we use the |SessionEncoder| to create session-level features that we can | ||
| # aggregate over. We define a session boundary as "a user has been inactive for | ||
| # more than 30 minutes". The |SessionEncoder| will create a new column | ||
| # ``timestamp_session_id`` that assigns a unique session ID to each session detected. | ||
| # The parameter ``session_gap=30 * 60`` specifies the inactivity threshold in | ||
| # seconds (30 minutes). | ||
|
|
||
| # %% | ||
| from skrub import SessionEncoder | ||
|
|
||
| se = SessionEncoder("timestamp", split_by="user_id", session_gap=30 * 60) | ||
| X_sessions = X.skb.apply(se) | ||
| X_sessions | ||
|
|
||
| # %% | ||
| # ``timestamp_session_id`` identifies the session of each event. | ||
| # We use it to compute session-level aggregates and join them back to event-level rows. | ||
| # | ||
| # .. admonition:: Session-level feature engineering | ||
| # :collapsible: closed | ||
| # | ||
| # We will compute the following session-level features: | ||
| # | ||
| # - ``session_has_add_to_cart``: whether the session includes at least one | ||
| # "add_to_cart" event | ||
| # - ``session_n_events``: the total number of events in the session | ||
| # - ``session_mean_price``: the mean price viewed during the session | ||
| # - ``session_dominant_device``: the most frequently used device type in the session | ||
|
|
||
|
|
||
| def most_frequent(series): | ||
| # mode() can return multiple values; use the first one | ||
| # for a deterministic tie-break. | ||
| return series.mode().iat[0] | ||
|
|
||
|
|
||
| def compute_session_features(df): | ||
| session_agg = df.groupby("timestamp_session_id").agg( | ||
| session_has_add_to_cart=("event_type", lambda x: "add_to_cart" in x.values), | ||
| session_n_events=("event_type", "count"), | ||
| session_mean_price=("price_viewed", "mean"), | ||
| session_dominant_device=("device_type", most_frequent), | ||
| ) | ||
| df = df.join(session_agg, on="timestamp_session_id") | ||
| return df | ||
|
|
||
|
|
||
| # %% | ||
| # We use |apply_func| to apply these feature engineering functions to the data | ||
| # with session IDs. | ||
| X_enriched = X_sessions.skb.apply_func(compute_session_features) | ||
| X_enriched | ||
| # %% | ||
| # Now we can train the same model on the enriched data with session-level features | ||
| # and see if the performance improves. | ||
| model = tabular_pipeline("classification") | ||
| pred_enriched = X_enriched.skb.apply(model, y=y) | ||
| learner_enriched = pred_enriched.skb.make_learner() | ||
| results_enriched = skrub.cross_validate( | ||
| learner_enriched, | ||
| environment=pred_enriched.skb.get_data(), | ||
| cv=splitter, | ||
| scoring="roc_auc", | ||
| ) | ||
| print(f"ROC-AUC with session encoding: {results_enriched['test_score'].mean():.3f}") | ||
|
|
||
| # %% | ||
| # The enriched model clearly outperforms the baseline, showing the value of | ||
| # session-level context for conversion prediction. | ||
|
|
||
| # %% | ||
| # Discussion | ||
| # ----------- | ||
| # In DataOps, these aggregations are evaluated with temporal ordering in mind, | ||
| # which helps prevent leakage: features for an event are computed only from data | ||
| # available up to that event timestamp (provided that the correct splitter is used). | ||
| # | ||
| # This example focuses on |SessionEncoder| usage, so we intentionally keep modeling | ||
| # simple (no hyperparameter tuning and only a small set of engineered features). | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I mentioned I find it weird that we have session-level targets to predict but still need to construct session ids from the events. we can revisit the example and dataset in a later PR