Skip to content

Adding the SessionEncoder#1930

Open
rcap107 wants to merge 91 commits into
skrub-data:mainfrom
rcap107:feat-session-encoder
Open

Adding the SessionEncoder#1930
rcap107 wants to merge 91 commits into
skrub-data:mainfrom
rcap107:feat-session-encoder

Conversation

@rcap107

@rcap107 rcap107 commented Feb 24, 2026

Copy link
Copy Markdown
Member

Very early draft of an encoder that adds session IDs and statistics to a given dataframe

This one isn't a SingleColumnTransformer because it needs two separate columns.

To decide:

  • What's the granularity we want to go for in the session duration? minutes? seconds? let the user choose?
  • What statistics should be added?
    ...

Example:
Given this dataset (randomly generated)

timestamp  user_id     value
0  2024-01-01 00:00:00      101 -1.072157
1  2024-01-01 00:02:00      101 -2.281613
2  2024-01-01 00:04:00      102  0.462635
3  2024-01-01 00:06:00      101  1.929639
4  2024-01-01 00:08:00      101  1.285300
5  2024-01-11 00:00:00      101  1.279716
6  2024-01-11 00:02:00      101  0.958734
7  2024-01-11 00:04:00      102  0.041920
8  2024-01-11 00:06:00      102  0.383161
9  2024-01-11 00:08:00      101 -0.021126
10 2024-01-21 00:00:00      101 -1.605621
11 2024-01-21 00:02:00      102 -0.076007
12 2024-01-21 00:04:00      101  1.609458
13 2024-01-21 00:06:00      102 -0.137637
14 2024-01-21 00:08:00      101 -0.317564

We should be able to fit transform the SessionEncoder like so:

se = SessionEncoder(add_duration=True, add_sessions_per_user=True, add_session_time=True)
se.fit_transform(df)

and get something like this

timestamp  user_id     value  session_id  session_duration  \
0  2024-01-01 00:00:00      101 -1.072157           0                 4   
1  2024-01-01 00:02:00      101 -2.281613           0                 4   
2  2024-01-01 00:06:00      101  1.929639           0                 4   
3  2024-01-01 00:08:00      101  1.285300           0                 4   
4  2024-01-11 00:00:00      101  1.279716           1                 3   
5  2024-01-11 00:02:00      101  0.958734           1                 3   
6  2024-01-11 00:08:00      101 -0.021126           1                 3   
7  2024-01-21 00:00:00      101 -1.605621           2                 3   
8  2024-01-21 00:04:00      101  1.609458           2                 3   
9  2024-01-21 00:08:00      101 -0.317564           2                 3   
10 2024-01-01 00:04:00      102  0.462635           3                 1   
11 2024-01-11 00:04:00      102  0.041920           4                 2   
12 2024-01-11 00:06:00      102  0.383161           4                 2   
13 2024-01-21 00:02:00      102 -0.076007           5                 2   
14 2024-01-21 00:06:00      102 -0.137637           5                 2   

    sessions_per_user  total_session_time  
0                  10            12000000  
1                  10            12000000  
2                  10            12000000  
3                  10            12000000  
4                  10            12000000  
5                  10            12000000  
6                  10            12000000  
7                  10            12000000  
8                  10            12000000  
9                  10            12000000  
10                  5             6000000  
11                  5             6000000  
12                  5             6000000  
13                  5             6000000  
14                  5             6000000

@GaelVaroquaux

GaelVaroquaux commented Feb 24, 2026 via email

Copy link
Copy Markdown
Member

@rcap107

rcap107 commented Feb 25, 2026

Copy link
Copy Markdown
Member Author

I added an example with some possible parameters that we could add to the encoder.

I've been using this dataset as an example https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history?select=wowah_data.csv

I've already noticed a pretty big difference in performance between pandas and polars.

@rcap107

rcap107 commented Feb 25, 2026

Copy link
Copy Markdown
Member Author

I think the code can already be reviewed for early comments. I've added tests and doctests, and so far the basic approach for sessionization is working.

In the end I decided to avoid adding more features because those only make sense after aggregation, and that should be done by the user with the data ops.

Something that needs to be decided is the resolution of the session duration: for now, it's in minutes, with 30 minutes being the default value. We might want to change that to seconds, or add a parameter so the user can decide the resolution by themselves.

@rcap107

rcap107 commented Feb 25, 2026

Copy link
Copy Markdown
Member Author

I also just realized that the "by" column is not necessary: maybe the sessions are just sequence of operations executed by the user and we only care that there is a gap with a certain duration between actions to mark a new session.

I'll update the code to reflect that.

@jeromedockes

Copy link
Copy Markdown
Member

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

@rcap107

rcap107 commented Feb 25, 2026

Copy link
Copy Markdown
Member Author

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

I could have a dataset where I only have the timestamp, like a logfile that has only the events. Then, I may still want to group the events so that sessions are delimited by periods of activity. Like, the single user has been connected for this long, then they disconnected, then they connected again. If there is only a single user we care about, then there's no need to group by that.

@jeromedockes

Copy link
Copy Markdown
Member

sorry, I had misunderstood the "is not necessary" as "we can remove this parameter". yes I agree it can be optional :)

@rcap107 rcap107 marked this pull request as ready for review February 26, 2026 10:56
@rcap107

rcap107 commented Feb 26, 2026

Copy link
Copy Markdown
Member Author

The example is still missing, but the rest of the code is ready for review.

There are two points that still need to be discussed:

  • How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?
  • What should I use as the name for the session id column? At the moment I just have "session_id", but it should probably be something like "TIMESERIES_NAME_session_id" instead

I also need to see if the thing works when it's put in a pipeline or if I'm missing something.

@jeromedockes

Copy link
Copy Markdown
Member

How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?

I would say by default any duration passed as a number is usually in seconds, and we can also allow strings like "2s", "2m", "2h"

Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
user_id timestamp action timestamp_session_id
0 alice 2024-01-01 10:00:00 login 0
1 alice 2024-01-01 10:05:00 view 0
2 alice 2024-01-01 11:00:00 logout 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe pick something else than 'logout' because in this example it looks like the logout was in reality probably the last event of the first session

Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
if self.by is not None
else [self.timestamp]
)
X_sorted = sbd.sort(X, by=sort_by)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could keep only the columns we need before sorting, so that X_sorted will be smaller in case X is large

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good idea

Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated

@jeromedockes jeromedockes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rcap107 !! this is a big one but we are getting close :)

Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread skrub/_session_encoder.py Outdated
greater predictive power than raw individual events.

The |SessionEncoder| helps addressing this problem by detecting sessions based on
a timestamp column, other "session columns" (e.g., user and device) that should be

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session columns = split_by right?

Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py Outdated
Comment thread skrub/_session_encoder.py

The data includes columns such as event type, device type, viewed price, and
timestamp. The target is binary: whether the session eventually contains a
purchase event or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned I find it weird that we have session-level targets to predict but still need to construct session ids from the events. we can revisit the example and dataset in a later PR

Comment thread examples/0110_session_encoder.py
Comment thread examples/0110_session_encoder.py

@lisaleemcb lisaleemcb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments but otherwise looks really great :)

Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated
Comment thread examples/0110_session_encoder.py Outdated
Comment thread examples/0110_session_encoder.py
Comment thread examples/0110_session_encoder.py Outdated
Comment thread examples/0110_session_encoder.py Outdated
Comment thread examples/0110_session_encoder.py Outdated
@rcap107 rcap107 requested a review from jeromedockes June 16, 2026 09:13
@rcap107

rcap107 commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

I think the PR is complete now

@jeromedockes

jeromedockes commented Jun 16, 2026 via email

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT - Adding a transformer to sessionize a table

4 participants