Adding the SessionEncoder by rcap107 · Pull Request #1930 · skrub-data/skrub

rcap107 · 2026-02-24T16:48:25Z

Very early draft of an encoder that adds session IDs and statistics to a given dataframe

This one isn't a SingleColumnTransformer because it needs two separate columns.

To decide:

What's the granularity we want to go for in the session duration? minutes? seconds? let the user choose?
What statistics should be added?
...

Example:
Given this dataset (randomly generated)

timestamp  user_id     value
0  2024-01-01 00:00:00      101 -1.072157
1  2024-01-01 00:02:00      101 -2.281613
2  2024-01-01 00:04:00      102  0.462635
3  2024-01-01 00:06:00      101  1.929639
4  2024-01-01 00:08:00      101  1.285300
5  2024-01-11 00:00:00      101  1.279716
6  2024-01-11 00:02:00      101  0.958734
7  2024-01-11 00:04:00      102  0.041920
8  2024-01-11 00:06:00      102  0.383161
9  2024-01-11 00:08:00      101 -0.021126
10 2024-01-21 00:00:00      101 -1.605621
11 2024-01-21 00:02:00      102 -0.076007
12 2024-01-21 00:04:00      101  1.609458
13 2024-01-21 00:06:00      102 -0.137637
14 2024-01-21 00:08:00      101 -0.317564

We should be able to fit transform the SessionEncoder like so:

se = SessionEncoder(add_duration=True, add_sessions_per_user=True, add_session_time=True)
se.fit_transform(df)

and get something like this

timestamp  user_id     value  session_id  session_duration  \
0  2024-01-01 00:00:00      101 -1.072157           0                 4   
1  2024-01-01 00:02:00      101 -2.281613           0                 4   
2  2024-01-01 00:06:00      101  1.929639           0                 4   
3  2024-01-01 00:08:00      101  1.285300           0                 4   
4  2024-01-11 00:00:00      101  1.279716           1                 3   
5  2024-01-11 00:02:00      101  0.958734           1                 3   
6  2024-01-11 00:08:00      101 -0.021126           1                 3   
7  2024-01-21 00:00:00      101 -1.605621           2                 3   
8  2024-01-21 00:04:00      101  1.609458           2                 3   
9  2024-01-21 00:08:00      101 -0.317564           2                 3   
10 2024-01-01 00:04:00      102  0.462635           3                 1   
11 2024-01-11 00:04:00      102  0.041920           4                 2   
12 2024-01-11 00:06:00      102  0.383161           4                 2   
13 2024-01-21 00:02:00      102 -0.076007           5                 2   
14 2024-01-21 00:06:00      102 -0.137637           5                 2   

    sessions_per_user  total_session_time  
0                  10            12000000  
1                  10            12000000  
2                  10            12000000  
3                  10            12000000  
4                  10            12000000  
5                  10            12000000  
6                  10            12000000  
7                  10            12000000  
8                  10            12000000  
9                  10            12000000  
10                  5             6000000  
11                  5             6000000  
12                  5             6000000  
13                  5             6000000  
14                  5             6000000

GaelVaroquaux · 2026-02-24T17:10:21Z

Can you do a small example (even one that does not work currently) to showcase a bit how you are thinking to use such an object? Thanks!!

rcap107 · 2026-02-25T09:06:22Z

I added an example with some possible parameters that we could add to the encoder.

I've been using this dataset as an example https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history?select=wowah_data.csv

I've already noticed a pretty big difference in performance between pandas and polars.

rcap107 · 2026-02-25T14:53:09Z

I think the code can already be reviewed for early comments. I've added tests and doctests, and so far the basic approach for sessionization is working.

In the end I decided to avoid adding more features because those only make sense after aggregation, and that should be done by the user with the data ops.

Something that needs to be decided is the resolution of the session duration: for now, it's in minutes, with 30 minutes being the default value. We might want to change that to seconds, or add a parameter so the user can decide the resolution by themselves.

rcap107 · 2026-02-25T16:36:08Z

I also just realized that the "by" column is not necessary: maybe the sessions are just sequence of operations executed by the user and we only care that there is a gap with a certain duration between actions to mark a new session.

I'll update the code to reflect that.

jeromedockes · 2026-02-25T16:49:59Z

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

rcap107 · 2026-02-25T16:57:17Z

I also just realized that the "by" column is not necessary:

not sure I understand -- we need to know which column(s) identify users, to have one log per user, and then sessionize that: an action from user B does not prolong the session of user A

I could have a dataset where I only have the timestamp, like a logfile that has only the events. Then, I may still want to group the events so that sessions are delimited by periods of activity. Like, the single user has been connected for this long, then they disconnected, then they connected again. If there is only a single user we care about, then there's no need to group by that.

jeromedockes · 2026-02-25T16:59:47Z

sorry, I had misunderstood the "is not necessary" as "we can remove this parameter". yes I agree it can be optional :)

rcap107 · 2026-02-26T11:00:08Z

The example is still missing, but the rest of the code is ready for review.

There are two points that still need to be discussed:

How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?
What should I use as the name for the session id column? At the moment I just have "session_id", but it should probably be something like "TIMESERIES_NAME_session_id" instead

I also need to see if the thing works when it's put in a pipeline or if I'm missing something.

jeromedockes · 2026-02-26T15:52:27Z

How do we define the resolution of the session gap? for the moment it's in minutes. Should I add a parameter to let the user decide the granularity?

I would say by default any duration passed as a number is usually in seconds, and we can also allow strings like "2s", "2m", "2h"

jeromedockes · 2026-02-26T16:06:02Z

+       user_id           timestamp   action  timestamp_session_id
+    0    alice 2024-01-01 10:00:00    login                     0
+    1    alice 2024-01-01 10:05:00     view                     0
+    2    alice 2024-01-01 11:00:00   logout                     1


maybe pick something else than 'logout' because in this example it looks like the logout was in reality probably the last event of the first session

jeromedockes · 2026-02-26T16:09:17Z

+            if self.by is not None
+            else [self.timestamp]
+        )
+        X_sorted = sbd.sort(X, by=sort_by)


maybe we could keep only the columns we need before sorting, so that X_sorted will be smaller in case X is large

Yes that's a good idea

jeromedockes

thanks @rcap107 !! this is a big one but we are getting close :)

jeromedockes · 2026-06-08T14:01:59Z

+greater predictive power than raw individual events.
+
+The |SessionEncoder| helps addressing this problem by detecting sessions based on
+a timestamp column, other "session columns" (e.g., user and device) that should be


session columns = split_by right?

jeromedockes · 2026-06-08T14:57:49Z

+
+The data includes columns such as event type, device type, viewed price, and
+timestamp. The target is binary: whether the session eventually contains a
+purchase event or not.


as I mentioned I find it weird that we have session-level targets to predict but still need to construct session ids from the events. we can revisit the example and dataset in a later PR

lisaleemcb

Small comments but otherwise looks really great :)

Co-authored-by: Lisa <lisaleemcb@gmail.com>

rcap107 · 2026-06-16T09:13:36Z

I think the PR is complete now

jeromedockes · 2026-06-16T09:56:48Z

I think the PR is complete now

thanks i will do another review

Adding the SessionEncoder

f7fdcd7

rcap107 added 7 commits February 25, 2026 11:51

more work

65be83a

adding tests

a6caeb7

changelog

cc14c55

adding drop cols, various improvements

dba476f

adding a test

d46090d

simplifying tests and code

28b958a

docstrings

ac7389d

rcap107 added 5 commits February 26, 2026 11:11

adding support for multiple by columns

8a61ea0

fixing optional by

a70006f

ddocs

429f155

fixing a compatibility problem

23c4111

changelog

acb091f

rcap107 marked this pull request as ready for review February 26, 2026 10:56

rcap107 added 5 commits February 26, 2026 13:35

improving tests

d69fb9e

renaming session id column

e0199f5

fixing some broken tests

603a57b

doctest

9d2c577

testing error dispatch

8d869b2

jeromedockes reviewed Feb 26, 2026

View reviewed changes

addressing some of the coments

01fed5f

lisaleemcb reviewed Jun 8, 2026

View reviewed changes

Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated

lisaleemcb reviewed Jun 8, 2026

View reviewed changes

Comment thread doc/modules/multi_column_operations/sessionization.rst Outdated

jeromedockes reviewed Jun 8, 2026

View reviewed changes

rcap107 added 4 commits June 8, 2026 17:12

Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder

e033cdf

addressing some of the comments from the review

2de71c4

clean up test

1dc32c2

addressing more comments

d199ec2

lisaleemcb reviewed Jun 8, 2026

View reviewed changes

rcap107 and others added 18 commits June 9, 2026 14:33

Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder

aada173

changelog

d22774c

example

f728604

slight rewording

f215bdb

Merge remote-tracking branch 'upstream/HEAD' into feat-session-encoder

679077a

Apply suggestions from code review

5f04c14

Co-authored-by: Lisa <lisaleemcb@gmail.com>

fixing doctest

efada37

reworking docstring, renaming attr

73adcfb

moving error checking, more work on docstring

74f5991

simplifying part of the code

49a4583

addressing comments

77942d2

removing factorizer

05c7bf7

docstring

b389575

doc fixes

9675950

pandas grrr

0787468

fixing test on min deps

2951486

adding a comment

326dbb9

changelog

e04fd29

rcap107 requested a review from jeromedockes June 16, 2026 09:13

Conversation

rcap107 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaelVaroquaux commented Feb 24, 2026 via email

Uh oh!

rcap107 commented Feb 25, 2026

Uh oh!

rcap107 commented Feb 25, 2026

Uh oh!

rcap107 commented Feb 25, 2026

Uh oh!

jeromedockes commented Feb 25, 2026

Uh oh!

rcap107 commented Feb 25, 2026

Uh oh!

jeromedockes commented Feb 25, 2026

Uh oh!

rcap107 commented Feb 26, 2026

Uh oh!

jeromedockes commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

jeromedockes Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromedockes Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

rcap107 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromedockes Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromedockes Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lisaleemcb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rcap107 commented Jun 16, 2026

Uh oh!

jeromedockes commented Jun 16, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

rcap107 commented Feb 24, 2026 •

edited

Loading