Update notebooks for TimeSeries MasterClass by snath-xoc · Pull Request #3 · probabl-ai/forecasting

snath-xoc · 2026-05-03T19:21:36Z

Relevant to https://github.com/probabl-ai/labs-orga-private/issues/6, I have checked the new skrub compatibility with the time series forecasting masterclass. Changes in this branch include:

Option in the feature_engineering.py file to directly load in the electricity data. Once the data is within skrub we can switch to using at, as already in progress here: FEAT - Add a fetcher for the electricity forecasting dataset skrub-data/skrub#2013 (P.S. @jeromedockes )
TableReport in skrub seems to not work when calling open, for now I added , max_plot_columns=0 as a workaround but this is sub-optimal
Multiple-horizon forecasting now uses direct forecasting such that a separate model is fit for individual time horizons as according to: https://github.com/probabl-ai/skore-example-electricity-load. @jeromedockes this is mainly from your work so would be good to get your feedback. I probably need to add in more details and descriptions anyways so feel free to critique. The notebook runs super slow now compared to before :/ (GAH)
I added in the TabularFoundational model notebook as according to here: https://tabicl.readthedocs.io/en/latest/tutorials/time_series_forecasting.html added some edits to make it more "skrubby", probably can improve on this and the descriptions.
I messed around with skrubs and within the prediction_interval.py used the skrub.choose_float and ``skrub.choose_int``` options when initialising the HGBR hyperparameters, although this will only kick in if a hyperparameter search is done so does not change default behaviour.

@ogrisel, @glemaitre this was my first attempt at shaping the tutorial together. Skrub seems to work well apart from the TableReport issue (see 2 above), let me know your initial thoughts.

snath-xoc · 2026-05-03T19:24:14Z

Hmmmm running into fail to open in web browser issues, it worked locally though.

jeromedockes

here are a few questions from a quick glance but I will find time to take a deeper look!

I think the main issue with the current approach (in the main branch, not specific to this pr) is that the feature engineering is anchored around the prediction time, and the target is shifted (as seen here), rather than anchoring the feature engineering to the target time. this makes it harder to have fixed features like 'at the same time on the previous day' which is why prediction is worse for some horizons (as seen on the bell shape of MAPE curves here )

also I think maybe we could try to separate more EDA and plotting of results (which don't need skrub) from the actual predictive pipeline, but I'm not sure

jeromedockes · 2026-05-12T07:13:48Z


-TableReport(electricity_lagged.skb.eval())
-
+TableReport(electricity_lagged.skb.eval(), max_plot_columns=0).open()


open() uses a subprocess to actually open the report as a web page in the user's web browser. in environments where we cannot open a browser such as jupyterlite or a remote server it will not work. however in notebooks the report is displayed inline when it is the output of a cell. so if the class is mostly meant to happen in a jupyterlite notebook the easiest thing to do is probably just remove the .open() call

jeromedockes · 2026-05-12T07:27:51Z

+prediction_timestamps = pd.to_datetime(to_numpy_1d(prediction_time.skb.eval()), utc=True)
+
+
+class DateBasedSplitter(BaseCrossValidator):


detail: I don't think you need the BaseCrossValidator. the only thing it brings here would be the __repr__ but we don't seem to print a splitter anywhere

jeromedockes · 2026-05-12T07:31:59Z

+        return sum(1 for _ in self.split(X=X, y=y, groups=groups))
+
+
+def make_direct_predictions(estimator_factory):


out of curiosity: why a factory rather than passing an estimator directly?

jeromedockes · 2026-05-12T07:33:24Z

+    data = {}
+    for horizon, prediction in predictions_by_horizon.items():
+        target_column_name = target_column_name_pattern.format(horizon=horizon)
+        data[f"predicted_{target_column_name}"] = to_numpy_1d(prediction.skb.eval())


i'm not sure I understand this part -- here we are collecting the predictions on the full (not splitted data). and then why to we convert back to a DataOp?

jeromedockes · 2026-05-12T07:38:12Z

 plot_horizon_forecast(
    targets,
-    named_predictions,
+    named_predictions_hgbr,


this is not introduced by this PR but I find it a bit strange to have a plot as a DataOp. is it done like this to handle subsampling?

jeromedockes · 2026-05-12T08:44:27Z

I think the main issue with the current approach (in the main branch, not specific to this pr) is that the feature engineering is anchored around the prediction time, and the target is shifted

or I guess this may be the most natural way for predicting multiple horizons, but in this case the feature engineering should be different for each horizon and be aware of the horizon it is preparing features for, so that it can make the most appropriate shifts

ogrisel · 2026-05-21T14:03:47Z

@snath-xoc let me know when you have consolidated the work of Jerome to modernize our use of skrub and skore in this material, in particular w.r.t. per-horizon feature engineering.

Once done I can review the PR and we can discuss if we want to split into additional shorter notebooks or not.

ogrisel · 2026-05-21T14:06:09Z

Also, the CI is broken because TableReport seems to attempt to open a webbrowser. There is probably a way to tell it not to do that in headless CI environment.

Let's try to always get the CI green as it is necessary to keep a working set of notebooks.

jeromedockes · 2026-05-21T14:50:24Z

Also, the CI is broken because `TableReport` seems to attempt to open a webbrowser. There is probably a way to tell it not to do that in headless CI environment.

it opens a browser tab when we call TableReport.open() . however in a notebook that should not be necessary, because when the report is the output of a cell it is displayed directly in the notebook

snath-xoc · 2026-06-01T09:17:48Z

Hi all,

After discussion with @jeromedockes and @ogrisel I implemented changes to match here

Some issues:

I couldn't get .skb.with_scoring working for skrub versions less than 0.9.0, is this just me?
Optuna with skrub was throwing some errors I need to check this
When creating the TimeSeriesSplitter (in next_horizon_prediction.py), the splits are returned with an extra column allow_object so I constantly have to use .drop("allow_object"), this may just be a bug on my side not sure why it appears

I will carry on investigating the optuna side of things but the feature_engineering and next/mutliple horizon prediction should be O.K. for a first feedback pass if you want @jeromedockes and @ogrisel. I also included the %%writefile feature so we don't have to constantly rewrite functions, keen to hear your feedback on its current implementation @ogrisel

jeromedockes

thank you @snath-xoc ! here are a few comments on the first notebook

jeromedockes · 2026-06-03T12:40:35Z

these 2 variables don't seem to be used anymore

jeromedockes · 2026-06-03T12:45:29Z

-    time_zone="UTC",
-):
-    """Define an historical time range shared by all data sources."""
+%%writefile time_range.py


I wonder if there is a way to use this magic command but still keep the python file runnable as a script without a syntax error 🤔

jeromedockes · 2026-06-03T12:53:36Z

+        allow_object=True,
    )

+# %%


one potential drawback of the writefile is that the function definition and the line that calls it have to be in separate cells. I don't know this is a nuisance for notebook power users to have to execute 2 cells when they edit the function to see the updated output. however in any case in this portion of the tutorial they are mostly reading it and not editing anyway?

also not sure if it is interesting but I guess it should be possible to define a magic command as shown here that both writes and executes the cell if you wanted to avoid the imports. doing so would also probably allow writing all the functions in a single file with %%writefile -a if you wanted to avoid multiple small files

Ooooh interesting am playing around with this.

:) still I think it's maybe simpler to write the notebook normally and scrape the functions out of it before moving on to the second notebook as here

jeromedockes · 2026-06-03T12:59:58Z

+from data_paths import data_dir
+from time_range import time_range
+
+def load_electricity_load_data(data_dir=data_dir):


maybe the function should be called get_data_dir instead of just data_dir? and the parameter here could be an actual path rather than a function that needs to be called?

jeromedockes · 2026-06-03T13:26:32Z

+        how="left",
+        maintain_order="left",
+    )
+def fetch_city_weather(city, data_dir=data_dir):


I wonder if it would help to split the cell and show the intermediate results after adding lags, weather and holidays to keep the iterative / interactive aspect?

jeromedockes · 2026-06-03T13:38:29Z

I couldn't get .skb.with_scoring working for skrub versions less than 0.9.0, is this just me?

this is normal, it is a new feature. if requesting skrub >= 0.9.0 is an issue, here with_scoring is not really necessary, so we can pass the splitter when calling cross_validate() as you did

jeromedockes · 2026-06-03T13:40:09Z

When creating the TimeSeriesSplitter (in next_horizon_prediction.py), the splits are returned with an extra column allow_object so I constantly have to use .drop("allow_object"), this may just be a bug on my side not sure why it appears

I think it is due to the added with_columns(..., allow_object=True) here. removing that line should fix the problem

jeromedockes · 2026-06-03T14:28:43Z

maybe an alternative to writefile could be to run a script during the building of the book to extract all the functions defined in the feature engineering python file, to write a module from which they can be imported. then students can be told where to find the module and how it is written if they are curious, but it removes the clutter due to writefile, allows splitting the cells without constraints, and keeps the python file executable as a script rather than a notebook. it could be done with something like this

import ast
from pathlib import Path


def extract_defs(source):
    """Extract imports & function and class definitions, discarding other statements."""
    lines = source.splitlines(keepends=True)
    tree = ast.parse(source)
    import_snippets = []
    def_snippets = []
    for node in tree.body:
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            import_snippets.append("".join(lines[node.lineno - 1 : node.end_lineno]))
        elif isinstance(node, (ast.FunctionDef, ast.ClassDef)):
            lineno = node.decorator_list[0].lineno if node.decorator_list else node.lineno
            def_snippets.append("".join(lines[lineno - 1 : node.end_lineno]))
    out = "".join(sorted(set(import_snippets))) + "\n\n" + "\n\n".join(def_snippets)
    return out


if __name__ == "__main__":
    fname = "feature_engineering.py"
    source = Path(fname).read_text("utf-8")
    print(extract_defs(source))

jeromedockes · 2026-06-03T15:39:06Z

-# follows:
+        while current_test_start < max_date:
+            test_start_dates.append(current_test_start)
+            # advance by 3 months (quarter)


in this case do you want to align the first date with a january 1st or something like that?

I added in a method to align the start_date to the 1st of a month... was not sure if I should stick it onto January as this may lead to jumping forward a whole year.

jeromedockes · 2026-06-03T15:45:55Z

-target = targets[target_column_name].skb.mark_as_y()
-target
+TIME_HORIZON = 1 # Focus on next step prediction
+load_electricity_load_history = skrub.as_data_op(load_electricity_load_data).skb.set_name(


maybe there could be a function to create those objects to avoid duplicating that code across notebooks; not sure if it would make things simpler or more complicated

jeromedockes · 2026-06-03T16:05:47Z

I think this part needs to be updated too, for example features_with_dropped_cols does not exist anymore.

I also wonder if we should chain apply() calls instead of using a pipeline. it allows inspecting the intermediate results in reports, but here it would require discussing a subtlety that for the spline transformer it would need to be apply(SplineTransformer(sparse_output=True), no_wrap=True) because otherwise we get an error trying to put the sparse array in a dataframe. so maybe it is better to use the pipeline

we have discussed removing the need for no_wrap=True in this situation (sparse output + cols is None) so we can revisit that if that change gets done in skrub

Added a modification, check if you agree?

jeromedockes · 2026-06-03T16:12:26Z

+
+# %%
+TIME_HORIZONS = (1,12,24)
+load_electricity_load_history = skrub.as_data_op(load_electricity_load_data).skb.set_name(


here also maybe we can remove some of the duplication with other notebooks 🤔

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

add tabicl and modify multiple forecast

5118c98

snath-xoc moved this to In progress in Labs May 6, 2026

snath-xoc added this to Labs May 6, 2026

snath-xoc added 2 commits May 10, 2026 23:24

fix typos

52f7497

fix typos

2dff41f

jeromedockes reviewed May 12, 2026

View reviewed changes

snath-xoc marked this pull request as draft June 1, 2026 09:09

add horizon specific pipeline and rename single horizon

e4bf28a

jeromedockes reviewed Jun 3, 2026

View reviewed changes

snath-xoc and others added 3 commits June 11, 2026 18:17

Update content/python_files/feature_engineering.py

c9a5196

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

add reviews

0f42095

fix typos

5428f3d


		TableReport(electricity_lagged.skb.eval())

		TableReport(electricity_lagged.skb.eval(), max_plot_columns=0).open()

		prediction_timestamps = pd.to_datetime(to_numpy_1d(prediction_time.skb.eval()), utc=True)


		class DateBasedSplitter(BaseCrossValidator):

		return sum(1 for _ in self.split(X=X, y=y, groups=groups))


		def make_direct_predictions(estimator_factory):

Conversation

snath-xoc commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snath-xoc commented May 3, 2026

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromedockes commented May 12, 2026

Uh oh!

ogrisel commented May 21, 2026

Uh oh!

ogrisel commented May 21, 2026

Uh oh!

jeromedockes commented May 21, 2026 via email

Uh oh!

snath-xoc commented Jun 1, 2026

Uh oh!

jeromedockes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeromedockes commented Jun 3, 2026

Uh oh!

jeromedockes commented Jun 3, 2026

Uh oh!

jeromedockes commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

snath-xoc commented May 3, 2026 •

edited

Loading