Update notebooks for TimeSeries MasterClass#3
Conversation
|
Hmmmm running into fail to open in web browser issues, it worked locally though. |
jeromedockes
left a comment
There was a problem hiding this comment.
here are a few questions from a quick glance but I will find time to take a deeper look!
I think the main issue with the current approach (in the main branch, not specific to this pr) is that the feature engineering is anchored around the prediction time, and the target is shifted (as seen here), rather than anchoring the feature engineering to the target time. this makes it harder to have fixed features like 'at the same time on the previous day' which is why prediction is worse for some horizons (as seen on the bell shape of MAPE curves here )
also I think maybe we could try to separate more EDA and plotting of results (which don't need skrub) from the actual predictive pipeline, but I'm not sure
|
|
||
| TableReport(electricity_lagged.skb.eval()) | ||
|
|
||
| TableReport(electricity_lagged.skb.eval(), max_plot_columns=0).open() |
There was a problem hiding this comment.
open() uses a subprocess to actually open the report as a web page in the user's web browser. in environments where we cannot open a browser such as jupyterlite or a remote server it will not work. however in notebooks the report is displayed inline when it is the output of a cell. so if the class is mostly meant to happen in a jupyterlite notebook the easiest thing to do is probably just remove the .open() call
| prediction_timestamps = pd.to_datetime(to_numpy_1d(prediction_time.skb.eval()), utc=True) | ||
|
|
||
|
|
||
| class DateBasedSplitter(BaseCrossValidator): |
There was a problem hiding this comment.
detail: I don't think you need the BaseCrossValidator. the only thing it brings here would be the __repr__ but we don't seem to print a splitter anywhere
| return sum(1 for _ in self.split(X=X, y=y, groups=groups)) | ||
|
|
||
|
|
||
| def make_direct_predictions(estimator_factory): |
There was a problem hiding this comment.
out of curiosity: why a factory rather than passing an estimator directly?
| data = {} | ||
| for horizon, prediction in predictions_by_horizon.items(): | ||
| target_column_name = target_column_name_pattern.format(horizon=horizon) | ||
| data[f"predicted_{target_column_name}"] = to_numpy_1d(prediction.skb.eval()) |
There was a problem hiding this comment.
i'm not sure I understand this part -- here we are collecting the predictions on the full (not splitted data). and then why to we convert back to a DataOp?
| plot_horizon_forecast( | ||
| targets, | ||
| named_predictions, | ||
| named_predictions_hgbr, |
There was a problem hiding this comment.
this is not introduced by this PR but I find it a bit strange to have a plot as a DataOp. is it done like this to handle subsampling?
or I guess this may be the most natural way for predicting multiple horizons, but in this case the feature engineering should be different for each horizon and be aware of the horizon it is preparing features for, so that it can make the most appropriate shifts |
|
@snath-xoc let me know when you have consolidated the work of Jerome to modernize our use of skrub and skore in this material, in particular w.r.t. per-horizon feature engineering. Once done I can review the PR and we can discuss if we want to split into additional shorter notebooks or not. |
|
Also, the CI is broken because Let's try to always get the CI green as it is necessary to keep a working set of notebooks. |
|
Also, the CI is broken because `TableReport` seems to attempt to open a webbrowser. There is probably a way to tell it not to do that in headless CI environment.
it opens a browser tab when we call TableReport.open() . however in a notebook that should not be necessary, because when the report is the output of a cell it is displayed directly in the notebook
|
|
Hi all, After discussion with @jeromedockes and @ogrisel I implemented changes to match here Some issues:
I will carry on investigating the optuna side of things but the feature_engineering and next/mutliple horizon prediction should be O.K. for a first feedback pass if you want @jeromedockes and @ogrisel. I also included the %%writefile feature so we don't have to constantly rewrite functions, keen to hear your feedback on its current implementation @ogrisel |
jeromedockes
left a comment
There was a problem hiding this comment.
thank you @snath-xoc ! here are a few comments on the first notebook
There was a problem hiding this comment.
these 2 variables don't seem to be used anymore
| time_zone="UTC", | ||
| ): | ||
| """Define an historical time range shared by all data sources.""" | ||
| %%writefile time_range.py |
There was a problem hiding this comment.
I wonder if there is a way to use this magic command but still keep the python file runnable as a script without a syntax error 🤔
| allow_object=True, | ||
| ) | ||
|
|
||
| # %% |
There was a problem hiding this comment.
one potential drawback of the writefile is that the function definition and the line that calls it have to be in separate cells. I don't know this is a nuisance for notebook power users to have to execute 2 cells when they edit the function to see the updated output. however in any case in this portion of the tutorial they are mostly reading it and not editing anyway?
also not sure if it is interesting but I guess it should be possible to define a magic command as shown here that both writes and executes the cell if you wanted to avoid the imports. doing so would also probably allow writing all the functions in a single file with %%writefile -a if you wanted to avoid multiple small files
There was a problem hiding this comment.
Ooooh interesting am playing around with this.
There was a problem hiding this comment.
:) still I think it's maybe simpler to write the notebook normally and scrape the functions out of it before moving on to the second notebook as here
| from data_paths import data_dir | ||
| from time_range import time_range | ||
|
|
||
| def load_electricity_load_data(data_dir=data_dir): |
There was a problem hiding this comment.
maybe the function should be called get_data_dir instead of just data_dir? and the parameter here could be an actual path rather than a function that needs to be called?
| how="left", | ||
| maintain_order="left", | ||
| ) | ||
| def fetch_city_weather(city, data_dir=data_dir): |
There was a problem hiding this comment.
I wonder if it would help to split the cell and show the intermediate results after adding lags, weather and holidays to keep the iterative / interactive aspect?
this is normal, it is a new feature. if requesting skrub >= 0.9.0 is an issue, here with_scoring is not really necessary, so we can pass the splitter when calling cross_validate() as you did |
I think it is due to the added |
|
maybe an alternative to writefile could be to run a script during the building of the book to extract all the functions defined in the feature engineering python file, to write a module from which they can be imported. then students can be told where to find the module and how it is written if they are curious, but it removes the clutter due to writefile, allows splitting the cells without constraints, and keeps the python file executable as a script rather than a notebook. it could be done with something like this import ast
from pathlib import Path
def extract_defs(source):
"""Extract imports & function and class definitions, discarding other statements."""
lines = source.splitlines(keepends=True)
tree = ast.parse(source)
import_snippets = []
def_snippets = []
for node in tree.body:
if isinstance(node, (ast.Import, ast.ImportFrom)):
import_snippets.append("".join(lines[node.lineno - 1 : node.end_lineno]))
elif isinstance(node, (ast.FunctionDef, ast.ClassDef)):
lineno = node.decorator_list[0].lineno if node.decorator_list else node.lineno
def_snippets.append("".join(lines[lineno - 1 : node.end_lineno]))
out = "".join(sorted(set(import_snippets))) + "\n\n" + "\n\n".join(def_snippets)
return out
if __name__ == "__main__":
fname = "feature_engineering.py"
source = Path(fname).read_text("utf-8")
print(extract_defs(source)) |
| # follows: | ||
| while current_test_start < max_date: | ||
| test_start_dates.append(current_test_start) | ||
| # advance by 3 months (quarter) |
There was a problem hiding this comment.
in this case do you want to align the first date with a january 1st or something like that?
There was a problem hiding this comment.
I added in a method to align the start_date to the 1st of a month... was not sure if I should stick it onto January as this may lead to jumping forward a whole year.
| target = targets[target_column_name].skb.mark_as_y() | ||
| target | ||
| TIME_HORIZON = 1 # Focus on next step prediction | ||
| load_electricity_load_history = skrub.as_data_op(load_electricity_load_data).skb.set_name( |
There was a problem hiding this comment.
maybe there could be a function to create those objects to avoid duplicating that code across notebooks; not sure if it would make things simpler or more complicated
There was a problem hiding this comment.
I think this part needs to be updated too, for example features_with_dropped_cols does not exist anymore.
I also wonder if we should chain apply() calls instead of using a pipeline. it allows inspecting the intermediate results in reports, but here it would require discussing a subtlety that for the spline transformer it would need to be apply(SplineTransformer(sparse_output=True), no_wrap=True) because otherwise we get an error trying to put the sparse array in a dataframe. so maybe it is better to use the pipeline
we have discussed removing the need for no_wrap=True in this situation (sparse output + cols is None) so we can revisit that if that change gets done in skrub
There was a problem hiding this comment.
Added a modification, check if you agree?
|
|
||
| # %% | ||
| TIME_HORIZONS = (1,12,24) | ||
| load_electricity_load_history = skrub.as_data_op(load_electricity_load_data).skb.set_name( |
There was a problem hiding this comment.
here also maybe we can remove some of the duplication with other notebooks 🤔
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
Relevant to https://github.com/probabl-ai/labs-orga-private/issues/6, I have checked the new skrub compatibility with the time series forecasting masterclass. Changes in this branch include:
, max_plot_columns=0as a workaround but this is sub-optimalskrub.choose_floatand ``skrub.choose_int``` options when initialising the HGBR hyperparameters, although this will only kick in if a hyperparameter search is done so does not change default behaviour.@ogrisel, @glemaitre this was my first attempt at shaping the tutorial together. Skrub seems to work well apart from the
TableReportissue (see 2 above), let me know your initial thoughts.