Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/check_stub_files_diff.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
frozen: true

- name: Check stub file for `_data_ops.py` is up-to-date
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/run-code-format-checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
frozen: true

- name: Run tests
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-javascript.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
environments: ci-py314-latest-optional-deps
# we can freeze the environment and manually bump the dependencies to the
# latest version time to time.
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
environments: ${{ matrix.environment }}
# we can freeze the environment and manually bump the dependencies to the
# latest version time to time.
Expand Down Expand Up @@ -63,7 +63,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
environments: ci-nightly-deps
# we can freeze the environment and manually bump the dependencies to the
# latest version time to time.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/update_pixi_lock_files.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
- uses: actions/checkout@v6
- uses: prefix-dev/setup-pixi@v0.9.6
with:
pixi-version: v0.59.0
pixi-version: v0.68.0
run-install: false

- name: Remove the current lock file
Expand Down
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"sphinx.ext.linkcode",
"sphinx.ext.autodoc.typehints",
# contrib
"sphinx_design",
"numpydoc",
"sphinx_issues",
"sphinx_copybutton",
Expand Down
86 changes: 55 additions & 31 deletions doc/data_ops.rst
Original file line number Diff line number Diff line change
@@ -1,36 +1,60 @@
.. _user_guide_data_ops_index:

Complex multi-table pipelines with Data Ops
===========================================

Skrub provides an easy way to build complex, flexible machine learning pipelines.
There are several needs that are not easily addressed with standard scikit-learn
tools such as :class:`~sklearn.pipeline.Pipeline` and
:class:`~sklearn.compose.ColumnTransformer`, and for which the skrub DataOps offer
a solution:

- Multiple tables: We often have several tables of different shapes (for
example, "Customers", "Orders", and "Products" tables) that need to be
processed and assembled into a design matrix ``X``. The target ``y`` may also
be the result of some data processing. Standard scikit-learn estimators do not
support this, as they expect right away a single design matrix ``X`` and a
target array ``y``, with one row per observation.
- DataFrame wrangling: Performing typical DataFrame operations such as
projections, joins, and aggregations should be possible and allow leveraging
the powerful and familiar APIs of `Pandas <https://pandas.pydata.org>`_ or
`Polars <https://docs.pola.rs/>`_.
- Hyperparameter tuning: Choices of estimators, hyperparameters, and even
the pipeline architecture can be guided by validation scores. Specifying
ranges of possible values outside of the pipeline itself (as in
:class:`~sklearn.model_selection.GridSearchCV`) is difficult in complex
pipelines.
- Iterative development: Building a pipeline step by step while inspecting
intermediate results allows for a short feedback loop and early discovery of
errors.

In this section we cover all about the skrub Data Ops, from starting out with a
simple example, to more advanced concepts like parameter tuning and and pipeline
validation.
.. currentmodule:: skrub

Building complete pipelines with DataOps
========================================

A skrub DataOp is a complete machine learning pipeline —from data loading and
wrangling to the final prediction— in a single object that can be fitted, tuned,
cross-validated, and saved in a file like any scikit-learn estimator.

By integrating the whole data processing, DataOps help to validate pipelines
while **avoiding data leakage**, to **tune complex modelling choices**, and to keep
track of important **fitted (learned) state**.

To solve a machine-learning task we often need to combine multiple operations
such as loading and filtering data, joining tables and computing aggregations,
extracting numerical features, and fitting a classifier or regressor.

**Storing state**  Each of those operations may need to be fitted: to learn some
information from training data and reuse it to apply consistent transformations
to new data. This is the case for transformers like the
:class:`~sklearn.preprocessing.StandardScaler` and :class:`TableVectorizer` and
estimators like :class:`~sklearn.ensemble.RandomForestClassifier`.

**Tuning**  Moreover, each processing step may involve decisions that need to be
tuned (*tuning* means finding the value that gives the best predictive
performance), for example: what weather forecast features should I include to
predict the load on an electric grid? How should I encode a product description
to help predict the product's category? What learning rate to set on a
:class:`~sklearn.ensemble.HistGradientBoostingRegressor`?

**Validation**  Finally, the quality of predictions must be evaluated on

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the section on leakage should be moved further up.

I also think there should be a mention of leakage at the very start, because it's really important and it may come a bit late (even though it's not that far down the page)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I added a (bold font) mention of data leakage at the very start. for the paragraphs that follow I think the chronological order of when you meet problems is roughly this one (building a pipeline at all, making modelling choices, validation) but that is indeed debatable

held-out data (with a train/test split or cross-validation), taking care to
**avoid leakage** of test data into the training set.

Separating the data wrangling from the fitted estimator prevents correctly
handling the tasks above. Skrub DataOps help by binding an arbitrary set of
transformations of any number of inputs in a single estimator. These
transformations can be easily parametrized with tunable choices. The resulting
objects have built-in methods for cross-validation and tuning with either Optuna
or scikit-learn, and for inspecting runs and intermediate results. Once fitted,
they can be saved in a file, loaded, applied to new data as easily as a single
:class:`~sklearn.linear_model.LogisticRegression`.

.. dropdown:: Going beyond the scikit-learn Pipeline
:color: primary

To some extent, the DataOps exist for the same reasons as the simpler
scikit-learn :class:`sklearn.pipeline.Pipeline` used in other parts of this
documentation. However the Pipeline is too limited for many real-world problems:
it can only represent a linear sequence of scikit-learn transformers, the design
matrix and target variables must be constructed and divided into training and
testing sets outside of the pipeline and the number of rows cannot change, only
a single table can be handled, hyperparameter choices are difficult to define,
etc. . Skrub DataOps remove those limitations and add several useful features
such as interactive previews and integration with Optuna.

Data Ops basic concepts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
Loading