Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 49 additions & 23 deletions docs/api/datapipes/physicsnemo.datapipes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,23 @@ PhysicsNeMo Datapipes
.. automodule:: physicsnemo.datapipes
.. currentmodule:: physicsnemo.datapipes

The PhysicsNeMo Datapipes consists largely of two separate components, both
described here. Prior to version 2.0 of PhysicsNeMo, each datapipe was largely
The PhysicsNeMo Datapipes consists largely of two separate components.

Prior to version 2.0 of PhysicsNeMo, each datapipe was largely
independent from all others, targeted for very specific datasets and applications,
and broadly not extensible. Those datapipes, preserved in v2.0 for compatibility,
are described in the climate, cae, gnn, and
benchmark subsections.

In PhysicsNeMo v2.0, the datapipes API has been redesigned from scratch to focus
on key factors to enable scientific machine learning training and inference.
These documentation pages describe the architecture and design philosophy, while
in the examples of PhysicsNeMo there are runnable datapipe tutorials for
getting started.
This document describes the architecture and design philosophy

Refer to the examples of PhysicsNeMo for runnable datapipe tutorials to
get started.


Datapipes philosophy
Datapipes Philosophy
--------------------

The PhysicsNeMo datapipe structure is built on several key design decisions
Expand All @@ -27,27 +29,27 @@ that are specifically made to enable diverse scientific machine learning dataset
- GPU First: data preprocessing is done on the GPU, not the CPU.
- Isolation of roles: reading data is separate from transforming data, which is
separate from pipelining data for training, which is separate from threading
and stream management, etc. Changing data sources, or preprocessing pipelines,
and stream management. Changing data sources, or preprocessing pipelines,
should require no intervention in other areas.
- Composability and Extensibility: We aim to provide a tool kit and examples that
lets you build what you need yourself, easily, if it's not here.
- Datapipes as configuration: Changing a pipeline shouldn't require source code
modification; the registry system in PhysicsNeMo datapipes enables hydra instantiation
of datapipes at runtime for version-controlled, runtime-configured datapipes.
You can register and instantiate custom components, of course.
You can also register and instantiate custom components.

Data flows through a PhysicsNeMo datapipe in a consistent path:

1. A ``reader`` will bring the data from storage to CPU memory
1. A ``reader`` will bring the data from storage to CPU memory.
2. An optional series of one or more transformations will apply on-the-fly
manipulations of that data, per instance of data.
3. Several instances of data will be collated into a batch (customizable,
just like in PyTorch).
4. The batched data is ready for use in a model.

At the highest level, ``physicsnemo.datapipes.DataLoader`` has a similar API and
model as ``pytorch.utils.data.DataLoader``, enabling a drop-in replacement in many
cases. Under the hood, physicsnemo follows a very different computation orchestration.
model as ``pytorch.utils.data.DataLoader``, which enables a drop-in replacement for many
cases. However, PhysicsNeMo has a very different computation orchestration.

Quick Start
-----------
Expand Down Expand Up @@ -92,8 +94,8 @@ Quick Start
predictions = model(batch["pressure"], batch["coordinates"])


The best place to see the PhysicsNeMo datapipes in action, and get a sense of
how they work and use them, is to start with the examples located in the
The best place to see the PhysicsNeMo datapipes in action, get a sense of
how they work, and use them, is to start with the examples located in the
`examples directory <https://github.com/NVIDIA/physicsnemo/tree/main/examples/minimal/datapipes>`_.


Expand Down Expand Up @@ -153,35 +155,59 @@ the ``Dataset`` is responsible for the threaded execution of ``Reader``s and
:members:
:show-inheritance:

MultiDataset
^^^^^^^^^^^^

The ``MultiDataset`` includes two or more ``Dataset`` instances behind a single
index space (concatenation). Each sub-dataset can have its own Reader and
transforms. Global indices are mapped to the owning sub-dataset and local index;
metadata is enriched with ``dataset_index`` so that batches can identify the source.
Use ``MultiDataset`` when you want to train on multiple datasets with the same
DataLoader, and, optionally, enforce all outputs to share the same TensorDict keys
for collation. Refer to :const:`physicsnemo.datapipes.multi_dataset.DATASET_INDEX_METADATA_KEY`
for the metadata key added to each sample.

To properly collate and stack outputs from different datasets, you
can set ``output_strict=True`` in the constructor of a ``MultiDataset``. After
construction, it will load the first batch from every passed dataset and test
that the TensorDict produced by the ``Reader`` and ``Transform`` pipeline has
consistent keys. Because the exact collation details differ by dataset, the
``MultiDataset`` does not check more aggressively than output key consistency.

.. autoclass:: physicsnemo.datapipes.multi_dataset.MultiDataset
:members:
:show-inheritance:


Readers
^^^^^^^

Readers are the data-ingestion layer: each one loads individual samples from a
specific storage format (HDF5, Zarr, NumPy, VTK, etc.) and returns CPU tensors
in a uniform dict interface. See :doc:`physicsnemo.datapipes.readers` for the
Readers are the data-ingestion layer. Each one loads individual samples from a
specific storage format (HDF5, Zarr, NumPy, VTK) and returns CPU tensors
in a uniform dict interface. Refer to :doc:`physicsnemo.datapipes.readers` for the
base class API and all built-in readers.

Transforms
^^^^^^^^^^

Transforms are composable, device-agnostic operations applied to each sample
after it is loaded and transferred to the target device. The ``Compose``
container chains multiple transforms into a single callable. See
container chains multiple transforms into a single callable. Refer to
:doc:`physicsnemo.datapipes.transforms` for the base class API, ``Compose``,
and all built-in transforms.

Collation
^^^^^^^^^

Combining a set of tensordict objects into a batch of data can, at times,
Some datasets, like graph datasets, require special care. For
this reason, PhysicsNeMo datapipes offers custom collation functions
Combining a set of TensorDict objects into a batch of data can, at times,
require special care. For example, collating graph datasets for Graph Neural
Networks requires different merging of batches than concatenation along a batch
dimension. For this reason, PhysicsNeMo datapipes offers custom collation functions
as well as an interface to write your own collator. If the dataset you are
trying to collate can not be accommodated here, please open an issue on github.
trying to collate can not be accommodated here, open an issue on github.

For an example of a custom collation function to produce a batch of PyG graph data,
see the examples on github for the datapipes.
For an example of a custom collation function that produces a batch of PyG graph data,
refer to the examples on github for the datapipes.

.. autoclass:: physicsnemo.datapipes.collate.Collator
:members:
Expand Down
2 changes: 2 additions & 0 deletions physicsnemo/datapipes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
)
from physicsnemo.datapipes.dataloader import DataLoader
from physicsnemo.datapipes.dataset import Dataset
from physicsnemo.datapipes.multi_dataset import MultiDataset
from physicsnemo.datapipes.readers import (
HDF5Reader,
NumpyReader,
Expand Down Expand Up @@ -84,6 +85,7 @@
"TensorDict", # Re-export from tensordict
"Dataset",
"DataLoader",
"MultiDataset",
# Transforms - Base
"Transform",
"Compose",
Expand Down
4 changes: 3 additions & 1 deletion physicsnemo/datapipes/dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,9 @@ def __len__(self) -> int:
int
Number of batches in the dataloader.
"""
n_samples = len(self.dataset)
n_samples = (
len(self.sampler) if hasattr(self.sampler, "__len__") else len(self.dataset)
)
if self.drop_last:
return n_samples // self.batch_size
return (n_samples + self.batch_size - 1) // self.batch_size
Expand Down
Loading
Loading