NVIDIA · coreyjadams · Mar 18, 2026 · Mar 14, 2026 · Mar 17, 2026 · Mar 17, 2026
@@ -4,21 +4,23 @@ PhysicsNeMo Datapipes
 .. automodule:: physicsnemo.datapipes
 .. currentmodule:: physicsnemo.datapipes
 
-The PhysicsNeMo Datapipes consists largely of two separate components, both
-described here.  Prior to version 2.0 of PhysicsNeMo, each datapipe was largely
+The PhysicsNeMo Datapipes consists largely of two separate components.  
+
+Prior to version 2.0 of PhysicsNeMo, each datapipe was largely
 independent from all others, targeted for very specific datasets and applications,
 and broadly not extensible.  Those datapipes, preserved in v2.0 for compatibility,
 are described in the climate, cae, gnn, and
 benchmark subsections.
 
 In PhysicsNeMo v2.0, the datapipes API has been redesigned from scratch to focus
 on key factors to enable scientific machine learning training and inference.
-These documentation pages describe the architecture and design philosophy, while
-in the examples of PhysicsNeMo there are runnable datapipe tutorials for
-getting started.
+This document describes the architecture and design philosophy
+
+Refer to the examples of PhysicsNeMo for runnable datapipe tutorials to
+get started.
 
 
-Datapipes philosophy
+Datapipes Philosophy
 --------------------
 
 The PhysicsNeMo datapipe structure is built on several key design decisions
@@ -27,27 +29,27 @@ that are specifically made to enable diverse scientific machine learning dataset
 - GPU First: data preprocessing is done on the GPU, not the CPU.
 - Isolation of roles: reading data is separate from transforming data, which is 
   separate from pipelining data for training, which is separate from threading
-  and stream management, etc.  Changing data sources, or preprocessing pipelines,
+  and stream management.  Changing data sources, or preprocessing pipelines,
   should require no intervention in other areas.
 - Composability and Extensibility: We aim to provide a tool kit and examples that
   lets you build what you need yourself, easily, if it's not here.
 - Datapipes as configuration: Changing a pipeline shouldn't require source code
   modification; the registry system in PhysicsNeMo datapipes enables hydra instantiation
   of datapipes at runtime for version-controlled, runtime-configured datapipes.
-  You can register and instantiate custom components, of course.
+  You can also register and instantiate custom components.
 
 Data flows through a PhysicsNeMo datapipe in a consistent path:
 
-1. A ``reader`` will bring the data from storage to CPU memory
+1. A ``reader`` will bring the data from storage to CPU memory.
 2. An optional series of one or more transformations will apply on-the-fly
    manipulations of that data, per instance of data.
 3. Several instances of data will be collated into a batch (customizable,
    just like in PyTorch).
 4. The batched data is ready for use in a model.
 
 At the highest level, ``physicsnemo.datapipes.DataLoader`` has a similar API and
-model as ``pytorch.utils.data.DataLoader``, enabling a drop-in replacement in many
-cases.  Under the hood, physicsnemo follows a very different computation orchestration.
+model as ``pytorch.utils.data.DataLoader``, which enables a drop-in replacement for many
+cases. However, PhysicsNeMo has a very different computation orchestration.
 
 Quick Start
 -----------
@@ -92,8 +94,8 @@ Quick Start
         predictions = model(batch["pressure"], batch["coordinates"])
 
 
-The best place to see the PhysicsNeMo datapipes in action, and get a sense of 
-how they work and use them, is to start with the examples located in the
+The best place to see the PhysicsNeMo datapipes in action, get a sense of 
+how they work, and use them, is to start with the examples located in the
 `examples directory <https://github.com/NVIDIA/physicsnemo/tree/main/examples/minimal/datapipes>`_.
 
 
@@ -153,35 +155,59 @@ the ``Dataset`` is responsible for the threaded execution of ``Reader``s and
     :members:
     :show-inheritance:
 
+MultiDataset
+^^^^^^^^^^^^
+
+The ``MultiDataset`` includes two or more ``Dataset`` instances behind a single
+index space (concatenation). Each sub-dataset can have its own Reader and
+transforms. Global indices are mapped to the owning sub-dataset and local index;
+metadata is enriched with ``dataset_index`` so that batches can identify the source.
+Use ``MultiDataset`` when you want to train on multiple datasets with the same
+DataLoader, and, optionally, enforce all outputs to share the same TensorDict keys
+for  collation. Refer to :const:`physicsnemo.datapipes.multi_dataset.DATASET_INDEX_METADATA_KEY`
+for the metadata key added to each sample.
+
+To properly collate and stack outputs from different datasets, you
+can set ``output_strict=True`` in the constructor of a ``MultiDataset``.  After
+construction, it will load the first batch from every passed dataset and test
+that the TensorDict produced by the ``Reader`` and ``Transform`` pipeline has
+consistent keys.  Because the exact collation details differ by dataset, the
+``MultiDataset`` does not check more aggressively than output key consistency.
+
+.. autoclass:: physicsnemo.datapipes.multi_dataset.MultiDataset
+    :members:
+    :show-inheritance:
+
 
 Readers
 ^^^^^^^
 
-Readers are the data-ingestion layer: each one loads individual samples from a
-specific storage format (HDF5, Zarr, NumPy, VTK, etc.) and returns CPU tensors
-in a uniform dict interface.  See :doc:`physicsnemo.datapipes.readers` for the
+Readers are the data-ingestion layer. Each one loads individual samples from a
+specific storage format (HDF5, Zarr, NumPy, VTK) and returns CPU tensors
+in a uniform dict interface. Refer to :doc:`physicsnemo.datapipes.readers` for the
 base class API and all built-in readers.
 
 Transforms
 ^^^^^^^^^^
 
 Transforms are composable, device-agnostic operations applied to each sample
 after it is loaded and transferred to the target device.  The ``Compose``
-container chains multiple transforms into a single callable.  See
+container chains multiple transforms into a single callable.  Refer to
 :doc:`physicsnemo.datapipes.transforms` for the base class API, ``Compose``,
 and all built-in transforms.
 
 Collation
 ^^^^^^^^^
 
-Combining a set of tensordict objects into a batch of data can, at times,
-Some datasets, like graph datasets, require special care.  For
-this reason, PhysicsNeMo datapipes offers custom collation functions 
+Combining a set of TensorDict objects into a batch of data can, at times, 
+require special care.  For example, collating graph datasets for Graph Neural 
+Networks requires different merging of batches than concatenation along a batch
+dimension.  For this reason, PhysicsNeMo datapipes offers custom collation functions 
 as well as an interface to write your own collator.  If the dataset you are
-trying to collate can not be accommodated here, please open an issue on github.
+trying to collate can not be accommodated here, open an issue on github.
 
-For an example of a custom collation function to produce a batch of PyG graph data,
-see the examples on github for the datapipes.
+For an example of a custom collation function that produces a batch of PyG graph data,
+refer to the examples on github for the datapipes.
 
 .. autoclass:: physicsnemo.datapipes.collate.Collator
     :members:

@@ -40,6 +40,7 @@
 )
 from physicsnemo.datapipes.dataloader import DataLoader
 from physicsnemo.datapipes.dataset import Dataset
+from physicsnemo.datapipes.multi_dataset import MultiDataset
 from physicsnemo.datapipes.readers import (
     HDF5Reader,
     NumpyReader,
@@ -84,6 +85,7 @@
     "TensorDict",  # Re-export from tensordict
     "Dataset",
     "DataLoader",
+    "MultiDataset",
     # Transforms - Base
     "Transform",
     "Compose",

@@ -168,7 +168,9 @@ def __len__(self) -> int:
         int
             Number of batches in the dataloader.
         """
-        n_samples = len(self.dataset)
+        n_samples = (
+            len(self.sampler) if hasattr(self.sampler, "__len__") else len(self.dataset)
+        )
         if self.drop_last:
             return n_samples // self.batch_size
         return (n_samples + self.batch_size - 1) // self.batch_size