[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly by coreyjadams · Pull Request #1505 · NVIDIA/physicsnemo

coreyjadams · 2026-03-15T02:13:43Z

PhysicsNeMo Pull Request

This pr adds multiple datasets to the physicsnemo datapipe scheme.

They are added dynamically at creation, viewed as one logical dataset, and if passed with a sampler for random access you the length is tracked via the sampler, not the datasets - this is to let you squish a bunch of data together, and then randomly take 80/20 splits of the whole thing, for example.

Transforms are per dataset, not uniform. This deliberately lets you have unique pipelines for unique data. You can have them come out all funky at the end, if you have batch size 1 and don't need to collate data. If you do want everything happy and collatable, there is a helper debug setting in the MultiDataset class that will take the first item from every dataset at construction, run the pipeline, and at least verify the keys match. I can't align the shapes or anything without a collation function too, and since that can vary per dataset, we have to defer. But it's something.

There are more PRs coming in this stack but it's difficult still to pile them on top with the tools I'm using. Next up:

some transform work to clean up the pipelines for an example.
A new exmaple in the darcy flow space enabling the transolver example with multiple datasets at once.

There is a parallel pr for geotransolver to enable it for 2D data. With that in, we can do GeoTransolver on Darcy with multiple datasets too.

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

greptile-apps · 2026-03-15T02:17:07Z

Greptile Summary

This PR introduces MultiDataset, a new class that composes multiple Dataset instances behind a single concatenated index space, enabling training on heterogeneous data sources with a single DataLoader. Each sub-dataset retains its own Reader and transforms pipeline, and samples are enriched with a dataset_index metadata key for source identification. The DataLoader.__len__ method is updated to prefer sampler length over dataset length, supporting custom samplers for train/test splits over the combined dataset.

New MultiDataset class (physicsnemo/datapipes/multi_dataset.py): Well-designed with proper index mapping, prefetch delegation, strict output validation, context manager support, and comprehensive docstrings.
DataLoader.__len__ fix (physicsnemo/datapipes/dataloader.py): Now uses sampler length when available, enabling custom samplers to control effective dataset size — important for train/test splits over MultiDataset.
Type hint gap: DataLoader.__init__ declares dataset: Dataset, but MultiDataset is not a subclass of Dataset. This works at runtime via duck typing but will produce false-positive type errors for users with static type checkers. Consider a Protocol or Union type.
Comprehensive tests: 17 tests covering basic functionality, strict validation, prefetching, error handling, DataLoader integration, and string representation.

Important Files Changed

Filename	Overview
physicsnemo/datapipes/multi_dataset.py	New MultiDataset class composing multiple Datasets behind a single index space. Well-structured with proper delegation, validation, context manager support, and comprehensive docstrings. Minor style note on linear index lookup.
physicsnemo/datapipes/dataloader.py	Small change to len to prefer sampler length over dataset length. Logic is correct and supports custom samplers for train/test splits. Type hint for dataset parameter doesn't include MultiDataset.
physicsnemo/datapipes/init.py	Adds MultiDataset to the package's public imports and all. Clean, no issues.
docs/api/datapipes/physicsnemo.datapipes.rst	Adds documentation section for MultiDataset with usage guidance and autoclass directive. Clear and well-written.
test/datapipes/core/test_multi_dataset.py	Comprehensive test suite covering basic functionality, strict validation, prefetching, error cases, DataLoader integration, and repr. Good coverage of edge cases including negative indexing and out-of-range handling.

Comments Outside Diff (1)

physicsnemo/datapipes/dataloader.py, line 81 (link)

Type hint excludes MultiDataset

DataLoader.__init__ declares dataset: Dataset, but MultiDataset does not inherit from Dataset. This means static type checkers (mypy, pyright) will flag errors when users pass a MultiDataset to DataLoader, even though it works at runtime via duck typing.

Consider widening the type hint to a Protocol or Union type that captures the required interface (e.g., __len__, __getitem__, prefetch, cancel_prefetch), or make MultiDataset a formal subclass/mixin. As-is, the tests demonstrate the runtime compatibility, but consumers with strict type checking will get false-positive errors.

Note: you would also need to add the import from physicsnemo.datapipes.multi_dataset import MultiDataset (or use from __future__ import annotations and a TYPE_CHECKING guard). A Protocol-based approach would be cleaner long-term.

_{Last reviewed commit: 916ea03}

coreyjadams · 2026-03-17T18:03:23Z

/blossom-ci

laserkelvin

General questions, and depending on your answers I'll be ready to approve:

Is there a line of sight into how this will work with DDP and more? If that's coming in future PRs plz ignore
How should users expect to do some kind of data balancing - e.g. making sure that each batch contains a similar composition from each dataset as to not bias training dynamics
output_strict optionally checks against fields, but is the intention for this solely for concatenation? The scenario in chemistry/materials is that in some datasets, you have labels available for some data but not others, but you might want to train with multiple heads on what data you might have and just learn based on the shared embedding. Naively the equivalent in aero would be like only having drag force for some datasets, and others having vector fields or something to that.
earth

Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.

megnvidia

Just validate that the text that i added to complete the partial sentence is technically correct. (also, sorry, today I had time so I took the opportunity to edit the whole file)

After that I can approve.

coreyjadams · 2026-03-17T21:35:33Z

@laserkelvin thanks for taking a look!

yes, there is a line of sight towards DDP. [stacked 3/3, multi-datapipes] darcy flow multi dataset #1507 includes this PR + a training example. I didn't build DDP into it, but can (and now will!) but tl;dr it's not hard; we use the usual machinery of distributed samplers to coordinate over ranks and it works as usual. Datasets are exposed as "one" unit so the sampler indexes into the unit as a whole.
Users should expect to balance batch composition via samplers, if they want that behavior. The idea here was that if you compose datasets into a multi-dataset, they become "one" and your sampling treats them as one. Perhaps you think we should support some sort of custom sampler? We run the risk of biasing the training of course in that case...
the intention is not solely for concatenation. In fact, for many datasets collation via concatenation is not correct. So I can't do something more sophisticated than check "hey, does every dataset produce the same keys?" It is not strictly mandatory, though, you could have a different output from each dataset happily, if you are not collating (aka batch size 1).
yes

laserkelvin · 2026-03-17T22:54:31Z

Users should expect to balance batch composition via samplers, if they want that behavior. The idea here was that if you compose datasets into a multi-dataset, they become "one" and your sampling treats them as one. Perhaps you think we should support some sort of custom sampler? We run the risk of biasing the training of course in that case...

Yeah nah I don't think we need to necessary provide the custom sampler - that will depend a lot on the dataset compositions, and we can't know that a priori generally, I don't think. But, I think it is important to highlight that aspect, maybe in an example or something when it comes to it. I imagine there's going to be a lot of imbalanced datasets that, if naively just shuffled, will end up yielding training signals skewed towards the bigger datasets. As long as there's a path forward for then I'm happy with that.

Another thing came to mind as well: how do you envision dealing with different splits?

laserkelvin

I'll approve assuming you want to make that one stylistic comment I had

… a list, etc.

coreyjadams · 2026-03-18T16:38:23Z

/blossom-ci

NVIDIA#1505) * Adding most of the multi-dataset cleanly * Refine documentation for PhysicsNeMo Datapipes Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability. * update api docs. * Update multidata set interface to accept an unpacked tuple instead of a list, etc. --------- Co-authored-by: megnvidia <mmiranda@nvidia.com>

Adding most of the multi-dataset cleanly

916ea03

coreyjadams requested review from ktangsali and megnvidia as code owners March 15, 2026 02:13

coreyjadams requested a review from laserkelvin March 15, 2026 02:14

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

Comment thread physicsnemo/datapipes/multi_dataset.py

coreyjadams changed the title ~~[stacked] Adding most of the multi-dataset cleanly~~ [stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly Mar 15, 2026

laserkelvin reviewed Mar 17, 2026

View reviewed changes

Comment thread physicsnemo/datapipes/multi_dataset.py Outdated

Comment thread physicsnemo/datapipes/multi_dataset.py

Refine documentation for PhysicsNeMo Datapipes

1b98bec

Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.

megnvidia requested changes Mar 17, 2026

View reviewed changes

Comment thread docs/api/datapipes/physicsnemo.datapipes.rst Outdated

coreyjadams added 2 commits March 17, 2026 16:35

Merge branch 'main' into multi-dataset-clean

06a07e3

update api docs.

56593a3

coreyjadams requested review from laserkelvin and megnvidia March 17, 2026 21:54

megnvidia approved these changes Mar 17, 2026

View reviewed changes

laserkelvin approved these changes Mar 17, 2026

View reviewed changes

Update multidata set interface to accept an unpacked tuple instead of…

f91db31

… a list, etc.

coreyjadams enabled auto-merge March 18, 2026 16:38

coreyjadams added this pull request to the merge queue Mar 18, 2026

Merged via the queue into NVIDIA:main with commit 2b677e3 Mar 18, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly#1505

[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly#1505
coreyjadams merged 5 commits intoNVIDIA:mainfrom
coreyjadams:multi-dataset-clean

coreyjadams commented Mar 15, 2026

Uh oh!

greptile-apps bot commented Mar 15, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

coreyjadams commented Mar 17, 2026

Uh oh!

laserkelvin left a comment

Uh oh!

Uh oh!

Uh oh!

megnvidia left a comment

Uh oh!

Uh oh!

coreyjadams commented Mar 17, 2026

Uh oh!

laserkelvin commented Mar 17, 2026

Uh oh!

laserkelvin left a comment

Uh oh!

coreyjadams commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

coreyjadams commented Mar 15, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

coreyjadams commented Mar 17, 2026

Uh oh!

laserkelvin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

megnvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coreyjadams commented Mar 17, 2026

Uh oh!

laserkelvin commented Mar 17, 2026

Uh oh!

laserkelvin left a comment

Choose a reason for hiding this comment

Uh oh!

coreyjadams commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Mar 15, 2026 •

edited

Loading