Skip to content

[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly#1505

Merged
coreyjadams merged 5 commits intoNVIDIA:mainfrom
coreyjadams:multi-dataset-clean
Mar 18, 2026
Merged

[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly#1505
coreyjadams merged 5 commits intoNVIDIA:mainfrom
coreyjadams:multi-dataset-clean

Conversation

@coreyjadams
Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

This pr adds multiple datasets to the physicsnemo datapipe scheme.

They are added dynamically at creation, viewed as one logical dataset, and if passed with a sampler for random access you the length is tracked via the sampler, not the datasets - this is to let you squish a bunch of data together, and then randomly take 80/20 splits of the whole thing, for example.

Transforms are per dataset, not uniform. This deliberately lets you have unique pipelines for unique data. You can have them come out all funky at the end, if you have batch size 1 and don't need to collate data. If you do want everything happy and collatable, there is a helper debug setting in the MultiDataset class that will take the first item from every dataset at construction, run the pipeline, and at least verify the keys match. I can't align the shapes or anything without a collation function too, and since that can vary per dataset, we have to defer. But it's something.

There are more PRs coming in this stack but it's difficult still to pile them on top with the tools I'm using. Next up:

  • some transform work to clean up the pipelines for an example.
  • A new exmaple in the darcy flow space enabling the transolver example with multiple datasets at once.

There is a parallel pr for geotransolver to enable it for 2D data. With that in, we can do GeoTransolver on Darcy with multiple datasets too.

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR introduces MultiDataset, a new class that composes multiple Dataset instances behind a single concatenated index space, enabling training on heterogeneous data sources with a single DataLoader. Each sub-dataset retains its own Reader and transforms pipeline, and samples are enriched with a dataset_index metadata key for source identification. The DataLoader.__len__ method is updated to prefer sampler length over dataset length, supporting custom samplers for train/test splits over the combined dataset.

  • New MultiDataset class (physicsnemo/datapipes/multi_dataset.py): Well-designed with proper index mapping, prefetch delegation, strict output validation, context manager support, and comprehensive docstrings.
  • DataLoader.__len__ fix (physicsnemo/datapipes/dataloader.py): Now uses sampler length when available, enabling custom samplers to control effective dataset size — important for train/test splits over MultiDataset.
  • Type hint gap: DataLoader.__init__ declares dataset: Dataset, but MultiDataset is not a subclass of Dataset. This works at runtime via duck typing but will produce false-positive type errors for users with static type checkers. Consider a Protocol or Union type.
  • Comprehensive tests: 17 tests covering basic functionality, strict validation, prefetching, error handling, DataLoader integration, and string representation.

Important Files Changed

Filename Overview
physicsnemo/datapipes/multi_dataset.py New MultiDataset class composing multiple Datasets behind a single index space. Well-structured with proper delegation, validation, context manager support, and comprehensive docstrings. Minor style note on linear index lookup.
physicsnemo/datapipes/dataloader.py Small change to len to prefer sampler length over dataset length. Logic is correct and supports custom samplers for train/test splits. Type hint for dataset parameter doesn't include MultiDataset.
physicsnemo/datapipes/init.py Adds MultiDataset to the package's public imports and all. Clean, no issues.
docs/api/datapipes/physicsnemo.datapipes.rst Adds documentation section for MultiDataset with usage guidance and autoclass directive. Clear and well-written.
test/datapipes/core/test_multi_dataset.py Comprehensive test suite covering basic functionality, strict validation, prefetching, error cases, DataLoader integration, and repr. Good coverage of edge cases including negative indexing and out-of-range handling.

Comments Outside Diff (1)

  1. physicsnemo/datapipes/dataloader.py, line 81 (link)

    Type hint excludes MultiDataset

    DataLoader.__init__ declares dataset: Dataset, but MultiDataset does not inherit from Dataset. This means static type checkers (mypy, pyright) will flag errors when users pass a MultiDataset to DataLoader, even though it works at runtime via duck typing.

    Consider widening the type hint to a Protocol or Union type that captures the required interface (e.g., __len__, __getitem__, prefetch, cancel_prefetch), or make MultiDataset a formal subclass/mixin. As-is, the tests demonstrate the runtime compatibility, but consumers with strict type checking will get false-positive errors.

    Note: you would also need to add the import from physicsnemo.datapipes.multi_dataset import MultiDataset (or use from __future__ import annotations and a TYPE_CHECKING guard). A Protocol-based approach would be cleaner long-term.

Last reviewed commit: 916ea03

Comment thread physicsnemo/datapipes/multi_dataset.py
@coreyjadams coreyjadams changed the title [stacked] Adding most of the multi-dataset cleanly [stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly Mar 15, 2026
@coreyjadams
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

Copy link
Copy Markdown
Collaborator

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General questions, and depending on your answers I'll be ready to approve:

  • Is there a line of sight into how this will work with DDP and more? If that's coming in future PRs plz ignore
  • How should users expect to do some kind of data balancing - e.g. making sure that each batch contains a similar composition from each dataset as to not bias training dynamics
  • output_strict optionally checks against fields, but is the intention for this solely for concatenation? The scenario in chemistry/materials is that in some datasets, you have labels available for some data but not others, but you might want to train with multiple heads on what data you might have and just learn based on the shared embedding. Naively the equivalent in aero would be like only having drag force for some datasets, and others having vector fields or something to that.
  • earth

Comment thread physicsnemo/datapipes/multi_dataset.py Outdated
Comment thread physicsnemo/datapipes/multi_dataset.py
Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.
Copy link
Copy Markdown
Collaborator

@megnvidia megnvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just validate that the text that i added to complete the partial sentence is technically correct. (also, sorry, today I had time so I took the opportunity to edit the whole file)

After that I can approve.

Comment thread docs/api/datapipes/physicsnemo.datapipes.rst Outdated
@coreyjadams
Copy link
Copy Markdown
Collaborator Author

@laserkelvin thanks for taking a look!

  • yes, there is a line of sight towards DDP. [stacked 3/3, multi-datapipes] darcy flow multi dataset #1507 includes this PR + a training example. I didn't build DDP into it, but can (and now will!) but tl;dr it's not hard; we use the usual machinery of distributed samplers to coordinate over ranks and it works as usual. Datasets are exposed as "one" unit so the sampler indexes into the unit as a whole.
  • Users should expect to balance batch composition via samplers, if they want that behavior. The idea here was that if you compose datasets into a multi-dataset, they become "one" and your sampling treats them as one. Perhaps you think we should support some sort of custom sampler? We run the risk of biasing the training of course in that case...
  • the intention is not solely for concatenation. In fact, for many datasets collation via concatenation is not correct. So I can't do something more sophisticated than check "hey, does every dataset produce the same keys?" It is not strictly mandatory, though, you could have a different output from each dataset happily, if you are not collating (aka batch size 1).
  • yes

@laserkelvin
Copy link
Copy Markdown
Collaborator

  • Users should expect to balance batch composition via samplers, if they want that behavior. The idea here was that if you compose datasets into a multi-dataset, they become "one" and your sampling treats them as one. Perhaps you think we should support some sort of custom sampler? We run the risk of biasing the training of course in that case...

Yeah nah I don't think we need to necessary provide the custom sampler - that will depend a lot on the dataset compositions, and we can't know that a priori generally, I don't think. But, I think it is important to highlight that aspect, maybe in an example or something when it comes to it. I imagine there's going to be a lot of imbalanced datasets that, if naively just shuffled, will end up yielding training signals skewed towards the bigger datasets. As long as there's a path forward for then I'm happy with that.

Another thing came to mind as well: how do you envision dealing with different splits?

Copy link
Copy Markdown
Collaborator

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve assuming you want to make that one stylistic comment I had

@coreyjadams
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@coreyjadams coreyjadams enabled auto-merge March 18, 2026 16:38
@coreyjadams coreyjadams added this pull request to the merge queue Mar 18, 2026
Merged via the queue into NVIDIA:main with commit 2b677e3 Mar 18, 2026
4 checks passed
coreyjadams added a commit to coreyjadams/physicsnemo that referenced this pull request Mar 19, 2026
NVIDIA#1505)

* Adding most of the multi-dataset cleanly

* Refine documentation for PhysicsNeMo Datapipes

Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.

* update api docs.

* Update multidata set interface to accept an unpacked tuple instead of a list, etc.

---------

Co-authored-by: megnvidia <mmiranda@nvidia.com>
nbren12 pushed a commit to nbren12/modulus that referenced this pull request Mar 24, 2026
NVIDIA#1505)

* Adding most of the multi-dataset cleanly

* Refine documentation for PhysicsNeMo Datapipes

Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.

* update api docs.

* Update multidata set interface to accept an unpacked tuple instead of a list, etc.

---------

Co-authored-by: megnvidia <mmiranda@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants