[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly#1505
[stacked 1/3, multi-datasets] Adding most of the multi-dataset cleanly#1505coreyjadams merged 5 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR introduces
Important Files Changed
|
|
/blossom-ci |
laserkelvin
left a comment
There was a problem hiding this comment.
General questions, and depending on your answers I'll be ready to approve:
- Is there a line of sight into how this will work with DDP and more? If that's coming in future PRs plz ignore
- How should users expect to do some kind of data balancing - e.g. making sure that each batch contains a similar composition from each dataset as to not bias training dynamics
output_strictoptionally checks against fields, but is the intention for this solely for concatenation? The scenario in chemistry/materials is that in some datasets, you have labels available for some data but not others, but you might want to train with multiple heads on what data you might have and just learn based on the shared embedding. Naively the equivalent in aero would be like only having drag force for some datasets, and others having vector fields or something to that.- earth
Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability.
megnvidia
left a comment
There was a problem hiding this comment.
Just validate that the text that i added to complete the partial sentence is technically correct. (also, sorry, today I had time so I took the opportunity to edit the whole file)
After that I can approve.
|
@laserkelvin thanks for taking a look!
|
Yeah nah I don't think we need to necessary provide the custom sampler - that will depend a lot on the dataset compositions, and we can't know that a priori generally, I don't think. But, I think it is important to highlight that aspect, maybe in an example or something when it comes to it. I imagine there's going to be a lot of imbalanced datasets that, if naively just shuffled, will end up yielding training signals skewed towards the bigger datasets. As long as there's a path forward for then I'm happy with that. Another thing came to mind as well: how do you envision dealing with different splits? |
laserkelvin
left a comment
There was a problem hiding this comment.
I'll approve assuming you want to make that one stylistic comment I had
|
/blossom-ci |
NVIDIA#1505) * Adding most of the multi-dataset cleanly * Refine documentation for PhysicsNeMo Datapipes Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability. * update api docs. * Update multidata set interface to accept an unpacked tuple instead of a list, etc. --------- Co-authored-by: megnvidia <mmiranda@nvidia.com>
NVIDIA#1505) * Adding most of the multi-dataset cleanly * Refine documentation for PhysicsNeMo Datapipes Updated the documentation for PhysicsNeMo Datapipes to improve clarity and consistency. Adjusted wording and structure for better readability. * update api docs. * Update multidata set interface to accept an unpacked tuple instead of a list, etc. --------- Co-authored-by: megnvidia <mmiranda@nvidia.com>
PhysicsNeMo Pull Request
This pr adds multiple datasets to the physicsnemo datapipe scheme.
They are added dynamically at creation, viewed as one logical dataset, and if passed with a sampler for random access you the length is tracked via the sampler, not the datasets - this is to let you squish a bunch of data together, and then randomly take 80/20 splits of the whole thing, for example.
Transforms are per dataset, not uniform. This deliberately lets you have unique pipelines for unique data. You can have them come out all funky at the end, if you have batch size 1 and don't need to collate data. If you do want everything happy and collatable, there is a helper debug setting in the MultiDataset class that will take the first item from every dataset at construction, run the pipeline, and at least verify the keys match. I can't align the shapes or anything without a collation function too, and since that can vary per dataset, we have to defer. But it's something.
There are more PRs coming in this stack but it's difficult still to pile them on top with the tools I'm using. Next up:
There is a parallel pr for geotransolver to enable it for 2D data. With that in, we can do GeoTransolver on Darcy with multiple datasets too.
Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.