Skip to content

[WIP - DO NOT REVIEW] Diffusion SDA recipe#1511

Open
CharlelieLrt wants to merge 32 commits intoNVIDIA:mainfrom
CharlelieLrt:diffusion-sda-recipe
Open

[WIP - DO NOT REVIEW] Diffusion SDA recipe#1511
CharlelieLrt wants to merge 32 commits intoNVIDIA:mainfrom
CharlelieLrt:diffusion-sda-recipe

Conversation

@CharlelieLrt
Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR introduces a new Diffusion SDA (Score-based Data Assimilation) recipe for HRRR surface weather data, adding a complete training pipeline alongside new reusable library components in physicsnemo/diffusion/. The recipe includes zarr-based data loading, an async ETL pipeline for fetching HRRR data from S3, a statistics collection/aggregation workflow, a SongUNet-based backbone with temporal conditioning, and a distributed training script using MultiDiffusionModel2D with patch-based EDM loss.

Key changes:

  • examples/weather/diffusion_sda/: New end-to-end recipe — data.py (dataset), etl.py (async data ingestion), nn.py (model backbone), train.py (distributed training), plot_hrrr.py (visualization), and stats/ (normalization statistics pipeline).
  • physicsnemo/diffusion/multi_diffusion/losses.py: New MultiDiffusionMSEDSMLoss and MultiDiffusionWeightedMSEDSMLoss classes for patch-based denoising score matching, with robust DDP/torch.compile unwrapping.
  • physicsnemo/diffusion/utils/model_wrappers.py: New ConcatConditionWrapper that adapts channel-concatenated and vector conditioning for multiple backbone types.
  • Notable issue in etl.py: shutil.rmtree(ds.cache) is placed inside the try block of the fetch retry loop — if cleanup fails (e.g., when cache=False and no directory was created), the exception is caught and incorrectly triggers retries of the already-completed fetch operation.
  • Minor issues: An unused store = root.store variable (lines 373/424 of etl.py), a trailing comma that produces an unintentional tuple at line 486 of etl.py, misleading mean_val/std_val variable names in collect_stat_moments.py (they hold sums, not statistics), and a mutable default list argument in nn.py.

Important Files Changed

Filename Overview
examples/weather/diffusion_sda/data.py New HRRRSurfaceDataset with zarr-backed HRRR surface data loading, normalization, and spatial/temporal conditioning. Minor issue: invalid_values is computed on OOB check but never included in the error message.
examples/weather/diffusion_sda/etl.py New async ETL pipeline for fetching HRRR/HRRR_FX data and writing to a zarr store. Has a logic bug where shutil.rmtree inside the retry try-block causes fetch retries on cleanup failure; also contains unused store variable and an unintentional trailing comma that creates a tuple.
examples/weather/diffusion_sda/nn.py New HRRRUnconditionalUNet backbone wrapping SongUNet with temporal embedding and spatial conditioning. Minor: mutable default list for channel_mult parameter.
examples/weather/diffusion_sda/train.py New distributed training script for the diffusion SDA recipe using MultiDiffusionModel2D, EDM loss, and InfiniteSampler. Previously reported issues (iter() wrapping, rank-0 counter) are fixed.
examples/weather/diffusion_sda/stats/collect_stat_moments.py Collects per-day spatial sum and sum-of-squares statistics from the zarr store. Docstring and loop variable names are misleading (say mean/std but are actually sums); the math is correct.
physicsnemo/diffusion/multi_diffusion/losses.py New MultiDiffusionMSEDSMLoss and MultiDiffusionWeightedMSEDSMLoss classes implementing patch-based denoising score matching. Well-documented with doctests; DDP/compile unwrapping logic is robust.
physicsnemo/diffusion/utils/model_wrappers.py New ConcatConditionWrapper handling channel-concatenated and vector conditioning for SongUNet, DhariwalUNet, and DiT backbones. Well-validated with compile-aware guards and good docstring coverage.

Comments Outside Diff (1)

  1. examples/weather/diffusion_sda/etl.py, line 357-368 (link)

    shutil.rmtree inside try block causes spurious retries

    shutil.rmtree(ds.cache) is inside the try block that guards the fetch. If the cleanup raises (e.g., FileNotFoundError when cache=False and no directory was created, or a permissions error), the except Exception handler catches it and re-attempts the fetch — even though the fetch itself succeeded. This silently discards a completed download.

    The cleanup should be separated from the fetch retry logic:

    retries = 4
    for attempt in range(retries + 1):
        try:
            da = await ds.fetch(batch_t, variable)
            break  # fetch succeeded
        except Exception:
            logger.error("Failed to download data, re-attempting...")
            if attempt < retries:
                await asyncio.sleep(4 * attempt + 1)
            else:
                raise
    # cleanup is outside the retry loop
    try:
        shutil.rmtree(ds.cache)
    except FileNotFoundError:
        pass  # cache disabled or already gone

    Note also that pull_hrrr_fx_data never calls shutil.rmtree at all, which is inconsistent — leaving temp files behind if that code path is used.

Last reviewed commit: 2bae3ff

Comment thread examples/weather/diffusion_sda/data.py Outdated
Comment thread examples/weather/diffusion_sda/train.py
Comment thread examples/weather/diffusion_sda/train.py Outdated
Comment thread examples/weather/diffusion_sda/stats/aggregate_stats.py Outdated
Comment thread examples/weather/diffusion_sda/data.py Outdated
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
…/diffusion/multi_diffusion/losses.py

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
@CharlelieLrt
Copy link
Copy Markdown
Collaborator Author

@greptile I fixed the bugs you flagged. Please review again the PR and hunt for any potential bug I may I left

Comment thread examples/weather/diffusion_sda/stats/collect_stat_moments.py
Comment thread examples/weather/diffusion_sda/data.py
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: CharlelieLrt <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: CharlelieLrt <claurent@nvidia.com>
…DDP order in train.py

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
CharlelieLrt and others added 9 commits March 18, 2026 16:12
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Signed-off-by: CharlelieLrt <claurent@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: CharlelieLrt <claurent@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant