Skip to content

Fix Model Sigma Data#1534

Open
albertocarpentieri wants to merge 5 commits intoNVIDIA:mainfrom
albertocarpentieri:fix/stormcast-sigma-data-precond
Open

Fix Model Sigma Data#1534
albertocarpentieri wants to merge 5 commits intoNVIDIA:mainfrom
albertocarpentieri:fix/stormcast-sigma-data-precond

Conversation

@albertocarpentieri
Copy link
Copy Markdown
Contributor

PhysicsNeMo Pull Request

Description

  • The EDM preconditioner (EDMPrecond) defaults to sigma_data=0.5, ignoring the value set in training.loss.sigma_data. This causes a mismatch between the loss weighting and the preconditioning coefficients (c_skip, c_out, c_in), leading to systematic bias in diffusion model predictions.
  • Forward training.loss.sigma_data into the model hyperparameters when building the diffusion network, so the preconditioner and loss use the same value.

Checklist

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 25, 2026

Greptile Summary

This PR fixes a real bug in the StormCast trainer: EDMPrecond was silently defaulting to sigma_data=0.5 regardless of the value configured in training.loss.sigma_data, creating a mismatch between the EDM loss weighting and the preconditioner coefficients (c_skip, c_out, c_in). The fix reads sigma_data from the loss config at model build time and injects it into model_hparams via setdefault, correctly preserving any explicit user override in model.hyperparameters.

Key observations:

  • The scalar sigma_data case is handled correctly and the use of setdefault is appropriate.
  • Per-channel sigma_data: LossConfig.sigma_data can be a list[float]. In _setup_loss this list is converted to a (1, C, 1, 1) tensor for broadcasting. In the new code the raw list is forwarded to EDMPrecond, which declares sigma_data: float and stores torch.tensor(sigma_data) as a buffer — producing a 1D (C,) tensor that will broadcast incorrectly (or error) in the preconditioner's coefficient arithmetic. The per-channel case needs a guard or explicit documentation.
  • The info log fires unconditionally even when a model-level override causes the preconditioner and loss to intentionally use different values, which may be confusing.

Important Files Changed

Filename Overview
examples/weather/stormcast/utils/trainer.py Forwards training.loss.sigma_data into the model's hyperparameters so the EDMPrecond preconditioner and the EDM loss share the same value. The scalar case is handled correctly; however, when sigma_data is a per-channel list[float], the raw list is passed to EDMPrecond which expects a float, potentially causing incorrect preconditioner coefficients or a runtime shape error.

Reviews (1): Last reviewed commit: "fix model sigma data" | Re-trigger Greptile

Comment thread examples/weather/stormcast/utils/trainer.py
Comment thread examples/weather/stormcast/utils/trainer.py Outdated
@jleinonen
Copy link
Copy Markdown
Collaborator

As per the Greptile comment, this doesn't address the case of sigma_data passed as a list. Should we implement it now or leave it for later?

@jleinonen
Copy link
Copy Markdown
Collaborator

Can we add a test that makes sure the loss and the model have the same sigma_data?

@jleinonen jleinonen requested a review from pzharrington March 25, 2026 19:15
@pzharrington
Copy link
Copy Markdown
Collaborator

pzharrington commented Mar 25, 2026

Did this bug only apply to EDMPrecond (used with songunet) and not EDMPreconditioner? If so these changes may be superseded fairly soon. I started working on full adoption of the new physcisnemo.diffusion interfaces, thus eliminating use of EDMPrecond, still need to figure out the specific changes to support channel-wise sigma_data though.

@jleinonen
Copy link
Copy Markdown
Collaborator

Did this bug only apply to EDMPrecond (used with songunet) and not EDMPreconditioner? If so these changes may be supersedes fairly soon. I started working on full adoption of the new physcisnemo.diffusion interfaces, thus eliminating use of EDMPrecond, still need to figure out the specific changes to support channel-wise sigma_data though.

It seems to affect both and this change should fix both, since both have a sigma_data parameter in their call signature.

@pzharrington
Copy link
Copy Markdown
Collaborator

Ah ok I was just going off the greptile summary. Regarding the channel-wise sigma_data, how impactful has it been to tune that rather than use a scalar or the default 0.5? I don't think StormScope uses per-channel sigma_data IIRC

@jleinonen
Copy link
Copy Markdown
Collaborator

jleinonen commented Mar 25, 2026

Ah ok I was just going off the greptile summary. Regarding the channel-wise sigma_data, how impactful has it been to tune that rather than use a scalar or the default 0.5? I don't think StormScope uses per-channel sigma_data IIRC

It's mostly relevant for regression-diffusion models. For those the regression net often has very different errors for the channels and then we should choose the RMSE of the regression for each channel as sigma_data. Whereas for pure diffusion models we usually have the data normalized to unit variance, in which case sigma_data = 1.0 should be the most mathematically justified choice.

root and others added 4 commits April 7, 2026 05:37
Signed-off-by: root <root@pool0-01762.cm.cluster>
Signed-off-by: root <root@pool0-01101.cm.cluster>
Signed-off-by: root <root@pool0-01523.cm.cluster>
Signed-off-by: root <root@pool0-01102.cm.cluster>
@albertocarpentieri albertocarpentieri force-pushed the fix/stormcast-sigma-data-precond branch from f4e3310 to bcb2432 Compare April 7, 2026 12:40
Signed-off-by: root <root@pool0-01814.cm.cluster>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants