Add callback hooks to the simulation stepper by thomasgibson · Pull Request #370 · illinois-ceesd/mirgecom

thomasgibson · 2021-05-28T21:13:53Z

Adds pre/post-step callbacks to advance_state. This also removes the checkpoint argument since it is technically a callback. All examples have been updated.

I also threw in a health-check callback that is exercised in the examples. Will write a short unit test for it as well.

cc @MTCam @anderson2981

inducer

Thanks for working on this! Some thoughts below.

majosm

Added a few thoughts based on the work that's been done in #257 (which AFAIK is not dead, just dormant 🙂).

majosm · 2021-06-04T15:28:21Z

+    if (np.isnan(op.nodal_sum(discr, "vol", dependent_vars.pressure))
+            or np.isnan(op.nodal_sum(discr, "vol", dependent_vars.temperature))):


Need to synchronize the error condition across all ranks before raising, otherwise the code may hang. Possibly useful snippet

Good point. Question though: Is this okay if op.nodal_sum and other reductions are distributed? Once inducer/grudge#117 lands, these (and similar reductions like norms) will do parallel reductions. Or would we rather do things rank-local in these callbacks?

Oh, neat. Yeah, I think if op.nodal_sum is global then it should be fine. It's probably going to be slower than doing a single reduction, but we can always optimize it later if it becomes a problem.

@majosm I've opted to keep it rank-local for now, as exceptions are treated globally right? The error conditions now call out to rank-local grudge routines. Does this sound okay? I just felt doing 3 allreduces in a callback might be a bit much

It is my understanding that exception handling is not global. I think that's the main crux of our struggle with parallel/asynchronous error handling. It appears to me, that if your current version of the healthcheck fails, then the simulation will hang or behave pathologically (as an exception gets tossed on one rank only, that rank goes into the handling branch, the rest don't).

I don't think we need a bunch of reductions though, really only one:

def fluid_healthcheck(...): nans_present = check_for_local_naninf() if nans_present: message += "NANs or INFs present in solution" health_problem = check_for_whatever_local_problem(...): if healt_problem: message += "Some other health issue." health = nansyes | healthproblem | yadda | yadda | yadda health = comm.allreduce(health, MPI.LOR) if health: log(message) toss TheException(message)

We could probably do better with the actual message. Consider that we could have Nranks disparate messages, each with valid (and potentially useful) error content. If we use log or maybe a similar utility that we ourselves roll to be parallel-smart, then we can capture such messages.

It is my understanding that exception handling is not global. I think that's the main crux of our struggle with parallel/asynchronous error handling. It appears to me, that if your current version of the healthcheck fails, then the simulation will hang or behave pathologically (as an exception gets tossed on one rank only, that rank goes into the handling branch, the rest don't).

Yeah. One minor clarification for @thomasgibson: if an exception is raised and doesn't get handled anywhere, it will bubble up to the top and cause mpi4py to call MPI_Abort (this is what python -m mpi4py does). In that situation it's fine for exceptions to occur locally. It's only when we try to handle them and do I/O, etc. prior to exiting that we need to synchronize globally.

majosm · 2021-06-04T15:38:08Z

        t += dt
+
+        if post_step_callback is not None:
+            post_step_callback(state=state, step=istep, t=t, dt=dt)


I would advocate for having post_step_callback (and maybe pre_step_callback too?) return a boolean indicating whether to terminate timestepping. It gives the user some more flexibility in deciding when the simulation should stop. Then the timestepping loop can be a while not done: and the t_final stuff can be removed from the interface. I don't know if we want to actually make that change to the stepper in this PR, but in any case, just something to think about.

I like this idea. How does a call signature like this look? state, stop = callback(...), where stop is a bool?

What about callbacks that raise exceptions though? That's my only real concern.

I would advocate for placing the simulation termination in the hands of the user at the callback level. Consider this callback:

def my_callback(state, step, t, dt): is_healthy = my_healthcheck(state, step, t, dt) is_healthy = comm.allreduce(is_healthy, MPI.LOR) if not is_healthy: write_viz(...) write_restart(...) do_my_other_biddings(...) raise SimError("Simulation health check failed.")

This is safe because the callback routine is already collective. If we are concerned with users doing collective operations (I don't see why we should be) - then we can add a wrapper. I like this a lot because it places healthcheck at the driver level, places full control over how to handle a health issue in the hands of the user (correctly imo), and forces users to opt-in to use it.

I would advocate for placing the simulation termination in the hands of the user at the callback level. Consider this callback:

def my_callback(state, step, t, dt): is_healthy = my_healthcheck(state, step, t, dt) is_healthy = comm.allreduce(is_healthy, MPI.LOR) if not is_healthy: write_viz(...) write_restart(...) do_my_other_biddings(...) raise SimError("Simulation health check failed.")

This is safe because the callback routine is already collective. If we are concerned with users doing collective operations (I don't see why we should be) - then we can add a wrapper. I like this a lot because it places healthcheck at the driver level, places full control over how to handle a health issue in the hands of the user (correctly imo), and forces users to opt-in to use it.

I think I mostly agree with this. My only complaint is that when doing it this way the user won't get any information back from my_healthcheck about what exactly failed. My inclination would be to make my_healthcheck collective and then either:

Print out what failed on rank 0 inside my_healthcheck

Do something like:

def my_callback(state, step, t, dt): try: my_healthcheck(state, step, t, dt) except HealthCheckError: write_viz(...) write_restart(...) do_my_other_biddings(...) raise

(i.e., whatever is contained in the HealthCheckError will bubble back up to the user.)

If we catch the exception using what @majosm proposed (try - except clause), then we can yank out the error message from the exception and print that out on rank 0. How does that sound?

Edit: Actually I think that's precisely what @majosm's code snippet does :)

What about callbacks that raise exceptions though? That's my only real concern.

I think exceptions from callbacks are OK if the user handles it right. Callbacks are collective already. Clearly, the user can toss an exception from only one rank - but hey: that's on the user.

If we catch the exception using what @majosm proposed (try - except clause), then we can yank out the error message from the exception and print that out on rank 0. How does that sound?

I think that is OK, but imo not as powerful as forming a status string and letting the user deal with the error. It means that the healthcheck does more than just check the health - it also does I/O on rank 0 - which may or may not be what the user wants to do there.

If we catch the exception using what @majosm proposed (try - except clause), then we can yank out the error message from the exception and print that out on rank 0. How does that sound?

I think that is OK, but imo not as powerful as forming a status string and letting the user deal with the error. It means that the healthcheck does more than just check the health - it also does I/O on rank 0 - which may or may not be what the user wants to do there.

Sorry, I meant doing one or the other, i.e., either have my_healthcheck print, or do the try/except thing (my preference is the latter). With the latter approach you could extract the error message by doing something like:

try: my_healthcheck(state, step, t, dt) except HealthCheckError as e: msg = e.message # ... do more processing with msg

Sorry, I meant doing one or the other, i.e., either have my_healthcheck print, or do the try/except thing (my preference is the latter). With the latter approach you could extract the error message by doing something like:

try: my_healthcheck(state, step, t, dt) except HealthCheckError as e: msg = e.message # ... do more processing with msg

I like this OK. What I would like to avoid is needing to pass "comm" as an argument to the healthcheck utility. So if my_healthcheck essentially calls a library routine underneath but still does synchronization (i.e. collective error handling) at driver level, then I think that is better than doing collective health check and collective exception in the lower layer (i.e. in the library or simutils). [but am not opposed to being talked out of this notion if it is flawed]

thomasgibson · 2021-06-05T21:43:54Z

+        try:
+            # Check the health of the simulation
+            cfd_healthcheck(discr, eos, state,
+                            step=step, t=t, freq=ncheck)
+            # Perform checkpointing
+            sim_checkpoint(discr, eos, state,
+                           step=step, t=t, dt=dt, freq=nstatus,
+                           constant_cfl=constant_cfl)
+            # Visualize
+            sim_visualization(discr, eos, state,
+                              visualizer, vizname=casename,
+                              step=step, t=t, freq=nviz,
+                              viz_fields=viz_fields)
+        except StepperCrashError as err:


@majosm @MTCam here's the updated proposal

Cool, thanks @thomasgibson. I like how you've split these functions out. This work interacts strongly with that of #257.

I dunno about StepperCrashError for these utilities. I do see @majosm's comment about this earlier, but these aren't really stepper errors. I think it is a little confusing to toss stepper errors from these functions that are (on the surface) unrelated to the actual stepping and stepper.

I'm not a giant fan of having these utilities check whether they will do something on the inside (via step/freq). I think it is a fairly good policy to make the routine very dumb; if it is called, it does the thing. Let the caller worry about if it is time to invoke the function. The upshot/question is: Any objections to putting the check_step call on the outside?

I would be in favor of renaming these utilities. I like cfd_healthcheck but would maybe change that to fluid_healthcheck. The other two are fluid-specific too. As long as we're changing them, we may as well change sim_visualization to fluid_visualization.

sim_checkpoint is basically eliminated by this change; we can go back later and reshape it to be general in Refactor checkpoint/simutils (attempt #2) #257, or banish it forever. If we banish, we need to add soln watch via logpyle to the examples that lack it.

We do need to write some tests that expose some of the stuff that we are concerned about for error handling. I don't have any specific good ideas about how to test these atm, but some of the stuff we've talked about:

Asynchronous errors during stepping

Viz file collisions/overwrites

Graceful exit after any exceptions

Handling exact soln mismatch collectively

Some of this may be firmly in #257 territory.

To give some background here @thomasgibson: the main reason sim_checkpoint exists is to reduce the amount of duplicated code in the drivers. Ideally we want the code inside my_checkpoint to be pretty simple (unless the user wants to do something extra that's specific to their simulation). So, with that motivation in mind, I wonder if there's a way we could move the exception handling part inside sim_checkpoint? Maybe something along the lines of:

# driver.py def healthcheck(step, t, state): return cfd_healthcheck(...) def write_vis(step, t, state): return sim_visualization(...) def checkpoint(step, t, dt, state): return sim_checkpoint(..., healthcheck=healthcheck, write_vis=write_vis)

# simutil.py def sim_checkpoint(..., healthcheck=None, write_vis=None): try: if check_step(...): # ... write status ... if check_step(...) and healthcheck is not None: healthcheck(step, t, state) if check_step(...) and write_vis is not None: write_vis(step, t, state) except StepperCrashError as e: # ... log error ... if write_vis is not None: write_vis(step, t, state) raise

I dunno about StepperCrashError for these utilities. I do see @majosm's comment about this earlier, but these aren't really stepper errors. I think it is a little confusing to toss stepper errors from these functions that are (on the surface) unrelated to the actual stepping and stepper.

IMO the name made sense when the try/except was wrapped around advance_state, but now that it's inside the checkpoint I think it could be named something else. Maybe the base class can just be something low level that indicates that it's raised globally? SynchronizedError? 🤷‍♂️

I'm not a giant fan of having these utilities check whether they will do something on the inside (via step/freq). I think it is a fairly good policy to make the routine very dumb; if it is called, it does the thing. Let the caller worry about if it is time to invoke the function. The upshot/question is: Any objections to putting the check_step call on the outside?

I think this depends on whether the code in question is designed to maximize flexibility or simplicity. sim_checkpoint is more intended to do the latter, so I think it makes sense to wrap the check_step calls inside it.

I would be in favor of renaming these utilities. I like cfd_healthcheck but would maybe change that to fluid_healthcheck. The other two are fluid-specific too. As long as we're changing them, we may as well change sim_visualization to fluid_visualization.

I think I would vote for just calling it sim_healthcheck for the time being and fixing it in #257.

I think this depends on whether the code in question is designed to maximize flexibility or simplicity. sim_checkpoint is more intended to do the latter, so I think it makes sense to wrap the check_step calls inside it.

imo, removing the check both simplifies and increases flexibility (the way it was before). I agree that if the sim_checkpoint ends up wrapping all the checks, that sim_checkpoint needs to perform the step checks, but I'd still argue that the call to check_step should be at the call site to sim_healthcheck and not down inside of it. But... this is also a very mild and minor complaint. It has been down inside the routine all this time.

I would be in favor of renaming these utilities. I like cfd_healthcheck but would maybe change that to fluid_healthcheck. The other two are fluid-specific too. As long as we're changing them, we may as well change sim_visualization to fluid_visualization.

I think I would vote for just calling it sim_healthcheck for the time being and fixing it in #257.

👍

@majosm @MTCam I've pushed a change that renames cfd_healthcheck, and I've renamed the StepperCrashError according to @majosm's comment: 2523fbd.

Is there anything else you'd like for me to apply before leaving the rest to #257?

Is there a good reason that sim_checkpoint does not make a crash viz dump at exception time for serial runs?

I'm not going to pursue this complaint very far - but I think that we are going backwards a bit here. This works essentially strips the user of a good portion of the control over what happens when the health check fails. Health checks are meant to be a user-defined facility. Imo, MIRGE-Com should provide simple utilities to perform health checks, but it should be up to the user to handle what happens if any of those checks are failed. For me, it is a step backwards to roll a new monolithic "checkpoint driver" that tries to do and automate too much.

Edit: actually maybe this adding of callback routines to the stepper has had a bit a scope creep, too. refactoring the checkpoint junk is a good deal of the business in #257.

Is there a good reason that sim_checkpoint does not make a crash viz dump at exception time for serial runs?

?

I'm not going to pursue this complaint very far - but I think that we are going backwards a bit here. This works essentially strips the user of a good portion of the control over what happens when the health check fails. Health checks are meant to be a user-defined facility. Imo, MIRGE-Com should provide simple utilities to perform health checks, but it should be up to the user to handle what happens if any of those checks are failed. For me, it is a step backwards to roll a new monolithic "checkpoint driver" that tries to do and automate too much.

Sounds like we don't have a consensus about what we're actually trying to do here... i.e., are we trying to maximize user control, or minimize code? They conflict with each other to an extent. I was under the impression that we were aiming more for the latter.

I think it would make sense to have the user pass in a health check function to sim_checkpoint. I'm less sure I see the value in having them handle errors manually. Can you give an example?

sim_checkpoint in its current state here does raise the exceptions again after handling them, so I guess theoretically a user could do:

try: sim_checkpoint(...) except HealthCheckError as e: # ... additional handling ...

Edit: actually maybe this adding of callback routines to the stepper has had a bit a scope creep, too. refactoring the checkpoint junk is a good deal of the business in #257.

I'm not sure if the original goal of this PR was to add a basic health check or just the callback functionality and nothing else, but if we want to add a health check here I think we have to deal with at least some of the stuff from #257 to make it work.

Is there a good reason that sim_checkpoint does not make a crash viz dump at exception time for serial runs?

?

Never mind, I see it now:

if comm is None: if terminate: raise exception return

Yeah, need a vis dump here @thomasgibson. Or maybe you could do something like:

if comm is not None: from mpi4py import MPI terminate = comm.allreduce(terminate, MPI.LOR) if terminate: # ...

to avoid multiple returns?

Sounds like we don't have a consensus about what we're actually trying to do here... i.e., are we trying to maximize user control, or minimize code? They conflict with each other to an extent. I was under the impression that we were aiming more for the latter.

I agree with that. My suggestion to add the callback hooks was centered on the need to give the user more freedom to implement what they want in the callbacks - one of those things likely to be health checking. I wanted to let the user decide what to do once he had evaluated the state.

I think it would make sense to have the user pass in a health check function to sim_checkpoint. I'm less sure I see the value in having them handle errors manually. Can you give an example?

Health check is exactly the example, I think. The user should be able to compose whatever health check they want. In fact, lets say the user wants to monitor specific volume (because, for example, it tends to go wild at an injection inlet or something). Then the user writes that health check function (or uses a built-in utility), and then ... if the check triggers, the user can either give up and quit, or turn on relaxation source terms to combat the issue.

I'm not sure if the original goal of this PR was to add a basic health check or just the callback functionality and nothing else, but if we want to add a health check here I think we have to deal with at least some of the stuff from #257 to make it work.

It was my understanding that this PR was just trying to add the callbacks, however I see how its scope expanded, and I'm not necessarily opposed. I also want to make clear that my complaint about the structure and user control is not really a major one (from my perspective). We can make use of it how it is, but it has evolved to the point that it doesn't really accomplish the things I had in mind for introducing the callbacks, and imo tries to do too much with automated health checking.

…chronizedError}

…to thg/callbacks

majosm · 2021-06-08T15:37:51Z

+    if terminate:
+        # Log crash error message
+        if rank == 0:
+            logger.info(str(exception))


I think this will need to be tweaked, because exception doesn't necessarily exist on rank 0.

majosm · 2021-06-08T15:49:37Z

+                          visualizer, vizname=vizname,
+                          step=step, t=t,
+                          viz_fields=viz_fields)
+        raise exception


This too, since exception doesn't exist on all ranks.

This one is a little trickier to deal with I think... Seems like we can either:

Only re-raise the exception on the rank that originally raised it. I don't know if I like this since it makes any subsequent handling have to redo the MPI reduction.

Try to pickle the exception and broadcast it to all ranks. I vaguely remember trying this and running into issues, but unfortunately I can't recall the details.

Synchronize the error state before raising and then do the raise on all ranks.

Thoughts?

@majosm I think option 3 (which I believe the code now sort of does) is probably the easiest for now? I don't want this PR to dig too deeply into this. Are there any major objections to doing this: 4c938fa?

I think option 3 (which I believe the code now sort of does)

To do option 3 properly I think we would need to synchronize before raising for each check inside sim_healthcheck, so that in the case of an error, every rank creates and raises the exception. Currently it's possible that only a subset of ranks will raise inside sim_healthcheck, and then afterwards in sim_checkpoint we end up raising None on some of the ranks since the exception object was never created there.

To do option 3 properly I think we would need to synchronize before raising for each check inside sim_healthcheck (...)

Why not something along the lines of this that we talked about earlier?

def fluid_healthcheck(...): nans_present = check_for_local_naninf() if nans_present: message += "NANs or INFs present in solution" health_problem = check_for_whatever_local_problem(...): if healt_problem: message += "Some other health issue." health = nansyes | healthproblem | yadda | yadda | yadda health = comm.allreduce(health, MPI.LOR) if health: log(message) toss TheException(message)

To do option 3 properly I think we would need to synchronize before raising for each check inside sim_healthcheck (...)

Why not something along the lines of this that we talked about earlier?

def fluid_healthcheck(...): nans_present = check_for_local_naninf() if nans_present: message += "NANs or INFs present in solution" health_problem = check_for_whatever_local_problem(...): if healt_problem: message += "Some other health issue." health = nansyes | healthproblem | yadda | yadda | yadda health = comm.allreduce(health, MPI.LOR) if health: log(message) toss TheException(message)

Oops, yeah, I didn't mean to exclude this. I just meant that the error state needs to be synchronized before exceptions are raised.

thomasgibson · 2021-06-11T20:51:41Z

closing in favor of #381

inducer reviewed May 28, 2021

View reviewed changes

Comment thread mirgecom/steppers.py Outdated

Comment thread mirgecom/steppers.py Outdated

Comment thread mirgecom/steppers.py Outdated

majosm reviewed Jun 4, 2021

View reviewed changes

thomasgibson force-pushed the thg/callbacks branch 2 times, most recently from b914d3d to 42de632 Compare June 4, 2021 16:54

thomasgibson marked this pull request as ready for review June 4, 2021 17:38

MTCam reviewed Jun 4, 2021

View reviewed changes

Comment thread examples/vortex-mpi.py Outdated

thomasgibson added the enhancement New feature or request label Jun 5, 2021

thomasgibson force-pushed the thg/callbacks branch from 30d6541 to 4a76d59 Compare June 5, 2021 15:44

thomasgibson commented Jun 5, 2021

View reviewed changes

Comment thread examples/scalar-lump-mpi.py Outdated

thomasgibson force-pushed the thg/callbacks branch 2 times, most recently from 22a7056 to 22ad5c4 Compare June 5, 2021 21:38

thomasgibson commented Jun 5, 2021

View reviewed changes

thomasgibson added 13 commits June 5, 2021 23:53

Add pre/post callback function args to advance_state

856821c

Pass checkpoint functions as pre-step callbacks

72ceaf6

Add healthcheck callback and exceptions

2814826

Use healthcheck callbacks in examples

25dd51c

Fix docstring for sim_healthcheck

7ee673c

Document callback signature and return state

9dba4ff

Test the basic healthcheck callback

f1b3e4f

Remove EOS copy-pasta

faba167

Minor refactoring and allow callbacks to terminate the simulation

0b6df95

Fix exception docs

b656079

Expand documentation of callbacks

2acdf73

Simplify sim_checkpoint and modularize callbacks

5405604

Update steppers

ca7c37b

thomasgibson force-pushed the thg/callbacks branch from 22ad5c4 to ca7c37b Compare June 6, 2021 05:08

thomasgibson added 3 commits June 6, 2021 00:27

Write short test for comparison callback

da65cd2

Renaming: {cfd_healthcheck,StepperCrashError} -> {sim_healthcheck,Syn…

2523fbd

…chronizedError}

Simplify drivers; move exception handling into sim_checkpoint

8284d8a

thomasgibson force-pushed the thg/callbacks branch from 900566f to 8284d8a Compare June 7, 2021 23:07

thomasgibson added 3 commits June 7, 2021 19:12

Merge branch 'main' into thg/callbacks

00d29a0

Fix callback tests

cba127b

Merge branch 'thg/callbacks' of github.com:illinois-ceesd/mirgecom in…

5c091bd

…to thg/callbacks

majosm reviewed Jun 8, 2021

View reviewed changes

Clean up sim_checkpoint function

4c938fa

majosm mentioned this pull request Jun 9, 2021

Tweaks to error handling in #370 #376

Closed

Merge branch 'main' into thg/callbacks

ed8d1b3

thomasgibson closed this Jun 11, 2021

MTCam mentioned this pull request Jun 14, 2021

Recallbacks #381

Merged

MTCam deleted the thg/callbacks branch July 17, 2021 16:30

kaushikcfd mentioned this pull request Apr 23, 2022

[fusion_actx] High compile times #644

Open

		if (np.isnan(op.nodal_sum(discr, "vol", dependent_vars.pressure))
		or np.isnan(op.nodal_sum(discr, "vol", dependent_vars.temperature))):

Conversation

thomasgibson commented May 28, 2021

Uh oh!

inducer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

majosm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majosm Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasgibson Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MTCam Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MTCam Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

majosm Jun 4, 2021 •

edited

Loading

thomasgibson Jun 4, 2021 •

edited

Loading

MTCam Jun 4, 2021 •

edited

Loading

MTCam Jun 8, 2021 •

edited

Loading