Add callback hooks to the simulation stepper#370
Conversation
inducer
left a comment
There was a problem hiding this comment.
Thanks for working on this! Some thoughts below.
| if (np.isnan(op.nodal_sum(discr, "vol", dependent_vars.pressure)) | ||
| or np.isnan(op.nodal_sum(discr, "vol", dependent_vars.temperature))): |
There was a problem hiding this comment.
Need to synchronize the error condition across all ranks before raising, otherwise the code may hang. Possibly useful snippet
There was a problem hiding this comment.
Good point. Question though: Is this okay if op.nodal_sum and other reductions are distributed? Once inducer/grudge#117 lands, these (and similar reductions like norms) will do parallel reductions. Or would we rather do things rank-local in these callbacks?
There was a problem hiding this comment.
Oh, neat. Yeah, I think if op.nodal_sum is global then it should be fine. It's probably going to be slower than doing a single reduction, but we can always optimize it later if it becomes a problem.
There was a problem hiding this comment.
@majosm I've opted to keep it rank-local for now, as exceptions are treated globally right? The error conditions now call out to rank-local grudge routines. Does this sound okay? I just felt doing 3 allreduces in a callback might be a bit much
There was a problem hiding this comment.
It is my understanding that exception handling is not global. I think that's the main crux of our struggle with parallel/asynchronous error handling. It appears to me, that if your current version of the healthcheck fails, then the simulation will hang or behave pathologically (as an exception gets tossed on one rank only, that rank goes into the handling branch, the rest don't).
I don't think we need a bunch of reductions though, really only one:
def fluid_healthcheck(...):
nans_present = check_for_local_naninf()
if nans_present:
message += "NANs or INFs present in solution"
health_problem = check_for_whatever_local_problem(...):
if healt_problem:
message += "Some other health issue."
health = nansyes | healthproblem | yadda | yadda | yadda
health = comm.allreduce(health, MPI.LOR)
if health:
log(message)
toss TheException(message)We could probably do better with the actual message. Consider that we could have Nranks disparate messages, each with valid (and potentially useful) error content. If we use log or maybe a similar utility that we ourselves roll to be parallel-smart, then we can capture such messages.
There was a problem hiding this comment.
It is my understanding that exception handling is not global. I think that's the main crux of our struggle with parallel/asynchronous error handling. It appears to me, that if your current version of the healthcheck fails, then the simulation will hang or behave pathologically (as an exception gets tossed on one rank only, that rank goes into the handling branch, the rest don't).
Yeah. One minor clarification for @thomasgibson: if an exception is raised and doesn't get handled anywhere, it will bubble up to the top and cause mpi4py to call MPI_Abort (this is what python -m mpi4py does). In that situation it's fine for exceptions to occur locally. It's only when we try to handle them and do I/O, etc. prior to exiting that we need to synchronize globally.
| t += dt | ||
|
|
||
| if post_step_callback is not None: | ||
| post_step_callback(state=state, step=istep, t=t, dt=dt) |
There was a problem hiding this comment.
I would advocate for having post_step_callback (and maybe pre_step_callback too?) return a boolean indicating whether to terminate timestepping. It gives the user some more flexibility in deciding when the simulation should stop. Then the timestepping loop can be a while not done: and the t_final stuff can be removed from the interface. I don't know if we want to actually make that change to the stepper in this PR, but in any case, just something to think about.
There was a problem hiding this comment.
I like this idea. How does a call signature like this look? state, stop = callback(...), where stop is a bool?
There was a problem hiding this comment.
What about callbacks that raise exceptions though? That's my only real concern.
There was a problem hiding this comment.
I would advocate for placing the simulation termination in the hands of the user at the callback level. Consider this callback:
def my_callback(state, step, t, dt):
is_healthy = my_healthcheck(state, step, t, dt)
is_healthy = comm.allreduce(is_healthy, MPI.LOR)
if not is_healthy:
write_viz(...)
write_restart(...)
do_my_other_biddings(...)
raise SimError("Simulation health check failed.")This is safe because the callback routine is already collective. If we are concerned with users doing collective operations (I don't see why we should be) - then we can add a wrapper. I like this a lot because it places healthcheck at the driver level, places full control over how to handle a health issue in the hands of the user (correctly imo), and forces users to opt-in to use it.
There was a problem hiding this comment.
I would advocate for placing the simulation termination in the hands of the user at the callback level. Consider this callback:
def my_callback(state, step, t, dt): is_healthy = my_healthcheck(state, step, t, dt) is_healthy = comm.allreduce(is_healthy, MPI.LOR) if not is_healthy: write_viz(...) write_restart(...) do_my_other_biddings(...) raise SimError("Simulation health check failed.")This is safe because the callback routine is already collective. If we are concerned with users doing collective operations (I don't see why we should be) - then we can add a wrapper. I like this a lot because it places healthcheck at the driver level, places full control over how to handle a health issue in the hands of the user (correctly imo), and forces users to opt-in to use it.
I think I mostly agree with this. My only complaint is that when doing it this way the user won't get any information back from my_healthcheck about what exactly failed. My inclination would be to make my_healthcheck collective and then either:
- Print out what failed on rank 0 inside
my_healthcheck - Do something like:
def my_callback(state, step, t, dt):
try:
my_healthcheck(state, step, t, dt)
except HealthCheckError:
write_viz(...)
write_restart(...)
do_my_other_biddings(...)
raise(i.e., whatever is contained in the HealthCheckError will bubble back up to the user.)
There was a problem hiding this comment.
There was a problem hiding this comment.
What about callbacks that raise exceptions though? That's my only real concern.
I think exceptions from callbacks are OK if the user handles it right. Callbacks are collective already. Clearly, the user can toss an exception from only one rank - but hey: that's on the user.
There was a problem hiding this comment.
If we catch the exception using what @majosm proposed (try - except clause), then we can yank out the error message from the exception and print that out on rank 0. How does that sound?
I think that is OK, but imo not as powerful as forming a status string and letting the user deal with the error. It means that the healthcheck does more than just check the health - it also does I/O on rank 0 - which may or may not be what the user wants to do there.
There was a problem hiding this comment.
If we catch the exception using what @majosm proposed (try - except clause), then we can yank out the error message from the exception and print that out on rank 0. How does that sound?
I think that is OK, but imo not as powerful as forming a status string and letting the user deal with the error. It means that the healthcheck does more than just check the health - it also does I/O on rank 0 - which may or may not be what the user wants to do there.
Sorry, I meant doing one or the other, i.e., either have my_healthcheck print, or do the try/except thing (my preference is the latter). With the latter approach you could extract the error message by doing something like:
try:
my_healthcheck(state, step, t, dt)
except HealthCheckError as e:
msg = e.message
# ... do more processing with msgThere was a problem hiding this comment.
Sorry, I meant doing one or the other, i.e., either have
my_healthcheckprint, or do thetry/exceptthing (my preference is the latter). With the latter approach you could extract the error message by doing something like:try: my_healthcheck(state, step, t, dt) except HealthCheckError as e: msg = e.message # ... do more processing with msg
I like this OK. What I would like to avoid is needing to pass "comm" as an argument to the healthcheck utility. So if my_healthcheck essentially calls a library routine underneath but still does synchronization (i.e. collective error handling) at driver level, then I think that is better than doing collective health check and collective exception in the lower layer (i.e. in the library or simutils). [but am not opposed to being talked out of this notion if it is flawed]
b914d3d to
42de632
Compare
30d6541 to
4a76d59
Compare
22a7056 to
22ad5c4
Compare
| try: | ||
| # Check the health of the simulation | ||
| cfd_healthcheck(discr, eos, state, | ||
| step=step, t=t, freq=ncheck) | ||
| # Perform checkpointing | ||
| sim_checkpoint(discr, eos, state, | ||
| step=step, t=t, dt=dt, freq=nstatus, | ||
| constant_cfl=constant_cfl) | ||
| # Visualize | ||
| sim_visualization(discr, eos, state, | ||
| visualizer, vizname=casename, | ||
| step=step, t=t, freq=nviz, | ||
| viz_fields=viz_fields) | ||
| except StepperCrashError as err: |
There was a problem hiding this comment.
Cool, thanks @thomasgibson. I like how you've split these functions out. This work interacts strongly with that of #257.
- I dunno about StepperCrashError for these utilities. I do see @majosm's comment about this earlier, but these aren't really stepper errors. I think it is a little confusing to toss stepper errors from these functions that are (on the surface) unrelated to the actual stepping and stepper.
- I'm not a giant fan of having these utilities check whether they will do something on the inside (via step/freq). I think it is a fairly good policy to make the routine very dumb; if it is called, it does the thing. Let the caller worry about if it is time to invoke the function. The upshot/question is: Any objections to putting the
check_stepcall on the outside? - I would be in favor of renaming these utilities. I like
cfd_healthcheckbut would maybe change that tofluid_healthcheck. The other two are fluid-specific too. As long as we're changing them, we may as well changesim_visualizationtofluid_visualization. sim_checkpointis basically eliminated by this change; we can go back later and reshape it to be general in Refactor checkpoint/simutils (attempt #2) #257, or banish it forever. If we banish, we need to add soln watch vialogpyleto the examples that lack it.
There was a problem hiding this comment.
We do need to write some tests that expose some of the stuff that we are concerned about for error handling. I don't have any specific good ideas about how to test these atm, but some of the stuff we've talked about:
- Asynchronous errors during stepping
- Viz file collisions/overwrites
- Graceful exit after any exceptions
- Handling exact soln mismatch collectively
Some of this may be firmly in #257 territory.
There was a problem hiding this comment.
To give some background here @thomasgibson: the main reason sim_checkpoint exists is to reduce the amount of duplicated code in the drivers. Ideally we want the code inside my_checkpoint to be pretty simple (unless the user wants to do something extra that's specific to their simulation). So, with that motivation in mind, I wonder if there's a way we could move the exception handling part inside sim_checkpoint? Maybe something along the lines of:
# driver.py
def healthcheck(step, t, state):
return cfd_healthcheck(...)
def write_vis(step, t, state):
return sim_visualization(...)
def checkpoint(step, t, dt, state):
return sim_checkpoint(..., healthcheck=healthcheck, write_vis=write_vis)# simutil.py
def sim_checkpoint(..., healthcheck=None, write_vis=None):
try:
if check_step(...):
# ... write status ...
if check_step(...) and healthcheck is not None:
healthcheck(step, t, state)
if check_step(...) and write_vis is not None:
write_vis(step, t, state)
except StepperCrashError as e:
# ... log error ...
if write_vis is not None:
write_vis(step, t, state)
raise
- I dunno about StepperCrashError for these utilities. I do see @majosm's comment about this earlier, but these aren't really stepper errors. I think it is a little confusing to toss stepper errors from these functions that are (on the surface) unrelated to the actual stepping and stepper.
IMO the name made sense when the try/except was wrapped around advance_state, but now that it's inside the checkpoint I think it could be named something else. Maybe the base class can just be something low level that indicates that it's raised globally? SynchronizedError? 🤷♂️
- I'm not a giant fan of having these utilities check whether they will do something on the inside (via step/freq). I think it is a fairly good policy to make the routine very dumb; if it is called, it does the thing. Let the caller worry about if it is time to invoke the function. The upshot/question is: Any objections to putting the
check_stepcall on the outside?
I think this depends on whether the code in question is designed to maximize flexibility or simplicity. sim_checkpoint is more intended to do the latter, so I think it makes sense to wrap the check_step calls inside it.
- I would be in favor of renaming these utilities. I like
cfd_healthcheckbut would maybe change that tofluid_healthcheck. The other two are fluid-specific too. As long as we're changing them, we may as well changesim_visualizationtofluid_visualization.
I think I would vote for just calling it sim_healthcheck for the time being and fixing it in #257.
There was a problem hiding this comment.
I think this depends on whether the code in question is designed to maximize flexibility or simplicity.
sim_checkpointis more intended to do the latter, so I think it makes sense to wrap thecheck_stepcalls inside it.
imo, removing the check both simplifies and increases flexibility (the way it was before). I agree that if the sim_checkpoint ends up wrapping all the checks, that sim_checkpoint needs to perform the step checks, but I'd still argue that the call to check_step should be at the call site to sim_healthcheck and not down inside of it. But... this is also a very mild and minor complaint. It has been down inside the routine all this time.
- I would be in favor of renaming these utilities. I like
cfd_healthcheckbut would maybe change that tofluid_healthcheck. The other two are fluid-specific too. As long as we're changing them, we may as well changesim_visualizationtofluid_visualization.I think I would vote for just calling it
sim_healthcheckfor the time being and fixing it in #257.
👍
There was a problem hiding this comment.
Is there a good reason that sim_checkpoint does not make a crash viz dump at exception time for serial runs?
I'm not going to pursue this complaint very far - but I think that we are going backwards a bit here. This works essentially strips the user of a good portion of the control over what happens when the health check fails. Health checks are meant to be a user-defined facility. Imo, MIRGE-Com should provide simple utilities to perform health checks, but it should be up to the user to handle what happens if any of those checks are failed. For me, it is a step backwards to roll a new monolithic "checkpoint driver" that tries to do and automate too much.
Edit: actually maybe this adding of callback routines to the stepper has had a bit a scope creep, too. refactoring the checkpoint junk is a good deal of the business in #257.
There was a problem hiding this comment.
Is there a good reason that
sim_checkpointdoes not make a crash viz dump at exception time for serial runs?
?
I'm not going to pursue this complaint very far - but I think that we are going backwards a bit here. This works essentially strips the user of a good portion of the control over what happens when the health check fails. Health checks are meant to be a user-defined facility. Imo, MIRGE-Com should provide simple utilities to perform health checks, but it should be up to the user to handle what happens if any of those checks are failed. For me, it is a step backwards to roll a new monolithic "checkpoint driver" that tries to do and automate too much.
Sounds like we don't have a consensus about what we're actually trying to do here... i.e., are we trying to maximize user control, or minimize code? They conflict with each other to an extent. I was under the impression that we were aiming more for the latter.
I think it would make sense to have the user pass in a health check function to sim_checkpoint. I'm less sure I see the value in having them handle errors manually. Can you give an example?
sim_checkpoint in its current state here does raise the exceptions again after handling them, so I guess theoretically a user could do:
try:
sim_checkpoint(...)
except HealthCheckError as e:
# ... additional handling ...Edit: actually maybe this adding of callback routines to the stepper has had a bit a scope creep, too. refactoring the checkpoint junk is a good deal of the business in #257.
I'm not sure if the original goal of this PR was to add a basic health check or just the callback functionality and nothing else, but if we want to add a health check here I think we have to deal with at least some of the stuff from #257 to make it work.
There was a problem hiding this comment.
Is there a good reason that
sim_checkpointdoes not make a crash viz dump at exception time for serial runs??
Never mind, I see it now:
if comm is None:
if terminate:
raise exception
returnYeah, need a vis dump here @thomasgibson. Or maybe you could do something like:
if comm is not None:
from mpi4py import MPI
terminate = comm.allreduce(terminate, MPI.LOR)
if terminate:
# ...to avoid multiple returns?
There was a problem hiding this comment.
Sounds like we don't have a consensus about what we're actually trying to do here... i.e., are we trying to maximize user control, or minimize code? They conflict with each other to an extent. I was under the impression that we were aiming more for the latter.
I agree with that. My suggestion to add the callback hooks was centered on the need to give the user more freedom to implement what they want in the callbacks - one of those things likely to be health checking. I wanted to let the user decide what to do once he had evaluated the state.
I think it would make sense to have the user pass in a health check function to
sim_checkpoint. I'm less sure I see the value in having them handle errors manually. Can you give an example?
Health check is exactly the example, I think. The user should be able to compose whatever health check they want. In fact, lets say the user wants to monitor specific volume (because, for example, it tends to go wild at an injection inlet or something). Then the user writes that health check function (or uses a built-in utility), and then ... if the check triggers, the user can either give up and quit, or turn on relaxation source terms to combat the issue.
I'm not sure if the original goal of this PR was to add a basic health check or just the callback functionality and nothing else, but if we want to add a health check here I think we have to deal with at least some of the stuff from #257 to make it work.
It was my understanding that this PR was just trying to add the callbacks, however I see how its scope expanded, and I'm not necessarily opposed. I also want to make clear that my complaint about the structure and user control is not really a major one (from my perspective). We can make use of it how it is, but it has evolved to the point that it doesn't really accomplish the things I had in mind for introducing the callbacks, and imo tries to do too much with automated health checking.
22ad5c4 to
ca7c37b
Compare
900566f to
8284d8a
Compare
| if terminate: | ||
| # Log crash error message | ||
| if rank == 0: | ||
| logger.info(str(exception)) |
There was a problem hiding this comment.
I think this will need to be tweaked, because exception doesn't necessarily exist on rank 0.
| visualizer, vizname=vizname, | ||
| step=step, t=t, | ||
| viz_fields=viz_fields) | ||
| raise exception |
There was a problem hiding this comment.
This too, since exception doesn't exist on all ranks.
This one is a little trickier to deal with I think... Seems like we can either:
- Only re-raise the exception on the rank that originally raised it. I don't know if I like this since it makes any subsequent handling have to redo the MPI reduction.
- Try to pickle the exception and broadcast it to all ranks. I vaguely remember trying this and running into issues, but unfortunately I can't recall the details.
- Synchronize the error state before raising and then do the
raiseon all ranks.
Thoughts?
There was a problem hiding this comment.
I think option 3 (which I believe the code now sort of does)
To do option 3 properly I think we would need to synchronize before raising for each check inside sim_healthcheck, so that in the case of an error, every rank creates and raises the exception. Currently it's possible that only a subset of ranks will raise inside sim_healthcheck, and then afterwards in sim_checkpoint we end up raising None on some of the ranks since the exception object was never created there.
There was a problem hiding this comment.
To do option 3 properly I think we would need to synchronize before raising for each check inside
sim_healthcheck(...)
Why not something along the lines of this that we talked about earlier?
def fluid_healthcheck(...):
nans_present = check_for_local_naninf()
if nans_present:
message += "NANs or INFs present in solution"
health_problem = check_for_whatever_local_problem(...):
if healt_problem:
message += "Some other health issue."
health = nansyes | healthproblem | yadda | yadda | yadda
health = comm.allreduce(health, MPI.LOR)
if health:
log(message)
toss TheException(message)There was a problem hiding this comment.
To do option 3 properly I think we would need to synchronize before raising for each check inside
sim_healthcheck(...)Why not something along the lines of this that we talked about earlier?
def fluid_healthcheck(...): nans_present = check_for_local_naninf() if nans_present: message += "NANs or INFs present in solution" health_problem = check_for_whatever_local_problem(...): if healt_problem: message += "Some other health issue." health = nansyes | healthproblem | yadda | yadda | yadda health = comm.allreduce(health, MPI.LOR) if health: log(message) toss TheException(message)
Oops, yeah, I didn't mean to exclude this. I just meant that the error state needs to be synchronized before exceptions are raised.
|
closing in favor of #381 |
Adds pre/post-step callbacks to
advance_state. This also removes thecheckpointargument since it is technically a callback. All examples have been updated.I also threw in a health-check callback that is exercised in the examples. Will write a short unit test for it as well.
cc @MTCam @anderson2981