Tweaks to error handling in #370#376
Conversation
thomasgibson
left a comment
There was a problem hiding this comment.
Thanks! I like the error handling a lot. A few initial comments
| sim_checkpoint(discr, visualizer, eos, q=state, | ||
| exact_soln=initializer, vizname=casename, step=step, | ||
| t=t, dt=dt, nstatus=nstatus, nviz=nviz, | ||
| exact_soln=initializer, step=step, | ||
| t=t, dt=dt, nstatus=nstatus, healthcheck_callback=healthcheck, | ||
| nviz=nviz, vis_callback=write_vis, | ||
| exittol=exittol, constant_cfl=constant_cfl) |
There was a problem hiding this comment.
To be honest, this feels weird passing these very specific callbacks as kwargs to sim_checkpoint. Is there anyway we can rework this to include an arbitrary list of callables, and just call out to those?
There was a problem hiding this comment.
Yeah I was wondering about that. Could pass a list of "actions" with a frequency associated with each, or something like that. Might be best to save that for #257 though.
| state, stop = exception_callback(istep, t, state, | ||
| synced_exception.exception) | ||
| if stop: | ||
| return istep, t, state |
There was a problem hiding this comment.
Why is this not break like here: https://github.com/illinois-ceesd/mirgecom/pull/376/files#diff-e13bce10f3dc4e997e0c56c91b3d0309d0ebf5ceaecd41ae16bca424c91ff24fR125?
There was a problem hiding this comment.
The diff makes it hard to see, but this section is actually outside of the timestepping loop.
|
This probably won't come as surprise given my objections to other developments along these lines.. but imo this is becoming far too complicated. imo, the stepper should not be in the business of handling exceptions, like ever. Callbacks are collective - once the callback is called, all control has passed back the driver. The user has an optional chance to evaluate the state here however he chooses, and to exit the simulation if he wants. The control never needs to be handed back to the stepper if the plan is to exit, but in any case tossing an exception for the stepper to handle is just very backwards to me. There is a lot of effort going into this exception tossing and handling by the various components, all of which (imo) are becoming too complicated by these changes that do not seem to support user need. |
If we want to get back the partially-stepped results after an error occurs, the stepper and error handling become intertwined one way or another. If we don't handle the exceptions inside the stepper, we have to do like we did before and store the last good stepper state inside the exceptions, which in general is pretty awkward.
I'm not sure I follow what you're describing here. Elaborate?
In what sense is this not supporting user need? |
I don't follow this. If the stepper calls the callback with the state, which it does, then the user already has the state. The user can then check the state, and if the state is in some condition that the requires action then the user can take that action; exit, or decide something else to do. Full stop. Nothing else is required.
If the user wants to exit, then just exit. The user already has full control and all ranks present in the callback. If the user wants to modify the state, then he does, and control (and the state) go back to the stepper. No exception handling is needed there.
The user needs a mechanism for checking the state during stepping. The callback hooks provide that. The user might need a mechanism for modifying the state during stepping - the callback that returns the state provides that. Imo, that ends the user need section. If the user wants to do I/O in the callback, he does it. If he wants to do healthchecking (by whatever the user determines indicates the health of his simulation), then he does it. If he wants to exit before stepping is done, just exit. In my opinion we are trying to automate too much, and assume way too much about what the user needs in their checkpointing or health-checking. |
|
I wanted to say just one more thing that might clarify my perspective on this... If we provided callbacks only and no other infrastructure (i.e. no health checking, no checkpointing, no status messaging etc), then someone like @anderson2981 would use the callback to do the things that he needs to do - he'd add a check on specific volume, say, or maybe mass fractions, he'd check to see if he wants to do a viz dump and then do it. imo, we should let the users define what it is they need to do in their particular simulation, and these things are so wildly varied that rolling some automated system for handling all that is just awkward, not very flexible, and will almost always be less than or grossly more than what the user needs. We could however provide several utilities that users can use to roll their own checkpoint/callback procedures - and this would be very useful to them. Those utilities may throw exceptions, or whatever - and I think that is fine - but handling of those (imo) is squarely in the domain of user option. |
To elaborate a little on what I was trying to say previously: I was contrasting the approach laid out here vs. the current approach in main. Option 1 (here): Handle exception in the stepper via callback, user decides whether to continue or stop. Both involve intertwining the stepping with error handling to some degree.
Hang on -- which callback are we talking about here? The checkpoint? If I'm understanding correctly and you're saying that we should just handle errors that occur in the checkpoint immediately as they come up; I guess my only dislike about this is that the user would have to add separate error handling for errors that occur in other parts of the stepping (I'm thinking timestep size computation, RHS, maybe elsewhere if we have more sophisticated timestepping through Leap?).
I don't think this precludes any of that. |
Currently in main, there is no exception thrown. The stepper calls something called Anyway.. this discussion is getting too long to respond to piece-by-piece. It would be terrific if we could meet to talk about this stuff. |
I'm talking about the pre-step callback called from the stepper. That's really the only callback that we currently use, and the only one I can really think of that we need. We already wrap the call to the stepper in a try/catch, so we already catch anything thrown by anything that the stepper calls.
I think that
def my_callback(state, istep, time, dt, whatever):
if check_step(istep, time, vizrfreq):
do_viz_dump(...)
if check_step(istep, time, statusfreq):
do_my_status(...)
spec_vol = 1 / state.mass
if discr.nodal_max(spec_vol) > spec_vol_limit:
do_viz_dump(...)
log.info("Quitting due to specific volume violation.")
exitThis is OK, the callback is already collective. |
I'm talking about this and this.
As I said, this is ok if you only care about errors that occur inside the checkpoint. If errors can occur elsewhere (timstep computation, RHS, timestepper itself?) then this doesn't help.
This requires the user to add separate error handling for checkpointing errors vs. other errors. I don't see why we necessarily need to treat them separately.
I repeat: whether
Exceptions are the standard way to handle errors in Python. I don't see a compelling reason to avoid them. There really isn't much of a practical difference between: log.info("Quitting due to specific volume violation.")
exitand raise RuntimeError("Specific volume violation.")In principle doing the above after the |
Both of these come from an exception tossed by checkpoint routine - which should be eliminated. I understand you don't want to do it here, or in TG's PR, but my argument is that we shouldn't spend time coding around something that shouldn't even exist.
Why is the try/catch around stepping not sufficient to catch any exceptions tossed by anything inside stepping (or that stepping calls). I would argue that no exceptions should ever be thrown by any library component unless it is thrown collectively. This is a good strategy for parallel codes. Library routines should behave the same regardless of domain-decomp. If one rank tosses in some library routine, then they all should. If we adopt this policy, many of our issues surrounding exception handling can be mitigated.
I disagree. The user often needs to know which components fail. And if the user does not want to call all the stuff done by (the current checkpoint, or healthcheck) then he doesn't have to.
I'm not advocating for avoiding all exceptions, I'm only pointing out that no exceptions are needed to do the things we need the simulation to do in this callback example. No callback should ever need to throw an exception to the stepper - it just is not needed. My main point is that all of this work being done to encapsulate health checking and checkpointing, and handling all their errors automatically by the library components is unnecessary and does not address a simulation need; so why do it?
I'm fine with this. I don't care how the user exits or causes the simulation to exit, only that doing so is in the hands of the user, not the library.
I still don't buy it. The only reason this additional machinery is needed from my perspective is in the processing of errors tossed by routines that should not exist to begin with. I'm definitely just fine with the prospect of seeing a lot of code duplication in the drivers. Drivers are the user's business, and almost everything should be done in the driver, including error handling. |
|
I guess we're moving over to #381, but just to address a few lingering things here...
A try/catch around
This isn't really feasible. Any subpackage can raise an exception at any time, and we don't have control over this.
Note that
It is though? The user is choosing to call
All of this extra machinery can be applied to #381, too, and IMO would still be nice to have.
🤷♂️ I don't really share this view. |
I didn't recognize the subtle points about the state before you explained this. (at least it's subtle to me). This is a fair point. As for getting the state back if a sub-package fails or throws an exception: I think we don't care much about that case. The main failure modes are:
I totally agree and recognize this point. I only mean MIRGE-Com library. We just have to accept it if a subpackage fails and kills the sim. This is a rare failure mode beyond initialization and first step.
It turns out that the structure where we use callbacks into the ( i would argue overly-restrictive ) utility drivers (like I totally agree that many simple situations can be encapsulated inside some ready-to-use constructs, and these constructs can be arbitrarily complex, accepting callbacks, etc. They should not be required or necessary, however. I think it is important for us to hang all this out in the driver (for now) for the user (and us), and then have a think about how we can encapsulate these functionalities in a way we can all get on-board with. The prospect of more callbacks and the circular error handling issues surrounding the retention of stuff like
Let's duke it out in #257? FWIW, I really like these discussions. |
I wouldn't say I was thinking so much about subpackages, mainly just other parts of mirgecom. But I'll admit I don't have an entirely clear idea of whether anything outside of timestep computation and checkpointing would ever fail mid-stepping. @inducer Can you think of anything?
That's fair. A 'reset' could be helpful. Let's continue in #381. |
This PR tweaks the error handling setup in #370 in the following ways:
SynchronizedErrorfrom a base class to a type that wraps another exception.SynchronizedExceptionnow indicates to the handling code that the contained exception has been synchronized across MPI ranks.bcast_from_lowest_rank).try/exceptfromsim_checkpointintoadvance_state.exception_callbacktoadvance_state. This function can either re-raise the exception after handling, or it can return astate, stoptuple causing stepping to either resume or halt according to what the user decides.In the process I also added a couple of callbacks to
sim_checkpoint(healthcheck_callbackandvis_callback).(Note: I only changed
vortex-mpifor now, so the CI will probably fail.)Thoughts @thomasgibson, @MTCam?