Skip to content

Recallbacks#381

Merged
MTCam merged 85 commits into
mainfrom
recallbacks
Jul 9, 2021
Merged

Recallbacks#381
MTCam merged 85 commits into
mainfrom
recallbacks

Conversation

@MTCam
Copy link
Copy Markdown
Member

@MTCam MTCam commented Jun 11, 2021

Continued from PR (#370):

This builds on @thomasgibson callbacks and uses the opportunity to clean up some cumbersome junk in the infrastructure that was gumming up the use of the code in production. Instead of building the callback and health checking facilities on old junky code that we didn't like, we refactored it a bit here, to give the user more flexibility, and responsibility, for the behavior of the simulation application.

  • Updates the old callback (checkpoint) to the explicitly named (pre_step_callback)
  • Takes apart the clunky sim_checkpoint, sim_vizualization, for a more direct, much less "fluid-specific" interface
  • Allows full customization of status messages, viz fields, and health checking
  • Implements minimally collective health-check (only one collective comm to sync)
  • Optionally dumps viz at health fail (automatically collective)
  • Correctly places control over what to do with errors in the hands of the user/driver

@MTCam MTCam marked this pull request as ready for review June 11, 2021 17:41
Comment thread examples/autoignition-mpi.py Outdated
Comment thread examples/autoignition-mpi.py Outdated
Co-authored-by: Thomas H. Gibson <gibsonthomas1120@hotmail.com>
@MTCam MTCam requested review from majosm and thomasgibson July 9, 2021 01:38
@MTCam
Copy link
Copy Markdown
Member Author

MTCam commented Jul 9, 2021

Ready for another peak. @majosm and @thomasgibson.

Comment thread examples/autoignition-mpi.py Outdated
Copy link
Copy Markdown
Collaborator

@majosm majosm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I left a few minor comments.

Comment thread examples/autoignition-mpi.py
Comment thread mirgecom/steppers.py Outdated
Comment thread mirgecom/steppers.py Outdated
Comment thread mirgecom/steppers.py Outdated
Comment thread mirgecom/steppers.py Outdated
Comment thread mirgecom/steppers.py Outdated
Comment thread mirgecom/steppers.py Outdated
Comment thread examples/autoignition-mpi.py Outdated
@MTCam
Copy link
Copy Markdown
Member Author

MTCam commented Jul 9, 2021

I've built-in a health failure to autoignition example, so CI should fail that example. Don't be alarmed :)

Comment thread examples/autoignition-mpi.py Outdated
Comment on lines +372 to +373
if step == 5: # quick test
health_errors = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was left-over testing that can be removed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - just wanting to demonstrate the health error while we have this PR cracked open. I can certainly remove it right away if it is distracting.

Comment thread examples/autoignition-mpi.py Outdated
Comment on lines +369 to +371
health_errors = my_health_check(dv)
if comm is not None:
health_errors = comm.allreduce(health_errors, op=MPI.LOR)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's worth adding in a utility function in simutils that does this reduction across ranks? Just to clean this up. I think @majosm has some code somewhere that does this. This is not a functional request, just aesthetics.

Copy link
Copy Markdown
Member Author

@MTCam MTCam Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health_errors = allsync(my_health_check(dv), op=MPI.LOR) ?
Oh woops, we'd need to pass in comm too. Hrm...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, comm is already being used around so is it that bad pass it? I don't see the problem.

Copy link
Copy Markdown
Collaborator

@majosm majosm Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the code snippet @thomasgibson was referring to:

mirgecom/mirgecom/mpi.py

Lines 161 to 166 in e47d3d0

def comm_any(comm, local_value):
"""Return True if *local_value* is True on any rank in *comm*."""
reduced_value = np.array([int(local_value)])
from mpi4py import MPI
comm.Allreduce(MPI.IN_PLACE, reduced_value, op=MPI.MAX)
return bool(reduced_value[0])

(Don't copy the implementation, it's dumb. Just use MPI.LOR.)

Edit: I'm not sure this adds anything as-is. You might be able to stick the if comm is not None: check in here though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem at all!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just real quick before I implement..... @thomasgibson and @majosm

How would we feel about this utility:

def allsync(local_value(s), comm, op=MPI.MAX):
   if comm is None:
     return local_value(s)
   return comm.allreduce(local_values, op=op)
 
global_value(s) = allsync(local_value(s), comm, op=MPIOP)

This way the user can use the allsync utility to do syncing operations on arrays of data, or other objects as supported by comm, and control over what operation is performed.

Copy link
Copy Markdown
Contributor

@thomasgibson thomasgibson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MTCam I just left some minor comments, but overall I'm happy with the auto-ignition example. I think once the changes are applied to the other examples, we can finally merge this.

@MTCam
Copy link
Copy Markdown
Member Author

MTCam commented Jul 9, 2021

I think this is ready for your eyes again @thomasgibson and @majosm. The only changes were the ones we discussed (5306e30), (5bd30d8).

Copy link
Copy Markdown
Collaborator

@majosm majosm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 👍

@MTCam MTCam merged commit acf4458 into main Jul 9, 2021
@MTCam MTCam deleted the recallbacks branch July 9, 2021 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants