Skip to content

Start tracking bad nodes#69

Open
saforem2 wants to merge 2 commits intomainfrom
saforem2/track-bad-nodes
Open

Start tracking bad nodes#69
saforem2 wants to merge 2 commits intomainfrom
saforem2/track-bad-nodes

Conversation

@saforem2
Copy link
Copy Markdown
Member

Copilot Summary

This pull request adds a new markdown document, nodes.md, to track nodes that have been identified as problematic based on recent benchmarking. The document organizes nodes into those that are slow and those that are hanging, and provides references to the source of this information.

New documentation for problematic nodes:

  • Created nodes.md to maintain a running list of nodes with performance or reliability issues, categorized by slow performance and communication hangs.
  • Included references and context from benchmarking reports, with direct links and details for further investigation.

Updated node references and added new symbols for categorization.
@TApplencourt
Copy link
Copy Markdown
Contributor

At first glance It seem to be a good idea!

I know at some point Ben and stuff got a canvas in on channel. Node go down and up pretty fast, so not sure if a MD document is the best place for such info. But I will let other people chime in.

Also will you maintain this page? Track if node are still up and still bad? I'm always afraid of "wrong / stalled publish facing information"


BTW the title is wrong " feat: Create saforem2/track-bad-nodes branch #69 " I suppose you want to merge your branch into main and not create a new branch from your fork to this repo.

@saforem2 saforem2 changed the title feat: Create saforem2/track-bad-nodes branch Start tracking bad nodes Sep 8, 2025
@colleeneb
Copy link
Copy Markdown

I'm ok to merge it in! (Sorry for the delay) Will you be able to maintain it? The thursday “aurora status and issues” meeting would be a good place to point it out and ask people to add to it if needed.

@TApplencourt
Copy link
Copy Markdown
Contributor

The problem is that Intel change dozen (hundread) of blade each day so 🤷🏽

@saforem2
Copy link
Copy Markdown
Member Author

saforem2 commented Sep 16, 2025

no these are both really good points actually

I suppose if nodes are being replaced that frequently, then in that case this is probably a bit less useful.

@saforem2
Copy link
Copy Markdown
Member Author

saforem2 commented Sep 16, 2025

My initial thought was something along the lines of:

Say your large scale application is cruising along when, unexpectedly, it crashes with some hard-to-immediately-parse error that you suspect is hardware related.

  1. Are any of the hosts in my "${PBS_NODEFILE}" present in the table from nodes.md?
    1. 🚨 Match Found: Create new hostfile without them and re-run1. Does it now run successfully?
      1. Application completes successfully
      2. ❌ Application crashes again.
        1. Identify problematic node2 and ADD IT TO BAD NODE LIST (with a date!)
    2. ⚠️ No match found. Try to identify problematic node2 and ADD IT TO BAD NODE LIST (with a date!)

Then, (maybe what I would imagine(?)) when a blade is replaced, remove that entry from the table (or maybe edit it to reflect like "HARDWARE_REPLACED: m/d/yyyy")

though I guess this makes me wonder, what is the criteria for determining if specific hardware needs to be replaced? I guess that is what I was trying to accomplish though maybe this already exists 🤷🏻‍♂️

Footnotes

  1. Though this would likely require a manual over-provisioning of nodes, etc., depending on the application.

  2. At least what I do is binary search $PBS_NODEFILE to try and identify problematic node (?? 🤷🏻‍♂️) 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants