Start tracking bad nodes by saforem2 · Pull Request #69 · argonne-lcf/AuroraBugTracking

saforem2 · 2025-08-23T14:53:33Z

Copilot Summary

This pull request adds a new markdown document, nodes.md, to track nodes that have been identified as problematic based on recent benchmarking. The document organizes nodes into those that are slow and those that are hanging, and provides references to the source of this information.

New documentation for problematic nodes:

Created nodes.md to maintain a running list of nodes with performance or reliability issues, categorized by slow performance and communication hangs.
Included references and context from benchmarking reports, with direct links and details for further investigation.

Updated node references and added new symbols for categorization.

TApplencourt · 2025-09-05T14:45:13Z

At first glance It seem to be a good idea!

I know at some point Ben and stuff got a canvas in on channel. Node go down and up pretty fast, so not sure if a MD document is the best place for such info. But I will let other people chime in.

Also will you maintain this page? Track if node are still up and still bad? I'm always afraid of "wrong / stalled publish facing information"

BTW the title is wrong " feat: Create saforem2/track-bad-nodes branch #69 " I suppose you want to merge your branch into main and not create a new branch from your fork to this repo.

colleeneb · 2025-09-15T18:28:33Z

I'm ok to merge it in! (Sorry for the delay) Will you be able to maintain it? The thursday “aurora status and issues” meeting would be a good place to point it out and ask people to add to it if needed.

TApplencourt · 2025-09-15T22:29:54Z

The problem is that Intel change dozen (hundread) of blade each day so 🤷🏽

saforem2 · 2025-09-16T00:39:25Z

no these are both really good points actually

I suppose if nodes are being replaced that frequently, then in that case this is probably a bit less useful.

saforem2 · 2025-09-16T01:18:24Z

My initial thought was something along the lines of:

Say your large scale application is cruising along when, unexpectedly, it crashes with some hard-to-immediately-parse error that you suspect is hardware related.

Are any of the hosts in my "${PBS_NODEFILE}" present in the table from nodes.md?
1. 🚨 Match Found: Create new hostfile without them and re-run¹. Does it now run successfully?
  1. ✅ Application completes successfully
  2. ❌ Application crashes again.
    1. Identify problematic node² and ADD IT TO BAD NODE LIST (with a date!)
2. ⚠️ No match found. Try to identify problematic node² and ADD IT TO BAD NODE LIST (with a date!)

Then, (maybe what I would imagine(?)) when a blade is replaced, remove that entry from the table (or maybe edit it to reflect like "HARDWARE_REPLACED: m/d/yyyy")

though I guess this makes me wonder, what is the criteria for determining if specific hardware needs to be replaced? I guess that is what I was trying to accomplish though maybe this already exists 🤷🏻‍♂️

Though this would likely require a manual over-provisioning of nodes, etc., depending on the application. ↩
At least what I do is binary search $PBS_NODEFILE to try and identify problematic node (?? 🤷🏻‍♂️) ↩ ↩²

saforem2 added 2 commits August 23, 2025 09:52

feat: Create saforem2/track-bad-nodes branch

937fbd0

Revise node references and update symbols

1b1c15b

Updated node references and added new symbols for categorization.

saforem2 changed the title ~~feat: Create saforem2/track-bad-nodes branch~~ Start tracking bad nodes Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start tracking bad nodes#69

Start tracking bad nodes#69
saforem2 wants to merge 2 commits intomainfrom
saforem2/track-bad-nodes

saforem2 commented Aug 23, 2025

Uh oh!

TApplencourt commented Sep 5, 2025

Uh oh!

colleeneb commented Sep 15, 2025

Uh oh!

TApplencourt commented Sep 15, 2025

Uh oh!

saforem2 commented Sep 16, 2025 •

edited

Loading

Uh oh!

saforem2 commented Sep 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saforem2 commented Aug 23, 2025

Copilot Summary

Uh oh!

TApplencourt commented Sep 5, 2025

Uh oh!

colleeneb commented Sep 15, 2025

Uh oh!

TApplencourt commented Sep 15, 2025

Uh oh!

saforem2 commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saforem2 commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saforem2 commented Sep 16, 2025 •

edited

Loading

saforem2 commented Sep 16, 2025 •

edited

Loading