Conversation
Updated node references and added new symbols for categorization.
|
At first glance It seem to be a good idea! I know at some point Ben and stuff got a Also will you maintain this page? Track if node are still up and still bad? I'm always afraid of "wrong / stalled publish facing information" BTW the title is wrong " feat: Create saforem2/track-bad-nodes branch #69 " I suppose you want to merge your branch into main and not create a new branch from your fork to this repo. |
saforem2/track-bad-nodes branch|
I'm ok to merge it in! (Sorry for the delay) Will you be able to maintain it? The thursday “aurora status and issues” meeting would be a good place to point it out and ask people to add to it if needed. |
|
The problem is that Intel change dozen (hundread) of blade each day so 🤷🏽 |
|
no these are both really good points actually I suppose if nodes are being replaced that frequently, then in that case this is probably a bit less useful. |
|
My initial thought was something along the lines of: Say your large scale application is cruising along when, unexpectedly, it crashes with some hard-to-immediately-parse error that you suspect is hardware related.
Then, (maybe what I would imagine(?)) when a blade is replaced, remove that entry from the table (or maybe edit it to reflect like "HARDWARE_REPLACED: m/d/yyyy") though I guess this makes me wonder, what is the criteria for determining if specific hardware needs to be replaced? I guess that is what I was trying to accomplish though maybe this already exists 🤷🏻♂️ Footnotes |
Copilot Summary
This pull request adds a new markdown document,
nodes.md, to track nodes that have been identified as problematic based on recent benchmarking. The document organizes nodes into those that are slow and those that are hanging, and provides references to the source of this information.New documentation for problematic nodes:
nodes.mdto maintain a running list of nodes with performance or reliability issues, categorized by slow performance and communication hangs.