Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions nodes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# ⚠️ Running List of (Potentially) Problematic Nodes


| Node | Reference |
| :-------------: | :------------------------: |
| `x4002c2s1b0n0` | 🚨[^slow-vaino-2025-08-23] |
| `x4717c7s2b0n0` | 🚨[^slow-vaino-2025-08-23] |
| `x4514c6s1b0n0` | 🚧[^slow-vaino-2025-08-23] |
| `x4208c0s5b0n0` | 🚧[^slow-vaino-2025-08-23] |
| `x4311c0s5b0n0` | 🚧[^slow-vaino-2025-08-23] |
| `x4102c4s5b0n0` | 🚧[^slow-vaino-2025-08-23] |
| `x4314c7s0b0n0` | 🚧[^slow-vaino-2025-08-23] |
| `x4608c3s1b0n0` | 🚧[^slow-vaino-2025-08-23] |
| `x4711c3s6b0n0` | 🚧[^slow-vaino-2025-08-23] |

[^slow-vaino-2025-08-23]:

<details closed><summary>[<b>2025-08-23</b>] (Väinö Hatanpää, on Slack):</summary>

- [Report](https://cels-anl.slack.com/archives/C058HKVJ0QL/p1755957183678639):

> I did some full machine PyTorch benchmarking and there were some
> troublemaker nodes.
> These were >10% slower than others (sample size 1-3, could be randomness,
> but some were reoccurring):
>
> ```bash
> x4514c6s1b0n0 x4208c0s5b0n0 x4311c0s5b0n0 x4102c4s5b0n0 x4314c7s0b0n0 x4608c3s1b0n0 x4711c3s6b0n0
> ```
>
> And these two were hanging with some communication calls, every time I tried:
>
> ```bash
> x4002c2s1b0n0 x4717c7s2b0n0
> ```

</details>