mlpstorage training run --model=retinanet has considerably lower I/O Throughput and higher CPU utilization

Hi,

The `mlpstorage training run --model=retinanet --accelerator-type b200` .. job  has considerably lower I/O throughput of 600 KB/s with all of the 8 x pt_data_worker 100% CPU busy. 

The single accelerator `mlpstorage training run --model=retinanet` (even with using a single accelerator) has been running for 47 hours in the epoch 1 phase and it is not clear when it is expected to finish.

```
$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/retinanet_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-13 14:11:55|INFO: Environment validation passed
2026-04-13 14:11:55|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/retinanet/run/20260413_141154
2026-04-13 14:11:55|INFO: Created benchmark run: training_run_retinanet_20260413_141154
2026-04-13 14:11:55|STATUS: Verifying benchmark run for training_run_retinanet_20260413_141154
2026-04-13 14:11:55|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-13 14:11:55|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 4106030 (Parameter: Overrode Parameters)
2026-04-13 14:11:55|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 4111155, Actual: 4106030)
2026-04-13 14:11:55|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='retinanet', run_datetime='20260413_141154')])
2026-04-13 14:11:55|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠴ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-13 14:11:56|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=retinanet_b200 ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/retinanet/run/20260413_141154 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=4106030 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/retinanet_b200/retinanet --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-13T14:12:00.200868 Running DLIO [Training] with 1 process(es)
[OUTPUT] 2026-04-13T14:13:45.267283 Max steps per epoch: 171084 = 1 * 4106030 / 24 / 1 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-13T14:14:06.659418 Starting epoch 1: 171084 steps expected
[OUTPUT] 2026-04-13T14:14:06.659952 Starting block 1
⠸ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 1 day, 23:12:53
```

From performance profiling during the `retinanet` training run, we observe most of the time being spent on `PyUnicode_FromFormatV` with all of the 8 x `pt_data_worker` 100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training dataset (4.1 Million JPEG files).

<img width="627" height="231" alt="Image" src="https://github.com/user-attachments/assets/2f8e8ef3-d27e-4b7e-8828-3c4379e78232" />

<img width="730" height="274" alt="Image" src="https://github.com/user-attachments/assets/bb09fe69-8033-40a3-b4de-99d769393c5b" />

<img width="463" height="245" alt="Image" src="https://github.com/user-attachments/assets/77f0465c-f694-4462-a715-87e1f4fda3f4" />

 FWIW, the `mlpstorage training datagen --hosts=srt017-e0 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030` generated 4106031 JPEG files, with each file being 176 KiB in size.

```
# Training retinanet generated dataset
# 4106031 x 176KB JPEG files
$ find . -print | wc -l
4106031

$ ls -lh img_000001*_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000010_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000011_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000012_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000013_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000014_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000015_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000016_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000017_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000018_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000019_of_4106030.jpeg
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlpstorage training run --model=retinanet has considerably lower I/O Throughput and higher CPU utilization #337

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mlpstorage training run --model=retinanet has considerably lower I/O Throughput and higher CPU utilization #337

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions