Hi,
The mlpstorage training run --model=retinanet --accelerator-type b200 .. job has considerably lower I/O throughput of 600 KB/s with all of the 8 x pt_data_worker 100% CPU busy.
The single accelerator mlpstorage training run --model=retinanet (even with using a single accelerator) has been running for 47 hours in the epoch 1 phase and it is not clear when it is expected to finish.
$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/retinanet_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-13 14:11:55|INFO: Environment validation passed
2026-04-13 14:11:55|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/retinanet/run/20260413_141154
2026-04-13 14:11:55|INFO: Created benchmark run: training_run_retinanet_20260413_141154
2026-04-13 14:11:55|STATUS: Verifying benchmark run for training_run_retinanet_20260413_141154
2026-04-13 14:11:55|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-13 14:11:55|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 4106030 (Parameter: Overrode Parameters)
2026-04-13 14:11:55|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 4111155, Actual: 4106030)
2026-04-13 14:11:55|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='retinanet', run_datetime='20260413_141154')])
2026-04-13 14:11:55|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠴ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-13 14:11:56|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=retinanet_b200 ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/retinanet/run/20260413_141154 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=4106030 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/retinanet_b200/retinanet --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-13T14:12:00.200868 Running DLIO [Training] with 1 process(es)
[OUTPUT] 2026-04-13T14:13:45.267283 Max steps per epoch: 171084 = 1 * 4106030 / 24 / 1 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-13T14:14:06.659418 Starting epoch 1: 171084 steps expected
[OUTPUT] 2026-04-13T14:14:06.659952 Starting block 1
⠸ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 1 day, 23:12:53
From performance profiling during the retinanet training run, we observe most of the time being spent on PyUnicode_FromFormatV with all of the 8 x pt_data_worker 100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training dataset (4.1 Million JPEG files).
FWIW, the mlpstorage training datagen --hosts=srt017-e0 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030 generated 4106031 JPEG files, with each file being 176 KiB in size.
# Training retinanet generated dataset
# 4106031 x 176KB JPEG files
$ find . -print | wc -l
4106031
$ ls -lh img_000001*_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000010_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000011_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000012_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000013_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000014_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000015_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000016_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000017_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000018_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000019_of_4106030.jpeg
Hi,
The
mlpstorage training run --model=retinanet --accelerator-type b200.. job has considerably lower I/O throughput of 600 KB/s with all of the 8 x pt_data_worker 100% CPU busy.The single accelerator
mlpstorage training run --model=retinanet(even with using a single accelerator) has been running for 47 hours in the epoch 1 phase and it is not clear when it is expected to finish.From performance profiling during the
retinanettraining run, we observe most of the time being spent onPyUnicode_FromFormatVwith all of the 8 xpt_data_worker100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training dataset (4.1 Million JPEG files).FWIW, the
mlpstorage training datagen --hosts=srt017-e0 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030generated 4106031 JPEG files, with each file being 176 KiB in size.