Skip to content

mlpstorage training run --model=retinanet has considerably lower I/O Throughput and higher CPU utilization #337

@ddn-kums

Description

@ddn-kums

Hi,

The mlpstorage training run --model=retinanet --accelerator-type b200 .. job has considerably lower I/O throughput of 600 KB/s with all of the 8 x pt_data_worker 100% CPU busy.

The single accelerator mlpstorage training run --model=retinanet (even with using a single accelerator) has been running for 47 hours in the epoch 1 phase and it is not clear when it is expected to finish.

$ mlpstorage training run --hosts=srt017-e0 --client-host-memory-in-gb 247 --num-accelerators 1 --num-client-hosts 1 --accelerator-type b200 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030 --file --results-dir=/work/kums/mlstorage_v3/results --data-dir=/mnt/redfs/mlstorage_dd/retinanet_b200
Setting attr from num_accelerators to 1
Hosts is: ['srt017-e0']
Hosts is: ['srt017-e0']
⠙ Validating environment... 0:00:002026-04-13 14:11:55|INFO: Environment validation passed
2026-04-13 14:11:55|STATUS: Benchmark results directory: /work/kums/mlstorage_v3/results/training/retinanet/run/20260413_141154
2026-04-13 14:11:55|INFO: Created benchmark run: training_run_retinanet_20260413_141154
2026-04-13 14:11:55|STATUS: Verifying benchmark run for training_run_retinanet_20260413_141154
2026-04-13 14:11:55|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-13 14:11:55|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 4106030 (Parameter: Overrode Parameters)
2026-04-13 14:11:55|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 4111155, Actual: 4106030)
2026-04-13 14:11:55|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='retinanet', run_datetime='20260413_141154')])
2026-04-13 14:11:55|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠴ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-13 14:11:56|STATUS: Running benchmark command:: mpirun -n 1 -host srt017-e0:1 --bind-to none --map-by socket /work/kums/mlstorage_v3/storage/.venv/bin/dlio_benchmark workload=retinanet_b200 ++hydra.run.dir=/work/kums/mlstorage_v3/results/training/retinanet/run/20260413_141154 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=4106030 ++workload.dataset.data_folder=/mnt/redfs/mlstorage_dd/retinanet_b200/retinanet --config-dir=/work/kums/mlstorage_v3/storage/configs/dlio
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] 2026-04-13T14:12:00.200868 Running DLIO [Training] with 1 process(es)
[OUTPUT] 2026-04-13T14:13:45.267283 Max steps per epoch: 171084 = 1 * 4106030 / 24 / 1 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-13T14:14:06.659418 Starting epoch 1: 171084 steps expected
[OUTPUT] 2026-04-13T14:14:06.659952 Starting block 1
⠸ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 1 day, 23:12:53

From performance profiling during the retinanet training run, we observe most of the time being spent on PyUnicode_FromFormatV with all of the 8 x pt_data_worker 100% CPU busy but with MINIMAL I/O to the underlying storage systems hosting the training dataset (4.1 Million JPEG files).

Image Image Image

FWIW, the mlpstorage training datagen --hosts=srt017-e0 --model=retinanet --exec-type=mpi --param dataset.num_files_train=4106030 generated 4106031 JPEG files, with each file being 176 KiB in size.

# Training retinanet generated dataset
# 4106031 x 176KB JPEG files
$ find . -print | wc -l
4106031

$ ls -lh img_000001*_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000010_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000011_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000012_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000013_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000014_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000015_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000016_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000017_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000018_of_4106030.jpeg
-rw-rw---- 1 nodeadmin nodeadmin 176K Apr 12 15:19 img_0000019_of_4106030.jpeg

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions