I have set the certificates properly and verified that the S3 bucket is accessible. However, the tests/object-store/dlio_s3torch_checkpoint.sh fails during [Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed.
FWIW, Using s3dlio library, CHECKPOINTS=1 NP=2 bash tests/object-store/dlio_s3dlio_checkpoint.sh works successfully on the same setup.
cc: @russfellows
- Details:
$ echo $AWS_CA_BUNDLE
/home/nodeadmin/ca.pem
$ python3 -c "import s3dlio; print(s3dlio.list('s3://s3torch-chckpt-test1/', recursive=False))"
[]
$ CHECKPOINTS=1 NP=2 bash tests/object-store/dlio_s3torch_checkpoint.sh
[env] Loading credentials from .env
════════════════════════════════════════════════════════
DLIO Checkpoint — s3torchconnector + MinIO (llama3-8b)
════════════════════════════════════════════════════════
Bucket : s3torch-chckpt-test1
Objects at : s3://s3torch-chckpt-test1/s3torch/llama3-8b/
Endpoint : https://192.168.7.124:8111
MPI ranks : 2 (default=1; full run: NP=8 bash tests/object-store/dlio_s3torch_checkpoint.sh)
Checkpoints : 1 write + 1 read
Per-rank : ~13.1 GB per checkpoint (ZeRO-3, 8 ranks)
Run dir : /tmp/dlio-s3torch-checkpoint-20260409_153733
════════════════════════════════════════════════════════
Checking bucket reachability: s3://s3torch-chckpt-test1/ ...
Bucket accessible — 0 top-level entries
--------------------------------------------------------------------------
..
..
--------------------------------------------------------------------------
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
storage_type = <StorageType.S3: 's3'>
storage_root = 's3torch-chckpt-test1'
storage_options= {'endpoint_url': 'https://192.168.7.124:8111', 'region': 'us-east-1', 'storage_library': 's3torchconnector'}
data_folder = './data/'
framework = <FrameworkType.PYTORCH: 'pytorch'>
num_files_train= 8
record_length = 65536
generate_data = False
do_train = False
do_checkpoint = True
epochs = 1
batch_size = 1
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
storage_type = <StorageType.S3: 's3'>
storage_root = 's3torch-chckpt-test1'
storage_options= {'endpoint_url': 'https://192.168.7.124:8111', 'region': 'us-east-1', 'storage_library': 's3torchconnector'}
data_folder = './data/'
framework = <FrameworkType.PYTORCH: 'pytorch'>
num_files_train= 8
record_length = 65536
generate_data = False
do_train = False
do_checkpoint = True
epochs = 1
batch_size = 1
[OUTPUT] 2026-04-09T15:37:37.559203 Running DLIO [Checkpointing] with 2 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT] dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT] dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
..
..
[Main] Creating 4 buffers...
================================================================================
STREAMING CHECKPOINT - Producer-Consumer Pattern
================================================================================
Output: s3://s3torch-chckpt-test1/s3torch/llama3-8b/global_epoch1_step1/zero_pp_rank_0_mp_rank_0_model_states.pt
Backend: s3torchconnector
Total size: 7.48 GB
Buffer size: 32 MB
Buffer pool: 4 × 32 MB = 0.12 GB
Direct I/O: False
Use dgen-py: True
================================================================================
[Main] Creating 4 buffers...
[Main] Buffer pool ready: 0.12 GB
[Main] Initializing dgen-py (MPI world_size=2, threads=24/48 CPUs)...
[Main] Buffer pool ready: 0.12 GB
[Main] Initializing dgen-py (MPI world_size=2, threads=24/48 CPUs)...
[Main] Generator ready
[Main] Generator ready
[Main] Writer process started (PID=2952630)
[Main] Starting producer at 15368685.043s
[Main] Starting producer (buffer pool reuse pattern)...
[Writer] Starting (PID=2952630)
[Writer] DEBUG: AWS_ACCESS_KEY_ID = PSEZ***
[Writer] DEBUG: AWS_ENDPOINT_URL = https://192.168.7.124:8111
[Writer] Attached to 4 buffers (32 MB each)
[Main] Writer process started (PID=2952631)
[Main] Starting producer at 15368685.047s
[Main] Starting producer (buffer pool reuse pattern)...
[Writer] Starting (PID=2952631)
[Writer] DEBUG: AWS_ACCESS_KEY_ID = PSEZ***
[Writer] DEBUG: AWS_ENDPOINT_URL = https://192.168.7.124:8111
[Writer] Attached to 4 buffers (32 MB each)
[Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed
[Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed
I have set the certificates properly and verified that the S3 bucket is accessible. However, the
tests/object-store/dlio_s3torch_checkpoint.shfails during[Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed.FWIW, Using s3dlio library,
CHECKPOINTS=1 NP=2 bash tests/object-store/dlio_s3dlio_checkpoint.shworks successfully on the same setup.cc: @russfellows
- Details: