Skip to content

dlio_s3torch_checkpoint fails with TLS (SSL) negotiation failed errors #327

@ddn-kums

Description

@ddn-kums

I have set the certificates properly and verified that the S3 bucket is accessible. However, the tests/object-store/dlio_s3torch_checkpoint.sh fails during [Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed.

FWIW, Using s3dlio library, CHECKPOINTS=1 NP=2 bash tests/object-store/dlio_s3dlio_checkpoint.sh works successfully on the same setup.

cc: @russfellows

- Details:

$ echo $AWS_CA_BUNDLE
/home/nodeadmin/ca.pem

$ python3 -c "import s3dlio; print(s3dlio.list('s3://s3torch-chckpt-test1/', recursive=False))"
[]
$ CHECKPOINTS=1 NP=2 bash tests/object-store/dlio_s3torch_checkpoint.sh 
[env] Loading credentials from .env

════════════════════════════════════════════════════════
  DLIO Checkpoint — s3torchconnector + MinIO  (llama3-8b)
════════════════════════════════════════════════════════
  Bucket      : s3torch-chckpt-test1
  Objects at  : s3://s3torch-chckpt-test1/s3torch/llama3-8b/
  Endpoint    : https://192.168.7.124:8111
  MPI ranks   : 2   (default=1; full run: NP=8 bash tests/object-store/dlio_s3torch_checkpoint.sh)
  Checkpoints : 1 write + 1 read
  Per-rank    : ~13.1 GB per checkpoint  (ZeRO-3, 8 ranks)
  Run dir     : /tmp/dlio-s3torch-checkpoint-20260409_153733
════════════════════════════════════════════════════════

Checking bucket reachability: s3://s3torch-chckpt-test1/ ...
  Bucket accessible — 0 top-level entries

--------------------------------------------------------------------------
..
..
--------------------------------------------------------------------------
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
  storage_type   = <StorageType.S3: 's3'>
  storage_root   = 's3torch-chckpt-test1'
  storage_options= {'endpoint_url': 'https://192.168.7.124:8111', 'region': 'us-east-1', 'storage_library': 's3torchconnector'}
  data_folder    = './data/'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 8
  record_length  = 65536
  generate_data  = False
  do_train       = False
  do_checkpoint  = True
  epochs         = 1
  batch_size     = 1
[DEBUG DLIOBenchmark.__init__] After LoadConfig:
  storage_type   = <StorageType.S3: 's3'>
  storage_root   = 's3torch-chckpt-test1'
  storage_options= {'endpoint_url': 'https://192.168.7.124:8111', 'region': 'us-east-1', 'storage_library': 's3torchconnector'}
  data_folder    = './data/'
  framework      = <FrameworkType.PYTORCH: 'pytorch'>
  num_files_train= 8
  record_length  = 65536
  generate_data  = False
  do_train       = False
  do_checkpoint  = True
  epochs         = 1
  batch_size     = 1
[OUTPUT] 2026-04-09T15:37:37.559203 Running DLIO [Checkpointing] with 2 process(es)
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
[OUTPUT] ================================================================================
[OUTPUT] Data Generation Method: DGEN (default)
[OUTPUT]   dgen-py zero-copy BytesView — 155x faster than NumPy, 0 MB overhead
[OUTPUT] ================================================================================
..
..
[Main] Creating 4 buffers...
================================================================================
STREAMING CHECKPOINT - Producer-Consumer Pattern
================================================================================
Output:      s3://s3torch-chckpt-test1/s3torch/llama3-8b/global_epoch1_step1/zero_pp_rank_0_mp_rank_0_model_states.pt
Backend:     s3torchconnector
Total size:  7.48 GB
Buffer size: 32 MB
Buffer pool: 4 × 32 MB = 0.12 GB
Direct I/O:  False
Use dgen-py: True
================================================================================

[Main] Creating 4 buffers...
[Main] Buffer pool ready: 0.12 GB
[Main] Initializing dgen-py (MPI world_size=2, threads=24/48 CPUs)...
[Main] Buffer pool ready: 0.12 GB
[Main] Initializing dgen-py (MPI world_size=2, threads=24/48 CPUs)...
[Main] Generator ready
[Main] Generator ready

[Main] Writer process started (PID=2952630)
[Main] Starting producer at 15368685.043s
[Main] Starting producer (buffer pool reuse pattern)...
[Writer] Starting (PID=2952630)
[Writer] DEBUG: AWS_ACCESS_KEY_ID = PSEZ***
[Writer] DEBUG: AWS_ENDPOINT_URL = https://192.168.7.124:8111
[Writer] Attached to 4 buffers (32 MB each)

[Main] Writer process started (PID=2952631)
[Main] Starting producer at 15368685.047s
[Main] Starting producer (buffer pool reuse pattern)...
[Writer] Starting (PID=2952631)
[Writer] DEBUG: AWS_ACCESS_KEY_ID = PSEZ***
[Writer] DEBUG: AWS_ENDPOINT_URL = https://192.168.7.124:8111
[Writer] Attached to 4 buffers (32 MB each)
[Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed
[Writer] ERROR: Failed to create storage writer: Client error: Unknown CRT error: CRT error 1029: aws-c-io: AWS_IO_TLS_ERROR_NEGOTIATION_FAILURE, TLS (SSL) negotiation failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions