Skip to content

DAOS-18882 vos: avoid heap_curr_allocated underflow#18103

Open
grom72 wants to merge 3 commits intomasterfrom
grom72/DAOS-18882
Open

DAOS-18882 vos: avoid heap_curr_allocated underflow#18103
grom72 wants to merge 3 commits intomasterfrom
grom72/DAOS-18882

Conversation

@grom72
Copy link
Copy Markdown
Contributor

@grom72 grom72 commented Apr 24, 2026

Update PMDK to incorporate the following fixes:

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

Ticket title is '0 size SCM on pool with no containers'
Status is 'Awaiting Verification'
https://daosio.atlassian.net/browse/DAOS-18882

@grom72 grom72 changed the title DAOS-18882 pmdk: avoid heap_curr_allocated underflow DAOS-18882 vos: avoid heap_curr_allocated underflow Apr 24, 2026
@grom72 grom72 force-pushed the grom72/DAOS-18882 branch from 735b1f2 to b6c997b Compare April 24, 2026 10:08
Update PMDK to incorporate the following fixes:
- fix "The pool was not closed" message (no ADR failure) daos-stack/pmdk#36
- recalculate curr_allocated on underflow daos-stack/pmdk#37

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

Priority: 2

Allow-unstable-test: true

Focus validation on PMem version

Skip-func-hw-test-medium: false
Skip-func-hw-test-medium-md-on-ssd: true
Skip-func-hw-test-medium-vmd: false
Skip-func-hw-test-medium-verbs-provider: false
Skip-func-hw-test-medium-verbs-provider-md-on-ssd: true
Skip-func-hw-test-large: false
Skip-func-hw-test-large-md-on-ssd: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
@grom72 grom72 force-pushed the grom72/DAOS-18882 branch from b6c997b to 91a8209 Compare April 24, 2026 10:15
grom72 added a commit that referenced this pull request Apr 24, 2026
Update PMDK to incorporate the following fixes:
- fix "The pool was not closed" message (no ADR failure) daos-stack/pmdk#36
- recalculate curr_allocated on underflow daos-stack/pmdk#37

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

Priority: 2

Allow-unstable-test: true

Skip-func-hw-test-medium: false
Skip-func-hw-test-large: false
Signed-off-by: Oksana Salyk <oksana.salyk@hpe.com>
@osalyk osalyk marked this pull request as ready for review April 24, 2026 14:18
@osalyk osalyk requested a review from a team as a code owner April 24, 2026 14:18
@daosbuild3
Copy link
Copy Markdown
Collaborator

Comment thread utils/build.config Outdated
[patch_versions]
spdk=0001_3428322b812fe31cc3e1d0308a7f5bd4b06b9886.diff,0002_spdk_rwf_nowait.patch,0003_external_isal.patch
mercury=0001_dep_versions.patch,0002_ofi_counters.patch,0003_ofi_auth_key.patch
pmdk=https://github.com/daos-stack/pmdk/commit/69925cf455ef672c4cbdbdb13bef7ae581e67045.diff
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change, but did not test locally.

Signed-off-by: Ryon Jensen <ryon.jensen@hpe.com>
@mchaarawi mchaarawi requested a review from soumagne April 24, 2026 19:41
@grom72
Copy link
Copy Markdown
Contributor Author

grom72 commented Apr 24, 2026

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18103/6/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18103/5/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18103/6/execution/node/565/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18103/5/execution/node/1224/log

@phender
Copy link
Copy Markdown
Contributor

phender commented Apr 25, 2026

Functional test failures in build 5 and build 6:

  • (build 5) 1-./dfuse/daos_build.py:DaosBuild.test_dfuse_daos_build_wb
  • (build 5) 20-./daos_test/suite.py:DaosCoreTest.test_daos_extend_simple
    • Test timeout - seemingly caused by pool connect issues:
    2026-04-25 02:11:54,990 process          L0416 DEBUG| [stdout] [1,0]<stdout>:setup: connecting to pool eb84a26d-bebe-4132-bdfe-f0341c1cace1
    2026-04-25 02:57:11,662 stacktrace       L0039 ERROR| 
    
    2026/04/25 02:11:54.991408 hdr-240 DAOS[150491/150491/0] pool DBUG src/pool/cli.c:1152 dc_pool_connect() eb84a26d-bebe-4132-bdfe-f0341c1cace1: connecting: hdl=c898e89a-d725-4c7b-b133-50a7648fcfd8 flags=2
    ...
    2026/04/25 02:12:54.996649 hdr-240 DAOS[150491/150491/0] rpc  WARN src/cart/crt_context.c:1270 crt_context_timeout_check(0x56179c38b520) [opc=0x2070001 (DAOS_POOL_MODULE:POOL_CONNECT) rpcid=0x4339bfd500005dd8 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds) [deadline: 1777083175], target (2:0)
    2026/04/25 02:12:54.996708 hdr-240 DAOS[150491/150491/0] rpc  INFO src/cart/crt_context.c:1196 crt_req_timeout_hdlr(0x56179c38b520) [opc=0x2070001 (DAOS_POOL_MODULE:POOL_CONNECT) rpcid=0x4339bfd500005dd8 rank:tag=2:0] aborting in-flight to group daos_server, rank 2, tgt_uri ofi+verbs;ofi_rxm://192.168.88.13:31317
    2026/04/25 02:12:54.996741 hdr-240 DAOS[150491/150491/0] hg   WARN src/cart/crt_hg.c:1491 crt_hg_req_send_cb(0x56179c38b520) [opc=0x2070001 (DAOS_POOL_MODULE:POOL_CONNECT) rpcid=0x4339bfd500005dd8 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
    
  • (build 6) 1-./dfuse/daos_build.py:DaosBuild.test_dfuse_daos_build_wb
    2026-04-24 23:55:49,698 run_utils        L0352 DEBUG|       - Curl error (28): Timeout was reached for https://artifactory.daos.hpc.amslabs.hpecorp.net/artifactory/mellanox-proxy/doca/3.2.1/rhel9/x86_64/repodata/repomd.xml [Operation timed out after 30000 milliseconds with 0 out of 0 bytes received]
    
  • (build 6) 24-./daos_test/suite.py:DaosCoreTest.test_daos_rebuild_ec
    • Test timeout after creating a pool
    2026-04-25 04:23:44,610 process          L0416 DEBUG| [stdout] [1,0]<stdout>:setup: creating pool, SCM size=8 GB, NVMe size=16 GB
    2026-04-25 04:24:32,499 stacktrace       L0039 ERROR|
    
    • Control log:
    2026/04/25 04:13:43.870827 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:184 run_cmd() dmg cmd: dmg -j -d --log-file=/tmp/suite_dmg.log -o /var/tmp/daos_testing/configs/daos_control.yml pool create --ranks=0,1,2,3,4,5,6,7 --user=jenkins --group=jenkins --scm-size=8589934592b --nvme-size=17179869184b --properties=rd_fac:0 --properties=space_rb:0 test_dKXBMt --nsvc=5
    2026/04/25 04:13:43.870844 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:205 run_cmd() forking to run dmg command
    2026/04/25 04:13:44.240550 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:236 run_cmd() waiting for dmg to finish executing
    2026/04/25 04:23:44.539126 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:241 run_cmd() dmg command finished
    2026/04/25 04:23:44.539174 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:312 daos_dmg_json_pipe() reading json from stdout
    2026/04/25 04:23:44.539192 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:334 daos_dmg_json_pipe() read 150 bytes
    2026/04/25 04:23:44.539246 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:382 daos_dmg_json_pipe() parsed output:
    {
      "response": null,
      "error": "client: code = 510 description = \"the *control.PoolCreateReq request timed out after 10m0s\"",
      "status": -1025
    }
    2026/04/25 04:23:44.539251 hdr-241 DAOS[152511/152511/0] daos ERR  src/common/tests_dmg_helpers.c:391 daos_dmg_json_pipe() dmg error: client: code = 510 description = "the *control.PoolCreateReq request timed out after 10m0s"
    2026/04/25 04:23:44.539266 hdr-241 DAOS[152511/152511/0] daos ERR  src/common/tests_dmg_helpers.c:888 dmg_pool_create() dmg failed
    2026/04/25 04:23:44.539332 hdr-241 DAOS[152511/152511/0] hg   DBUG src/cart/crt_hg.c:732 crt_get_info_string() iface_idx:0 context:9 domain_str=mlx5_0 iface_str=ib0 info_str=ofi+verbs;ofi_rxm://mlx5_0/ib0
    2026/04/25 04:23:44.610116 hdr-241 DAOS[152511/152511/0] rpc  DBUG src/cart/crt_context.c:352 crt_context_provider_create() created context (idx 9, self_uri ofi+verbs;ofi_rxm://192.168.88.2:35564)
    2026/04/25 04:23:44.610354 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:184 run_cmd() dmg cmd: dmg -j -d --log-file=/tmp/suite_dmg.log -o /var/tmp/daos_testing/configs/daos_control.yml pool create --ranks=0,1,2,3,4,5,6,7 --user=jenkins --group=jenkins --scm-size=8589934592b --nvme-size=17179869184b --properties=rd_fac:0 --properties=space_rb:0 test_mSXjCJ --nsvc=5
    2026/04/25 04:23:44.610371 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:205 run_cmd() forking to run dmg command
    2026/04/25 04:23:45.018781 hdr-241 DAOS[152511/152511/0] daos DBUG src/common/tests_dmg_helpers.c:236 run_cmd() waiting for dmg to finish executing
    
    • This failure caused the subsequent tests to fail due to failures destroying the pool:
      • 25-./daos_test/suite.py:DaosCoreTest.test_daos_aggregate_ec
      • 26-./daos_test/suite.py:DaosCoreTest.test_daos_degraded_ec

Comment thread utils/rpms/daos.spec
Version: 2.9.100
Release: 2%{?relval}%{?dist}
Release: 3%{?relval}%{?dist}
Summary: DAOS Storage Engine
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to bump the DAOS rpm version

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if we want to add information to the daos.changelog but no version has been changed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well maybe you shouldn't add this to the DAOS changelog?

@grom72
Copy link
Copy Markdown
Contributor Author

grom72 commented Apr 27, 2026

@grom72
Copy link
Copy Markdown
Contributor Author

grom72 commented Apr 27, 2026

Substituted by: #18108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants