Skip to content

[agent][redisreply] Add null check in checkStatus() to prevent segfault#1166

Open
rustiqly wants to merge 3 commits into
sonic-net:masterfrom
rustiqly:fix/null-deref-checkstatus
Open

[agent][redisreply] Add null check in checkStatus() to prevent segfault#1166
rustiqly wants to merge 3 commits into
sonic-net:masterfrom
rustiqly:fix/null-deref-checkstatus

Conversation

@rustiqly

Copy link
Copy Markdown
Contributor

What I did

Add null check for m_reply->str before strcmp() in RedisReply::checkStatus().

How I did it

Check !m_reply->str before calling strcmp. Log "(null)" when str is NULL for debug clarity.

How to verify it

Call checkStatus() on a Redis reply with NULL str field — should throw instead of segfault.

Which release branch to backport

master

Description for the changelog

Fix potential segfault in RedisReply::checkStatus() when m_reply->str is NULL.

Fixes: #1164

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

rustiqly commented Apr 4, 2026

Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly rustiqly requested review from StormLiangMS and yxieca April 6, 2026 16:17
@yxieca

yxieca commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

@rustiqly can you try to add unit test coverage for this change?

@rustiqly

Copy link
Copy Markdown
Contributor Author

@yxieca acknowledged, working on adding unit tests.

rustiqly added 2 commits May 11, 2026 02:15
strcmp(m_reply->str, status) dereferences m_reply->str without
checking for NULL. Redis error replies or type mismatches can
leave str as NULL, causing a segfault.

Add null check before strcmp and use "(null)" in the error log
for clarity.

Fixes: sonic-net#1164

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
@rustiqly rustiqly force-pushed the fix/null-deref-checkstatus branch from 5866860 to 0cbe042 Compare May 11, 2026 09:17
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@rustiqly

Copy link
Copy Markdown
Contributor Author

Added unit coverage in 0cbe042 for the null status string path in RedisReply::checkStatus().\n\nLocal compile note: direct compile is blocked on this host by missing system header hiredis/hiredis.h; CI should exercise the new tests/redisreply_ut.cpp through the normal test target.

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
@rustiqly

Copy link
Copy Markdown
Contributor Author

Pushed follow-up b72de84 for #1166 to fix the Analyze (cpp) failure.

Root cause from GitHub Actions:

tests/redisreply_ut.cpp:16:22: error: declaration of 'redisReply' shadows a global declaration [-Werror=shadow]

The test variable is now renamed to statusReply to avoid shadowing the hiredis redisReply typedef.

Validation: inspected the failed Actions log for #1166 and applied the exact warning fix; local full compile remains blocked by missing system hiredis/hiredis.h on this host.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run

1 similar comment
@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

Rebased/checked CI. The current required Azure failure appears unrelated to this PR's null-check change: vstest failed with Redis/DVS startup errors, e.g. RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address across route/tunnel tests. This matches an infra/runtime startup failure rather than the touched redisreply code. Requesting Azure re-run.\n\nAzure Pipelines, please retry

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

Rechecked CI for #1166.

The current required failure is Azure.sonic-swss-common (Test vstest) from Azure build 1110940 / GitHub run https://github.com/sonic-net/sonic-swss-common/runs/75422030403. The failures look like the same unrelated Redis/DVS startup issue, not this checkStatus() null-check change:

  • repeated RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address
  • broad failures across P4RT/VRF tests after Redis socket connection failures
  • build logs also show runner disk pressure (Free disk space on / is lower than 5%)
  • all build jobs passed; only vstest failed

Requesting a fresh Azure run.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI update for #1166:

Latest Azure.sonic-swss-common Test vstest failure in build 1111045 is from test cleanup after all selected tests had passed:

  • test_vnet.py: 41 passed, 1 skipped
  • p4rt: 35 passed, 4 skipped
  • failure happened during ApplDbValidator.__del__, raising PytestUnraisableExceptionWarning because Redis unix socket was already gone:
    RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address
  • job then exited with code 123

This looks unrelated to the RedisReply::checkStatus() null-check change; re-running Azure.

/azp run Azure.sonic-swss-common

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI failure classification for #1166:

Azure.sonic-swss-common build 1112076 failed in Test vstest. The failures are broad VS/MUX/P4RT instability, not related to the RedisReply::checkStatus() null-check change in this PR:

  • 34 failures in test_mux_prefixroute.py::TestMuxTunnel, mostly unexpected extra ASIC_STATE:SAI_OBJECT_TYPE_NEXT_HOP objects and neighbor mode mismatch (Expected neighbor_mode 'host-route', got 'prefix-route').
  • Multiple teardown/runtime messages show Redis unix socket connection failures (RuntimeError: Unable to connect to redis (unix-socket)).
  • The changed files in this PR are limited to RedisReply null handling/tests and do not touch MUX, P4RT, route orchestration, or VS neighbor behavior.

Requesting a fresh Azure run.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI failure classification for #1166:

Azure.sonic-swss-common build 1112248 failed required Build amd64 during the Python swss-common unit test phase. The C++ unit suite passed all 243 tests, then pytest-3 --cov=. --cov-report=xml failed one existing Redis table test:

tests/test_redis_ut.py::test_ProducerStateTable
AssertionError: assert '' == 'DEL'

This looks unrelated to the #1166 change, which only adds a null guard in RedisReply::checkStatus(). The failure is in ProducerStateTable.delete()/ConsumerStateTable.pop() behavior and appears to be a flaky Redis/swss-common unit-test issue in the CI environment.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI triage for #1166:

Azure build 1112298 failed required Azure.sonic-swss-common (Test vstest). The failure is in broad VS test environment instability, not in the RedisReply::checkStatus() null-check change:

  • repeated RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address
  • unrelated failures in test_portchannel.py (lacpkey, TPID expected 37376 but got 0, netdev oper status down), test_route.py, test_storm_control.py, test_switch.py, test_soft_bfd.py, test_watermark.py, MUX tests, and P4RT L3 tests

Log evidence saved locally at /tmp/pr-logs/swss-common-1166-build1112298-vstest-run-vs-tests.log.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI failure in Azure build 1112472 looks unrelated to this RedisReply null-check change.

Failed required check: Azure.sonic-swss-common (Test vstest): https://github.com/sonic-net/sonic-swss-common/runs/75769297882

The log shows broad VS/DVS instability rather than a RedisReply code failure: repeated Redis unix-socket connection errors, portchannel LACP/TPID failures, VNET VR object assertion failures, and P4RT/DASH failures:

  • RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address
  • test_portchannel.py::TestPortchannel::test_Portchannel_lacpkey / test_Portchannel_tpid
  • test_vnet.py::TestVnetOrch::test_vnet_orch_* with The VR objects are not created
  • p4rt/test_p4rt_tunnel_decap.py::TestP4RTunnelDecap::test_TunnelDecapGroupAddAndDelete

Saved the inspected log as /tmp/pr-ci-logs/swss-common-1166-build1112472-log213-vstest.txt. Requesting a rerun.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI note for #1166: the required Azure.sonic-swss-common (Test vstest) failure in Azure build 1112627 appears unrelated to this RedisReply::checkStatus() null-check change.

The log shows broad VS/DVS instability across unrelated test areas:

  • repeated Redis unix-socket connection failures (RuntimeError: Unable to connect to redis (unix-socket))
  • portchannel/LACP/TPID failures in test_portchannel.py
  • MUX neighbor assertion in test_mux.py::TestMuxTunnel::test_neighbor_learned_before_mux_config
  • P4RT/DASH/VNET failures, including OSError: [Errno 24] Too many open files

The build and coverage jobs passed; re-running Azure to get a clean VS test leg.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI note for #1166: the required Azure.sonic-swss-common failure in Azure build 1112739 is unrelated to the RedisReply::checkStatus() null guard change.

The failing vstest job stopped during dependency setup before tests ran because the Azure worker could not fetch an Ubuntu package:

E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/main/b/boost1.74/libboost-serialization1.74.0_1.74.0-14ubuntu3_amd64.deb  Could not connect to azure.archive.ubuntu.com:80 (20.106.104.242), connection timed out
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

The build and coverage checks completed green, so this looks like transient Azure/Ubuntu mirror connectivity rather than a PR regression. Re-running Azure.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI failure looks unrelated to this PR (#1166):

  • Azure.sonic-swss-common (Test vstest) failed in Azure build 1112810.
  • All build/coverage jobs passed; the failure is in VS/DVS tests.
  • Logs show broad testbed instability: test_mux.py::TestMuxTunnel::test_neighbor_miss[active->ping_serv->standby->resolve_entry->delete_entry-IPv6] failed, with repeated RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address during cleanup and DASH/MUX test runs.
  • This is not related to the PR's RedisReply::checkStatus() null-guard change.

Requesting an Azure rerun.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI failure in Azure build 1112879 looks unrelated to this PR.

Azure.sonic-swss-common (Test vstest) failed while all build/coverage checks passed. The VS run failed in existing ACL tests (test_acl.py::TestAcl::test_V6AclRuleVlanId and test_acl.py::TestAcl::test_AclRuleIcmpV6) with missing ACL counter mappings, plus repeated Redis unix-socket cleanup warnings:

RuntimeError: Unable to connect to redis (unix-socket) - No such file or directory(1): Cannot assign requested address

This PR only adds a null guard in RedisReply::checkStatus() and does not touch ACL orchestration/counter mapping or VS test setup. Requesting a fresh Azure run.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI failure classification for #1166:

Azure build 1113069 failed required Azure.sonic-swss-common / Azure.sonic-swss-common (Test vstest).

The failed job is unrelated to the RedisReply::checkStatus() null-guard change. Build/coverage jobs passed, and the vstest log shows broad VS/DVS instability across unrelated suites:

  • many VLAN tests failed together (test_AddPortChannelToVlanRaceCondition, test_VlanMemberTaggingMode, test_VlanDbData, etc.) with unexpected DB state assertions;
  • almost the entire MUX prefix-route shard failed (30 failures in one shard);
  • cleanup emitted repeated Redis unix-socket errors (Unable to connect to redis (unix-socket) ... Cannot assign requested address);
  • reruns within the same job later passed several of the previously failed shards, which points to infrastructure/flakiness rather than this PR.

Saved logs locally for traceability:

  • /tmp/pr-ci-logs/swss-common-1166-build1113069-log213.txt
  • /tmp/pr-ci-logs/swss-common-1166-build1113069-log221.txt

Re-running Azure.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@rustiqly

Copy link
Copy Markdown
Contributor Author

CI classification for #1166: Azure build 1113228 failed required Azure.sonic-swss-common (Test vstest). The failure is broad VS/P4RT instability: multiple P4RT L3/multicast tests failed after repeated Redis unix-socket connection errors (Unable to connect to redis (unix-socket) ... Cannot assign requested address). Build and coverage jobs passed, and the failing tests are unrelated to the RedisReply::checkStatus() null guard change. Saved logs under /tmp/pr-ci-logs/swss-common-1166-build1113228-log194-vstest.txt. Rerunning Azure.

@rustiqly

Copy link
Copy Markdown
Contributor Author

/azp run Azure.sonic-swss-common

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[redisreply] Null pointer dereference in checkStatus() when m_reply->str is NULL

3 participants