Skip to content

Move ZMQ socket bind to poll thread to prevent orch blocking#1157

Open
ypcisco wants to merge 1 commit into
sonic-net:masterfrom
ypcisco:zmqserver_move_bind_to_poll_thread
Open

Move ZMQ socket bind to poll thread to prevent orch blocking#1157
ypcisco wants to merge 1 commit into
sonic-net:masterfrom
ypcisco:zmqserver_move_bind_to_poll_thread

Conversation

@ypcisco

@ypcisco ypcisco commented Mar 6, 2026

Copy link
Copy Markdown

Why I did it

  • In scale scenarios, zmq_bind() can take significant time to complete
  • Orchestration agent gets blocked during ZmqServer initialization
  • Synchronous bind in main thread delays startup and impacts system responsiveness

How I did it

  • Moved zmq_bind() call from ZmqServer::bind() to mqPollThread()
  • Socket creation and configuration remain in main thread
  • Actual bind operation now happens in background poll thread

Signed-off-by: Yash Pandit <ypcisco@gmail.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny requested a review from qiluo-msft March 6, 2026 23:23

@anish-n anish-n left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve, but I would request Qi or someone else with more context on this aspect to review

@prsunny

prsunny commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

@ypcisco , can you fix coverage? @qiluo-msft , please review

@qiluo-msft

Copy link
Copy Markdown
Contributor

Could you provide evidence (logs, profiling data) showing that zmq_bind() itself is the bottleneck?

Comment thread common/zmqserver.cpp
SWSS_LOG_NOTICE("Attempting to bind to zmq endpoint: %s", m_endpoint.c_str());
if (zmq_bind(m_socket, m_endpoint.c_str()) != 0)
{
SWSS_LOG_THROW("zmq_bind failed on endpoint: %s, zmqerrno: %d",

@qiluo-msft qiluo-msft Mar 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread has NO catch-all block. Throwing exception will crash the process immediately.

@ypcisco ypcisco Mar 19, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the intention of the original code to terminate the process and not silently log the msg.

@ypcisco

ypcisco commented Mar 19, 2026

Copy link
Copy Markdown
Author

Could you provide evidence (logs, profiling data) showing that zmq_bind() itself is the bottleneck?

Here are some logs showing how zmq_server bind is blocking the orch.

Syslogs :

2025 Jun 26 15:19:12.294637 sonic NOTICE swss#orchagent: :- main: Attempting to bind ZMQ server to tcp://127.0.0.1:8100...  >>>> This log was added for debugging just before zmq_server->bind() call itself
2025 Jun 26 15:23:22.548010 sonic NOTICE swss#orchagent: :- main: ZMQ channel on the northbound side of Orchagent successfully bound: tcp://127.0.0.1:8100, 

Swss.rec :

2025-06-26.15:18:42.506924|recording started
2025-06-26.15:23:22.549016|CRM|Config|SET|acl_counter_high_threshold:85|acl_counter_low_threshold:70|acl_counter_threshold_type:percentage|acl_entry_high_threshold:85|acl_entry_low_threshold:70|acl_entry_threshold_type:percentage|acl....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants