Improve cluster cleanup for in-memory integTest nodes#6127
Improve cluster cleanup for in-memory integTest nodes#6127cwperks wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
Signed-off-by: Craig Perkins <cwperx@amazon.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6127 +/- ##
==========================================
+ Coverage 74.78% 74.85% +0.06%
==========================================
Files 447 447
Lines 28467 28474 +7
Branches 4328 4330 +2
==========================================
+ Hits 21289 21313 +24
+ Misses 5184 5167 -17
Partials 1994 1994 🚀 New features to boost your workflow:
|
Description
Fixes test infrastructure issues that cause cascading flaky failures — specifically thread leaks and port conflicts caused by incomplete cleanup after partial cluster startup failures.
Root cause: When a test cluster node fails to start (e.g.,
BindHttpException: Address already in use), already-started nodes were not being shut down. Their thread pools and port bindings leaked into subsequent tests, causingThreadLeakErrorand furtherBindHttpExceptionfailures.Example: https://github.com/opensearch-project/security/actions/runs/25217779716/job/73942275556
Changes
ClusterHelper.java:closeAllNodes()before throwing whenstartCluster()fails partway through. Previously, nodes that started successfully before the failure were abandoned — their management thread pools and port bindings leaked.awaitClosetimeout from 250ms to 5 seconds. The previous timeout was too short for thread pools to drain, causing silent thread leaks even during normal shutdown.SingleClusterTest.java:clusterHelper.stopCluster()intearDown(), regardless of whetherclusterInfowas set. Previously, cleanup was skipped entirely after a failed startup becauseclusterInfois only assigned on success.Assert.failuntil after both clusters are cleaned up. Previously, if remote cluster stop failed,Assert.failthrew immediately and prevented local cluster cleanup from running.Testing
Check List