Skip to content

Improve cluster cleanup for in-memory integTest nodes#6127

Open
cwperks wants to merge 1 commit intoopensearch-project:mainfrom
cwperks:ssl-test-flaky
Open

Improve cluster cleanup for in-memory integTest nodes#6127
cwperks wants to merge 1 commit intoopensearch-project:mainfrom
cwperks:ssl-test-flaky

Conversation

@cwperks
Copy link
Copy Markdown
Member

@cwperks cwperks commented May 1, 2026

Description

Fixes test infrastructure issues that cause cascading flaky failures — specifically thread leaks and port conflicts caused by incomplete cleanup after partial cluster startup failures.

Root cause: When a test cluster node fails to start (e.g., BindHttpException: Address already in use), already-started nodes were not being shut down. Their thread pools and port bindings leaked into subsequent tests, causing ThreadLeakError and further BindHttpException failures.

Example: https://github.com/opensearch-project/security/actions/runs/25217779716/job/73942275556

Changes

ClusterHelper.java:

  • Call closeAllNodes() before throwing when startCluster() fails partway through. Previously, nodes that started successfully before the failure were abandoned — their management thread pools and port bindings leaked.
  • Increase awaitClose timeout from 250ms to 5 seconds. The previous timeout was too short for thread pools to drain, causing silent thread leaks even during normal shutdown.

SingleClusterTest.java:

  • Always call clusterHelper.stopCluster() in tearDown(), regardless of whether clusterInfo was set. Previously, cleanup was skipped entirely after a failed startup because clusterInfo is only assigned on success.
  • Defer Assert.fail until after both clusters are cleaned up. Previously, if remote cluster stop failed, Assert.fail threw immediately and prevented local cluster cleanup from running.

Testing

  • Verified compilation
  • Changes are to test infrastructure only — no production code modified

Check List

  • Commits are signed per the DCO using --signoff

Signed-off-by: Craig Perkins <cwperx@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Fix tearDown to always stop cluster and defer Assert.fail

Relevant files:

  • src/test/java/org/opensearch/security/test/SingleClusterTest.java

Sub-PR theme: Fix ClusterHelper node cleanup on startup failure and increase awaitClose timeout

Relevant files:

  • src/test/java/org/opensearch/security/test/helper/cluster/ClusterHelper.java

⚡ Recommended focus areas for review

Exception Swallowed

In closeNode, both node.close() and node.awaitClose(5, TimeUnit.SECONDS) are inside the same try-catch that silently ignores all Throwables. If node.close() throws, awaitClose will never be called, and the exception is silently swallowed. Consider separating the two calls or at least logging the error.

private static void closeNode(Node node) {
    try {
        node.close();
        node.awaitClose(5, TimeUnit.SECONDS);
    } catch (Throwable e) {
        // ignore
    }
Cleanup in Error Path

When err.get() != null, closeAllNodes() is called before throwing. However, closeAllNodes() itself can throw an Exception. If it does, the original error stored in err.get() will be lost and replaced by the cleanup exception. Consider logging or suppressing the cleanup exception to preserve the original error context.

if (err.get() != null) {
    closeAllNodes();
    throw new RuntimeException("Could not start all nodes " + err.get(), err.get());
Incomplete Failure Message

The Assert.fail message only includes firstFailure.getMessage(), which may be null or lack context. Consider using firstFailure.toString() or including the cluster name for better diagnostics.

if (firstFailure != null) {
    Assert.fail("Cluster cleanup failed: " + firstFailure.getMessage());
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard cluster stop call against null state

The clusterHelper.stopCluster() is now called unconditionally, even when clusterInfo
is null. The original code guarded this call with if (clusterInfo != null). Calling
stopCluster() when no cluster was started could cause a NullPointerException or
unexpected behavior.

src/test/java/org/opensearch/security/test/SingleClusterTest.java [182-188]

-try {
-    clusterHelper.stopCluster();
-} catch (Exception e) {
-    log.error("Failed to stop cluster.", e);
-    if (firstFailure == null) firstFailure = e;
+if (clusterInfo != null) {
+    try {
+        clusterHelper.stopCluster();
+    } catch (Exception e) {
+        log.error("Failed to stop cluster.", e);
+        if (firstFailure == null) firstFailure = e;
+    }
+    clusterInfo = null;
 }
-clusterInfo = null;
Suggestion importance[1-10]: 6

__

Why: The original code guarded clusterHelper.stopCluster() with if (clusterInfo != null), but the new code calls it unconditionally. This could cause issues if no cluster was started, though it depends on stopCluster()'s implementation handling null/uninitialized state gracefully.

Low
General
Avoid null message in failure assertion

firstFailure.getMessage() may return null for some exception types, resulting in the
unhelpful message "Cluster cleanup failed: null". Use firstFailure.toString() or
provide a fallback to include the exception class name and message.

src/test/java/org/opensearch/security/test/SingleClusterTest.java [190-192]

 if (firstFailure != null) {
-    Assert.fail("Cluster cleanup failed: " + firstFailure.getMessage());
+    Assert.fail("Cluster cleanup failed: " + firstFailure);
 }
Suggestion importance[1-10]: 4

__

Why: Exception.getMessage() can return null for some exception types, making the failure message unhelpful. Using firstFailure.toString() would include the exception class name and is a minor but valid improvement.

Low

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.85%. Comparing base (ea6087c) to head (4d37c1f).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6127      +/-   ##
==========================================
+ Coverage   74.78%   74.85%   +0.06%     
==========================================
  Files         447      447              
  Lines       28467    28474       +7     
  Branches     4328     4330       +2     
==========================================
+ Hits        21289    21313      +24     
+ Misses       5184     5167      -17     
  Partials     1994     1994              

see 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant