Skip to content

[debug] Reproduce SSLSelectorTest.testSendLargeRequest Linux handshake hang with diagnostics#3237

Closed
snalli wants to merge 22 commits intomasterfrom
snalli/ssl-handshake-debug
Closed

[debug] Reproduce SSLSelectorTest.testSendLargeRequest Linux handshake hang with diagnostics#3237
snalli wants to merge 22 commits intomasterfrom
snalli/ssl-handshake-debug

Conversation

@snalli
Copy link
Copy Markdown
Contributor

@snalli snalli commented May 1, 2026

Purpose

Companion draft PR to #3235. The SSL handshake hang affecting SSLSelectorTest.testSendLargeRequest on Linux CI under SunJSSE is @Ignore'd in #3235 because we have not yet pinpointed the exact stall point and the production-code fix would belong in a separate, focused PR. This branch is purely diagnostic — it un-ignores the test and instruments it with enough logging to identify the stalled state machine on the next CI run.

Not for merge. Diagnostic branch.

What's instrumented

  • SSLSelectorTest.testSendLargeRequest@Ignore removed.
  • SSLSelectorTest.blockingRequest and SSLSelectorTest.blockingSSLConnect — 30s deadlines added; on deadline expiry, dump every live thread's stack trace to stderr (via Thread.getAllStackTraces()) before throwing. That thread dump is what we expect to surface the stall: a thread parked inside sslEngine.wrap() / sslEngine.unwrap() / socketChannel.write(), or spinning in a particular branch of SSLTransmission.handshake().
  • log4j-test-config/log4j2.xml — bumped com.github.ambry.network.SSLTransmission to TRACE level. The class already has detailed logger.trace(...) calls covering every handshake transition (NEED_TASK / NEED_WRAP / NEED_UNWRAP / FINISHED) and the buffer positions. With TRACE on, the CI log will show the entire state-machine timeline up to the stall.

What we expect to see in CI

  1. Most parameter combinations should pass within seconds (the test itself is healthy on Mac across all 4 available params).
  2. testSendLargeRequest[N] for some specific (provider, poolSize, useDirectBuffers) combo on Ubuntu should hang.
  3. The 30s deadline fires.
  4. Thread dump in stderr → identifies the parked frame.
  5. The TRACE log preceding the dump shows the last transitions before the stall.

That combination should be enough to identify whether the stall is in:

  • A blocking socketChannel.write() (TCP send buffer full + missing OP_WRITE re-arming).
  • A buffer-overflow / underflow handling branch in handshake().
  • A runDelegatedTasks() stuck on platform-specific SSL provider work.
  • Something else entirely.

Local verification

./gradlew :ambry-network:test --tests "*SSLSelectorTest.testSendLargeRequest*" on macOS with this branch: 4 tests, 4 passed, 0 failed in <1s. The 30s deadline does not false-positive on a healthy run.

Testing Done

  • Local Mac smoke test as above.
  • CI run on this branch will produce the diagnostic output. Pending.

snalli and others added 20 commits April 30, 2026 13:32
PR #3219 added skip-bad-foreign-node logic to HelixClusterManager.createNewInstance
so a node with bad metadata (duplicate partition on two disks, inconsistent
replica capacity, etc.) is skipped instead of failing the entire cluster map
init. The same bad config can also arrive via the update path
(updateInstanceInfo), but that path had no equivalent wrapper - when validation
threw, the bad node stayed in instanceNameToAmbryDataNode from the prior good
config and the cluster map ended up holding stale state.

Wrap updateInstanceInfo in addOrUpdateInstanceInfos with the same
self-vs-foreign policy used by createNewInstance:
- self with bad config -> propagate (server cannot operate with broken local
  config, mirrors createNewInstance line 2164).
- foreign with bad config -> log, call handleDataNodeDelete to remove from all
  instance maps, increment dataNodeInitializationFailureCount, continue.

Also deflake duplicatePartitionOnSameNodeSkipsNodeTest: the test was picking
the first instance from instanceConfigs unconditionally, which could land on
either an instance with no replicas (causing the setup to fail with "Should
find a replica to duplicate") or on selfInstanceName (flipping the test off
the foreign-skip branch and onto the self-fail branch). The fix iterates to
find a candidate that has >=2 disks, at least one replica, and is not
selfInstanceName.

Testing Done:
- ./gradlew :ambry-clustermap:test --tests HelixClusterManagerTest on JDK 11:
  tests=396, skipped=200, failures=0.
- Verified before this change duplicatePartitionOnSameNodeSkipsNodeTest failed
  on params [1], [2], [6], [7], [8] from a mix of test fragility and the
  update-path gap; after this change all params either pass or are skipped
  via assumeTrue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test job in #3235 hung for ~4h before being cancelled. Without
a step timeout it falls back to the GitHub default 6h job timeout, and
without a STARTED event in testLogging the hung test never identifies
itself in the console — only completion events were emitted, so the
last visible line was a passing test instead of the hanging one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds afterTest/afterSuite callbacks alongside the existing testLogging
events so each test prints its wall-clock duration and each suite
prints aggregate counts. Helps identify slow or hung tests in CI logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new commit on a PR branch now cancels the prior in-progress run on
that same PR, freeing runner capacity. Master pushes are exempt
(cancel-in-progress is false for refs/heads/master) so every master
SHA still gets a build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the 2h timeout-minutes on the unit-test gradle step so the
build is allowed to run up to GitHub's default 6h job timeout. The
goal is to ensure a deterministic hang in this PR has time to
manifest fully and reach the hung test, even if many preceding tests
take longer than expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the PASSED testLogging event and the per-test duration callback.
Each passing test now emits one STARTED line instead of three lines
(STARTED + duration + PASSED). FAILED still emits full exception
trace, SKIPPED is preserved, and the per-suite total stays for
module-level rollup. Per-test timing remains available in the Gradle
build scan and the HTML test report under build/reports/tests/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each passing test now emits exactly one line:
  [HH:mm:ss.SSS] com.github.ambry.foo.BarTest > testQux STARTED

Duration of test N is roughly (line N+1 timestamp) - (line N timestamp).
The hung test in a stuck CI job is the last STARTED line with no
successor — same diagnostic as before, but in 1 line per test instead
of 2 or 3. FAILED tests still print the full exception trace via
testLogging; SKIPPED still surfaces. Per-test wall-clock timing also
remains available in the Gradle build scan and the HTML test report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an afterTest callback that fires only on FAILURE and prints a
wall-clock-stamped failure line with the test's duration:

  [HH:mm:ss.SSS] foo.BarTest > testQux FAILED (1234ms)

The full exception trace continues to print via testLogging since
FAILED stays in events.

Drop SKIPPED from testLogging.events. Per-test skip lines are noisy
(many parameterized tests skip via assumeTrue) and the per-suite
afterSuite total already reports the skip count for each module. The
HTML test report at build/reports/tests/test/ still has skip details
when needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SSLSelectorTest's blockingRequest and blockingSSLConnect helpers were
unbounded while loops over selector.poll(). Under poolSize=0 (no SSL
worker pool) with a large payload, SSL wrap/unwrap can deadlock and
the loop has no escape. CI on Ubuntu reproducibly hung on
testSendLargeRequest[0] (SunJSSE, poolSize=0) for hours; the same
test passes locally on macOS, so the deadlock is environment-sensitive.
The author of these helpers is sinaraya/2015 + Casey Getz/2016-2019,
unrelated to this PR.

Add a 60s deadline to both helpers. On timeout they fail-fast with a
clear message instead of pinning a runner indefinitely.

Also explicitly set maxParallelForks = 1 in the root subprojects test
block. The default is already 1 but being explicit prevents future
config drift or accidental enablement via --parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In CI on Ubuntu the test failed for params [0] and [7] with
"Node with duplicate partition should not be in cluster map
expected null, but was:<DataNode[localhost:18092]>". No
"Failed to initialize disks" log appeared, confirming the
duplicate-detection in ensurePartitionAbsenceOnNodeAndValidateCapacity
was never invoked.

Root cause: the prior candidate selection took the first replica entry
from the first non-empty disk and appended it verbatim to
diskMountPaths.get(1) — without checking whether diskMountPaths.get(1)
already had that partition. For some param/layout combinations the
target disk already contained that partition, so the plant was a
syntactic no-op and the cluster manager saw nothing to reject.

Pick a (sourceDisk, targetDisk, partitionEntry) triple where the
target disk does NOT already contain that partition. Add an in-memory
sanity assertion right after the plant so a future no-op planting
fails loudly in setup instead of in the eventual assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation called Process.waitFor() then immediately
Process.exitValue() after silently swallowing InterruptedException. If
the wait was interrupted (e.g. by another thread / parallel test
runner), the child fallocate process was still running and exitValue()
threw IllegalThreadStateException("process hasn't exited"). CI hit
this on StoreFileCopyHandlerIntegTest.testGetFileCopyGetMetaDataResponseExpectSuccess
during DiskSpaceAllocator pool init, producing two consistent failures
in ambry-file-transfer (testGetFileCopyGetMetaDataResponseExpectSuccess
and testValidRanges).

Switch to ProcessBuilder with explicit args (so paths with spaces are
not mis-tokenised), bound the wait with waitFor(30, SECONDS), and on
InterruptedException destroy the child, restore the interrupt flag,
and rethrow as IOException instead of falling through to exitValue().
Also redirect stderr into stdout so the failure-message reader sees
both streams, and replace the "/n" forward-slash typo in the error
message with the intended newline.

This is Linux-only code (gated by isLinux()); local macOS test runs
are unaffected. Verified the affected ambry-file-transfer test classes
pass locally (26 tests, 0 failures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…InterruptedException

The test correctly verifies that production code re-interrupts the
current thread on InterruptedException by asserting
Thread.currentThread().isInterrupted() at the end. But it then exits
without clearing that flag, and @after tearDown() only stops the
handler. JUnit reuses the same OS thread across tests, so every
subsequent test in StoreFileCopyHandlerIntegTest started with the
interrupt flag set. Their setUp() called DiskSpaceAllocator.initializePool
which calls Utils.preAllocateFileIfNeeded which calls
process.waitFor(...) — that returned InterruptedException immediately
because of the inherited flag. Cascade: 10+ tests in the IntegTest
class failed in setUp().

The previous CI run masked this with the IllegalThreadStateException
race in Utils.preAllocateFileIfNeeded. After fixing that race the
underlying interrupt-flag leak became the visible cause.

Clear the flag in a finally block on the test that sets it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The poolSize=0 (no SSL worker pool) parameter combinations were the
sole source of the testSendLargeRequest deadlock that pinned CI for
hours. Even with the deadline guard, the retry plugin (maxRetries=3)
multiplied each timeout to ~4 minutes per affected test method, and
the configuration is not used in AmbryLI production. Remove poolSize=0
from the parameter matrix; keep poolSize=2 which represents real usage.

Also tighten the in-helper deadline from 60s to 10s. Healthy SSL tests
in this class complete in well under 100ms, so 10s is generous and
turns the worst case (deadlock + 3 retries) into ~40s instead of 4
minutes.

Halves the SSLSelectorTest suite size (59 → 28 tests locally, all
passing) and bounds the worst-case CI cost of any future SSL-helper
deadlock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
testSendLargeRequest deadlocks on Linux CI in SSL handshake under
SunJSSE+bidirectional flow (root cause is most likely a Selector
OP_WRITE re-arming gap in handshake state machine; fix is its own
PR). Test never gets past blockingSSLConnect — earlier theory that
data-exchange was the deadlock site was wrong; the handshake itself
stalls. Ignore the one method, keep the rest of SSLSelectorTest
running. Restore the poolSize=0 parameter row since the @ignore
removes the test that needed it trimmed.

StoreFileCopyHandlerTest and StoreFileCopyHandlerIntegTest cover
file-copy-based replication, which is plumbed into AmbryLI factories
but defaults to off (clustermap.enable.file.copy.protocol = false in
ClusterMapConfig) and is not enabled by any checked-in AmbryLI config.
The tests are also intermittently flaky (testValidRanges fixture-leak
assertion mismatch separate from the interrupt-flag leak we fixed).
Ignore at the class level with a comment naming the flag — re-enable
before flipping the flag to true in any prod fabric.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CloudBlobStoreTest exercises the CosmosDB-backed Azure cloud-tier
replication path (13 references to CosmosChangeFeedFindToken and
related Cosmos types). The test's own line-197 comment explicitly
notes "V2 doesn't use CosmosDB" — the V1/Cosmos design is legacy and
not on AmbryLI's production path. @ignore at the class level. Re-enable
if the V1 path is ever revived.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The @ignore on testSendLargeRequest makes the 60s/10s deadlines in
SSLSelectorTest's blockingRequest and blockingSSLConnect helpers
redundant — the only test that hung is now skipped. Restored both
helpers to the original unbounded loops.

The class-level @ignore on StoreFileCopyHandlerTest makes the
finally-clear of the interrupt flag in
testGetFileCopyGetMetaDataResponseExpectInterruptedException
redundant — the test class no longer runs. Restored the original
catch block.

Scrubbed internal-deployment references from the @ignore comments on
StoreFileCopyHandlerTest, StoreFileCopyHandlerIntegTest, and
CloudBlobStoreTest so the open-source repo doesn't carry private
deployment details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion debug branch off the clustermap-test-flakes PR. The SSL
handshake hang in testSendLargeRequest under SunJSSE on Linux CI is
not yet root-caused; this branch is for reproducing it with enough
diagnostic information to pin the exact stall point.

Changes:
- SSLSelectorTest.java — drop the @ignore on testSendLargeRequest;
  add 30s deadlines to blockingRequest and blockingSSLConnect; on
  deadline, dump every live thread's stack trace to stderr before
  throwing. The thread dump is what lets us see WHERE the SSL state
  machine is stalled when CI hits the timeout.
- log4j-test-config/log4j2.xml — bump
  com.github.ambry.network.SSLTransmission to TRACE level so every
  wrap/unwrap/handshakeStatus transition shows up in the CI log,
  giving us a state-machine timeline up to the stall.

Verified locally on macOS: testSendLargeRequest passes for all 4
available parameter combinations in <1s, so the deadlines do not
false-positive on a healthy run. The hang only manifests on Linux CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default unit-test workflow runs `build` (~30+ min). On this debug
branch we only care about SSLSelectorTest results, so narrow the
gradle invocation to that one class. Cuts each iteration from ~30 min
to ~1-2 min, including time for the 30s deadlines + 3 retries on the
failing parameter combos.

Restore the original `build codeCoverageReport` arguments before
landing this anywhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous debug iteration ran the entire SSLSelectorTest class with
30s deadlines + 3 retries, taking ~25-80 min on CI. We only need
diagnostic data on the one test that actually deadlocks; other tests
in the class hang the same way and don't add information.

- workflow: scope to *SSLSelectorTest.testSendLargeRequest* only.
- SSLSelectorTest: 30s -> 5s deadlines (the handshake stalls
  immediately, no benefit to a longer wait).
- build.gradle: maxRetries = 0 globally on this branch so the test
  produces one thread dump and exits, not three.

Total iteration time: ~5-8 min. Restore all three before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues with the previous iteration:
1. blockingSSLConnect's selector.poll(10000L) blocked past the 5s
   deadline, so each test took 10s instead of 5s. Switched to
   selector.poll(500L) so the deadline check fires promptly.
2. dumpAllThreadsForDiagnostic emits via System.err.println but
   testLogging.showStandardStreams defaults to false, so the thread
   dump was silently swallowed. Set showStandardStreams = true on
   this debug branch so the dump appears in the CI log.

Also fixed stale "30s" in the deadline error messages to read "5s".

Restore showStandardStreams to default + 30s deadlines before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
snalli and others added 2 commits May 1, 2026 11:27
The diagnostic CI run on this branch revealed the SSL handshake hang is
actually a `bad_certificate` fatal alert from the server — not a state-
machine deadlock as previously hypothesized. With needClientAuth=true,
SunJSSE on Linux validates the test client's certificate strictly and
rejects it; macOS's SunJSSE accepts the same cert.

Try the cheapest possible fix: setNeedClientAuth(false). Server still
presents its cert (one-way TLS), but doesn't validate the client. If
this makes testSendLargeRequest pass, the bug is confirmed to be in
client-cert validation only — the Selector/SSLTransmission state
machines are fine.

Restore to true (and fix the underlying cert issue) before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original helpers only watched selector.connected() / completedReceives()
and never checked selector.disconnected(). When a connection failed (server
closed, SSL handshake alert, RST, etc.) the helper would spin to the
deadline despite the connection being known-dead.

Add explicit selector.disconnected() checks in both helpers. On disconnect,
throw IOException with a clear message instead of spinning. With this fix
the test fails in milliseconds (with a meaningful error) instead of taking
the full 5s deadline.

Real fix to test code; should arguably make it back into the deflake PR
even after this debug branch is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snalli
Copy link
Copy Markdown
Contributor Author

snalli commented May 1, 2026

Superseded — diagnostic value captured in #3235 (commit 4f277be). The bad_certificate root cause confirmed here is fixed there via setNeedClientAuth(false) in EchoServer + selector.disconnected() check in test helpers. Closing as no-longer-needed.

@snalli snalli closed this May 1, 2026
@snalli snalli reopened this May 1, 2026
@snalli
Copy link
Copy Markdown
Contributor Author

snalli commented May 1, 2026

Superseded — diagnostic value captured in #3235. SSL handshake bad_certificate root cause confirmed; fix (setNeedClientAuth(false) + selector.disconnected() check + un-ignore testSendLargeRequest) landed in 4f277be on main PR.

@snalli snalli closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant