Fix flaky ServerHttp2Test integration tests#3244
Closed
snalli wants to merge 8 commits intolinkedin:masterfrom
Closed
Fix flaky ServerHttp2Test integration tests#3244snalli wants to merge 8 commits intolinkedin:masterfrom
snalli wants to merge 8 commits intolinkedin:masterfrom
Conversation
ServerHttp2Test was the only Server*Test in ambry-server integration tests using @BeforeClass/@afterclass to create http2Cluster once and share it across all 6 @test methods x 2 parameter variants. The shared in-memory MockCluster accumulated blob-store, replication-token, and replica/disk state across tests, surfacing as flaky failures in replicateBlobV2MultipleCases (e.g. expected:<BlobNotFound> but was:<NoError>; expected:<NoError> but was:<ReplicaUnavailable>). Convert to @Before/@after per-test cluster lifecycle, matching every other Server*Test in the same module (ServerPlaintextTest, ServerSSLTest, ServerHardDeleteTest, ServerBatchDeleteTest, ServerPlaintextTokenTest, ServerSSLTokenTest, StatsManagerIntegrationTest). Each @test now gets a fresh MockCluster, eliminating cross-test state leakage as a flake source. The existing nettyByteBufLeakHelper @Before/@after hooks are folded into the same methods, with the leak check ordered after cluster teardown so released ByteBufs are reflected in the measurement. Cost: ~1-2 min added to ServerHttp2Test runtime (12 cluster bringups at ~5-10s each), the same cost the other 7 Server*Test classes already pay. forkEvery=1 for intTest is intentionally NOT applied here per the existing build.gradle comment documenting the MySqlNamedBlobDbIntegrationTest regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3244 +/- ##
=============================================
- Coverage 64.24% 18.22% -46.02%
+ Complexity 10398 3140 -7258
=============================================
Files 840 931 +91
Lines 71755 79433 +7678
Branches 8611 9500 +889
=============================================
- Hits 46099 14478 -31621
- Misses 23004 63595 +40591
+ Partials 2652 1360 -1292 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…en unavailable Companion to the @Before/@after lifecycle change in the previous commit. That change exposed a set of latent issues that the prior @BeforeClass single-cluster pattern had been masking; this commit addresses each. EventLoopGroup leaks (server-side): - Http2NetworkClientFactory allocated a NioEventLoopGroup per server but had no close() method. Per-test cluster lifecycle leaked ~9 groups per test invocation x 12 invocations = 108 leaked groups per test class, enough to exhaust the OS thread/FD limits on CI runners. - Add Http2NetworkClientFactory.close() (impl Closeable). Wire it into AmbryServer.shutdown() after replicationManager.shutdown() so in-flight network clients drain first. EventLoopGroup leaks (test-side): - Http2BlockingChannel test constructor allocates its own EventLoopGroup that disconnect() was a no-op for. Tests created channels via ServerTestUtil.getBlockingChannelBasedOnPortType and called disconnect() relying on the no-op contract; nothing released the group. - Track ownership in Http2BlockingChannel; add explicit close() that releases owned pool and EventLoopGroup. Keep disconnect() as a no-op to preserve the existing reversible-disconnect contract used by tests that call connect/disconnect/connect. - Track Http2BlockingChannel instances in ServerTestUtil; close them all in test @after via closeRegisteredHttp2Channels(). Replication readiness: - MockCluster.startServers() returns when servers listen but replication threads haven't established peer connections, racing tests against awaitBlobCreations' 60s budget. Plus MockCluster overrode replication throttle to 100ms (production default is 0). - Restore production-default throttle (0). - Add sentinel-blob warm-up in ServerHttp2Test.before(): PUT one probe blob and await full replication before tests start, so replication is provably alive on every replica before the test exercises it. MySQL test skip: - ServerTestUtil.repairRequestTest requires a real MySQL on localhost:3306. CI sets one up; local dev typically doesn't. Without handling, three test-retry attempts hammer the connection refused and fail loudly. - Add ServerTestUtil.isMysqlAvailable() (TCP probe) and assumeTrue check at the top of repairRequestTest. Test skips cleanly when MySQL is unavailable; runs normally in CI. Resource cleanup in @after (ServerHttp2Test): - Cluster cleanup, registered channel cleanup, truststore temp file delete, and Netty leak check, each in its own try/finally so a single failure doesn't skip the rest. Verified locally: ServerHttp2Test now runs 12 tests, 0 failures, 2 skipped (both repairRequestTest variants — would run in CI with MySQL set up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NOT for merge. Disable all unit-test groups, store-test, and int-test so the next CI run only exercises :ambry-server:intTest. Add --info to gradle args and bump replication / http2 / AmbryServer log levels to debug so we can capture where Http2NetworkClientFactory.close() and RouterServerSSLTest interleave (the prior run failed with RouterErrorCode RouterClosed before OOM'ing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous static REGISTERED_HTTP2_CHANNELS list in ServerTestUtil was JVM-wide: every test class's HTTP/2 channels ended up in it, but only ServerHttp2Test.@after cleared it. So channels created by other test classes (e.g., RouterServerSSLTest) could be silently closed by a ServerHttp2Test @after running later in the same JVM, which manifested as RouterClosed assertion failures in the unrelated test. Replace the static list with a ThreadLocal<List<Http2BlockingChannel>> and an opt-in API: enableHttp2ChannelTracking(list) / disableHttp2ChannelTracking(). getBlockingChannelBasedOnPortType only appends to the tracker if one is registered. Test classes that don't opt in (everyone except ServerHttp2Test for now) are not affected by the cleanup logic at all — their channels never enter any tracker. ServerHttp2Test holds a per-instance trackedHttp2Channels list, opts in @before, closes the tracked channels and opts out @after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Throttle=0 (matching prod default) and the sentinel-blob warm-up were added together to handle cold-start replication on per-test cluster bringups. They turned out to interact badly with replicateBlobV2CaseTest: - That test calls controlReplicationForPartition(disable) and then PUTs blobs, expecting them not to replicate to a "third channel". With throttle=0, replication threads run continuously, so a replication cycle in flight when the disable call fires can finish anyway, making the blob land where the test expects it not to. - Master's throttle=100ms was an implicit safety margin that this test was relying on, not arbitrary slowness. Restore throttle to 100ms (master behavior) and remove the warm-up. With throttle=100ms, cold-start replication is paced enough that endToEndHttp2Replication... no longer needs the warm-up to make its 60s awaitBlobCreations budget. Net change vs master: per-test cluster lifecycle + HTTP/2 leak fixes (factory close, channel registry per-class) + MySQL skip via assumeTrue. Smaller, more defensible scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
ServerHttp2Testflakes on master CI:replicateBlobV2MultipleCases:expected:<BlobNotFound> but was:<NoError>/expected:<NoError> but was:<ReplicaUnavailable>endToEndHttp2ReplicationWithMultiNodeMultiPartition:awaitBlobCreationstimeoutFailures rotate; reruns mask them but waste ~50 min of CI time per occurrence.
Fix
Per-test
MockClusterlifecycle (@BeforeClass→@Before). Matches every otherServer*Testin this module. Removes cross-test state leakage as a flake source.Plug HTTP/2 EventLoopGroup leaks exposed by per-test lifecycle:
Http2NetworkClientFactory.close()(new) — wired intoAmbryServer.shutdown(). Was leaking ~1 group per server per cluster bringup.Http2BlockingChannel.close()(new) — explicit cleanup for instances that own a pool.disconnect()stays a no-op (existing reversible-disconnect contract).ServerTestUtilregisters test channels and closes them in@After.Determinism for cold-start replication:
MockCluster; was overridden to 100 ms with no documented reason.@Beforeproves replication is alive on every replica before tests start.Skip MySQL-dependent tests when MySQL isn't running:
assumeTrue+ TCP probe at the top ofServerTestUtil.repairRequestTest. CI sets up MySQL → test runs; local dev without MySQL → test skipped cleanly instead of failing through 3 retries.Testing Done
:ambry-server:intTest --tests ServerHttp2Test*locally: 12 tests, 0 failures, 2 skipped (the MySQL ones).