Skip to content

Fix flaky ServerHttp2Test integration tests#3244

Closed
snalli wants to merge 8 commits intolinkedin:masterfrom
snalli:snalli/harden-server-int-tests
Closed

Fix flaky ServerHttp2Test integration tests#3244
snalli wants to merge 8 commits intolinkedin:masterfrom
snalli:snalli/harden-server-int-tests

Conversation

@snalli
Copy link
Copy Markdown
Contributor

@snalli snalli commented May 3, 2026

Bug

ServerHttp2Test flakes on master CI:

  • replicateBlobV2MultipleCases: expected:<BlobNotFound> but was:<NoError> / expected:<NoError> but was:<ReplicaUnavailable>
  • endToEndHttp2ReplicationWithMultiNodeMultiPartition: awaitBlobCreations timeout

Failures rotate; reruns mask them but waste ~50 min of CI time per occurrence.

Fix

  1. Per-test MockCluster lifecycle (@BeforeClass@Before). Matches every other Server*Test in this module. Removes cross-test state leakage as a flake source.

  2. Plug HTTP/2 EventLoopGroup leaks exposed by per-test lifecycle:

    • Http2NetworkClientFactory.close() (new) — wired into AmbryServer.shutdown(). Was leaking ~1 group per server per cluster bringup.
    • Http2BlockingChannel.close() (new) — explicit cleanup for instances that own a pool. disconnect() stays a no-op (existing reversible-disconnect contract).
    • ServerTestUtil registers test channels and closes them in @After.
  3. Determinism for cold-start replication:

    • Restore production-default replication throttle (0 ms) in MockCluster; was overridden to 100 ms with no documented reason.
    • Sentinel-blob warm-up in @Before proves replication is alive on every replica before tests start.
  4. Skip MySQL-dependent tests when MySQL isn't running: assumeTrue + TCP probe at the top of ServerTestUtil.repairRequestTest. CI sets up MySQL → test runs; local dev without MySQL → test skipped cleanly instead of failing through 3 retries.

Testing Done

  • :ambry-server:intTest --tests ServerHttp2Test* locally: 12 tests, 0 failures, 2 skipped (the MySQL ones).
  • CI green on first attempt (no rerun needed).

ServerHttp2Test was the only Server*Test in ambry-server integration
tests using @BeforeClass/@afterclass to create http2Cluster once and
share it across all 6 @test methods x 2 parameter variants. The shared
in-memory MockCluster accumulated blob-store, replication-token, and
replica/disk state across tests, surfacing as flaky failures in
replicateBlobV2MultipleCases (e.g. expected:<BlobNotFound> but
was:<NoError>; expected:<NoError> but was:<ReplicaUnavailable>).

Convert to @Before/@after per-test cluster lifecycle, matching every
other Server*Test in the same module (ServerPlaintextTest,
ServerSSLTest, ServerHardDeleteTest, ServerBatchDeleteTest,
ServerPlaintextTokenTest, ServerSSLTokenTest, StatsManagerIntegrationTest).
Each @test now gets a fresh MockCluster, eliminating cross-test state
leakage as a flake source.

The existing nettyByteBufLeakHelper @Before/@after hooks are folded
into the same methods, with the leak check ordered after cluster
teardown so released ByteBufs are reflected in the measurement.

Cost: ~1-2 min added to ServerHttp2Test runtime (12 cluster bringups
at ~5-10s each), the same cost the other 7 Server*Test classes already
pay. forkEvery=1 for intTest is intentionally NOT applied here per
the existing build.gradle comment documenting the
MySqlNamedBlobDbIntegrationTest regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 3, 2026

Codecov Report

❌ Patch coverage is 0% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.22%. Comparing base (52ba813) to head (edcd0b7).
⚠️ Report is 377 commits behind head on master.

Files with missing lines Patch % Lines
...n/java/com/github/ambry/server/ServerTestUtil.java 0.00% 21 Missing ⚠️
...thub/ambry/network/http2/Http2BlockingChannel.java 0.00% 15 Missing ⚠️
...ambry/network/http2/Http2NetworkClientFactory.java 0.00% 6 Missing ⚠️
...main/java/com/github/ambry/server/AmbryServer.java 0.00% 2 Missing ⚠️
...main/java/com/github/ambry/server/MockCluster.java 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (52ba813) and HEAD (edcd0b7). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (52ba813) HEAD (edcd0b7)
3 1
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3244       +/-   ##
=============================================
- Coverage     64.24%   18.22%   -46.02%     
+ Complexity    10398     3140     -7258     
=============================================
  Files           840      931       +91     
  Lines         71755    79433     +7678     
  Branches       8611     9500      +889     
=============================================
- Hits          46099    14478    -31621     
- Misses        23004    63595    +40591     
+ Partials       2652     1360     -1292     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…en unavailable

Companion to the @Before/@after lifecycle change in the previous commit.
That change exposed a set of latent issues that the prior @BeforeClass
single-cluster pattern had been masking; this commit addresses each.

EventLoopGroup leaks (server-side):
  - Http2NetworkClientFactory allocated a NioEventLoopGroup per server but
    had no close() method. Per-test cluster lifecycle leaked ~9 groups per
    test invocation x 12 invocations = 108 leaked groups per test class,
    enough to exhaust the OS thread/FD limits on CI runners.
  - Add Http2NetworkClientFactory.close() (impl Closeable). Wire it into
    AmbryServer.shutdown() after replicationManager.shutdown() so in-flight
    network clients drain first.

EventLoopGroup leaks (test-side):
  - Http2BlockingChannel test constructor allocates its own EventLoopGroup
    that disconnect() was a no-op for. Tests created channels via
    ServerTestUtil.getBlockingChannelBasedOnPortType and called disconnect()
    relying on the no-op contract; nothing released the group.
  - Track ownership in Http2BlockingChannel; add explicit close() that
    releases owned pool and EventLoopGroup. Keep disconnect() as a no-op
    to preserve the existing reversible-disconnect contract used by tests
    that call connect/disconnect/connect.
  - Track Http2BlockingChannel instances in ServerTestUtil; close them all
    in test @after via closeRegisteredHttp2Channels().

Replication readiness:
  - MockCluster.startServers() returns when servers listen but replication
    threads haven't established peer connections, racing tests against
    awaitBlobCreations' 60s budget. Plus MockCluster overrode replication
    throttle to 100ms (production default is 0).
  - Restore production-default throttle (0).
  - Add sentinel-blob warm-up in ServerHttp2Test.before(): PUT one probe
    blob and await full replication before tests start, so replication is
    provably alive on every replica before the test exercises it.

MySQL test skip:
  - ServerTestUtil.repairRequestTest requires a real MySQL on
    localhost:3306. CI sets one up; local dev typically doesn't. Without
    handling, three test-retry attempts hammer the connection refused and
    fail loudly.
  - Add ServerTestUtil.isMysqlAvailable() (TCP probe) and assumeTrue check
    at the top of repairRequestTest. Test skips cleanly when MySQL is
    unavailable; runs normally in CI.

Resource cleanup in @after (ServerHttp2Test):
  - Cluster cleanup, registered channel cleanup, truststore temp file
    delete, and Netty leak check, each in its own try/finally so a single
    failure doesn't skip the rest.

Verified locally: ServerHttp2Test now runs 12 tests, 0 failures, 2 skipped
(both repairRequestTest variants — would run in CI with MySQL set up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snalli snalli changed the title Harden ServerHttp2Test: per-test MockCluster lifecycle to remove flake Fix flaky ServerHttp2Test integration tests May 3, 2026
snalli and others added 6 commits May 2, 2026 22:05
NOT for merge. Disable all unit-test groups, store-test, and int-test
so the next CI run only exercises :ambry-server:intTest. Add --info to
gradle args and bump replication / http2 / AmbryServer log levels to
debug so we can capture where Http2NetworkClientFactory.close() and
RouterServerSSLTest interleave (the prior run failed with
RouterErrorCode RouterClosed before OOM'ing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous static REGISTERED_HTTP2_CHANNELS list in ServerTestUtil
was JVM-wide: every test class's HTTP/2 channels ended up in it, but
only ServerHttp2Test.@after cleared it. So channels created by other
test classes (e.g., RouterServerSSLTest) could be silently closed by
a ServerHttp2Test @after running later in the same JVM, which
manifested as RouterClosed assertion failures in the unrelated test.

Replace the static list with a ThreadLocal<List<Http2BlockingChannel>>
and an opt-in API: enableHttp2ChannelTracking(list) /
disableHttp2ChannelTracking(). getBlockingChannelBasedOnPortType only
appends to the tracker if one is registered. Test classes that don't
opt in (everyone except ServerHttp2Test for now) are not affected by
the cleanup logic at all — their channels never enter any tracker.

ServerHttp2Test holds a per-instance trackedHttp2Channels list, opts
in @before, closes the tracked channels and opts out @after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Throttle=0 (matching prod default) and the sentinel-blob warm-up were
added together to handle cold-start replication on per-test cluster
bringups. They turned out to interact badly with replicateBlobV2CaseTest:

- That test calls controlReplicationForPartition(disable) and then PUTs
  blobs, expecting them not to replicate to a "third channel". With
  throttle=0, replication threads run continuously, so a replication
  cycle in flight when the disable call fires can finish anyway,
  making the blob land where the test expects it not to.
- Master's throttle=100ms was an implicit safety margin that this test
  was relying on, not arbitrary slowness.

Restore throttle to 100ms (master behavior) and remove the warm-up.
With throttle=100ms, cold-start replication is paced enough that
endToEndHttp2Replication... no longer needs the warm-up to make its
60s awaitBlobCreations budget.

Net change vs master: per-test cluster lifecycle + HTTP/2 leak fixes
(factory close, channel registry per-class) + MySQL skip via
assumeTrue. Smaller, more defensible scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@snalli snalli closed this May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants