Fix semaphore permit drain, burst gate race, migration issues#204
Merged
Fix semaphore permit drain, burst gate race, migration issues#204
Conversation
1a98d43 to
31cf0f3
Compare
31cf0f3 to
3e2e345
Compare
Migration fixes: - Client ID collision: fetch_max counter after reconstruct - SCRAM passthrough: serialize BackendAuthMethod in migration v2 Adaptive anticipation budget: - Static 300-500ms replaced with xact_p99 × 2 ± 20% jitter - Cold start default: 100ms ± 20% - Clamped [5ms, 500ms] Session mode stats: - xact_time recorded per-transaction via session_xact_start field - query_start_at not leaked through function signatures Checkout diagnostics: - Timeout logs include phase (semaphore/burst_gate/coordinator/create) - Slow checkout (>500ms) logs pool state snapshot
3e2e345 to
86a67aa
Compare
added 2 commits
April 21, 2026 14:29
What was needed: pools stabilized at 3-4 connections out of 40 under load, with 800ms checkout latency and 50+ waiting clients. The pool could not grow because the semaphore that gates timeout_get was permanently drained. What changed: return_object now restores the semaphore permit on both the handoff (deliver to waiter) and idle-queue paths. Previously only the idle path called add_permits(1) — each handoff permanently consumed one permit. After max_size handoffs the semaphore was empty, blocking all new checkouts. Additional fixes: biased select in burst gate prevents silent connection loss on tokio::select! races; try_recv drain recovers orphaned connections without double-counting permits; pre_replace_one no longer adds a compensating permit (the drain it compensated for is gone). Migration protocol v2 preserves SCRAM passthrough state. Session-mode xact_time recorded per-transaction. Adaptive anticipation budget scales with xact_p99. Diagnostic logging on slow checkouts and phase-level failures.
- restore_backend_auth_if_pending: &Option<T> → Option<&T> - crate::config::BackendAuthMethod → use import - Magic 50ms in burst gate → BURST_GATE_EXHAUSTED_BACKOFF constant
Regression test for the permit drain bug. 200 clients share 10 slots for 5 seconds with 5ms hold time — thousands of handoff cycles. The pool must remain at full size throughout; if the semaphore drains, the pool shrinks and p99 latency exceeds the 600ms bound.
added 4 commits
April 21, 2026 15:11
Replace Rust internals (VecDeque, return_object, slots.size, yield_now, tokio::select!, AtomicU64) with behavioral descriptions. Translate all section headings. Add Russian-first definitions for foreign terms at first use: anticipation, burst gate, thundering herd, direct handoff, eviction, jitter, timeout cliff. Fix unnatural Russian: дропается, протаймаутился, recycl'ится, апгрейднуть, спайк, backend-spawn'ов. Rewrite glossary with behavioral descriptions.
Same bug as xact_time: query_start_at was set once before the inner loop and never reset. In session mode the inner loop runs for the entire session, so each query reported cumulative elapsed time instead of individual query duration (observed as query_ms p50=102s). Fix: reset query_start_at at the top of each inner-loop iteration in session mode. One if-statement, same pattern as the xact_time fix.
New step 'server query_p99 is below N ms' reads the address stats histogram directly. The scenario runs 20 session-mode clients for 3s with 5ms hold time. Without the query_start_at reset fix, p99 would be in the 100-second range. Bound set at 500ms to catch the bug while tolerating scheduling noise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Semaphore permit drain on direct handoff. Each
return_objecthandoff permanently consumed one semaphore permit. Aftermax_sizehandoffs the pool could not create connections and stabilized at 3-4 out of 40 servers. Root cause:wrap_checkoutcallspermit.forget(), but the handoff path never calledadd_permits(1). Now both handoff and idle-queue paths restore the permit. Compensatingadd_permitsinpre_replace_oneremoved.Burst gate
tokio::select!race. Withoutbiased;, the select randomly picked among ready branches — a connection delivered via oneshot could be silently dropped whencreate_doneor the backoff timer won. Fixed withbiased;(oneshot first) and atry_recvdrain that pushes orphaned connections to idle without double-counting permits.Client ID collision after migration. New process counter started at 0, colliding with migrated IDs. Now advances past the highest migrated ID via
fetch_max.SCRAM passthrough state preserved. Migration payload v2 (backward compatible) carries the ClientKey so the new process skips
ScramPendingfallback.Session mode
xact_timefix. Previously recorded the entire session duration. Now per-transaction at eachReadyForQuery(Idle).Adaptive anticipation budget. Scales with
xact_p99 * 2 +/- 20%jitter, clamped to [5ms, 500ms]. Cold start: 100ms.Diagnostic logging. Slow checkout warnings include full pool state. Phase-specific warnings for semaphore, burst gate, coordinator, and create failures.
Test plan
cargo test --lib pool::inner::tests— 45 tests pass (18 new: semaphore invariant across handoff/idle/mixed/concurrent paths, negative test for the old drain, pre_replace inflation guard, burst gate try_recv drain, anticipation budget edge cases)cargo clippy -- --deny warnings— cleanservers=40, wait=0, avg_wait=0.1ms, qps=5000(was:servers=3, wait=50, avg_wait=800ms, qps=400)servers=40, wait=0, antic_notify=61k