Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "pg_doorman"
version = "3.5.1"
version = "3.5.2"
edition = "2021"
rust-version = "1.87.0"
license = "MIT"
Expand Down
30 changes: 30 additions & 0 deletions documentation/en/src/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,35 @@
# Changelog

### 3.5.2 <small>Apr 21, 2026</small>

#### Semaphore permit leak on direct handoff

Each `return_object` handoff (delivering a connection to a waiting client via oneshot channel) permanently consumed one semaphore permit. After `max_size` handoffs the pool semaphore was fully drained, blocking all new `timeout_get` callers. The pool could not create connections and stabilized at whatever size it reached during cold start (typically 4-8 out of 40).

Root cause: `wrap_checkout` calls `permit.forget()`, and the handoff path in `return_object` skipped `add_permits(1)`. Now `return_object` restores the permit on both the handoff and idle-queue paths. Compensating `add_permits(1)` in `pre_replace_one` removed (no longer needed).

#### Burst gate select race

The `tokio::select!` in the burst gate loop randomly picked among ready branches. When `sleep(5ms)` or `create_done` won over an already-delivered oneshot, the connection was silently dropped, inflating `slots.size` without a live server. Fixed with `biased;` (oneshot checked first) and a `try_recv` drain that pushes orphaned connections to idle without double-counting the permit.

#### Migration fixes

- **Client ID collision after migration.** The new process started its connection counter at 0, colliding with migrated client IDs. Now the counter advances past the highest migrated ID.

- **SCRAM passthrough state preserved.** The ClientKey from the first client's SCRAM handshake is serialized in the migration payload (v2 format, backward compatible). The new process skips the `ScramPending` fallback to `server_password`.

#### Session mode statistics fix

`xact_time` percentiles in session mode showed the entire session duration instead of individual transaction time. Now recorded per-transaction at each `ReadyForQuery(Idle)`, matching transaction mode semantics.

#### Adaptive anticipation budget

Anticipation wait (formerly fixed 300-500ms) scales with real transaction latency: `xact_p99 * 2 +/- 20%` jitter, clamped to [5ms, 500ms]. Cold start default: 100ms.

#### Diagnostic logging

Slow checkout warnings (>500ms) now include pool state: `size`, `avail`, `waiting`, `inflight`, `creates`, `gate_waits`, `antic_ok`, `antic_to`, `fallback`. Phase-specific warnings added for semaphore timeout, burst gate timeout, coordinator exhaustion, and create failure.

### 3.5.1 <small>Apr 20, 2026</small>

#### systemd Type=notify support
Expand Down
31 changes: 20 additions & 11 deletions documentation/en/src/tutorials/pool-pressure.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,14 +163,21 @@ dropped. `return_object` detects the dropped receiver (`send` returns
This way timed-out waiters are cleaned up lazily without a separate
garbage-collection pass.

The deadline is `min(query_wait_timeout - 500 ms, PHASE_4_HARD_CAP)`
where `PHASE_4_HARD_CAP` is randomly chosen between **300 ms and
500 ms** (uniform jitter) for each checkout attempt, measured against
a timestamp captured at the top of `timeout_get`. Phase 1/2 semaphore
wait consumes from the same budget, so the cumulative wait across
phases cannot exceed the caller's `query_wait_timeout`.

The jitter prevents a **timeout cliff**: without it, N clients that
The deadline is adaptive: `min(query_wait_timeout - 500 ms, adaptive_cap)`
where `adaptive_cap` is derived from real transaction latency:

| Pool state | Budget | Example |
|------------|--------|---------|
| Cold start (no stats) | 100ms ± 20% jitter | 80-120ms |
| Steady state | xact_p99 × 2 ± 20% jitter | p99=0.7ms → 5ms (min); p99=50ms → 100ms |
| High latency | Capped at 500ms | p99=300ms → 500ms |

The budget is measured against a timestamp captured at the top of
`timeout_get`. Phase 1/2 semaphore wait consumes from the same budget,
so the cumulative wait across phases cannot exceed the caller's
`query_wait_timeout`.

The ±20% jitter prevents a **timeout cliff**: without it, N clients that
entered Phase 4 at the same instant all exit simultaneously and
stampede into the burst gate, creating N new backend connections for a
pool that needs far fewer. With jitter, clients exit in staggered
Expand All @@ -183,9 +190,11 @@ to the idle queue for recycling.
connects. If a slot is free, take it and call `connect()` against
PostgreSQL. If all slots are full, register a direct-handoff oneshot
waiter and also listen for `create_done` (another in-flight create
finishing). If a connection arrives via the oneshot channel, recycle it
and return. Otherwise, re-try the recycle and the gate after the wake.
A 5 ms backoff acts as a safety net if both wake sources are missed.
finishing). The `select!` uses `biased;` to always check the oneshot
first, preventing a race where `create_done` or the 5 ms backoff timer
wins and silently drops the delivered connection. If a connection
arrives via the oneshot channel, recycle it and return. Otherwise,
re-try the recycle and the gate after the wake.

**Phase 6 — Backend connect.** Run `connect()`, authenticate, hand the
connection to the client. The burst slot is released automatically when
Expand Down
51 changes: 34 additions & 17 deletions documentation/ru/pool-pressure.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,16 +185,22 @@ recycle 10 раз в плотном цикле `yield_now` (контролиру
Таким образом протухшие waiter'ы вычищаются лениво, без отдельного
прохода сборки мусора.

Дедлайн равен `min(query_wait_timeout - 500 ms, PHASE_4_HARD_CAP)`,
где `PHASE_4_HARD_CAP` выбирается случайно в диапазоне **300–500 ms**
(равномерный jitter) для каждой попытки checkout'а. Фаза 1/2 ожидания
семафора тратит из того же бюджета, поэтому суммарное ожидание не
может увести клиента за его `query_wait_timeout`.

Jitter предотвращает **timeout cliff**: без него N клиентов, вошедших
в Phase 4 одновременно, выходят одновременно и лавиной входят в burst
gate, создавая N новых backend-соединений для пула, которому нужно
значительно меньше. С jitter'ом клиенты выходят порциями — первые
Дедлайн адаптивный: `min(query_wait_timeout - 500 ms, adaptive_cap)`,
где `adaptive_cap` вычисляется из реальной latency транзакций:

| Состояние пула | Бюджет | Пример |
|---------------|--------|--------|
| Холодный старт (нет данных) | 100ms ± 20% jitter | 80-120ms |
| Steady state | xact_p99 × 2 ± 20% jitter | p99=0.7ms → 5ms (min); p99=50ms → 100ms |
| Высокая latency | Ограничено 500ms | p99=300ms → 500ms |

Фаза 1/2 ожидания семафора тратит из того же бюджета, поэтому
суммарное ожидание не может увести клиента за его `query_wait_timeout`.

Jitter ±20% предотвращает **timeout cliff**: без него N клиентов,
вошедших в Phase 4 одновременно, выходят одновременно и лавиной входят
в burst gate, создавая N новых backend-соединений для пула, которому
нужно значительно меньше. С jitter'ом клиенты выходят порциями — первые
создают соединения, и к моменту выхода последних эти соединения уже
использованы и возвращены в idle queue для переиспользования.

Expand All @@ -207,12 +213,22 @@ gate, создавая N новых backend-соединений для пула
умолчанию 2) задач. Если слот свободен, задача забирает его и идёт
вызывать `connect()`. Если все слоты заняты, задача регистрирует
oneshot-waiter для direct handoff и слушает `create_done` (завершение
чужого create). Если соединение приходит через oneshot-канал, оно
recycl'ится и возвращается. Иначе задача снова пробует recycle и
gate. Короткий `sleep(5 ms)` подстраховывает на случай, если оба
источника пробуждения пропущены. Так лимитируется именно скорость
рождения новых backend-соединений в одном пуле, а не размер самого
пула.
чужого create). `select!` использует `biased;`, чтобы oneshot
проверялся первым — без этого `create_done` или таймер `sleep(5 ms)`
могли выиграть гонку и потерять уже доставленное соединение. Если
соединение приходит через oneshot-канал, оно recycl'ится и
возвращается. Иначе задача снова пробует recycle и gate. Так
лимитируется скорость рождения новых backend-соединений в одном пуле,
а не размер самого пула.

**Adaptive timeout burst gate.** Цикл burst gate ограничен адаптивным
бюджетом: `xact_p99 × 2 ± 20%` jitter (min 20 ms, max 500 ms). Если
задача провела в цикле дольше бюджета, она прекращает регистрироваться
как waiter для handoff и переходит к захвату слота burst gate напрямую.
Без этого механизма пул мог застрять на `warm_threshold` навсегда:
клиенты бесконечно получали recycled connections через shared handoff
queue, а до `try_acquire` (создание нового соединения) дело не доходило.
Счётчик `burst_gate_budget_exhausted` отслеживает срабатывания.

**Фаза 6 — backend connect.** Запускаем `connect()`, аутентифицируемся,
отдаём соединение клиенту. Burst slot освобождается автоматически по
Expand Down Expand Up @@ -1025,6 +1041,7 @@ eviction ротирует соединения между пользовател
| `pg_doorman_pool_scaling{type="inflight_creates"}` | gauge | `user`, `database` | `inflight` из `SHOW POOL_SCALING` |
| `pg_doorman_pool_scaling_total{type="creates_started"}` | counter | `user`, `database` | `creates` |
| `pg_doorman_pool_scaling_total{type="burst_gate_waits"}` | counter | `user`, `database` | `gate_waits` |
| `pg_doorman_pool_scaling_total{type="burst_gate_budget_exhausted"}` | counter | `user`, `database` | `gate_budget_ex` — adaptive timeout сработал, клиент перешёл к созданию |
| `pg_doorman_pool_scaling_total{type="anticipation_wakes_notify"}` | counter | `user`, `database` | `antic_notify` |
| `pg_doorman_pool_scaling_total{type="anticipation_wakes_timeout"}` | counter | `user`, `database` | `antic_timeout` |
| `pg_doorman_pool_scaling_total{type="create_fallback"}` | counter | `user`, `database` | `create_fallback` |
Expand Down Expand Up @@ -1299,7 +1316,7 @@ psql -h 127.0.0.1 -p 6432 -U admin pgdoorman \
```

Позиции полей в `awk` соответствуют порядку колонок, описанному выше:
для `POOL_SCALING` это `user|database|inflight|creates|gate_waits|antic_notify|antic_timeout|create_fallback|replenish_def`,
для `POOL_SCALING` это `user|database|inflight|creates|gate_waits|gate_budget_ex|antic_notify|antic_timeout|create_fallback|replenish_def`,
для `POOL_COORDINATOR` это `database|max_db_conn|current|reserve_size|reserve_used|evictions|reserve_acq|exhaustions`.

## Сравнение с PgBouncer
Expand Down
2 changes: 2 additions & 0 deletions src/admin/show.rs
Original file line number Diff line number Diff line change
Expand Up @@ -653,6 +653,7 @@ where
("inflight", DataType::Numeric),
("creates", DataType::Numeric),
("gate_waits", DataType::Numeric),
("gate_budget_ex", DataType::Numeric),
("antic_notify", DataType::Numeric),
("antic_timeout", DataType::Numeric),
("create_fallback", DataType::Numeric),
Expand All @@ -675,6 +676,7 @@ where
snapshot.inflight_creates.to_string(),
snapshot.creates_started.to_string(),
snapshot.burst_gate_waits.to_string(),
snapshot.burst_gate_budget_exhausted.to_string(),
snapshot.anticipation_wakes_notify.to_string(),
snapshot.anticipation_wakes_timeout.to_string(),
snapshot.create_fallback.to_string(),
Expand Down
13 changes: 10 additions & 3 deletions src/app/server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -815,7 +815,7 @@ async fn binary_upgrade_and_shutdown(
// restart the service when we exit.
if let Err(e) = sd_notify::notify(
false,
&[sd_notify::NotifyState::MainPid(child_pid.into())],
&[sd_notify::NotifyState::MainPid(child_pid)],
) {
warn!("sd_notify MAINPID failed: {e}. systemd may restart the service after old process exits.");
}
Expand All @@ -834,8 +834,15 @@ async fn binary_upgrade_and_shutdown(
let _ = MIGRATION_TX.set(tx);
// MIGRATION_IN_PROGRESS already set above (before SHUTDOWN)
let (shutdown_tx, shutdown_rx) = tokio::sync::oneshot::channel();
let sender_handle = tokio::spawn(migration_sender_task(migration_parent_fd, rx, shutdown_rx));
migration_handles = Some(MigrationHandles { shutdown_tx, sender_handle });
let sender_handle = tokio::spawn(migration_sender_task(
migration_parent_fd,
rx,
shutdown_rx,
));
migration_handles = Some(MigrationHandles {
shutdown_tx,
sender_handle,
});
info!("Client migration enabled");
}

Expand Down
5 changes: 5 additions & 0 deletions src/client/core.rs
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,11 @@ pub struct Client<S, T> {
/// Connected to server
pub(crate) connected_to_server: bool,

/// Session mode: transaction start timestamp for per-transaction xact_time.
/// Set when server transitions into a transaction (ReadyForQuery 'T'/'E').
/// Consumed when transaction ends (ReadyForQuery 'I').
pub(crate) session_xact_start: Option<quanta::Instant>,

/// Name of the server pool for this client (This comes from the database name in the connection string)
pub(crate) pool_name: String,

Expand Down
Loading
Loading