Skip to content

Add CDC file cleanup watchdog#29

Draft
mble wants to merge 9 commits into
mainfrom
mble/cdc-file-cleanup
Draft

Add CDC file cleanup watchdog#29
mble wants to merge 9 commits into
mainfrom
mble/cdc-file-cleanup

Conversation

@mble
Copy link
Copy Markdown

@mble mble commented Apr 15, 2026

Summary

During long-running CDC with high write volume, pgcopydb accumulates .json and .sql files in the CDC directory indefinitely after they've been applied. This can exhaust disk space.

This PR adds a cleanup watchdog subprocess that periodically deletes applied CDC files based on a configurable size threshold and minimum age floor:

  • --cleanup-threshold (e.g. 10GB) — max total size of applied CDC files to retain; oldest applied files are deleted when exceeded. Set to 0 to disable (default).
  • --cleanup-min-age (e.g. 15m) — minimum age before an applied file is eligible for deletion. Defaults to 15 minutes when threshold is set. Overridden under disk pressure.

How it works

A new cleanup FollowSubProcess runs alongside prefetch/transform/catchup. Every 30 seconds it:

  1. Reads replay_lsn from the sentinel table
  2. Scans the CDC directory for .json/.sql files whose WAL segment LSN is below replay_lsn (fully applied)
  3. If total applied file bytes exceed the threshold, deletes oldest files first
  4. Respects the minimum age floor unless old-enough files alone can't bring total under threshold (disk pressure override)

Safety

  • Only deletes files whose segment LSN is behind replay_lsn, which is only advanced after apply finishes with a file
  • Transform necessarily completes before apply on a given segment, so no race with either process
  • The 15-minute default age floor provides a debugging/safety window
  • --cleanup-threshold 0 (default) disables the watchdog entirely — backward compatible

Test plan

  • Verify pgcopydb follow --cleanup-threshold 100MB --cleanup-min-age 1m against a write-heavy source (e.g. pgbench); confirm CDC files are cleaned up after apply advances past them
  • Verify ps aux | grep pgcopydb shows pgcopydb: follow cleanup subprocess
  • Kill and resume mid-CDC with --resume; confirm correct pickup without data loss
  • Verify --cleanup-threshold 0 (default) does not fork the cleanup subprocess
  • Run tests/cdc-cleanup/ integration test
  • Run existing tests/cdc-wal2json/ to confirm no regressions

mble and others added 9 commits April 13, 2026 15:51
Parses duration strings like "30s", "15m", "2h" into seconds. A bare
number with no suffix is treated as seconds. Returns false on parse
error. This will be used by the --cleanup-min-age CLI flag for the
CDC file cleanup watchdog.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add CDC file cleanup configuration options to CopyDBOptions struct and
wire them into cli_copy_db_getopts (clone/follow) and cli_stream_getopts
(stream subcommands). These flags accept human-readable values using the
existing cli_parse_bytes_pretty and cli_parse_duration parsers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add the cleanup subprocess to the processArray in both
follow_wait_subprocesses and follow_terminate_subprocesses so it gets
proper signal handling and waitpid management. Start the cleanup
watchdog in followDB after the catchup subprocess, gated on
cleanupThresholdBytes > 0. Subprocesses with pid <= 0 are skipped
automatically, so an unconfigured cleanup process does not interfere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ld_cleanup.c/h with the core cleanup logic that runs as a forked
subprocess. The watchdog periodically scans the CDC directory, identifies
applied .json/.sql files (LSN < replay_lsn), and deletes the oldest
first when total applied file bytes exceed the configured threshold.
Respects a minimum age floor unless disk pressure requires overriding it.

Replace the follow_start_cleanup stub in follow.c with a call to
cdc_cleanup_loop. The Makefile picks up the new source automatically
via its wildcard pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies that pgcopydb follow works correctly with --cleanup-threshold
and --cleanup-min-age flags, that the cleanup subprocess doesn't crash
or interfere with the apply pipeline, and that follow reaches endpos
and exits cleanly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Apply citus_indent formatting (move && to end of line, add braces,
  fix argument alignment)
- Add IGNORE-BANNED for qsort() in ld_cleanup.c
- Regenerate clone.rst and follow.rst with new cleanup options

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace qsort (banned API) with repeated linear min-scan to find
the oldest file each iteration. The I/O cost of unlink dominates,
so the O(n*k) scan cost is negligible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mble mble marked this pull request as ready for review April 20, 2026 14:52
@mble mble marked this pull request as draft May 7, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant