Skip to content

fix(l1): prune Docker build cache and handle disk-full errors in multisync monitor#6500

Open
avilagaston9 wants to merge 2 commits intomainfrom
fix/multisync-disk-cleanup
Open

fix(l1): prune Docker build cache and handle disk-full errors in multisync monitor#6500
avilagaston9 wants to merge 2 commits intomainfrom
fix/multisync-disk-cleanup

Conversation

@avilagaston9
Copy link
Copy Markdown
Contributor

Motivation

The multisync monitor on srv9 crashed with OSError: [Errno 28] No space left on device after completing Run #9 successfully. The Docker build cache was growing unbounded across runs (750 GB accumulated), eventually filling the 1.8 TB disk entirely. When the monitor tried to write run logs, ENOSPC killed the process, leaving containers running unattended with no Slack notifications for 42+ hours.

Description

Two changes to docker_monitor.py:

  1. Prune Docker build cache after each image build — runs docker builder prune -f after every successful docker build, preventing cache from accumulating across multisync cycles. The prune is non-fatal: if it fails, a warning is printed and the monitor continues.

  2. Handle disk-full errors gracefully in log writingsave_all_logs and log_run_result now catch OSError so a full disk degrades to a warning instead of crashing the monitor loop. On write failure, the run result is printed to stdout so it's still captured in tmux history.

Checklist

  • No code compilation needed (Python script)
  • Already deployed and running on srv9

azteca1998 and others added 2 commits April 16, 2026 18:54
…6486)

## Summary
- The Hive consume-engine Amsterdam tests for EIP-7778 and EIP-8037 were
failing because ethrex's per-tx gas limit checks were incompatible with
Amsterdam's new gas accounting rules.
- **EIP-7778** uses pre-refund gas for block accounting, so cumulative
pre-refund gas can exceed the block gas limit even when a block builder
correctly included all transactions.
- **EIP-8037** introduces 2D gas accounting (`block_gas = max(regular,
state)`), meaning cumulative total gas (regular + state) can legally
exceed the block gas limit.
- The fix skips the per-tx cumulative gas check for Amsterdam and adds a
**post-execution** block-level overflow check using `max(sum_regular,
sum_state)` in all three execution paths (sequential, pipeline,
parallel).

## Local test results
- **200/201** EIP-7778 + EIP-8037 Hive consume-engine tests pass
- **105/105** EIP-7778 + EIP-8037 EF blockchain tests pass (4 + 101)
- The single remaining Hive failure
(`test_block_regular_gas_limit[exceed=True]`) expects
`TransactionException.GAS_ALLOWANCE_EXCEEDED` but we return
`BlockException.GAS_USED_OVERFLOW` — the block is correctly rejected,
just with a different error classification.

## Test plan
- [x] All EIP-7778 EF blockchain tests pass locally
- [x] All EIP-8037 EF blockchain tests pass locally
- [x] 200/201 Hive consume-engine Amsterdam tests pass locally
- [ ] Full CI Amsterdam Hive suite passes

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
errors gracefully in the multisync monitor. The build cache was growing
unbounded across runs (750 GB on srv9), eventually filling the disk and
crashing the monitor with ENOSPC when writing run logs. The monitor loop
would then exit, leaving containers running unattended with no Slack
notifications.

The build cache prune runs after every successful docker build. Log
writing (save_all_logs, log_run_result) now catches OSError so a full
disk degrades to a warning instead of killing the monitor.
@github-actions github-actions bot added the L1 Ethereum client label Apr 17, 2026
@avilagaston9 avilagaston9 marked this pull request as ready for review April 17, 2026 14:33
@avilagaston9 avilagaston9 requested a review from a team as a code owner April 17, 2026 14:33
Copilot AI review requested due to automatic review settings April 17, 2026 14:33
@ethrex-project-sync ethrex-project-sync bot moved this to In Review in ethrex_l1 Apr 17, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Review of tooling/sync/docker_monitor.py:

Operational Robustness Improvements
The changes add necessary hardening for production monitoring: disk space management via cache pruning, graceful degradation when log directories are unavailable, and stdout fallback for critical run results.

Specific Issues:

  1. Fragile output parsing (line 161)

    reclaimed = result.stdout.strip().split("\n")[-1]

    Docker CLI output format is not guaranteed stable across versions. If docker builder prune returns no output or changes format, this may raise IndexError or display misleading info. Consider defensive parsing:

    reclaimed = ""
    if result.stdout:
        lines = [l for l in result.stdout.strip().split("\n") if l]
        if lines:
            reclaimed = lines[-1]
  2. Overly broad exception handling (line 166)

    except Exception as e:

    Catching Exception masks programming errors (e.g., NameError). Since this is cleanup code that shouldn't fail the build, this is acceptable, but consider logging the full traceback for debugging: traceback.print_exc().

  3. Log saving granularity (lines 425-435)
    The try-except wraps the entire loop. If saving logs for the first instance fails (e.g., disk full), remaining instances are skipped. Consider handling per-instance:

    for inst in instances:
        try:
            save_container_logs(inst.container, run_id)
            consensus_container = inst.container.replace("ethrex-", "consensus-")
            save_container_logs(consensus_container, run_id)
        except OSError as e:
            print(f"⚠️ Failed to save logs for {inst.container}: {e}", flush=True)
  4. Container name assumption (line 431)

    consensus_container = inst.container.replace("ethrex-", "consensus-")

    This assumes a strict naming convention. If a container is named my-ethrex-test, it becomes my-consensus-test. Document this convention or validate the replacement occurred (check if name actually changed).

Minor:

  • Line 161: If docker builder prune returns success with empty stdout, reclaimed becomes empty string, resulting in output: "🧹 Build cache pruned. " (trailing space). Consider conditional formatting.
  • Line 511: text already ends with \n (line 498), so print(text) adds an extra newline. Use print(text, end="", flush=True) to avoid double-spacing.

Security: No vulnerabilities detected. Commands use list format (safe from injection), and the prune operation uses -f as required for automation.

Verdict: Accept with minor fixes for output parsing robustness (Item 1) and potentially per-instance error handling (Item 3).


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is my review of PR #6500:


PR #6500 — Fix: Prune Docker Build Cache & Handle Disk-Full Errors in Multisync Monitor

Overview

This PR fixes a production incident on srv9 where 750 GB of accumulated Docker build cache filled a 1.8 TB disk, crashing the multisync monitor mid-run via ENOSPC and leaving containers unattended for 42+ hours. Two changes address this: automatic cache pruning after each build, and graceful OSError handling on log writes.


Correctness

save_all_logs — OSError catch may be a no-op

The try/except OSError wraps calls to save_container_logs, but without seeing that function's implementation it's unclear whether OS-level write errors propagate out of it. If save_container_logs uses subprocess.run(stdout=subprocess.PIPE) and then writes to a file internally without re-raising, the outer catch in save_all_logs will never fire. Worth verifying whether this protection actually works end-to-end.

log_run_result — stdout fallback is solid

The stdout dump on write failure (print(text, flush=True)) is a good degraded-mode fallback. The flush=True ensures tmux/journal capture before a potential subsequent crash.

OSError message is slightly misleading

Both catch blocks print "disk full?" but OSError covers much more (permission denied, too many open files, etc.). Checking e.errno == errno.ENOSPC would distinguish actual disk-full from other OS errors and give a more precise diagnostic message.

import errno
except OSError as e:
    if e.errno == errno.ENOSPC:
        print(f"⚠️ Disk full — failed to write run log: {e}", flush=True)
    else:
        print(f"⚠️ OS error writing run log: {e}", flush=True)

Performance / Operational Considerations

docker builder prune -f nukes the entire cache

Every successful build now discards all cached layers before the next run. This prevents disk growth but also means every subsequent build starts cold. If the multisync cycles run frequently and share base image layers, this significantly increases build time per cycle.

A less aggressive option: docker builder prune -f --keep-storage 20GB (Docker 23+) or --filter until=24h to retain recent layers while capping growth. However, given the severity of the incident (750 GB / full disk), a conservative full-prune-every-time approach is reasonable for now.

120-second prune timeout

For very large caches (the original 750 GB), the first prune after deploying this fix could take longer than 120 seconds. Since the prune is non-fatal (except Exception), a timeout just skips the prune for that cycle, which is acceptable — the cache will shrink incrementally over subsequent runs. No action required, just worth being aware of.


Minor Issues

except Exception as e vs specific exceptions

In the prune block, catching Exception is broad. Using (subprocess.TimeoutExpired, subprocess.SubprocessError, OSError) would be more explicit about expected failure modes, though for a non-critical housekeeping step this is low priority.

No Slack notification on degraded disk-full state

The PR description notes that the original crash caused missed Slack alerts for 42+ hours. The fix prevents the crash, but a disk-full degraded write still generates no Slack alert — a human won't know unless they check stdout/tmux. Out of scope for this PR, but worth a follow-up issue.


Summary

The fix is correct in intent and appropriately scoped. The non-fatal handling for both the prune and the log writes is the right design — a monitoring script should degrade gracefully, not crash. The main concern to verify is whether the OSError in save_all_logs actually propagates from save_container_logs, since that protection could be dead code. The errno.ENOSPC refinement is a nice-to-have but not blocking.


Automated review by Claude (Anthropic) · sonnet · custom prompt

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the reliability of the multisync Docker monitor by proactively limiting Docker build cache growth and preventing the monitor loop from crashing when the disk is full, so runs don’t continue unattended without logging/notifications.

Changes:

  • Prunes Docker build cache (docker builder prune -f) after successful image builds to avoid unbounded cache accumulation across runs.
  • Adds OSError handling around log persistence so ENOSPC (disk full) degrades to warnings and stdout output instead of crashing.
Comments suppressed due to low confidence (1)

tooling/sync/docker_monitor.py:443

  • log_run_result catches OSError for file writes, but ensure_logs_dir() runs before the try block. If the logs directory needs to be created while the disk is full (or inode exhausted), LOGS_DIR.mkdir(...) will still raise OSError and crash the monitor. Consider moving ensure_logs_dir() inside the existing try or wrapping it in its own except OSError so disk-full degrades to the same warning/stdout behavior as the writes.
def log_run_result(run_id: str, run_count: int, instances: list[Instance], hostname: str, branch: str, commit: str, build_profile: str = ""):
    """Append run result to the persistent log file."""
    ensure_logs_dir()
    all_success = all(i.status == "success" for i in instances)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. [docker_monitor.py:159](/home/runner/work/ethrex/ethrex/tooling/sync/docker_monitor.py#L159) Unconditionally running docker builder prune -f after every successful build is a performance regression. It prunes daemon-wide BuildKit cache, not just cache for this repo/image, so the next --auto-update build loses incremental layers and other jobs on the same Docker host can lose their cache too. I’d make this opt-in, or at least scope it with filters/labels so it does not evict unrelated cache.

  2. [docker_monitor.py:427](/home/runner/work/ethrex/ethrex/tooling/sync/docker_monitor.py#L427) The new try/except OSError in save_all_logs() does not actually catch the disk-full path it is trying to handle, because [docker_monitor.py:401](/home/runner/work/ethrex/ethrex/tooling/sync/docker_monitor.py#L401) save_container_logs() already catches Exception and returns False. As written, per-container write failures still get swallowed there, and save_all_logs() can still print Logs saved... even if all writes failed. If you want meaningful ENOSPC handling, move the OSError logic into save_container_logs() or aggregate return values and only print success when at least one write succeeded.

No Ethereum-specific correctness or consensus concerns in this PR; the change is confined to local tooling.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 17, 2026

Greptile Summary

This PR addresses a production outage on srv9 where the Docker build cache grew to 750 GB and filled a 1.8 TB disk, killing the monitor with ENOSPC and leaving containers running unattended for 42+ hours. The fix prunes the build cache after every successful image build and wraps log-writing calls in OSError handlers so a full disk degrades to a warning instead of a crash.

Confidence Score: 5/5

Safe to merge; the primary root-cause fix (build-cache prune) is correct, and the OSError handling prevents crashes even if the disk fills again.

The only finding is a P2 inconsistency in OSError coverage inside save_all_logs — disk-full errors are still handled (the monitor doesn't crash) just via a different code path. All changes are confined to a single Python monitoring script and have already been validated in production on srv9.

tooling/sync/docker_monitor.py — minor OSError propagation inconsistency between save_container_logs and save_all_logs.

Important Files Changed

Filename Overview
tooling/sync/docker_monitor.py Adds docker builder prune after each successful image build and wraps log-writing calls in OSError handlers to survive disk-full conditions; prune logic is sound but the OSError guard in save_all_logs has partial coverage due to save_container_logs's internal broad exception handler.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start auto-update run] --> B[git pull latest]
    B --> C{pull OK?}
    C -- No --> D[sys.exit]
    C -- Yes --> E[docker build]
    E --> F{build OK?}
    F -- No --> G[sys.exit]
    F -- Yes --> H["docker builder prune -f NEW"]
    H --> I{prune OK?}
    I -- No --> J[print warning, continue]
    I -- Yes --> K[print reclaimed space]
    J --> L[restart containers]
    K --> L
    L --> M[monitor instances]
    M --> N{all done?}
    N -- No --> M
    N -- Yes --> O["save_all_logs OSError guarded"]
    O --> P{OSError?}
    P -- Yes --> Q[print warning, continue]
    P -- No --> R["log_run_result OSError guarded"]
    R --> S{OSError?}
    S -- Yes --> T[print warning + stdout fallback]
    S -- No --> U[slack_notify]
    T --> U
    Q --> R
    U --> V{all success?}
    V -- Yes --> A
    V -- No --> W[sys.exit 1]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tooling/sync/docker_monitor.py
Line: 427-437

Comment:
**`except OSError` in `save_all_logs` has incomplete coverage**

The `OSError` guard here only catches errors from `log_file.parent.mkdir()` in `save_container_logs`, because that call sits *outside* `save_container_logs`'s inner `try` block. However, write-time `ENOSPC` from `log_file.write_text(logs)` is swallowed by `save_container_logs`'s broad `except Exception` handler, so the user-friendly `"disk full?"` message in `save_all_logs` is never printed for the most common failure scenario. The monitor won't crash either way (both paths print a warning), but the coverage is inconsistent.

A straightforward fix is to re-raise `OSError` from inside `save_container_logs` so it can propagate to the caller:

```python
# save_container_logs – change the broad handler to only silence non-OSError exceptions
except OSError:
    raise  # let the caller (save_all_logs) handle disk-full
except Exception as e:
    print(f"  ⚠️ Error saving logs for {container}: {e}")
    return False
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Prune Docker build cache after each imag..." | Re-trigger Greptile

Comment on lines +427 to +437
try:
for inst in instances:
# Save ethrex logs
save_container_logs(inst.container, run_id)
# Save consensus logs (convention: consensus-{network})
consensus_container = inst.container.replace("ethrex-", "consensus-")
save_container_logs(consensus_container, run_id)

print(f"📁 Logs saved to {LOGS_DIR}/run_{run_id}/\n")
except OSError as e:
print(f"⚠️ Failed to save some logs (disk full?): {e}", flush=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 except OSError in save_all_logs has incomplete coverage

The OSError guard here only catches errors from log_file.parent.mkdir() in save_container_logs, because that call sits outside save_container_logs's inner try block. However, write-time ENOSPC from log_file.write_text(logs) is swallowed by save_container_logs's broad except Exception handler, so the user-friendly "disk full?" message in save_all_logs is never printed for the most common failure scenario. The monitor won't crash either way (both paths print a warning), but the coverage is inconsistent.

A straightforward fix is to re-raise OSError from inside save_container_logs so it can propagate to the caller:

# save_container_logs – change the broad handler to only silence non-OSError exceptions
except OSError:
    raise  # let the caller (save_all_logs) handle disk-full
except Exception as e:
    print(f"  ⚠️ Error saving logs for {container}: {e}")
    return False
Prompt To Fix With AI
This is a comment left during a code review.
Path: tooling/sync/docker_monitor.py
Line: 427-437

Comment:
**`except OSError` in `save_all_logs` has incomplete coverage**

The `OSError` guard here only catches errors from `log_file.parent.mkdir()` in `save_container_logs`, because that call sits *outside* `save_container_logs`'s inner `try` block. However, write-time `ENOSPC` from `log_file.write_text(logs)` is swallowed by `save_container_logs`'s broad `except Exception` handler, so the user-friendly `"disk full?"` message in `save_all_logs` is never printed for the most common failure scenario. The monitor won't crash either way (both paths print a warning), but the coverage is inconsistent.

A straightforward fix is to re-raise `OSError` from inside `save_container_logs` so it can propagate to the caller:

```python
# save_container_logs – change the broad handler to only silence non-OSError exceptions
except OSError:
    raise  # let the caller (save_all_logs) handle disk-full
except Exception as e:
    print(f"  ⚠️ Error saving logs for {container}: {e}")
    return False
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants