fix(l1): prune Docker build cache and handle disk-full errors in multisync monitor#6500
fix(l1): prune Docker build cache and handle disk-full errors in multisync monitor#6500avilagaston9 wants to merge 2 commits intomainfrom
Conversation
…6486) ## Summary - The Hive consume-engine Amsterdam tests for EIP-7778 and EIP-8037 were failing because ethrex's per-tx gas limit checks were incompatible with Amsterdam's new gas accounting rules. - **EIP-7778** uses pre-refund gas for block accounting, so cumulative pre-refund gas can exceed the block gas limit even when a block builder correctly included all transactions. - **EIP-8037** introduces 2D gas accounting (`block_gas = max(regular, state)`), meaning cumulative total gas (regular + state) can legally exceed the block gas limit. - The fix skips the per-tx cumulative gas check for Amsterdam and adds a **post-execution** block-level overflow check using `max(sum_regular, sum_state)` in all three execution paths (sequential, pipeline, parallel). ## Local test results - **200/201** EIP-7778 + EIP-8037 Hive consume-engine tests pass - **105/105** EIP-7778 + EIP-8037 EF blockchain tests pass (4 + 101) - The single remaining Hive failure (`test_block_regular_gas_limit[exceed=True]`) expects `TransactionException.GAS_ALLOWANCE_EXCEEDED` but we return `BlockException.GAS_USED_OVERFLOW` — the block is correctly rejected, just with a different error classification. ## Test plan - [x] All EIP-7778 EF blockchain tests pass locally - [x] All EIP-8037 EF blockchain tests pass locally - [x] 200/201 Hive consume-engine Amsterdam tests pass locally - [ ] Full CI Amsterdam Hive suite passes --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
errors gracefully in the multisync monitor. The build cache was growing unbounded across runs (750 GB on srv9), eventually filling the disk and crashing the monitor with ENOSPC when writing run logs. The monitor loop would then exit, leaving containers running unattended with no Slack notifications. The build cache prune runs after every successful docker build. Log writing (save_all_logs, log_run_result) now catches OSError so a full disk degrades to a warning instead of killing the monitor.
🤖 Kimi Code ReviewReview of Operational Robustness Improvements Specific Issues:
Minor:
Security: No vulnerabilities detected. Commands use list format (safe from injection), and the prune operation uses Verdict: Accept with minor fixes for output parsing robustness (Item 1) and potentially per-instance error handling (Item 3). Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Claude Code ReviewHere is my review of PR #6500: PR #6500 — Fix: Prune Docker Build Cache & Handle Disk-Full Errors in Multisync MonitorOverviewThis PR fixes a production incident on srv9 where 750 GB of accumulated Docker build cache filled a 1.8 TB disk, crashing the multisync monitor mid-run via Correctness
The
The stdout dump on write failure ( OSError message is slightly misleading Both catch blocks print import errno
except OSError as e:
if e.errno == errno.ENOSPC:
print(f"⚠️ Disk full — failed to write run log: {e}", flush=True)
else:
print(f"⚠️ OS error writing run log: {e}", flush=True)Performance / Operational Considerations
Every successful build now discards all cached layers before the next run. This prevents disk growth but also means every subsequent build starts cold. If the multisync cycles run frequently and share base image layers, this significantly increases build time per cycle. A less aggressive option: 120-second prune timeout For very large caches (the original 750 GB), the first prune after deploying this fix could take longer than 120 seconds. Since the prune is non-fatal ( Minor Issues
In the prune block, catching No Slack notification on degraded disk-full state The PR description notes that the original crash caused missed Slack alerts for 42+ hours. The fix prevents the crash, but a disk-full degraded write still generates no Slack alert — a human won't know unless they check stdout/tmux. Out of scope for this PR, but worth a follow-up issue. SummaryThe fix is correct in intent and appropriately scoped. The non-fatal handling for both the prune and the log writes is the right design — a monitoring script should degrade gracefully, not crash. The main concern to verify is whether the Automated review by Claude (Anthropic) · sonnet · custom prompt |
There was a problem hiding this comment.
Pull request overview
Improves the reliability of the multisync Docker monitor by proactively limiting Docker build cache growth and preventing the monitor loop from crashing when the disk is full, so runs don’t continue unattended without logging/notifications.
Changes:
- Prunes Docker build cache (
docker builder prune -f) after successful image builds to avoid unbounded cache accumulation across runs. - Adds
OSErrorhandling around log persistence so ENOSPC (disk full) degrades to warnings and stdout output instead of crashing.
Comments suppressed due to low confidence (1)
tooling/sync/docker_monitor.py:443
log_run_resultcatchesOSErrorfor file writes, butensure_logs_dir()runs before thetryblock. If the logs directory needs to be created while the disk is full (or inode exhausted),LOGS_DIR.mkdir(...)will still raiseOSErrorand crash the monitor. Consider movingensure_logs_dir()inside the existingtryor wrapping it in its ownexcept OSErrorso disk-full degrades to the same warning/stdout behavior as the writes.
def log_run_result(run_id: str, run_count: int, instances: list[Instance], hostname: str, branch: str, commit: str, build_profile: str = ""):
"""Append run result to the persistent log file."""
ensure_logs_dir()
all_success = all(i.status == "success" for i in instances)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🤖 Codex Code Review
No Ethereum-specific correctness or consensus concerns in this PR; the change is confined to local tooling. Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR addresses a production outage on srv9 where the Docker build cache grew to 750 GB and filled a 1.8 TB disk, killing the monitor with Confidence Score: 5/5Safe to merge; the primary root-cause fix (build-cache prune) is correct, and the OSError handling prevents crashes even if the disk fills again. The only finding is a P2 inconsistency in OSError coverage inside save_all_logs — disk-full errors are still handled (the monitor doesn't crash) just via a different code path. All changes are confined to a single Python monitoring script and have already been validated in production on srv9. tooling/sync/docker_monitor.py — minor OSError propagation inconsistency between save_container_logs and save_all_logs.
|
| Filename | Overview |
|---|---|
| tooling/sync/docker_monitor.py | Adds docker builder prune after each successful image build and wraps log-writing calls in OSError handlers to survive disk-full conditions; prune logic is sound but the OSError guard in save_all_logs has partial coverage due to save_container_logs's internal broad exception handler. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Start auto-update run] --> B[git pull latest]
B --> C{pull OK?}
C -- No --> D[sys.exit]
C -- Yes --> E[docker build]
E --> F{build OK?}
F -- No --> G[sys.exit]
F -- Yes --> H["docker builder prune -f NEW"]
H --> I{prune OK?}
I -- No --> J[print warning, continue]
I -- Yes --> K[print reclaimed space]
J --> L[restart containers]
K --> L
L --> M[monitor instances]
M --> N{all done?}
N -- No --> M
N -- Yes --> O["save_all_logs OSError guarded"]
O --> P{OSError?}
P -- Yes --> Q[print warning, continue]
P -- No --> R["log_run_result OSError guarded"]
R --> S{OSError?}
S -- Yes --> T[print warning + stdout fallback]
S -- No --> U[slack_notify]
T --> U
Q --> R
U --> V{all success?}
V -- Yes --> A
V -- No --> W[sys.exit 1]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tooling/sync/docker_monitor.py
Line: 427-437
Comment:
**`except OSError` in `save_all_logs` has incomplete coverage**
The `OSError` guard here only catches errors from `log_file.parent.mkdir()` in `save_container_logs`, because that call sits *outside* `save_container_logs`'s inner `try` block. However, write-time `ENOSPC` from `log_file.write_text(logs)` is swallowed by `save_container_logs`'s broad `except Exception` handler, so the user-friendly `"disk full?"` message in `save_all_logs` is never printed for the most common failure scenario. The monitor won't crash either way (both paths print a warning), but the coverage is inconsistent.
A straightforward fix is to re-raise `OSError` from inside `save_container_logs` so it can propagate to the caller:
```python
# save_container_logs – change the broad handler to only silence non-OSError exceptions
except OSError:
raise # let the caller (save_all_logs) handle disk-full
except Exception as e:
print(f" ⚠️ Error saving logs for {container}: {e}")
return False
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Prune Docker build cache after each imag..." | Re-trigger Greptile
| try: | ||
| for inst in instances: | ||
| # Save ethrex logs | ||
| save_container_logs(inst.container, run_id) | ||
| # Save consensus logs (convention: consensus-{network}) | ||
| consensus_container = inst.container.replace("ethrex-", "consensus-") | ||
| save_container_logs(consensus_container, run_id) | ||
|
|
||
| print(f"📁 Logs saved to {LOGS_DIR}/run_{run_id}/\n") | ||
| except OSError as e: | ||
| print(f"⚠️ Failed to save some logs (disk full?): {e}", flush=True) |
There was a problem hiding this comment.
except OSError in save_all_logs has incomplete coverage
The OSError guard here only catches errors from log_file.parent.mkdir() in save_container_logs, because that call sits outside save_container_logs's inner try block. However, write-time ENOSPC from log_file.write_text(logs) is swallowed by save_container_logs's broad except Exception handler, so the user-friendly "disk full?" message in save_all_logs is never printed for the most common failure scenario. The monitor won't crash either way (both paths print a warning), but the coverage is inconsistent.
A straightforward fix is to re-raise OSError from inside save_container_logs so it can propagate to the caller:
# save_container_logs – change the broad handler to only silence non-OSError exceptions
except OSError:
raise # let the caller (save_all_logs) handle disk-full
except Exception as e:
print(f" ⚠️ Error saving logs for {container}: {e}")
return FalsePrompt To Fix With AI
This is a comment left during a code review.
Path: tooling/sync/docker_monitor.py
Line: 427-437
Comment:
**`except OSError` in `save_all_logs` has incomplete coverage**
The `OSError` guard here only catches errors from `log_file.parent.mkdir()` in `save_container_logs`, because that call sits *outside* `save_container_logs`'s inner `try` block. However, write-time `ENOSPC` from `log_file.write_text(logs)` is swallowed by `save_container_logs`'s broad `except Exception` handler, so the user-friendly `"disk full?"` message in `save_all_logs` is never printed for the most common failure scenario. The monitor won't crash either way (both paths print a warning), but the coverage is inconsistent.
A straightforward fix is to re-raise `OSError` from inside `save_container_logs` so it can propagate to the caller:
```python
# save_container_logs – change the broad handler to only silence non-OSError exceptions
except OSError:
raise # let the caller (save_all_logs) handle disk-full
except Exception as e:
print(f" ⚠️ Error saving logs for {container}: {e}")
return False
```
How can I resolve this? If you propose a fix, please make it concise.
Motivation
The multisync monitor on srv9 crashed with
OSError: [Errno 28] No space left on deviceafter completing Run #9 successfully. The Docker build cache was growing unbounded across runs (750 GB accumulated), eventually filling the 1.8 TB disk entirely. When the monitor tried to write run logs, ENOSPC killed the process, leaving containers running unattended with no Slack notifications for 42+ hours.Description
Two changes to
docker_monitor.py:Prune Docker build cache after each image build — runs
docker builder prune -fafter every successfuldocker build, preventing cache from accumulating across multisync cycles. The prune is non-fatal: if it fails, a warning is printed and the monitor continues.Handle disk-full errors gracefully in log writing —
save_all_logsandlog_run_resultnow catchOSErrorso a full disk degrades to a warning instead of crashing the monitor loop. On write failure, the run result is printed to stdout so it's still captured in tmux history.Checklist