Atari fixes + benchmarks: memory, life-loss, per-game metric, README table#128
Merged
Conversation
…ames Three of CleanRL's 'PPO 37 details' that were missing — flagged when the 5M and 10M Breakout runs both plateaued at per-game ~75 with entropy stuck around 0.8 (policy wasn't sharpening, clip rarely activating): - Linear LR anneal from 2.5e-4 -> 0 across the run; lets late updates fine-tune instead of bouncing. - Value-function loss clipping around the old prediction (CLIP_COEF), matching the policy clipping range; stabilizes value targets. - Advantage normalization moved inside the minibatch loop instead of once per batch. Also bumps TOTAL_FRAMES 5M -> 10M to match the CleanRL Atari budget so runs are directly comparable to their published curves. lr now logged to wandb so the anneal is visible.
- ReplayBuffer stores single frames and stacks 4 at sample time (~28GB -> ~7GB). - LifeLossTerminalEnv signals terminal on life loss but defers real reset to game-over, so noop_max + FIRE no longer fire every life and GAE/Q chains break only at the right boundary. - DQN: BATCH_SIZE 64 -> 32, TARGET_UPDATE_EVERY 2500 -> 250 train steps (~1k env frames), EPSILON_END 0.1 -> 0.01. - Log per-life and per-game returns separately (DQN and PPO).
- README: add Atari to algorithms list, new Benchmarks section with hardware notes, per-algo row (params, train time, score, RAM, CPU/GPU, W&B report). - DQN buffer 1M -> 500k (~3.5GB) so a 1M-capacity run stops swapping on 8GB unified memory. - moviepy added for the local eval/recording script. - .gitignore: exclude scripts/ and docs/ (local-only working dirs).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Originally a PPO hyperparameter tuning PR; while running the longer 10M-frame schedule a chain of issues surfaced and are fixed here. Also adds a README benchmarks section now that there are real numbers to report.
Replay buffer OOM (DQN). The buffer stored full
(4, 84, 84)stacks per slot, so capacity 1M occupied ~28 GB and killed the laptop. Switched to single-frame storage with on-the-fly 4-stack reconstruction at sample time. Episode boundaries inside the stack are masked using storeddoneflags.8 GB-friendly capacity. Even with the 4× memory cut, 1M ×
(84,84)is ~7 GB — borderline on an 8 GB unified-memory MacBook (swap starts, training output gets noisy). Default capacity is now 500k (~3.5 GB); bump back to 1M on machines with headroom.Life loss triggered full game reset.
terminal_on_life_loss=Truecombined with the main loop'senv.reset()made every life loss restart the game — burning frames onnoop_max=30+ FIRE and breaking long-horizon credit assignment. Added aLifeLossTerminalEnvwrapper that emitsterminated=Trueon life loss but only resets the underlying env on real game-over. AtariPreprocessing's built-in flag is turned off so the wrapper owns the logic. Applies to both DQN and PPO viaenv.py.DQN hyperparameters re-aligned with modern v5 defaults.
BATCH_SIZE64 → 32,TARGET_UPDATE_EVERY2500 train steps → 250 (≈ 1k env frames, hard update),EPSILON_END0.1 → 0.01.Per-game return metric. Because life-loss now ends a logged episode,
recent_mean_returnreports per-life score. Addedrecent_mean_game_returnthat accumulates across all 5 lives and resets only on real game-over (signaled viainfo["game_over"]fromLifeLossTerminalEnv). Logged to stdout and W&B in both DQN and PPO.README benchmarks section. New "Benchmarks" block with a per-algorithm table (params, train time, final mean score, peak RAM, CPU/GPU, W&B report link). Hardware footprint is a MacBook Pro 14" (M3, 8 GB, MPS); CPU/GPU are read off Activity Monitor on the
python3.11process. Scores live in publicly shared W&B Reports.Misc.
moviepydep added for a local-only eval/recording script (kept out of git viascripts/)..gitignoreexcludesscripts/,docs/,logs/— all local-only working dirs.Test plan
LifeLossTerminalEnv(rerun pending — previous run predates the fix)