Atari fixes + benchmarks: memory, life-loss, per-game metric, README table by dnddnjs · Pull Request #128 · rlcode/reinforcement-learning

dnddnjs · 2026-05-17T22:00:08Z

Summary

Originally a PPO hyperparameter tuning PR; while running the longer 10M-frame schedule a chain of issues surfaced and are fixed here. Also adds a README benchmarks section now that there are real numbers to report.

Replay buffer OOM (DQN). The buffer stored full (4, 84, 84) stacks per slot, so capacity 1M occupied ~28 GB and killed the laptop. Switched to single-frame storage with on-the-fly 4-stack reconstruction at sample time. Episode boundaries inside the stack are masked using stored done flags.

8 GB-friendly capacity. Even with the 4× memory cut, 1M × (84,84) is ~7 GB — borderline on an 8 GB unified-memory MacBook (swap starts, training output gets noisy). Default capacity is now 500k (~3.5 GB); bump back to 1M on machines with headroom.

Life loss triggered full game reset. terminal_on_life_loss=True combined with the main loop's env.reset() made every life loss restart the game — burning frames on noop_max=30 + FIRE and breaking long-horizon credit assignment. Added a LifeLossTerminalEnv wrapper that emits terminated=True on life loss but only resets the underlying env on real game-over. AtariPreprocessing's built-in flag is turned off so the wrapper owns the logic. Applies to both DQN and PPO via env.py.

DQN hyperparameters re-aligned with modern v5 defaults. BATCH_SIZE 64 → 32, TARGET_UPDATE_EVERY 2500 train steps → 250 (≈ 1k env frames, hard update), EPSILON_END 0.1 → 0.01.

Per-game return metric. Because life-loss now ends a logged episode, recent_mean_return reports per-life score. Added recent_mean_game_return that accumulates across all 5 lives and resets only on real game-over (signaled via info["game_over"] from LifeLossTerminalEnv). Logged to stdout and W&B in both DQN and PPO.

README benchmarks section. New "Benchmarks" block with a per-algorithm table (params, train time, final mean score, peak RAM, CPU/GPU, W&B report link). Hardware footprint is a MacBook Pro 14" (M3, 8 GB, MPS); CPU/GPU are read off Activity Monitor on the python3.11 process. Scores live in publicly shared W&B Reports.

Misc.

moviepy dep added for a local-only eval/recording script (kept out of git via scripts/).
.gitignore excludes scripts/, docs/, logs/ — all local-only working dirs.

Test plan

DQN 10M-frame run finishes within 8 GB RAM budget (5.27 GB peak)
DQN per-game mean reaches ~94 (was plateaued at ~12 per-life before)
PPO 10M-frame run with the new LifeLossTerminalEnv (rerun pending — previous run predates the fix)

…ames Three of CleanRL's 'PPO 37 details' that were missing — flagged when the 5M and 10M Breakout runs both plateaued at per-game ~75 with entropy stuck around 0.8 (policy wasn't sharpening, clip rarely activating): - Linear LR anneal from 2.5e-4 -> 0 across the run; lets late updates fine-tune instead of bouncing. - Value-function loss clipping around the old prediction (CLIP_COEF), matching the policy clipping range; stabilizes value targets. - Advantage normalization moved inside the minibatch loop instead of once per batch. Also bumps TOTAL_FRAMES 5M -> 10M to match the CleanRL Atari budget so runs are directly comparable to their published curves. lr now logged to wandb so the anneal is visible.

- ReplayBuffer stores single frames and stacks 4 at sample time (~28GB -> ~7GB). - LifeLossTerminalEnv signals terminal on life loss but defers real reset to game-over, so noop_max + FIRE no longer fire every life and GAE/Q chains break only at the right boundary. - DQN: BATCH_SIZE 64 -> 32, TARGET_UPDATE_EVERY 2500 -> 250 train steps (~1k env frames), EPSILON_END 0.1 -> 0.01. - Log per-life and per-game returns separately (DQN and PPO).

- README: add Atari to algorithms list, new Benchmarks section with hardware notes, per-algo row (params, train time, score, RAM, CPU/GPU, W&B report). - DQN buffer 1M -> 500k (~3.5GB) so a 1M-capacity run stops swapping on 8GB unified memory. - moviepy added for the local eval/recording script. - .gitignore: exclude scripts/ and docs/ (local-only working dirs).

dnddnjs added 2 commits May 18, 2026 06:59

dnddnjs changed the title ~~PPO tuning: LR anneal, value clipping, per-minibatch adv norm~~ Atari fixes: DQN memory, life-loss episodes, per-game metric May 23, 2026

dnddnjs added 2 commits May 24, 2026 11:58

Ignore local logs/ directory

804e0f2

dnddnjs changed the title ~~Atari fixes: DQN memory, life-loss episodes, per-game metric~~ Atari fixes + benchmarks: memory, life-loss, per-game metric, README table May 24, 2026

dnddnjs merged commit 54ffaeb into master May 24, 2026

dnddnjs deleted the atari-ppo-tuning branch May 24, 2026 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atari fixes + benchmarks: memory, life-loss, per-game metric, README table#128

Atari fixes + benchmarks: memory, life-loss, per-game metric, README table#128
dnddnjs merged 4 commits into
masterfrom
atari-ppo-tuning

dnddnjs commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dnddnjs commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dnddnjs commented May 17, 2026 •

edited

Loading