01 // THE SETUP
After building RL Pong with Q-Learning, DQN, and REINFORCE, the natural next step was the algorithm family that actually runs the modern world: Actor-Critic methods. A2C is how DeepMind's agents learn. PPO is how ChatGPT was aligned via RLHF. And if you read When GANs Meet RL, you already know the punchline — the actor is the generator, the critic is the discriminator. Same adversarial game, different costume.
Lunar Lander was the obvious vehicle. It fits the space theme, has intuitive physics anyone can understand (gravity goes down, thrust goes up), and the 6-action space is rich enough to show real behavioral differences between algorithms. Most importantly, the actor-critic split is visually demonstrable: you can watch the critic’s confidence rise as the lander approaches the pad, and the actor concentrate probability mass onto main-thrust.
The goal: implement DQN (baseline visitors know from Pong), A2C (the core teaching algorithm), and PPO (the modern standard) — all training live in the browser with a real-time actor-critic visualization panel.
02 // THE THREE ALGORITHMS
DQN — The Baseline
DQN is the same Double DQN from Pong, scaled up to 6 actions. One network estimates $Q(s, a)$ for all actions; the agent picks $\arg\max_a Q(s, a)$ (with $\epsilon$-greedy exploration). The target network stabilizes learning:
$$Q(s, a) \leftarrow r + \gamma \, Q_{\text{target}}(s', \arg\max_{a'} Q_{\text{online}}(s', a'))$$Architecture: MiniNet2(8, 64, 48, 6) = ~2,200 parameters. It works, but it only learns values — there’s no explicit policy. This is where actor-critic enters.
A2C — Advantage Actor-Critic
A2C splits the brain in two. The Actor $\pi_\theta(a|s)$ is a policy network that outputs a probability distribution over actions. The Critic $V_\phi(s)$ is a value network that estimates how good the current state is. The key insight: instead of using raw returns as in REINFORCE, we use the advantage:
$$A(s, a) = r + \gamma \, V_\phi(s') - V_\phi(s)$$This is the TD (temporal difference) advantage — “how much better was this outcome than what the critic expected?” A positive advantage means “do more of this,” negative means “do less.” The actor’s policy gradient becomes:
$$\nabla_\theta J \approx \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a)$$And the critic trains via MSE on the TD target:
$$\mathcal{L}_{\text{critic}} = \Big( r + \gamma \, V_\phi(s') - V_\phi(s) \Big)^2$$Two networks, two loss functions, one elegant loop. Actor: MiniNet2(8, 64, 48, 6) = ~2,200 params. Critic: MiniNet2(8, 64, 48, 1) = ~2,000 params. Total: ~4,200 parameters.
PPO — Proximal Policy Optimization
PPO is A2C with guardrails. The problem with vanilla policy gradients: a single bad update can destroy a good policy. PPO’s fix is the clipped surrogate objective:
$$L^{\text{PPO}} = \mathbb{E} \Big[ \min\!\big( r_t(\theta) \, A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, A_t \big) \Big]$$where the importance ratio $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ measures how much the policy has changed. If the ratio drifts too far from 1 (policy changed too much), the clipping kicks in and stops the gradient. This is why PPO is the algorithm of choice for RLHF — it makes large policy changes safely.
PPO also uses Generalized Advantage Estimation (GAE) instead of single-step TD:
$$\hat{A}_t = \sum_{l=0}^{T-t} (\gamma \lambda)^l \, \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$The $\lambda$ parameter (0.95 here) interpolates between high-bias/low-variance (TD, $\lambda=0$) and low-bias/high-variance (Monte Carlo, $\lambda=1$). Same architecture as A2C: ~4,200 parameters.
03 // THE PHYSICS
Everything runs in an 800×600 logical coordinate space (same as Pong). The physics are deliberately simple — just enough to make landing non-trivial:
- Gravity: 0.04 px/frame² downward (gentle moon gravity)
- Main thrust: 0.15 px/frame² along the lander’s up axis
- Side thrusters: 0.05 px/frame² for lateral translation
- Rotation: ±0.04 rad/frame, with angular damping of 0.98
- Wind gusts: random, max 0.02 px/frame² (enabled at curriculum level 3)
These values went through iteration. We originally used gravity 0.05 and thrust 0.12, giving a thrust-to-gravity ratio of 2.4×. The agent needed to fire thrust 42% of the time just to hover — too demanding for a random exploration policy that only picks each of 6 actions ~17% of the time. Bumping to gravity 0.04 and thrust 0.15 (ratio 3.75×) means hovering requires thrust only 27% of the time, giving the agent enough control authority to accidentally discover braking.
Terrain is generated via midpoint displacement — a fractal algorithm that produces realistic mountainous profiles. An 80px-wide landing pad is placed randomly in the middle third of the terrain, painted neon green with glow.
The lander itself is a triangle with two legs, rendered in neon cyan with animated thrust flames (particle system). Landing criteria: both leg tips on pad, $|\theta| < 20°$, $|v_y| < 1.5$, $|v_x| < 0.8$. We relaxed these from the initial values ($15°$, $1.0$, $0.5$) after discovering that the 4-pixel landing window was too tight for small networks to hit reliably.
// 6 discrete actions
const ACTIONS = {
NOOP: 0,
THRUST_MAIN: 1,
THRUST_LEFT: 2,
THRUST_RIGHT: 3,
ROTATE_LEFT: 4,
ROTATE_RIGHT: 5
};
04 // THE REWARD THAT BROKE EVERYTHING (THREE TIMES)
The state is an 8-dimensional vector, all normalized to roughly [-1, 1]:
const state = [
(lander.x - padCx) / W, // pad-relative X
(lander.y - padCy) / H, // pad-relative Y
lander.vx / 5, // normalized horizontal velocity
lander.vy / 5, // normalized vertical velocity
lander.angle / Math.PI, // normalized angle
lander.angVel / 0.2, // normalized angular velocity
leftLegContact, // binary: left leg on pad
rightLegContact // binary: right leg on pad
];
The reward function went through five iterations before things worked. Three of them failed for reasons the textbooks don’t warn you about.
Failure 1: Rewarding gravity (0% landing)
The textbook approach: reward closeness to the pad, penalize crashes.
reward = -distance_to_pad * 0.5; // per frame
if (landed) reward = +100; // terminal
0% landing after 200 episodes. Why? The distance reward increases as the lander falls — gravity does the work for free. The agent gets rewarded for doing nothing. By the time it’s near the pad, it’s going too fast to land.
Lesson: reward shaping that correlates with natural dynamics is useless. If gravity already makes you closer to the pad, rewarding closeness teaches nothing. You have to reward the controlled approach — being close AND slow simultaneously:
// What actually works: closeness × slowness product
reward = closeness * Math.max(0, 3 - speed) * 0.5;
Failure 2: The 240× dynamic range (DQN 14%, A2C/PPO 2%)
The closeness×slowness product worked — DQN hit 14%. But A2C and PPO were stuck at 2%. The problem: per-frame rewards were ~1 while terminal rewards were ±100. A 240× dynamic range.
A tiny neural network trying to predict $V(s)$ with MSE loss sees gradients that vary by two orders of magnitude. The critic spends all its capacity modeling the rare terminal transition and learns nothing about the approach. The advantages become noise.
DQN survived because its replay buffer averages over thousands of transitions, dampening the spikes. A2C and PPO, which train on recent experience directly, were destroyed.
Lesson: this is arguably the single most important practical lesson in RL engineering. Reward scale matters as much as reward shape. If your terminal reward is 100× your step reward, the critic will learn nothing useful. We scaled terminal rewards to ±20 and added a separate 0.1× multiplier for on-policy methods. A2C jumped from 2% to ~15%.
Failure 3: Uniform exploration fights the physics (PPO stuck at 2%)
Even after fixing reward scale, PPO barely learned. The answer was embarrassingly simple: $\epsilon$-greedy exploration picks uniformly from 6 actions. THRUST_MAIN fires 17% of the time. But hovering requires thrust 27% of the time (gravity / thrust = 0.04 / 0.15). Random exploration was biased against the correct behavior.
Fix: weight THRUST_MAIN at 40% of random actions. This isn’t cheating — it’s an informed prior. We know the lander needs thrust to fight gravity; we’re telling the exploration policy, not the learned policy.
function biasedRandomAction() {
if (Math.random() < 0.40) return 1; // THRUST_MAIN
// remaining 60% split across 5 other actions
}
PPO went from 2% to 23%. The single biggest improvement we made.
Lesson: if your domain has a physics prior (gravity demands thrust, friction demands steering), encode it in the exploration distribution. The agent still learns when to thrust; you’re just ensuring it discovers that thrust matters before the episode budget runs out.
What else mattered
Two more structural changes each contributed ~5 percentage points:
- Two hidden layers instead of one. The value function for landing depends on interactions between position and velocity (close + slow = good, close + fast = bad). A single ReLU layer produces piecewise linear functions — it can’t represent products. Two layers can, because the first layer computes features (closeness, speed) and the second combines them non-linearly.
- Curriculum spawn altitude. Level 0 originally spawned at 120–200px. Free-fall from 160px builds $v_y = 3.2$ by impact — impossible to land. Dropping to 35–60px lets the agent discover braking within 50 episodes, bootstrapping the learning signal.
05 // THE TRAINING LOOP
Lunar Lander is single-player, so difficulty comes from curriculum learning via spawn randomization:
- Level 0 (Easy): 35–60px directly above pad, zero velocity
- Level 1 (Medium): 100–180px altitude, ±40px offset
- Level 2 (Hard): 160–300px altitude, ±100px offset, initial velocity
- Level 3 (Full): 200–400px altitude, ±150px offset, wind gusts
The level advances when the rolling landing rate (last 20 episodes) exceeds 60%. The warmup boot pre-trains each agent for 600–1,000 episodes in non-blocking chunks (~2–3 seconds on page load).
06 // WHAT EACH ALGORITHM ACTUALLY DID
DQN: The Replay Buffer Wins at Small Scale
DQN was the most resilient performer — 14% even with bad rewards. The replay buffer is the reason: (1) gradient spikes from terminal rewards get diluted by hundreds of per-frame transitions, (2) the agent sees diverse experiences regardless of its current policy, (3) rare near-landings get replayed many times.
Interestingly, Adam optimizer made DQN worse. Adam’s momentum accumulates bias from non-stationary Q-targets — the target network changes every 200 steps, so the optimization landscape shifts underneath the momentum. DQN does best with plain SGD, which has no memory of stale gradients.
A2C: Fast but Fragile
A2C was the most sensitive to reward design — it went from 0% to 15% purely based on reward normalization. Without a replay buffer, the critic sees only recent transitions. If the reward scale is unstable, the critic’s MSE loss oscillates wildly, the advantages become noise, and the actor gets random gradient updates.
A2C benefits from Adam optimizer (unlike DQN) because policy gradients are naturally noisy and momentum smooths them. The entropy bonus ($\beta = 0.04$) is also critical — without it, A2C collapses to “always fire main thrust,” which is locally optimal but can’t handle lateral corrections.
PPO: The Most Consistent Learner
PPO lands 22–24% across all random seeds — the tightest variance of any algorithm (2 percentage points). This is exactly what the theory predicts: clipping prevents catastrophic updates, so once PPO finds a decent policy, it never loses it. A2C can reach higher peaks but also has worse valleys.
PPO’s weakness: it trains less frequently than A2C (every 64 steps vs every 16), and each update is clipped. In production RLHF, this is solved with large batch sizes. In a browser with 4K parameters, it just needs patience.
The Scoreboard
| Algorithm | Params | Landing @1K eps | Seed Variance |
|---|---|---|---|
| DQN (SGD) | ~2,200 | 21% | 19–23% |
| A2C (Adam) | ~4,200 | 14% | 12–19% |
| PPO (SGD) | ~4,200 | 23% | 22–24% |
What didn’t work
Some lessons come from negative results:
- Frame skipping (decide every 4 physics frames, like Atari DQN): helped A2C but destroyed DQN and PPO. At Atari scale (10K+ frames/episode), frame skip reduces redundancy. At our scale (35 frames), it just cuts training data, starving replay-based and trajectory-based methods.
- Step-function rewards (bonus jumps from 0 to +3 at a threshold): neural networks hate discontinuities. Replaced with smooth $\text{closeness}^2 \times (3 - \text{speed})/3$ for stable gradients everywhere.
- Hyperparameter tuning (batch size, epochs, learning rate): we wasted time tuning knobs when the real problems were structural. Every run has high variance — a lucky seed makes any configuration look good.
The meta-lesson: the changes that actually mattered were all structural — exploration bias, reward scale, network depth, spawn difficulty. If your RL agent isn’t learning, resist the urge to tune. Look for a mismatch between your assumptions and your problem.
07 // SEEING ACTOR-CRITIC
This is the teaching payoff. When you select A2C or PPO, a visualization panel appears below the game canvas with two components:
The Critic Confidence Meter: a vertical bar that shows $V(s)$ in real time. It goes from red (low value — “we’re going to crash”) through yellow to green (high value — “we’re about to land”). Watch it during a trained agent’s descent: it starts low, rises as the lander aligns with the pad, and spikes green right before touchdown.
The Actor Action Probabilities: a 6-bar chart showing $\pi(a|s)$ for each action. During early training, all bars are roughly equal (random policy). As the agent learns, you see probability mass concentrate: main-thrust dominates during descent, rotation adjusts for wind, NOOP appears during stable glides. The top-probability action glows green.
If you’ve read the GAN-RL blog post, this should feel viscerally familiar. The actor is the generator — it produces actions (samples). The critic is the discriminator — it evaluates states (scores). The actor tries to take actions that the critic approves of. The critic tries to accurately predict outcomes. Same adversarial game, same convergent dynamic.
In RLHF for LLMs, the actor is the language model (generating tokens), and the critic is the reward model (scoring responses). PPO is the training algorithm. Every piece of this visualization maps directly onto how ChatGPT was trained.
08 // TECHNICAL DETAILS
trainPPO() — The New MiniNet Method
The PPO clipped objective requires a new training method on MiniNet. The key difference from trainPG(): we need the old policy’s probabilities to compute the importance ratio.
// Importance ratio for the taken action
const ratio = newP / oldP;
// Clipped surrogate
const clipped = Math.max(1 - clipEps, Math.min(1 + clipEps, ratio));
const surr1 = ratio * advantage;
const surr2 = clipped * advantage;
// Gradient flows through the binding term
const useClipped = (surr1 > surr2) ? 1 : 0;
When the advantage is positive (action was good), the ratio is clipped from above at $1 + \epsilon$ — the policy can’t increase the probability of this action too much. When the advantage is negative, it’s clipped from below at $1 - \epsilon$. This creates a “trust region” around the old policy.
GAE Implementation
GAE computes advantages in a single backward pass over the trajectory:
let gae = 0;
for (let t = T - 1; t >= 0; t--) {
const delta = reward[t] + gamma * V(s[t+1]) - V(s[t]);
gae = delta + gamma * lambda * gae;
advantages[t] = gae;
returns[t] = gae + values[t];
}
Each advantage is a weighted sum of all future TD errors, with exponential decay controlled by $\gamma \lambda$. At $\lambda = 0$ you get single-step TD (high bias, low variance). At $\lambda = 1$ you get Monte Carlo returns (low bias, high variance). The sweet spot at 0.95 gives smooth advantage estimates that help PPO learn stably.
Parameter Counts
Every parameter is a Float64 in typed arrays — no framework, no tensors, just arithmetic. We use 2-hidden-layer networks (MiniNet2) for the capacity to represent non-linear value functions:
- DQN: MiniNet2(8, 64, 48, 6) → (8×64+64) + (64×48+48) + (48×6+6) = 576+3120+294 = 3,990 — SGD optimizer
- A2C/PPO Actor: MiniNet2(8, 64, 48, 6) → 3,990
- A2C/PPO Critic: MiniNet2(8, 64, 48, 1) → (8×64+64) + (64×48+48) + (48×1+1) = 576+3120+49 = 3,745
- A2C/PPO Total: 3,990 + 3,745 = 7,735 — A2C uses Adam, PPO uses SGD
09 // TRY IT YOURSELF
Start with DQN (the algorithm you know from Pong). Hit AUTO-TRAIN for 100 episodes and watch the landing rate climb. Then switch to A2C — notice how the Actor-Critic panel appears. Train it and watch the critic meter track the lander’s altitude. Finally, try PPO and compare the learning curves. The action probability bars tell the whole story.
All three agents persist in localStorage. Refresh the page and they keep their learned weights. Clear your browser data to start fresh.
10 // REFERENCES
- Human-level control through deep reinforcement learning Nature, 2015
- Asynchronous Methods for Deep Reinforcement Learning (A3C) ICML, 2016
- Proximal Policy Optimization Algorithms arXiv:1707.06347, 2017
- High-Dimensional Continuous Control Using Generalized Advantage Estimation ICLR, 2016
- Reinforcement Learning: An Introduction (2nd ed.) MIT Press, 2018
- Training language models to follow instructions with human feedback (InstructGPT) NeurIPS, 2022
- When GANs Meet RL: The Adversarial Game Behind Generative AI Blog post, 2026