01 // THE SETUP

I wanted to build something visitors could feel. Not a static chart showing a loss curve. Not a pre-trained demo where the AI is already good. I wanted people to watch reinforcement learning happen -- to see an agent start dumb, flail around, and gradually figure out how to hit a ball.

The result is the RL Pong Lab: a playable Pong game where you face three different RL algorithms that train in real time, right in your browser. No server. No Python. No GPU. Just pure JavaScript, a tiny neural network with 579 parameters, and some hard-won lessons about what actually makes RL work on a budget.

This post is the build log. Not the cleaned-up textbook version -- the real sequence of things that didn't work, why they didn't work, and what we changed. If you've read about Q-learning and DQN in theory but never watched them fail in practice, this is for you.

"In theory, theory and practice are the same. In practice, they are not."

02 // THE THREE ALGORITHMS

Before the war stories, here's what we're working with. We implemented three fundamentally different approaches to the same problem [5]: move a paddle up, down, or stay still to return a bouncing ball. Same game, same state, same actions -- different learning philosophies.

Q-Learning: The Lookup Table

The simplest possible approach. Discretize the game state into bins and build a giant table mapping every possible state to the value of each action. The update rule is the Bellman equation:

$$Q(s, a) \leftarrow Q(s, a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \Big]$$

We discretized the state into 1,728 bins: 8 ball-x positions × 6 ball-y × 2 horizontal velocity signs × 3 vertical velocity buckets × 6 paddle-y positions. With 3 actions, that's 5,184 Q-values. The entire "model" fits in a JavaScript object literal.

Exploration: ε-greedy. Start with ε = 1.0 (pure random), decay by a factor each episode, floor at 0.05 so it never stops exploring entirely.

DQN: The Neural Network

Instead of discretizing, feed the continuous state directly into a neural network that outputs Q-values for each action. The architecture is deliberately tiny:

Input (8) → Dense(48, ReLU) → Dense(3, linear)
Total: 8×48 + 48 + 48×3 + 3 = 579 parameters

DQN adds two critical stabilization tricks from the DeepMind paper [2]:

  • Experience replay: Store transitions $(s, a, r, s', \text{done})$ in a buffer of 10,000. Sample random mini-batches of 32 for training. This breaks temporal correlation and reuses data efficiently.
  • Target network: A frozen copy of the Q-network, synced every 200 steps. TD targets use this frozen copy so the network isn't chasing a moving target.

The network is implemented from scratch -- no TensorFlow, no WebGL. Just Float64Array matrix multiplies, manual backprop, and Xavier initialization via Box-Muller sampling.

REINFORCE: The Policy Gradient

Q-Learning [1] and DQN learn action values (how good is each action in this state?). REINFORCE [3] takes a fundamentally different approach: it directly learns a policy -- a probability distribution over actions.

Same 8→48→3 network, but the output goes through a softmax to produce action probabilities. The agent samples from this distribution, plays an entire episode, then updates the network to make high-reward actions more likely:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right]$$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the discounted return from time $t$. We normalize returns by subtracting the mean and dividing by std (baseline normalization) to reduce variance.

03 // THE STATE REPRESENTATION

This is the single most important design decision, and it's where the browser constraint actually helps us make the right choice.

DeepMind's famous DQN paper [2] trained on raw pixels -- 84×84 grayscale frames stacked 4 deep. That's 28,224 input dimensions requiring a convolutional network, millions of frames of experience, and hours of GPU time. Karpathy's "Pong from Pixels" [4] blog post needed 3 days on a CPU.

We don't have 3 days. We have maybe 10 seconds of the visitor's patience. So we use engineered features -- 6 numbers that capture everything the agent needs to know:

const state = [
    ball.x / W,              // Ball X position (0 to 1)
    ball.y / H,              // Ball Y position (0 to 1)
    ball.vx / BALL_SPEED_MAX, // Ball horizontal velocity (-1 to 1)
    ball.vy / BALL_SPEED_MAX, // Ball vertical velocity (-1 to 1)
    aiY / H,                  // AI paddle Y position (0 to 1)
    playerY / H,              // Opponent paddle Y position (0 to 1)
    (ball.y - aiY) / H,       // Relative Y: "go up or down?"
    ball.vx > 0 ? ball.x/W : 0 // Ball distance when approaching
];

All values normalized to roughly [-1, 1]. The last two features are derived: the relative Y distance tells the agent directly whether the ball is above or below the paddle (no need to learn subtraction from raw positions), and the approach distance is nonzero only when the ball is heading toward the AI, giving a natural urgency signal. These two features alone cut convergence time nearly in half for the neural network agents.

The tradeoff: Feature-based state makes the problem trivially solvable (hundreds of episodes instead of millions) but requires that we have access to the game's internal state. Since we built the engine, we do. In a real-world setting where you only have pixels, you'd need the full DeepMind treatment.

04 // THE REWARD THAT BROKE EVERYTHING

Here's where the textbook diverges from reality. The standard Pong reward is elegant:

  • +1 when the AI scores
  • -1 when the opponent scores
  • 0 otherwise

This is mathematically clean. It's also practically useless for fast training.

The problem is sparsity. A Pong episode runs for hundreds of frames. The ball bounces back and forth multiple times before anyone scores. During all those frames, the reward is exactly zero. The agent is flying blind. A random agent might play 50 episodes before stumbling into enough +1 signals to learn anything. In a browser with a 10-second patience budget, that's a death sentence.

Version 1 of our reward added distance shaping:

// Every frame: penalize being far from ball
reward = -0.01 * Math.abs(aiY - ball.y) / H;

This helped the agent learn to track the ball. But it created a new problem: the agent would park itself near the ball's Y coordinate and never actually hit it. The tracking reward was dense and immediate; the scoring reward was sparse and delayed. The agent optimized for what it could measure easily.

Version 2 (the breakthrough) added a ball-hit reward:

if (aiHitBall) {
    reward = 0.3;             // Successfully returned the ball
} else if (aiScored) {
    reward = 1.0;             // Scored a point
} else if (opponentScored) {
    reward = -1.0;            // Lost a point
} else {
    reward = -0.01 * dist;   // Small tracking incentive
    // Directional bonus: reward moving toward the ball
    if (ballApproaching && movingCorrectly) reward += 0.005;
}

The +0.3 for hitting the ball is the single highest-impact change we made. It transforms the reward landscape from "long stretches of near-zero with rare spikes" to "frequent positive feedback for doing the right thing." The directional bonus (+0.005 for moving toward the ball when it's approaching) is small but gives the agent an immediate signal about which direction is correct, even before it reaches the ball. The agent learns the chain: move toward ball (+0.005) → intercept → hit ball (+0.3) → maybe score (+1.0).

The calibration matters. If the hit reward is too large (say, 1.0), the agent learns to hit the ball but doesn't care about scoring -- it's already getting rewarded. Too small (0.05) and it gets drowned out by noise. 0.3 hit us a sweet spot where the agent learns to rally first, then gradually optimizes for scoring as ε decays.

05 // THE TRAINING LOOP EVOLUTION

The training mechanism went through three iterations, each solving a real problem.

Iteration 1: Train-While-You-Play (Bad)

The original design: the AI only trains when a human is playing. Every frame, the agent observes the state, takes an action, gets a reward, and updates. Elegant in theory. In practice:

  • Training speed is capped at 60 FPS (the frame rate). That's 60 training steps per second.
  • A visitor needs to play for 5-10 minutes before the AI shows improvement.
  • Nobody plays a browser game for 10 minutes with a stupid opponent.

Iteration 2: Auto-Train With a Fixed Bot (Better, But Broken)

We added a "AUTO-TRAIN" button that replaces the human with a bot and runs the game at high speed (200 physics steps per animation frame -- 200x acceleration). Now training takes seconds instead of minutes.

But the bot was too good. We set it to strength 0.7 (on a 0-1 scale). The bot returned almost everything. The AI agent could barely score, so it only received negative reward signals. It learned to be less bad -- to lose slowly -- but never learned to win.

The deeper issue: the ε decay was designed for interactive play. Q-Learning's epsilonDecay = 0.9995 means that after 100 episodes, ε = 0.9995100 ≈ 0.95. Still almost completely random after 100 rounds of training. The agent literally never gets to exploit what it's learned.

Iteration 3: Adaptive Curriculum (Current)

Two key fixes:

1. Adaptive bot difficulty. Instead of a fixed-strength opponent, the bot scales to match the AI's current ability. We track a rolling win rate over the last 20 episodes:

function adaptiveBotStrength() {
    var wr = rollingWinRate();
    if (wr < 0.15) return 0.10;  // AI losing badly → very weak bot
    if (wr < 0.30) return 0.25;
    if (wr < 0.45) return 0.40;
    if (wr < 0.60) return 0.55;
    if (wr < 0.75) return 0.70;
    return 0.85;                  // AI winning → strong bot
}

This is curriculum learning. When the AI is weak, the bot is weak too -- the AI can actually score, gets +1 rewards, and learns that scoring is good. As the AI improves, the bot gets tougher, forcing the AI to develop real strategies instead of exploiting a weak opponent.

2. Aggressive warmup hyperparameters. During the initial warmup (which runs automatically on first page load), we temporarily override the ε decay:

// Normal play: ε decays 0.9995 per episode (very slow, for stability)
// Warmup:      ε decays 0.97 per episode (aggressive, for speed)
// After 100 warmup episodes: ε = 0.97^100 ≈ 0.048 (exploiting)

After warmup, we restore normal decay rates. The agent starts with a strong learned policy and continues to refine it during human play.

06 // ALGORITHM-SPECIFIC LESSONS

Q-Learning: Fast But Brittle

The good: Q-Learning converges fastest because table lookups have zero generalization error. State 437 is state 437, and its Q-values are exactly right for that state. No function approximation noise. With 1,728 states and aggressive ε decay, it produces a competent player in ~100 episodes.

The bad: Discretization throws away information. A ball at y=0.31 and y=0.34 might land in different bins and the agent treats them as completely unrelated. The agent also can't generalize to states it hasn't visited -- if it's never seen a particular ball-x/paddle-y combination, it has literally zero knowledge.

The surprising: Despite these limitations, Q-Learning is the best "first 50 episodes" performer. The table fills in fast, there's no gradient instability, and the agent goes from random to competent in a smooth curve.

DQN: Slow Start, Smooth Ride

Three changes made DQN competitive:

  • Training every 4 steps instead of once per episode. Our first implementation trained a batch of 32 at episode end only. Switching to per-step training meant 200x more gradient updates per episode.
  • Per-step epsilon decay instead of per-episode. With ε decaying by 0.99997 each step (~0.05 after 30K steps), exploration transitions smoothly instead of jerking down once per episode.
  • Double DQN. Instead of using the target network for both action selection and evaluation, the online network selects the best action and the target network evaluates its Q-value. This reduces Q-value overestimation — a known instability in vanilla DQN where the max operator causes systematically inflated value estimates.

The advantage over Q-Learning: generalization. DQN handles the continuous state space naturally. A ball at y=0.51 and y=0.52 produce similar network activations and get similar Q-values. The agent can interpolate between states it's seen.

REINFORCE: The Noisiest Learner

Policy gradient methods have high variance by nature. REINFORCE collects an entire trajectory, then updates the network once. If the episode was good, all actions get reinforced. If it was bad, all actions get penalized. There's no per-step credit assignment like Q-Learning or DQN.

The gradient sign bug. Our first REINFORCE implementation didn't learn at all. After hours of debugging hyperparameters and reward shaping, the real culprit was a sign error in the policy gradient backprop. The gradient of $\log \pi(a|s)$ w.r.t. logits is $(\mathbf{1}_{k=a} - \pi_k)$. We want gradient ascent (maximize reward), but our weight update uses w -= lr * dOut (descent convention). So we need dOut = -\text{advantage} \cdot (\mathbf{1} - \pi). The original code computed this correctly, then negated it again "for gradient ascent" — making it gradient descent. The agent was trained to make good actions less likely. One line removal fixed everything.

Baseline normalization is non-optional. Without subtracting the mean return and dividing by std, the gradient signal is dominated by the absolute magnitude of returns rather than which actions were better than average. With it, REINFORCE converges — slowly, noisily, but it converges.

The exploration problem. Even with the sign fix, REINFORCE would initialize with the paddle stuck at one position and never move. The issue: unlike Q-Learning and DQN which have ε-greedy exploration, vanilla REINFORCE samples from its own policy. If the initial random weights slightly favor STAY, the paddle parks itself, never hits the ball, never gets the +0.3 reward, and the tiny distance penalty (-0.01) gets washed out by baseline normalization. Two fixes were needed:

  • ε-greedy exploration (ε=0.3, decaying): 30% of the time, pick a random action regardless of the policy. This guarantees the paddle moves, discovers ball hits, and generates diverse trajectories for learning.
  • Entropy regularization (β=0.02): add $\beta \cdot H(\pi)$ to the objective, where $H(\pi) = -\sum_k \pi_k \log \pi_k$. This gradient pushes action probabilities away from 0 or 1, preventing the policy from collapsing to a single action. The agent stays stochastic enough to explore even as ε decays.

Learning rate matters more here. DQN trains every 4 steps — thousands of gradient updates per episode. REINFORCE trains once at episode end, on a subsample of up to 300 trajectory steps. With lr=0.001 (same as DQN), the weights barely moved. Bumping to lr=0.005 gave the single end-of-episode update enough magnitude to actually shift the policy.

The payoff — and the honest caveat. REINFORCE is the weakest performer on this problem, and that's not a bug. For Pong with 3 discrete actions and an 8D state, Q-Learning's lookup table and DQN's per-step training are simply better fits. Q-Learning updates every frame — instant feedback. DQN trains every 4 steps — thousands of gradient updates per episode. REINFORCE trains once at the end of each episode on a single trajectory. The sample efficiency gap is fundamental.

So why include it? Because REINFORCE is the foundation of modern RL. PPO [6], A3C, SAC — every policy optimization algorithm that powers robotics, game-playing agents, and RLHF for LLMs descends from this idea. It shines in problems with continuous action spaces (where Q-tables can't enumerate actions) or high-dimensional outputs (where DQN would need an impractically large output layer). For simple Pong, it's a sledgehammer for a nail — but seeing it struggle here is itself instructive. Algorithms have niches. Matching the algorithm to the problem structure matters more than any amount of hyperparameter tuning.

The Scoreboard: Algorithm vs. Problem Fit

After hundreds of training episodes against the same adaptive bot, the performance ranking is consistent and reveals something important about algorithm selection:

  • Q-Learning — fastest learner. ~50 episodes to competent. Zero approximation error, instant per-frame updates. The right tool for a small discrete state space.
  • DQN — close second. ~100-150 episodes. Generalizes across continuous states, handles unseen positions well. Slight overhead from neural net training but the replay buffer compensates.
  • REINFORCE — slowest, noisiest, ~200+ episodes. One gradient update per episode vs. thousands for DQN. High variance means inconsistent improvement. For this problem, it's fundamentally less sample-efficient.

The lesson: the best RL algorithm depends on the problem structure, not on how modern it sounds. For 3 actions and 1,728 states, a lookup table beats a neural network.

07 // THE NEURAL NETWORK IN 150 LINES

Building a neural network from scratch in JavaScript is an exercise in appreciating what frameworks do for you. Our MiniNet class handles:

  • Xavier initialization (scale = $\sqrt{2/n_\text{in}}$) via Box-Muller random normals
  • Forward pass: matrix multiply → ReLU → matrix multiply → output
  • MSE backprop for DQN: compute $\partial L / \partial w$ and apply gradient descent
  • Policy gradient backprop for REINFORCE: gradient of $\log \pi(a|s)$ weighted by advantage
  • Softmax for action probability distribution
  • Serialization to/from JSON for localStorage persistence

Everything uses Float64Array for precision. The entire implementation is ~150 lines. No dependencies, no build step, no bundler. You can read every line in pong-rl.js.

Performance: A forward pass through the 6→32→3 network takes roughly 0.01ms on modern hardware. That's 100,000 forward passes per second. During auto-training at 200 steps per frame, the bottleneck is DQN's batch training (32 forward+backward passes every 4 steps), not the game physics.

08 // WHAT WOULD WE DO DIFFERENTLY

If we rebuilt this from scratch:

  • Self-play instead of bot opponent. Two RL agents playing each other produce a natural curriculum -- neither is ever too strong or too weak for the other. The challenge is bootstrapping: two random agents rarely score, so the reward signal is sparse early on. You'd need the ball-hit reward to break the deadlock.
  • PPO instead of REINFORCE. Proximal Policy Optimization [6] clips the policy update to prevent catastrophic changes, which would reduce the noise we see in REINFORCE learning curves. But PPO is significantly more complex to implement from scratch.
  • Web Worker for training. Currently, training runs on the main thread. During auto-train, the 200 physics steps per frame can occasionally cause dropped frames. Moving the game simulation + training to a Web Worker would keep the UI perfectly smooth.
  • Larger discretization for Q-Learning. 1,728 states is fairly coarse. With 12 ball-x × 10 ball-y × 2 vx × 3 vy × 10 paddle-y = 7,200 states, Q-Learning would have finer resolution and probably outperform both neural network approaches for this simple game.

09 // TRY IT YOURSELF

The RL Pong Lab is live. Switch between algorithms with the tabs, hit AUTO-TRAIN to watch the AI practice at 200x speed, then take over and play against it. Your agent's learned weights persist in localStorage -- come back tomorrow and it'll remember everything.

The entire implementation is four files totaling ~600 lines of JavaScript:

  • pong-engine.js -- game physics, rendering, collision detection
  • pong-rl.js -- MiniNet neural network, Q-Learning, DQN, REINFORCE
  • pong-ui.js -- glue code, HUD, auto-training, localStorage
  • pong.html + pong.css -- page shell and styling

No build step. No dependencies. View source and you'll see everything.

10 // REFERENCES

  1. Watkins, C. J. C. H. & Dayan, P. Q-Learning. Machine Learning, 8(3-4), 279-292, 1992.
  2. Mnih, V., Kavukcuoglu, K., Silver, D., et al. Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533, 2015.
  3. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256, 1992.
  4. Karpathy, A. Deep Reinforcement Learning: Pong from Pixels. Blog post, 2016.
  5. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction. 2nd ed. MIT Press, 2018.
  6. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal Policy Optimization Algorithms. arXiv:1707.06347, 2017.