01 // INTRODUCTION

Generative Adversarial Networks (GANs) and Reinforcement Learning (RL) appear, at first glance, to inhabit different corners of machine learning. GANs generate images by playing a game between a generator and a discriminator. RL trains agents to maximize cumulative reward in an environment. Different problem statements, different communities, different conferences.

But look closer and a striking structural parallel emerges. Both frameworks learn a generative process (generator / policy) by optimizing against an evaluative process (discriminator / reward). Both can be formalized as two-player games. And both suffer from eerily similar failure modes -- mode collapse in GANs mirrors reward hacking in RL.

This connection isn't just an analogy. It's a mathematical equivalence that has been formalized by several independent lines of work [2][3][4], and it has become practically relevant now that the dominant paradigm for post-training large language models -- RLHF (Reinforcement Learning from Human Feedback) [7] -- sits squarely at the intersection.

In this post, we'll build the mathematical framework to see GANs and RL as instances of the same abstract game, trace the formal connections through game theory and f-divergence minimization, and examine how RLHF for LLMs completes the picture.

"The generator is the actor, the discriminator is the critic. The rest is notation."

02 // THE GAN GAME

Let's start with the GAN formulation [1]. A GAN consists of two neural networks locked in a minimax game:

Generator $G_\theta: \mathcal{Z} \to \mathcal{X}$ maps random noise $\mathbf{z} \sim p_z(\mathbf{z})$ to data-like samples $G_\theta(\mathbf{z})$.

Discriminator $D_\phi: \mathcal{X} \to [0, 1]$ outputs the probability that an input came from the real data distribution $p_{\text{data}}$ rather than from $G_\theta$.

The Value Function

The game is defined by the value function:

$$\min_\theta \max_\phi \; V(G_\theta, D_\phi) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D_\phi(G_\theta(\mathbf{z})))]$$

The discriminator wants to maximize $V$ -- correctly classifying real samples as real ($D_\phi(\mathbf{x}) \to 1$) and generated samples as fake ($D_\phi(G_\theta(\mathbf{z})) \to 0$). The generator wants to minimize $V$ -- producing samples that the discriminator classifies as real.

Optimal Discriminator

For a fixed generator $G_\theta$ inducing distribution $p_g$, the optimal discriminator is:

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

Substituting $D^*$ back into $V$ yields the Jensen-Shannon divergence (up to constants):

$$V(G, D^*) = 2 \, D_{\text{JS}}(p_{\text{data}} \| p_g) - 2\log 2$$

So training the generator against the optimal discriminator is equivalent to minimizing the JS-divergence between the generated and real distributions. At the Nash equilibrium, $p_g = p_{\text{data}}$ and $D^*(\mathbf{x}) = \frac{1}{2}$ everywhere -- the discriminator can no longer tell real from fake.

THE MINIMAX THEOREM

Von Neumann's minimax theorem guarantees that for convex-concave games, $\max_\phi \min_\theta V = \min_\theta \max_\phi V$. This means the order of optimization doesn't matter at convergence -- the Nash equilibrium is well-defined. Goodfellow et al. [1] proved convergence for the non-parametric case (infinite-capacity $G$ and $D$).

03 // THE RL GAME

Now let's set up the RL framework. A standard RL problem is defined by a Markov Decision Process (MDP): $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ with states $\mathcal{S}$, actions $\mathcal{A}$, transition dynamics $P(s'|s,a)$, reward function $r(s,a)$, and discount factor $\gamma$.

A policy $\pi_\theta(a|s)$ maps states to action distributions. The objective is to maximize expected cumulative reward:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$

where $\tau = (s_0, a_0, s_1, a_1, \ldots)$ is a trajectory sampled under $\pi_\theta$.

Policy Gradient

The REINFORCE estimator [12] gives us the gradient:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\right]$$

where $R_t = \sum_{t'=t}^{T} \gamma^{t'-t} r(s_{t'}, a_{t'})$ is the return from time $t$. In the actor-critic variant, we replace $R_t$ with a learned value function $Q_\phi(s_t, a_t)$ or advantage $A_\phi(s_t, a_t)$:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot A_\phi(s, a)\right]$$

Already we can see the structure: an actor (the policy $\pi_\theta$) and a critic (the value function $Q_\phi$), trained in tandem. The actor generates behavior; the critic evaluates it.

04 // THE EQUIVALENCE

Pfau and Vinyals [2] made the connection explicit: a GAN is a single-step actor-critic RL problem. Consider a degenerate MDP with one state, where the "action" is the generated sample $G_\theta(\mathbf{z})$ and the "reward" is derived from the discriminator's evaluation.

The Mapping

The component-by-component correspondence is:

GAN RL (Actor-Critic)
Generator $G_\theta$Actor / Policy $\pi_\theta$
Discriminator $D_\phi$Critic $Q_\phi$ / Reward $r_\phi$
Noise $\mathbf{z}$Initial state / randomness
Generated sample $G_\theta(\mathbf{z})$Action $a$
$\log D_\phi(G_\theta(\mathbf{z}))$Reward signal $r(a)$
Discriminator lossBellman residual / TD error
Generator lossPolicy gradient objective

The generator's gradient, $\nabla_\theta \mathbb{E}_z[\log D_\phi(G_\theta(\mathbf{z}))]$, is precisely a policy gradient where the discriminator's log-output serves as the reward. The discriminator, in turn, is trained to be a better critic -- estimating how "real" each generated sample is, analogous to a value function learning to evaluate states.

Key Structural Difference

The difference lies in where the reward comes from:

In standard RL, the reward function $r(s, a)$ is given by the environment -- it's fixed, external, and not learned. The agent optimizes against a static objective.

In a GAN, the reward function (discriminator) is itself a learned model that co-evolves with the generator. The "environment" is adversarial -- the goalposts move during training. This makes GANs a simultaneous game rather than a standard optimization problem.

FIXED vs. CO-TRAINED REWARDS

This is the essential distinction: RL optimizes against a fixed or designed reward signal, while GANs build their reward on the fly through the discriminator's co-training. One plays against a static environment; the other plays against a moving target. Both are games, but with fundamentally different dynamics. As we'll see, RLHF sits somewhere in between.

05 // THE GAME THEORY

Both frameworks can be unified under the language of game theory. The abstract formulation is a two-player game:

$$\min_G \max_D \; \mathcal{F}(G, D)$$

where $G$ is a generative process, $D$ is an evaluative process, and $\mathcal{F}$ is the game's value function. The specifics depend on the instantiation:

GAN Instantiation

$$\mathcal{F}_{\text{GAN}}(G, D) = \mathbb{E}_{p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{p_g}[\log(1 - D(\mathbf{x}))]$$

This is a zero-sum game: the discriminator's gain is the generator's loss. Nash equilibrium: $p_g = p_{\text{data}}$, $D = \frac{1}{2}$.

IRL / GAIL Instantiation

Finn et al. [3] and Ho & Ermon [4] showed that Inverse RL and GANs share the same structure. In Generative Adversarial Imitation Learning (GAIL):

$$\mathcal{F}_{\text{GAIL}}(\pi, D) = \mathbb{E}_{\pi}[\log D(s, a)] + \mathbb{E}_{\pi_E}[\log(1 - D(s, a))] - \lambda H(\pi)$$

where $\pi_E$ is the expert policy (analogous to real data) and $H(\pi)$ is a causal entropy regularizer. The discriminator $D(s, a)$ now operates on state-action pairs, and $-\log D(s, a)$ serves as the reward signal for policy gradient updates.

The key insight from Finn et al.: training a GAN is equivalent to performing inverse reinforcement learning on a single-step decision problem, where expert demonstrations are the real data and the discriminator is the learned reward function.

Three-Way Equivalence

This leads to a triangle of equivalences between three seemingly different problems:

THE EQUIVALENCE TRIANGLE

GANs (learn to generate data indistinguishable from real) $\iff$ Inverse RL (recover reward from demonstrations, then optimize policy) $\iff$ Energy-Based Models (learn energy function $E(\mathbf{x})$, sample via MCMC).

The mapping: discriminator $\leftrightarrow$ reward function $\leftrightarrow$ energy function. Generator $\leftrightarrow$ policy $\leftrightarrow$ sampler.

06 // F-DIVERGENCE FRAMEWORK

The deepest mathematical thread connecting GANs and RL runs through f-divergence minimization and convex duality [5].

f-Divergences

An f-divergence between distributions $p$ and $q$ is defined as:

$$D_f(p \| q) = \mathbb{E}_{q}\left[f\left(\frac{p(\mathbf{x})}{q(\mathbf{x})}\right)\right]$$

where $f$ is a convex function with $f(1) = 0$. Different choices of $f$ yield familiar divergences:

$f(u)$ Divergence
$u \log u$KL divergence
$-\log u$Reverse KL
$(u-1)^2$Pearson $\chi^2$
$u\log u - (u+1)\log\frac{u+1}{2}$Jensen-Shannon

Variational Lower Bound

By the Fenchel conjugate (convex dual) of $f$, any f-divergence admits a variational representation [5]:

$$D_f(p \| q) = \sup_T \left\{ \mathbb{E}_p[T(\mathbf{x})] - \mathbb{E}_q[f^*(T(\mathbf{x}))] \right\}$$

where $f^*$ is the Fenchel conjugate of $f$, and $T: \mathcal{X} \to \mathbb{R}$ is an arbitrary function (parameterized by a neural network in practice). This is the f-GAN framework: the discriminator $T$ maximizes this bound, the generator minimizes the divergence.

Now here's the connection to RL. In Maximum Entropy Inverse RL [13], the objective for recovering a reward function from expert demonstrations is:

$$\max_r \left\{ \mathbb{E}_{\pi_E}[r(\tau)] - \log Z(r) \right\}$$

where $Z(r) = \int \exp(r(\tau)) d\tau$ is the partition function. Compare this with the f-divergence variational bound. Setting $T = r$ and $f^*(T) = \log Z(r)$, the IRL objective is a special case of f-divergence estimation -- specifically, it corresponds to the reverse KL divergence between the occupancy measures of the expert and learned policies.

THE UNIFYING PRINCIPLE

Both GANs and (inverse) RL are variational f-divergence minimization problems. The discriminator/reward function is the variational function $T$ that tightens the bound. The generator/policy is the distribution $q$ being optimized. The real data/expert demonstrations are the reference distribution $p$. The only difference is the choice of $f$ and whether generation is one-shot or sequential.

07 // RLHF: THE BRIDGE

With this framework in place, we can now examine Reinforcement Learning from Human Feedback [7], the dominant paradigm for post-training large language models. RLHF sits precisely at the intersection of GANs and RL, borrowing structural elements from both.

The Three-Phase Pipeline

Phase 1: Supervised Fine-Tuning (SFT). A pretrained LLM is fine-tuned on demonstration data to produce a reference policy $\pi_{\text{ref}}$.

Phase 2: Reward Model Training. A reward model $R_\phi(x, y)$ is trained on human pairwise comparisons using the Bradley-Terry model:

$$\mathcal{L}(\phi) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\!\left(R_\phi(x, y_w) - R_\phi(x, y_l)\right)\right]$$

where $y_w \succ y_l$ indicates that response $y_w$ was preferred over $y_l$ for prompt $x$.

Phase 3: RL Fine-Tuning (PPO). The policy $\pi_\theta$ is optimized to maximize reward while staying close to the reference:

$$\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\left[R_\phi(x, y)\right] - \beta \, D_{\text{KL}}\!\left(\pi_\theta \| \pi_{\text{ref}}\right)$$

The hyperparameter $\beta$ controls the strength of the KL penalty.

RLHF as a GAN

The structural parallel to GANs is immediate:

GAN RLHF
Generator $G_\theta$LLM policy $\pi_\theta$
Discriminator $D_\phi$Reward model $R_\phi$
Real data $\mathbf{x} \sim p_{\text{data}}$Human-preferred responses $y_w$
Generated data $G(\mathbf{z})$Model responses $y \sim \pi_\theta$
$\min_\theta \max_\phi V(G, D)$$\max_\phi \mathcal{L}(\phi)$; then $\max_\theta J(\theta)$
Gradient penalty / spectral normKL penalty $\beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

But there is a crucial structural difference: in vanilla RLHF, the reward model is frozen after Phase 2. The generator (LLM) optimizes against a static evaluator. In contrast, GANs co-train the discriminator and generator simultaneously.

In game-theoretic terms, vanilla RLHF is a Stackelberg game (leader-follower: the reward model commits first, the policy best-responds) rather than a Nash game (simultaneous moves, as in GANs).

The KL Penalty Decomposition

The KL penalty in RLHF connects to maximum entropy RL. Expanding:

$$D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) = \mathbb{E}_{\pi_\theta}\!\left[\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x)\right] = -H(\pi_\theta) - \mathbb{E}_{\pi_\theta}\!\left[\log \pi_{\text{ref}}(y|x)\right]$$

The first term is the negative entropy of the policy (encourages exploration, prevents mode collapse). The second is a cross-entropy anchor to the reference policy. This mirrors the entropy regularization in MaxEnt RL [13] and SAC [14], where $H(\pi)$ is added to the reward to prevent premature convergence.

In GAN terms, this serves the same role as gradient penalties in WGAN-GP [11]: constraining the optimization to prevent the generator from collapsing onto a narrow set of outputs that exploit the discriminator's blind spots.

08 // THE OPTIMAL POLICY AND THE BOLTZMANN CONNECTION

The RLHF objective has a beautiful closed-form solution. Given a fixed reward model $R_\phi$, the optimal policy $\pi^*$ that maximizes $\mathbb{E}_{\pi}[R_\phi(x, y)] - \beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ is [9]:

$$\pi^*(y|x) = \frac{1}{Z(x)} \, \pi_{\text{ref}}(y|x) \, \exp\!\left(\frac{R_\phi(x, y)}{\beta}\right)$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp(R_\phi(x,y)/\beta)$ is the partition function.

This is a Boltzmann distribution -- the same form as the energy-based models in Finn et al.'s triangle [3]. The reward model plays the role of the negative energy, the reference policy is the base measure, and $\beta$ is the temperature.

Inverting this relationship gives us the implicit reward:

$$R_\phi(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$

This is exactly the insight behind Direct Preference Optimization (DPO) [9]: since the optimal policy implicitly encodes the reward, we can bypass the explicit reward model entirely and optimize the policy directly on preference data. The policy is the reward model -- or in GAN terms, the generator is the discriminator.

DPO: COLLAPSING THE GAME

DPO eliminates the two-player game entirely. Instead of training a separate reward model and then running PPO, DPO reparameterizes the reward in terms of the policy ratio $\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ and optimizes the Bradley-Terry loss directly: $$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$$ No discriminator. No reward model. No RL. Just supervised learning on preference pairs. The adversarial game collapses into a single-objective optimization.

09 // FAILURE MODES: SAME GAME, SAME PATHOLOGIES

If GANs and RL are instances of the same game, we should expect similar failure modes. And indeed we do.

Mode Collapse / Reward Hacking

In GANs, mode collapse occurs when the generator concentrates on a few samples that fool the discriminator, ignoring large regions of the data distribution.

In RLHF, reward hacking (or reward overoptimization [10]) occurs when the policy exploits the reward model's blind spots -- producing outputs that score high on the proxy reward but are low quality by true human judgment.

Gao et al. [10] showed this follows a characteristic curve: as KL divergence from the reference policy increases, true quality first improves then degrades:

$$\text{TrueReward}(\pi_\theta) \propto \alpha \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) - \gamma \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})^2$$

The linear term is genuine improvement from aligning with human preferences. The quadratic term is overoptimization -- exploiting the gap between proxy reward and true quality. This is Goodhart's Law in mathematical form: "when a measure becomes a target, it ceases to be a good measure."

Training Instability

GANs are notoriously hard to train -- oscillating gradients, vanishing discriminator signal, sensitivity to hyperparameters. RL suffers analogous issues: high variance policy gradients, reward sparsity, sensitivity to learning rates.

The stabilization techniques also transfer:

GAN Stabilizer RL Analog
Gradient penalty (WGAN-GP)Trust region / clipping (PPO)
Spectral normalizationReward normalization
Label smoothingReward shaping / clipping
Progressive growingCurriculum learning
Two-timescale learning ratesSeparate actor/critic learning rates

10 // SEQUENTIAL GENERATION: SeqGAN

Standard GANs generate samples in a single forward pass -- image pixels all at once. But language models generate tokens sequentially. This makes the RLHF-LLM setting fundamentally multi-step, closer to RL's MDP formulation than to vanilla GANs.

SeqGAN [6] bridges this gap by applying the GAN framework to sequential generation using RL. At each timestep, the generator (policy) selects the next token $y_t$. Since the discriminator can only evaluate complete sequences, Monte Carlo rollouts are used to estimate the intermediate reward:

$$Q_{D_\phi}(s = y_{1:t-1},\, a = y_t) = \begin{cases} \frac{1}{N}\sum_{n=1}^{N} D_\phi(y_{1:T}^{(n)}) & \text{if } t < T \\ D_\phi(y_{1:T}) & \text{if } t = T \end{cases}$$

where $y_{1:T}^{(n)}$ are complete sequences obtained by rolling out the generator from position $t$. The generator is then updated with the REINFORCE policy gradient:

$$\nabla_\theta J(\theta) = \mathbb{E}_{y_{1:t-1} \sim \pi_\theta}\!\left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(y_t | y_{1:t-1}) \cdot Q_{D_\phi}(y_{1:t-1}, y_t)\right]$$

This makes the connection explicit: sequential GAN training is RL with the discriminator as the reward function. SeqGAN was, in a sense, a precursor to RLHF for language models -- both use policy gradients to optimize a sequence generator against a learned evaluator.

11 // THE UNIFIED VIEW

Let's step back and see the full picture. We can describe a single abstract framework that encompasses GANs, RL, inverse RL, and RLHF:

$$\min_G \max_D \; \underbrace{\mathbb{E}_{p_{\text{ref}}}[\ell_D(D(\mathbf{x}))]}_{\text{evaluate reference}} + \underbrace{\mathbb{E}_{p_G}[\ell_G(D(\mathbf{x}))]}_{\text{evaluate generated}} - \underbrace{\lambda \, \Omega(G)}_{\text{regularize generator}}$$

where $\ell_D, \ell_G$ are evaluation losses and $\Omega(G)$ is a generator regularizer. Each framework instantiates this differently:

Framework $p_{\text{ref}}$ $D$ $\Omega(G)$ Game Type
GAN $p_{\text{data}}$ Co-trained None Nash (simultaneous)
GAIL $\pi_E$ Co-trained $H(\pi)$ Nash (alternating)
RL N/A Fixed ($r$) $H(\pi)$ Single-player
RLHF Human prefs Pre-trained, frozen $D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ Stackelberg
DPO Human prefs Implicit in $\pi$ $D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ Collapsed (single-player)

The spectrum runs from a full two-player simultaneous game (GAN) through a leader-follower game (RLHF) to a collapsed single-player optimization (DPO). As the evaluative model becomes more tightly coupled with the generative model, the adversarial dynamics simplify -- but at the cost of less adaptive evaluation.

12 // CONCLUSION

The connection between GANs and RL is not a loose analogy -- it is a formal mathematical equivalence rooted in game theory, f-divergence minimization, and convex duality. The generator is the actor; the discriminator is the critic. What differs is the source and dynamics of the evaluative signal.

GANs build their reward on the fly through a co-trained discriminator -- the evaluator evolves with the generator in a simultaneous game. RL (and RLHF) uses a fixed or pre-trained reward signal -- the evaluator is external and static. This seemingly small difference has profound consequences for training dynamics, stability, and failure modes.

The modern RLHF pipeline for LLMs sits at the intersection: the reward model is a learned evaluator (like a discriminator), but it is trained separately and frozen (like an environment reward). DPO collapses the game entirely, revealing that the adversarial structure was always implicit in the preference data.

Understanding this unified view is more than academic. It tells us why RLHF suffers from reward hacking (same root cause as mode collapse), why KL penalties help (same role as gradient penalties and entropy regularization), and where to look for better algorithms -- in the rich toolbox that both communities have built for stabilizing adversarial optimization.

"Every generative model is an agent playing against an evaluator. The question is whether the evaluator plays back."

REFERENCES

  1. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. Generative Adversarial Nets. NeurIPS 2014.
  2. Pfau, D. & Vinyals, O. Connecting Generative Adversarial Networks and Actor-Critic Methods. arXiv:1611.02163, 2016.
  3. Finn, C., Christiano, P., Abbeel, P. & Levine, S. A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models. NeurIPS 2016 Workshop.
  4. Ho, J. & Ermon, S. Generative Adversarial Imitation Learning. NeurIPS 2016.
  5. Nowozin, S., Cseke, B. & Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. NeurIPS 2016.
  6. Yu, L., Zhang, W., Wang, J. & Yu, Y. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.
  7. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.
  8. Schulman, J., Wolski, F., Dhariwal, P., Klimov, O. & Radford, A. Proximal Policy Optimization Algorithms. arXiv:1707.06347, 2017.
  9. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
  10. Gao, L., Schulman, J. & Hilton, J. Scaling Laws for Reward Model Overoptimization. ICML 2023.
  11. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved Training of Wasserstein GANs. NeurIPS 2017.
  12. Williams, R. J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4), 229-256, 1992.
  13. Ziebart, B. D., Maas, A., Bagnell, J. A. & Dey, A. K. Maximum Entropy Inverse Reinforcement Learning. AAAI 2008.
  14. Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018.