Score Matching Explained

01 // INTRODUCTION

If you've been following the generative AI space, you've probably heard of diffusion models -- the family of models behind DALL-E, Stable Diffusion, and many other image generators. But at the mathematical core of these models lies a beautifully elegant concept: score matching.

The story of modern diffusion models has two intertwined threads. One comes from the score matching perspective -- learning the gradient of the log-density. The other comes from the DDPM (Denoising Diffusion Probabilistic Models) line of work -- a variational approach rooted in hierarchical latent variable models. These two threads turn out to be deeply connected, and understanding both gives you the full picture.

In this post, we'll build up both perspectives from scratch, show exactly where they meet, and trace the evolution from the original ideas to modern practice. No hand-waving -- we're going full math mode.

"The score function is the gradient of the log-density. That's it. That's the tweet."

02 // THE SCORE FUNCTION

Given a probability density $p(\mathbf{x})$, the score function is defined as the gradient of the log-density:

\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})

Why is this useful? The score function tells us the direction in which the log-density increases most rapidly. It points "uphill" toward regions of high probability, without requiring us to know the normalizing constant of $p(\mathbf{x})$.

For a concrete example, if $p(\mathbf{x})$ is a Gaussian $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, the score is:

\nabla_{\mathbf{x}} \log p(\mathbf{x}) = -\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})

It's a vector field that always points toward the mean -- exactly what you'd expect. At any point, it tells you "go this way to reach higher probability."

KEY INSIGHT

Unlike the density $p(\mathbf{x})$ itself, the score function $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ doesn't depend on the partition function $Z$. If $p(\mathbf{x}) = \frac{1}{Z}\tilde{p}(\mathbf{x})$, then $\nabla_{\mathbf{x}} \log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x})$ since $\nabla_{\mathbf{x}} \log Z = 0$. This is a big deal -- computing $Z$ is intractable for most interesting distributions.

Once you have the score function, you can generate samples via Langevin dynamics. Starting from noise $\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, iterate:

\mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\eta}{2} \nabla_{\mathbf{x}} \log p(\mathbf{x}_t) + \sqrt{\eta}\, \mathbf{z}_t, \quad \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

As $\eta \to 0$ and the number of steps $\to \infty$, the distribution of $\mathbf{x}_t$ converges to $p(\mathbf{x})$. The score function is literally all you need to sample.

03 // SCORE MATCHING OBJECTIVE

The goal of score matching (Hyvärinen, 2005) is to learn a model $\mathbf{s}_\theta(\mathbf{x})$ that approximates the true score $\nabla_{\mathbf{x}} \log p(\mathbf{x})$. The naive approach would minimize the Fisher divergence:

J(\theta) = \frac{1}{2} \mathbb{E}_{p(\mathbf{x})} \left[ \left\| \mathbf{s}_\theta(\mathbf{x}) - \nabla_{\mathbf{x}} \log p(\mathbf{x}) \right\|^2 \right]

But we don't know $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ -- that's the whole point. Through integration by parts (assuming mild boundary conditions), this objective can be rewritten without the unknown score:

J(\theta) = \mathbb{E}_{p(\mathbf{x})} \left[ \text{tr}(\nabla_{\mathbf{x}} \mathbf{s}_\theta(\mathbf{x})) + \frac{1}{2} \left\| \mathbf{s}_\theta(\mathbf{x}) \right\|^2 \right] + \text{const}

This is explicit score matching. The first term is the trace of the Jacobian of the score model -- it only depends on our model, not on the unknown data distribution. Brilliant in theory, but computing that Jacobian trace for a deep neural network is expensive ($O(d)$ backward passes for $d$-dimensional data).

Sliced Score Matching

Song et al. (2020) proposed sliced score matching to avoid the Jacobian trace. The idea: project the score onto random directions $\mathbf{v}$ and match the projected scores:

J_{SSM}(\theta) = \mathbb{E}_{p(\mathbf{v})} \mathbb{E}_{p(\mathbf{x})} \left[ \mathbf{v}^\top \nabla_{\mathbf{x}} \mathbf{s}_\theta(\mathbf{x}) \mathbf{v} + \frac{1}{2} \left( \mathbf{v}^\top \mathbf{s}_\theta(\mathbf{x}) \right)^2 \right]

This reduces the cost to a single vector-Jacobian product, computable efficiently via backpropagation. But in practice, an even simpler approach won out.

04 // DENOISING SCORE MATCHING

Denoising score matching (Vincent, 2011) is the approach that actually powers modern diffusion models. The idea is elegant: instead of matching the score of $p(\mathbf{x})$, match the score of a noise-corrupted version $p_\sigma(\tilde{\mathbf{x}})$.

If we corrupt data with Gaussian noise $\tilde{\mathbf{x}} = \mathbf{x} + \sigma\boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, then the conditional distribution is $p(\tilde{\mathbf{x}} | \mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 \mathbf{I})$, and its score has a simple closed form:

\nabla_{\tilde{\mathbf{x}}} \log p(\tilde{\mathbf{x}} | \mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2} = -\frac{\boldsymbol{\epsilon}}{\sigma}

Vincent (2011) proved that minimizing the denoising objective is equivalent to minimizing the Fisher divergence to the noisy data distribution $p_\sigma(\tilde{\mathbf{x}}) = \int p(\tilde{\mathbf{x}} | \mathbf{x}) p(\mathbf{x}) d\mathbf{x}$:

J_{DSM}(\theta) = \frac{1}{2} \mathbb{E}_{p(\mathbf{x})} \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ \left\| \mathbf{s}_\theta(\mathbf{x} + \sigma\boldsymbol{\epsilon}, \sigma) + \frac{\boldsymbol{\epsilon}}{\sigma} \right\|^2 \right]

No Jacobian traces, no integration by parts -- just predict which direction the noise came from. The catch? This only gives you the score of the noisy distribution. If $\sigma$ is too small, the score estimate is inaccurate in low-density regions. If $\sigma$ is too large, the noisy distribution is too far from the real data.

INTUITION

Denoising score matching says: "train a network to predict the direction back to the clean data from its noisy version." The optimal denoiser points in the direction of the score. Denoising is score estimation.

05 // NCSN: MULTI-SCALE SCORE MATCHING

Song & Ermon (2019) resolved the noise level dilemma with Noise Conditional Score Networks (NCSN). The key insight: don't pick one $\sigma$ -- use many. Define a geometric sequence of noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$ and train a single score network $\mathbf{s}_\theta(\mathbf{x}, \sigma)$ across all of them:

\mathcal{L}_{NCSN}(\theta) = \frac{1}{L} \sum_{i=1}^{L} \lambda(\sigma_i) \, \mathbb{E}_{p(\mathbf{x})} \mathbb{E}_{\tilde{\mathbf{x}} \sim \mathcal{N}(\mathbf{x}, \sigma_i^2 \mathbf{I})} \left[ \left\| \mathbf{s}_\theta(\tilde{\mathbf{x}}, \sigma_i) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma_i^2} \right\|^2 \right]

where $\lambda(\sigma_i) = \sigma_i^2$ ensures roughly equal loss magnitude across scales. At sampling time, they use annealed Langevin dynamics -- start with the largest $\sigma_1$ (where the score landscape is smooth and easy to navigate), and gradually reduce to $\sigma_L$ (where the score captures fine details):

\mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\eta_i}{2} \mathbf{s}_\theta(\mathbf{x}_t, \sigma_i) + \sqrt{\eta_i}\, \mathbf{z}_t

This was the first score-based model to generate high-quality images, and it set the stage for everything that followed. But around the same time, a parallel line of work was reaching similar conclusions from a very different angle.

06 // THE DDPM LINE OF WORK

While the score matching community was thinking about gradients of log-densities, another thread was developing diffusion models from the perspective of hierarchical variational autoencoders.

Deep Unsupervised Learning (Sohl-Dickstein et al., 2015)

The original diffusion model paper framed the problem as learning to reverse a fixed corruption process. Define a forward process that gradually destroys data by adding noise over $T$ steps:

q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t}\, \mathbf{x}_{t-1}, \beta_t \mathbf{I})

where $\{\beta_t\}_{t=1}^T$ is a variance schedule. After enough steps, $q(\mathbf{x}_T) \approx \mathcal{N}(\mathbf{0}, \mathbf{I})$. The generative model learns the reverse process:

p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})

The model is trained by maximizing a variational lower bound (ELBO) on $\log p_\theta(\mathbf{x}_0)$. The idea was sound, but the results in 2015 weren't competitive with GANs. It took five more years for someone to make it work.

DDPM (Ho et al., 2020)

Ho et al. (2020) took this framework and made several key choices that turned it into a powerhouse. First, define cumulative noise parameters:

\alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

A critical property: you can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ without iterating through all intermediate steps:

q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\, \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})

Or equivalently: $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\, \boldsymbol{\epsilon}$, where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.

The ELBO decomposes into a sum of KL divergences between Gaussians. The posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ is tractable and Gaussian:

q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})

where the posterior mean is:

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t

Ho et al.'s crucial insight: instead of parameterizing the model to predict $\boldsymbol{\mu}_\theta$ directly, reparameterize to predict the noise $\boldsymbol{\epsilon}$. Substituting $\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\, \boldsymbol{\epsilon})$ into the posterior mean gives:

\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)

The simplified training objective becomes stunningly simple:

\mathcal{L}_{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\, \boldsymbol{\epsilon}, \, t) \right\|^2 \right]

That's it. Sample a timestep, sample noise, corrupt data, predict the noise. This simple objective -- dropping the ELBO weighting terms -- actually produced better samples in practice.

THE CONNECTION

Here's the punchline: DDPM's noise prediction $\boldsymbol{\epsilon}_\theta$ and NCSN's score estimation $\mathbf{s}_\theta$ are the same thing up to scaling. Specifically: $$\mathbf{s}_\theta(\mathbf{x}_t, t) = -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$$ Predicting the noise is estimating the score. The two lines of work converged to the same mathematical object from completely different starting points.

Improved DDPM (Nichol & Dhariwal, 2021)

Nichol & Dhariwal improved upon DDPM with several practical contributions:

Learned variance: Instead of fixing $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}_t$, they parameterize $\sigma_t^2$ as an interpolation $\exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)$ where $v$ is a learned output, enabling the model to adapt the noise level per timestep.
Cosine noise schedule: Replaced the linear $\beta_t$ schedule with $\bar{\alpha}_t = \frac{f(t)}{f(0)}$ where $f(t) = \cos\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)^2$, which avoids destroying too much information too quickly at early timesteps.
Hybrid objective: Combined the simple $\mathcal{L}_{simple}$ (for $\boldsymbol{\epsilon}$-prediction quality) with the full variational $\mathcal{L}_{vlb}$ (for learning variances), preventing training instabilities.

Classifier-Free Guidance (Ho & Salimans, 2022)

For conditional generation (e.g., text-to-image), classifier-free guidance became the standard approach. The idea: train a single model that handles both conditional and unconditional generation by randomly dropping the conditioning signal $\mathbf{c}$ during training. At inference, interpolate between the conditional and unconditional predictions:

\hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, \mathbf{c}) = (1 + w) \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \mathbf{c}) - w \, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, \varnothing)

where $w > 0$ is the guidance scale. In score function language, this is equivalent to:

\hat{\mathbf{s}}(\mathbf{x}_t, \mathbf{c}) = \nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) + (1 + w) \nabla_{\mathbf{x}_t} \log p_t(\mathbf{c} | \mathbf{x}_t)

Higher guidance scale = more faithful to the condition, at the cost of diversity. This technique is used in virtually every modern text-to-image model.

07 // THE SDE UNIFICATION

Song et al. (2021) provided the grand unification. They showed that both NCSN and DDPM are discretizations of stochastic differential equations. The forward process is a continuous-time SDE:

d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}

where $\mathbf{w}$ is a Wiener process. Different choices of $\mathbf{f}$ and $g$ recover different models:

VE SDE (Variance Exploding): $\mathbf{f} = \mathbf{0}$, $g(t) = \sqrt{\frac{d[\sigma^2(t)]}{dt}}$ -- recovers NCSN
VP SDE (Variance Preserving): $\mathbf{f} = -\frac{1}{2}\beta(t)\mathbf{x}$, $g(t) = \sqrt{\beta(t)}$ -- recovers DDPM
sub-VP SDE: a variant with less variance at intermediate times

The remarkable result from Anderson (1982) is that the reverse-time SDE is:

d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right] dt + g(t)\,d\bar{\mathbf{w}}

The only unknown is $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ -- the score at time $t$. Estimate it with a neural network trained via denoising score matching, then solve the reverse SDE to generate samples.

Even better, there's a deterministic counterpart -- the probability flow ODE:

d\mathbf{x} = \left[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\right] dt

This ODE defines the same marginal distributions $p_t(\mathbf{x})$ as the SDE but without stochasticity. It enables exact likelihood computation, deterministic encoding, and faster sampling with adaptive ODE solvers.

08 // THE TRAINING OBJECTIVE (UNIFIED)

The unified training objective for a time-conditional score network $\mathbf{s}_\theta(\mathbf{x}, t)$ is:

\mathcal{L}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0, T)} \, \mathbb{E}_{\mathbf{x}(0) \sim p_0} \, \mathbb{E}_{\mathbf{x}(t) \sim p_{0t}(\cdot|\mathbf{x}(0))} \left[ \lambda(t) \left\| \mathbf{s}_\theta(\mathbf{x}(t), t) - \nabla_{\mathbf{x}(t)} \log p_{0t}(\mathbf{x}(t) | \mathbf{x}(0)) \right\|^2 \right]

For the VP SDE (DDPM), the transition kernel is $p_{0t}(\mathbf{x}(t) | \mathbf{x}(0)) = \mathcal{N}(\mathbf{x}(t); \sqrt{\bar{\alpha}(t)}\,\mathbf{x}(0), (1-\bar{\alpha}(t))\mathbf{I})$, and the conditional score is $-\boldsymbol{\epsilon}/\sqrt{1-\bar{\alpha}(t)}$. Three equivalent parameterizations exist:

$\boldsymbol{\epsilon}$-prediction: predict the noise $\boldsymbol{\epsilon}$ (DDPM style)
Score prediction: predict $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ (NCSN style)
$\mathbf{x}_0$-prediction: predict the clean data directly

They're all equivalent up to a time-dependent scaling factor. The choice of parameterization (and weighting $\lambda(t)$) affects sample quality vs. likelihood in practice, but the underlying math is the same.

09 // FAST SAMPLING: DDIM AND BEYOND

The original DDPM requires ~1000 denoising steps to generate a sample. Song et al. (2021b) introduced DDIM (Denoising Diffusion Implicit Models), which defines a family of non-Markovian forward processes that share the same marginals $q(\mathbf{x}_t | \mathbf{x}_0)$ but allow deterministic sampling:

\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\, \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } \mathbf{x}_0} + \underbrace{\sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}_{\text{direction pointing to } \mathbf{x}_t} + \underbrace{\sigma_t \boldsymbol{\epsilon}_t}_{\text{random noise}}

When $\sigma_t = 0$, the process is fully deterministic -- this is the DDIM sampler, which is essentially a discretization of the probability flow ODE. It enables sampling in as few as 10-50 steps with reasonable quality.

This opened the floodgates for fast sampling research: DPM-Solver, consistency models, progressive distillation, and other techniques that further reduce the number of required function evaluations.

10 // SUMMARY

The full picture, from 30,000 feet:

Score function = gradient of log-density (avoids normalizing constants)
Score matching = learn the score without knowing the true density
Denoising score matching = learning to denoise is learning the score
NCSN = denoising score matching across multiple discrete noise levels + annealed Langevin sampling
DDPM = variational approach that independently arrived at noise prediction (= score estimation)
SDE framework = continuous-time unification showing NCSN and DDPM are discretizations of the same family of SDEs
DDIM / fast sampling = deterministic sampling via probability flow ODE, enabling practical generation speeds

Two research communities -- one thinking about score functions and Langevin dynamics, the other about variational bounds and hierarchical latent variables -- converged on the same mathematical object. The score function is the bridge, and denoising is the universal language they both speak.

11 // REFERENCES

Hyvärinen, A. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6, 695-709, 2005.
Vincent, P. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7), 1661-1674, 2011.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015.
Song, Y. & Ermon, S. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019.
Ho, J., Jain, A. & Abbeel, P. Denoising Diffusion Probabilistic Models. NeurIPS 2020.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S. & Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
Nichol, A. & Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. ICML 2021.
Song, J., Meng, C. & Ermon, S. Denoising Diffusion Implicit Models. ICLR 2021.
Ho, J. & Salimans, T. Classifier-Free Diffusion Guidance. NeurIPS 2022 Workshop on Deep Generative Models and Downstream Applications.
Anderson, B. D. O. Reverse-time Diffusion Equation Models. Stochastic Processes and their Applications, 12(3), 313-326, 1982.

Score Matching Explained: From Theory to Diffusion Models