A look at the recent Score-Based Generative Modeling through Stochastic Differential Equations paper by Yang Song et al. . Made by Sayantan Das using W&B

- Why Score Based?
- Score Estimation and Score Matching
- Sample Generation and Langevin Dynamics
- Gaussian Perturbations of data
- Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021)
- Experiments
- Conclusions
- Appendix

Before we get into score-based modeling, it's important to first understand what a "score" is in this context. Given a probability density function p(x) the 'score' is defined as

\nabla_\mathbf{x} \log p(\mathbf{x}),

or the gradient of the log-likelihood of the object x with respect to the input dimensions x. In this case, the object x is input dimensions.

Instead of working with a probability density function (pdf) , we work with the gradients and the gradients here are with respect to the input dimensions and notably not w.r.t the model parameters \theta.

In this report, we will assume that pdfs are continuous random variables.

The score is a vector field of the gradient at any point x. This gradient of \log p(x) tells us the directions in which to move if we want to increase the likelihood as much as possible.

Score-based generative models are trained to estimate \nabla_\mathbf{x} \log p(\mathbf{x}).

Unlike Likelihood-based models like Normalizing Flows or Autoregressive models, these models are easier to parameterize and do not need to be normalized.

Moreover, \nabla_\mathbf{x} \log p(\mathbf{x}) behaves like an unconstrained function so modeling it is easier.

In EBMs, we treat p_\theta(x) as a function

p_\theta(x) = \frac{e^{-f_\theta(x)}}{Z_\theta}

where f_\theta(x) is an energy function parameterized by \theta and Z_\theta is the normalization constant aka partition function, which is introduced here for the same reasons.

\mathbb{E}_{p_{data}}[-\log p_\theta (x)] = \mathbb{E}_{p_{data}}[\log f_\theta (x) - \log Z_\theta]

where \log Z_\theta is intractable.

\nabla_\mathbf{x} \log p(\mathbf{x}) = - \nabla_\mathbf{x} \log f(\mathbf{x}) - \sout{ \nabla_\mathbf{x} \log Z_\theta}

Idea: Learn \theta by fitting \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) to \nabla_\mathbf{x} \log p_{data}(\mathbf{x}).

This is also called score estimation.

CelebA sampling. Extracted from https://github.com/ermongroup/ncsn

CIFAR10 sampling. Extracted from https://github.com/ermongroup/ncsn

Left: \nabla_x \log p_{data} (x) , Right: S_\theta(x). The data density is encoded using an orange colormap: darker colour implies higher density. Red rectangles highlight regions where model scores are close to the data scores. Source: https://arxiv.org/abs/1907.05600

The caveat to all of this is : we don't have p_{data}. In this article,

we assume that

\{{x1,x2,...x_N}\}
\stackrel{iid}{\sim} p_{data}(x) = p(x).

Task: Estimate the score \nabla_x \log p(x)

Score Model: A vector-valued function S_\theta(x) : \R^D \rightarrow \R^D.

Objective: How to compare two vector fields of scores?

Average Euclidean distance over the whole space.

\frac{1}{2} \mathbb{E}_{p(x)}[\lVert\nabla_x \log p(x) - S_\theta(x)\rVert_2^2] which is also referred to as Fisher Divergence.

This expression is simplified using Integration By Parts method which gives us

\mathbb{E}_{p(x)}[\frac{1}{2}\lVert S_\theta(x)\rVert_2^2 + \mathrm{tr}(\nabla_x S_\theta(x))]

This was derived by Aapo Hyvarinen in 2005. On further simplification by lifting the Expectation terms, we obtain

\approx \frac{1}{2} \sum_{i=1}^{N}[\frac{1}{2}\lVert S_\theta(x_i)\rVert_2^2 + \mathrm{tr}(\nabla_x S_\theta(x_i))],

\{x_1,x_2,...,x_N\} \stackrel{iid}{\sim} p(x).

Some takeaways:

- Trace of the jacobian of the score is used in order to make local maxima.
- The score objective says "Choose \theta such that every data point in the training set becomes a local maxima of our estimated density".

Vanilla Score Matching has its drawbacks:

- "Not Scalable"

Let's take an example of a simple score model that computes \lVert S_\theta (x)\rVert^2_2 on forward pass.

Score Matching is not Scalable

The second of our score matching objective requires O(D) backpropagations which is expensive in practice.

A remedy that the community has come up with: Sliced Score Matching

- Project the vectors onto random directions and attach an Expectation over these directions over the original objective.

This is cheaper in practice than the vanilla formulation. This sliced fisher divergence is as follows:

\frac{1}{2}\mathbb{E}_{p_v}\mathbb{E}_{p_{data}}[(v^T\nabla_x \log p_\theta(x)-v^T\nabla_x \log p_{data}(x))^2 ],

where v is a random direction based on some distribution p_v. It has been found that Multivariate Rademacher Distribution is a good candidate for this distribution other than Multivariate Standard Normal.

In computational statistics and recently in generative modeling, Langevin sampling has had great success. Langevin Monte Carlo is a Markov Chain Monte Carlo (MCMC) method for obtaining random samples from probability distributions for which direct sampling is difficult. The goal is to "follow the gradient but add a bit of noise" so as to not get stuck at the local optima regions and thus we are able to explore the distribution and sample from it.

- Sample from p(x) using only the score \nabla_x \log p(x).
- Initialize \stackrel{\sim}{x_0} \sim \pi(x).
- Repeat for t \leftarrow 1,2,...,T:
- z_t \sim \mathcal{N}(0,1)
- \stackrel{\sim}{x_t}\leftarrow\stackrel{\sim}{x}_{t-1} + \frac{\epsilon}{2} \nabla_x \log p(\stackrel{\sim}{x}_{t-1}) + \sqrt[]{\epsilon}z_t

gradient gaussian noise

Created by Sayantan Das

Score Matching has various pitfalls:

Pitfall #1: Manifold Hypothesis

This hypothesis states that high dimensional data often tends to lie in a low dimensional manifold -- which makes \nabla_x \log p_{data}(x) undefined in some regions.

Pitfall #2: Inaccurate Score Estimation in Low Data-Density regions

When most of the samples will be around the modes of the distribution, the vector field can be misled around low density regions where not enough samples can guide the correct direction to the score vector.

Pitfall #3: Slow Mixing of Langevin Dynamics between data modes

When the distribution has disconnected supports -- or is a mixture of two disjoint components with a weighting coefficient \pi, scores cannot recover this coefficient and is invariant towards mode weights. This is a failure mode of Langevin Dynamics as the density is not supported over the whole space.

This is also known as "Annealed Langevin Dynamics"

- Sample using \sigma_1,\sigma_2,...,\sigma_L sequentially with Langevin Dynamics(LD) where,

\sigma_1>\sigma_2>...>\sigma_L

- Run LD_1 with \sigma_1
- Run LD_2 with particles from the previous LD run i.e LD_1and \sigma_2
- and so on.

This was presented in the paper - Noise Conditional Score Networks (NeurIPS 2019).

As we saw in the previous section, gaussian perturbations corrupt the data into random noise. In order to generate samples with score-based models, we need to consider a diffusion process. Scores arise when we reverse this diffusion process leading to sample generation. Let x(t) be a diffusion process, indexed by a continuous random variable of time t:0,...,T. A diffusion process is governed by a stochastic differential equation (SDE), in the following form

d\textbf{x} = \textbf{f}(x,t)dt + g(t)d\textbf{w},

where \textbf{f}(.,t) is the drift coefficient of the SDE , g(t) is the diffusion coefficient, w represents the standard brownian motion. It is this w that makes SDEs the stochastic generalization of an Ordinary Differential Equation (ODE) -- since the particles not only follow the deterministic drift guided by f(x,t), but are also affected by the random noise coming from g(t)dw.

For score based generative modeling, the diffusion process needs to be bounded i.e. x(0) \sim p_0 and x(T) \sim p_T where p_t(x) is used to denote the distribution for x(t). Here p_0 is the data distribution where we have a dataset of i.i.d. samples, and p_T is the prior distribution that has a tractable form and easy to sample from. The noise perturbation by the diffusion process is large enough to ensure p_T does not depend on p_0 .

By starting out with a sample from p_T and reversing this diffusion SDE, we can obtain samples from our data distribution p_0. The reverse-time SDE is given as:

d\textbf{x} = [\textbf{f}(\textbf{x},t) - g^2(t)\nabla_x \log p_t(x)]dt + g(t)d\bar{\textbf{w}}

where \bar{w} is the brownian motion in the reverse time direction and dt is the infinitesimal negative time step.

Extracted from the paper.

A time-dependent score function s_\theta(\textbf{x},t) is required to approximate \nabla_x \log p_t(x) in order to numerically solve the reverse-time SDE. This score model is trained using denoising objective:

\min_\theta \mathbb{E}_{t\sim \mathcal{U}(0, T)} [\lambda(t) \mathbb{E}_{\mathbf{x}(0) \sim p_0(\mathbf{x})}\mathbf{E}_{\mathbf{x}(t) \sim p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0))}[ \|s_\theta(\mathbf{x}(t), t) - \nabla_{\mathbf{x}(t)}\log p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0))\|_2^2]]

where \mathcal{U}(0,T) is a uniform distribution over [0, T], p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0)) denotes the transition probability from \mathbf{x}(0) to \mathbf{x}(t), and \lambda(t) \in \mathbb{R}_{>0} denotes a positive weighting function.

In the objective, the expectation over \mathbf{x}(0) can be estimated with empirical means over data samples from p_0. The expectation over \mathbf{x}(t) can be estimated by sampling from p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0)), which is efficient when the drift coefficient \mathbf{f}(\mathbf{x}, t) is affine. The weight function \lambda(t) is typically chosen to be inverse proportional to \mathbb{E}[\|\nabla_{\mathbf{x}}\log p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0)) \|_2^2].

The authors discuss various sampling methods in their paper which are as follows:

- Sampling using Numerical Solvers -- such as the Euler-Maruyama approach where dt is approximated as \Delta t.
- Sampling using Predictor-Corrector Methods : These methods use one step of Numerical Solvers like the one discussed above to obtain \textbf{x}_{t-\Delta t} from \textbf{x}_{t} which is called the 'predictor' step. This is followed by applying several steps of Langevin MCMC to refine \textbf{x}_{t} such that it becomes a more accurate sample from p_{t-\Delta t}(x). This is the 'corrector' step as the MCMC helps reduce the error of the numerical SDE solver.

Reason this is popular is that Score-based MCMC approaches can produce samples from underlying distributions when \nabla_x \log p(x) is known -- which in our case is approximated well by the time-dependent score-model.

3. Sampling with Numerical ODE Solvers: For any SDE of the form

d \mathbf{x} = \mathbf{f}(\mathbf{x}, t) d t + g(t) d \mathbf{w},

there exists an associated ordinary differential equation (ODE)

d \mathbf{x} = \bigg[\mathbf{f}(\mathbf{x}, t) - \frac{1}{2}g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})\bigg] dt

such that their trajectories have the same mariginal probability density 𝑝𝑡(𝐱) . Therefore, by solving this ODE in the reverse time direction, we can sample from the same distribution as solving the reverse-time SDE. We call this ODE the probability flow ODE.

This approach is also very popular since when the formulation turns into an ODE there is a possibility to estimate the true likelihood of data or in our case p_0(\textbf{x}), using the change of variable formula. More on this in the ICLR 21 paper.

We reproduce experiments for Controllable Generation from the score_sde repository.

More details about them can be found in Appendix 1.2 in the ICLR paper.

This report presents and summarizes the latest developments in score-based generative models -- with a goal to enable better understanding of existing approaches, new sampling algorithms, exact likelihood computation and conditional generation abilities to the family of score based generative models.

\int_{-\infty}^{\infty} p(x) \,dx = 1