[Song et al. NeurIPS 2021] Maximum Likelihood Training of Score-Based Diffusion Models

Yang Song, Conor Durkan, Ian Murray and Stefano Ermon.

Maximum likelihood training of score-based diffusion models. In Advances in neural information processing systems, 2021.

@article{song2021maximum,
  title={Maximum likelihood training of score-based diffusion models},
  author={Song, Yang and Durkan, Conor and Murray, Iain and Ermon, Stefano},
  journal={Advances in neural information processing systems},
  volume={34},
  pages={1415--1428},
  year={2021}
}

TL;DR:

Introduction

Efficient maximum Likelihood Training

As mentioned in [Song et al. ICLR 2021], the weighting function $\lambda(t)$ can be chosen using some theoretically principled approach. We can do that to specifically maximize the maximum likelihood.

There is a connection between the Kullback-Leiber (KL) divergence and the score matching objective.

$$ \mathrm{KL}(p_{\text{data}} \,\|\, p_\theta) \leq \frac{1}{2} \mathbb{E}{t \sim \text{Uniform}[0, T]} \left[ \sigma(t)^2 \mathbb{E}{p_t(x)} \left[ \| \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) - s_\theta(\mathbf{x}, t) \|_2^2 \right] \right] \\

\boxed{\mathrm{KL}(p_T \,\|\, \pi)} \approx 0 $$

The second term, KL divergence from T to \pi is approximately zero if T is large enough. This term does not affect optimization because it does not depend on model parameter theta.
The first term is exactly our score matching objective but with a different weighting function \sigma(t)^2, which is called the likelihood weighting. Because KL divergence is directly related to maximal likelihood training, by minimizing the score matching loss function with this particular likelihood weighting function, we are implicily maximizing the likelihood.

Because this score matching loss function is very efficient to optimize, this also gives a way for efficient maximum likelihood training for score-based diffusion models. With this approach, we can further improve the density values on several tested datasets.