[Song et al. UAI 2020] Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon.

Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, 2020

@inproceedings{song2020sliced,
  title={Sliced score matching: A scalable approach to density and score estimation},
  author={Song, Yang and Garg, Sahaj and Shi, Jiaxin and Ermon, Stefano},
  booktitle={Uncertainty in Artificial Intelligence},
  pages={574--584},
  year={2020},
  organization={PMLR}
}

PDF: song20a.pdf (mlr.press)

Other sources:

Video explanation by the author: https://youtu.be/wMmqCMwuM2Q
Blog by author: Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song (yang-song.net)
Thesis by the author: Learning to generate data by estimating gradients of the data distribution in SearchWorks catalog (stanford.edu)

Background

In short, we want to estimate the probability distribution of the data.

We hope to find the parameters of a model so the model distribution is close to the data distribution. The model respresents the parametrized probability distribution of the data, which we call model distribution.

Our dataset contains N samples and xi is each data point in the dataset. From the all the models with probability distributions $\Theta$ we want to find a single probability distribution $\theta \in \Theta$ by minimizing the distance between $p_{data}$ and $p_\theta$, so then we can generate samples from $p_\theta$.

BUT the data distribution is very complex for high dimensional data.

We will start from a gaussian distribution which is a graph with 2 layers, the data points and a single unit, which is the density function of the probability distribution of such points.

This distribution is too simple to model high dimensional data, so we need to add more layers and build a deeper computational graph or neural network to model the probability distribution $p_\theta$, where $\theta$ denotes the weights of the network.

BUT, how to build a deep neural network that models the probability distribution? Neural networks convert a high dimensional input into a simple one dimensional output $f_\theta(x)$. However, f_\theta(x) might not be positive everywhere, which means that we cannot directly model the probability distribution with the neural network.