Collapsed Gibbs Sampling for Dirichlet Process Gaussian Mixture Models

I really enjoyed the pedagogy of Edwin Chen’s introduction to infinite mixture models, but I was a little disappointed that it does not go as far as presenting the details of the Dirichlet process Gaussian mixture model (DPGMM), as he uses sklearn’s variational Bayes DPGMM implementation.

For this reason, I will try and give here sufficient information to implement a DPGMM with collapsed Gibbs sampling. This is not an in-depth evaluation of which conjugate priors to use, nor an analysis of the parameters and hyper-parameters (that should have their own priors! ;)).

Prerequisites

On Dirichlet processes, Chinese Restaurant processes, Indian Buffet processes, there is the excellent blog post by Edwin Chen. Another excellent introduction to Dirichlet processes is provided by Frigyik, Kapila and Gupta.

If you lack some knowledge about clustering or density estimation (unsupervised learning), you can read Chapters 20 (p. 284) to (at least) 22 of MacKay’s ITILA, that you can find as a free ebook; or chapter 9 of Bishop’s PRML. As a refresher, the Wikipedia article on mixture models, and the sklearn documentation on GMM are more efficient.

DPGMM: the model

Let’s say we have $N$ observations and $K$ clusters, $i \in [1\dots N]$ is the indice for the observations, while $k \in [1\dots K]$ is the indice for the clusters. With $z_i$ the cluster assignment of observation $x_i$, and $\theta_k$ the parameter of mixture $k$:

So, the generative story of a DPGMM is as follows:

$\pi \sim Stick(\alpha)$ (mixing rates)
$z_i \sim \pi$ (cluster assignments)
$\theta_k \sim H(\lambda)$ (parameters)
$x_i \sim F(\theta_{z_i})$ (values)

Fitting the data

Notation:

Let’s decompose the probability that the observation $i$ belongs to cluster $k$ into its two independent factors:

Then:

is the marginal likelihood of all the data assigned to cluster $k$, including $i$.

If $z_i = k^*$ (new cluster) then:

Conjugate priors

Now we should choose $H$ for it to be conjugate to $F$ and have easy to compute parameters posterior. As we want $F$ to be multivariate normal: we can look on Wikipedia’s page of conjugate priors under multivariate normal with unknown $\mu$ and $\Sigma$ to see that $H$ should be normal-inverse-Wishart with prior parameters:

• $\mu_0$ initial mean guess [In my code further, I set it to the mean of whole the dataset.]
• $\kappa_0$ mean fraction (smoothing parameter) [A common value is 1. I set it to 0.]
• $\nu_0$ degrees of freedom [I set it to the number of dimensions.]
• $\Psi_0$ pairwise deviation product (matrix) [I set it to $10 \times I_d$ ($I_d$ is the $d\times d$ identity matrix). Indentity matrix makes this prior Gaussian circular, the $10$ factor should be dependant on the dataset, for instance on the mean distance between points.]

This gives us MAP estimates on parameters, for one of the clusters:

with $\tilde{x}$ the sample mean and $C=\sum_{i=1}^n (x_i-\tilde{x})(x_i-\tilde{x})^T$.

Set $\kappa_{0} = 0$ to have no effect of the prior on the posterior mean. This reduces to MLE estimates if:

So now we can compute the posterior predictive for cluster $k$ evaluated at $x_i$

Collapsed Gibbs sampling

Here is the pseudo-code of collapsed Gibbs sampling adapted from algorithm 3 of Neal’s seminal paper:

while (not converged on mus and sigmas):
for each i = 1 : N in random order do:
remove x[i]'s sufficient statistics from old cluster z[i]
if any cluster is empty, remove it and decrease K
for each k = 1 : K do
compute P_k(x[i]) = P(x[i] | x[-i]=k)
N[k,-i] = dim(x[-i]=k)
compute P(z[i]=k | z[-i], Data) = N[k,-i] / (alpha + N - 1)
compute P*(x[i]) = P(x[i] | lambda)
compute P(z[i]=* | z[-i], Data) = alpha / (alpha + N - 1)
normalize P(z[i] | ...)
sample z[i] from P(z[i] | ...)
add x[i]'s sufficient statistics to new cluster z[i]
(possibly increase K)


Results

Here is the result of our implementation of collapsed Gibbs sampling DPGMM compared to scikit-learn’s implementation of variational Bayes DPGMM:

Code

Here is my quick-and-dirty code implementing this version of Gibbs sampling for DPGMM. You may want to comment out scikit-learn (that I used for the comparison above) if you do not have it installed.

 
 
 From Hacks to Bayesian Probability Mar 8th, 2013 In which we look at two pragmatic hacks that lead to the Bayesian approach of probabilities, when pushed further and added as constraints. Coinflips Let’s say we have a coin, and we want to decide if it’s fair. We throw it $N$ times and we get $m$ heads, we can code heads=1, tails=0. With $\mu$ the ratio of heads: P(m | N, \mu) = Binomial(m|N,\mu) = \binom{N}{m} \mu^m (1-\mu)^{N-m} Maximum likelihood How do we set $\mu$? We could maximize the probability of the data that we saw under our model, that is maximizing the likelihood. Let’s say that $D = {x_1 \dots x_N}$, then we have: P(D|\mu) = \prod_{n=1}^N P(x_n|\mu) = \prod_{n=1}^N \mu^{x_n}(1-\mu)^{1-x_n} The maximum of this function of $\mu$ is reached for $\mu= \frac{m}{N}$. The problem arises if we have little data (in fact, when we have data that does not cover the whole space of possible data). If $D=(1,1,1)$, the maximum likelihood estimate of $\mu$ will be $1.0$. It means that we predict that all the tosses will land on heads, after only three observations! Smoothing A classical hack is to smooth the maximum likelihood estimate by adding “fake data”, we could consider that we already saw the coin land on heads and tails once, before getting our data. This way, before (“prior to”) the experiment, we would have $\mu=1/2=0.5$. After (posterior to) our experiment, taking the data into account, we would have $\mu = (3+1)/(3+2) = 0.8$. How do we set the these prior coin flips (smoothing parameters)? Maximum A Posteriori The right way to encode this prior knowledge is to put a probability distribution on the parameter $\mu$. As $\mu$ is a ratio, we should have a continuous distribution on $[0, 1]$ that can represent a whole range of prior belief on what the coin’s ratio of heads is. For these reasons, a sensible choice is the Beta distribution: Beta(\mu|a, b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \mu^{a-1}(1-\mu)^{b-1} On Wikpedia, we can check how the $Beta(x|\alpha, \beta)$ distribution looks like: Now we can compute again what is the posterior value of $\mu$ knowing the data $D$ and the prior Beta ($\propto$ means “proportional to”): % <![CDATA[ \begin{align} P(\mu | a_0, b_0, D) & \propto P(D|\mu)P(\mu|a_0, b_0)\\ & \propto \left( \prod_{n=1}^N \mu^{x_n} (1-\mu)^{1-x_n} \right) Beta(\mu | a_0, b_0) \end{align} %]]> Hopefully, the Beta distribution is the conjugate prior for the Bernouilli and binomial distributions, and thus a bit of calculus reduces it to: P(\mu | a_0, b_0, D) \propto Beta(\mu|a_N, b_N)\\ a_N = a_0 + m\\ b_N = b_0 + (N-m) We can compute that, when $N \rightarrow \infty$, the expectation of $\mu$: $\mathbb{E}[\mu] = \mu_{ML}$, as: \mathbb{E} [\mu | a_0, b_0, D] = \frac{a_N}{b_N} First conclusion This approach of using a prior on the parameters of the distributions that are essential to our model (the predicting distribution) is central to the Bayesian approach of building models. It makes the model robust to what can happen, even though we had few data. It makes it easier to reason about our prior assumptions that simply “adding unseen data”, and it yields in the presence of more data. If you’re interested about Bayesian modeling, there are plenty of very good textbooks. My prefered gradual introduction is MacKay’s ITILA, that you can find as a free ebook. Causality Now here is another hack for logical reasoning, that leads to Bayesian probabilities. Let’s say that you want to express that an event $A$ entails an event $B$, in logic you would write $A \Rightarrow B$. We will be abusing the notation $A=[A=true]$ and $\neg A=[A=false]$. Now with the modus ponens, you can deduce $B$ whenever $A$ is true. \frac{[A\Rightarrow B] \wedge A}{B} Plausible reasoning Now, we want to extend prepositional logic to plausible reasoning, in which we can have degrees of probability that rules are true; or degrees of belief in these rules and facts. A pragmatic way to do that is to introduce the variable $C$ which represents $A \Rightarrow B$, that is: if $P(C)=p$, there is a probability $p$ that $A \Rightarrow B$. Then, this previous modus ponens translates to: P(B|A,C) = \frac{P(A|B,C)P(B|C)}{P(A|C)}\ (Bayes'\ theorem)\\ P(B|A,C)=\frac{P(A,B|C)}{P(A|C)}\ (Product\ rule) And actually, as $P(A,B|C)=P(A|C)$, we have $P(B|A,C)=1$, which corresponds to the strong syllogism of modus ponens. So now, if we are only 80% sure of $C$, we can write $P(C) = 0.8$ and seek for $P(B|A)$ (we are 100% sure of A): % <![CDATA[ \begin{align} P(B|A) = \frac{\sum_{C\in\{false,true\}} P(B|A,C)P(A)P(C)}{P(A)} & = P(\neg C)P(B|A,\neg C) + P(C)P(B|A,C)\\ & = 0.2*x(\in [0,1]) + 0.8*1.0 \geq 0.8 \end{align} %]]> Which means that $B$ has 80% chances to be true by following the strong syllogism of modus ponens, but it can also be true even though $C=false$. Finally, contrary to prepositional logic, we also get the weak syllogism (and I’ll let you think it through): \frac{[A\Rightarrow B] \wedge B}{A\ becomes\ more\ plausible} A similar derivation and observation can be done for modus tollens. Cox-Jaynes theorem A reasoning mechanism needs to be consistent (one cannot prove $A$ and $\neg A$ at the same time). For plausible reasoning, consistency means: a) all the possible ways to reach a conclusion leads to the same result, b) information cannot be ignored, c) two equal states of knowledge have the same plausibilities. Adding consistency to plausible reasoning leads to Cox’s theorem, which derives the laws of probability (the product-rule and the sum-rule). So, the degrees of belief of any consistent induction mechanism verify Kolmogorov’s axioms. Second and last conclusion With plausible reasoning, we get all the benefits of prepositional logic, but we can also reason with/about facts and rules that are not 100% true. We have another example of how a pragmatical (sensical) hack to extend logic to “degrees of beliefs” (probabilities) leads to Bayesian probabilities. If you are interested by learning about plausible reasonning, you can look at my thesis, or, better yet, read it directly from one of the masters in Jayne’s Probability Theory: The Logic of Science for which the pre-print is there. Blog Archives 
 Recent Posts Collapsed Gibbs Sampling for Dirichlet Process Gaussian Mixture Models From hacks to Bayesian probability GitHub Repos Status updating… @SnippyHolloW on GitHub $.domReady(function(){ if (!window.jXHR){ var jxhr = document.createElement('script'); jxhr.type = 'text/javascript'; jxhr.src = '/javascripts/libs/jXHR.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(jxhr, s); } github.showRepos({ user: 'SnippyHolloW', count: 4, skip_forks: true, target: '#gh_repos' }); }); Latest Tweets Status updating…$.domReady(function(){ getTwitterFeed("syhw", 4, false); }); Follow @syhw My Pinboard Fetching linkroll… My Pinboard Bookmarks » var linkroll = 'pinboard_linkroll'; //id target for pinboard list var pinboard_user = "syhw"; //id target for pinboard list var pinboard_count = 5; //id target for pinboard list (function(){ var pinboardInit = document.createElement('script'); pinboardInit.type = 'text/javascript'; pinboardInit.async = true; pinboardInit.src = '/javascripts/pinboard.js'; document.getElementsByTagName('head')[0].appendChild(pinboardInit); })(); 
 
 Copyright © 2013 - syhw - Powered by Octopress (function(d, s, id) { var js, fjs = d.getElementsByTagName(s)[0]; if (d.getElementById(id)) {return;} js = d.createElement(s); js.id = id; js.src = "//connect.facebook.net/en_US/all.js#appId=212934732101925&xfbml=1"; fjs.parentNode.insertBefore(js, fjs); }(document, 'script', 'facebook-jssdk')); (function() { var script = document.createElement('script'); script.type = 'text/javascript'; script.async = true; script.src = 'https://apis.google.com/js/plusone.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(script, s); })(); (function(){ var twitterWidgets = document.createElement('script'); twitterWidgets.type = 'text/javascript'; twitterWidgets.async = true; twitterWidgets.src = 'http://platform.twitter.com/widgets.js'; document.getElementsByTagName('head')[0].appendChild(twitterWidgets); })();