# From Logistic Regression to Backprop (and Beyond)

I’ve started a repo where I’d like to put some very basic but also very didactic (i.e. python+numpy and/or C) code about all stuff machine learning.

So the first thing I did about it is a practical for my labmates at LSCP, “From Logistic Regression to Deep Nets, a Crash Course”. It’s very imperfect, but it’s “out there™”.

E.g. it has a very basic intro to stochastic gradient descent:

Now go play with it and break stuff. :)

# Classic Speech Recognition Features in One Picture

Quite some time ago (when I was coding some of this myself for educational purposes), I made a diagram of the succesive transformations that are applied to “raw” (wave files) audio for a classic speech recognition pipeline. I think sharing it would be interesting to some of you, pay close attention to the X-axis labels, and you shouldn’t get lost!

# So You Wanna Try Deep Learning?

I’m keeping this post quick and dirty, but at least it’s out there. The gist of this post is that I put out a one file gist that does all the basics, so that you can play around with it yourself. First of all, I would say that deep learning is simply kernel machines whose kernel we learn. That’s gross but that’s not totally false. Second of all, there is nothing magical about deep learning, just that we can efficiently train (GPUs, clusters) large models (millions of weights, billions if you want to make a Wired headline) on large datasets (millions of images, thousands of hours of speech, more if you’re GOOG/FB/AAPL/MSFT/NSA). I think a good part of the success of deep learning comes from the fact that practitionners are not affraid to go around beautiful mathematical principles to have their model work on whatever dataset and whatever task. But I disgress…

## What is a deep neural network?

A series of matrix multiplications and non-linearities. You take your input $x$ in your features space, multiply it by a matrix $W$ (add biases $b$), apply a non-linearity (Rectified Linear Unit is fashionable these days, that’s $max(0, output)$, but $sigmoid$ and $tanh$ are OK too) and keep on doing that with other layers until you reach a classifier. For instance, you have a 3 layers ReLUs-based neural network with a softmax classifier on top? That gives:

There are all sorts of different mammals, with very strong specificities, but I think I just described a rat (or is it an euarchontoglires?).

I’m just dumping here a collection of links that I think everybody with an interest in deep learning should at least skim:

## Stuff you’ll learn

There I’m getting totally subjective, because I’m telling you stuff that I learned the hard way.

#### Generic

• For all gradient descent related stuff, first RTFM.
• When do we stop the training? Almost everybody does it but nobody speaks about it: early stopping on a validation set.
• If you use $tanh$ or $sigmoid$ activation units, initialize them well, respectively with uniform weights in $[-\sqrt{\frac{6}{\mathrm{fan}_{in} + \mathrm{fan}_{out}}}, \sqrt{\frac{6}{\mathrm{fan}_{in} + \mathrm{fan}_{out}}}]$ or $4$ times that.

## Practice

I’d advise to start by using either Torch (Lua) or Theano (Python), both nice libraries that do automatic differentiation.

I put together a single file simple deep neural network working on small datasets (Python), more for pedagogical purposes than production ready, but it runs relatively fast on GPUs thanks to Theano. So if you want to run it, install Theano (I use the bleeding edge). If you want to play around with it, look for TODO in the code and change values there. There are several datasets that you can use. Also, you should play around with the parameters of this function, and maybe try against the SVMs from scikit-learn. Finally, if you use Dropout, you will see improvement only on large-enough networks (> 1000 units / layer, > 3-4 layers). Here is the result on running this file (python dnn.py) with a small ($784\times200\times200\times10$) ReLU-based L2-regularized network on MNIST:

If your GPU can handle it, you want to try Dropout on MNIST with 4 (or more) layers of 2000 units. ;-)

## Conclusion

I didn’t talk about convolutional neural networks, nor recurrent neural networks, nor other beasts. That should be the next step for the passionate reader. This was just a primer on raw facts for basic deep learning. Depending on what people want, I can either explain function by function the file that I provided here, talk about different loss functions (learning embeddings, e.g. as word2vec), recurrent neural nets, etc.

# A Random Thought About ReLUs

We know ReLU rock, they’re fast to compute, they’re fast to converge (train), they combine well with dropout… When I transitioned to ReLU for good, I found out that in phones recognition the “hard sigmoids” (piecewise approximation with 5 pieces) are doing almost as well as ReLUs (e.g. $\approx$24% phone error rate on TIMIT for a given architecture vs 23.5% IIRC), and much better than “smooth” sigmoids (26%). I’ve been wondering for a few months about how much of the good performance of ReLUs comes from the fact that they have a hard 0, that propagates to the upper layer and makes the higher level activations sparser and sparser. Is it well known? Is this random thought in the bad part of the random walk on the posterior? ;-)

# Visualizing Phn2vec With Biclustering

Following my previous blog post on phn2vec, I used scikit-learn’s biclustering to make the similarity matrices more readable. So here are some quick results for TIMIT:

## Phonetic annotation

### 2 biclusters

We clearly see consonants vs. vowels.

### 4 biclusters

We clearly see a separation in the place (+nasals) in the consonants. Silences get their own cluster.

### 6 biclusters

Fricatives and nasals get their own clusters.

# Phn2vec Embeddings

Several months ago, I started thinking in terms of embeddings for everything, let’s forget about discrete/categorical values and replace everything with vector spaces that behave as we ask of them!1

### xyz2vec

A few months ago, I toyed with “word2vec” (Mikolov et al. 2013) in gensim on a lot of stuff. One of them were phonetic annotations of speech corpora. Basically, a word2vec model is a one hidden layer neural network trained with backpropagation of a loss based on a) either predicting the central word given its neighbors (continuous bag-of-word), or b) predicting the neighbors given the central word (skip-gram). This can be applied to corpora of continuous text of words, but anything that has neightboring structure really. So I ran word2vec on phonetic and phonemic datasets (TIMIT and Buckeye), with a window of 5 phone(me)s (+/- 2 around the central one) and both skip-grams (SG) and continuous bag-of-words (CBOW). For all the following results, I used an embedding dimension of 10, so it is “contractive” compared to the number of phone(me)s (39). I tried with 100 dimensions and this gave very similar results, so this does not seem to matter. All the code to reproduce these results is here.

## phone2vec

Using TIMIT phonetic annotations here are the similarity matrices of phones (SG left and CBOW right):

If we do a 2 dimensional isomap projection of the skip-grams, we can see 3 clusters of vowels, (mainly) plosives (stop consonants) and other consonants (some fricatives, nasals..).2

An isomap of the CBOW gives roughly the same clusters:

Contrary to the TIMIT corpus (read sentences that were designed for phonetic variability and effects), the Buckeye is a corpus of conversational speech. We find the same (but a little bit weaker) clusters, e.g. in an isomap of skip-grams:

## read speech vs. conversational speech

To the extent that the Buckeye and TIMIT corpus have slightly different phonetic annotations (and different annotations quality too), we can try and compare read speech vs. conversational speech. Here we plot the difference of the similarity matrices between one and the other (SG left, CBOW right). The biggest difference is in silences vs. stop-consonants:

## phoneme2vec

What about phonemic annotation (phonemes from the word-level transcription)? Here are the similarity matrices (SG left, CBOW right):

We still have the clusters of consonants vs vowels in the isomap of the skip-gram:

and of the CBOW

but we have much lesser distinctions between front and back consonants as well as nasals and stop. That’s obvious, because phonotactics are not accounted for in the phonemic transcription that I did.

## phonetic vs. phonemic

We already know that speech (phonetics) and this “higher level” (phonemic) representation differ, how do they differ in this embedding?

That’s all folks! Currently, I worked only on English datasets. That would be fun to see what comes up for other languages.

1. There are several interesting papers about turning text into emdeddings that have useful properties. Picking a few: from classical nature language processing tasks done only with vector spaces (and neural networks) (Collobert et al. 2011), to semantic spaces for multi-relational data (Bordes et al. 2013). The recent “word2vec” (Mikolov et al. 2013) from Google spiked interest from the NLP community, and it quickly got implemented in gensim (if you want to geek out the implementation, I recommend the excellent blog post about its optimisation).

# Collapsed Gibbs Sampling for Dirichlet Process Gaussian Mixture Models

I really enjoyed the pedagogy of Edwin Chen’s introduction to infinite mixture models, but I was a little disappointed that it does not go as far as presenting the details of the Dirichlet process Gaussian mixture model (DPGMM), as he uses sklearn’s variational Bayes DPGMM implementation.

For this reason, I will try and give here sufficient information to implement a DPGMM with collapsed Gibbs sampling. This is not an in-depth evaluation of which conjugate priors to use, nor an analysis of the parameters and hyper-parameters (that should have their own priors! ;)).

### Prerequisites

On Dirichlet processes, Chinese Restaurant processes, Indian Buffet processes, there is the excellent blog post by Edwin Chen. Another excellent introduction to Dirichlet processes is provided by Frigyik, Kapila and Gupta.

If you lack some knowledge about clustering or density estimation (unsupervised learning), you can read Chapters 20 (p. 284) to (at least) 22 of MacKay’s ITILA, that you can find as a free ebook; or chapter 9 of Bishop’s PRML. As a refresher, the Wikipedia article on mixture models, and the sklearn documentation on GMM are more efficient.

### DPGMM: the model

Let’s say we have $N$ observations and $K$ clusters, $i \in [1\dots N]$ is the indice for the observations, while $k \in [1\dots K]$ is the indice for the clusters. With $z_i$ the cluster assignment of observation $x_i$, and $\theta_k$ the parameter of mixture $k$:

So, the generative story of a DPGMM is as follows:

$\pi \sim Stick(\alpha)$ (mixing rates)
$z_i \sim \pi$ (cluster assignments)
$\theta_k \sim H(\lambda)$ (parameters)
$x_i \sim F(\theta_{z_i})$ (values)

### Fitting the data

Notation:

Let’s decompose the probability that the observation $i$ belongs to cluster $k$ into its two independent factors:

Then:

is the marginal likelihood of all the data assigned to cluster $k$, including $i$.

If $z_i = k^*$ (new cluster) then:

### Conjugate priors

Now we should choose $H$ for it to be conjugate to $F$ and have easy to compute parameters posterior. As we want $F$ to be multivariate normal: we can look on Wikipedia’s page of conjugate priors under multivariate normal with unknown $\mu$ and $\Sigma$ to see that $H$ should be normal-inverse-Wishart with prior parameters:

• $\mu_0$ initial mean guess [In my code further, I set it to the mean of whole the dataset.]
• $\kappa_0$ mean fraction (smoothing parameter) [A common value is 1. I set it to 0.]
• $\nu_0$ degrees of freedom [I set it to the number of dimensions.]
• $\Psi_0$ pairwise deviation product (matrix) [I set it to $10 \times I_d$ ($I_d$ is the $d\times d$ identity matrix). Indentity matrix makes this prior Gaussian circular, the $10$ factor should be dependant on the dataset, for instance on the mean distance between points.]

This gives us MAP estimates on parameters, for one of the clusters:

with $\tilde{x}$ the sample mean and $C=\sum_{i=1}^n (x_i-\tilde{x})(x_i-\tilde{x})^T$.

Set $\kappa_{0} = 0$ to have no effect of the prior on the posterior mean. This reduces to MLE estimates if:

So now we can compute the posterior predictive for cluster $k$ evaluated at $x_i$

### Collapsed Gibbs sampling

Here is the pseudo-code of collapsed Gibbs sampling adapted from algorithm 3 of Neal’s seminal paper:

while (not converged on mus and sigmas):
for each i = 1 : N in random order do:
remove x[i]'s sufficient statistics from old cluster z[i]
if any cluster is empty, remove it and decrease K
for each k = 1 : K do
compute P_k(x[i]) = P(x[i] | x[-i]=k)
N[k,-i] = dim(x[-i]=k)
compute P(z[i]=k | z[-i], Data) = N[k,-i] / (alpha + N - 1)
compute P*(x[i]) = P(x[i] | lambda)
compute P(z[i]=* | z[-i], Data) = alpha / (alpha + N - 1)
normalize P(z[i] | ...)
sample z[i] from P(z[i] | ...)
add x[i]'s sufficient statistics to new cluster z[i]
(possibly increase K)

### Results

Here is the result of our implementation of collapsed Gibbs sampling DPGMM compared to scikit-learn’s implementation of variational Bayes DPGMM:

### Code

Here is my quick-and-dirty code implementing this version of Gibbs sampling for DPGMM. You may want to comment out scikit-learn (that I used for the comparison above) if you do not have it installed.

# From Hacks to Bayesian Probability

In which we look at two pragmatic hacks that lead to the Bayesian approach of probabilities, when pushed further and added as constraints.

## Coinflips

Let’s say we have a coin, and we want to decide if it’s fair. We throw it $N$ times and we get $m$ heads, we can code heads=1, tails=0. With $\mu$ the ratio of heads:

### Maximum likelihood

How do we set $\mu$? We could maximize the probability of the data that we saw under our model, that is maximizing the likelihood. Let’s say that $D = {x_1 \dots x_N}$, then we have:

The maximum of this function of $\mu$ is reached for $\mu= \frac{m}{N}$. The problem arises if we have little data (in fact, when we have data that does not cover the whole space of possible data). If $D=(1,1,1)$, the maximum likelihood estimate of $\mu$ will be $1.0$. It means that we predict that all the tosses will land on heads, after only three observations!

### Smoothing

A classical hack is to smooth the maximum likelihood estimate by adding “fake data”, we could consider that we already saw the coin land on heads and tails once, before getting our data. This way, before (“prior to”) the experiment, we would have $\mu=1/2=0.5$. After (posterior to) our experiment, taking the data into account, we would have $\mu = (3+1)/(3+2) = 0.8$. How do we set the these prior coin flips (smoothing parameters)?

### Maximum A Posteriori

The right way to encode this prior knowledge is to put a probability distribution on the parameter $\mu$. As $\mu$ is a ratio, we should have a continuous distribution on $[0, 1]$ that can represent a whole range of prior belief on what the coin’s ratio of heads is. For these reasons, a sensible choice is the Beta distribution:

On Wikpedia, we can check how the $Beta(x|\alpha, \beta)$ distribution looks like:

Now we can compute again what is the posterior value of $\mu$ knowing the data $D$ and the prior Beta ($\propto$ means “proportional to”):

Hopefully, the Beta distribution is the conjugate prior for the Bernouilli and binomial distributions, and thus a bit of calculus reduces it to:

We can compute that, when $N \rightarrow \infty$, the expectation of $\mu$: $\mathbb{E}[\mu] = \mu_{ML}$, as:

### First conclusion

This approach of using a prior on the parameters of the distributions that are essential to our model (the predicting distribution) is central to the Bayesian approach of building models. It makes the model robust to what can happen, even though we had few data. It makes it easier to reason about our prior assumptions that simply “adding unseen data”, and it yields in the presence of more data.

If you’re interested about Bayesian modeling, there are plenty of very good textbooks. My prefered gradual introduction is MacKay’s ITILA, that you can find as a free ebook.

## Causality

Now here is another hack for logical reasoning, that leads to Bayesian probabilities. Let’s say that you want to express that an event $A$ entails an event $B$, in logic you would write $A \Rightarrow B$. We will be abusing the notation $A=[A=true]$ and $\neg A=[A=false]$. Now with the modus ponens, you can deduce $B$ whenever $A$ is true.

### Plausible reasoning

Now, we want to extend prepositional logic to plausible reasoning, in which we can have degrees of probability that rules are true; or degrees of belief in these rules and facts. A pragmatic way to do that is to introduce the variable $C$ which represents $A \Rightarrow B$, that is: if $P(C)=p$, there is a probability $p$ that $A \Rightarrow B$. Then, this previous modus ponens translates to:

And actually, as $P(A,B|C)=P(A|C)$, we have $P(B|A,C)=1$, which corresponds to the strong syllogism of modus ponens.

So now, if we are only 80% sure of $C$, we can write $P(C) = 0.8$ and seek for $P(B|A)$ (we are 100% sure of A):

Which means that $B$ has 80% chances to be true by following the strong syllogism of modus ponens, but it can also be true even though $C=false$.

Finally, contrary to prepositional logic, we also get the weak syllogism (and I’ll let you think it through):

A similar derivation and observation can be done for modus tollens.

### Cox-Jaynes theorem

A reasoning mechanism needs to be consistent (one cannot prove $A$ and $\neg A$ at the same time). For plausible reasoning, consistency means: a) all the possible ways to reach a conclusion leads to the same result, b) information cannot be ignored, c) two equal states of knowledge have the same plausibilities. Adding consistency to plausible reasoning leads to Cox’s theorem, which derives the laws of probability (the product-rule and the sum-rule). So, the degrees of belief of any consistent induction mechanism verify Kolmogorov’s axioms.

### Second and last conclusion

With plausible reasoning, we get all the benefits of prepositional logic, but we can also reason with/about facts and rules that are not 100% true. We have another example of how a pragmatical (sensical) hack to extend logic to “degrees of beliefs” (probabilities) leads to Bayesian probabilities.

If you are interested by learning about plausible reasonning, you can look at my thesis, or, better yet, read it directly from one of the masters in Jayne’s Probability Theory: The Logic of Science for which the pre-print is there.