I’m keeping this post quick and dirty, but at least it’s out there. The gist of this post is that I put out a one file gist that does all the basics, so that you can play around with it yourself. First of all, I would say that deep learning is simply kernel machines whose kernel we learn. That’s gross but that’s not totally false. Second of all, there is nothing magical about deep learning, just that we can efficiently train (GPUs, clusters) large models (millions of weights, billions if you want to make a Wired headline) on large datasets (millions of images, thousands of hours of speech, more if you’re GOOG/FB/AAPL/MSFT/NSA). I think a good part of the success of deep learning comes from the fact that practitionners are not affraid to go around beautiful mathematical principles to have their model work on whatever dataset and whatever task. But I disgress…
What is a deep neural network?
A series of matrix multiplications and non-linearities. You take your input $x$ in your features space, multiply it by a matrix $W$ (add biases $b$), apply a non-linearity (Rectified Linear Unit is fashionable these days, that’s $max(0, output)$, but $sigmoid$ and $tanh$ are OK too) and keep on doing that with other layers until you reach a classifier. For instance, you have a 3 layers ReLUs-based neural network with a softmax classifier on top? That gives:
There are all sorts of different mammals, with very strong specificities, but I think I just described a rat (or is it an euarchontoglires?).
Links and Papers
I’m just dumping here a collection of links that I think everybody with an interest in deep learning should at least skim:
- First, you should of course start with the deeplearning.net tutorials, even though it’s pretty old. Overall, these are very good foundations nevertheless.
- If you want to get an intuition for how NNs fold space with non-linearities and an online demo to play around with this concept.
- These online demos were nice, right? They’re done by a guy who also wrote a pretty interesting personnal history that concurs with my point-of-view on feature learning.
- I’m going to advise you against it in a bit, but if you want to do RBM pre-training, this paper is a must-read
- If you want to do anything that has to deal with images, start here and there.
- If you want to do anything that has to deal with speech (I assume you know about speech coding, otherwise I did a crash course), start here and there.
- If you want to do NLP with deep learning, there are lots of hot papers right now, but you could start with NLP (almost) from scratch.
- In any case, you should learn practical stuff about SGD (must-read), learn about momentum, and you can geek out about extensions (I’m fond of Adadelta). You should learn about Dropout, and maybe geek out about the variants (fast dropout, dropconnect…).
- If you like videos, Optimization I and Leon (1) Bottou’s (2) MLSS class (3) are good introductions.
- Finally, if you want more, you can have a look at my non-extensive collection of links on deep learning.
Stuff you’ll learn
There I’m getting totally subjective, because I’m telling you stuff that I learned the hard way.
- Always answer “Do you want more data?” with “Yes, please.”
- If something feels wrong, check your gradients with finite differences.
- For all gradient descent related stuff, first RTFM.
- When do we stop the training? Almost everybody does it but nobody speaks about it: early stopping on a validation set.
- If you use $tanh$ or $sigmoid$ activation units, initialize them well, respectively with uniform weights in or $4$ times that.
- “What is unsupervised pre-training?” Using un-annotated data to initialize the network’s
- What is unsupervised pre-training doing? “unsupervised pre-training guides the learning towards basins of attraction of minima that are better in terms of the underlying data distribution; the evidence from these results supports a regularization explanation for the effect of pre-training.”
- This is not needed if you have enough data.
- “How do we approach a problem with the deep learning mindset?” You design an under-constrained over-capacity over-fitting hog (by being deep and wide, just barely tractable efficiently on your hardware), and you keep it in check by using Dropout.
- “What is Dropout?” Dropping hidden units randomly (usually with a binomial probability of 0.5) during training so that the networks learns to be “robust” and doesn’t learn stupid co-activations of units (a way to tell the network to not just learn to compress the training set).
- “What is Dropout doing exactly?” “the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix”, “Dropout performs gradient descent on-line with respect to both the training examples and the ensemble of all possible subnetworks. (…) The regularization term is the usual weight decay or Gaussian prior term based on the square of the weights to prevent overfitting. Dropout provides immediately the magnitude of the regularization term which is scaled by the inputs and by the variance of the dropout variables.”
- “Sorry, what?” You know about L2 regularization right? So you know about this picture, where regularization means inflating the L2 (or L1) ball until it intersects your feasible set. Now imagine an ellipsis that has its moments matching the ones of the inverse of the Fisher information matrix of the data. You now have a picture of “kinda” what Dropout is doing.
I put together a single file simple deep neural network working on small datasets (Python), more for pedagogical purposes than production ready, but it runs relatively fast on GPUs thanks to Theano. So if you want to run it, install Theano (I use the bleeding edge). If you want to play around with it, look for
TODO in the code and change values there. There are several datasets that you can use. Also, you should play around with the parameters of this function, and maybe try against the SVMs from scikit-learn. Finally, if you use Dropout, you will see improvement only on large-enough networks (> 1000 units / layer, > 3-4 layers). Here is the result on running this file (
python dnn.py) with a small ($784\times200\times200\times10$) ReLU-based L2-regularized network on MNIST:
If your GPU can handle it, you want to try Dropout on MNIST with 4 (or more) layers of 2000 units. ;-)
I didn’t talk about convolutional neural networks, nor recurrent neural networks, nor other beasts. That should be the next step for the passionate reader. This was just a primer on raw facts for basic deep learning. Depending on what people want, I can either explain function by function the file that I provided here, talk about different loss functions (learning embeddings, e.g. as word2vec), recurrent neural nets, etc.