# So You Wanna Try Deep Learning?

I’m keeping this post quick and dirty, but at least it’s out there. The gist of this post is that I put out a one file gist that does all the basics, so that you can play around with it yourself. First of all, I would say that deep learning is simply kernel machines whose kernel we learn. That’s gross but that’s not totally false. Second of all, there is nothing magical about deep learning, just that we can efficiently train (GPUs, clusters) large models (millions of weights, billions if you want to make a Wired headline) on large datasets (millions of images, thousands of hours of speech, more if you’re GOOG/FB/AAPL/MSFT/NSA). I think a good part of the success of deep learning comes from the fact that practitionners are not affraid to go around beautiful mathematical principles to have their model work on whatever dataset and whatever task. But I disgress…

## What is a deep neural network?

A series of matrix multiplications and non-linearities. You take your input $x$ in your features space, multiply it by a matrix $W$ (add biases $b$), apply a non-linearity (Rectified Linear Unit is fashionable these days, that’s $max(0, output)$, but $sigmoid$ and $tanh$ are OK too) and keep on doing that with other layers until you reach a classifier. For instance, you have a 3 layers ReLUs-based neural network with a softmax classifier on top? That gives:

There are all sorts of different mammals, with very strong specificities, but I think I just described a rat (or is it an euarchontoglires?).

I’m just dumping here a collection of links that I think everybody with an interest in deep learning should at least skim:

## Stuff you’ll learn

There I’m getting totally subjective, because I’m telling you stuff that I learned the hard way.

#### Generic

• Always answer “Do you want more data?” with “Yes, please.”
• If something feels wrong, check your gradients with finite differences.
• For all gradient descent related stuff, first RTFM.
• When do we stop the training? Almost everybody does it but nobody speaks about it: early stopping on a validation set.
• If you use $tanh$ or $sigmoid$ activation units, initialize them well, respectively with uniform weights in $[-\sqrt{\frac{6}{\mathrm{fan}_{in} + \mathrm{fan}_{out}}}, \sqrt{\frac{6}{\mathrm{fan}_{in} + \mathrm{fan}_{out}}}]$ or $4$ times that.

## Practice

I’d advise to start by using either Torch (Lua) or Theano (Python), both nice libraries that do automatic differentiation.

I put together a single file simple deep neural network working on small datasets (Python), more for pedagogical purposes than production ready, but it runs relatively fast on GPUs thanks to Theano. So if you want to run it, install Theano (I use the bleeding edge). If you want to play around with it, look for TODO in the code and change values there. There are several datasets that you can use. Also, you should play around with the parameters of this function, and maybe try against the SVMs from scikit-learn. Finally, if you use Dropout, you will see improvement only on large-enough networks (> 1000 units / layer, > 3-4 layers). Here is the result on running this file (python dnn.py) with a small ($784\times200\times200\times10$) ReLU-based L2-regularized network on MNIST:

If your GPU can handle it, you want to try Dropout on MNIST with 4 (or more) layers of 2000 units. ;-)

## Conclusion

I didn’t talk about convolutional neural networks, nor recurrent neural networks, nor other beasts. That should be the next step for the passionate reader. This was just a primer on raw facts for basic deep learning. Depending on what people want, I can either explain function by function the file that I provided here, talk about different loss functions (learning embeddings, e.g. as word2vec), recurrent neural nets, etc.