by Li Yang Ku (Gooly)
Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.
So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.
On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.
Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.
One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)
Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.