by Li Yang Ku (Gooly)
Convolutional Neural Network (CNN)-based features will likely replace engineered representations such as SIFT and HOG, yet we know little on what it represents. In this post I will go through a few papers that dive deeper into CNN-based features and discuss whether CNN feature vectors tend to be more like grandmother cells, where most information resides in a small set of filter responses, or distributed code, where most filter responses carry information equally. The content of this post is mostly taken from the following three papers:
- Agrawal, Pulkit, Ross Girshick, and Jitendra Malik. “Analyzing the performance of multilayer neural networks for object recognition.” Computer Vision–ECCV 2014. Springer International Publishing, 2014. 329-344.
- Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
- Dosovitskiy, Alexey, and Thomas Brox. “Inverting convolutional networks with convolutional networks.” arXiv preprint arXiv:1506.02753 (2015).
So why do we want to take insights from convolutional neural networks (CNN)? Like what I talked about in my previous post, In 2012, University of Toronto’s CNN implementation won the ImageNet challenge by a large margin, 15.3% and 26.6% in classification and detection by the nearest competitor. Since then CNN approaches have been leaders in most computer vision benchmarks. Although CNN doesn’t work like the brain, the characteristic that makes it work well might be also true in the brain.
The grandmother cell is a hypothetical neuron that represents a complex but specific concept or object proposed by cognitive scientist Jerry Letvin in 1969. Although it is mostly agreed that the original concept of grandmother cell which suggests that each person or object one recognizes is associated with a single cell is biological implausible (see here for more discussion), the less extreme idea of grandmother cell is now explained as sparse coding.
Before diving into CNN features we look into existing computer vision algorithms and see which camp they belong to. Traditional object recognition algorithms either are part-based approaches that use mid-level patches or use a bag of local descriptors such as SIFT. One of the well know part-based approaches is the deformable part model which uses HOG to model parts and a score on respective location and deformation to model their spatial relationship. Each part is a mid-level patch that can be seen as a feature that fires to specific visual patterns and mid-level patch discovery can be viewed as the search for a set of grandmother cell templates.
On the other hand, unlike mid-level patches, SIFT like features represent low level edges and corners. This bag of descriptors approach uses a distributed code; a single feature by itself is not discriminative, but a group of features taken together is.
There were many attempts to understand CNN more. One of the early work done by Zeiler and Fergus find locally optimal visual inputs for individual filters. However this does not characterize the distribution of images that cause a filter to activate. Agrawal et al. claimed that a grandmother cell can be seen as a filter with high precision and recall. Therefore for each conv-5 filter in the CNN trained on ImageNet they calculate the average precision for classifying images. They showed that grandmother cell like ﬁlters exist for only a few classes, such as bicycle, person, cars, and cats. The number of filters required to recognize objects of a class is also measured. For classes such as persons, cars, and cats few filters are required, but most classes require 30 to 40 filters.
In the work done by Hinton et al. a concept called distillation is introduced. Distillation transfers the knowledge of a cumbersome model to a small model. For a cumbersome model, the training objective is to maximize the probability of the correct answer. A side effect is that it also assigns probabilities to incorrect answers. Instead of training on the correct answer, distillation train on soft targets, which is the probabilities of all answers generated from the cumbersome model. They showed that the small model performs better when trained on these soft targets versus when trained on the correct answer. This result suggests that the relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.
On the other hand, Dosovitskiy et al. tried to understand CNN features through inverting the CNN. They claim that inverting CNN features allows us to see which information of the input image is preserved in the features. Applying inverse to a perturbed feature vector yields further insight into the structure of the feature space. Interestingly, when they discard features in the FC8 layer they found most information is contained in small probabilities of those classes instead of the top-5 activation. This result is consistent with the result of the distillation experiment mentioned previously.
These findings suggest that a combination of distributed code and some grandmother like cells may be closer to how CNN features work and might also be how our brain encodes visual inputs.