Life is a game, take it seriously

Distributed Code or Grandmother Cells: Insights From Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Sparse Coding on January 23, 2016 at 1:31 pm

by Li Yang Ku (Gooly)

grandmother-cell

Convolutional Neural Network (CNN)-based features will likely replace engineered representations such as SIFT and HOG, yet we know little on what it represents. In this post I will go through a few papers that dive deeper into CNN-based features and discuss whether CNN feature vectors tend to be more like grandmother cells, where most information resides in a small set of filter responses, or distributed code, where most filter responses carry information equally. The content of this post is mostly taken from the following three papers:

  1. Agrawal, Pulkit, Ross Girshick, and Jitendra Malik. “Analyzing the performance of multilayer neural networks for object recognition.” Computer Vision–ECCV 2014. Springer International Publishing, 2014. 329-344.
  2. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
  3. Dosovitskiy, Alexey, and Thomas Brox. “Inverting convolutional networks with convolutional networks.” arXiv preprint arXiv:1506.02753 (2015).

So why do we want to take insights from convolutional neural networks (CNN)? Like what I talked about in my previous postIn 2012, University of Toronto’s CNN implementation won the ImageNet challenge by a large margin, 15.3% and 26.6% in classification and detection by the nearest competitor. Since then CNN approaches have been leaders in most computer vision benchmarks. Although CNN doesn’t work like the brain, the characteristic that makes it work well might be also true in the brain.

faceselectiv

The grandmother cell is a hypothetical neuron that represents a complex but specific concept or object proposed by cognitive scientist Jerry Letvin in 1969. Although it is mostly agreed that the original concept of grandmother cell which suggests that each person or object one recognizes is associated with a single cell is biological implausible (see here for more discussion), the less extreme idea of grandmother cell is now explained as sparse coding.

Deformable Part Model

Before diving into CNN features we look into existing computer vision algorithms and see which camp they belong to. Traditional object recognition algorithms either are part-based approaches that use mid-level patches or use a bag of local descriptors such as SIFT. One of the well know part-based approaches is the deformable part model which uses HOG to model parts and a score on respective location and deformation to model their spatial relationship. Each part is a mid-level patch that can be seen as a feature that fires to specific visual patterns and mid-level patch discovery can be viewed as the search for a set of grandmother cell templates.

SIFT

On the other hand, unlike mid-level patches, SIFT like features represent low level edges and corners. This bag of descriptors approach uses a distributed code; a single feature by itself is not discriminative, but a group of features taken together is.

There were many attempts to understand CNN more. One of the early work done by Zeiler and Fergus find locally optimal visual inputs for individual filters. However this does not characterize the distribution of images that cause a filter to activate. Agrawal et al. claimed that a grandmother cell can be seen as a filter with high precision and recall. Therefore for each conv-5 filter in the CNN trained on ImageNet they calculate the average precision for classifying images. They showed that grandmother cell like filters exist for only a few classes, such as bicycle, person, cars, and cats. The number of filters required to recognize objects of a class is also measured. For classes such as persons, cars, and cats few filters are required, but most classes require 30 to 40 filters.

convolutional-neural-networks-top-9-layer-4-5

In the work done by Hinton et al. a concept called distillation is introduced. Distillation transfers the knowledge of a cumbersome model to a small model. For a cumbersome model, the training objective is to maximize the probability of the correct answer. A side effect is that it also assigns probabilities to incorrect answers. Instead of training on the correct answer, distillation train on soft targets, which is the probabilities of all answers generated from the cumbersome model. They showed that the small model performs better when trained on these soft targets versus when trained on the correct answer. This result suggests that the relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.

Inverting CNN Features

On the other hand, Dosovitskiy et al. tried to understand CNN features through inverting the CNN. They claim that inverting CNN features allows us to see which information of the input image is preserved in the features. Applying inverse to a perturbed feature vector yields further insight into the structure of the feature space. Interestingly, when they discard features in the FC8 layer they found most information is contained in small probabilities of those classes instead of the top-5 activation. This result is consistent with the result of the distillation experiment mentioned previously.

Top-5 vs rest feature in FC8

These findings suggest that a combination of distributed code and some grandmother like cells may be closer to how CNN features work and might also be how our brain encodes visual inputs.

 

Deep Learning and Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Uncategorized on November 22, 2015 at 8:17 pm

by Li Yang Ku (Gooly)

Yann LeCun Geoff Hinton Yoshua Bengio Andrew Ng

Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng

Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.

alexnet

So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.


On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.

imagenet

Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.

ladygaga

One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)

funny_brain_heart_fight

Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.

 

 

A Tale of Two Visual Pathways

In Computer Vision, Neural Science, Visual Illusion on May 14, 2015 at 7:53 pm

by Li Yang Ku (Gooly)

french-revolution

The idea that our brain encodes visual stimulus in two separate regions based on whether it contains information about the object location or identification was first proposed by Schneider in 1969. In 1982 Ungerleider and Mishkin further proposed this two visual pathway hypothesis that suggests that the two areas, inferotemporal cortex and posterior parietal cortex, receive independent sets of projections from the striate cortex (also named the visual cortex, often referred as V1. This is where many people think Gabor like filters reside). According to their original account, the ventral stream that starts from V1, bypassing V2, V4 and end in the inferotemporal cortex plays a critical role in identifying objects, while the the dorsal stream that starts from V1, by passing V5, V6 and end in the posterior parietal cortex encodes the spatial location of those same objects. Lesion experiments on monkeys at that time fitted well with this hypothesis. Monkeys with lesions of the inferotemporal cortex were impaired in recognition tasks but still capable of using visual cues to determine which location is rewarded. Opposite results were observed with monkeys with posterior parietal lesions. This hypothesis is often known as the distinction of ‘what’ and ‘where’ between the two visual pathways.

twopathways2

However, further findings found that this hypothesis that the two visual pathways encodes spatial location and object identification separately doesn’t quite capture the whole picture. Subjects with lesion in the posterior parietal region not only have difficulty in reaching the right direction but also in positioning their finger or adjusting the orientation of their hand. In 1992, Goodale and Milner proposed an alternative perspective on the functionality of these two visual pathways based on many observations made with patient DF. Instead of making distinctions on the internal representation, Goodale and Milner suggested to take more account of output requirements and introduced a separation between the two visual pathways based on ‘what’ and ‘how’ instead of ‘what’ and ‘where’.

sight_unseen_fig2.1.2 sight_unseen_fig2.1.1

Patient DF is unique in the sense that she developed a profound visual form agnosia due to anoxic damage to her ventral stream. Despite DF’s inability to recognize the shape, size and orientation of visual objects, she is capable of grasping the very same object with accurate hand and finger movements. When DF is asked to indicate the width of a cube with her thumb and index finger, her matches bore no relationship to the actual size of the cube. However when she was asked to reach out and pick up the cube, the distance between her thumb and index finger matches the dimension of the cube systematically. In a series of experiments, DF is capable of adjusting her fingers to pick up objects of different scale even though she is unable to perceive the dimension of those objects. Based on these observations, Goodale and Milner proposed that the dorsal pathway provides action-relevant information about the structural characteristic and orientation of objects and not just about their position.

descartes-mind-and-body

This two visual pathway hypothesis often referred to as the perception-action model received significant attention in the field of Neuropsychology and influenced thousands of studies since 1992. However several aspects of this model is questioned by recent findings. In 2011, Hesse etc. showed that the opposite experiment results between patients with lesion in dorsal stream and ventral stream are effected by whether the subject fixate on the target and are not as complimentary as previously thought. Several experiments also shown that the functional independence between action and perception might overlooked conditions when perception and actions actually interact. In 1998, Deubel etc. found that participants’ ability to discriminate a visual target is increased when the participants point to the target location. In 2005, Linnel etc. further found that this increase in discrimination ability happens even before the pointing action is performed. Simply the intention to perform an action may change perception capability. These findings suggest that the ventral and dorsal visual pathways are not as independent as previously thought and may ‘talk’ to one another when actions are programmed.

References are here

Follow

Get every new post delivered to your Inbox.

Join 228 other followers