Life is a game, take it seriously

Posts Tagged ‘brain’

Distributed Code or Grandmother Cells: Insights From Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Sparse Coding on January 23, 2016 at 1:31 pm

by Li Yang Ku (Gooly)


Convolutional Neural Network (CNN)-based features will likely replace engineered representations such as SIFT and HOG, yet we know little on what it represents. In this post I will go through a few papers that dive deeper into CNN-based features and discuss whether CNN feature vectors tend to be more like grandmother cells, where most information resides in a small set of filter responses, or distributed code, where most filter responses carry information equally. The content of this post is mostly taken from the following three papers:

  1. Agrawal, Pulkit, Ross Girshick, and Jitendra Malik. “Analyzing the performance of multilayer neural networks for object recognition.” Computer Vision–ECCV 2014. Springer International Publishing, 2014. 329-344.
  2. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
  3. Dosovitskiy, Alexey, and Thomas Brox. “Inverting convolutional networks with convolutional networks.” arXiv preprint arXiv:1506.02753 (2015).

So why do we want to take insights from convolutional neural networks (CNN)? Like what I talked about in my previous postIn 2012, University of Toronto’s CNN implementation won the ImageNet challenge by a large margin, 15.3% and 26.6% in classification and detection by the nearest competitor. Since then CNN approaches have been leaders in most computer vision benchmarks. Although CNN doesn’t work like the brain, the characteristic that makes it work well might be also true in the brain.


The grandmother cell is a hypothetical neuron that represents a complex but specific concept or object proposed by cognitive scientist Jerry Letvin in 1969. Although it is mostly agreed that the original concept of grandmother cell which suggests that each person or object one recognizes is associated with a single cell is biological implausible (see here for more discussion), the less extreme idea of grandmother cell is now explained as sparse coding.

Deformable Part Model

Before diving into CNN features we look into existing computer vision algorithms and see which camp they belong to. Traditional object recognition algorithms either are part-based approaches that use mid-level patches or use a bag of local descriptors such as SIFT. One of the well know part-based approaches is the deformable part model which uses HOG to model parts and a score on respective location and deformation to model their spatial relationship. Each part is a mid-level patch that can be seen as a feature that fires to specific visual patterns and mid-level patch discovery can be viewed as the search for a set of grandmother cell templates.


On the other hand, unlike mid-level patches, SIFT like features represent low level edges and corners. This bag of descriptors approach uses a distributed code; a single feature by itself is not discriminative, but a group of features taken together is.

There were many attempts to understand CNN more. One of the early work done by Zeiler and Fergus find locally optimal visual inputs for individual filters. However this does not characterize the distribution of images that cause a filter to activate. Agrawal et al. claimed that a grandmother cell can be seen as a filter with high precision and recall. Therefore for each conv-5 filter in the CNN trained on ImageNet they calculate the average precision for classifying images. They showed that grandmother cell like filters exist for only a few classes, such as bicycle, person, cars, and cats. The number of filters required to recognize objects of a class is also measured. For classes such as persons, cars, and cats few filters are required, but most classes require 30 to 40 filters.


In the work done by Hinton et al. a concept called distillation is introduced. Distillation transfers the knowledge of a cumbersome model to a small model. For a cumbersome model, the training objective is to maximize the probability of the correct answer. A side effect is that it also assigns probabilities to incorrect answers. Instead of training on the correct answer, distillation train on soft targets, which is the probabilities of all answers generated from the cumbersome model. They showed that the small model performs better when trained on these soft targets versus when trained on the correct answer. This result suggests that the relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.

Inverting CNN Features

On the other hand, Dosovitskiy et al. tried to understand CNN features through inverting the CNN. They claim that inverting CNN features allows us to see which information of the input image is preserved in the features. Applying inverse to a perturbed feature vector yields further insight into the structure of the feature space. Interestingly, when they discard features in the FC8 layer they found most information is contained in small probabilities of those classes instead of the top-5 activation. This result is consistent with the result of the distillation experiment mentioned previously.

Top-5 vs rest feature in FC8

These findings suggest that a combination of distributed code and some grandmother like cells may be closer to how CNN features work and might also be how our brain encodes visual inputs.


Deep Learning and Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Uncategorized on November 22, 2015 at 8:17 pm

by Li Yang Ku (Gooly)

Yann LeCun Geoff Hinton Yoshua Bengio Andrew Ng

Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng

Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.


So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.

On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.


Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.


One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)


Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.



A Tale of Two Visual Pathways

In Computer Vision, Neural Science, Visual Illusion on May 14, 2015 at 7:53 pm

by Li Yang Ku (Gooly)


The idea that our brain encodes visual stimulus in two separate regions based on whether it contains information about the object location or identification was first proposed by Schneider in 1969. In 1982 Ungerleider and Mishkin further proposed this two visual pathway hypothesis that suggests that the two areas, inferotemporal cortex and posterior parietal cortex, receive independent sets of projections from the striate cortex (also named the visual cortex, often referred as V1. This is where many people think Gabor like filters reside). According to their original account, the ventral stream that starts from V1, bypassing V2, V4 and end in the inferotemporal cortex plays a critical role in identifying objects, while the the dorsal stream that starts from V1, bypassing V5, V6 and end in the posterior parietal cortex encodes the spatial location of those same objects. Lesion experiments on monkeys at that time fitted well with this hypothesis. Monkeys with lesions of the inferotemporal cortex were impaired in recognition tasks but still capable of using visual cues to determine which location is rewarded. Opposite results were observed with monkeys with posterior parietal lesions. This hypothesis is often known as the distinction of ‘what’ and ‘where’ between the two visual pathways.


However, further findings found that this hypothesis that the two visual pathways encodes spatial location and object identification separately doesn’t quite capture the whole picture. Subjects with lesion in the posterior parietal region not only have difficulty in reaching the right direction but also in positioning their finger or adjusting the orientation of their hand. In 1992, Goodale and Milner proposed an alternative perspective on the functionality of these two visual pathways based on many observations made with patient DF. Instead of making distinctions on the internal representation, Goodale and Milner suggested to take more account of output requirements and introduced a separation between the two visual pathways based on ‘what’ and ‘how’ instead of ‘what’ and ‘where’.

sight_unseen_fig2.1.2 sight_unseen_fig2.1.1

Patient DF is unique in the sense that she developed a profound visual form agnosia due to anoxic damage to her ventral stream. Despite DF’s inability to recognize the shape, size and orientation of visual objects, she is capable of grasping the very same object with accurate hand and finger movements. When DF is asked to indicate the width of a cube with her thumb and index finger, her matches bore no relationship to the actual size of the cube. However when she was asked to reach out and pick up the cube, the distance between her thumb and index finger matches the dimension of the cube systematically. In a series of experiments, DF is capable of adjusting her fingers to pick up objects of different scale even though she is unable to perceive the dimension of those objects. Based on these observations, Goodale and Milner proposed that the dorsal pathway provides action-relevant information about the structural characteristic and orientation of objects and not just about their position.


This two visual pathway hypothesis often referred to as the perception-action model received significant attention in the field of Neuropsychology and influenced thousands of studies since 1992. However several aspects of this model is questioned by recent findings. In 2011, Hesse etc. showed that the opposite experiment results between patients with lesion in dorsal stream and ventral stream are effected by whether the subject fixate on the target and are not as complimentary as previously thought. Several experiments also shown that the functional independence between action and perception might overlooked conditions when perception and actions actually interact. In 1998, Deubel etc. found that participants’ ability to discriminate a visual target is increased when the participants point to the target location. In 2005, Linnel etc. further found that this increase in discrimination ability happens even before the pointing action is performed. Simply the intention to perform an action may change perception capability. These findings suggest that the ventral and dorsal visual pathways are not as independent as previously thought and may ‘talk’ to one another when actions are programmed.

References are here

Human vision, top down or bottom up?

In Computer Vision, Neural Science, Paper Talk on February 9, 2014 at 6:42 pm

by Li Yang Ku (Gooly)

top-down bottom-up

How our brain handles visual input is a myth. When Hubel and Wiesel discovered the Gabor filter like neuron in cat’s V1 area, several feed forward model theories appear. These models view our brain as a hierarchical classifier that extracts features layer by layer. Poggio’s papers “A feedforward architecture accounts for rapid categorization” and “Hierarchical models of object recognition in cortex” are good examples. These kind of structure are called discriminative models. Although this new type of model helped the community leap forward one step, it doesn’t solve the problem. Part of the reason is that there are ambiguities if you are only viewing part of the image locally and a feed-forward only structure can’t achieve global consistency.

Feedforward Vision

Therefore the idea that some kind of feedback model has to exist gradually emerged. Some of the early works in the computer science community had first came up with models that rely on feedback, such as Gefforey Hinton’s Boltzman Machine invented back in the 80’s which developed into the so called deep learning around late 2000. However it was only around early 2000 had David Mumford clearly addressed the importance of feedback in the paper “Hierarchical Bayesian inference in the visual cortex“.  Around the same time Wu and others had also combined feedback and feedforward models successfully on textures in the paper “Visual learning by integrating descriptive and generative methods“. Since then the computer vision community have partly embraced the idea that the brain is more like a generative model which in addition to categorizing inputs is capable of generating images. An example of human having generative skills will be drawing images out of imagination.


Slightly before David Mumford addresses the importance of the generative model. Lamme in the neuroscience community also started a series of research on the recurrent process in the vision system. His paper “The distinct modes of vision offered by feedforward and recurrent processing” published in 2000 addressed why recurrent (feedback) processing might be associated with conscious vision (recognizing object). While in the same year the paper “Competition for consciousness among visual events: the psychophysics of reentrant visual processes.” published in the field of psychology also addressed the reentrant (feedback) visual process and proposed a model where conscious vision is associated with the reentrant visual process.


While both the neuroscience and psychology field have research results that suggests a brain model that is composed of feedforward and feedback processing where the feedback mechanism is associated with conscious vision, a recent paper “Detecting meaning in RSVP at 13 ms per picture” shows that human is able to recognize high level concept of an image within 13 ms, a very short gap that won’t allow the brain to do a complete reentrant (feedback) visual process. This conflicting result could suggest that conscious vision is not the result of feedback processing or there are still missing pieces that we haven’t discover. This kind of reminds me one of Jeff Hawkins’  brain theory, which he said that solving the mystery of consciousness is like figuring out the world is round not flat, it’s easy to understand but hard to accept, and he believes that consciousness does not reside in one part of the brain but is simply the combination of all firing neuron from top to bottom.

Paper Talk: Untangling Invariant Object Recognition

In Computer Vision, Neural Science, Paper Talk on September 29, 2013 at 7:31 pm

by Gooly (Li Yang Ku)

Untangle Invariant Object Recognition

In one of my previous post I talked about the big picture of object recognition, which can be divided into two parts 1) transforming the image space 2) classifying and grouping. In this post I am gonna talk about a paper that clarifies object recognition and some of it’s pretty cool graphs explaining how our brains might transform the input image space. The paper also talked about why the idealized classification might not be what we want.

Lets start by explaining what’s a manifold.

image space manifolds

An object manifold is the set of images projected by one object in the image space. Since each image is a point in the image space and an object can project similar images with infinitely small differences, the points form a continuous surface in the image space. This continuous surface is the object’s manifold. Figure (a) above is an idealized manifold generated by a specific face. When the face is viewed from different angles the projected point move around on the continuous manifold. Although the graph is drew in 3D one should keep in mind that it is actually in a much larger dimension space. A 576000000 dimension space if consider human eyes to be 576 mega pixel. Figure (b) shows another manifold from another face, in this space the two individuals can be separated easily by a plane. Figure (c) shows another space which the two faces would be hard to separate. Note that these are ideal spaces that is possibly transformed from the original image space by our cortex. If the shapes are that simple, object recognition would be easy. However, the actual stuff we get is in Figure (d). The object manifolds from two objects are usually tangled and intersect in multiple spots. However the two image space are not the same, therefore it is possible that through some non linear operation we can transform figure (d) to something more like figure (c).

classification: manifold or point?

One interesting point this paper made is that the traditional view that there is a neuron that represents an object is probably wrong. Instead of having a grandmother cell (yes.. that’s how they called it) that represents your grandma, our brain might actually represents her with a manifold. Neurologically speaking, a manifold could be a set of neurons that have a certain firing pattern. This is related to the sparse encoding I talked about before and is consistent with Jeff Hawkins’ brain theory. (See his talk about sparse distribution representation around 17:15)

The figure (b) and (c) above are the comparison between a manifold representation and a single cell representation. What is being emphasized is that object recognition is more a task of transforming the space rather than classification.