Life is a game, take it seriously

How objects are represented in human brain? Structural description models versus Image-based models

In Computer Vision, Neural Science, Paper Talk on October 30, 2014 at 9:06 pm

by Li Yang Ku (Gooly)


A few years ago while I was still back in UCLA, Tomaso Poggio came to give a talk about the object recognition work he did with 2D templates. After the talk some student asked about whether he thought about using a 3D model to help recognizing objects from different viewpoints. “The field seems to agree that models are stored as 2D images instead of 3D models in human brain” was the short answer Tomaso replied. Since then I took it as a fact and never had a second thought of it till a few month ago when I actually need to argue against storing a 3D model to people in robotics.

70s fashion

To get the full story we have to first go back to the late 70s. The study of visual object recognition is often motivated by the problem of recognizing 3D objects while only receiving 2D patterns of light on our retina. The question was whether our object representations is more similar to abstract three-dimensional descriptions, or are they tied more closely to the two-dimensional image of an object? A commonly held solution at that time, popularized by Marr was that the goal of vision is to reconstruct 3D. In the paper “Representation and recognition of the spatial organization of three-dimensional shapes” published in 1978 Marr and Nishihara assumes that at the end of the reconstruction process, viewer centered descriptions are mapped into object centered representations. This is based on the hypothesis that object representation should be invariant over changes in the retinal image. Based on this object centered theory, Biederman introduced the recognition by component (RBC) model in 1987 which proposes that objects are represented as a collection of volumes or parts. This quite influential model explains how object recognition can be viewpoint invariant and is often referred to as a structural description model.

The structural description model or object centered theory was the dominant theory of visual object understanding around that time and it can correctly predict the view-independent recognition of familiar objects. On the other hand, the viewer centered models, which store a set of 2D images instead of one single 3D model, are usually considered implausible because of the amount of memory a system would require to store all discriminable views of many objects.


However, between late 1980’s to early 1990’s a wide variety of psychophysical and neurophysiological experiments surprisingly showed that human object recognition performance is strongly viewpoint dependent across rotation in depth. Before jumping into late 80’s I wanna first introduce some work done by Palmer, Rosch, and Chase in 1981. In their work they discovered that commonplace objects such as houses or cars can be hard or easy to recognize, depending on the attitude of the object with respect to the viewer. Subjects tended to respond quicker when the stimulus was shown from a good or canonical perspective. These observations was important in forming the viewer centered theory.

Paper clip like objects used in Bulthoff's experiments

Paper clip like objects used in Bulthoff’s experiments

In 1991 Bulthoff conducted an experiment on understanding these two theories. Subjects are shown sequences of animations where a paper clip like object is rotating. Given these sequences, the subjects have enough information to reconstruct a 3D model of the object. The subjects are then given a single image of a paper clip like object and are asked to identify whether it is the same object. Different viewing angles of the object are tested. The assumption is that if only one single complete 3D model of this object exists in our brain then recognizing it from all angles should be equally easy. However, according to Bulthoff when given every opportunity to form 3D, the subjects performed as if they have not done so.

Bulthoff 1991

In 1992 Edelman further showed that canonical perspectives arise even when all the views in question are shown equally often and the objects posses no intrinsic orientation that might lead to the advantage of some views.

Edelman 1992

Error rate from different viewpoint shown in Edelman’s experiment

In 1995 Tarr confirmed the discoveries using block like objects. Instead of showing a sequence of views of the object rotating, subjects are trained to learn how to build these block structures by manually placing them through an interface with fixed angle. The result shows that response times increased proportionally to the angular distance from the training viewpoint. With extensive practice, performance became nearly equivalent at all familiar viewpoints; however practice at familiar viewpoints did not transfer to unfamiliar viewpoints.

Tarr 1995

Based on these past observations, Logothetis, Pauls, and Poggio raised the question “If monkeys are extensively trained to identify novel 3D objects, would one find neurons in the brain that respond selectively to particular views of such object?” The results published in 1995 was clear. By conducting the same paper clip object recognition task on monkeys, they found 11.6% of the isolated neurons sampled in the IT region, which is the region that known to represent objects, responded selectively to a subset of views of one of the known target object. The response of these individual neurons decrease when the shown object rotate in all 4 axis from the canonical view which the neurons represent. The experiment also shows that these view specific neurons are scale and position invariant up to certain degree.

Logothetis 1995

Viewpoint specific neurons

These series of findings from human psychophysics and neurophysiolog research provided converging evidence for ‘image-based’ models in which objects are represented as collections of viewpoint-specific local features. A series of work in computer vision also shown that by allowing each canonical view to represent a range of images the model is no longer unfeasible. However despite a large amount of research, most of the detail mechanisms are still unknown and require further research.

Check out these papers visually in my other website

References not linked in post:

Tarr, Michael J., and Heinrich H. Bülthoff. “Image-based object recognition in man, monkey and machine.” Cognition 67.1 (1998): 1-20.

Palmeri, Thomas J., and Isabel Gauthier. “Visual object understanding.” Nature Reviews Neuroscience 5.4 (2004): 291-303.

Book it: Computer Vision Metrics

In Book It, Computer Vision on September 21, 2014 at 5:45 pm

by Li Yang Ku

computer vision metrics

I was asked to review a computer vision book again recently. The 500 page book “Computer Vision Metrics” is written by Scott Krig and, surprisingly, can be downloaded for free through Apress. It is a pretty nice book for people that are not completely new to Computer Vision but want to find research topics that they would be interested in. I would recommend to go through the first 4 chapters, specially the 3rd and the 4th chapter which gives a pretty complete overview on the most active research areas in recent years.


Topics I talked about in my blog such as Sparse Coding and Hierarchical  Matching Pursuit are also discussed in the book. The section that did a comparison between some of the relatively new descriptors FREAK, Brisk, ORB, and BREIF should also be pretty helpful.

Sparse Coding in a Nutshell

In Computer Vision, Neural Science, Sparse Coding on May 24, 2014 at 7:24 pm

by Li Yang Ku (Gooly)


I’ve been reading some of Dieter Fox’s publications recently and a series of work on Hierarchical Matching Pursuit (HMP) caught my eye. There are three papers that is based on HMP, “Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms”, “Unsupervised feature learning for RGB-D based object recognition” and “Unsupervised Feature Learning for 3D Scene Labeling”. In all 3 of these publications, the HMP algorithm is what it is all about. The first paper, published in 2011, deals with scene classification and object recognition on gray scale images; the second paper, published in 2012, takes RGBD image as input for object recognition; while the third paper, published in 2014, further extends the application to scene recognition using point cloud input. The 3 figures below are the feature dictionaries used in these 3 papers in chronicle order.


One of the center concept of HMP is to learn low level and mid level features instead of using hand craft features like SIFT feature. In fact the first paper claims that it is the first work to show that learning features from the pixel level significantly outperforms those approaches built on top of SIFT. Explaining it in a sentence, HMP is an algorithm that builds up a sparse dictionary and encodes it hierarchically such that meaningful features preserves. The final classifier is simply a linear support vector machine, so the magic is mostly in sparse coding. To fully understand why sparse coding might be a good idea we have to go back in time.

Back in the 50’s, Hubel and Wiesel’s work on discovering Gabor filter like neurons in the cat brain really inspired a lot of people. However, the community thought the Gabor like filters are some sort of edge detectors. This discovery leads to a series of work done on edge detection in the 80’s when digital image processing became possible on computers. Edge detectors such as Canny, Harris, Sobel, Prewitt, etc are all based on the concept of detecting edges before recognizing objects. More recent algorithms such as Histogram of Oriented Gradient (HOG) are an extension of these edge detectors. An example of HOG is the quite successful paper on pedestrian detection “Histograms of oriented gradients for human detection” (See figure below).

hog and sift

If we move on to the 90’s and 2000’s, SIFT like features seems to have dominated a large part of the Computer Vision world. These hand-craft features works surprisingly well and lead to many real applications. These type of algorithms usually consist of two steps, 1) detect interesting feature points (yellow circles in the figure above) , 2) generate an invariant descriptor around it (green check boards in the figure above). One of the reasons it works well is that SIFT only cares interest points, therefore lowering the dimension of the feature significantly. This allows classifiers to require less training samples before it can make reasonable predictions. However, throwing away all those geometry and texture information is unlikely how we humans see the world and will fail in texture-less scenarios.

In 1996, Olshausen showed that by adding a sparse constraint, gabor like filters are the codes that best describe natural images. What this might suggest is that Filters in V1 (Gabor filters) are not just edge detectors, but statistically the best coding for natural images under the sparse constraint. I regard this as the most important proof that our brain uses sparse coding and the reason it works better in many new algorithms that uses sparse coding such as the HMP. If you are interested in why evolution picked sparse coding, Jeff Hawkins has a great explanation in one of his talks (at 17:33); besides saving energy, it also helps generalizing and makes comparing features easy. Andrew Ng also has a paper “The importance of encoding versus training with sparse coding and vector quantization” on analyzing which part of sparse coding leads to better result.


Get every new post delivered to your Inbox.

Join 148 other followers