Life is a game, take it seriously

A Tale of Two Visual Pathways

In Computer Vision, Neural Science, Visual Illusion on May 14, 2015 at 7:53 pm

by Li Yang Ku (Gooly)

french-revolution

The idea that our brain encodes visual stimulus in two separate regions based on whether it contains information about the object location or identification was first proposed by Schneider in 1969. In 1982 Ungerleider and Mishkin further proposed this two visual pathway hypothesis that suggests that the two areas, inferotemporal cortex and posterior parietal cortex, receive independent sets of projections from the striate cortex (also named the visual cortex, often referred as V1. This is where many people think Gabor like filters reside). According to their original account, the ventral stream that starts from V1, bypassing V2, V4 and end in the inferotemporal cortex plays a critical role in identifying objects, while the the dorsal stream that starts from V1, by passing V5, V6 and end in the posterior parietal cortex encodes the spatial location of those same objects. Lesion experiments on monkeys at that time fitted well with this hypothesis. Monkeys with lesions of the inferotemporal cortex were impaired in recognition tasks but still capable of using visual cues to determine which location is rewarded. Opposite results were observed with monkeys with posterior parietal lesions. This hypothesis is often known as the distinction of ‘what’ and ‘where’ between the two visual pathways.

twopathways2

However, further findings found that this hypothesis that the two visual pathways encodes spatial location and object identification separately doesn’t quite capture the whole picture. Subjects with lesion in the posterior parietal region not only have difficulty in reaching the right direction but also in positioning their finger or adjusting the orientation of their hand. In 1992, Goodale and Milner proposed an alternative perspective on the functionality of these two visual pathways based on many observations made with patient DF. Instead of making distinctions on the internal representation, Goodale and Milner suggested to take more account of output requirements and introduced a separation between the two visual pathways based on ‘what’ and ‘how’ instead of ‘what’ and ‘where’.

sight_unseen_fig2.1.2 sight_unseen_fig2.1.1

Patient DF is unique in the sense that she developed a profound visual form agnosia due to anoxic damage to her ventral stream. Despite DF’s inability to recognize the shape, size and orientation of visual objects, she is capable of grasping the very same object with accurate hand and finger movements. When DF is asked to indicate the width of a cube with her thumb and index finger, her matches bore no relationship to the actual size of the cube. However when she was asked to reach out and pick up the cube, the distance between her thumb and index finger matches the dimension of the cube systematically. In a series of experiments, DF is capable of adjusting her fingers to pick up objects of different scale even though she is unable to perceive the dimension of those objects. Based on these observations, Goodale and Milner proposed that the dorsal pathway provides action-relevant information about the structural characteristic and orientation of objects and not just about their position.

descartes-mind-and-body

This two visual pathway hypothesis often referred to as the perception-action model received significant attention in the field of Neuropsychology and influenced thousands of studies since 1992. However several aspects of this model is questioned by recent findings. In 2011, Hesse etc. showed that the opposite experiment results between patients with lesion in dorsal stream and ventral stream are effected by whether the subject fixate on the target and are not as complimentary as previously thought. Several experiments also shown that the functional independence between action and perception might overlooked conditions when perception and actions actually interact. In 1998, Deubel etc. found that participants’ ability to discriminate a visual target is increased when the participants point to the target location. In 2005, Linnel etc. further found that this increase in discrimination ability happens even before the pointing action is performed. Simply the intention to perform an action may change perception capability. These findings suggest that the ventral and dorsal visual pathways are not as independent as previously thought and may ‘talk’ to one another when actions are programmed.

References are here

Local Distance Learning in Object Recognition

In Computer Vision, Paper Talk on February 8, 2015 at 11:59 am

by Li Yang Ku (Gooly)

learning distance

Unsupervised clustering algorithms such as K-means are often used in computer vision as a tool for feature learning. It can be used in different stages in the visual pathway. Running K-means algorithm on a small region of pixel patches might result in finding a lot of patches with edges of different orientation while running K-means on a larger HOG feature might result in finding contours of meaningful parts of objects such as faces if your training data consists of selfies.  However, although convenient and simple as it seems, we have to keep in mind that these unsupervised clustering algorithms are all based on the assumption that a meaningful metric is provided. Without this criteria, clustering suffers from the “no right answer” problem. Whether the algorithm should group a set of images into clusters that contain objects with the same type or the same color is ambiguous and not well defined. This is especially true when your observation vectors are consists of values representing different types of properties.

distance learning

This is where Distance Learning comes into play. In the paper “Distance Metric Learning, with Application to Clustering with Side-Information” written by Eric Xing, Andrew Ng, Michael Jordan and Stuart Russell, a matrix A that represents the distance metric is learned through convex optimization using user inputs specifying grouping examples. This matrix A can either be full or diagonal. When learning a diagonal matrix, the values simply represent the weights of each feature. If the goal is to group objects with similar color, features that can represent color will have a higher weight in the matrix. This metric learning approach was shown to improve clustering on the UCI data set.

visual association

In another work “Recognition by Association via Learning Per-exemplar Distances” written by Tomasz Malisiewicz and Alexei Efros, the object recognition problem is posed as data association. A region in the image is classified by associating it with a small set of exemplars based on visual similarity. The authors suggested that the central question for recognition might not be “What is it?” but “What is it like?”. In this work, 14 different type of features under 4 categories, shape, color, texture and location are used. Unlike the single distance metric learned in the previous work, a separate distance function that specifies the weights put on these 14 different type of features is learned for each exemplar. Some exemplars like cars will not be as sensitive to color as exemplars like sky or grass, therefore having a different distance metric for each exemplar becomes advantageous in such situations. These class of work that defines separate distance metrics are called Local Distance Learning.

instance distance learning

In a more recent work “Sparse Distance Learning for Object Recognition Combining RGB and Depth Information” by Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox, a new approach called Instance Distance Learning is introduced, which instance is referred to one single object. When classifying a view, the view to object distance is compared simultaneously to all views of an object instead of a nearest neighbor approach. Besides learning weight vectors on each feature, weights on views are also learned. In addition, a L1 regularization is used instead of a L2 regularization in the Lagrange function. This generates a sparse weight vector which has a zero term on most views. This is quite interesting in the sense that this approach finds a small subset of representative views for each instance. In fact as shown in the image below, with just 8% of the exemplar data a similar decision boundaries can be achieved. This is consistent to what I talked about in my last post; human brain doesn’t store all the possible views of an object nor does it store a 3D model of the object, instead it stores a subset of views that are representing enough to recognize the same object. This work demonstrates one possible way of finding such subset of views.

instance distance learning decision boundaries

 

How objects are represented in human brain? Structural description models versus Image-based models

In Computer Vision, Neural Science, Paper Talk on October 30, 2014 at 9:06 pm

by Li Yang Ku (Gooly)

poggio

A few years ago while I was still back in UCLA, Tomaso Poggio came to give a talk about the object recognition work he did with 2D templates. After the talk some student asked about whether he thought about using a 3D model to help recognizing objects from different viewpoints. “The field seems to agree that models are stored as 2D images instead of 3D models in human brain” was the short answer Tomaso replied. Since then I took it as a fact and never had a second thought of it till a few month ago when I actually need to argue against storing a 3D model to people in robotics.

70s fashion

To get the full story we have to first go back to the late 70s. The study of visual object recognition is often motivated by the problem of recognizing 3D objects while only receiving 2D patterns of light on our retina. The question was whether our object representations is more similar to abstract three-dimensional descriptions, or are they tied more closely to the two-dimensional image of an object? A commonly held solution at that time, popularized by Marr was that the goal of vision is to reconstruct 3D. In the paper “Representation and recognition of the spatial organization of three-dimensional shapes” published in 1978 Marr and Nishihara assumes that at the end of the reconstruction process, viewer centered descriptions are mapped into object centered representations. This is based on the hypothesis that object representation should be invariant over changes in the retinal image. Based on this object centered theory, Biederman introduced the recognition by component (RBC) model in 1987 which proposes that objects are represented as a collection of volumes or parts. This quite influential model explains how object recognition can be viewpoint invariant and is often referred to as a structural description model.

The structural description model or object centered theory was the dominant theory of visual object understanding around that time and it can correctly predict the view-independent recognition of familiar objects. On the other hand, the viewer centered models, which store a set of 2D images instead of one single 3D model, are usually considered implausible because of the amount of memory a system would require to store all discriminable views of many objects.

1980-radio-shack-catalog

However, between late 1980’s to early 1990’s a wide variety of psychophysical and neurophysiological experiments surprisingly showed that human object recognition performance is strongly viewpoint dependent across rotation in depth. Before jumping into late 80’s I wanna first introduce some work done by Palmer, Rosch, and Chase in 1981. In their work they discovered that commonplace objects such as houses or cars can be hard or easy to recognize, depending on the attitude of the object with respect to the viewer. Subjects tended to respond quicker when the stimulus was shown from a good or canonical perspective. These observations was important in forming the viewer centered theory.

Paper clip like objects used in Bulthoff's experiments

Paper clip like objects used in Bulthoff’s experiments

In 1991 Bulthoff conducted an experiment on understanding these two theories. Subjects are shown sequences of animations where a paper clip like object is rotating. Given these sequences, the subjects have enough information to reconstruct a 3D model of the object. The subjects are then given a single image of a paper clip like object and are asked to identify whether it is the same object. Different viewing angles of the object are tested. The assumption is that if only one single complete 3D model of this object exists in our brain then recognizing it from all angles should be equally easy. However, according to Bulthoff when given every opportunity to form 3D, the subjects performed as if they have not done so.

Bulthoff 1991

In 1992 Edelman further showed that canonical perspectives arise even when all the views in question are shown equally often and the objects posses no intrinsic orientation that might lead to the advantage of some views.

Edelman 1992

Error rate from different viewpoint shown in Edelman’s experiment

In 1995 Tarr confirmed the discoveries using block like objects. Instead of showing a sequence of views of the object rotating, subjects are trained to learn how to build these block structures by manually placing them through an interface with fixed angle. The result shows that response times increased proportionally to the angular distance from the training viewpoint. With extensive practice, performance became nearly equivalent at all familiar viewpoints; however practice at familiar viewpoints did not transfer to unfamiliar viewpoints.

Tarr 1995

Based on these past observations, Logothetis, Pauls, and Poggio raised the question “If monkeys are extensively trained to identify novel 3D objects, would one find neurons in the brain that respond selectively to particular views of such object?” The results published in 1995 was clear. By conducting the same paper clip object recognition task on monkeys, they found 11.6% of the isolated neurons sampled in the IT region, which is the region that known to represent objects, responded selectively to a subset of views of one of the known target object. The response of these individual neurons decrease when the shown object rotate in all 4 axis from the canonical view which the neurons represent. The experiment also shows that these view specific neurons are scale and position invariant up to certain degree.

Logothetis 1995

Viewpoint specific neurons

These series of findings from human psychophysics and neurophysiolog research provided converging evidence for ‘image-based’ models in which objects are represented as collections of viewpoint-specific local features. A series of work in computer vision also shown that by allowing each canonical view to represent a range of images the model is no longer unfeasible. However despite a large amount of research, most of the detail mechanisms are still unknown and require further research.

Check out these papers visually in my other website EatPaper.org

References not linked in post:

Tarr, Michael J., and Heinrich H. Bülthoff. “Image-based object recognition in man, monkey and machine.” Cognition 67.1 (1998): 1-20.

Palmeri, Thomas J., and Isabel Gauthier. “Visual object understanding.” Nature Reviews Neuroscience 5.4 (2004): 291-303.

Follow

Get every new post delivered to your Inbox.

Join 194 other followers