Life is a game, take it seriously

Deep Learning and Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Uncategorized on November 22, 2015 at 8:17 pm

by Li Yang Ku (Gooly)

Yann LeCun Geoff Hinton Yoshua Bengio Andrew Ng

Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng

Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.


So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.

On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.


Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.


One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)


Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.



A Tale of Two Visual Pathways

In Computer Vision, Neural Science, Visual Illusion on May 14, 2015 at 7:53 pm

by Li Yang Ku (Gooly)


The idea that our brain encodes visual stimulus in two separate regions based on whether it contains information about the object location or identification was first proposed by Schneider in 1969. In 1982 Ungerleider and Mishkin further proposed this two visual pathway hypothesis that suggests that the two areas, inferotemporal cortex and posterior parietal cortex, receive independent sets of projections from the striate cortex (also named the visual cortex, often referred as V1. This is where many people think Gabor like filters reside). According to their original account, the ventral stream that starts from V1, bypassing V2, V4 and end in the inferotemporal cortex plays a critical role in identifying objects, while the the dorsal stream that starts from V1, by passing V5, V6 and end in the posterior parietal cortex encodes the spatial location of those same objects. Lesion experiments on monkeys at that time fitted well with this hypothesis. Monkeys with lesions of the inferotemporal cortex were impaired in recognition tasks but still capable of using visual cues to determine which location is rewarded. Opposite results were observed with monkeys with posterior parietal lesions. This hypothesis is often known as the distinction of ‘what’ and ‘where’ between the two visual pathways.


However, further findings found that this hypothesis that the two visual pathways encodes spatial location and object identification separately doesn’t quite capture the whole picture. Subjects with lesion in the posterior parietal region not only have difficulty in reaching the right direction but also in positioning their finger or adjusting the orientation of their hand. In 1992, Goodale and Milner proposed an alternative perspective on the functionality of these two visual pathways based on many observations made with patient DF. Instead of making distinctions on the internal representation, Goodale and Milner suggested to take more account of output requirements and introduced a separation between the two visual pathways based on ‘what’ and ‘how’ instead of ‘what’ and ‘where’.

sight_unseen_fig2.1.2 sight_unseen_fig2.1.1

Patient DF is unique in the sense that she developed a profound visual form agnosia due to anoxic damage to her ventral stream. Despite DF’s inability to recognize the shape, size and orientation of visual objects, she is capable of grasping the very same object with accurate hand and finger movements. When DF is asked to indicate the width of a cube with her thumb and index finger, her matches bore no relationship to the actual size of the cube. However when she was asked to reach out and pick up the cube, the distance between her thumb and index finger matches the dimension of the cube systematically. In a series of experiments, DF is capable of adjusting her fingers to pick up objects of different scale even though she is unable to perceive the dimension of those objects. Based on these observations, Goodale and Milner proposed that the dorsal pathway provides action-relevant information about the structural characteristic and orientation of objects and not just about their position.


This two visual pathway hypothesis often referred to as the perception-action model received significant attention in the field of Neuropsychology and influenced thousands of studies since 1992. However several aspects of this model is questioned by recent findings. In 2011, Hesse etc. showed that the opposite experiment results between patients with lesion in dorsal stream and ventral stream are effected by whether the subject fixate on the target and are not as complimentary as previously thought. Several experiments also shown that the functional independence between action and perception might overlooked conditions when perception and actions actually interact. In 1998, Deubel etc. found that participants’ ability to discriminate a visual target is increased when the participants point to the target location. In 2005, Linnel etc. further found that this increase in discrimination ability happens even before the pointing action is performed. Simply the intention to perform an action may change perception capability. These findings suggest that the ventral and dorsal visual pathways are not as independent as previously thought and may ‘talk’ to one another when actions are programmed.

References are here

Local Distance Learning in Object Recognition

In Computer Vision, Paper Talk on February 8, 2015 at 11:59 am

by Li Yang Ku (Gooly)

learning distance

Unsupervised clustering algorithms such as K-means are often used in computer vision as a tool for feature learning. It can be used in different stages in the visual pathway. Running K-means algorithm on a small region of pixel patches might result in finding a lot of patches with edges of different orientation while running K-means on a larger HOG feature might result in finding contours of meaningful parts of objects such as faces if your training data consists of selfies.  However, although convenient and simple as it seems, we have to keep in mind that these unsupervised clustering algorithms are all based on the assumption that a meaningful metric is provided. Without this criteria, clustering suffers from the “no right answer” problem. Whether the algorithm should group a set of images into clusters that contain objects with the same type or the same color is ambiguous and not well defined. This is especially true when your observation vectors are consists of values representing different types of properties.

distance learning

This is where Distance Learning comes into play. In the paper “Distance Metric Learning, with Application to Clustering with Side-Information” written by Eric Xing, Andrew Ng, Michael Jordan and Stuart Russell, a matrix A that represents the distance metric is learned through convex optimization using user inputs specifying grouping examples. This matrix A can either be full or diagonal. When learning a diagonal matrix, the values simply represent the weights of each feature. If the goal is to group objects with similar color, features that can represent color will have a higher weight in the matrix. This metric learning approach was shown to improve clustering on the UCI data set.

visual association

In another work “Recognition by Association via Learning Per-exemplar Distances” written by Tomasz Malisiewicz and Alexei Efros, the object recognition problem is posed as data association. A region in the image is classified by associating it with a small set of exemplars based on visual similarity. The authors suggested that the central question for recognition might not be “What is it?” but “What is it like?”. In this work, 14 different type of features under 4 categories, shape, color, texture and location are used. Unlike the single distance metric learned in the previous work, a separate distance function that specifies the weights put on these 14 different type of features is learned for each exemplar. Some exemplars like cars will not be as sensitive to color as exemplars like sky or grass, therefore having a different distance metric for each exemplar becomes advantageous in such situations. These class of work that defines separate distance metrics are called Local Distance Learning.

instance distance learning

In a more recent work “Sparse Distance Learning for Object Recognition Combining RGB and Depth Information” by Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox, a new approach called Instance Distance Learning is introduced, which instance is referred to one single object. When classifying a view, the view to object distance is compared simultaneously to all views of an object instead of a nearest neighbor approach. Besides learning weight vectors on each feature, weights on views are also learned. In addition, a L1 regularization is used instead of a L2 regularization in the Lagrange function. This generates a sparse weight vector which has a zero term on most views. This is quite interesting in the sense that this approach finds a small subset of representative views for each instance. In fact as shown in the image below, with just 8% of the exemplar data a similar decision boundaries can be achieved. This is consistent to what I talked about in my last post; human brain doesn’t store all the possible views of an object nor does it store a 3D model of the object, instead it stores a subset of views that are representing enough to recognize the same object. This work demonstrates one possible way of finding such subset of views.

instance distance learning decision boundaries



Get every new post delivered to your Inbox.

Join 216 other followers