Life is a game, take it seriously

Posts Tagged ‘machine learning’

Deep Learning and Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Uncategorized on November 22, 2015 at 8:17 pm

by Li Yang Ku (Gooly)

Yann LeCun Geoff Hinton Yoshua Bengio Andrew Ng

Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng

Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.


So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.

On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.


Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.


One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)


Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.



Local Distance Learning in Object Recognition

In Computer Vision, Paper Talk on February 8, 2015 at 11:59 am

by Li Yang Ku (Gooly)

learning distance

Unsupervised clustering algorithms such as K-means are often used in computer vision as a tool for feature learning. It can be used in different stages in the visual pathway. Running K-means algorithm on a small region of pixel patches might result in finding a lot of patches with edges of different orientation while running K-means on a larger HOG feature might result in finding contours of meaningful parts of objects such as faces if your training data consists of selfies.  However, although convenient and simple as it seems, we have to keep in mind that these unsupervised clustering algorithms are all based on the assumption that a meaningful metric is provided. Without this criteria, clustering suffers from the “no right answer” problem. Whether the algorithm should group a set of images into clusters that contain objects with the same type or the same color is ambiguous and not well defined. This is especially true when your observation vectors are consists of values representing different types of properties.

distance learning

This is where Distance Learning comes into play. In the paper “Distance Metric Learning, with Application to Clustering with Side-Information” written by Eric Xing, Andrew Ng, Michael Jordan and Stuart Russell, a matrix A that represents the distance metric is learned through convex optimization using user inputs specifying grouping examples. This matrix A can either be full or diagonal. When learning a diagonal matrix, the values simply represent the weights of each feature. If the goal is to group objects with similar color, features that can represent color will have a higher weight in the matrix. This metric learning approach was shown to improve clustering on the UCI data set.

visual association

In another work “Recognition by Association via Learning Per-exemplar Distances” written by Tomasz Malisiewicz and Alexei Efros, the object recognition problem is posed as data association. A region in the image is classified by associating it with a small set of exemplars based on visual similarity. The authors suggested that the central question for recognition might not be “What is it?” but “What is it like?”. In this work, 14 different type of features under 4 categories, shape, color, texture and location are used. Unlike the single distance metric learned in the previous work, a separate distance function that specifies the weights put on these 14 different type of features is learned for each exemplar. Some exemplars like cars will not be as sensitive to color as exemplars like sky or grass, therefore having a different distance metric for each exemplar becomes advantageous in such situations. These class of work that defines separate distance metrics are called Local Distance Learning.

instance distance learning

In a more recent work “Sparse Distance Learning for Object Recognition Combining RGB and Depth Information” by Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox, a new approach called Instance Distance Learning is introduced, which instance is referred to one single object. When classifying a view, the view to object distance is compared simultaneously to all views of an object instead of a nearest neighbor approach. Besides learning weight vectors on each feature, weights on views are also learned. In addition, a L1 regularization is used instead of a L2 regularization in the Lagrange function. This generates a sparse weight vector which has a zero term on most views. This is quite interesting in the sense that this approach finds a small subset of representative views for each instance. In fact as shown in the image below, with just 8% of the exemplar data a similar decision boundaries can be achieved. This is consistent to what I talked about in my last post; human brain doesn’t store all the possible views of an object nor does it store a 3D model of the object, instead it stores a subset of views that are representing enough to recognize the same object. This work demonstrates one possible way of finding such subset of views.

instance distance learning decision boundaries


One dollar classifier? NEIL, the never ending image learner

In Computer Vision, Machine Learning on November 27, 2013 at 5:18 pm

by Li Yang Ku (Gooly)

NEIL never ending image learner

I had the chance to chat with Abhinav Gupta, a research professor at CMU, in person when he visited UMass Amherst about a month ago. Abhinav presented NEIL, the never ending image learner in his talk at Amherst. To give a short intro, the following is from Abhinav

“NEIL (Never Ending Image Learner) is a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. It is an effort to build the world’s largest visual knowledge base with minimum human labeling effort – one that would be useful to many computer vision and AI efforts.” 

NEIL never ending image learner clusters

One of the characteristic that distinguishes NEIL from other object recognition algorithms that are trained and tested on large web image data set such as the ImageNet or LFW is that NEIL is trying to recognize images that are in a set that has unlimited data and unlimited category. At first glance this might look like a problem too hard to solve. But NEIL approaches this problem in a smart way. Instead of trying to label images one by one on the internet, NEIL start from labeling just the easy ones. Since given a keyword the number of images returned are so large using Google Image Search, NEIL simply picks the ones it feels most certain, which are the ones that share the most common HOG like features. This step also helps refining the query result. Say we searched for cars on Google Image, it is very likely that out of every 100 images there is one image that has nothing to do with cars (very likely some sexy photo of girls with file name girl_love_cars.jpg ). These outliers won’t share the same visual features as the other car clusters and will not be labeled. By doing so NEIL can gradually build  up a very large labeled data set from one word to another.


NEIL also learns the relationships between images and is connected with NELL, never ending language learning. More details should be released in future papers. During the talk Abhinav said he plan to set up a system where you can submit the category you wanna train on and with just $1, NEIL will give you a set of HOG classifiers in that category in 1 day.

NEIL relationship