Sparse Coding in a Nutshell

by Li Yang Ku (Gooly)

I’ve been reading some of Dieter Fox’s publications recently and a series of work on Hierarchical Matching Pursuit (HMP) caught my eye. There are three papers that is based on HMP, “Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms”, “Unsupervised feature learning for RGB-D based object recognition” and “Unsupervised Feature Learning for 3D Scene Labeling”. In all 3 of these publications, the HMP algorithm is what it is all about. The first paper, published in 2011, deals with scene classification and object recognition on gray scale images; the second paper, published in 2012, takes RGBD image as input for object recognition; while the third paper, published in 2014, further extends the application to scene recognition using point cloud input. The 3 figures below are the feature dictionaries used in these 3 papers in chronicle order.

One of the center concept of HMP is to learn low level and mid level features instead of using hand craft features like SIFT feature. In fact the first paper claims that it is the first work to show that learning features from the pixel level signiﬁcantly outperforms those approaches built on top of SIFT. Explaining it in a sentence, HMP is an algorithm that builds up a sparse dictionary and encodes it hierarchically such that meaningful features preserves. The final classifier is simply a linear support vector machine, so the magic is mostly in sparse coding. To fully understand why sparse coding might be a good idea we have to go back in time.

Back in the 50’s, Hubel and Wiesel’s work on discovering Gabor filter like neurons in the cat brain really inspired a lot of people. However, the community thought the Gabor like filters are some sort of edge detectors. This discovery leads to a series of work done on edge detection in the 80’s when digital image processing became possible on computers. Edge detectors such as Canny, Harris, Sobel, Prewitt, etc are all based on the concept of detecting edges before recognizing objects. More recent algorithms such as Histogram of Oriented Gradient (HOG) are an extension of these edge detectors. An example of HOG is the quite successful paper on pedestrian detection “Histograms of oriented gradients for human detection” (See figure below).

If we move on to the 90’s and 2000’s, SIFT like features seems to have dominated a large part of the Computer Vision world. These hand-craft features works surprisingly well and lead to many real applications. These type of algorithms usually consist of two steps, 1) detect interesting feature points (yellow circles in the figure above) , 2) generate an invariant descriptor around it (green check boards in the figure above). One of the reasons it works well is that SIFT only cares interest points, therefore lowering the dimension of the feature significantly. This allows classifiers to require less training samples before it can make reasonable predictions. However, throwing away all those geometry and texture information is unlikely how we humans see the world and will fail in texture-less scenarios.

In 1996, Olshausen showed that by adding a sparse constraint, gabor like filters are the codes that best describe natural images. What this might suggest is that Filters in V1 (Gabor filters) are not just edge detectors, but statistically the best coding for natural images under the sparse constraint. I regard this as the most important proof that our brain uses sparse coding and the reason it works better in many new algorithms that uses sparse coding such as the HMP. If you are interested in why evolution picked sparse coding, Jeff Hawkins has a great explanation in one of his talks (at 17:33); besides saving energy, it also helps generalizing and makes comparing features easy. Andrew Ng also has a paper “The importance of encoding versus training with sparse coding and vector quantization” on analyzing which part of sparse coding leads to better result.

Posted

May 24, 2014

Computer Vision, Neural Science, Sparse Coding

Gooly

Tags:

computer vision, HMP, sparse coding

Comments

10 responses to “Sparse Coding in a Nutshell”

Alireza

June 11, 2014

I just want to add some extra information (maybe you already know but I am not sure about other viewers).

– SIFT finds blobs in the image. Blob: The bright disk with dark halo around it (with gaussian interpretation (in more precise manner: Difference of Gaussian)) and vice versa. It seems there are some similarities between they way SIFT works and our visionary systems.

– Dieter Fox is the 3rd guy from the German gang (all studied in USA and at the same university (not sure)) who wrote the book “Probabilistic Robotics” (which is like a bible in the robotic community). The first one (Sebastian Thrun: http://robots.stanford.edu/ ) is like a hero and founder of Udacity (thanks for his bright thinking), and the second one (Wolfram Burgard) has some courses (with videos) online here (http://ais.informatik.uni-freiburg.de/teaching/ss13/robotics/) they are great source of materials for lazy people (like as me) to watch and learn. Dieter Fox teaching are available here (http://www.cs.washington.edu/people/faculty/fox/teaching/)

p.s.1 accept my apologies for using lots of parentheses and grammatical errors.
p.s.2 fan of your site since 2 years ago. I’m new in ROS and 3D sensing and RGBD …

LikeLike

Reply
1. Gooly
  
  June 11, 2014
  
  Cool, thanks for the information. I haven’t really thought about the relation between SIFT and our vision system, but you might be right.
  
  I like how you describe them as the German gang. LOL. My unpublished current research is actually partly inspired by a chapter in the book, hopefully I will be able to write about it soon.
  
  And thanks for the support, readers like you are what keeps me writing.
  
  LikeLike
  
  Reply
Aruni RC

June 20, 2014

Really informative Li, great blog you have here!

The similarity between the mammalian visual cortex and Gabor filters is the basis for a lot of approaches to recognition or coding. Also interesting is the similarity between Gabor filters, the visualizations of sparse encoders (or “sparse filtering” as Andrew Ng describes it) and the weights in Convolutional Neural networks.

In fact, the best performances in object recognition (Pascal VOC or ImageNet) have mostly used ConvNets or some form of sparse coding before a linear SVM as mentioned. Maybe that has become the de-facto standard of image description now? Instead of SIFT being the go-to descriptor, nowadays we would end up using sparse filtering or features from the pre-final layers of pre-trained deep neural networks: http://arxiv.org/pdf/1403.6382v3.pdf

Hope you found it interesting,
Aruni

LikeLike

Reply
Sparse Coding in a Nutshell | Ragnarok Connection

July 10, 2014

[…] Base from: the Serious Computer Vision Blog […]

LikeLike

Reply
CCTC

July 17, 2014

The final classifier is simply a linear support vector machine, so the magic is mostly in sparse coding. To fully understand why sparse coding might be a good idea we have to go back in time. I am not sure about other viewers but visit at CCTC

LikeLike

Reply
Book it: Computer Vision Metrics | the Serious Computer Vision Blog

September 21, 2014

[…] I talked about in my blog such as Sparse Coding and Hierarchical Matching Pursuit are also discussed in the book. The section that did a comparison between some of the relatively […]

LikeLike

Reply
A Tale of Two Visual Pathways | the Serious Computer Vision Blog

May 14, 2015

[…] striate cortex (also named the visual cortex, often referred as V1. This is where many people think Gabor like filters reside). According to their original account, the ventral stream that starts from V1, bypassing […]

LikeLike

Reply
Deep Learning and Convolutional Neural Networks | the Serious Computer Vision Blog

November 22, 2015

[…] the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that […]

LikeLike

Reply
Distributed Code or Grandmother Cells: Insights From Convolutional Neural Networks | the Serious Computer Vision Blog

January 23, 2016

[…] The grandmother cell is a hypothetical neuron that represents a complex but specific concept or object proposed by cognitive scientist Jerry Letvin in 1969. Although it is mostly agreed that the original concept of grandmother cell which suggests that each person or object one recognizes is associated with a single cell is biological implausible (see here for more discussion), the less extreme idea of grandmother cell is now explained as sparse coding. […]

LikeLike

Reply
Unsupervised learning | Bits and pieces

May 7, 2016

[…] Sparse coding (2014), Back to basics (2013) […]

LikeLike

Reply