Life is a game, take it seriously

Sparse Coding in a Nutshell

In Computer Vision, Neural Science, Sparse Coding on May 24, 2014 at 7:24 pm

by Li Yang Ku (Gooly)


I’ve been reading some of Dieter Fox’s publications recently and a series of work on Hierarchical Matching Pursuit (HMP) caught my eye. There are three papers that is based on HMP, “Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms”, “Unsupervised feature learning for RGB-D based object recognition” and “Unsupervised Feature Learning for 3D Scene Labeling”. In all 3 of these publications, the HMP algorithm is what it is all about. The first paper, published in 2011, deals with scene classification and object recognition on gray scale images; the second paper, published in 2012, takes RGBD image as input for object recognition; while the third paper, published in 2014, further extends the application to scene recognition using point cloud input. The 3 figures below are the feature dictionaries used in these 3 papers in chronicle order.


One of the center concept of HMP is to learn low level and mid level features instead of using hand craft features like SIFT feature. In fact the first paper claims that it is the first work to show that learning features from the pixel level significantly outperforms those approaches built on top of SIFT. Explaining it in a sentence, HMP is an algorithm that builds up a sparse dictionary and encodes it hierarchically such that meaningful features preserves. The final classifier is simply a linear support vector machine, so the magic is mostly in sparse coding. To fully understand why sparse coding might be a good idea we have to go back in time.

Back in the 50’s, Hubel and Wiesel’s work on discovering Gabor filter like neurons in the cat brain really inspired a lot of people. However, the community thought the Gabor like filters are some sort of edge detectors. This discovery leads to a series of work done on edge detection in the 80’s when digital image processing became possible on computers. Edge detectors such as Canny, Harris, Sobel, Prewitt, etc are all based on the concept of detecting edges before recognizing objects. More recent algorithms such as Histogram of Oriented Gradient (HOG) are an extension of these edge detectors. An example of HOG is the quite successful paper on pedestrian detection “Histograms of oriented gradients for human detection” (See figure below).

hog and sift

If we move on to the 90’s and 2000’s, SIFT like features seems to have dominated a large part of the Computer Vision world. These hand-craft features works surprisingly well and lead to many real applications. These type of algorithms usually consist of two steps, 1) detect interesting feature points (yellow circles in the figure above) , 2) generate an invariant descriptor around it (green check boards in the figure above). One of the reasons it works well is that SIFT only cares interest points, therefore lowering the dimension of the feature significantly. This allows classifiers to require less training samples before it can make reasonable predictions. However, throwing away all those geometry and texture information is unlikely how we humans see the world and will fail in texture-less scenarios.

In 1996, Olshausen showed that by adding a sparse constraint, gabor like filters are the codes that best describe natural images. What this might suggest is that Filters in V1 (Gabor filters) are not just edge detectors, but statistically the best coding for natural images under the sparse constraint. I regard this as the most important proof that our brain uses sparse coding and the reason it works better in many new algorithms that uses sparse coding such as the HMP. If you are interested in why evolution picked sparse coding, Jeff Hawkins has a great explanation in one of his talks (at 17:33); besides saving energy, it also helps generalizing and makes comparing features easy. Andrew Ng also has a paper “The importance of encoding versus training with sparse coding and vector quantization” on analyzing which part of sparse coding leads to better result.

Creating 3D mesh models using Asus xtion with RGBDemo and Meshlab on Ubuntu 12.04

In Computer Vision, Kinect on March 12, 2014 at 5:15 pm

by Li Yang Ku (Gooly)


Creating 3D models simply by scanning an object using low cost sensors is something that sounds futuristic but isn’t. Although models scanned with a Kinect or Asus xtion aren’t as pretty as CAD models nor laser scanned models, they might actually be helpful in robotics research. A not so perfect model scanned by the same sensor on the robot is closer to what the robot perceives. In this post I’ll go through the steps on creating a polygon mesh model from scanning a coke can using the xtion sensor. The steps are consist of 3 parts: compiling RGBDemo, scanning the object, and converting scanned vertices to a polygon mesh in Meshlab.


RGBDemo is a great piece of opensource software that can help you scan objects into a single ply file with the help of some AR-tags. If you are using a Windows machine, running the compiled binary should be the easiest way to get started. However if you are running on an Ubuntu machine, the following are the steps I did. (I had compile errors following the official instruction, but still might worth a try)

  1. Make sure you have OpenNI installed. I use the old version OpenNI instead of OpenNI2. See my previous post about installing OpenNI on Ubuntu if you haven’t.
  2. Make sure you have PCL and OpenCV installed. For PCL I use the one that comes with ROS (ros-fuerte-pcl) and for OpenCV I have libcv2.3 installed.
  3. Download RGBDemo from Github
    git clone --recursive
  4. Modify the file under the rgbdemo folder. Add the following line among the other options so that it won’t use OpenNI2.
        -DNESTK_USE_OPENNI2=0 \
  5. Modify rgbdemo/scan-markers/ModelAcquisitionWindow.cpp. Comment out line 57 to 61. (For compile error: ‘const class ntk::RGBDImage’ has no member named ‘withDepthDataAndCalibrated’)
        void ModelAcquisitionWindow::on_saveMeshButton_clicked()
            //if (!m_controller.modelAcquisitionController()->currentImage().withDepthDataAndCalibrated())
                //ntk_dbg(1) << "No image already processed.";
            QString filename = QFileDialog::getSaveFileName
  6. cmake and build
  7. The binary files should be built under build/bin/.

turtle mesh

To create a 3D mesh model, we first capture a model (PLY file) that only consists of vertices using RGBDemo.

  1. Print out the AR tags located in the folder ./scan-markers/data/, stick them on a flat board such that the numbers are close to each other. Put your target object on the center of the board.
  2. Run the binary ./build/bin/rgbd-scan-markers
  3. Two windows should pop out, RGB-D Capture and 3D View. Point the camera toward the object on the board and click “Add current frame” in the 3D view window. Move the camera around the object to fill the missing pieces of the model.
  4. Click on the RGB-D Capture window and click Capture->pause in the menu top of the screen. Click “Remove floor plane” in the 3D View Window to remove most of the board.
  5. Click “Save current mesh” to save the vertices into a ply file.


The following steps convert the model captured from RGBDemo to a 3D mesh model in MeshLab (MeshLab can be installed through Ubuntu Software Center).

  1. Import the ply file created in the last section.
  2. Remove unwanted vertices in the model. (select and delete, let me know if you can’t figure out how to do this)
  3. Click on “Filters ->Point Set -> Surface Reconstruction: Poisson”. This will pop up a dialog, apply the default setting will generate a mesh that has an estimated surface. If you check “View -> show layer dialog” you should be able to see two layers, the original and the new constructed mesh.
  4. To transfer color to the new mesh click “Filters -> Sampling -> Vertex Attribute Transfer”. Select mesh.ply as source and poisson mesh as target. This should transfer the colors on the vertices to the mesh.
  5. Note that MeshLab has some problem when saving to the collada(dae) format.

Human vision, top down or bottom up?

In Computer Vision, Neural Science, Paper Talk on February 9, 2014 at 6:42 pm

by Li Yang Ku (Gooly)

top-down bottom-up

How our brain handles visual input is a myth. When Hubel and Wiesel discovered the Gabor filter like neuron in cat’s V1 area, several feed forward model theories appear. These models view our brain as a hierarchical classifier that extracts features layer by layer. Poggio’s papers “A feedforward architecture accounts for rapid categorization” and “Hierarchical models of object recognition in cortex” are good examples. These kind of structure are called discriminative models. Although this new type of model helped the community leap forward one step, it doesn’t solve the problem. Part of the reason is that there are ambiguities if you are only viewing part of the image locally and a feed-forward only structure can’t achieve global consistency.

Feedforward Vision

Therefore the idea that some kind of feedback model has to exist gradually emerged. Some of the early works in the computer science community had first came up with models that rely on feedback, such as Gefforey Hinton’s Boltzman Machine invented back in the 80’s which developed into the so called deep learning around late 2000. However it was only around early 2000 had David Mumford clearly addressed the importance of feedback in the paper “Hierarchical Bayesian inference in the visual cortex“.  Around the same time Wu and others had also combined feedback and feedforward models successfully on textures in the paper “Visual learning by integrating descriptive and generative methods“. Since then the computer vision community have partly embraced the idea that the brain is more like a generative model which in addition to categorizing inputs is capable of generating images. An example of human having generative skills will be drawing images out of imagination.


Slightly before David Mumford addresses the importance of the generative model. Lamme in the neuroscience community also started a series of research on the recurrent process in the vision system. His paper “The distinct modes of vision offered by feedforward and recurrent processing” published in 2000 addressed why recurrent (feedback) processing might be associated with conscious vision (recognizing object). While in the same year the paper “Competition for consciousness among visual events: the psychophysics of reentrant visual processes.” published in the field of psychology also addressed the reentrant (feedback) visual process and proposed a model where conscious vision is associated with the reentrant visual process.


While both the neuroscience and psychology field have research results that suggests a brain model that is composed of feedforward and feedback processing where the feedback mechanism is associated with conscious vision, a recent paper “Detecting meaning in RSVP at 13 ms per picture” shows that human is able to recognize high level concept of an image within 13 ms, a very short gap that won’t allow the brain to do a complete reentrant (feedback) visual process. This conflicting result could suggest that conscious vision is not the result of feedback processing or there are still missing pieces that we haven’t discover. This kind of reminds me one of Jeff Hawkins’  brain theory, which he said that solving the mystery of consciousness is like figuring out the world is round not flat, it’s easy to understand but hard to accept, and he believes that consciousness does not reside in one part of the brain but is simply the combination of all firing neuron from top to bottom.


Get every new post delivered to your Inbox.

Join 124 other followers