Life is a game, take it seriously

Posts Tagged ‘computer vision’

Distributed Code or Grandmother Cells: Insights From Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Sparse Coding on January 23, 2016 at 1:31 pm

by Li Yang Ku (Gooly)

grandmother-cell

Convolutional Neural Network (CNN)-based features will likely replace engineered representations such as SIFT and HOG, yet we know little on what it represents. In this post I will go through a few papers that dive deeper into CNN-based features and discuss whether CNN feature vectors tend to be more like grandmother cells, where most information resides in a small set of filter responses, or distributed code, where most filter responses carry information equally. The content of this post is mostly taken from the following three papers:

  1. Agrawal, Pulkit, Ross Girshick, and Jitendra Malik. “Analyzing the performance of multilayer neural networks for object recognition.” Computer Vision–ECCV 2014. Springer International Publishing, 2014. 329-344.
  2. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
  3. Dosovitskiy, Alexey, and Thomas Brox. “Inverting convolutional networks with convolutional networks.” arXiv preprint arXiv:1506.02753 (2015).

So why do we want to take insights from convolutional neural networks (CNN)? Like what I talked about in my previous postIn 2012, University of Toronto’s CNN implementation won the ImageNet challenge by a large margin, 15.3% and 26.6% in classification and detection by the nearest competitor. Since then CNN approaches have been leaders in most computer vision benchmarks. Although CNN doesn’t work like the brain, the characteristic that makes it work well might be also true in the brain.

faceselectiv

The grandmother cell is a hypothetical neuron that represents a complex but specific concept or object proposed by cognitive scientist Jerry Letvin in 1969. Although it is mostly agreed that the original concept of grandmother cell which suggests that each person or object one recognizes is associated with a single cell is biological implausible (see here for more discussion), the less extreme idea of grandmother cell is now explained as sparse coding.

Deformable Part Model

Before diving into CNN features we look into existing computer vision algorithms and see which camp they belong to. Traditional object recognition algorithms either are part-based approaches that use mid-level patches or use a bag of local descriptors such as SIFT. One of the well know part-based approaches is the deformable part model which uses HOG to model parts and a score on respective location and deformation to model their spatial relationship. Each part is a mid-level patch that can be seen as a feature that fires to specific visual patterns and mid-level patch discovery can be viewed as the search for a set of grandmother cell templates.

SIFT

On the other hand, unlike mid-level patches, SIFT like features represent low level edges and corners. This bag of descriptors approach uses a distributed code; a single feature by itself is not discriminative, but a group of features taken together is.

There were many attempts to understand CNN more. One of the early work done by Zeiler and Fergus find locally optimal visual inputs for individual filters. However this does not characterize the distribution of images that cause a filter to activate. Agrawal et al. claimed that a grandmother cell can be seen as a filter with high precision and recall. Therefore for each conv-5 filter in the CNN trained on ImageNet they calculate the average precision for classifying images. They showed that grandmother cell like filters exist for only a few classes, such as bicycle, person, cars, and cats. The number of filters required to recognize objects of a class is also measured. For classes such as persons, cars, and cats few filters are required, but most classes require 30 to 40 filters.

convolutional-neural-networks-top-9-layer-4-5

In the work done by Hinton et al. a concept called distillation is introduced. Distillation transfers the knowledge of a cumbersome model to a small model. For a cumbersome model, the training objective is to maximize the probability of the correct answer. A side effect is that it also assigns probabilities to incorrect answers. Instead of training on the correct answer, distillation train on soft targets, which is the probabilities of all answers generated from the cumbersome model. They showed that the small model performs better when trained on these soft targets versus when trained on the correct answer. This result suggests that the relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.

Inverting CNN Features

On the other hand, Dosovitskiy et al. tried to understand CNN features through inverting the CNN. They claim that inverting CNN features allows us to see which information of the input image is preserved in the features. Applying inverse to a perturbed feature vector yields further insight into the structure of the feature space. Interestingly, when they discard features in the FC8 layer they found most information is contained in small probabilities of those classes instead of the top-5 activation. This result is consistent with the result of the distillation experiment mentioned previously.

Top-5 vs rest feature in FC8

These findings suggest that a combination of distributed code and some grandmother like cells may be closer to how CNN features work and might also be how our brain encodes visual inputs.

 

Deep Learning and Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Uncategorized on November 22, 2015 at 8:17 pm

by Li Yang Ku (Gooly)

Yann LeCun Geoff Hinton Yoshua Bengio Andrew Ng

Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng

Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.

alexnet

So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.


On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.

imagenet

Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.

ladygaga

One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)

funny_brain_heart_fight

Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.

 

 

Local Distance Learning in Object Recognition

In Computer Vision, Paper Talk on February 8, 2015 at 11:59 am

by Li Yang Ku (Gooly)

learning distance

Unsupervised clustering algorithms such as K-means are often used in computer vision as a tool for feature learning. It can be used in different stages in the visual pathway. Running K-means algorithm on a small region of pixel patches might result in finding a lot of patches with edges of different orientation while running K-means on a larger HOG feature might result in finding contours of meaningful parts of objects such as faces if your training data consists of selfies.  However, although convenient and simple as it seems, we have to keep in mind that these unsupervised clustering algorithms are all based on the assumption that a meaningful metric is provided. Without this criteria, clustering suffers from the “no right answer” problem. Whether the algorithm should group a set of images into clusters that contain objects with the same type or the same color is ambiguous and not well defined. This is especially true when your observation vectors are consists of values representing different types of properties.

distance learning

This is where Distance Learning comes into play. In the paper “Distance Metric Learning, with Application to Clustering with Side-Information” written by Eric Xing, Andrew Ng, Michael Jordan and Stuart Russell, a matrix A that represents the distance metric is learned through convex optimization using user inputs specifying grouping examples. This matrix A can either be full or diagonal. When learning a diagonal matrix, the values simply represent the weights of each feature. If the goal is to group objects with similar color, features that can represent color will have a higher weight in the matrix. This metric learning approach was shown to improve clustering on the UCI data set.

visual association

In another work “Recognition by Association via Learning Per-exemplar Distances” written by Tomasz Malisiewicz and Alexei Efros, the object recognition problem is posed as data association. A region in the image is classified by associating it with a small set of exemplars based on visual similarity. The authors suggested that the central question for recognition might not be “What is it?” but “What is it like?”. In this work, 14 different type of features under 4 categories, shape, color, texture and location are used. Unlike the single distance metric learned in the previous work, a separate distance function that specifies the weights put on these 14 different type of features is learned for each exemplar. Some exemplars like cars will not be as sensitive to color as exemplars like sky or grass, therefore having a different distance metric for each exemplar becomes advantageous in such situations. These class of work that defines separate distance metrics are called Local Distance Learning.

instance distance learning

In a more recent work “Sparse Distance Learning for Object Recognition Combining RGB and Depth Information” by Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox, a new approach called Instance Distance Learning is introduced, which instance is referred to one single object. When classifying a view, the view to object distance is compared simultaneously to all views of an object instead of a nearest neighbor approach. Besides learning weight vectors on each feature, weights on views are also learned. In addition, a L1 regularization is used instead of a L2 regularization in the Lagrange function. This generates a sparse weight vector which has a zero term on most views. This is quite interesting in the sense that this approach finds a small subset of representative views for each instance. In fact as shown in the image below, with just 8% of the exemplar data a similar decision boundaries can be achieved. This is consistent to what I talked about in my last post; human brain doesn’t store all the possible views of an object nor does it store a 3D model of the object, instead it stores a subset of views that are representing enough to recognize the same object. This work demonstrates one possible way of finding such subset of views.

instance distance learning decision boundaries

 

Book it: Computer Vision Metrics

In Book It, Computer Vision on September 21, 2014 at 5:45 pm

by Li Yang Ku

computer vision metrics

I was asked to review a computer vision book again recently. The 500 page book “Computer Vision Metrics” is written by Scott Krig and, surprisingly, can be downloaded for free through Apress. It is a pretty nice book for people that are not completely new to Computer Vision but want to find research topics that they would be interested in. I would recommend to go through the first 4 chapters, specially the 3rd and the 4th chapter which gives a pretty complete overview on the most active research areas in recent years.

wavelet

Topics I talked about in my blog such as Sparse Coding and Hierarchical  Matching Pursuit are also discussed in the book. The section that did a comparison between some of the relatively new descriptors FREAK, Brisk, ORB, and BREIF should also be pretty helpful.

Sparse Coding in a Nutshell

In Computer Vision, Neural Science, Sparse Coding on May 24, 2014 at 7:24 pm

by Li Yang Ku (Gooly)

nutshell

I’ve been reading some of Dieter Fox’s publications recently and a series of work on Hierarchical Matching Pursuit (HMP) caught my eye. There are three papers that is based on HMP, “Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms”, “Unsupervised feature learning for RGB-D based object recognition” and “Unsupervised Feature Learning for 3D Scene Labeling”. In all 3 of these publications, the HMP algorithm is what it is all about. The first paper, published in 2011, deals with scene classification and object recognition on gray scale images; the second paper, published in 2012, takes RGBD image as input for object recognition; while the third paper, published in 2014, further extends the application to scene recognition using point cloud input. The 3 figures below are the feature dictionaries used in these 3 papers in chronicle order.

hmp

One of the center concept of HMP is to learn low level and mid level features instead of using hand craft features like SIFT feature. In fact the first paper claims that it is the first work to show that learning features from the pixel level significantly outperforms those approaches built on top of SIFT. Explaining it in a sentence, HMP is an algorithm that builds up a sparse dictionary and encodes it hierarchically such that meaningful features preserves. The final classifier is simply a linear support vector machine, so the magic is mostly in sparse coding. To fully understand why sparse coding might be a good idea we have to go back in time.

Back in the 50’s, Hubel and Wiesel’s work on discovering Gabor filter like neurons in the cat brain really inspired a lot of people. However, the community thought the Gabor like filters are some sort of edge detectors. This discovery leads to a series of work done on edge detection in the 80’s when digital image processing became possible on computers. Edge detectors such as Canny, Harris, Sobel, Prewitt, etc are all based on the concept of detecting edges before recognizing objects. More recent algorithms such as Histogram of Oriented Gradient (HOG) are an extension of these edge detectors. An example of HOG is the quite successful paper on pedestrian detection “Histograms of oriented gradients for human detection” (See figure below).

hog and sift

If we move on to the 90’s and 2000’s, SIFT like features seems to have dominated a large part of the Computer Vision world. These hand-craft features works surprisingly well and lead to many real applications. These type of algorithms usually consist of two steps, 1) detect interesting feature points (yellow circles in the figure above) , 2) generate an invariant descriptor around it (green check boards in the figure above). One of the reasons it works well is that SIFT only cares interest points, therefore lowering the dimension of the feature significantly. This allows classifiers to require less training samples before it can make reasonable predictions. However, throwing away all those geometry and texture information is unlikely how we humans see the world and will fail in texture-less scenarios.

In 1996, Olshausen showed that by adding a sparse constraint, gabor like filters are the codes that best describe natural images. What this might suggest is that Filters in V1 (Gabor filters) are not just edge detectors, but statistically the best coding for natural images under the sparse constraint. I regard this as the most important proof that our brain uses sparse coding and the reason it works better in many new algorithms that uses sparse coding such as the HMP. If you are interested in why evolution picked sparse coding, Jeff Hawkins has a great explanation in one of his talks (at 17:33); besides saving energy, it also helps generalizing and makes comparing features easy. Andrew Ng also has a paper “The importance of encoding versus training with sparse coding and vector quantization” on analyzing which part of sparse coding leads to better result.

One dollar classifier? NEIL, the never ending image learner

In Computer Vision, Machine Learning on November 27, 2013 at 5:18 pm

by Li Yang Ku (Gooly)

NEIL never ending image learner

I had the chance to chat with Abhinav Gupta, a research professor at CMU, in person when he visited UMass Amherst about a month ago. Abhinav presented NEIL, the never ending image learner in his talk at Amherst. To give a short intro, the following is from Abhinav

“NEIL (Never Ending Image Learner) is a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. It is an effort to build the world’s largest visual knowledge base with minimum human labeling effort – one that would be useful to many computer vision and AI efforts.” 

NEIL never ending image learner clusters

One of the characteristic that distinguishes NEIL from other object recognition algorithms that are trained and tested on large web image data set such as the ImageNet or LFW is that NEIL is trying to recognize images that are in a set that has unlimited data and unlimited category. At first glance this might look like a problem too hard to solve. But NEIL approaches this problem in a smart way. Instead of trying to label images one by one on the internet, NEIL start from labeling just the easy ones. Since given a keyword the number of images returned are so large using Google Image Search, NEIL simply picks the ones it feels most certain, which are the ones that share the most common HOG like features. This step also helps refining the query result. Say we searched for cars on Google Image, it is very likely that out of every 100 images there is one image that has nothing to do with cars (very likely some sexy photo of girls with file name girl_love_cars.jpg ). These outliers won’t share the same visual features as the other car clusters and will not be labeled. By doing so NEIL can gradually build  up a very large labeled data set from one word to another.

girl_love_car

NEIL also learns the relationships between images and is connected with NELL, never ending language learning. More details should be released in future papers. During the talk Abhinav said he plan to set up a system where you can submit the category you wanna train on and with just $1, NEIL will give you a set of HOG classifiers in that category in 1 day.

NEIL relationship

Book it: OpenNI Cookbook

In Book It, Computer Vision, Kinect, Point Cloud Library on November 20, 2013 at 7:53 pm

by Li Yang Ku (Gooly)

OpenNI Cookbook

 

I was asked to help review a technical book “OpenNI Cookbook” about the OpenNI library for Kinect like sensors recently. This is the kind of book that would be helpful if you just started developing OpenNI applications in Windows. Although I did all my OpenNI researches in Linux, it was mostly because I need it to work with robots that use ROS (Robotic Operating System), which was only supported in Ubuntu. OpenNI was always more stable and more supported in Windows than in Linux. However, if you plan to use PCL (Point Cloud Library) with OpenNI, you might still want to consider Linux.

OpenNI Skeleton Tracking

The book contains topics from basic to advance applications such as getting the raw sensor data, hand tracking and skeleton tracking. It also contains sections that people don’t usually talk about but crucial for actual software development such as listening to connect and disconnect events. The code in this book uses the OpenNI2 library, which is the latest version of OpenNI. Note that although OpenNI is opensource, the NITE library for hand tracking and human tracking used in the book isn’t. (But free under certain license)

You can also buy the book on Amazon.

 

 

Paper Talk: Untangling Invariant Object Recognition

In Computer Vision, Neural Science, Paper Talk on September 29, 2013 at 7:31 pm

by Gooly (Li Yang Ku)

Untangle Invariant Object Recognition

In one of my previous post I talked about the big picture of object recognition, which can be divided into two parts 1) transforming the image space 2) classifying and grouping. In this post I am gonna talk about a paper that clarifies object recognition and some of it’s pretty cool graphs explaining how our brains might transform the input image space. The paper also talked about why the idealized classification might not be what we want.

Lets start by explaining what’s a manifold.

image space manifolds

An object manifold is the set of images projected by one object in the image space. Since each image is a point in the image space and an object can project similar images with infinitely small differences, the points form a continuous surface in the image space. This continuous surface is the object’s manifold. Figure (a) above is an idealized manifold generated by a specific face. When the face is viewed from different angles the projected point move around on the continuous manifold. Although the graph is drew in 3D one should keep in mind that it is actually in a much larger dimension space. A 576000000 dimension space if consider human eyes to be 576 mega pixel. Figure (b) shows another manifold from another face, in this space the two individuals can be separated easily by a plane. Figure (c) shows another space which the two faces would be hard to separate. Note that these are ideal spaces that is possibly transformed from the original image space by our cortex. If the shapes are that simple, object recognition would be easy. However, the actual stuff we get is in Figure (d). The object manifolds from two objects are usually tangled and intersect in multiple spots. However the two image space are not the same, therefore it is possible that through some non linear operation we can transform figure (d) to something more like figure (c).

classification: manifold or point?

One interesting point this paper made is that the traditional view that there is a neuron that represents an object is probably wrong. Instead of having a grandmother cell (yes.. that’s how they called it) that represents your grandma, our brain might actually represents her with a manifold. Neurologically speaking, a manifold could be a set of neurons that have a certain firing pattern. This is related to the sparse encoding I talked about before and is consistent with Jeff Hawkins’ brain theory. (See his talk about sparse distribution representation around 17:15)

The figure (b) and (c) above are the comparison between a manifold representation and a single cell representation. What is being emphasized is that object recognition is more a task of transforming the space rather than classification.

Why Visual Illusions: Illusory Contours and Checkerboard Illusion

In Computer Vision, Paper Talk, Visual Illusion on September 16, 2013 at 6:33 pm

by Gooly (Li Yang Ku)

I talked about some visual illusions in my previous post but didn’t mention why they are important to computer vision and the pros of seeing visual illusions. In this post I am gonna talk about the advantage of having two of the most common known visual illusions, Illusory contours and checkerboard illusion.

Illusory Contours:

Kanizsa's Triangle

The Kanizsa’s triangle invented by Gaetano Kanizsa is a very good example of illusory contours. Even though the center upside down triangle doesn’t exist, you are forced to see it because of the clues given by the other parts. If you gradually cover up some of the circles and corners, at some point you would be able to see the pac man and the angle as individual objects and the illusory contours will disappear. This illusion is the side effect of how we perceive objects and shows that we see edges using various clues instead of just light differences. Because our eyes receive noisy real world inputs, illusory contour actually helps us fill in the missing contours caused by lighting, shading, or occlusion. It also explains why a bottom up vision system won’t work in many situations. In the paper “Hierarchical Bayesian inference in the visual cortex” written by Lee and Mumford, a Kanizsa’s square is used to test whether monkeys perceive illusory contours in V1. The result is positive but has a delayed response compared to V2. This suggests that information of illusory contours is possibly generated in V2 and back propagated to V1.

Checkerboard Illusion:

checker board illusion

This checkerboard illusion above is done by Edward H. Adelson. In the book “Perception as Bayesian Inference” Adelson wrote a chapter discussing how we perceive objects under different lighting conditions. In other words, how we achieve “lightness constancy”. The illusion above should be easily understandable. At first sight, In the left image square A on the checkerboard seems to be darker than square B although they actually have the same brightness. By breaking the 3D structure, the right images shows that the two squares indeed have the same brightness. We perceive A and B differently in the left image because our vision system is trying to achieve lightness invariant. In fact if the cylinder is removed square A will be darker than square B, therefore lightness constancy actually gives us the correct brightness when only constant lighting is presented. This allows us to recognize the same object even under large lighting changes, which I would argue is an important ability for survival. In the paper “Recovering reflectance and illumination in a world of painted polyhedra” by Sinha and Adelson, how we construct 3D structure from 2D drawing and shading are further discussed. Understanding object’s 3D structure is crucial in obtaining light constancy like the checkerboard illusion above. As in the image below, by removing certain types of junction clues, a 3D drawing can easily be seen flat. However, as mentioned in the paper, more complex global strategies are needed to cover all cases.

3D 2D Recovering  Reflectance  and  Illumination  in  a  World  of  Painted  Polyhedra by Pawan Sinha & Edward Adelson

I was gonna post this a few month ago but was delayed by my Los Angeles to Boston road trip (and numerous good bye parties), but I am now officially back to school in UMASS Amherst for a PhD program. Not totally settled down yet but enough to make a quick post.

The most cited papers in Computer Vision

In Computer Vision, Paper Talk on February 10, 2012 at 11:10 pm

by gooly (Li Yang Ku)

Although it’s not always the case that a paper cited more contributes more to the field, a highly cited paper usually indicates that something interesting have been discovered. The following are the papers to my knowledge being cited the most in Computer Vision. (updated on 11/24/2013) If you want your “friend’s” paper listed here, just comment below.

Cited by 21528 + 6830 (Object recognition from local scale-invariant features)

Distinctive image features from scale-invariant keypoints

DG Lowe – International journal of computer vision, 2004

Cited by 22181

A threshold selection method from gray-level histograms

N Otsu – Automatica, 1975

Cited by 17671

A theory for multiresolution signal decomposition: The wavelet representation

SG Mallat – Pattern Analysis and Machine Intelligence, IEEE …, 1989

Cited by 17611

A computational approach to edge detection

J Canny – Pattern Analysis and Machine Intelligence, IEEE …, 1986

Cited by 15422

Snakes: Active contour models

M Kass, A Witkin, Demetri Terzopoulos – International journal of computer …, 1988

Cited by 15188

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Geman and Geman – Pattern Analysis and Machine …, 1984

Cited by 11630+ 4138 (Face Recognition using Eigenfaces)

Eigenfaces for Recognition

Turk and Pentland, Journal of cognitive neuroscience Vol. 3, No. 1, Pages 71-86, 1991 (9358 citations)

Cited by 8788

Determining optical flow

B.K.P. Horn and B.G. Schunck, Artificial Intelligence, vol 17, pp 185-203, 1981

Cited by 8559

Scale-space and edge detection using anisotropic diffusion

P Perona, J Malik

Pattern Analysis and Machine Intelligence, IEEE Transactions on 12 (7), 629-639

Cited by 8432 + 5901 (Robust real time face detection)

Rapid object detection using a boosted cascade of simple features

P Viola, M Jones

Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the …

Cited by 8351

Active contours without edges

TF Chan, LA Vese – IEEE Transactions on image processing, 2001

Cited by 7517

An iterative image registration technique with an application to stereo vision

B. D. Lucas and T. Kanade (1981), An iterative image registration technique with an application to stereo vision. Proceedings of Imaging Understanding Workshop, pages 121–130

Cited by 7979

Normalized cuts and image segmentation

J Shi, J Malik

Pattern Analysis and Machine Intelligence, IEEE Transactions on 22 (8), 888-905

Cited by 6658

Histograms of oriented gradients for human detection

N Dalal… – … 2005. CVPR 2005. IEEE Computer Society …, 2005

Cited by 6528

Mean shift: A robust approach toward feature space analysis

D Comaniciu, P Meer – … Analysis and Machine Intelligence, …, 2002

Cited by 5130

The Laplacian pyramid as a compact image code

Burt and Adelson, – Communications, IEEE Transactions on, 1983

Cited by 4987

Performance of optical flow techniques

JL Barron, DJ Fleet, SS Beauchemin – International journal of computer vision, 1994

Cited by 4870

Condensation—conditional density propagation for visual tracking

M Isard and Blake – International journal of computer vision, 1998

Cited by 4884

Good features to track

Shi and Tomasi , 1994. Proceedings CVPR’94., 1994 IEEE, 1994

Cited by 4875

A model of saliency-based visual attention for rapid scene analysis

L Itti, C Koch, E Niebur, Analysis and Machine Intelligence, 1998

Cited by 4769

A performance evaluation of local descriptors

K Mikolajczyk, C Schmid

Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (10 ..

Cited by 4070

Fast approximate energy minimization via graph cuts

Y Boykov, O Veksler, R Zabih

Pattern Analysis and Machine Intelligence, IEEE Transactions on 23 (11 .

Cited by 3634

Surf: Speeded up robust features

H Bay, T Tuytelaars… – Computer Vision–ECCV 2006, 2006

Cited by 3702

Neural network-based face detection

HA Rowley, S Baluja, Takeo Kanade – Pattern Analysis and …, 1998

Cited by 2869

Emergence of simple-cell receptive field properties by learning a sparse code for natural images

BA Olshausen – Nature, 1996

Cited by 3832

Shape matching and object recognition using shape contexts

S Belongie, J Malik, J Puzicha

Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (4), 509-522

Cited by 3271

Shape modeling with front propagation: A level set approach

R Malladi, JA Sethian, BC Vemuri – Pattern Analysis and Machine Intelligence, 1995

Cited by 2547

The structure of images

JJ Koenderink – Biological cybernetics, 1984 – Springer

Cited by 2361

Shape and motion from image streams under orthography: a factorization method

Tomasi and Kanade – International Journal of Computer Vision, 1992

Cited by 2632

Active appearance models

TF Cootes, GJ Edwards… – Pattern Analysis and …, 2001

Cited by 2704

Scale & affine invariant interest point detectors

K Mikolajczyk, C Schmid

International journal of computer vision 60 (1), 63-86

Cited by 2025

Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach

PE Debevec, CJ Taylor, J Malik

Proceedings of the 23rd annual conference on Computer graphics and …

Cited by 1978

Feature extraction from faces using deformable templates

AL Yuille, PW Hallinan… – International journal of computer …, 1992

Cited by 2048

Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation

SC Zhu, A Yuille

Pattern Analysis and Machine Intelligence, IEEE Transactions on 18 (9), 884-900

Cited by 2948

Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories

S Lazebnik, C Schmid, J Ponce

Computer Vision and Pattern Recognition, 2006 IEEE Computer Society …

Cited by 2206

Face detection in color images

RL Hsu, M Abdel-Mottaleb, AK Jain – IEEE transactions on pattern …, 2002

Cited by 2148

Efficient graph-based image segmentation

PF Felzenszwalb… – International Journal of Computer …, 2004

Cited by 2112

Visual categorization with bags of keypoints

G Csurka, C Dance, L Fan, J Willamowski, C Bray – Workshop on statistical …, 2004

Cited by 1868

Object class recognition by unsupervised scale-invariant learning

R Fergus, P Perona, A Zisserman

Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE …

Cited by 1945

Recovering high dynamic range radiance maps from photographs

PE Debevec, J Malik

ACM SIGGRAPH 2008 classes, 31

Cited by 1896

A comparison of affine region detectors

K Mikolajczyk, T Tuytelaars, C Schmid, A Zisserman, J Matas, F Schaffalitzky …

International journal of computer vision 65 (1), 43-72

Cited by 1880

A bayesian hierarchical model for learning natural scene categories

L Fei-Fei… – Computer Vision and Pattern …, 2005

Note that the papers I listed here are just the ones that came up to my mind, let me know if I missed any important publications; I would be happy to make the list more complete. Also check out the website I made for organizing papers visually.