Life is a game, take it seriously

Archive for the ‘Paper Talk’ Category

The most cited papers in computer vision and deep learning

In Computer Vision, deep learning, Paper Talk on June 19, 2016 at 1:18 pm

by Li Yang Ku (Gooly)

paper citation

In 2012 I started a list on the most cited papers in the field of computer vision. I try to keep the list focus on researches that relate to understanding this visual world and avoid image processing, survey, and pure statistic works. However, the computer vision world have changed a lot since 2012 when deep learning techniques started a trend in the field and outperformed traditional approaches on many computer vision benchmarks. No matter if this trend on deep learning lasts long or not I think these techniques deserve their own list.

As I mentioned in the previous post, it’s not always the case that a paper cited more contributes more to the field. However, a highly cited paper usually indicates that something interesting have been discovered. The following are the papers to my knowledge being cited the most in Computer Vision and Deep Learning (note that it is “and” not “or”). If you want a certain paper listed here, just comment below.

Cited by 5518

Imagenet classification with deep convolutional neural networks

A Krizhevsky, I Sutskever, GE Hinton, 2012

Cited by 1868

Caffe: Convolutional architecture for fast feature embedding

Y Jia, E Shelhamer, J Donahue, S Karayev…, 2014

Cited by 1681

Backpropagation applied to handwritten zip code recognition

Y LeCun, B Boser, JS Denker, D Henderson…, 1989

Cited by 1516

Rich feature hierarchies for accurate object detection and semantic segmentation

R Girshick, J Donahue, T Darrell…, 2014

Cited by 1405

Very deep convolutional networks for large-scale image recognition

K Simonyan, A Zisserman, 2014

Cited by 1169

Improving neural networks by preventing co-adaptation of feature detectors

GE Hinton, N Srivastava, A Krizhevsky…, 2012

Cited by 1160

Going deeper with convolutions

C Szegedy, W Liu, Y Jia, P Sermanet…, 2015

Cited by 977

Handwritten digit recognition with a back-propagation network

BB Le Cun, JS Denker, D Henderson…, 1990

Cited by 907

Visualizing and understanding convolutional networks

MD Zeiler, R Fergus, 2014

Cited by 839

Dropout: a simple way to prevent neural networks from overfitting

N Srivastava, GE Hinton, A Krizhevsky…, 2014

Cited by 839

Overfeat: Integrated recognition, localization and detection using convolutional networks

P Sermanet, D Eigen, X Zhang, M Mathieu…, 2013

Cited by 818

Learning multiple layers of features from tiny images

A Krizhevsky, G Hinton, 2009

Cited by 718

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang…, 2014

Cited by 691

Deepface: Closing the gap to human-level performance in face verification

Y Taigman, M Yang, MA Ranzato…, 2014

Cited by 679

Deep Boltzmann Machines

R Salakhutdinov, GE Hinton, 2009

Cited by 670

Convolutional networks for images, speech, and time series

Y LeCun, Y Bengio, 1995

Cited by 570

CNN features off-the-shelf: an astounding baseline for recognition

A Sharif Razavian, H Azizpour, J Sullivan…, 2014

Cited by 549

Learning hierarchical features for scene labeling

C Farabet, C Couprie, L Najman…, 2013

Cited by 510

Fully convolutional networks for semantic segmentation

J Long, E Shelhamer, T Darrell, 2015

Cited by 469

Maxout networks

IJ Goodfellow, D Warde-Farley, M Mirza, AC Courville…, 2013

Cited by 453

Return of the devil in the details: Delving deep into convolutional nets

K Chatfield, K Simonyan, A Vedaldi…, 2014

Cited by 445

Large-scale video classification with convolutional neural networks

A Karpathy, G Toderici, S Shetty, T Leung…, 2014

Cited by 347

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei, 2015

Cited by 342

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

K He, X Zhang, S Ren, J Sun, 2015

Cited by 334

Learning and transferring mid-level image representations using convolutional neural networks

M Oquab, L Bottou, I Laptev, J Sivic, 2014

Cited by 333

Convolutional networks and applications in vision

Y LeCun, K Kavukcuoglu, C Farabet, 2010

Cited by 332

Learning deep features for scene recognition using places database

B Zhou, A Lapedriza, J Xiao, A Torralba…,2014

Cited by 299

Spatial pyramid pooling in deep convolutional networks for visual recognition

K He, X Zhang, S Ren, J Sun, 2014

Cited by 268

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks…, 2015

Cited by 261

Two-stream convolutional networks for action recognition in videos

K Simonyan, A Zisserman, 2014

 

Local Distance Learning in Object Recognition

In Computer Vision, Paper Talk on February 8, 2015 at 11:59 am

by Li Yang Ku (Gooly)

learning distance

Unsupervised clustering algorithms such as K-means are often used in computer vision as a tool for feature learning. It can be used in different stages in the visual pathway. Running K-means algorithm on a small region of pixel patches might result in finding a lot of patches with edges of different orientation while running K-means on a larger HOG feature might result in finding contours of meaningful parts of objects such as faces if your training data consists of selfies.  However, although convenient and simple as it seems, we have to keep in mind that these unsupervised clustering algorithms are all based on the assumption that a meaningful metric is provided. Without this criteria, clustering suffers from the “no right answer” problem. Whether the algorithm should group a set of images into clusters that contain objects with the same type or the same color is ambiguous and not well defined. This is especially true when your observation vectors are consists of values representing different types of properties.

distance learning

This is where Distance Learning comes into play. In the paper “Distance Metric Learning, with Application to Clustering with Side-Information” written by Eric Xing, Andrew Ng, Michael Jordan and Stuart Russell, a matrix A that represents the distance metric is learned through convex optimization using user inputs specifying grouping examples. This matrix A can either be full or diagonal. When learning a diagonal matrix, the values simply represent the weights of each feature. If the goal is to group objects with similar color, features that can represent color will have a higher weight in the matrix. This metric learning approach was shown to improve clustering on the UCI data set.

visual association

In another work “Recognition by Association via Learning Per-exemplar Distances” written by Tomasz Malisiewicz and Alexei Efros, the object recognition problem is posed as data association. A region in the image is classified by associating it with a small set of exemplars based on visual similarity. The authors suggested that the central question for recognition might not be “What is it?” but “What is it like?”. In this work, 14 different type of features under 4 categories, shape, color, texture and location are used. Unlike the single distance metric learned in the previous work, a separate distance function that specifies the weights put on these 14 different type of features is learned for each exemplar. Some exemplars like cars will not be as sensitive to color as exemplars like sky or grass, therefore having a different distance metric for each exemplar becomes advantageous in such situations. These class of work that defines separate distance metrics are called Local Distance Learning.

instance distance learning

In a more recent work “Sparse Distance Learning for Object Recognition Combining RGB and Depth Information” by Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox, a new approach called Instance Distance Learning is introduced, which instance is referred to one single object. When classifying a view, the view to object distance is compared simultaneously to all views of an object instead of a nearest neighbor approach. Besides learning weight vectors on each feature, weights on views are also learned. In addition, a L1 regularization is used instead of a L2 regularization in the Lagrange function. This generates a sparse weight vector which has a zero term on most views. This is quite interesting in the sense that this approach finds a small subset of representative views for each instance. In fact as shown in the image below, with just 8% of the exemplar data a similar decision boundaries can be achieved. This is consistent to what I talked about in my last post; human brain doesn’t store all the possible views of an object nor does it store a 3D model of the object, instead it stores a subset of views that are representing enough to recognize the same object. This work demonstrates one possible way of finding such subset of views.

instance distance learning decision boundaries

 

How objects are represented in human brain? Structural description models versus Image-based models

In Computer Vision, Neural Science, Paper Talk on October 30, 2014 at 9:06 pm

by Li Yang Ku (Gooly)

poggio

A few years ago while I was still back in UCLA, Tomaso Poggio came to give a talk about the object recognition work he did with 2D templates. After the talk some student asked about whether he thought about using a 3D model to help recognizing objects from different viewpoints. “The field seems to agree that models are stored as 2D images instead of 3D models in human brain” was the short answer Tomaso replied. Since then I took it as a fact and never had a second thought of it till a few month ago when I actually need to argue against storing a 3D model to people in robotics.

70s fashion

To get the full story we have to first go back to the late 70s. The study of visual object recognition is often motivated by the problem of recognizing 3D objects while only receiving 2D patterns of light on our retina. The question was whether our object representations is more similar to abstract three-dimensional descriptions, or are they tied more closely to the two-dimensional image of an object? A commonly held solution at that time, popularized by Marr was that the goal of vision is to reconstruct 3D. In the paper “Representation and recognition of the spatial organization of three-dimensional shapes” published in 1978 Marr and Nishihara assumes that at the end of the reconstruction process, viewer centered descriptions are mapped into object centered representations. This is based on the hypothesis that object representation should be invariant over changes in the retinal image. Based on this object centered theory, Biederman introduced the recognition by component (RBC) model in 1987 which proposes that objects are represented as a collection of volumes or parts. This quite influential model explains how object recognition can be viewpoint invariant and is often referred to as a structural description model.

The structural description model or object centered theory was the dominant theory of visual object understanding around that time and it can correctly predict the view-independent recognition of familiar objects. On the other hand, the viewer centered models, which store a set of 2D images instead of one single 3D model, are usually considered implausible because of the amount of memory a system would require to store all discriminable views of many objects.

1980-radio-shack-catalog

However, between late 1980’s to early 1990’s a wide variety of psychophysical and neurophysiological experiments surprisingly showed that human object recognition performance is strongly viewpoint dependent across rotation in depth. Before jumping into late 80’s I wanna first introduce some work done by Palmer, Rosch, and Chase in 1981. In their work they discovered that commonplace objects such as houses or cars can be hard or easy to recognize, depending on the attitude of the object with respect to the viewer. Subjects tended to respond quicker when the stimulus was shown from a good or canonical perspective. These observations was important in forming the viewer centered theory.

Paper clip like objects used in Bulthoff's experiments

Paper clip like objects used in Bulthoff’s experiments

In 1991 Bulthoff conducted an experiment on understanding these two theories. Subjects are shown sequences of animations where a paper clip like object is rotating. Given these sequences, the subjects have enough information to reconstruct a 3D model of the object. The subjects are then given a single image of a paper clip like object and are asked to identify whether it is the same object. Different viewing angles of the object are tested. The assumption is that if only one single complete 3D model of this object exists in our brain then recognizing it from all angles should be equally easy. However, according to Bulthoff when given every opportunity to form 3D, the subjects performed as if they have not done so.

Bulthoff 1991

In 1992 Edelman further showed that canonical perspectives arise even when all the views in question are shown equally often and the objects posses no intrinsic orientation that might lead to the advantage of some views.

Edelman 1992

Error rate from different viewpoint shown in Edelman’s experiment

In 1995 Tarr confirmed the discoveries using block like objects. Instead of showing a sequence of views of the object rotating, subjects are trained to learn how to build these block structures by manually placing them through an interface with fixed angle. The result shows that response times increased proportionally to the angular distance from the training viewpoint. With extensive practice, performance became nearly equivalent at all familiar viewpoints; however practice at familiar viewpoints did not transfer to unfamiliar viewpoints.

Tarr 1995

Based on these past observations, Logothetis, Pauls, and Poggio raised the question “If monkeys are extensively trained to identify novel 3D objects, would one find neurons in the brain that respond selectively to particular views of such object?” The results published in 1995 was clear. By conducting the same paper clip object recognition task on monkeys, they found 11.6% of the isolated neurons sampled in the IT region, which is the region that known to represent objects, responded selectively to a subset of views of one of the known target object. The response of these individual neurons decrease when the shown object rotate in all 4 axis from the canonical view which the neurons represent. The experiment also shows that these view specific neurons are scale and position invariant up to certain degree.

Logothetis 1995

Viewpoint specific neurons

These series of findings from human psychophysics and neurophysiolog research provided converging evidence for ‘image-based’ models in which objects are represented as collections of viewpoint-specific local features. A series of work in computer vision also shown that by allowing each canonical view to represent a range of images the model is no longer unfeasible. However despite a large amount of research, most of the detail mechanisms are still unknown and require further research.

Check out these papers visually in my other website EatPaper.org

References not linked in post:

Tarr, Michael J., and Heinrich H. Bülthoff. “Image-based object recognition in man, monkey and machine.” Cognition 67.1 (1998): 1-20.

Palmeri, Thomas J., and Isabel Gauthier. “Visual object understanding.” Nature Reviews Neuroscience 5.4 (2004): 291-303.

Human vision, top down or bottom up?

In Computer Vision, Neural Science, Paper Talk on February 9, 2014 at 6:42 pm

by Li Yang Ku (Gooly)

top-down bottom-up

How our brain handles visual input is a myth. When Hubel and Wiesel discovered the Gabor filter like neuron in cat’s V1 area, several feed forward model theories appear. These models view our brain as a hierarchical classifier that extracts features layer by layer. Poggio’s papers “A feedforward architecture accounts for rapid categorization” and “Hierarchical models of object recognition in cortex” are good examples. These kind of structure are called discriminative models. Although this new type of model helped the community leap forward one step, it doesn’t solve the problem. Part of the reason is that there are ambiguities if you are only viewing part of the image locally and a feed-forward only structure can’t achieve global consistency.

Feedforward Vision

Therefore the idea that some kind of feedback model has to exist gradually emerged. Some of the early works in the computer science community had first came up with models that rely on feedback, such as Gefforey Hinton’s Boltzman Machine invented back in the 80’s which developed into the so called deep learning around late 2000. However it was only around early 2000 had David Mumford clearly addressed the importance of feedback in the paper “Hierarchical Bayesian inference in the visual cortex“.  Around the same time Wu and others had also combined feedback and feedforward models successfully on textures in the paper “Visual learning by integrating descriptive and generative methods“. Since then the computer vision community have partly embraced the idea that the brain is more like a generative model which in addition to categorizing inputs is capable of generating images. An example of human having generative skills will be drawing images out of imagination.

lost-brain-sign

Slightly before David Mumford addresses the importance of the generative model. Lamme in the neuroscience community also started a series of research on the recurrent process in the vision system. His paper “The distinct modes of vision offered by feedforward and recurrent processing” published in 2000 addressed why recurrent (feedback) processing might be associated with conscious vision (recognizing object). While in the same year the paper “Competition for consciousness among visual events: the psychophysics of reentrant visual processes.” published in the field of psychology also addressed the reentrant (feedback) visual process and proposed a model where conscious vision is associated with the reentrant visual process.

homer-brain

While both the neuroscience and psychology field have research results that suggests a brain model that is composed of feedforward and feedback processing where the feedback mechanism is associated with conscious vision, a recent paper “Detecting meaning in RSVP at 13 ms per picture” shows that human is able to recognize high level concept of an image within 13 ms, a very short gap that won’t allow the brain to do a complete reentrant (feedback) visual process. This conflicting result could suggest that conscious vision is not the result of feedback processing or there are still missing pieces that we haven’t discover. This kind of reminds me one of Jeff Hawkins’  brain theory, which he said that solving the mystery of consciousness is like figuring out the world is round not flat, it’s easy to understand but hard to accept, and he believes that consciousness does not reside in one part of the brain but is simply the combination of all firing neuron from top to bottom.

Paper Talk: Untangling Invariant Object Recognition

In Computer Vision, Neural Science, Paper Talk on September 29, 2013 at 7:31 pm

by Gooly (Li Yang Ku)

Untangle Invariant Object Recognition

In one of my previous post I talked about the big picture of object recognition, which can be divided into two parts 1) transforming the image space 2) classifying and grouping. In this post I am gonna talk about a paper that clarifies object recognition and some of it’s pretty cool graphs explaining how our brains might transform the input image space. The paper also talked about why the idealized classification might not be what we want.

Lets start by explaining what’s a manifold.

image space manifolds

An object manifold is the set of images projected by one object in the image space. Since each image is a point in the image space and an object can project similar images with infinitely small differences, the points form a continuous surface in the image space. This continuous surface is the object’s manifold. Figure (a) above is an idealized manifold generated by a specific face. When the face is viewed from different angles the projected point move around on the continuous manifold. Although the graph is drew in 3D one should keep in mind that it is actually in a much larger dimension space. A 576000000 dimension space if consider human eyes to be 576 mega pixel. Figure (b) shows another manifold from another face, in this space the two individuals can be separated easily by a plane. Figure (c) shows another space which the two faces would be hard to separate. Note that these are ideal spaces that is possibly transformed from the original image space by our cortex. If the shapes are that simple, object recognition would be easy. However, the actual stuff we get is in Figure (d). The object manifolds from two objects are usually tangled and intersect in multiple spots. However the two image space are not the same, therefore it is possible that through some non linear operation we can transform figure (d) to something more like figure (c).

classification: manifold or point?

One interesting point this paper made is that the traditional view that there is a neuron that represents an object is probably wrong. Instead of having a grandmother cell (yes.. that’s how they called it) that represents your grandma, our brain might actually represents her with a manifold. Neurologically speaking, a manifold could be a set of neurons that have a certain firing pattern. This is related to the sparse encoding I talked about before and is consistent with Jeff Hawkins’ brain theory. (See his talk about sparse distribution representation around 17:15)

The figure (b) and (c) above are the comparison between a manifold representation and a single cell representation. What is being emphasized is that object recognition is more a task of transforming the space rather than classification.

Why Visual Illusions: Illusory Contours and Checkerboard Illusion

In Computer Vision, Paper Talk, Visual Illusion on September 16, 2013 at 6:33 pm

by Gooly (Li Yang Ku)

I talked about some visual illusions in my previous post but didn’t mention why they are important to computer vision and the pros of seeing visual illusions. In this post I am gonna talk about the advantage of having two of the most common known visual illusions, Illusory contours and checkerboard illusion.

Illusory Contours:

Kanizsa's Triangle

The Kanizsa’s triangle invented by Gaetano Kanizsa is a very good example of illusory contours. Even though the center upside down triangle doesn’t exist, you are forced to see it because of the clues given by the other parts. If you gradually cover up some of the circles and corners, at some point you would be able to see the pac man and the angle as individual objects and the illusory contours will disappear. This illusion is the side effect of how we perceive objects and shows that we see edges using various clues instead of just light differences. Because our eyes receive noisy real world inputs, illusory contour actually helps us fill in the missing contours caused by lighting, shading, or occlusion. It also explains why a bottom up vision system won’t work in many situations. In the paper “Hierarchical Bayesian inference in the visual cortex” written by Lee and Mumford, a Kanizsa’s square is used to test whether monkeys perceive illusory contours in V1. The result is positive but has a delayed response compared to V2. This suggests that information of illusory contours is possibly generated in V2 and back propagated to V1.

Checkerboard Illusion:

checker board illusion

This checkerboard illusion above is done by Edward H. Adelson. In the book “Perception as Bayesian Inference” Adelson wrote a chapter discussing how we perceive objects under different lighting conditions. In other words, how we achieve “lightness constancy”. The illusion above should be easily understandable. At first sight, In the left image square A on the checkerboard seems to be darker than square B although they actually have the same brightness. By breaking the 3D structure, the right images shows that the two squares indeed have the same brightness. We perceive A and B differently in the left image because our vision system is trying to achieve lightness invariant. In fact if the cylinder is removed square A will be darker than square B, therefore lightness constancy actually gives us the correct brightness when only constant lighting is presented. This allows us to recognize the same object even under large lighting changes, which I would argue is an important ability for survival. In the paper “Recovering reflectance and illumination in a world of painted polyhedra” by Sinha and Adelson, how we construct 3D structure from 2D drawing and shading are further discussed. Understanding object’s 3D structure is crucial in obtaining light constancy like the checkerboard illusion above. As in the image below, by removing certain types of junction clues, a 3D drawing can easily be seen flat. However, as mentioned in the paper, more complex global strategies are needed to cover all cases.

3D 2D Recovering  Reflectance  and  Illumination  in  a  World  of  Painted  Polyhedra by Pawan Sinha & Edward Adelson

I was gonna post this a few month ago but was delayed by my Los Angeles to Boston road trip (and numerous good bye parties), but I am now officially back to school in UMASS Amherst for a PhD program. Not totally settled down yet but enough to make a quick post.

Back to Basics: Sparse Coding?

In Computer Vision, Neural Science, Paper Talk on May 4, 2013 at 9:04 pm

by Gooly (Li Yang Ku)

Gabor like filters

It’s always good to go back to the reason that lured you into computer vision once in a while. Mine was to understand the brain after I astonishingly realized that computers have no intelligence while I was studying EE in undergrad. In fact if they use the translation “computer” instead of  “electrical brain” in my mother language, I would probably be better off.

Anyway, I am currently revisiting some of the first few computer vision papers I read, and to tell the truth I still learn a lot from reading stuffs I read several times before, which you can also interpret it as I never actually understood a paper.

So back to the papers,

Simoncelli, Eero P., and Bruno A. Olshausen. “Natural image statistics and neural representation.” Annual review of neuroscience 24.1 (2001): 1193-1216.

Olshausen, Bruno A., and David J. Field. “Sparse coding with an overcomplete basis set: A strategy employed by VI?.” Vision research 37.23 (1997): 3311-3326.

Olshausen, Bruno A. “Emergence of simple-cell receptive field properties by learning a sparse code for natural images.” Nature 381.6583 (1996): 607-609.

These 3 papers are essentially the same, the first two are the spin-offs of the 3rd paper published in Nature. I personally prefer the second paper for reading.

Brain Sparse Coding

In this paper, Bruno explains why overcomplete sparse coding is essential for human vision in a statistical way. The goal is to obtain a set of basis functions that can be used to regenerate an image. (basis functions are filters) This can be viewed as an image encoding problem, but instead of having an encoder that compresses the image to the minimum size, the goal is to also remain sparsity, which means only a small amount of basis are used compared to the whole basis pool. Sparsity has obvious advantage biologically, such as saving energy, but Bruno conjectured that sparsity is also essential to vision and is originated from the sparse structure in natural image.

In order to obtain this set of sparse basis, a sparsity constraint is added to the energy function for optimization. The final result is a set of basis function (image atop) that interestingly looks very similar to Gabor filters which is found in the visual cortex. This some how proves that sparseness is essential in the evolution of human vision.

Organizing Publications Visually: EatPaper

In Paper Talk, Serious Stuffs on January 13, 2013 at 2:44 pm

by Gooly (Li Yang Ku)

eatpaper.org

I always hoped that there would be a good app to organize the papers I read and show it graphically. I did Google for it but couldn’t find one that fits my need, so I started to build one about a year ago (EatPaper.org). I stopped working on it several times for various reasons but now it’s finally functional (not perfect, but enough for now). The website functions a little bit like Pinterest, we both use bookmarklet, a website book mark that executes javascript, to fetch your current webpage. So the following are the typical steps to use EatPaper.

1. Search on Google scholar.

2. Click the bookmark button I made. (You can get the button by clicking “Add a Node” in EatPaper.org)

3. A dialog pops up and you can store the publication information you found as a node in your graph.

I also made a Chrome extension that has the exact same function.

eatpaper.org

The website is built using Google App Engine + Google Web Toolkit. If it turns out to be a little bit slow occasionally, please be patient; I have to admit that I don’t have any funding and only pay the minimum amount needed to host the server. Please share it to your friends if you like it. I probably can get more resources if more people use it.

You can leave a message here if you have any opinions, problems or found a bug.

The most cited papers in Computer Vision

In Computer Vision, Paper Talk on February 10, 2012 at 11:10 pm

by gooly (Li Yang Ku)

Although it’s not always the case that a paper cited more contributes more to the field, a highly cited paper usually indicates that something interesting have been discovered. The following are the papers to my knowledge being cited the most in Computer Vision. (updated on 11/24/2013) If you want your “friend’s” paper listed here, just comment below.

Cited by 21528 + 6830 (Object recognition from local scale-invariant features)

Distinctive image features from scale-invariant keypoints

DG Lowe – International journal of computer vision, 2004

Cited by 22181

A threshold selection method from gray-level histograms

N Otsu – Automatica, 1975

Cited by 17671

A theory for multiresolution signal decomposition: The wavelet representation

SG Mallat – Pattern Analysis and Machine Intelligence, IEEE …, 1989

Cited by 17611

A computational approach to edge detection

J Canny – Pattern Analysis and Machine Intelligence, IEEE …, 1986

Cited by 15422

Snakes: Active contour models

M Kass, A Witkin, Demetri Terzopoulos – International journal of computer …, 1988

Cited by 15188

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Geman and Geman – Pattern Analysis and Machine …, 1984

Cited by 11630+ 4138 (Face Recognition using Eigenfaces)

Eigenfaces for Recognition

Turk and Pentland, Journal of cognitive neuroscience Vol. 3, No. 1, Pages 71-86, 1991 (9358 citations)

Cited by 8788

Determining optical flow

B.K.P. Horn and B.G. Schunck, Artificial Intelligence, vol 17, pp 185-203, 1981

Cited by 8559

Scale-space and edge detection using anisotropic diffusion

P Perona, J Malik

Pattern Analysis and Machine Intelligence, IEEE Transactions on 12 (7), 629-639

Cited by 8432 + 5901 (Robust real time face detection)

Rapid object detection using a boosted cascade of simple features

P Viola, M Jones

Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the …

Cited by 8351

Active contours without edges

TF Chan, LA Vese – IEEE Transactions on image processing, 2001

Cited by 7517

An iterative image registration technique with an application to stereo vision

B. D. Lucas and T. Kanade (1981), An iterative image registration technique with an application to stereo vision. Proceedings of Imaging Understanding Workshop, pages 121–130

Cited by 7979

Normalized cuts and image segmentation

J Shi, J Malik

Pattern Analysis and Machine Intelligence, IEEE Transactions on 22 (8), 888-905

Cited by 6658

Histograms of oriented gradients for human detection

N Dalal… – … 2005. CVPR 2005. IEEE Computer Society …, 2005

Cited by 6528

Mean shift: A robust approach toward feature space analysis

D Comaniciu, P Meer – … Analysis and Machine Intelligence, …, 2002

Cited by 5130

The Laplacian pyramid as a compact image code

Burt and Adelson, – Communications, IEEE Transactions on, 1983

Cited by 4987

Performance of optical flow techniques

JL Barron, DJ Fleet, SS Beauchemin – International journal of computer vision, 1994

Cited by 4870

Condensation—conditional density propagation for visual tracking

M Isard and Blake – International journal of computer vision, 1998

Cited by 4884

Good features to track

Shi and Tomasi , 1994. Proceedings CVPR’94., 1994 IEEE, 1994

Cited by 4875

A model of saliency-based visual attention for rapid scene analysis

L Itti, C Koch, E Niebur, Analysis and Machine Intelligence, 1998

Cited by 4769

A performance evaluation of local descriptors

K Mikolajczyk, C Schmid

Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (10 ..

Cited by 4070

Fast approximate energy minimization via graph cuts

Y Boykov, O Veksler, R Zabih

Pattern Analysis and Machine Intelligence, IEEE Transactions on 23 (11 .

Cited by 3634

Surf: Speeded up robust features

H Bay, T Tuytelaars… – Computer Vision–ECCV 2006, 2006

Cited by 3702

Neural network-based face detection

HA Rowley, S Baluja, Takeo Kanade – Pattern Analysis and …, 1998

Cited by 2869

Emergence of simple-cell receptive field properties by learning a sparse code for natural images

BA Olshausen – Nature, 1996

Cited by 3832

Shape matching and object recognition using shape contexts

S Belongie, J Malik, J Puzicha

Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (4), 509-522

Cited by 3271

Shape modeling with front propagation: A level set approach

R Malladi, JA Sethian, BC Vemuri – Pattern Analysis and Machine Intelligence, 1995

Cited by 2547

The structure of images

JJ Koenderink – Biological cybernetics, 1984 – Springer

Cited by 2361

Shape and motion from image streams under orthography: a factorization method

Tomasi and Kanade – International Journal of Computer Vision, 1992

Cited by 2632

Active appearance models

TF Cootes, GJ Edwards… – Pattern Analysis and …, 2001

Cited by 2704

Scale & affine invariant interest point detectors

K Mikolajczyk, C Schmid

International journal of computer vision 60 (1), 63-86

Cited by 2025

Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach

PE Debevec, CJ Taylor, J Malik

Proceedings of the 23rd annual conference on Computer graphics and …

Cited by 1978

Feature extraction from faces using deformable templates

AL Yuille, PW Hallinan… – International journal of computer …, 1992

Cited by 2048

Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation

SC Zhu, A Yuille

Pattern Analysis and Machine Intelligence, IEEE Transactions on 18 (9), 884-900

Cited by 2948

Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories

S Lazebnik, C Schmid, J Ponce

Computer Vision and Pattern Recognition, 2006 IEEE Computer Society …

Cited by 2206

Face detection in color images

RL Hsu, M Abdel-Mottaleb, AK Jain – IEEE transactions on pattern …, 2002

Cited by 2148

Efficient graph-based image segmentation

PF Felzenszwalb… – International Journal of Computer …, 2004

Cited by 2112

Visual categorization with bags of keypoints

G Csurka, C Dance, L Fan, J Willamowski, C Bray – Workshop on statistical …, 2004

Cited by 1868

Object class recognition by unsupervised scale-invariant learning

R Fergus, P Perona, A Zisserman

Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE …

Cited by 1945

Recovering high dynamic range radiance maps from photographs

PE Debevec, J Malik

ACM SIGGRAPH 2008 classes, 31

Cited by 1896

A comparison of affine region detectors

K Mikolajczyk, T Tuytelaars, C Schmid, A Zisserman, J Matas, F Schaffalitzky …

International journal of computer vision 65 (1), 43-72

Cited by 1880

A bayesian hierarchical model for learning natural scene categories

L Fei-Fei… – Computer Vision and Pattern …, 2005

Note that the papers I listed here are just the ones that came up to my mind, let me know if I missed any important publications; I would be happy to make the list more complete. Also check out the website I made for organizing papers visually.

Object recognition with limited (< 6) training images

In Computer Vision, Paper Talk on December 11, 2011 at 10:51 pm

by gooly

If you read my last post, you know I am working on a social app; it turned out that the social app didn’t work as we imagined due to some false assumptions we made; so we came up with a slightly different idea and is still testing it. In the mean time, I decided to post some vision work I made .

The goal of this project is to recognize objects with limited training images even under slightly different angle. Only using a few images has a lot of advantages, specially for researcher that is lazy of collecting images and don’t have the patience to wait for several hours or days of training. The concept is simple, we look at the 4 training images we only have, and try to find what is common. Then we take the common structure and appearance and make them into a model.

So first, in order to be rotation and scale invariant we find the SURF points of all training.

 Then we find the ones that have similar appearance and also form the same structure.

We build the structure by combining SURF points into a chain of triangles using dynamic programming. And then it is done. For test image, simply match the model to it’s SURF points. The results are fairly good on different objects and different angles. You can download the full paper here (A Probabilistic Model for Object Matching).