Life is a game, take it seriously

Generative Adversarial Nets: Your Enemy is Your Best Friend?

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 20, 2017 at 7:10 pm

By Li Yang Ku (gooly)

Generating realistic images with machines was always one of the top items on my list of difficult tasks. Past attempts in the Computer Vision community were only able to get a blurry image at best. The well publicized Google Deepdream project was able to generate some interesting artsy images, however they were modified from existing images and were designed more to make you feel like on drugs then realistic. Recently (2016), a work that combines the generative adversarial network framework with convolutional neural networks (CNNs) generated some results that look surprisingly good. (A non vision person would likely not be amazed though.) This approach was quickly accepted by the community and was referenced more then 200 times in less then a year.

This work is based on an interesting concept first introduced by Goodfellow et al. in the paper “Generative Adversarial Nets” at NIPS 2014 ( The idea was to have two neural networks compete with each other. One would try to generate images as realistic as it can and the other network would try to distinguish them from real images at its best. By theory this competition will reach a global optimum where the generated image and the real image will belong to the same distribution (Could be a lot trickier in practice though). This work in 2014 got some pretty good results on digits and faces but the generated natural images are still quite blurry (see figure above).

In the more recent work “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford, Metz, and Chintala, convolutional neural networks and the generative adversarial net framework are successfully combined with a few techniques that help stabilize the training ( Through this approach, the generated images are sharp and surprisingly realistic at first glance. The figures above are some of the generated bedroom images. Notice that if you look closer some of them may be weird.

The authors further explored what the latent variables represents. Ideally the generator (neural network that generates image) should disentangle independent features and each latent variable should represent a meaningful concept. By modifying these variables, images that have different characteristics can be generated. Note that these latent variables are what given to the neural network that generates images and is randomly sampled from a uniform distribution in the previous examples. In the figure above is an example where the authors show that the latent variables do represent meaningful concepts through arithmetic operations. If you subtract the average latent variables of men without glasses from the average latent variables of men with glasses and add the average latent variables of women without glasses, you obtain a latent variable that result in women with glasses when passed through the generator. This process identifies the latent variables that represent glasses.




Convolutional Neural Network Features for Robot Manipulation

In Computer Vision, deep learning, Robotics on October 24, 2016 at 6:30 am

by Li Yang Ku (Gooly)


In my previous post, I mentioned the obstacles when applying deep learning techniques directly to robotics. First, training data is harder to acquire; Second, interacting with the world is not just a classification problem. In this post, I am gonna talk about a really simple approach that treats convolutional neural networks (CNNs) as a feature extractor that generates a set of features similar to traditional features such as SIFT. This idea is applied to grasping on Robonaut 2 and published in arXiv (Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks) with more details. The ROS package called ros-deep-vision that generates such features using a RGB-D sensor is also public.

Hierarchical CNN Features


When we look at these deep models such as CNNs, we should keep in mind that these models work well because how the layers stack up hierarchically matches how the data is structured. Our observed world is also hierarchical, there are common shared structures such as edges that can be used to represent more complex structures such as squares and cubes when combined in meaningful ways. A simple view of CNN is just a tree structure, where a higher level neuron is a combination of neurons in the previous layer. For example, a neuron that represents cuboids is a combination of neurons that represent the corners and edges of the cuboid. The figures above show such examples on neurons that found to activate consistently on cuboids and cylinders.

Deep Learning for Robotics

By taking advantage of this hierarchical nature of CNN, we can turn a CNN into a feature extractor that generates features that represents local structures of a higher level structure. For example, such hierarchical features can represent the left edge of the top face of a box while traditional edge detectors would find all edges in the scene. Instead of representing a feature with a single filter (neuron) in one of the CNN layers, this feature, which we call hierarchical CNN feature, uses a tuple of filters from different layers. Using backpropagation that restricts activation to one filter per layer allows us to locate the location of such feature precisely. By finding features such as the front and back edge of the top face of a box we can learn where to place robot fingers relative to these hierarchical CNN features in order to manipulate the object.

robonaut 2 grasping


The most cited papers in computer vision and deep learning

In Computer Vision, deep learning, Paper Talk on June 19, 2016 at 1:18 pm

by Li Yang Ku (Gooly)

paper citation

In 2012 I started a list on the most cited papers in the field of computer vision. I try to keep the list focus on researches that relate to understanding this visual world and avoid image processing, survey, and pure statistic works. However, the computer vision world have changed a lot since 2012 when deep learning techniques started a trend in the field and outperformed traditional approaches on many computer vision benchmarks. No matter if this trend on deep learning lasts long or not I think these techniques deserve their own list.

As I mentioned in the previous post, it’s not always the case that a paper cited more contributes more to the field. However, a highly cited paper usually indicates that something interesting have been discovered. The following are the papers to my knowledge being cited the most in Computer Vision and Deep Learning (note that it is “and” not “or”). If you want a certain paper listed here, just comment below.

Cited by 5518

Imagenet classification with deep convolutional neural networks

A Krizhevsky, I Sutskever, GE Hinton, 2012

Cited by 1868

Caffe: Convolutional architecture for fast feature embedding

Y Jia, E Shelhamer, J Donahue, S Karayev…, 2014

Cited by 1681

Backpropagation applied to handwritten zip code recognition

Y LeCun, B Boser, JS Denker, D Henderson…, 1989

Cited by 1516

Rich feature hierarchies for accurate object detection and semantic segmentation

R Girshick, J Donahue, T Darrell…, 2014

Cited by 1405

Very deep convolutional networks for large-scale image recognition

K Simonyan, A Zisserman, 2014

Cited by 1169

Improving neural networks by preventing co-adaptation of feature detectors

GE Hinton, N Srivastava, A Krizhevsky…, 2012

Cited by 1160

Going deeper with convolutions

C Szegedy, W Liu, Y Jia, P Sermanet…, 2015

Cited by 977

Handwritten digit recognition with a back-propagation network

BB Le Cun, JS Denker, D Henderson…, 1990

Cited by 907

Visualizing and understanding convolutional networks

MD Zeiler, R Fergus, 2014

Cited by 839

Dropout: a simple way to prevent neural networks from overfitting

N Srivastava, GE Hinton, A Krizhevsky…, 2014

Cited by 839

Overfeat: Integrated recognition, localization and detection using convolutional networks

P Sermanet, D Eigen, X Zhang, M Mathieu…, 2013

Cited by 818

Learning multiple layers of features from tiny images

A Krizhevsky, G Hinton, 2009

Cited by 718

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang…, 2014

Cited by 691

Deepface: Closing the gap to human-level performance in face verification

Y Taigman, M Yang, MA Ranzato…, 2014

Cited by 679

Deep Boltzmann Machines

R Salakhutdinov, GE Hinton, 2009

Cited by 670

Convolutional networks for images, speech, and time series

Y LeCun, Y Bengio, 1995

Cited by 570

CNN features off-the-shelf: an astounding baseline for recognition

A Sharif Razavian, H Azizpour, J Sullivan…, 2014

Cited by 549

Learning hierarchical features for scene labeling

C Farabet, C Couprie, L Najman…, 2013

Cited by 510

Fully convolutional networks for semantic segmentation

J Long, E Shelhamer, T Darrell, 2015

Cited by 469

Maxout networks

IJ Goodfellow, D Warde-Farley, M Mirza, AC Courville…, 2013

Cited by 453

Return of the devil in the details: Delving deep into convolutional nets

K Chatfield, K Simonyan, A Vedaldi…, 2014

Cited by 445

Large-scale video classification with convolutional neural networks

A Karpathy, G Toderici, S Shetty, T Leung…, 2014

Cited by 347

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei, 2015

Cited by 342

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

K He, X Zhang, S Ren, J Sun, 2015

Cited by 334

Learning and transferring mid-level image representations using convolutional neural networks

M Oquab, L Bottou, I Laptev, J Sivic, 2014

Cited by 333

Convolutional networks and applications in vision

Y LeCun, K Kavukcuoglu, C Farabet, 2010

Cited by 332

Learning deep features for scene recognition using places database

B Zhou, A Lapedriza, J Xiao, A Torralba…,2014

Cited by 299

Spatial pyramid pooling in deep convolutional networks for visual recognition

K He, X Zhang, S Ren, J Sun, 2014

Cited by 268

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks…, 2015

Cited by 261

Two-stream convolutional networks for action recognition in videos

K Simonyan, A Zisserman, 2014