Life is a game, take it seriously

Convolutional Neural Networks in Robotics

In Computer Vision, deep learning, Machine Learning, Neural Science, Robotics on April 10, 2016 at 1:29 pm

by Li Yang Ku (Gooly)

robot using tools

As I mentioned in my previous post, Deep Learning and Convolutional Neural Networks (CNNs) have gained a lot of attention in the field of computer vision and outperformed other algorithms on many benchmarks. However, applying these technics to robotics is non-trivial for two reasons. First, training large neural networks requires a lot of training data and collecting them on robots is hard. Not only do research robots easily have network or hardware failures after many trials, the time and resource needed to collect millions of data is also significant. The trained neural network is also robot specific and cannot be used on a different type of robot directly, therefore limiting the incentive of training such network. Second, CNNs are good for classification but when we are talking about interacting with a dynamic environment there is no direct relationship. Knowing you are seeing a lightsaber gives no indication on how to interact with it. Of course you can hard code this information, but that would just be using Deep Learning in computer vision instead of robotics.

Despite these difficulties, a few groups did make it through and successfully applied Deep Learning and CNNs in robotics; I will talk about three of these interesting works.

  • Levine, Sergey, et al. “End-to-end training of deep visuomotor policies.” arXiv preprint arXiv:1504.00702 (2015). 
  • Finn, Chelsea, et al. “Deep Spatial Autoencoders for Visuomotor Learning.” reconstruction 117.117 (2015): 240. 
  • Pinto, Lerrel, and Abhinav Gupta. “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.” arXiv preprint arXiv:1509.06825 (2015).

Deep Learning in Robotics

Traditional policy search approaches in reinforcement learning usually use the output of a “computer vision systems” and send commands to low-level controllers such as a PD controller. In the paper “end-to-end training of deep visuomotor policies”, Sergey, et al. try to learn a policy from low-level observations (image and joint angles) and output joint torques directly. The overall architecture is shown in the figure above. As you can tell this is ambitious and cannot be easily achieved without a few tricks. The authors first initialize the first layer with weights pre-trained on the ImageNet, then train vision layers with object pose information through pose regression. This pose information is obtained by having the robot holding the object with its hand covered by a cloth similar to the back ground (See figure below). robot collecting pose information

In addition to that, using the pose information of the object, a trajectory can be learned with an approach called guided policy search. This trajectory is then used to train the motor control layers that takes the visual layer output plus joint configuration as input and output joint torques. The results is better shown then described; see video below.

The second paper, “Deep Spatial Autoencoders for Visuomotor Learning”, is done by the same group in Berkeley. In this work, the authors try to learn a state space for reinforcement learning. Reinforcement learning requires a detailed representation of the state; in most work such state is however usually manually designed. This work automates this state space construction from camera image where the deep spatial autoencoder is used to acquire features that represent the position of objects. The architecture is shown in the figure below.

Deep Autoencoder in Robotics

The deep spatial autoencoder maps full-resolution RGB images to a down-sampled, grayscale version of the input image. All information in the image is forced to pass through a bottleneck of spatial features therefore forcing the network to learn important low dimension representations. The position is then extracted from the bottleneck layer and combined with joint information to form the state representation. The result is tested on several tasks shown in the figure below.

Experiments on Deep Auto Encoder

As I mentioned earlier gathering a large amount of training data in robotics is hard, while in the paper “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” the authors try to show that it is possible. Although still not comparable to datasets in the vision community such as ImageNet, gathering 50 thousand tries in robotics is significant if not unprecedented. The data is gathered using this two arm robot Baxter that is (relatively) mass produced compared to most research robots.

Baxter Grasping

 

The authors then use these collected data to train a CNN initialized with weights trained on ImageNet. The final output is one out of 18 different orientation of the gripper, assuming the robot always grab from the top. The architecture is shown in the figure below.

Grasping with Deep Learning

Distributed Code or Grandmother Cells: Insights From Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Sparse Coding on January 23, 2016 at 1:31 pm

by Li Yang Ku (Gooly)

grandmother-cell

Convolutional Neural Network (CNN)-based features will likely replace engineered representations such as SIFT and HOG, yet we know little on what it represents. In this post I will go through a few papers that dive deeper into CNN-based features and discuss whether CNN feature vectors tend to be more like grandmother cells, where most information resides in a small set of filter responses, or distributed code, where most filter responses carry information equally. The content of this post is mostly taken from the following three papers:

  1. Agrawal, Pulkit, Ross Girshick, and Jitendra Malik. “Analyzing the performance of multilayer neural networks for object recognition.” Computer Vision–ECCV 2014. Springer International Publishing, 2014. 329-344.
  2. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
  3. Dosovitskiy, Alexey, and Thomas Brox. “Inverting convolutional networks with convolutional networks.” arXiv preprint arXiv:1506.02753 (2015).

So why do we want to take insights from convolutional neural networks (CNN)? Like what I talked about in my previous postIn 2012, University of Toronto’s CNN implementation won the ImageNet challenge by a large margin, 15.3% and 26.6% in classification and detection by the nearest competitor. Since then CNN approaches have been leaders in most computer vision benchmarks. Although CNN doesn’t work like the brain, the characteristic that makes it work well might be also true in the brain.

faceselectiv

The grandmother cell is a hypothetical neuron that represents a complex but specific concept or object proposed by cognitive scientist Jerry Letvin in 1969. Although it is mostly agreed that the original concept of grandmother cell which suggests that each person or object one recognizes is associated with a single cell is biological implausible (see here for more discussion), the less extreme idea of grandmother cell is now explained as sparse coding.

Deformable Part Model

Before diving into CNN features we look into existing computer vision algorithms and see which camp they belong to. Traditional object recognition algorithms either are part-based approaches that use mid-level patches or use a bag of local descriptors such as SIFT. One of the well know part-based approaches is the deformable part model which uses HOG to model parts and a score on respective location and deformation to model their spatial relationship. Each part is a mid-level patch that can be seen as a feature that fires to specific visual patterns and mid-level patch discovery can be viewed as the search for a set of grandmother cell templates.

SIFT

On the other hand, unlike mid-level patches, SIFT like features represent low level edges and corners. This bag of descriptors approach uses a distributed code; a single feature by itself is not discriminative, but a group of features taken together is.

There were many attempts to understand CNN more. One of the early work done by Zeiler and Fergus find locally optimal visual inputs for individual filters. However this does not characterize the distribution of images that cause a filter to activate. Agrawal et al. claimed that a grandmother cell can be seen as a filter with high precision and recall. Therefore for each conv-5 filter in the CNN trained on ImageNet they calculate the average precision for classifying images. They showed that grandmother cell like filters exist for only a few classes, such as bicycle, person, cars, and cats. The number of filters required to recognize objects of a class is also measured. For classes such as persons, cars, and cats few filters are required, but most classes require 30 to 40 filters.

convolutional-neural-networks-top-9-layer-4-5

In the work done by Hinton et al. a concept called distillation is introduced. Distillation transfers the knowledge of a cumbersome model to a small model. For a cumbersome model, the training objective is to maximize the probability of the correct answer. A side effect is that it also assigns probabilities to incorrect answers. Instead of training on the correct answer, distillation train on soft targets, which is the probabilities of all answers generated from the cumbersome model. They showed that the small model performs better when trained on these soft targets versus when trained on the correct answer. This result suggests that the relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize.

Inverting CNN Features

On the other hand, Dosovitskiy et al. tried to understand CNN features through inverting the CNN. They claim that inverting CNN features allows us to see which information of the input image is preserved in the features. Applying inverse to a perturbed feature vector yields further insight into the structure of the feature space. Interestingly, when they discard features in the FC8 layer they found most information is contained in small probabilities of those classes instead of the top-5 activation. This result is consistent with the result of the distillation experiment mentioned previously.

Top-5 vs rest feature in FC8

These findings suggest that a combination of distributed code and some grandmother like cells may be closer to how CNN features work and might also be how our brain encodes visual inputs.

 

Deep Learning and Convolutional Neural Networks

In Computer Vision, deep learning, Machine Learning, Neural Science, Uncategorized on November 22, 2015 at 8:17 pm

by Li Yang Ku (Gooly)

Yann LeCun Geoff Hinton Yoshua Bengio Andrew Ng

Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng

Well, right, nowadays it is just hard not to talk about Deep Learning and Convolutional Neural Networks (CNN) in the field of Computer Vision. Since 2012 when the neural network trained by two of Geoffrey Hinton’s students, Alex Krizhevsky and Ilya Sutskever, won the ImageNet Challenge by a large margin, neural networks have quickly become mainstream and made probably the greatest comeback ever in the history of AI.

alexnet

So what is Deep Learning and CNN? According to the 2014 RSS keynote speech by Andrew Ng , Deep Learning is more or less a brand name for all works related to this class of approaches that try to learn high-level abstractions in data by using multiple layers. One of my favorite pre-2012 work is the deep belief nets done by Geoffrey Hinton, Simon Osindero and Yee-Why Teh, where basically a multi-layer neural network is used to learn hand written digits. While I was still in UCLA, Geoffrey demonstrated this neural network during his visit in 2010. What is interesting is that this network not only classifies digits but can also be used to generate digits in a top down fashion. See a talk he did below for this work.


On the other hand, Convolutional Neural Networks (CNN) is a specific type of multi-layer model. One of the most famous work pre-2012 was on classifying images (hand written digits) introduced by Yann LeCun and his colleagues while he was at Bell Laboratories. This specific CNN, which is called the LeNet now, uses the same weights for the same filter across different locations in the first two layers, therefore largely reduces the number of parameters needed to be learned compared to a fully connected neural network. The underlying concept is fairly simple; if a filter that acts like an edge detector is useful in the left corner then it is probably also useful in the right corner.

imagenet

Both Deep Learning and CNNs are not new. Deep Learning concepts such as using multiple layers can be dated all the way back to 1975 when back propagation, an algorithm for learning the weights of a multi-layer neural network, was first introduced by Paul Werbos. CNNs on the other hand can also be traced back to around 1980s when neural network was popular. The LeNet was also work done around 1989. So why are Deep Learning and CNN suddenly gaining fame faster than any pop song singer in the field of Computer Vision? The short answer is because it works. Or more precisely, it works better than traditional approaches. A more interesting question would be why it works now but not before? The answer of this question can be narrowed down to three reasons. 1) Data: thanks to people posting cat images on the internet and the Amazon Mechanical Turk we have millions of labeled images for training neural networks such as the ImageNet. 2) Hardware: GPUs allow us to train multi-layer neural networks with millions of data within a few weeks through exploiting parallelism in neural networks. 3) Algorithms: new approaches such as dropout and better loss functions are developed to help train better networks.

ladygaga

One of the advantages of Deep Learning is that it bundles feature detection and classification. Traditional approaches, which I have talked about in my past post, usually consist of two parts, a feature detector such as the SIFT detector and a classifier such as the support vector machine. On the other hand, Deep Learning trains both of these together. This allows better features to be learned directly from the raw data based on the classification results through back propagation. Note that even though sparse coding approaches also learns features from raw images they are not trained end to end. It was also shown that through using dropout, an approach that simply randomly drops units to prevent co-adapting, such deep neural networks doesn’t seem to suffer an over fitting problem like other machine learning approaches. However, the biggest challenge lies in the fact that it works like a black box and there are no proven theories on why back propagation on deep neural networks doesn’t converge to a local minima yet. (or it might be converging to a local minima but we just don’t know.)

funny_brain_heart_fight

Many are excited about this recent trend in Deep Learning and associate it with how our own brain works. As exciting as I am, being a big fan of Neuroscience, we have to also keep in mind that such neural networks are proven to be able to approximate any continuous function based on the universal approximation theory. Therefore a black box as it is we should not be surprised that it has the capability to be a great classifier. Besides, an object recognition algorithm that works well doesn’t mean that it correlates to how brains work, not to mention that deep learning only works well with supervised data and therefore quite different from how humans learn. The current neural network model also acts quite differently from how our neurons work according to Jeff Hawkins, not to mention the fact that there are a large amount of motor neurons going top down in every layer in our brain that is not captured in these neural networks. Having said that, I am still embracing Deep Learning in my own research and will go through other aspects of it in the following posts.

 

 

Follow

Get every new post delivered to your Inbox.

Join 262 other followers