Life is a game, take it seriously

Convolutional Neural Network Features for Robot Manipulation

In Computer Vision, deep learning, Robotics on October 24, 2016 at 6:30 am

by Li Yang Ku (Gooly)

bender_turtle

In my previous post, I mentioned the obstacles when applying deep learning techniques directly to robotics. First, training data is harder to acquire; Second, interacting with the world is not just a classification problem. In this post, I am gonna talk about a really simple approach that treats convolutional neural networks (CNNs) as a feature extractor that generates a set of features similar to traditional features such as SIFT. This idea is applied to grasping on Robonaut 2 and published in arXiv (Associating Grasping with Convolutional Neural Network Features) with more details. The ROS package called ros-deep-vision that generates such features using a RGB-D sensor is also public.

Hierarchical CNN Features

 

When we look at these deep models such as CNNs, we should keep in mind that these models work well because how the layers stack up hierarchically matches how the data is structured. Our observed world is also hierarchical, there are common shared structures such as edges that can be used to represent more complex structures such as squares and cubes when combined in meaningful ways. A simple view of CNN is just a tree structure, where a higher level neuron is a combination of neurons in the previous layer. For example, a neuron that represents cuboids is a combination of neurons that represent the corners and edges of the cuboid. The figures above show such examples on neurons that found to activate consistently on cuboids and cylinders.

Deep Learning for Robotics

By taking advantage of this hierarchical nature of CNN, we can turn a CNN into a feature extractor that generates features that represents local structures of a higher level structure. For example, such hierarchical features can represent the left edge of the top face of a box while traditional edge detectors would find all edges in the scene. Instead of representing a feature with a single filter (neuron) in one of the CNN layers, this feature, which we call hierarchical CNN feature, uses a tuple of filters from different layers. Using backpropagation that restricts activation to one filter per layer allows us to locate the location of such feature precisely. By finding features such as the front and back edge of the top face of a box we can learn where to place robot fingers relative to these hierarchical CNN features in order to manipulate the object.

robonaut 2 grasping

 

The most cited papers in computer vision and deep learning

In Computer Vision, deep learning, Paper Talk on June 19, 2016 at 1:18 pm

by Li Yang Ku (Gooly)

paper citation

In 2012 I started a list on the most cited papers in the field of computer vision. I try to keep the list focus on researches that relate to understanding this visual world and avoid image processing, survey, and pure statistic works. However, the computer vision world have changed a lot since 2012 when deep learning techniques started a trend in the field and outperformed traditional approaches on many computer vision benchmarks. No matter if this trend on deep learning lasts long or not I think these techniques deserve their own list.

As I mentioned in the previous post, it’s not always the case that a paper cited more contributes more to the field. However, a highly cited paper usually indicates that something interesting have been discovered. The following are the papers to my knowledge being cited the most in Computer Vision and Deep Learning (note that it is “and” not “or”). If you want a certain paper listed here, just comment below.

Cited by 5518

Imagenet classification with deep convolutional neural networks

A Krizhevsky, I Sutskever, GE Hinton, 2012

Cited by 1868

Caffe: Convolutional architecture for fast feature embedding

Y Jia, E Shelhamer, J Donahue, S Karayev…, 2014

Cited by 1681

Backpropagation applied to handwritten zip code recognition

Y LeCun, B Boser, JS Denker, D Henderson…, 1989

Cited by 1516

Rich feature hierarchies for accurate object detection and semantic segmentation

R Girshick, J Donahue, T Darrell…, 2014

Cited by 1405

Very deep convolutional networks for large-scale image recognition

K Simonyan, A Zisserman, 2014

Cited by 1169

Improving neural networks by preventing co-adaptation of feature detectors

GE Hinton, N Srivastava, A Krizhevsky…, 2012

Cited by 1160

Going deeper with convolutions

C Szegedy, W Liu, Y Jia, P Sermanet…, 2015

Cited by 977

Handwritten digit recognition with a back-propagation network

BB Le Cun, JS Denker, D Henderson…, 1990

Cited by 907

Visualizing and understanding convolutional networks

MD Zeiler, R Fergus, 2014

Cited by 839

Dropout: a simple way to prevent neural networks from overfitting

N Srivastava, GE Hinton, A Krizhevsky…, 2014

Cited by 839

Overfeat: Integrated recognition, localization and detection using convolutional networks

P Sermanet, D Eigen, X Zhang, M Mathieu…, 2013

Cited by 818

Learning multiple layers of features from tiny images

A Krizhevsky, G Hinton, 2009

Cited by 718

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang…, 2014

Cited by 691

Deepface: Closing the gap to human-level performance in face verification

Y Taigman, M Yang, MA Ranzato…, 2014

Cited by 679

Deep Boltzmann Machines

R Salakhutdinov, GE Hinton, 2009

Cited by 670

Convolutional networks for images, speech, and time series

Y LeCun, Y Bengio, 1995

Cited by 570

CNN features off-the-shelf: an astounding baseline for recognition

A Sharif Razavian, H Azizpour, J Sullivan…, 2014

Cited by 549

Learning hierarchical features for scene labeling

C Farabet, C Couprie, L Najman…, 2013

Cited by 510

Fully convolutional networks for semantic segmentation

J Long, E Shelhamer, T Darrell, 2015

Cited by 469

Maxout networks

IJ Goodfellow, D Warde-Farley, M Mirza, AC Courville…, 2013

Cited by 453

Return of the devil in the details: Delving deep into convolutional nets

K Chatfield, K Simonyan, A Vedaldi…, 2014

Cited by 445

Large-scale video classification with convolutional neural networks

A Karpathy, G Toderici, S Shetty, T Leung…, 2014

Cited by 347

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei, 2015

Cited by 342

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

K He, X Zhang, S Ren, J Sun, 2015

Cited by 334

Learning and transferring mid-level image representations using convolutional neural networks

M Oquab, L Bottou, I Laptev, J Sivic, 2014

Cited by 333

Convolutional networks and applications in vision

Y LeCun, K Kavukcuoglu, C Farabet, 2010

Cited by 332

Learning deep features for scene recognition using places database

B Zhou, A Lapedriza, J Xiao, A Torralba…,2014

Cited by 299

Spatial pyramid pooling in deep convolutional networks for visual recognition

K He, X Zhang, S Ren, J Sun, 2014

Cited by 268

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks…, 2015

Cited by 261

Two-stream convolutional networks for action recognition in videos

K Simonyan, A Zisserman, 2014

 

Convolutional Neural Networks in Robotics

In Computer Vision, deep learning, Machine Learning, Neural Science, Robotics on April 10, 2016 at 1:29 pm

by Li Yang Ku (Gooly)

robot using tools

As I mentioned in my previous post, Deep Learning and Convolutional Neural Networks (CNNs) have gained a lot of attention in the field of computer vision and outperformed other algorithms on many benchmarks. However, applying these technics to robotics is non-trivial for two reasons. First, training large neural networks requires a lot of training data and collecting them on robots is hard. Not only do research robots easily have network or hardware failures after many trials, the time and resource needed to collect millions of data is also significant. The trained neural network is also robot specific and cannot be used on a different type of robot directly, therefore limiting the incentive of training such network. Second, CNNs are good for classification but when we are talking about interacting with a dynamic environment there is no direct relationship. Knowing you are seeing a lightsaber gives no indication on how to interact with it. Of course you can hard code this information, but that would just be using Deep Learning in computer vision instead of robotics.

Despite these difficulties, a few groups did make it through and successfully applied Deep Learning and CNNs in robotics; I will talk about three of these interesting works.

  • Levine, Sergey, et al. “End-to-end training of deep visuomotor policies.” arXiv preprint arXiv:1504.00702 (2015). 
  • Finn, Chelsea, et al. “Deep Spatial Autoencoders for Visuomotor Learning.” reconstruction 117.117 (2015): 240. 
  • Pinto, Lerrel, and Abhinav Gupta. “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.” arXiv preprint arXiv:1509.06825 (2015).

Deep Learning in Robotics

Traditional policy search approaches in reinforcement learning usually use the output of a “computer vision systems” and send commands to low-level controllers such as a PD controller. In the paper “end-to-end training of deep visuomotor policies”, Sergey, et al. try to learn a policy from low-level observations (image and joint angles) and output joint torques directly. The overall architecture is shown in the figure above. As you can tell this is ambitious and cannot be easily achieved without a few tricks. The authors first initialize the first layer with weights pre-trained on the ImageNet, then train vision layers with object pose information through pose regression. This pose information is obtained by having the robot holding the object with its hand covered by a cloth similar to the back ground (See figure below). robot collecting pose information

In addition to that, using the pose information of the object, a trajectory can be learned with an approach called guided policy search. This trajectory is then used to train the motor control layers that takes the visual layer output plus joint configuration as input and output joint torques. The results is better shown then described; see video below.

The second paper, “Deep Spatial Autoencoders for Visuomotor Learning”, is done by the same group in Berkeley. In this work, the authors try to learn a state space for reinforcement learning. Reinforcement learning requires a detailed representation of the state; in most work such state is however usually manually designed. This work automates this state space construction from camera image where the deep spatial autoencoder is used to acquire features that represent the position of objects. The architecture is shown in the figure below.

Deep Autoencoder in Robotics

The deep spatial autoencoder maps full-resolution RGB images to a down-sampled, grayscale version of the input image. All information in the image is forced to pass through a bottleneck of spatial features therefore forcing the network to learn important low dimension representations. The position is then extracted from the bottleneck layer and combined with joint information to form the state representation. The result is tested on several tasks shown in the figure below.

Experiments on Deep Auto Encoder

As I mentioned earlier gathering a large amount of training data in robotics is hard, while in the paper “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” the authors try to show that it is possible. Although still not comparable to datasets in the vision community such as ImageNet, gathering 50 thousand tries in robotics is significant if not unprecedented. The data is gathered using this two arm robot Baxter that is (relatively) mass produced compared to most research robots.

Baxter Grasping

 

The authors then use these collected data to train a CNN initialized with weights trained on ImageNet. The final output is one out of 18 different orientation of the gripper, assuming the robot always grab from the top. The architecture is shown in the figure below.

Grasping with Deep Learning