by Li Yang Ku (Gooly)
As I mentioned in my previous post, Deep Learning and Convolutional Neural Networks (CNNs) have gained a lot of attention in the field of computer vision and outperformed other algorithms on many benchmarks. However, applying these technics to robotics is non-trivial for two reasons. First, training large neural networks requires a lot of training data and collecting them on robots is hard. Not only do research robots easily have network or hardware failures after many trials, the time and resource needed to collect millions of data is also significant. The trained neural network is also robot specific and cannot be used on a different type of robot directly, therefore limiting the incentive of training such network. Second, CNNs are good for classification but when we are talking about interacting with a dynamic environment there is no direct relationship. Knowing you are seeing a lightsaber gives no indication on how to interact with it. Of course you can hard code this information, but that would just be using Deep Learning in computer vision instead of robotics.
Despite these difficulties, a few groups did make it through and successfully applied Deep Learning and CNNs in robotics; I will talk about three of these interesting works.
- Levine, Sergey, et al. “End-to-end training of deep visuomotor policies.” arXiv preprint arXiv:1504.00702 (2015).
- Finn, Chelsea, et al. “Deep Spatial Autoencoders for Visuomotor Learning.” reconstruction 117.117 (2015): 240.
- Pinto, Lerrel, and Abhinav Gupta. “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.” arXiv preprint arXiv:1509.06825 (2015).
Traditional policy search approaches in reinforcement learning usually use the output of a “computer vision systems” and send commands to low-level controllers such as a PD controller. In the paper “end-to-end training of deep visuomotor policies”, Sergey, et al. try to learn a policy from low-level observations (image and joint angles) and output joint torques directly. The overall architecture is shown in the figure above. As you can tell this is ambitious and cannot be easily achieved without a few tricks. The authors first initialize the first layer with weights pre-trained on the ImageNet, then train vision layers with object pose information through pose regression. This pose information is obtained by having the robot holding the object with its hand covered by a cloth similar to the back ground (See figure below).
In addition to that, using the pose information of the object, a trajectory can be learned with an approach called guided policy search. This trajectory is then used to train the motor control layers that takes the visual layer output plus joint configuration as input and output joint torques. The results is better shown then described; see video below.
The second paper, “Deep Spatial Autoencoders for Visuomotor Learning”, is done by the same group in Berkeley. In this work, the authors try to learn a state space for reinforcement learning. Reinforcement learning requires a detailed representation of the state; in most work such state is however usually manually designed. This work automates this state space construction from camera image where the deep spatial autoencoder is used to acquire features that represent the position of objects. The architecture is shown in the figure below.
The deep spatial autoencoder maps full-resolution RGB images to a down-sampled, grayscale version of the input image. All information in the image is forced to pass through a bottleneck of spatial features therefore forcing the network to learn important low dimension representations. The position is then extracted from the bottleneck layer and combined with joint information to form the state representation. The result is tested on several tasks shown in the figure below.
As I mentioned earlier gathering a large amount of training data in robotics is hard, while in the paper “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” the authors try to show that it is possible. Although still not comparable to datasets in the vision community such as ImageNet, gathering 50 thousand tries in robotics is significant if not unprecedented. The data is gathered using this two arm robot Baxter that is (relatively) mass produced compared to most research robots.
The authors then use these collected data to train a CNN initialized with weights trained on ImageNet. The final output is one out of 18 different orientation of the gripper, assuming the robot always grab from the top. The architecture is shown in the figure below.