Life is a game, take it seriously

Archive for the ‘Robotics’ Category

Convolutional Neural Network Features for Robot Manipulation

In Computer Vision, deep learning, Robotics on October 24, 2016 at 6:30 am

by Li Yang Ku (Gooly)


In my previous post, I mentioned the obstacles when applying deep learning techniques directly to robotics. First, training data is harder to acquire; Second, interacting with the world is not just a classification problem. In this post, I am gonna talk about a really simple approach that treats convolutional neural networks (CNNs) as a feature extractor that generates a set of features similar to traditional features such as SIFT. This idea is applied to grasping on Robonaut 2 and published in arXiv (Associating Grasping with Convolutional Neural Network Features) with more details. The ROS package called ros-deep-vision that generates such features using a RGB-D sensor is also public.

Hierarchical CNN Features


When we look at these deep models such as CNNs, we should keep in mind that these models work well because how the layers stack up hierarchically matches how the data is structured. Our observed world is also hierarchical, there are common shared structures such as edges that can be used to represent more complex structures such as squares and cubes when combined in meaningful ways. A simple view of CNN is just a tree structure, where a higher level neuron is a combination of neurons in the previous layer. For example, a neuron that represents cuboids is a combination of neurons that represent the corners and edges of the cuboid. The figures above show such examples on neurons that found to activate consistently on cuboids and cylinders.

Deep Learning for Robotics

By taking advantage of this hierarchical nature of CNN, we can turn a CNN into a feature extractor that generates features that represents local structures of a higher level structure. For example, such hierarchical features can represent the left edge of the top face of a box while traditional edge detectors would find all edges in the scene. Instead of representing a feature with a single filter (neuron) in one of the CNN layers, this feature, which we call hierarchical CNN feature, uses a tuple of filters from different layers. Using backpropagation that restricts activation to one filter per layer allows us to locate the location of such feature precisely. By finding features such as the front and back edge of the top face of a box we can learn where to place robot fingers relative to these hierarchical CNN features in order to manipulate the object.

robonaut 2 grasping


Convolutional Neural Networks in Robotics

In Computer Vision, deep learning, Machine Learning, Neural Science, Robotics on April 10, 2016 at 1:29 pm

by Li Yang Ku (Gooly)

robot using tools

As I mentioned in my previous post, Deep Learning and Convolutional Neural Networks (CNNs) have gained a lot of attention in the field of computer vision and outperformed other algorithms on many benchmarks. However, applying these technics to robotics is non-trivial for two reasons. First, training large neural networks requires a lot of training data and collecting them on robots is hard. Not only do research robots easily have network or hardware failures after many trials, the time and resource needed to collect millions of data is also significant. The trained neural network is also robot specific and cannot be used on a different type of robot directly, therefore limiting the incentive of training such network. Second, CNNs are good for classification but when we are talking about interacting with a dynamic environment there is no direct relationship. Knowing you are seeing a lightsaber gives no indication on how to interact with it. Of course you can hard code this information, but that would just be using Deep Learning in computer vision instead of robotics.

Despite these difficulties, a few groups did make it through and successfully applied Deep Learning and CNNs in robotics; I will talk about three of these interesting works.

  • Levine, Sergey, et al. “End-to-end training of deep visuomotor policies.” arXiv preprint arXiv:1504.00702 (2015). 
  • Finn, Chelsea, et al. “Deep Spatial Autoencoders for Visuomotor Learning.” reconstruction 117.117 (2015): 240. 
  • Pinto, Lerrel, and Abhinav Gupta. “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.” arXiv preprint arXiv:1509.06825 (2015).

Deep Learning in Robotics

Traditional policy search approaches in reinforcement learning usually use the output of a “computer vision systems” and send commands to low-level controllers such as a PD controller. In the paper “end-to-end training of deep visuomotor policies”, Sergey, et al. try to learn a policy from low-level observations (image and joint angles) and output joint torques directly. The overall architecture is shown in the figure above. As you can tell this is ambitious and cannot be easily achieved without a few tricks. The authors first initialize the first layer with weights pre-trained on the ImageNet, then train vision layers with object pose information through pose regression. This pose information is obtained by having the robot holding the object with its hand covered by a cloth similar to the back ground (See figure below). robot collecting pose information

In addition to that, using the pose information of the object, a trajectory can be learned with an approach called guided policy search. This trajectory is then used to train the motor control layers that takes the visual layer output plus joint configuration as input and output joint torques. The results is better shown then described; see video below.

The second paper, “Deep Spatial Autoencoders for Visuomotor Learning”, is done by the same group in Berkeley. In this work, the authors try to learn a state space for reinforcement learning. Reinforcement learning requires a detailed representation of the state; in most work such state is however usually manually designed. This work automates this state space construction from camera image where the deep spatial autoencoder is used to acquire features that represent the position of objects. The architecture is shown in the figure below.

Deep Autoencoder in Robotics

The deep spatial autoencoder maps full-resolution RGB images to a down-sampled, grayscale version of the input image. All information in the image is forced to pass through a bottleneck of spatial features therefore forcing the network to learn important low dimension representations. The position is then extracted from the bottleneck layer and combined with joint information to form the state representation. The result is tested on several tasks shown in the figure below.

Experiments on Deep Auto Encoder

As I mentioned earlier gathering a large amount of training data in robotics is hard, while in the paper “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” the authors try to show that it is possible. Although still not comparable to datasets in the vision community such as ImageNet, gathering 50 thousand tries in robotics is significant if not unprecedented. The data is gathered using this two arm robot Baxter that is (relatively) mass produced compared to most research robots.

Baxter Grasping


The authors then use these collected data to train a CNN initialized with weights trained on ImageNet. The final output is one out of 18 different orientation of the gripper, assuming the robot always grab from the top. The architecture is shown in the figure below.

Grasping with Deep Learning

Light Weighted Vision Algorithms for a Light Weighted Aerial Vehicle

In Computer Vision, Robotics on January 15, 2013 at 10:50 pm

by Gooly (Li Yang Ku)

Recently, light weighted quadrotors that equip a camera like the AR Drone becomes cheap and interesting enough to be used in research even for poor students. By combining vision algorithms with a quadrotor you can really expand your imagination even to robot fishing.

quadrotor fishing

SLAM (Simultaneous Localization and Mapping) is a pretty hot topic in recent years; since in many situations a detailed map is not available for robots, it would be largely preferred if a robot could navigate in an unknown environment and  draw the map itself like how we humans do. Previously most SLAM researches use mobile robots with a laser scanner that builds up a decent map combining laser result with dead reckoning. However, a laser scanner weighs quite a lot and is not ideal to be mounted on a light weighted aerial vehicle. This is where visual SLAM take place. Visual SLAM is the name for all SLAM techniques that uses mainly visual inputs. (Mostly monocular camera only)

One of the early works of visual SLAM is MonoSLAM, which is done in Oxford. The video above shows how a 3D map is built by recognizing relative position of features merely using  a 2D camera. PTAM further expands this concept and uses 2 threads to improve performance. They provide open sourced codes on their website and was widely tested. The code was further ported to ROS by ETH Zurich and tested in the European SFly project on a few UAVs.

ROS also has a quadrotor simulator package where you can test your algorithms before crashing your real quadrotor.

All the publications and websites you should know about quadrotors are organized here.

RVIZ: a good reason to implement a vision system in ROS

In Computer Vision, Point Cloud Library, Robotics on November 18, 2012 at 2:33 pm

by Gooly (Li Yang Ku)

It might seem illogical to implement a vision system in ROS (Robot Operating System) if you are working on pure vision, however after messing with ROS and PCL for a year I can see the advantages of doing this. To clarify, we started to use ROS only because we need it to communicate with Robonaut 2, but the package RVIZ in ROS are truly very helpful such that I would recommend it even if no robots are involved.

(Keynote speech about Robonaut 2 and ROS from the brilliant guy I work for)


RVIZ is a ROS package that visualizes robots, point clouds, etc. Although PCL does provide a visualizer for point cloud, it only provides the most basic visualize function. It is really not comparable with what RVIZ can give you.

  1. RVIZ is perfect for figuring out what went wrong in a vision system. The list on the left has a check box for each item. You can show or hide any visual information instantly.
  2. RVIZ provides 3D visualization which you could navigate with just your mouse. At first I prefer the kind of navigation similar to Microsoft Robotic Studio or Counter Strike. But once you get used to it, it is pretty handy. Since I already have 2 keyboards and 2 mouses, it’s quiet convenient to move around with my left mouse while not leaving my right hand from my right mouse.
  3. The best part of RVIZ is the interactive marker. This is the part where you can be really creative. It makes selecting a certain area in 3D relative easy. You can therefore adjust your vision system manually while it is still running such as select a certain area as your work space and ignoring other region.
  4. You can have multiple vision processes showing vision data in the same RVIZ. You simply have to publish the point cloud or shape you want to show using the ROS publishing method. Visualizing is relatively painless once you get used to it.

Try not to view ROS as an operating system like Windows, Linux. It is more like internet, where RVIZ is just one service like google map, and you can write your own app that queries the map if you use the same communication protocol provided by ROS.