by Li Yang Ku (Gooly)
In my previous post, I mentioned the obstacles when applying deep learning techniques directly to robotics. First, training data is harder to acquire; Second, interacting with the world is not just a classification problem. In this post, I am gonna talk about a really simple approach that treats convolutional neural networks (CNNs) as a feature extractor that generates a set of features similar to traditional features such as SIFT. This idea is applied to grasping on Robonaut 2 and published in arXiv (Associating Grasping with Convolutional Neural Network Features) with more details. The ROS package called ros-deep-vision that generates such features using a RGB-D sensor is also public.
When we look at these deep models such as CNNs, we should keep in mind that these models work well because how the layers stack up hierarchically matches how the data is structured. Our observed world is also hierarchical, there are common shared structures such as edges that can be used to represent more complex structures such as squares and cubes when combined in meaningful ways. A simple view of CNN is just a tree structure, where a higher level neuron is a combination of neurons in the previous layer. For example, a neuron that represents cuboids is a combination of neurons that represent the corners and edges of the cuboid. The figures above show such examples on neurons that found to activate consistently on cuboids and cylinders.
By taking advantage of this hierarchical nature of CNN, we can turn a CNN into a feature extractor that generates features that represents local structures of a higher level structure. For example, such hierarchical features can represent the left edge of the top face of a box while traditional edge detectors would find all edges in the scene. Instead of representing a feature with a single filter (neuron) in one of the CNN layers, this feature, which we call hierarchical CNN feature, uses a tuple of filters from different layers. Using backpropagation that restricts activation to one filter per layer allows us to locate the location of such feature precisely. By finding features such as the front and back edge of the top face of a box we can learn where to place robot fingers relative to these hierarchical CNN features in order to manipulate the object.