Life is a game, take it seriously

Paper Picks: ICRA 2017

In Computer Vision, deep learning, Machine Learning, Paper Talk, Robotics on July 31, 2017 at 1:04 pm

by Li Yang Ku (Gooly)

I was at ICRA (International Conference on Robotics and Automation) in Singapore to present one of my work this June. Surprisingly, the computer vision track seems to gain a lot of interest in the robotics community. The four computer vision sessions are the most crowded ones among all the sessions that I have attended. The following are a few papers related to computer vision and deep learning that I found quite interesting.

a) Schmidt, Tanner, Richard Newcombe, and Dieter Fox. “Self-supervised visual descriptor learning for dense correspondence.”

In this work, a self-supervised learning approach is introduced for generating dense visual descriptors with convolutional neural networks. Given a set of RGB-D videos of Schmidt, the first author, wandering around, a set of training data can be automatically generated by using Kinect Fusion to track feature points between frames. A pixel-wise contrastive loss is used such that two points belong to the same model point would have similar descriptors.

Kinect Fusion cannot associate points between videos, however with just training data within the same video, the authors show that the learned descriptors of the same model point (such as the tip of the nose) are similar across videos. This can be explained by the hypothesis that with enough data, a model point trajectory will inevitably come near to the same model point trajectory in another video. By chaining these trajectories, clusters of the same model point can be separated even without labels. The figure above visualizes the learned features with colors. Note that it learns a similar mapping across videos despite with no training signal across videos.

b) Pavlakos, Georgios, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis, and Kostas Daniilidis. “6-dof object pose from semantic keypoints.”

In this work, semantic keypoints predicted by convolutional neural networks are combined with a deformable shape model to estimate the pose of object instances or objects of the same class. Given a single RGB image of an object, a set of class specific keypoints is first identified through a CNN that is trained on labeled feature point heat maps. A fitting problem that maps these keypoints to keypoints on the 3D model is then solved using a deformable model that captures different shape variability. The figure above shows some pretty good results on recognizing the same feature of objects of the same class.

The CNN used in this work is the stacked hourglass architecture, where two hourglass modules are stacked together. The hourglass module was introduced in the paper “Newell, Alejandro, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.” An hourglass module is similar to a fully convolutional neural network but with residual modules, which the authors claim to make it more balanced between down sampling and up sampling. Stacking multiple hourglass modules allows repeated bottom up, top down inferences which improves on the state of the art performances.

c) Sung, Jaeyong, Ian Lenz, and Ashutosh Saxena. “Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories.”

In this work, point cloud, natural language, and manipulation trajectory data are mapped to a shared embedding space using a neural network. For example, given the point cloud of an object and a set of instructions as input, the neural network should map it to a region in the embedded space that is close to the trajectory that performs such action. Instead of taking the whole point cloud as input, a segmentation process that decides which part of the object to manipulate based on the instruction is first executed. Based on this shared embedding space, the closest trajectory to where the input point cloud and language map to can be executed during test time.

In order to learn a semantically meaningful embedding space, a loss-augmented cost that considers the similarity between different types of trajectory is used. The result shows that the network put similar groups of actions such as pushing a bar and moving a cup to a nozzle close to each other in the embedding space.

d) Finn, Chelsea, and Sergey Levine. “Deep visual foresight for planning robot motion.”

In this work, a video prediction model that uses a convolutional LSTM (long short-term memory) is used to predict pixel flow transformation from the current frame to the next frame for a non-prehensile manipulation task. This model takes the input image, end-effector pose, and a future action to predict the image of the next time step. The predicted image is then fed back into the network recursively to generate the next image. This network is learned from 50000 pushing examples of hundreds of objects collected from 10 robots.

For each test, the user specifies where certain pixels on an object should move to, the robot then uses the model to determine actions that will most likely reach the target using an optimization algorithm that samples actions for several iterations. Some of the results are shown in the figure above, the first column indicates the interface where the user specifies the goal. The red markers are the starting pixel positions and the green markers of the same shape are the goal positions. Each row shows a sequence of actions taken to reach the specified target.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: