Life is a game, take it seriously

Talk Picks: IROS 2017

In deep learning, Machine Learning, Robotics on February 10, 2018 at 1:06 pm

by Li Yang Ku (Gooly)

I was at IROS (International Conference on Intelligent Robots and Systems) in Vancouver recently (September 2017, this post took way too long to finish)  to present one of my work done almost two years ago. Interestingly, there are four deep learning related sessions this year and there are quite a few papers that I found interesting, however the talks at IROS were what I found the most inspiring. I am going to talk about three of them in the following.

a) “Toward Unifying Model-Based and Learning-Based Robotics”, plenary talk by Dieter Fox.  

In my previous post, I talked about how the machine learning field differs from the robotics field, where machine learning learns from data and robotics designs models that describe the environment. In this talk, Dieter tries to glue both worlds together. This 50 minutes talk is posted below. For those who don’t have 50 minutes, I describe the talk briefly in the following.

Dieter first described a list of work his lab did (robot localization, RGB-D matching, real time tracking, etc.) using model based approaches. Model based approaches matches models to data streams and controls the robot by finding actions that reaches the desired state. One of the benefits of such approach is that our own knowledge of how the physical world works can be injected into the model. Dieter then gave a brief introduction on deep learning and on one of his students work on learning visual descriptors in a self-supervised way, which I covered in a previous post. Based on the recent success in deep learning, Dieter suggested that there are ways to incorporate model based approaches into a deep learning framework and show an example on how we can add knowledge of rigid body motion into a network by forcing it to output segmentations and their poses. The overall conclusion is that 1) model based approaches are accurate within a local basin of attraction which the models match the environment, 2) deep learning provide larger basin of attraction in the trained regime, 3) Unifying both approaches give you more powerful systems.


b) “Robotics as the Path to Intelligence”, keynote talk by Oliver Brock

Oliver Brock gave an exciting interactive talk on understanding intelligence in one of the IROS keynote sessions. Unfortunately it is not recorded and the given slides cannot be distributed, so I posted the most similar talk he gave below instead. It is also a pretty good talk with some of the contents overlapped but under a different topic.

In the IROS talk, Oliver made a few points. First, he start out with the AlphaGo by Deepmind, stating that its success in the game go is very similar to the IBM Deep Blue that beats the chess champion in 1996. In both cases, despite the system’s superior game play performance, it needs a human to play for it. A lot of things that humans are good at are usually difficult to our current approach to artificial intelligence. How we define intelligence is crucial because it will shape our research direction and how we solve problems. Oliver then showed that defining intelligence is non-trivial and has to do with what we perceive by performing an interactive experiment with the audience. He then talked about his work on integrating cross model perception and action, the importance of manipulation towards intelligence, and soft hands that can solve hard manipulation problems.


c) “The Power of Procrastination”, special event talk by Jorge Cham

This is probably the most popular talk of all the IROS talks. The speaker Jorge Cham is the author of the popular PHD Comics (which I may have posted on my blog without permission) and has a PhD degree in robotics from Stanford university. The following is not the exact same talk he gave in IROS but very similar.



Machine Learning, Computer Vision, and Robotics

In Computer Vision, Machine Learning, Robotics on December 6, 2017 at 2:32 pm

By Li Yang Ku (Gooly)

Having TA’d for Machine Learning this semester and worked in the field of Computer Vision and Robotics for the past few years, I always have this feeling that the more I learn the less I know. Therefore, its sometimes good to just sit back and look at the big picture. This post will talk about how I see the relations between these three fields in a high level.

First of all, Machine Learning is more a brand then a name. Just like Deep Learning and AI, this name is used for getting funding when the previous name used is out of hype. In this case, the name popularized after AI projects failed in the 70s. Therefore, Machine learning covers a wide range of problems and approaches that may look quite different at first glance. Adaboost and support vector machine was the hot topic in Machine Learning when I was doing my master’s degree, but now it is deep neural network that gets all the attention.

Despite the wide variety of research in Machine Learning, they usually have this common assumption on the existent of a set of data. The goal is then to learn a model based on this set of data. There are a wide range of variations here, the data could be labeled or not labeled resulting in supervised or unsupervised approaches; the data could be labeled with a category or a real number, resulting in classification or regression problems; the model can be limited to a certain form such as a class of probability models, or can have less constraints in the case of deep neural network. Once the model is learned, there are also a wide range of possible usage. It can be used for predicting outputs given new inputs, filling missing data, generating new samples, or providing insights on hidden relationships between data entries. Data is so fundamental in Machine Learning, people in the field don’t really ask the question of why learning from data. Many datasets from different fields are collected or labeled and the learned models are compared based on accuracy, computation speed, generalizability, etc. Therefore Machine Learning people often consider Computer Vision and Robotics as areas for applying Machine Learning techniques.

Robotics on the other hand comes from a very different background. There are usually no data to start with in robotics. If you cannot control your robot or if your robot crashes itself at first move, how are you going to collect any data. Therefore, classical robotics is about designing models based on physics and geometries. You build models that model how the input and current observation of the robot changes the robot state. Based on this model you can infer the input that will safely control the robot to reach certain state.

Once you can command your robot to reach certain state, a wide variety of problems emerge. The robot will then have to do obstacle avoidance and path planning to reach certain goal. You may need to to find a goal state that satisfies a set of restrictions while optimizing a set of properties. Simultaneous localization and mapping (SLAM) may be needed if no maps are given. In addition, sensor fusion is required when multiple sensors with different properties are used. There may also be uncertainties in robot states where belief space planning may be helpful. For robots with a gripper, you may also need to be able to identify stable grasps and recognizing the type and pose of an object for manipulation. And of course, there is a whole different set of problems on designing the mechanics and hardware of the robot.  Unlike Machine Learning, a lot of approaches of these problems are solved without a set of data. However, most of these robotics problems (excluding mechanical and hardware problems) share a common goal of determining the robot input based on feedback. (Some) Roboticists view robotics as the field that has the ultimate goal of creating machines that act like humans, and Machine Learning and Computer Vision are fields that can provide methods to help accomplish such goal.

The field of Computer Vision started under AI in the 60s under the goal of helping robots to achieve intelligent behaviors, but left such goal behind after the internet era when tons of images on the internet are waiting to be classified. In this age, computer vision applications are no longer restricted to physical robots. In the past decade, the field of Computer Vision is driven by datasets. The implicit agreement on evaluation based on standardized datasets helped the field to advance in a reasonably fast pace (under the cost of millions of grad student hours on tweaking models to get a 1% improvement.) Given these datasets, the field of Computer Vision inevitably left the Robotics community and embraced the data-driven Machine Learning approaches. Most Computer Vision problems have a common goal of learning models for visual data. The model is then used to do classification, clustering, sample generation, etc. on images or videos. The big picture of Computer Vision can be seen in my previous post. Some Computer Vision scientists consider vision different from other senses and believe that the development of vision is fundamental to the evolution of intelligence (which could be true… experiments do show 50% of our brain neurons are vision related.) Nowadays, Computer Vision and Machine Learning are deeply tangled; Machine Learning techniques help foster Computer Vision solutions, while successful models in Computer Vision contribute back to the field of Machine Learning. For example, the successful story of Deep Learning started from Machine Learning models being applied to the ImageNet challenge, and end up with a wide range of architectures that can be applied to other problems in Machine Learning. On the other hand, Robotics is a field where Computer Vision folks are gradually moving back to. Several well known Computer Vision scientists, such as Jitendra Malik, started to consider how Computer Vision can help the field of Robotics ,since their conversation with Robotics colleagues were mostly about vision not working, based on the recent success on data-driven approaches in Computer Vision.

Paper Picks: ICRA 2017

In Computer Vision, deep learning, Machine Learning, Paper Talk, Robotics on July 31, 2017 at 1:04 pm

by Li Yang Ku (Gooly)

I was at ICRA (International Conference on Robotics and Automation) in Singapore to present one of my work this June. Surprisingly, the computer vision track seems to gain a lot of interest in the robotics community. The four computer vision sessions are the most crowded ones among all the sessions that I have attended. The following are a few papers related to computer vision and deep learning that I found quite interesting.

a) Schmidt, Tanner, Richard Newcombe, and Dieter Fox. “Self-supervised visual descriptor learning for dense correspondence.”

In this work, a self-supervised learning approach is introduced for generating dense visual descriptors with convolutional neural networks. Given a set of RGB-D videos of Schmidt, the first author, wandering around, a set of training data can be automatically generated by using Kinect Fusion to track feature points between frames. A pixel-wise contrastive loss is used such that two points belong to the same model point would have similar descriptors.

Kinect Fusion cannot associate points between videos, however with just training data within the same video, the authors show that the learned descriptors of the same model point (such as the tip of the nose) are similar across videos. This can be explained by the hypothesis that with enough data, a model point trajectory will inevitably come near to the same model point trajectory in another video. By chaining these trajectories, clusters of the same model point can be separated even without labels. The figure above visualizes the learned features with colors. Note that it learns a similar mapping across videos despite with no training signal across videos.

b) Pavlakos, Georgios, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis, and Kostas Daniilidis. “6-dof object pose from semantic keypoints.”

In this work, semantic keypoints predicted by convolutional neural networks are combined with a deformable shape model to estimate the pose of object instances or objects of the same class. Given a single RGB image of an object, a set of class specific keypoints is first identified through a CNN that is trained on labeled feature point heat maps. A fitting problem that maps these keypoints to keypoints on the 3D model is then solved using a deformable model that captures different shape variability. The figure above shows some pretty good results on recognizing the same feature of objects of the same class.

The CNN used in this work is the stacked hourglass architecture, where two hourglass modules are stacked together. The hourglass module was introduced in the paper “Newell, Alejandro, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.” An hourglass module is similar to a fully convolutional neural network but with residual modules, which the authors claim to make it more balanced between down sampling and up sampling. Stacking multiple hourglass modules allows repeated bottom up, top down inferences which improves on the state of the art performances.

c) Sung, Jaeyong, Ian Lenz, and Ashutosh Saxena. “Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories.”

In this work, point cloud, natural language, and manipulation trajectory data are mapped to a shared embedding space using a neural network. For example, given the point cloud of an object and a set of instructions as input, the neural network should map it to a region in the embedded space that is close to the trajectory that performs such action. Instead of taking the whole point cloud as input, a segmentation process that decides which part of the object to manipulate based on the instruction is first executed. Based on this shared embedding space, the closest trajectory to where the input point cloud and language map to can be executed during test time.

In order to learn a semantically meaningful embedding space, a loss-augmented cost that considers the similarity between different types of trajectory is used. The result shows that the network put similar groups of actions such as pushing a bar and moving a cup to a nozzle close to each other in the embedding space.

d) Finn, Chelsea, and Sergey Levine. “Deep visual foresight for planning robot motion.”

In this work, a video prediction model that uses a convolutional LSTM (long short-term memory) is used to predict pixel flow transformation from the current frame to the next frame for a non-prehensile manipulation task. This model takes the input image, end-effector pose, and a future action to predict the image of the next time step. The predicted image is then fed back into the network recursively to generate the next image. This network is learned from 50000 pushing examples of hundreds of objects collected from 10 robots.

For each test, the user specifies where certain pixels on an object should move to, the robot then uses the model to determine actions that will most likely reach the target using an optimization algorithm that samples actions for several iterations. Some of the results are shown in the figure above, the first column indicates the interface where the user specifies the goal. The red markers are the starting pixel positions and the green markers of the same shape are the goal positions. Each row shows a sequence of actions taken to reach the specified target.