Life is a game, take it seriously

Archive for the ‘Machine Learning’ Category

RSS 2018 Highlights

In Machine Learning, Paper Talk, Robotics on July 10, 2018 at 3:18 pm

by Li Yang Ku (Gooly)

I was at RSS (Conference on Robotics Science and System) in Pittsburgh a few weeks ago. The conference was held in the Carnegie music hall and the conference badge can also be used to visit the two Carnegie museums next to it. (The Eskimo and native American exhibition on the third floor is a must see. Just in case you don’t know, an igloo can be built within 1.5 hours by just two Inuits and there is a video of it.)

RSS is a relatively small conference compared to IROS and ICRA. With only one single track, you get to see every accepted paper from many different fields ranging from robotic whiskers to surgical robots. I would however argue that the highlights of this year’s RSS are the Keynote talks by Bernardine Dias and Chad Jenkins. Unlike most keynote talks I’ve been to, these two talks were less about new technologies but about humanity and diversity. In this post, I am going to talk about both talks plus a few interesting papers in RSS.

a) Bernardine Dias, “Robotics technology for underserved communities: challenges, rewards, and lessons learned.”

Bernadine’s group focuses on changing technologies so that they can be accessible to communities that are left behind. One of the technologies developed was a tool for helping blind students learn braille and it had significant impact among blind communities across the globe. Bernadine gave an amazing talk at RSS. However, the video of her talk is not public yet (not sure if it will be) and surprisingly not many videos of her are on the internet. The closest content I can find is a really nice audio interview with Bernardine. There is also a short video describing their work below, but what this talk is really about is not the technology or design but the lessons learned through helping these underserved communities.

When roboticist talk about helping the society, many of them focus on the technology and left the actual application to the future. Bernadine’s group are different in that they actually travel to these underserved communities to understand what they need and integrate their feedbacks to the design process directly. This is easier said then done. You have to understand each community before your visit, some acts are considered good in one culture but an insult in another. Giving without understanding often results in waste. Bernardine mentioned in her talk that one of the schools in an underserved community they collaborated with received a large one-time donations for buying computers. It was a large event where important people came and was broadcasted on the news. However, to accommodate these hardwares, this two classroom school has to give up one of there classrooms and therefore reduce the number of classes they can teach. Ironically, the school does not have resources to power these computers nor people to teach students or teachers how to use them. The donation actually result in more harm then help to the community.

b) Odest Chadwicke (Chad) Jenkins, “Robotics: Making the World a Better Place through Minimal Message-oriented Transport Layers .”

While Bernardine tries to change technologies for underserved communities, Chad tries to design interfaces for helping people with disability by deploying robots to their home. Chad showed some of the work done by Charlie Kemp’s group and his lab with Henry Evans. Henry Evans was a successful financial officer at silicon valley until he had a stroke that caused him paralyzed and mute. However, Henry did not give up living fully and strived in advocating robots for people with disability. Henry’s story is inspiring and an example of how robots can help people with disability live freely. The robot for humanity project is the result of these successful collaborations. Since then, Henry gave three Ted talks through robots and the one below shows how Chad helped him fly a quadrotor.

 

However, the highlight of Chad’s talk was when he called out for more diversity in the community. Minorities, especially African Americans and Latinos, are way underrepresented in robotics communities in the U.S. The issue of diversity is usually not what roboticist or computer scientist would thought of or list as a priority. Based on Chad’s numbers, past robotics conferences including RSSs were not immune to these kind of negligence. This not hard to see, among the thousands of conference talks I’ve been to there were probably no more then three talks by African American speakers. Although there are no obvious solutions to solve this problem yet, having the community aware or agree that this is a problem is an important first step. Chad urges people to be aware of whether everyone is given equal opportunities and simply being friendly to isolated minorities in a conference may make a difference in the long run.

c) Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. “Differential Particle Filters.”

This work introduces a differentiable particle filter (DPF) that can be trained end to end. The DPF is composed of a action sampler that generates action samples, an observation encoder, a particle proposer that learns to generate new particles based on observations, and an observation likelihood estimator that weights each particle. These four components are feedforward networks that can be learned through training data. What I found interesting is that the authors made comments similar to the authors of the paper Deep Image Prior; deep learning approaches work not just because of learning but also because of the engineered structure such as convolutional layers that encode priors. This motivated the authors to look for architectures that can encode prior knowledge of algorithms into the neural network.

d) Marc Toussaint, Kelsey R. Allen, Kevin A. Smith, and Joshua B. Tenenbaum. “Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning.”

Task and Motion Planning (TAMP) approaches are about combining symbolic task planners and geometric motion planners hierarchically. Symbolic task planners can be helpful in solving tasks sequences based on high level logic, while geometric planners operate in detailed specifications of the world state. This work is an extension that further considers dynamic physical interactions. The whole robot action sequence is modeled as a sequence of modes connected by switches. Modes represent durations that have constant contact or can be modeled by kinematic abstractions. The task can therefore be written in the form of a Logic-Geometric Program where the whole sequence can be jointly optimized. The video above show that such approach can solve tasks that the authors call physical puzzles. This work also won the best paper at RSS.

Advertisements

Paper Picks: CVPR 2018

In Computer Vision, deep learning, Machine Learning, Neural Science, Paper Talk on July 2, 2018 at 9:08 pm

by Li Yang Ku (Gooly)

I was at CVPR in salt lake city. This year there were more then 6500 attendances and a record high number of accepted papers. People were definitely struggling to see them all. It was a little disappointing that there were no keynote speakers, but among the 9 major conferences I have been to, this one has the best dance party (see image below). You never know how many computer scientists can dance until you give them unlimited alcohol.

In this post I am going to talk about a few papers that were not the most popular ones but were what I personally found interesting. If you want to know the papers that the reviewers though were interesting instead, you can look into the best paper “Taskonomy: Disentangling Task Transfer Learning” and four other honorable mentions including the “SPLATNet: Sparse Lattice Networks for Point Cloud Processing” from collaborations between Nvidia and some people in the vision lab at UMass Amherst which I am in.

a) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. “Learning and Using the Arrow of Time.”

I am quite fond of works that explore cues in the world that may be useful for unsupervised learning. Traditional deep learning approaches requires large amount of labeled training data but we humans seem to be able to learn from just interacting with the world in an unsupervised fashion. In this paper, the direction of time is used as a clue. The authors train a neural network to distinguish the direction of time and show that such network can be helpful in action recognition tasks.

b) Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. “Learning to Localize Sound Source in Visual Scenes.”

This is another example of using cues available in the world. In this work, the authors ask whether a machine can learn the correspondence between visual scene and sound, and localize the sound source only by observing sound and visual scene pairs like humans? This is done by using a triplet network that tries to minimize the difference between visual feature of a video frame and the sound feature generated in a similar time window, while maximizing the difference between the same visual feature and a random sound feature. As you can see in the figure above, the network is able to associate different sounds with different visual regions.

c) Edward Kim, Darryl Hannan, and Garrett Kenyon. “Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons.”

This work is inspired by experiments done by Quiroga et al. that found a single neuron in one human subject’s brain that fires on both pictures of Halle Berry and texts of Halle Berry’s name. In this paper, the authors show that training a deep sparse coding network that takes a face image and a text image of the corresponding name results in learning a multimodal invariant neuron that fires on both Halle Berry’s face and name. When certain modality is missing, the missing image or text can be generated. In this network, each sparse coding layer is learned through the Locally Competitive Algorithm (LCA) that uses principles of thresholding and local competition between neurons. Top down feedback is also used in this work through propagating reconstruction error downwards. The authors show interesting results where adding information to one modality changes the belief of the other modality. The figure above shows that this Halle Berry neuron in the sparse coding network can distinguish between cat women acted by Halle Berry versus cat women acted by Anne Hathaway and Michele Pfeiffer.

d) Assaf Shocher, Nadav Cohen, and Michal Irani. “Zero-Shot Super-Resolution using Deep Internal Learning.”

Super resolution is a task that tries to increase the resolution of an image. The typical approach nowaday is to learn it through a neural network. However, the author showed that this approach only works well if the down sampling process from the high resolution to the low resolution image is similar in training and testing. In this work, no training is needed beforehand. Given a test image, training examples are generated from the test image by down sampling patches of this same image. The fundamental idea of this approach is the fact that natural images have strong internal data repetition. Therefore, from the same image you can infer high resolution structures of lower resolution patches by observing other parts of the image that have higher resolution and similar structure. The image above shows their results (top row) versus state of the art results (bottom row).

e) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.”

Most modern approaches for denoising, super resolution, or inpainting tasks use an image generation network that trains on a large dataset that consist of pairs of images before and after the affect. This work shows that these nice outcomes are not just the result of learning but also the effect of the convolutional structure. The authors take an image generation network, feed random noise as input, and then update the network using the error between the outcome and the test image, such as the left image shown above for inpainting. After many iterations, the network magically generates an image that fills the gap, such as the right image above. What this works says is that unlike common belief that deep learning approaches for image restoration learns image priors better then engineered priors, the deep structure itself is just a better engineered prior.

Deep Learning Approaches For Object Detection

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 25, 2018 at 3:16 pm

by Li Yang Ku

In this post I am going to talk about the progression of a few deep learning approaches for object detection. I will start from R-CNN and OverFeat (2013) then gradually move to more recent approaches such as the RetinaNet which won the best student paper in ICCV 2017. Object detection here refers to the task of identifying a limited set of object classes (20 ~ 200) in a given image by giving each identified object a bounding box and a label. This is one of the main stream challenges in Computer Vision which requires algorithms to output the locations of multiple object in addition to corresponding class. Some of the most well known datasets are the PASCAL visual object classes challenge (2005-2012) funded by the EU (20 classes ~10k images), the ImageNet object detection challenge (2013 ~ present) sponsored by Stanford, UNC, Google, and Facebook (200 classes ~500k images) , and the COCO dataset (2015 ~ current) first started by Microsoft (80 classes ~200K images). These datasets provide hand labeled bounding boxes and class labels of objects in images for training. Challenges for these datasets happen yearly; teams from all over the world submit their code to compete on an undisclosed test set.

In December 2012, the success of Alexnet on the ImageNet classification challenge was published. While many computer vision scientist around the world were still scratching their head trying to understand this result, several groups quickly harvested techniques implemented in Alexnet and tested it out. Based on the success of Alexnet, in November 2013 the vision group in Berkeley published (on arxiv) an approach for solving the object detection problem. This proposed R-CNN is a simple extension that extends the Alexnet that was designed to solve the classification problem to handle the detection problem. R-CNN is composed of 3 parts, 1) region proposal: where selective search is used to generate around 2000 possible object location bounding boxes, 2) feature extraction: Alexnet is used to generate features, 3) classification: a SVM (support vector machine) is trained for each object class. This hybrid approach successfully outperformed previous algorithms on the PASCAL dataset by a significant margin.

R-CNN architecture

Around the same time (December 2013), the NYU team (Yann LeCun, Rob Fergus) published an approach called OverFeat. OverFeat is based on the idea that convolutions can be done efficiently on dense image locations in a sliding window fashion. The fully connected layers in the Alexnet can be seen as 1×1 convolution layers. Therefore, instead of generating a classification confidence for a cropped fix size image, OverFeat generates a map of confidence on the whole image. To predict the bounding box a regressor network is added after the convolution layers. OverFeat was at the 4th place during the 2013 ImageNet object detection challenge but claimed to have better then 1st place result with longer training time which wasn’t ready in time for the competition.

Since then, a lot of researches expanded based on concepts introduced in these work. The SPP-net is an approach that speeds up the R-CNN approach up to 100x by performing the convolution operations just once on the whole image. (note that OverFeat does convolution on images of different scale) The SPP-net adds a spatial pyramid pooling layer before the fully connected layers. This spatial pyramid pooling layer transforms an arbitrary size feature map into a fixed size input by pooling from areas separated by grids of different scale. However, similar to R-CNN, SPP-net requires multistep training on feature extraction and the SVM classification. Fast R-CNN came across to address this problem. Similar to R-CNN, Fast R-CNN uses selective search to generate a set of possible region proposals and by adapting the idea of SPP-net, feature map is generated once on the whole image and a ROI pooling layers extracts a fixed size features for each region proposal. A multi task loss is also used so that the whole network can be trained together in one stage. The Fast R-CNN can speed up R-CNN up to 200x and produce better accuracy.

Fast R-CNN architecture

At this point, the region proposal process have become the computation bottleneck for Fast R-CNN. As a result, the “Faster” R-CNN addresses this issue by introducing the region proposal network that generates region proposals based on the same feature map used for classification. This requires a four stage training that alternates between these two networks but achieves a 5 frames per second speed.

Image pyramid where images of multiple scales are created for feature extraction was a common approach used in features such as SIFT features to handle scale invariant. So far, most R-CNN based approaches does not use image pyramids due to the computation and memory cost during training. The feature pyramid network shows that since deep convolution neural networks are by natural multi-scale, a similar effect can be achieved with little extra cost. This is done by combining top-down information with lateral information for each convolution layer as shown in the figure below. By restricting the feature maps to have the same dimension, the same classification network can be used for all scales; this has a similar flavor to traditional approaches that use the same detector for images of different scales in the image pyramid.

Till 2017, most of the high accuracy approaches on object detection are extensions of R-CNN that have a region proposal module separate from classification. Single stage approaches although faster, were not able to out perform in accuracy. The paper “Focal Loss for Dense Object Detection” published in ICCV 2017 discovers the problem with single stage approaches and proposed an elegant solution that results in faster and more accurate models. The lower accuracy among single stage approaches was a consequence of imbalance between foreground and background training examples. By replacing the cross entropy loss with the focal loss that down weights examples the network already has high confidence, the network improves substantially on accuracy. The figure below shows the difference between the cross entropy loss (CE) and the focal loss (FL). A larger gamma parameter puts less weight on high confidence examples.

The references of approaches I mentioned is listed below. Note that I only talked about a small part of a large body of work on object detection and the current progress on object detection have been moving in a rapid speed. If you look at the current leader board for the COCO dataset, the numbers have already surpassed the best approach I have mentioned by a substantial margin.

  • Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
  • Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spatial pyramid pooling in deep convolutional networks for visual recognition.” In european conference on computer vision, pp. 346-361. Springer, Cham, 2014.
  • Girshick, Ross. “Fast r-cnn.” arXiv preprint arXiv:1504.08083(2015).
  • Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. “Feature pyramid networks for object detection.” In CVPR, vol. 1, no. 2, p. 4. 2017.
  • Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal loss for dense object detection.” arXiv preprint arXiv:1708.02002 (2017).

 

Talk Picks: IROS 2017

In deep learning, Machine Learning, Robotics on February 10, 2018 at 1:06 pm

by Li Yang Ku (Gooly)

I was at IROS (International Conference on Intelligent Robots and Systems) in Vancouver recently (September 2017, this post took way too long to finish)  to present one of my work done almost two years ago. Interestingly, there are four deep learning related sessions this year and there are quite a few papers that I found interesting, however the talks at IROS were what I found the most inspiring. I am going to talk about three of them in the following.

a) “Toward Unifying Model-Based and Learning-Based Robotics”, plenary talk by Dieter Fox.  

In my previous post, I talked about how the machine learning field differs from the robotics field, where machine learning learns from data and robotics designs models that describe the environment. In this talk, Dieter tries to glue both worlds together. This 50 minutes talk is posted below. For those who don’t have 50 minutes, I describe the talk briefly in the following.

Dieter first described a list of work his lab did (robot localization, RGB-D matching, real time tracking, etc.) using model based approaches. Model based approaches matches models to data streams and controls the robot by finding actions that reaches the desired state. One of the benefits of such approach is that our own knowledge of how the physical world works can be injected into the model. Dieter then gave a brief introduction on deep learning and on one of his students work on learning visual descriptors in a self-supervised way, which I covered in a previous post. Based on the recent success in deep learning, Dieter suggested that there are ways to incorporate model based approaches into a deep learning framework and show an example on how we can add knowledge of rigid body motion into a network by forcing it to output segmentations and their poses. The overall conclusion is that 1) model based approaches are accurate within a local basin of attraction which the models match the environment, 2) deep learning provide larger basin of attraction in the trained regime, 3) Unifying both approaches give you more powerful systems.

 

b) “Robotics as the Path to Intelligence”, keynote talk by Oliver Brock

Oliver Brock gave an exciting interactive talk on understanding intelligence in one of the IROS keynote sessions. Unfortunately it is not recorded and the given slides cannot be distributed, so I posted the most similar talk he gave below instead. It is also a pretty good talk with some of the contents overlapped but under a different topic.

In the IROS talk, Oliver made a few points. First, he start out with the AlphaGo by Deepmind, stating that its success in the game go is very similar to the IBM Deep Blue that beats the chess champion in 1996. In both cases, despite the system’s superior game play performance, it needs a human to play for it. A lot of things that humans are good at are usually difficult to our current approach to artificial intelligence. How we define intelligence is crucial because it will shape our research direction and how we solve problems. Oliver then showed that defining intelligence is non-trivial and has to do with what we perceive by performing an interactive experiment with the audience. He then talked about his work on integrating cross model perception and action, the importance of manipulation towards intelligence, and soft hands that can solve hard manipulation problems.

 

c) “The Power of Procrastination”, special event talk by Jorge Cham

This is probably the most popular talk of all the IROS talks. The speaker Jorge Cham is the author of the popular PHD Comics (which I may have posted on my blog without permission) and has a PhD degree in robotics from Stanford university. The following is not the exact same talk he gave in IROS but very similar.

 

Machine Learning, Computer Vision, and Robotics

In Computer Vision, Machine Learning, Robotics on December 6, 2017 at 2:32 pm

By Li Yang Ku (Gooly)

Having TA’d for Machine Learning this semester and worked in the field of Computer Vision and Robotics for the past few years, I always have this feeling that the more I learn the less I know. Therefore, its sometimes good to just sit back and look at the big picture. This post will talk about how I see the relations between these three fields in a high level.

First of all, Machine Learning is more a brand then a name. Just like Deep Learning and AI, this name is used for getting funding when the previous name used is out of hype. In this case, the name popularized after AI projects failed in the 70s. Therefore, Machine learning covers a wide range of problems and approaches that may look quite different at first glance. Adaboost and support vector machine was the hot topic in Machine Learning when I was doing my master’s degree, but now it is deep neural network that gets all the attention.

Despite the wide variety of research in Machine Learning, they usually have this common assumption on the existent of a set of data. The goal is then to learn a model based on this set of data. There are a wide range of variations here, the data could be labeled or not labeled resulting in supervised or unsupervised approaches; the data could be labeled with a category or a real number, resulting in classification or regression problems; the model can be limited to a certain form such as a class of probability models, or can have less constraints in the case of deep neural network. Once the model is learned, there are also a wide range of possible usage. It can be used for predicting outputs given new inputs, filling missing data, generating new samples, or providing insights on hidden relationships between data entries. Data is so fundamental in Machine Learning, people in the field don’t really ask the question of why learning from data. Many datasets from different fields are collected or labeled and the learned models are compared based on accuracy, computation speed, generalizability, etc. Therefore Machine Learning people often consider Computer Vision and Robotics as areas for applying Machine Learning techniques.

Robotics on the other hand comes from a very different background. There are usually no data to start with in robotics. If you cannot control your robot or if your robot crashes itself at first move, how are you going to collect any data. Therefore, classical robotics is about designing models based on physics and geometries. You build models that model how the input and current observation of the robot changes the robot state. Based on this model you can infer the input that will safely control the robot to reach certain state.

Once you can command your robot to reach certain state, a wide variety of problems emerge. The robot will then have to do obstacle avoidance and path planning to reach certain goal. You may need to to find a goal state that satisfies a set of restrictions while optimizing a set of properties. Simultaneous localization and mapping (SLAM) may be needed if no maps are given. In addition, sensor fusion is required when multiple sensors with different properties are used. There may also be uncertainties in robot states where belief space planning may be helpful. For robots with a gripper, you may also need to be able to identify stable grasps and recognizing the type and pose of an object for manipulation. And of course, there is a whole different set of problems on designing the mechanics and hardware of the robot.  Unlike Machine Learning, a lot of approaches of these problems are solved without a set of data. However, most of these robotics problems (excluding mechanical and hardware problems) share a common goal of determining the robot input based on feedback. (Some) Roboticists view robotics as the field that has the ultimate goal of creating machines that act like humans, and Machine Learning and Computer Vision are fields that can provide methods to help accomplish such goal.

The field of Computer Vision started under AI in the 60s under the goal of helping robots to achieve intelligent behaviors, but left such goal behind after the internet era when tons of images on the internet are waiting to be classified. In this age, computer vision applications are no longer restricted to physical robots. In the past decade, the field of Computer Vision is driven by datasets. The implicit agreement on evaluation based on standardized datasets helped the field to advance in a reasonably fast pace (under the cost of millions of grad student hours on tweaking models to get a 1% improvement.) Given these datasets, the field of Computer Vision inevitably left the Robotics community and embraced the data-driven Machine Learning approaches. Most Computer Vision problems have a common goal of learning models for visual data. The model is then used to do classification, clustering, sample generation, etc. on images or videos. The big picture of Computer Vision can be seen in my previous post. Some Computer Vision scientists consider vision different from other senses and believe that the development of vision is fundamental to the evolution of intelligence (which could be true… experiments do show 50% of our brain neurons are vision related.) Nowadays, Computer Vision and Machine Learning are deeply tangled; Machine Learning techniques help foster Computer Vision solutions, while successful models in Computer Vision contribute back to the field of Machine Learning. For example, the successful story of Deep Learning started from Machine Learning models being applied to the ImageNet challenge, and end up with a wide range of architectures that can be applied to other problems in Machine Learning. On the other hand, Robotics is a field where Computer Vision folks are gradually moving back to. Several well known Computer Vision scientists, such as Jitendra Malik, started to consider how Computer Vision can help the field of Robotics ,since their conversation with Robotics colleagues were mostly about vision not working, based on the recent success on data-driven approaches in Computer Vision.

Paper Picks: ICRA 2017

In Computer Vision, deep learning, Machine Learning, Paper Talk, Robotics on July 31, 2017 at 1:04 pm

by Li Yang Ku (Gooly)

I was at ICRA (International Conference on Robotics and Automation) in Singapore to present one of my work this June. Surprisingly, the computer vision track seems to gain a lot of interest in the robotics community. The four computer vision sessions are the most crowded ones among all the sessions that I have attended. The following are a few papers related to computer vision and deep learning that I found quite interesting.

a) Schmidt, Tanner, Richard Newcombe, and Dieter Fox. “Self-supervised visual descriptor learning for dense correspondence.”

In this work, a self-supervised learning approach is introduced for generating dense visual descriptors with convolutional neural networks. Given a set of RGB-D videos of Schmidt, the first author, wandering around, a set of training data can be automatically generated by using Kinect Fusion to track feature points between frames. A pixel-wise contrastive loss is used such that two points belong to the same model point would have similar descriptors.

Kinect Fusion cannot associate points between videos, however with just training data within the same video, the authors show that the learned descriptors of the same model point (such as the tip of the nose) are similar across videos. This can be explained by the hypothesis that with enough data, a model point trajectory will inevitably come near to the same model point trajectory in another video. By chaining these trajectories, clusters of the same model point can be separated even without labels. The figure above visualizes the learned features with colors. Note that it learns a similar mapping across videos despite with no training signal across videos.

b) Pavlakos, Georgios, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis, and Kostas Daniilidis. “6-dof object pose from semantic keypoints.”

In this work, semantic keypoints predicted by convolutional neural networks are combined with a deformable shape model to estimate the pose of object instances or objects of the same class. Given a single RGB image of an object, a set of class specific keypoints is first identified through a CNN that is trained on labeled feature point heat maps. A fitting problem that maps these keypoints to keypoints on the 3D model is then solved using a deformable model that captures different shape variability. The figure above shows some pretty good results on recognizing the same feature of objects of the same class.

The CNN used in this work is the stacked hourglass architecture, where two hourglass modules are stacked together. The hourglass module was introduced in the paper “Newell, Alejandro, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.” An hourglass module is similar to a fully convolutional neural network but with residual modules, which the authors claim to make it more balanced between down sampling and up sampling. Stacking multiple hourglass modules allows repeated bottom up, top down inferences which improves on the state of the art performances.

c) Sung, Jaeyong, Ian Lenz, and Ashutosh Saxena. “Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories.”

In this work, point cloud, natural language, and manipulation trajectory data are mapped to a shared embedding space using a neural network. For example, given the point cloud of an object and a set of instructions as input, the neural network should map it to a region in the embedded space that is close to the trajectory that performs such action. Instead of taking the whole point cloud as input, a segmentation process that decides which part of the object to manipulate based on the instruction is first executed. Based on this shared embedding space, the closest trajectory to where the input point cloud and language map to can be executed during test time.

In order to learn a semantically meaningful embedding space, a loss-augmented cost that considers the similarity between different types of trajectory is used. The result shows that the network put similar groups of actions such as pushing a bar and moving a cup to a nozzle close to each other in the embedding space.

d) Finn, Chelsea, and Sergey Levine. “Deep visual foresight for planning robot motion.”

In this work, a video prediction model that uses a convolutional LSTM (long short-term memory) is used to predict pixel flow transformation from the current frame to the next frame for a non-prehensile manipulation task. This model takes the input image, end-effector pose, and a future action to predict the image of the next time step. The predicted image is then fed back into the network recursively to generate the next image. This network is learned from 50000 pushing examples of hundreds of objects collected from 10 robots.

For each test, the user specifies where certain pixels on an object should move to, the robot then uses the model to determine actions that will most likely reach the target using an optimization algorithm that samples actions for several iterations. Some of the results are shown in the figure above, the first column indicates the interface where the user specifies the goal. The red markers are the starting pixel positions and the green markers of the same shape are the goal positions. Each row shows a sequence of actions taken to reach the specified target.

Generative Adversarial Nets: Your Enemy is Your Best Friend?

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 20, 2017 at 7:10 pm

by Li Yang Ku (gooly)

Generating realistic images with machines was always one of the top items on my list of difficult tasks. Past attempts in the Computer Vision community were only able to get a blurry image at best. The well publicized Google Deepdream project was able to generate some interesting artsy images, however they were modified from existing images and were designed more to make you feel like on drugs then realistic. Recently (2016), a work that combines the generative adversarial network framework with convolutional neural networks (CNNs) generated some results that look surprisingly good. (A non vision person would likely not be amazed though.) This approach was quickly accepted by the community and was referenced more then 200 times in less then a year.

This work is based on an interesting concept first introduced by Goodfellow et al. in the paper “Generative Adversarial Nets” at NIPS 2014 (http://papers.nips.cc/paper/5423-generative-adversarial-nets). The idea was to have two neural networks compete with each other. One would try to generate images as realistic as it can and the other network would try to distinguish them from real images at its best. By theory this competition will reach a global optimum where the generated image and the real image will belong to the same distribution (Could be a lot trickier in practice though). This work in 2014 got some pretty good results on digits and faces but the generated natural images are still quite blurry (see figure above).

In the more recent work “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford, Metz, and Chintala, convolutional neural networks and the generative adversarial net framework are successfully combined with a few techniques that help stabilize the training (https://arxiv.org/abs/1511.06434). Through this approach, the generated images are sharp and surprisingly realistic at first glance. The figures above are some of the generated bedroom images. Notice that if you look closer some of them may be weird.

The authors further explored what the latent variables represents. Ideally the generator (neural network that generates image) should disentangle independent features and each latent variable should represent a meaningful concept. By modifying these variables, images that have different characteristics can be generated. Note that these latent variables are what given to the neural network that generates images and is randomly sampled from a uniform distribution in the previous examples. In the figure above is an example where the authors show that the latent variables do represent meaningful concepts through arithmetic operations. If you subtract the average latent variables of men without glasses from the average latent variables of men with glasses and add the average latent variables of women without glasses, you obtain a latent variable that result in women with glasses when passed through the generator. This process identifies the latent variables that represent glasses.

 

 

 

Convolutional Neural Network Features for Robot Manipulation

In Computer Vision, deep learning, Robotics on October 24, 2016 at 6:30 am

by Li Yang Ku (Gooly)

bender_turtle

In my previous post, I mentioned the obstacles when applying deep learning techniques directly to robotics. First, training data is harder to acquire; Second, interacting with the world is not just a classification problem. In this post, I am gonna talk about a really simple approach that treats convolutional neural networks (CNNs) as a feature extractor that generates a set of features similar to traditional features such as SIFT. This idea is applied to grasping on Robonaut 2 and published in arXiv (Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks) with more details. The ROS package called ros-deep-vision that generates such features using a RGB-D sensor is also public.

Hierarchical CNN Features

 

When we look at these deep models such as CNNs, we should keep in mind that these models work well because how the layers stack up hierarchically matches how the data is structured. Our observed world is also hierarchical, there are common shared structures such as edges that can be used to represent more complex structures such as squares and cubes when combined in meaningful ways. A simple view of CNN is just a tree structure, where a higher level neuron is a combination of neurons in the previous layer. For example, a neuron that represents cuboids is a combination of neurons that represent the corners and edges of the cuboid. The figures above show such examples on neurons that found to activate consistently on cuboids and cylinders.

Deep Learning for Robotics

By taking advantage of this hierarchical nature of CNN, we can turn a CNN into a feature extractor that generates features that represents local structures of a higher level structure. For example, such hierarchical features can represent the left edge of the top face of a box while traditional edge detectors would find all edges in the scene. Instead of representing a feature with a single filter (neuron) in one of the CNN layers, this feature, which we call hierarchical CNN feature, uses a tuple of filters from different layers. Using backpropagation that restricts activation to one filter per layer allows us to locate the location of such feature precisely. By finding features such as the front and back edge of the top face of a box we can learn where to place robot fingers relative to these hierarchical CNN features in order to manipulate the object.

robonaut 2 grasping

 

The most cited papers in computer vision and deep learning

In Computer Vision, deep learning, Paper Talk on June 19, 2016 at 1:18 pm

by Li Yang Ku (Gooly)

paper citation

In 2012 I started a list on the most cited papers in the field of computer vision. I try to keep the list focus on researches that relate to understanding this visual world and avoid image processing, survey, and pure statistic works. However, the computer vision world have changed a lot since 2012 when deep learning techniques started a trend in the field and outperformed traditional approaches on many computer vision benchmarks. No matter if this trend on deep learning lasts long or not I think these techniques deserve their own list.

As I mentioned in the previous post, it’s not always the case that a paper cited more contributes more to the field. However, a highly cited paper usually indicates that something interesting have been discovered. The following are the papers to my knowledge being cited the most in Computer Vision and Deep Learning (note that it is “and” not “or”). If you want a certain paper listed here, just comment below.

Cited by 5518

Imagenet classification with deep convolutional neural networks

A Krizhevsky, I Sutskever, GE Hinton, 2012

Cited by 1868

Caffe: Convolutional architecture for fast feature embedding

Y Jia, E Shelhamer, J Donahue, S Karayev…, 2014

Cited by 1681

Backpropagation applied to handwritten zip code recognition

Y LeCun, B Boser, JS Denker, D Henderson…, 1989

Cited by 1516

Rich feature hierarchies for accurate object detection and semantic segmentation

R Girshick, J Donahue, T Darrell…, 2014

Cited by 1405

Very deep convolutional networks for large-scale image recognition

K Simonyan, A Zisserman, 2014

Cited by 1169

Improving neural networks by preventing co-adaptation of feature detectors

GE Hinton, N Srivastava, A Krizhevsky…, 2012

Cited by 1160

Going deeper with convolutions

C Szegedy, W Liu, Y Jia, P Sermanet…, 2015

Cited by 977

Handwritten digit recognition with a back-propagation network

BB Le Cun, JS Denker, D Henderson…, 1990

Cited by 907

Visualizing and understanding convolutional networks

MD Zeiler, R Fergus, 2014

Cited by 839

Dropout: a simple way to prevent neural networks from overfitting

N Srivastava, GE Hinton, A Krizhevsky…, 2014

Cited by 839

Overfeat: Integrated recognition, localization and detection using convolutional networks

P Sermanet, D Eigen, X Zhang, M Mathieu…, 2013

Cited by 818

Learning multiple layers of features from tiny images

A Krizhevsky, G Hinton, 2009

Cited by 718

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang…, 2014

Cited by 691

Deepface: Closing the gap to human-level performance in face verification

Y Taigman, M Yang, MA Ranzato…, 2014

Cited by 679

Deep Boltzmann Machines

R Salakhutdinov, GE Hinton, 2009

Cited by 670

Convolutional networks for images, speech, and time series

Y LeCun, Y Bengio, 1995

Cited by 570

CNN features off-the-shelf: an astounding baseline for recognition

A Sharif Razavian, H Azizpour, J Sullivan…, 2014

Cited by 549

Learning hierarchical features for scene labeling

C Farabet, C Couprie, L Najman…, 2013

Cited by 510

Fully convolutional networks for semantic segmentation

J Long, E Shelhamer, T Darrell, 2015

Cited by 469

Maxout networks

IJ Goodfellow, D Warde-Farley, M Mirza, AC Courville…, 2013

Cited by 453

Return of the devil in the details: Delving deep into convolutional nets

K Chatfield, K Simonyan, A Vedaldi…, 2014

Cited by 445

Large-scale video classification with convolutional neural networks

A Karpathy, G Toderici, S Shetty, T Leung…, 2014

Cited by 347

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei, 2015

Cited by 342

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

K He, X Zhang, S Ren, J Sun, 2015

Cited by 334

Learning and transferring mid-level image representations using convolutional neural networks

M Oquab, L Bottou, I Laptev, J Sivic, 2014

Cited by 333

Convolutional networks and applications in vision

Y LeCun, K Kavukcuoglu, C Farabet, 2010

Cited by 332

Learning deep features for scene recognition using places database

B Zhou, A Lapedriza, J Xiao, A Torralba…,2014

Cited by 299

Spatial pyramid pooling in deep convolutional networks for visual recognition

K He, X Zhang, S Ren, J Sun, 2014

Cited by 268

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks…, 2015

Cited by 261

Two-stream convolutional networks for action recognition in videos

K Simonyan, A Zisserman, 2014

 

Convolutional Neural Networks in Robotics

In Computer Vision, deep learning, Machine Learning, Neural Science, Robotics on April 10, 2016 at 1:29 pm

by Li Yang Ku (Gooly)

robot using tools

As I mentioned in my previous post, Deep Learning and Convolutional Neural Networks (CNNs) have gained a lot of attention in the field of computer vision and outperformed other algorithms on many benchmarks. However, applying these technics to robotics is non-trivial for two reasons. First, training large neural networks requires a lot of training data and collecting them on robots is hard. Not only do research robots easily have network or hardware failures after many trials, the time and resource needed to collect millions of data is also significant. The trained neural network is also robot specific and cannot be used on a different type of robot directly, therefore limiting the incentive of training such network. Second, CNNs are good for classification but when we are talking about interacting with a dynamic environment there is no direct relationship. Knowing you are seeing a lightsaber gives no indication on how to interact with it. Of course you can hard code this information, but that would just be using Deep Learning in computer vision instead of robotics.

Despite these difficulties, a few groups did make it through and successfully applied Deep Learning and CNNs in robotics; I will talk about three of these interesting works.

  • Levine, Sergey, et al. “End-to-end training of deep visuomotor policies.” arXiv preprint arXiv:1504.00702 (2015). 
  • Finn, Chelsea, et al. “Deep Spatial Autoencoders for Visuomotor Learning.” reconstruction 117.117 (2015): 240. 
  • Pinto, Lerrel, and Abhinav Gupta. “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.” arXiv preprint arXiv:1509.06825 (2015).

Deep Learning in Robotics

Traditional policy search approaches in reinforcement learning usually use the output of a “computer vision systems” and send commands to low-level controllers such as a PD controller. In the paper “end-to-end training of deep visuomotor policies”, Sergey, et al. try to learn a policy from low-level observations (image and joint angles) and output joint torques directly. The overall architecture is shown in the figure above. As you can tell this is ambitious and cannot be easily achieved without a few tricks. The authors first initialize the first layer with weights pre-trained on the ImageNet, then train vision layers with object pose information through pose regression. This pose information is obtained by having the robot holding the object with its hand covered by a cloth similar to the back ground (See figure below). robot collecting pose information

In addition to that, using the pose information of the object, a trajectory can be learned with an approach called guided policy search. This trajectory is then used to train the motor control layers that takes the visual layer output plus joint configuration as input and output joint torques. The results is better shown then described; see video below.

The second paper, “Deep Spatial Autoencoders for Visuomotor Learning”, is done by the same group in Berkeley. In this work, the authors try to learn a state space for reinforcement learning. Reinforcement learning requires a detailed representation of the state; in most work such state is however usually manually designed. This work automates this state space construction from camera image where the deep spatial autoencoder is used to acquire features that represent the position of objects. The architecture is shown in the figure below.

Deep Autoencoder in Robotics

The deep spatial autoencoder maps full-resolution RGB images to a down-sampled, grayscale version of the input image. All information in the image is forced to pass through a bottleneck of spatial features therefore forcing the network to learn important low dimension representations. The position is then extracted from the bottleneck layer and combined with joint information to form the state representation. The result is tested on several tasks shown in the figure below.

Experiments on Deep Auto Encoder

As I mentioned earlier gathering a large amount of training data in robotics is hard, while in the paper “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours” the authors try to show that it is possible. Although still not comparable to datasets in the vision community such as ImageNet, gathering 50 thousand tries in robotics is significant if not unprecedented. The data is gathered using this two arm robot Baxter that is (relatively) mass produced compared to most research robots.

Baxter Grasping

 

The authors then use these collected data to train a CNN initialized with weights trained on ImageNet. The final output is one out of 18 different orientation of the gripper, assuming the robot always grab from the top. The architecture is shown in the figure below.

Grasping with Deep Learning