Life is a game, take it seriously

RSS 2018 Highlights

In Machine Learning, Paper Talk, Robotics on July 10, 2018 at 3:18 pm

by Li Yang Ku (Gooly)

I was at RSS (Conference on Robotics Science and System) in Pittsburgh a few weeks ago. The conference was held in the Carnegie music hall and the conference badge can also be used to visit the two Carnegie museums next to it. (The Eskimo and native American exhibition on the third floor is a must see. Just in case you don’t know, an igloo can be built within 1.5 hours by just two Inuits and there is a video of it.)

RSS is a relatively small conference compared to IROS and ICRA. With only one single track, you get to see every accepted paper from many different fields ranging from robotic whiskers to surgical robots. I would however argue that the highlights of this year’s RSS are the Keynote talks by Bernardine Dias and Chad Jenkins. Unlike most keynote talks I’ve been to, these two talks were less about new technologies but about humanity and diversity. In this post, I am going to talk about both talks plus a few interesting papers in RSS.

a) Bernardine Dias, “Robotics technology for underserved communities: challenges, rewards, and lessons learned.”

Bernadine’s group focuses on changing technologies so that they can be accessible to communities that are left behind. One of the technologies developed was a tool for helping blind students learn braille and it had significant impact among blind communities across the globe. Bernadine gave an amazing talk at RSS. However, the video of her talk is not public yet (not sure if it will be) and surprisingly not many videos of her are on the internet. The closest content I can find is a really nice audio interview with Bernardine. There is also a short video describing their work below, but what this talk is really about is not the technology or design but the lessons learned through helping these underserved communities.

When roboticist talk about helping the society, many of them focus on the technology and left the actual application to the future. Bernadine’s group are different in that they actually travel to these underserved communities to understand what they need and integrate their feedbacks to the design process directly. This is easier said then done. You have to understand each community before your visit, some acts are considered good in one culture but an insult in another. Giving without understanding often results in waste. Bernardine mentioned in her talk that one of the schools in an underserved community they collaborated with received a large one-time donations for buying computers. It was a large event where important people came and was broadcasted on the news. However, to accommodate these hardwares, this two classroom school has to give up one of there classrooms and therefore reduce the number of classes they can teach. Ironically, the school does not have resources to power these computers nor people to teach students or teachers how to use them. The donation actually result in more harm then help to the community.

b) Odest Chadwicke (Chad) Jenkins, “Robotics: Making the World a Better Place through Minimal Message-oriented Transport Layers .”

While Bernardine tries to change technologies for underserved communities, Chad tries to design interfaces for helping people with disability by deploying robots to their home. Chad showed some of the work done by Charlie Kemp’s group and his lab with Henry Evans. Henry Evans was a successful financial officer at silicon valley until he had a stroke that caused him paralyzed and mute. However, Henry did not give up living fully and strived in advocating robots for people with disability. Henry’s story is inspiring and an example of how robots can help people with disability live freely. The robot for humanity project is the result of these successful collaborations. Since then, Henry gave three Ted talks through robots and the one below shows how Chad helped him fly a quadrotor.


However, the highlight of Chad’s talk was when he called out for more diversity in the community. Minorities, especially African Americans and Latinos, are way underrepresented in robotics communities in the U.S. The issue of diversity is usually not what roboticist or computer scientist would thought of or list as a priority. Based on Chad’s numbers, past robotics conferences including RSSs were not immune to these kind of negligence. This not hard to see, among the thousands of conference talks I’ve been to there were probably no more then three talks by African American speakers. Although there are no obvious solutions to solve this problem yet, having the community aware or agree that this is a problem is an important first step. Chad urges people to be aware of whether everyone is given equal opportunities and simply being friendly to isolated minorities in a conference may make a difference in the long run.

c) Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. “Differential Particle Filters.”

This work introduces a differentiable particle filter (DPF) that can be trained end to end. The DPF is composed of a action sampler that generates action samples, an observation encoder, a particle proposer that learns to generate new particles based on observations, and an observation likelihood estimator that weights each particle. These four components are feedforward networks that can be learned through training data. What I found interesting is that the authors made comments similar to the authors of the paper Deep Image Prior; deep learning approaches work not just because of learning but also because of the engineered structure such as convolutional layers that encode priors. This motivated the authors to look for architectures that can encode prior knowledge of algorithms into the neural network.

d) Marc Toussaint, Kelsey R. Allen, Kevin A. Smith, and Joshua B. Tenenbaum. “Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning.”

Task and Motion Planning (TAMP) approaches are about combining symbolic task planners and geometric motion planners hierarchically. Symbolic task planners can be helpful in solving tasks sequences based on high level logic, while geometric planners operate in detailed specifications of the world state. This work is an extension that further considers dynamic physical interactions. The whole robot action sequence is modeled as a sequence of modes connected by switches. Modes represent durations that have constant contact or can be modeled by kinematic abstractions. The task can therefore be written in the form of a Logic-Geometric Program where the whole sequence can be jointly optimized. The video above show that such approach can solve tasks that the authors call physical puzzles. This work also won the best paper at RSS.


Paper Picks: CVPR 2018

In Computer Vision, deep learning, Machine Learning, Neural Science, Paper Talk on July 2, 2018 at 9:08 pm

by Li Yang Ku (Gooly)

I was at CVPR in salt lake city. This year there were more then 6500 attendances and a record high number of accepted papers. People were definitely struggling to see them all. It was a little disappointing that there were no keynote speakers, but among the 9 major conferences I have been to, this one has the best dance party (see image below). You never know how many computer scientists can dance until you give them unlimited alcohol.

In this post I am going to talk about a few papers that were not the most popular ones but were what I personally found interesting. If you want to know the papers that the reviewers though were interesting instead, you can look into the best paper “Taskonomy: Disentangling Task Transfer Learning” and four other honorable mentions including the “SPLATNet: Sparse Lattice Networks for Point Cloud Processing” from collaborations between Nvidia and some people in the vision lab at UMass Amherst which I am in.

a) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. “Learning and Using the Arrow of Time.”

I am quite fond of works that explore cues in the world that may be useful for unsupervised learning. Traditional deep learning approaches requires large amount of labeled training data but we humans seem to be able to learn from just interacting with the world in an unsupervised fashion. In this paper, the direction of time is used as a clue. The authors train a neural network to distinguish the direction of time and show that such network can be helpful in action recognition tasks.

b) Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. “Learning to Localize Sound Source in Visual Scenes.”

This is another example of using cues available in the world. In this work, the authors ask whether a machine can learn the correspondence between visual scene and sound, and localize the sound source only by observing sound and visual scene pairs like humans? This is done by using a triplet network that tries to minimize the difference between visual feature of a video frame and the sound feature generated in a similar time window, while maximizing the difference between the same visual feature and a random sound feature. As you can see in the figure above, the network is able to associate different sounds with different visual regions.

c) Edward Kim, Darryl Hannan, and Garrett Kenyon. “Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons.”

This work is inspired by experiments done by Quiroga et al. that found a single neuron in one human subject’s brain that fires on both pictures of Halle Berry and texts of Halle Berry’s name. In this paper, the authors show that training a deep sparse coding network that takes a face image and a text image of the corresponding name results in learning a multimodal invariant neuron that fires on both Halle Berry’s face and name. When certain modality is missing, the missing image or text can be generated. In this network, each sparse coding layer is learned through the Locally Competitive Algorithm (LCA) that uses principles of thresholding and local competition between neurons. Top down feedback is also used in this work through propagating reconstruction error downwards. The authors show interesting results where adding information to one modality changes the belief of the other modality. The figure above shows that this Halle Berry neuron in the sparse coding network can distinguish between cat women acted by Halle Berry versus cat women acted by Anne Hathaway and Michele Pfeiffer.

d) Assaf Shocher, Nadav Cohen, and Michal Irani. “Zero-Shot Super-Resolution using Deep Internal Learning.”

Super resolution is a task that tries to increase the resolution of an image. The typical approach nowaday is to learn it through a neural network. However, the author showed that this approach only works well if the down sampling process from the high resolution to the low resolution image is similar in training and testing. In this work, no training is needed beforehand. Given a test image, training examples are generated from the test image by down sampling patches of this same image. The fundamental idea of this approach is the fact that natural images have strong internal data repetition. Therefore, from the same image you can infer high resolution structures of lower resolution patches by observing other parts of the image that have higher resolution and similar structure. The image above shows their results (top row) versus state of the art results (bottom row).

e) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.”

Most modern approaches for denoising, super resolution, or inpainting tasks use an image generation network that trains on a large dataset that consist of pairs of images before and after the affect. This work shows that these nice outcomes are not just the result of learning but also the effect of the convolutional structure. The authors take an image generation network, feed random noise as input, and then update the network using the error between the outcome and the test image, such as the left image shown above for inpainting. After many iterations, the network magically generates an image that fills the gap, such as the right image above. What this works says is that unlike common belief that deep learning approaches for image restoration learns image priors better then engineered priors, the deep structure itself is just a better engineered prior.

Deep Learning Approaches For Object Detection

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 25, 2018 at 3:16 pm

by Li Yang Ku

In this post I am going to talk about the progression of a few deep learning approaches for object detection. I will start from R-CNN and OverFeat (2013) then gradually move to more recent approaches such as the RetinaNet which won the best student paper in ICCV 2017. Object detection here refers to the task of identifying a limited set of object classes (20 ~ 200) in a given image by giving each identified object a bounding box and a label. This is one of the main stream challenges in Computer Vision which requires algorithms to output the locations of multiple object in addition to corresponding class. Some of the most well known datasets are the PASCAL visual object classes challenge (2005-2012) funded by the EU (20 classes ~10k images), the ImageNet object detection challenge (2013 ~ present) sponsored by Stanford, UNC, Google, and Facebook (200 classes ~500k images) , and the COCO dataset (2015 ~ current) first started by Microsoft (80 classes ~200K images). These datasets provide hand labeled bounding boxes and class labels of objects in images for training. Challenges for these datasets happen yearly; teams from all over the world submit their code to compete on an undisclosed test set.

In December 2012, the success of Alexnet on the ImageNet classification challenge was published. While many computer vision scientist around the world were still scratching their head trying to understand this result, several groups quickly harvested techniques implemented in Alexnet and tested it out. Based on the success of Alexnet, in November 2013 the vision group in Berkeley published (on arxiv) an approach for solving the object detection problem. This proposed R-CNN is a simple extension that extends the Alexnet that was designed to solve the classification problem to handle the detection problem. R-CNN is composed of 3 parts, 1) region proposal: where selective search is used to generate around 2000 possible object location bounding boxes, 2) feature extraction: Alexnet is used to generate features, 3) classification: a SVM (support vector machine) is trained for each object class. This hybrid approach successfully outperformed previous algorithms on the PASCAL dataset by a significant margin.

R-CNN architecture

Around the same time (December 2013), the NYU team (Yann LeCun, Rob Fergus) published an approach called OverFeat. OverFeat is based on the idea that convolutions can be done efficiently on dense image locations in a sliding window fashion. The fully connected layers in the Alexnet can be seen as 1×1 convolution layers. Therefore, instead of generating a classification confidence for a cropped fix size image, OverFeat generates a map of confidence on the whole image. To predict the bounding box a regressor network is added after the convolution layers. OverFeat was at the 4th place during the 2013 ImageNet object detection challenge but claimed to have better then 1st place result with longer training time which wasn’t ready in time for the competition.

Since then, a lot of researches expanded based on concepts introduced in these work. The SPP-net is an approach that speeds up the R-CNN approach up to 100x by performing the convolution operations just once on the whole image. (note that OverFeat does convolution on images of different scale) The SPP-net adds a spatial pyramid pooling layer before the fully connected layers. This spatial pyramid pooling layer transforms an arbitrary size feature map into a fixed size input by pooling from areas separated by grids of different scale. However, similar to R-CNN, SPP-net requires multistep training on feature extraction and the SVM classification. Fast R-CNN came across to address this problem. Similar to R-CNN, Fast R-CNN uses selective search to generate a set of possible region proposals and by adapting the idea of SPP-net, feature map is generated once on the whole image and a ROI pooling layers extracts a fixed size features for each region proposal. A multi task loss is also used so that the whole network can be trained together in one stage. The Fast R-CNN can speed up R-CNN up to 200x and produce better accuracy.

Fast R-CNN architecture

At this point, the region proposal process have become the computation bottleneck for Fast R-CNN. As a result, the “Faster” R-CNN addresses this issue by introducing the region proposal network that generates region proposals based on the same feature map used for classification. This requires a four stage training that alternates between these two networks but achieves a 5 frames per second speed.

Image pyramid where images of multiple scales are created for feature extraction was a common approach used in features such as SIFT features to handle scale invariant. So far, most R-CNN based approaches does not use image pyramids due to the computation and memory cost during training. The feature pyramid network shows that since deep convolution neural networks are by natural multi-scale, a similar effect can be achieved with little extra cost. This is done by combining top-down information with lateral information for each convolution layer as shown in the figure below. By restricting the feature maps to have the same dimension, the same classification network can be used for all scales; this has a similar flavor to traditional approaches that use the same detector for images of different scales in the image pyramid.

Till 2017, most of the high accuracy approaches on object detection are extensions of R-CNN that have a region proposal module separate from classification. Single stage approaches although faster, were not able to out perform in accuracy. The paper “Focal Loss for Dense Object Detection” published in ICCV 2017 discovers the problem with single stage approaches and proposed an elegant solution that results in faster and more accurate models. The lower accuracy among single stage approaches was a consequence of imbalance between foreground and background training examples. By replacing the cross entropy loss with the focal loss that down weights examples the network already has high confidence, the network improves substantially on accuracy. The figure below shows the difference between the cross entropy loss (CE) and the focal loss (FL). A larger gamma parameter puts less weight on high confidence examples.

The references of approaches I mentioned is listed below. Note that I only talked about a small part of a large body of work on object detection and the current progress on object detection have been moving in a rapid speed. If you look at the current leader board for the COCO dataset, the numbers have already surpassed the best approach I have mentioned by a substantial margin.

  • Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
  • Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spatial pyramid pooling in deep convolutional networks for visual recognition.” In european conference on computer vision, pp. 346-361. Springer, Cham, 2014.
  • Girshick, Ross. “Fast r-cnn.” arXiv preprint arXiv:1504.08083(2015).
  • Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. “Feature pyramid networks for object detection.” In CVPR, vol. 1, no. 2, p. 4. 2017.
  • Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal loss for dense object detection.” arXiv preprint arXiv:1708.02002 (2017).