Life is a game, take it seriously

Deep Learning Approaches For Object Detection

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 25, 2018 at 3:16 pm

by Li Yang Ku

In this post I am going to talk about the progression of a few deep learning approaches for object detection. I will start from R-CNN and OverFeat (2013) then gradually move to more recent approaches such as the RetinaNet which won the best student paper in ICCV 2017. Object detection here refers to the task of identifying a limited set of object classes (20 ~ 200) in a given image by giving each identified object a bounding box and a label. This is one of the main stream challenges in Computer Vision which requires algorithms to output the locations of multiple object in addition to corresponding class. Some of the most well known datasets are the PASCAL visual object classes challenge (2005-2012) funded by the EU (20 classes ~10k images), the ImageNet object detection challenge (2013 ~ present) sponsored by Stanford, UNC, Google, and Facebook (200 classes ~500k images) , and the COCO dataset (2015 ~ current) first started by Microsoft (80 classes ~200K images). These datasets provide hand labeled bounding boxes and class labels of objects in images for training. Challenges for these datasets happen yearly; teams from all over the world submit their code to compete on an undisclosed test set.

In December 2012, the success of Alexnet on the ImageNet classification challenge was published. While many computer vision scientist around the world were still scratching their head trying to understand this result, several groups quickly harvested techniques implemented in Alexnet and tested it out. Based on the success of Alexnet, in November 2013 the vision group in Berkeley published (on arxiv) an approach for solving the object detection problem. This proposed R-CNN is a simple extension that extends the Alexnet that was designed to solve the classification problem to handle the detection problem. R-CNN is composed of 3 parts, 1) region proposal: where selective search is used to generate around 2000 possible object location bounding boxes, 2) feature extraction: Alexnet is used to generate features, 3) classification: a SVM (support vector machine) is trained for each object class. This hybrid approach successfully outperformed previous algorithms on the PASCAL dataset by a significant margin.

R-CNN architecture

Around the same time (December 2013), the NYU team (Yann LeCun, Rob Fergus) published an approach called OverFeat. OverFeat is based on the idea that convolutions can be done efficiently on dense image locations in a sliding window fashion. The fully connected layers in the Alexnet can be seen as 1×1 convolution layers. Therefore, instead of generating a classification confidence for a cropped fix size image, OverFeat generates a map of confidence on the whole image. To predict the bounding box a regressor network is added after the convolution layers. OverFeat was at the 4th place during the 2013 ImageNet object detection challenge but claimed to have better then 1st place result with longer training time which wasn’t ready in time for the competition.

Since then, a lot of researches expanded based on concepts introduced in these work. The SPP-net is an approach that speeds up the R-CNN approach up to 100x by performing the convolution operations just once on the whole image. (note that OverFeat does convolution on images of different scale) The SPP-net adds a spatial pyramid pooling layer before the fully connected layers. This spatial pyramid pooling layer transforms an arbitrary size feature map into a fixed size input by pooling from areas separated by grids of different scale. However, similar to R-CNN, SPP-net requires multistep training on feature extraction and the SVM classification. Fast R-CNN came across to address this problem. Similar to R-CNN, Fast R-CNN uses selective search to generate a set of possible region proposals and by adapting the idea of SPP-net, feature map is generated once on the whole image and a ROI pooling layers extracts a fixed size features for each region proposal. A multi task loss is also used so that the whole network can be trained together in one stage. The Fast R-CNN can speed up R-CNN up to 200x and produce better accuracy.

Fast R-CNN architecture

At this point, the region proposal process have become the computation bottleneck for Fast R-CNN. As a result, the “Faster” R-CNN addresses this issue by introducing the region proposal network that generates region proposals based on the same feature map used for classification. This requires a four stage training that alternates between these two networks but achieves a 5 frames per second speed.

Image pyramid where images of multiple scales are created for feature extraction was a common approach used in features such as SIFT features to handle scale invariant. So far, most R-CNN based approaches does not use image pyramids due to the computation and memory cost during training. The feature pyramid network shows that since deep convolution neural networks are by natural multi-scale, a similar effect can be achieved with little extra cost. This is done by combining top-down information with lateral information for each convolution layer as shown in the figure below. By restricting the feature maps to have the same dimension, the same classification network can be used for all scales; this has a similar flavor to traditional approaches that use the same detector for images of different scales in the image pyramid.

Till 2017, most of the high accuracy approaches on object detection are extensions of R-CNN that have a region proposal module separate from classification. Single stage approaches although faster, were not able to out perform in accuracy. The paper “Focal Loss for Dense Object Detection” published in ICCV 2017 discovers the problem with single stage approaches and proposed an elegant solution that results in faster and more accurate models. The lower accuracy among single stage approaches was a consequence of imbalance between foreground and background training examples. By replacing the cross entropy loss with the focal loss that down weights examples the network already has high confidence, the network improves substantially on accuracy. The figure below shows the difference between the cross entropy loss (CE) and the focal loss (FL). A larger gamma parameter puts less weight on high confidence examples.

The references of approaches I mentioned is listed below. Note that I only talked about a small part of a large body of work on object detection and the current progress on object detection have been moving in a rapid speed. If you look at the current leader board for the COCO dataset, the numbers have already surpassed the best approach I have mentioned by a substantial margin.

  • Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
  • Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spatial pyramid pooling in deep convolutional networks for visual recognition.” In european conference on computer vision, pp. 346-361. Springer, Cham, 2014.
  • Girshick, Ross. “Fast r-cnn.” arXiv preprint arXiv:1504.08083(2015).
  • Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. “Feature pyramid networks for object detection.” In CVPR, vol. 1, no. 2, p. 4. 2017.
  • Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal loss for dense object detection.” arXiv preprint arXiv:1708.02002 (2017).



Talk Picks: IROS 2017

In deep learning, Machine Learning, Robotics on February 10, 2018 at 1:06 pm

by Li Yang Ku (Gooly)

I was at IROS (International Conference on Intelligent Robots and Systems) in Vancouver recently (September 2017, this post took way too long to finish)  to present one of my work done almost two years ago. Interestingly, there are four deep learning related sessions this year and there are quite a few papers that I found interesting, however the talks at IROS were what I found the most inspiring. I am going to talk about three of them in the following.

a) “Toward Unifying Model-Based and Learning-Based Robotics”, plenary talk by Dieter Fox.  

In my previous post, I talked about how the machine learning field differs from the robotics field, where machine learning learns from data and robotics designs models that describe the environment. In this talk, Dieter tries to glue both worlds together. This 50 minutes talk is posted below. For those who don’t have 50 minutes, I describe the talk briefly in the following.

Dieter first described a list of work his lab did (robot localization, RGB-D matching, real time tracking, etc.) using model based approaches. Model based approaches matches models to data streams and controls the robot by finding actions that reaches the desired state. One of the benefits of such approach is that our own knowledge of how the physical world works can be injected into the model. Dieter then gave a brief introduction on deep learning and on one of his students work on learning visual descriptors in a self-supervised way, which I covered in a previous post. Based on the recent success in deep learning, Dieter suggested that there are ways to incorporate model based approaches into a deep learning framework and show an example on how we can add knowledge of rigid body motion into a network by forcing it to output segmentations and their poses. The overall conclusion is that 1) model based approaches are accurate within a local basin of attraction which the models match the environment, 2) deep learning provide larger basin of attraction in the trained regime, 3) Unifying both approaches give you more powerful systems.


b) “Robotics as the Path to Intelligence”, keynote talk by Oliver Brock

Oliver Brock gave an exciting interactive talk on understanding intelligence in one of the IROS keynote sessions. Unfortunately it is not recorded and the given slides cannot be distributed, so I posted the most similar talk he gave below instead. It is also a pretty good talk with some of the contents overlapped but under a different topic.

In the IROS talk, Oliver made a few points. First, he start out with the AlphaGo by Deepmind, stating that its success in the game go is very similar to the IBM Deep Blue that beats the chess champion in 1996. In both cases, despite the system’s superior game play performance, it needs a human to play for it. A lot of things that humans are good at are usually difficult to our current approach to artificial intelligence. How we define intelligence is crucial because it will shape our research direction and how we solve problems. Oliver then showed that defining intelligence is non-trivial and has to do with what we perceive by performing an interactive experiment with the audience. He then talked about his work on integrating cross model perception and action, the importance of manipulation towards intelligence, and soft hands that can solve hard manipulation problems.


c) “The Power of Procrastination”, special event talk by Jorge Cham

This is probably the most popular talk of all the IROS talks. The speaker Jorge Cham is the author of the popular PHD Comics (which I may have posted on my blog without permission) and has a PhD degree in robotics from Stanford university. The following is not the exact same talk he gave in IROS but very similar.


Machine Learning, Computer Vision, and Robotics

In Computer Vision, Machine Learning, Robotics on December 6, 2017 at 2:32 pm

By Li Yang Ku (Gooly)

Having TA’d for Machine Learning this semester and worked in the field of Computer Vision and Robotics for the past few years, I always have this feeling that the more I learn the less I know. Therefore, its sometimes good to just sit back and look at the big picture. This post will talk about how I see the relations between these three fields in a high level.

First of all, Machine Learning is more a brand then a name. Just like Deep Learning and AI, this name is used for getting funding when the previous name used is out of hype. In this case, the name popularized after AI projects failed in the 70s. Therefore, Machine learning covers a wide range of problems and approaches that may look quite different at first glance. Adaboost and support vector machine was the hot topic in Machine Learning when I was doing my master’s degree, but now it is deep neural network that gets all the attention.

Despite the wide variety of research in Machine Learning, they usually have this common assumption on the existent of a set of data. The goal is then to learn a model based on this set of data. There are a wide range of variations here, the data could be labeled or not labeled resulting in supervised or unsupervised approaches; the data could be labeled with a category or a real number, resulting in classification or regression problems; the model can be limited to a certain form such as a class of probability models, or can have less constraints in the case of deep neural network. Once the model is learned, there are also a wide range of possible usage. It can be used for predicting outputs given new inputs, filling missing data, generating new samples, or providing insights on hidden relationships between data entries. Data is so fundamental in Machine Learning, people in the field don’t really ask the question of why learning from data. Many datasets from different fields are collected or labeled and the learned models are compared based on accuracy, computation speed, generalizability, etc. Therefore Machine Learning people often consider Computer Vision and Robotics as areas for applying Machine Learning techniques.

Robotics on the other hand comes from a very different background. There are usually no data to start with in robotics. If you cannot control your robot or if your robot crashes itself at first move, how are you going to collect any data. Therefore, classical robotics is about designing models based on physics and geometries. You build models that model how the input and current observation of the robot changes the robot state. Based on this model you can infer the input that will safely control the robot to reach certain state.

Once you can command your robot to reach certain state, a wide variety of problems emerge. The robot will then have to do obstacle avoidance and path planning to reach certain goal. You may need to to find a goal state that satisfies a set of restrictions while optimizing a set of properties. Simultaneous localization and mapping (SLAM) may be needed if no maps are given. In addition, sensor fusion is required when multiple sensors with different properties are used. There may also be uncertainties in robot states where belief space planning may be helpful. For robots with a gripper, you may also need to be able to identify stable grasps and recognizing the type and pose of an object for manipulation. And of course, there is a whole different set of problems on designing the mechanics and hardware of the robot.  Unlike Machine Learning, a lot of approaches of these problems are solved without a set of data. However, most of these robotics problems (excluding mechanical and hardware problems) share a common goal of determining the robot input based on feedback. (Some) Roboticists view robotics as the field that has the ultimate goal of creating machines that act like humans, and Machine Learning and Computer Vision are fields that can provide methods to help accomplish such goal.

The field of Computer Vision started under AI in the 60s under the goal of helping robots to achieve intelligent behaviors, but left such goal behind after the internet era when tons of images on the internet are waiting to be classified. In this age, computer vision applications are no longer restricted to physical robots. In the past decade, the field of Computer Vision is driven by datasets. The implicit agreement on evaluation based on standardized datasets helped the field to advance in a reasonably fast pace (under the cost of millions of grad student hours on tweaking models to get a 1% improvement.) Given these datasets, the field of Computer Vision inevitably left the Robotics community and embraced the data-driven Machine Learning approaches. Most Computer Vision problems have a common goal of learning models for visual data. The model is then used to do classification, clustering, sample generation, etc. on images or videos. The big picture of Computer Vision can be seen in my previous post. Some Computer Vision scientists consider vision different from other senses and believe that the development of vision is fundamental to the evolution of intelligence (which could be true… experiments do show 50% of our brain neurons are vision related.) Nowadays, Computer Vision and Machine Learning are deeply tangled; Machine Learning techniques help foster Computer Vision solutions, while successful models in Computer Vision contribute back to the field of Machine Learning. For example, the successful story of Deep Learning started from Machine Learning models being applied to the ImageNet challenge, and end up with a wide range of architectures that can be applied to other problems in Machine Learning. On the other hand, Robotics is a field where Computer Vision folks are gradually moving back to. Several well known Computer Vision scientists, such as Jitendra Malik, started to consider how Computer Vision can help the field of Robotics ,since their conversation with Robotics colleagues were mostly about vision not working, based on the recent success on data-driven approaches in Computer Vision.