Life is a game, take it seriously

Posts Tagged ‘object detection’

Deep Learning Approaches For Object Detection

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 25, 2018 at 3:16 pm

by Li Yang Ku

In this post I am going to talk about the progression of a few deep learning approaches for object detection. I will start from R-CNN and OverFeat (2013) then gradually move to more recent approaches such as the RetinaNet which won the best student paper in ICCV 2017. Object detection here refers to the task of identifying a limited set of object classes (20 ~ 200) in a given image by giving each identified object a bounding box and a label. This is one of the main stream challenges in Computer Vision which requires algorithms to output the locations of multiple object in addition to corresponding class. Some of the most well known datasets are the PASCAL visual object classes challenge (2005-2012) funded by the EU (20 classes ~10k images), the ImageNet object detection challenge (2013 ~ present) sponsored by Stanford, UNC, Google, and Facebook (200 classes ~500k images) , and the COCO dataset (2015 ~ current) first started by Microsoft (80 classes ~200K images). These datasets provide hand labeled bounding boxes and class labels of objects in images for training. Challenges for these datasets happen yearly; teams from all over the world submit their code to compete on an undisclosed test set.

In December 2012, the success of Alexnet on the ImageNet classification challenge was published. While many computer vision scientist around the world were still scratching their head trying to understand this result, several groups quickly harvested techniques implemented in Alexnet and tested it out. Based on the success of Alexnet, in November 2013 the vision group in Berkeley published (on arxiv) an approach for solving the object detection problem. This proposed R-CNN is a simple extension that extends the Alexnet that was designed to solve the classification problem to handle the detection problem. R-CNN is composed of 3 parts, 1) region proposal: where selective search is used to generate around 2000 possible object location bounding boxes, 2) feature extraction: Alexnet is used to generate features, 3) classification: a SVM (support vector machine) is trained for each object class. This hybrid approach successfully outperformed previous algorithms on the PASCAL dataset by a significant margin.

R-CNN architecture

Around the same time (December 2013), the NYU team (Yann LeCun, Rob Fergus) published an approach called OverFeat. OverFeat is based on the idea that convolutions can be done efficiently on dense image locations in a sliding window fashion. The fully connected layers in the Alexnet can be seen as 1×1 convolution layers. Therefore, instead of generating a classification confidence for a cropped fix size image, OverFeat generates a map of confidence on the whole image. To predict the bounding box a regressor network is added after the convolution layers. OverFeat was at the 4th place during the 2013 ImageNet object detection challenge but claimed to have better then 1st place result with longer training time which wasn’t ready in time for the competition.

Since then, a lot of researches expanded based on concepts introduced in these work. The SPP-net is an approach that speeds up the R-CNN approach up to 100x by performing the convolution operations just once on the whole image. (note that OverFeat does convolution on images of different scale) The SPP-net adds a spatial pyramid pooling layer before the fully connected layers. This spatial pyramid pooling layer transforms an arbitrary size feature map into a fixed size input by pooling from areas separated by grids of different scale. However, similar to R-CNN, SPP-net requires multistep training on feature extraction and the SVM classification. Fast R-CNN came across to address this problem. Similar to R-CNN, Fast R-CNN uses selective search to generate a set of possible region proposals and by adapting the idea of SPP-net, feature map is generated once on the whole image and a ROI pooling layers extracts a fixed size features for each region proposal. A multi task loss is also used so that the whole network can be trained together in one stage. The Fast R-CNN can speed up R-CNN up to 200x and produce better accuracy.

Fast R-CNN architecture

At this point, the region proposal process have become the computation bottleneck for Fast R-CNN. As a result, the “Faster” R-CNN addresses this issue by introducing the region proposal network that generates region proposals based on the same feature map used for classification. This requires a four stage training that alternates between these two networks but achieves a 5 frames per second speed.

Image pyramid where images of multiple scales are created for feature extraction was a common approach used in features such as SIFT features to handle scale invariant. So far, most R-CNN based approaches does not use image pyramids due to the computation and memory cost during training. The feature pyramid network shows that since deep convolution neural networks are by natural multi-scale, a similar effect can be achieved with little extra cost. This is done by combining top-down information with lateral information for each convolution layer as shown in the figure below. By restricting the feature maps to have the same dimension, the same classification network can be used for all scales; this has a similar flavor to traditional approaches that use the same detector for images of different scales in the image pyramid.

Till 2017, most of the high accuracy approaches on object detection are extensions of R-CNN that have a region proposal module separate from classification. Single stage approaches although faster, were not able to out perform in accuracy. The paper “Focal Loss for Dense Object Detection” published in ICCV 2017 discovers the problem with single stage approaches and proposed an elegant solution that results in faster and more accurate models. The lower accuracy among single stage approaches was a consequence of imbalance between foreground and background training examples. By replacing the cross entropy loss with the focal loss that down weights examples the network already has high confidence, the network improves substantially on accuracy. The figure below shows the difference between the cross entropy loss (CE) and the focal loss (FL). A larger gamma parameter puts less weight on high confidence examples.

The references of approaches I mentioned is listed below. Note that I only talked about a small part of a large body of work on object detection and the current progress on object detection have been moving in a rapid speed. If you look at the current leader board for the COCO dataset, the numbers have already surpassed the best approach I have mentioned by a substantial margin.

  • Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
  • Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spatial pyramid pooling in deep convolutional networks for visual recognition.” In european conference on computer vision, pp. 346-361. Springer, Cham, 2014.
  • Girshick, Ross. “Fast r-cnn.” arXiv preprint arXiv:1504.08083(2015).
  • Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. “Feature pyramid networks for object detection.” In CVPR, vol. 1, no. 2, p. 4. 2017.
  • Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal loss for dense object detection.” arXiv preprint arXiv:1708.02002 (2017).