CVPR is virtual this year for obvious reasons, and if you did not pay the $325 registration fee to attend this ‘prerecorded’ live event, you can now have a similar experience through watching all the recorded videos on their YouTube channel for free. Of course its not exactly the same since you are loosing out the virtual chat room networking experience, but honestly speaking, computer vision parties are often awkward in person already and I can’t imagine you missing much. Before we go through my paper picks, lets look at the trend first. The graph below is the accepted paper counts by topic this year.
And the following are the stats for CVPR 2019:
These numbers cannot be directly compared since the categories are not exactly the same, for example, deep learning that had the most submission in 2019 is no longer a category (Aren’t gonna be a very useful category when every paper is about deep learning.) The distribution of these two graphs look quite similar. However, if I have to analyze it at gunpoint, I would say the following:
- Recognition is still the most popular application for computer vision.
- The new category “Transfer/Low-shot/Semi/Unsupervised Learning” is the most popular problem to solve with deep networks.
- Despite being a controversial technology, more people are working on face recognition. For some countries this is probably still where most money is distributed.
- The new category “Efficient training and inference methods for networks” shows that there is an effort to push for practical use of the neural network.
- Based on this other statistic data, it seems that the keyword ‘graph’, ‘representation’, and ‘cloud’ doubled from last year. This is consistent with my observation that people are exploring 3D data more since the research space on 2D image is the most crowded and competitive.
Now for my random paper picks:
a) Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. “CvxNet: Learnable Convex Decomposition” (video)
This Google Research paper introduces a new representation for 3D shapes that can be learned by neural networks and used by physics engines directly. In the paper, the authors mentioned that there are two types of 3D representations, 1) explicit representations such as meshes. These representations can be used in many applications such as physics simulations directly because they contain information of the surface. explicit representations are however hard to learn with neural networks. The other type is 2) implicit representations such as voxel grids, voxel grids can be learned from neural networks since it can be considered as a classification problem that labels each voxel empty or not. However, turning these voxel grids into a mesh is quite expensive. The authors therefore introduce this convex decomposition representation that represent a 3D shape with a union of convex parts. Since a convex shape can be represented by a set of hyperplanes that draw the boundary of the shape, it becomes a learnable classification problem while remains the benefit of having information of the shape boundary. This representation is therefore both implicit and explicit. The authors also demonstrated that a learned CvxNet is able to generate 3D shapes from 2D images with much better success compared to other approaches as show below.
b) Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman. “Ego-Topo: Environment Affordances From Egocentric Video” (video)
This paper on predicting an environment’s affordance is a collaboration between UT Austin’s computer vision group and Facebook AI Research. This paper caught my eye since my dissertation was also about affordances using a graph like structure. If you are not familiar of the word “affordance”, its a controversial word made up to describe what action/function an object/environment affords a person/robot.
In this work, the authors argue that the space that an action is taken place in is important to understanding first person videos. Traditional approaches on classifier actions in videos usually just take a chunk of the video and generate a representation for classification, while SLAM (simultaneous localization and mapping) approaches that tries to create the exact 3D structure of the environment often fails when humans move too fast. Instead, this work learns a network that classifies whether two views belong to the same space. Based on this information, a graph where each node represents a space and the corresponding videos can be created. The edges between nodes then represent the action sequences that happened between these spaces. These videos within a node can then be used to predict what an environment affords. The authors further trained a graph convolution network that takes into account neighboring nodes to predict the next action in the video. The authors showed that taking into account the underlying space benefited in both tasks.
c) Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta. “Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects” (video)
This paper would probably won the best title award for this conference if there is one. This work is about estimating forces applied to objects by human in a video. Arguably, if robots can estimate forces applied on objects, it would be quite useful for performing tasks and predicting human intentions. However, personally I don’t think this is how humans understand the world and it may be solving a harder problem then needed. Having said that this is still an interesting paper worth discussing.
The difficulty of this task is that the ground truth forces applied on objects cannot be easily obtained. Instead of figuring out how to obtain this data, the authors use a physics simulator to simulate the outcome of applying the force and then use keypoints annotated in the next frame compared to the keypoints location of the simulated outcome as a signal to train the network. Contact points are also predicted by a separate network with annotated data. The figure above shows this training schema. Note that estimating gradients through a non-differentiable physics simulator is possible by looking at the result when each dimension is changed a little bit. The authors show this approach is able to obtain reasonable result on a collected dataset and can be extended to novel objects.
d) Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song. “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization” (video)
This is a Google Brain paper that tries to find a better architecture for object detection tasks that would benefit from more spatial information. For segmentation tasks, the typical architecture has an hour glass shaped encoder decoder structure that first down scales the resolution and then scales it back up to predict pixel-wise result. The authors argued that these type of neural networks that have this scale decreasing backbone may not be the best solution for tasks which localization is also important.

The idea is then to permute the order of the layers of an existing network such as ResNet and see if this can result in a better architecture. To avoid having to try out all combinations, the authors used Neural Architecture Search (basically another network) to learn what architecture would be better. The result is an architecture that has mixed resolutions and many skip connections that go further (image above). The authors showed that with this architecture they were able to outperform prior state of the art result and this same network was also able to achieve good results on other datasets other than the one trained on.