Life is a game, take it seriously

Paper Picks: CVPR 2018

In Computer Vision, deep learning, Machine Learning, Neural Science, Paper Talk on July 2, 2018 at 9:08 pm

by Li Yang Ku (Gooly)

I was at CVPR in salt lake city. This year there were more then 6500 attendances and a record high number of accepted papers. People were definitely struggling to see them all. It was a little disappointing that there were no keynote speakers, but among the 9 major conferences I have been to, this one has the best dance party (see image below). You never know how many computer scientists can dance until you give them unlimited alcohol.

In this post I am going to talk about a few papers that were not the most popular ones but were what I personally found interesting. If you want to know the papers that the reviewers though were interesting instead, you can look into the best paper “Taskonomy: Disentangling Task Transfer Learning” and four other honorable mentions including the “SPLATNet: Sparse Lattice Networks for Point Cloud Processing” from collaborations between Nvidia and some people in the vision lab at UMass Amherst which I am in.

a) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. “Learning and Using the Arrow of Time.”

I am quite fond of works that explore cues in the world that may be useful for unsupervised learning. Traditional deep learning approaches requires large amount of labeled training data but we humans seem to be able to learn from just interacting with the world in an unsupervised fashion. In this paper, the direction of time is used as a clue. The authors train a neural network to distinguish the direction of time and show that such network can be helpful in action recognition tasks.

b) Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. “Learning to Localize Sound Source in Visual Scenes.”

This is another example of using cues available in the world. In this work, the authors ask whether a machine can learn the correspondence between visual scene and sound, and localize the sound source only by observing sound and visual scene pairs like humans? This is done by using a triplet network that tries to minimize the difference between visual feature of a video frame and the sound feature generated in a similar time window, while maximizing the difference between the same visual feature and a random sound feature. As you can see in the figure above, the network is able to associate different sounds with different visual regions.

c) Edward Kim, Darryl Hannan, and Garrett Kenyon. “Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons.”

This work is inspired by experiments done by Quiroga et al. that found a single neuron in one human subject’s brain that fires on both pictures of Halle Berry and texts of Halle Berry’s name. In this paper, the authors show that training a deep sparse coding network that takes a face image and a text image of the corresponding name results in learning a multimodal invariant neuron that fires on both Halle Berry’s face and name. When certain modality is missing, the missing image or text can be generated. In this network, each sparse coding layer is learned through the Locally Competitive Algorithm (LCA) that uses principles of thresholding and local competition between neurons. Top down feedback is also used in this work through propagating reconstruction error downwards. The authors show interesting results where adding information to one modality changes the belief of the other modality. The figure above shows that this Halle Berry neuron in the sparse coding network can distinguish between cat women acted by Halle Berry versus cat women acted by Anne Hathaway and Michele Pfeiffer.

d) Assaf Shocher, Nadav Cohen, and Michal Irani. “Zero-Shot Super-Resolution using Deep Internal Learning.”

Super resolution is a task that tries to increase the resolution of an image. The typical approach nowaday is to learn it through a neural network. However, the author showed that this approach only works well if the down sampling process from the high resolution to the low resolution image is similar in training and testing. In this work, no training is needed beforehand. Given a test image, training examples are generated from the test image by down sampling patches of this same image. The fundamental idea of this approach is the fact that natural images have strong internal data repetition. Therefore, from the same image you can infer high resolution structures of lower resolution patches by observing other parts of the image that have higher resolution and similar structure. The image above shows their results (top row) versus state of the art results (bottom row).

e) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.”

Most modern approaches for denoising, super resolution, or inpainting tasks use an image generation network that trains on a large dataset that consist of pairs of images before and after the affect. This work shows that these nice outcomes are not just the result of learning but also the effect of the convolutional structure. The authors take an image generation network, feed random noise as input, and then update the network using the error between the outcome and the test image, such as the left image shown above for inpainting. After many iterations, the network magically generates an image that fills the gap, such as the right image above. What this works says is that unlike common belief that deep learning approaches for image restoration learns image priors better then engineered priors, the deep structure itself is just a better engineered prior.

Advertisements
  1. […] What I found interesting is that the authors made comments similar to the authors of the paper Deep Image Prior; deep learning approaches work not just because of learning but also because of the engineered […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: