Life is a game, take it seriously

Paper Picks: RSS 2020

In AI, Paper Talk, Robotics on December 13, 2020 at 8:58 pm

by Li Yang Ku

Just like CVPR, RSS (Robotics: Science and Systems) is virtual this year and all the videos are free of charge. You can find all the papers here and corresponding videos on the RSS youtube page once you finished bingeing Netflix, Hulu, Amazon Prime, and Disney+.

In this post, I am going to talk about a few RSS papers I found interesting. The best talk I watched so far was however (unsurprisingly) the keynote given by Josh Tenenbaum, who is probably one of the most charismatic speakers in the field of AI. Even though I am not a big fan of his recent “brain is a physics engine” work, it sounds less absurd and even a bit reasonable when he says it. This talk is an inspiring high level walk through of many AI research projects that try to tackle the problem of understanding intuitive physics and other aspects of the human mind. My favorite part of this talk was when Josh Tenenbaum showed a video of a baby trying to stack cylinders on top of a cat. Josh argued that machine learning approaches that fit parameters to data will not be able to generalize to an infinite amount of tasks (such as placing cylinders on cats) and is quite different from how our minds model the world.

a) Tasbolat Taunyazov, Weicong Sng, Brian Lim, Hian Hian See, Jethro Kuan, Abdul Fatir Ansari, Benjamin Tee, Harold Soh. “Event-Driven Visual-Tactile Sensing and Learning for Robots” (video)

If you’ve been to a computer vision or robotics conference in the past 10 years, you probably seen one of these event cameras (also called neuromorphic camera) that have super low latency but only detects when the brightness changes. A typical demo would be to point the camera at a rotating fan and show that it can capture the individual blades. It was a thing that was advertised as having great potential but people still haven’t quite figure out how to use it yet. In this paper, the authors not only used an event camera but also developed a low latency “event” tactile sensor and used them to detect objects with different weights by grasping them.

In this work, the event camera and event tactile sensor outputs are fed into a spiking neural network (SNN). A spiking neuron network is an artificial neural network that was inspired by biological neurons that become active when they receive spikes exceeding a threshold within a time window. In a SNN, information is passed through spike trains in parallel and the timing and frequency of spikes play a large role in the final outcome. Similar to convolution neural networks (CNN), neurons can stack in layers but they perform convolution also in the time dimension. Training is however a lot harder compared to CNN since the derivative of a spike is not well defined. Read this paper if you are interested in knowing more details of SNN.

The classification task is then based on which neuron had the most spikes. In the figure above we can see the accuracy increases over time when more information is observed. With just vision input, the robot can distinguish objects that look different but not objects that look the same but with different weights. Once the haptic sensor received more feedback while lifting the object, the combined SNN can reach a higher accuracy versus using a single modality.

b) Adam Allevato, Elaine Schaertl Short, Mitch Pryor, Andrea Thomaz. “Learning Labeled Robot Affordance Models Using Simulations and Crowdsourcing” (video)

Affordance can be defined as the functions an object affords an agent (see my paper for my definition.) A lot of research in this field tries to learn to identify affordances based on data labeled by experts. In this work, the authors try to ground affordance to language through crowd sourcing instead. The authors first tried to collect data by having subjects observe a real robot performing straight line movements that move towards a random location relative to an object. The subjects have to then enter the action the robot might be performing. The data collected turned out to be too noisy. So what the authors did instead was to take the verbs describing these actions collected on the real robot and use them as options for multiple choice questions on mechanical turk with a simulated robot.

By counting the percentage of two actions chosen for the same robot action in the collected data, the authors came up with a way to define a hierarchical relationship between these labeled actions based on conditional probability. The following are two hierarchies built with different thresholds. Some of them kind of make sense. For example, the generated hierarchies below shows that tip is a kind of touch and flip is a kind of tip.

The authors also trained classifiers that take the robot arm motion and the resulting object pose change as input and output the most likely label. They showed that classifiers trained on the effect on the object performs better then classifiers trained on the robot arm motion. The authors claimed that this result suggests humans may consider affordance primarily as a function of the effect on the object rather than the action itself.

c) Hong Jun, Dylan Losey, Dorsa Sadigh. “Shared Autonomy with Learned Latent Actions” (video)

For some people with disability, a robot that can be easily teleoperated through a joystick would be quite helpful in daily life. However, if you ever tried to control a robot with a joystick you would know it is no easy task. Shared Autonomy tries to solve this problem by guessing what the user tries to achieve and helps the user finish the intended action. Although this approach is convenient in a setting which the robot can easily interpret the users plan, it does not provide options for more detailed manipulation preferences such as where to cut a tofu. The authors try to address this by combining shared autonomy with latent actions.

In this work, shared autonomy is used at the start of the teleoperation, once the robot has higher confidence of the action the user intended to execute, the robot gradually switches to a 2-dimensional latent space control (e.g. z is the latent space in figure above.) This latent space is trained with an autoencoder using training data consists of (state, action, belief) tuples, which belief is the robot’s belief over a set of candidate goals. This autoencoder is conditioned on state and belief, which both would be provided to the decoder during run time as shown below.

The authors tested on two interesting tasks: 1) entree task which the robot has to cut the tofu and move tofu to plate, 2) dessert task which the robot has to stab the marshmallow, scoop it on icing, then dip it in rice. They showed that their approach required less time and has less error when compared to a latency space or shared autonomy only approach. You can see the whole task sequence in this video.

Paper Picks: CVPR 2020

In AI, Computer Vision, deep learning, Paper Talk, vision on September 7, 2020 at 6:30 am
by Li Yang Ku (Gooly)

CVPR is virtual this year for obvious reasons, and if you did not pay the $325 registration fee to attend this ‘prerecorded’ live event, you can now have a similar experience through watching all the recorded videos on their YouTube channel for free. Of course its not exactly the same since you are loosing out the virtual chat room networking experience, but honestly speaking, computer vision parties are often awkward in person already and I can’t imagine you missing much. Before we go through my paper picks, lets look at the trend first. The graph below is the accepted paper counts by topic this year.

CVPR 2020 stats

And the following are the stats for CVPR 2019:

CVPR 2019 stats

These numbers cannot be directly compared since the categories are not exactly the same, for example, deep learning that had the most submission in 2019 is no longer a category (Aren’t gonna be a very useful category when every paper is about deep learning.) The distribution of these two graphs look quite similar. However, if I have to analyze it at gunpoint, I would say the following:

  1. Recognition is still the most popular application for computer vision.
  2. The new category “Transfer/Low-shot/Semi/Unsupervised Learning” is the most popular problem to solve with deep networks.
  3. Despite being a controversial technology, more people are working on face recognition. For some countries this is probably still where most money is distributed.
  4. The new category “Efficient training and inference methods for networks” shows that there is an effort to push for practical use of the neural network.
  5. Based on this other statistic data, it seems that the keyword ‘graph’, ‘representation’, and ‘cloud’ doubled from last year. This is consistent with my observation that people are exploring 3D data more since the research space on 2D image is the most crowded and competitive.

Now for my random paper picks:

a) Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. “CvxNet: Learnable Convex Decomposition” (video)

This Google Research paper introduces a new representation for 3D shapes that can be learned by neural networks and used by physics engines directly. In the paper, the authors mentioned that there are two types of 3D representations, 1) explicit representations such as meshes. These representations can be used in many applications such as physics simulations directly because they contain information of the surface. explicit representations are however hard to learn with neural networks. The other type is 2) implicit representations such as voxel grids, voxel grids can be learned from neural networks since it can be considered as a classification problem that labels each voxel empty or not. However, turning these voxel grids into a mesh is quite expensive. The authors therefore introduce this convex decomposition representation that represent a 3D shape with a union of convex parts. Since a convex shape can be represented by a set of hyperplanes that draw the boundary of the shape, it becomes a learnable classification problem while remains the benefit of having information of the shape boundary. This representation is therefore both implicit and explicit. The authors also demonstrated that a learned CvxNet is able to generate 3D shapes from 2D images with much better success compared to other approaches as show below.

b) Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman. “Ego-Topo: Environment Affordances From Egocentric Video” (video)

Environment Affordance

This paper on predicting an environment’s affordance is a collaboration between UT Austin’s computer vision group and Facebook AI Research. This paper caught my eye since my dissertation was also about affordances using a graph like structure. If you are not familiar of the word “affordance”, its a controversial word made up to describe what action/function an object/environment affords a person/robot.

In this work, the authors argue that the space that an action is taken place in is important to understanding first person videos. Traditional approaches on classifier actions in videos usually just take a chunk of the video and generate a representation for classification, while SLAM (simultaneous localization and mapping) approaches that tries to create the exact 3D structure of the environment often fails when humans move too fast. Instead, this work learns a network that classifies whether two views belong to the same space. Based on this information, a graph where each node represents a space and the corresponding videos can be created. The edges between nodes then represent the action sequences that happened between these spaces. These videos within a node can then be used to predict what an environment affords. The authors further trained a graph convolution network that takes into account neighboring nodes to predict the next action in the video. The authors showed that taking into account the underlying space benefited in both tasks.

c) Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta. “Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects” (video)

use the force luke - Yoda | Meme Generator

This paper would probably won the best title award for this conference if there is one. This work is about estimating forces applied to objects by human in a video. Arguably, if robots can estimate forces applied on objects, it would be quite useful for performing tasks and predicting human intentions. However, personally I don’t think this is how humans understand the world and it may be solving a harder problem then needed. Having said that this is still an interesting paper worth discussing.

Estimating force and contact points

The difficulty of this task is that the ground truth forces applied on objects cannot be easily obtained. Instead of figuring out how to obtain this data, the authors use a physics simulator to simulate the outcome of applying the force and then use keypoints annotated in the next frame compared to the keypoints location of the simulated outcome as a signal to train the network. Contact points are also predicted by a separate network with annotated data. The figure above shows this training schema. Note that estimating gradients through a non-differentiable physics simulator is possible by looking at the result when each dimension is changed a little bit. The authors show this approach is able to obtain reasonable result on a collected dataset and can be extended to novel objects.

d) Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song. “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization” (video)

This is a Google Brain paper that tries to find a better architecture for object detection tasks that would benefit from more spatial information. For segmentation tasks, the typical architecture has an hour glass shaped encoder decoder structure that first down scales the resolution and then scales it back up to predict pixel-wise result. The authors argued that these type of neural networks that have this scale decreasing backbone may not be the best solution for tasks which localization is also important.

Left: ResNet, Right: Permute last 10 blocks

The idea is then to permute the order of the layers of an existing network such as ResNet and see if this can result in a better architecture. To avoid having to try out all combinations, the authors used Neural Architecture Search (basically another network) to learn what architecture would be better. The result is an architecture that has mixed resolutions and many skip connections that go further (image above). The authors showed that with this architecture they were able to outperform prior state of the art result and this same network was also able to achieve good results on other datasets other than the one trained on.

 

Why the Idea That Our Brain has an Area Dedicated Just for Faces is Likely Wrong

In brain, Neural Science, Serious Stuffs, vision on August 17, 2020 at 8:59 pm

by Li Yang Ku (Gooly)

Image for post

I was reading a neuroscience text book recently and came across a paragraph about the discovery of the Fusiform Face Area (FFA) in human brain. It was not news to me that there is a face area in the brain that activates when a face is observed, but I was quite surprised by what the researcher actually claimed. Nancy Kanwisher, the scientist that named the Fusiform Face Area in 1997, has this hypothesis that this area is selectively responsive to faces and argues that this discovery supports the idea that the brain contains a few specialized components dedicated to solving very specific tasks. In the paragraph, she also mentioned quite a few experiments that backs her hypothesis including a research done at the Japan Science and Technology Agency which reported that monkeys raised without seeing faces was able to show adult-like ability in face discrimination and suggested that experience with faces may not be needed to process faces. These observation seems to contradict with the cat experiments I previously talked about. If vision is acquired instead of innate and even seeing something as simple as horizontal lines needs to be learned, the conclusion that our brain dedicates a pre-wired area just for faces seems counterintuitive.

The simplest way to show evidence that no areas in the brain is dedicated to a specific function is probably to look at blind people’s Fusiform Face Area. A quick search led me to an interesting podcast about some experiments done at Georgetown University in 2014. In this experiment, blind people were trained to recognize faces through a device that scans faces and converts them into a sequence of tones with different frequency. The researchers showed that when blind people tries to recognize faces through this device, the same Fusiform Face Area became activated. This result suggests that this face area can be activated by not just visual input of faces but also auditory representation of faces, which kind of strengthens the argument that it is a dedicated face area. However, I wasn’t able to find experiment details in publications regarding these claims recorded in the podcast.

A deeper search gave me another more informative experiment result done by researchers at John Hopkins and MIT in 2015. In this work “Visual Cortex Responds to Spoken Language in Blind Children“, scientists measured the activity of the Fusiform Face Area of blind children when they listen to stories and suggested that their Fusiform Face Area is taken over by language functionalities instead. These experiments showed that the Fusiform Face Area is not dedicated just to faces and the plasticity of the brain during childhood allows the area to learn different functionalities. However, one can still argue that this is a special case and doesn’t invalidate the claim that Fusiform Face Area is dedicated just to faces in normal circumstances.

Apparently I am not the only one surprised by Kanwisher’s claim. In 2017, Isabel Gauthier, a professor in Vanderbilt University, wrote an article titled ‘The Quest for the FFA led to the Expertise Account of its Specialization’ as a direct response/protest to Kanwisher’s article ‘The Quest for the FFA and Where It Led’. Gauthier’s article aren’t just a scientific article but also a letter that exposed the bitter story of a derailed collaboration between Kanwisher and Gauthier that dates back to 20 years ago. In 1997, when Kaniwisher published her famous article that claimed to found an area dedicated to face recognition, Gauthier was working on her PhD thesis in Yale titled “Dissecting face recognition: The role of expertise and level of categorization in object recognition”. Gauthier had a very different hypothesis about face recognition which suggested that the ability of recognizing faces is a skill learned just like other expertise skills that involved distinguishing similar objects visually. Kaniwisher’s publication at that time had the opposite conclusion from Gauthier’s thesis. However, as naive as it sounds, Gauthier held this belief that it would be advantageous for her to work with someone whom she disagrees with strongly and proposed to work as Kaniwisher’s postdoc with her own funding. Instead of drawing conclusion from existing experiments, the proposal was to come up with an experiment both of them can agree on.

Prior to this Gauthier have done experiments on training people to distinguish these objects called Greebles shown above. Her results show that subjects that learned to distinguish these Greebles had stronger response in the Fusiform Face Area when shown a normal Greeble versus an upside down Greeble. Kaniwisher argued that these Greebles might be too similar to faces therefore activates the same area. Therefore they agreed that the experiment should be done on experts of things that are very different from faces, in this case they picked birds and cars. Gauthier then recruited about 20 bird and car experts and scanned their brain with MRI while showing them pictures of cars and birds. The results were clear, when bird experts see an image of a bird, the same Fusiform Face Area was recorded with stronger activities. Bird experts only had strong FFA activity when bird images were shown but not when car images were shown and experiments with car experts were observed with the opposite results. These results seem to indicate that FFA is an area used to distinguish similar objects of the same class rather than an area dedicated to faces.

Car Talk Classics: Four Perfectly Good Hours: Magliozzi, Ray, Magliozzi, Tom: 9781598870992: Amazon.com: Books

Kaniwisher however decided not to put her name on this paper that contradicts her original idea and hired another postdoc to redo this experiment with different settings. The results came out similar and again Kaniwisher refused to be listed on this paper titled “Revisiting the role of the fusiform face area in visual expertise” published in 2005. This could be the end of the conflict between Kaniwisher and Gauthier, and just like most disputes in academia, these stories were probably only known among a small group of grad students. Nancy Kaniwisher continued her academia career in MIT, won quite a few awards, and even went on a Ted talk in 2014. Isabel Gauthier became a professor in Vanderbilt University, also won a lot of awards, and continued with several researches that strengthen her expertise theory. However, in 2017 Kaniwisher published this article ‘The Quest for the FFA and Where It Led’, which she not only didn’t mention her involvement in early expertise experiments but claimed the effect shown in Gauthier’s work to be small and not replicable. She further claimed that her original paper that discovered this specialized brain area drew fire from Gauthier and many others because it found evidence for the century debate on domain specificity in the brain. Ironically, in this article Kaniwisher also argues that the field has a replication crisis and researchers should work harder to replicate their own results before publishing them. This seemly harmless article must have kept Gauthier awake for nights. Which led to this tell all article ‘The Quest for the FFA led to the Expertise Account of its Specialization’ that unveiled what would otherwise be another untold story in the ivory tower of academics.