by Li Yang Ku
Just like CVPR, RSS (Robotics: Science and Systems) is virtual this year and all the videos are free of charge. You can find all the papers here and corresponding videos on the RSS youtube page once you finished bingeing Netflix, Hulu, Amazon Prime, and Disney+.
In this post, I am going to talk about a few RSS papers I found interesting. The best talk I watched so far was however (unsurprisingly) the keynote given by Josh Tenenbaum, who is probably one of the most charismatic speakers in the field of AI. Even though I am not a big fan of his recent “brain is a physics engine” work, it sounds less absurd and even a bit reasonable when he says it. This talk is an inspiring high level walk through of many AI research projects that try to tackle the problem of understanding intuitive physics and other aspects of the human mind. My favorite part of this talk was when Josh Tenenbaum showed a video of a baby trying to stack cylinders on top of a cat. Josh argued that machine learning approaches that fit parameters to data will not be able to generalize to an infinite amount of tasks (such as placing cylinders on cats) and is quite different from how our minds model the world.

a) Tasbolat Taunyazov, Weicong Sng, Brian Lim, Hian Hian See, Jethro Kuan, Abdul Fatir Ansari, Benjamin Tee, Harold Soh. “Event-Driven Visual-Tactile Sensing and Learning for Robots” (video)
If you’ve been to a computer vision or robotics conference in the past 10 years, you probably seen one of these event cameras (also called neuromorphic camera) that have super low latency but only detects when the brightness changes. A typical demo would be to point the camera at a rotating fan and show that it can capture the individual blades. It was a thing that was advertised as having great potential but people still haven’t quite figure out how to use it yet. In this paper, the authors not only used an event camera but also developed a low latency “event” tactile sensor and used them to detect objects with different weights by grasping them.
In this work, the event camera and event tactile sensor outputs are fed into a spiking neural network (SNN). A spiking neuron network is an artificial neural network that was inspired by biological neurons that become active when they receive spikes exceeding a threshold within a time window. In a SNN, information is passed through spike trains in parallel and the timing and frequency of spikes play a large role in the final outcome. Similar to convolution neural networks (CNN), neurons can stack in layers but they perform convolution also in the time dimension. Training is however a lot harder compared to CNN since the derivative of a spike is not well defined. Read this paper if you are interested in knowing more details of SNN.
The classification task is then based on which neuron had the most spikes. In the figure above we can see the accuracy increases over time when more information is observed. With just vision input, the robot can distinguish objects that look different but not objects that look the same but with different weights. Once the haptic sensor received more feedback while lifting the object, the combined SNN can reach a higher accuracy versus using a single modality.
b) Adam Allevato, Elaine Schaertl Short, Mitch Pryor, Andrea Thomaz. “Learning Labeled Robot Affordance Models Using Simulations and Crowdsourcing” (video)
Affordance can be defined as the functions an object affords an agent (see my paper for my definition.) A lot of research in this field tries to learn to identify affordances based on data labeled by experts. In this work, the authors try to ground affordance to language through crowd sourcing instead. The authors first tried to collect data by having subjects observe a real robot performing straight line movements that move towards a random location relative to an object. The subjects have to then enter the action the robot might be performing. The data collected turned out to be too noisy. So what the authors did instead was to take the verbs describing these actions collected on the real robot and use them as options for multiple choice questions on mechanical turk with a simulated robot.
By counting the percentage of two actions chosen for the same robot action in the collected data, the authors came up with a way to define a hierarchical relationship between these labeled actions based on conditional probability. The following are two hierarchies built with different thresholds. Some of them kind of make sense. For example, the generated hierarchies below shows that tip is a kind of touch and flip is a kind of tip.

The authors also trained classifiers that take the robot arm motion and the resulting object pose change as input and output the most likely label. They showed that classifiers trained on the effect on the object performs better then classifiers trained on the robot arm motion. The authors claimed that this result suggests humans may consider affordance primarily as a function of the effect on the object rather than the action itself.
c) Hong Jun, Dylan Losey, Dorsa Sadigh. “Shared Autonomy with Learned Latent Actions” (video)
For some people with disability, a robot that can be easily teleoperated through a joystick would be quite helpful in daily life. However, if you ever tried to control a robot with a joystick you would know it is no easy task. Shared Autonomy tries to solve this problem by guessing what the user tries to achieve and helps the user finish the intended action. Although this approach is convenient in a setting which the robot can easily interpret the users plan, it does not provide options for more detailed manipulation preferences such as where to cut a tofu. The authors try to address this by combining shared autonomy with latent actions.
In this work, shared autonomy is used at the start of the teleoperation, once the robot has higher confidence of the action the user intended to execute, the robot gradually switches to a 2-dimensional latent space control (e.g. z is the latent space in figure above.) This latent space is trained with an autoencoder using training data consists of (state, action, belief) tuples, which belief is the robot’s belief over a set of candidate goals. This autoencoder is conditioned on state and belief, which both would be provided to the decoder during run time as shown below.
The authors tested on two interesting tasks: 1) entree task which the robot has to cut the tofu and move tofu to plate, 2) dessert task which the robot has to stab the marshmallow, scoop it on icing, then dip it in rice. They showed that their approach required less time and has less error when compared to a latency space or shared autonomy only approach. You can see the whole task sequence in this video.