Life is a game, take it seriously

Transformer for Vision

In Computer Vision, deep learning, Machine Learning, Paper Talk, vision on October 9, 2021 at 11:04 pm

By Li Yang Ku

In my previous post I talked about this web app I made that can generate rap lyrics using the transformer network. Transformer is currently the most popular approach for natural language related tasks (I am counting OpenAI’s GPT-3 as a transformer extension.) In this post I am going to talk about a few different work that tries to apply it to vision related tasks.

If six years ago you told me that the next big thing in computer vision would be a model developed for natural language processing I might laugh and thought it was supposed to be funny. Language which is normally received over a period of time seems so different from image recognition where information is spatial, its hard to imagine any model designed for language would be any good at vision. Because of these differences, directly applying transformer to vision tasks is non-trivial (we will go through a few different approaches in this post.) However, because transformer is based on ideas that are quite general, applying them to vision tasks actually makes sense.

The principle idea of transformer is about learning attention; to understand a sentence, we often have to associate words with other words in the same sentence and these relation is where we put our attention. For example, the sentence “Imagine your parents comparing you to Dr. Jonny Kim”, when you look at the word “comparing” we would probably pay attention to “you” and “Dr. Jonny Kim” which we are comparing between. And when we focus on “you” the association might be “your parents”. In transformer, we consider this first word as a query, the other words that we pay attention to as keys and each key has corresponding values that represent the meaning of this association. By stacking these attention blocks, we transform the sentence to a vector that contains more high level information that can be used for different lingual tasks such as translation, sentence completion, question answering, etc. Associating different features is also quite important for vision, to recognize a cat, you might want to check for a cat head and a cat body that is at the right relative position. If the cat body is far away from the head, something is probably wrong.

One of the earlier attempts to use transformer for vision tasks is published in the paper “Image Transformer” in 2018. One of the main problem of using transformer for vision tasks is computations. An image may worth a thousand words, but it can also take hundreds if not thousands times more in memory size. Computing relations between pixels to all other pixels is therefore infeasible. In this work Niki et al. addressed this issue by keeping attention within a set of neighboring pixel. This Image Transformer was used on vision tasks such as image completion and super resolution. By generating one pixel at a time in a top-down left-right order, image completion seems to be the task most similar to sentence completion and most suitable for applying transformer. On these tasks the image transformer out performed state of the art approaches that mostly used GANs (General Adversarial Networks). The results actually look pretty good, see Table 1. below.

Another attempt on applying transformers to vision tasks is the work “Scaling Autoregressive Video Models” by Weissenborn et al. In this work, the authors try to tackle the problem of video generation, which also has close resemblance to the sentence completion task. For transformers to handle videos, computation and memory become an even bigger problem due to its quadratic consumption with respect to input size. To tackle the problem, the video is divided into smaller non-overlapping sub blocks. The attention layers are then only applied to each block individually. The problem with this approach is that there is no communication between blocks. To address this problem, the blocks are split differently for each layer, and there will be blocks that extend to all parts of each axis. For example, the block sizes used for the first 4 layers in the network are (4, 8, 4), (4, 4, 8), (1, 32, 4), and (1, 4, 32) where each tuple represents sizes of (time, height, width) and the input video is subscaled to size 4 x 32 x 32. The following images are results trained on the BAIR Robot Pushing dataset which the first image on the left is given and the rest are generated.

So far we’ve been talking about vision applications that are in some way more similar to language tasks where the problem can be seen as generating a new “token” given already generated “tokens”. In the paper “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, transformer is used to solve more traditional vision tasks such as classification and segmentation. Similar to previous approaches, one of the focus when applying transformer is on reducing computation. The authors of this work proposed to factorize the 2D attention into two 1D attentions along height first then width. They call this the axial attention block (figure below) and use it to replace the convolution layers in a ResNet (a type of convolutional neural network that won the ImageNet competition in 2015.) This Axial-Resnet can be used just like Resnet for image classification tasks or can be combined with a conditional random field to provide segmentation outputs. The authors showed that this approach was able to achieve state of the art results on a few segmentation benchmarks.

This next work I am going to talk about was published this June (2021) and the goal is to show that a pure transformer network can do as good as (or even better than) CNNs (Convolution Neural Network) on image classification tasks when pre-trained on large enough data. In this paper “An Image is Worth 16×16 Words: Transformer for Image Recognition at Scale”, the authors introduced the Vision Transformer that has basically the same structure as the language transformer. Each image is cut into non-overlapping patches and tokenized just like words in natural language processing tasks; these tokens are then fed into the Vision Transformer in a fixed order (see figure below.) Since we are considering a classification task, only the encoder part of a typical transformer is needed.

What I found quite interesting is that the authors mentioned that when trained and tested on mid-sized datasets, the Vision Transformer resulted in modest results a few percentage point lower than a ResNet. But when pre-trained on larger datasets, Vision Transformer obtained state of the art results on these mid-sized datasets. The reason that CNNs performed better on mid-sized datasets seems to be because of its convolutional structures that enforces translation invariance and locality that are quite useful for vision tasks. Vision Transformer does not have these constraints so it requires more data to learn them; but when enough data is given, it can learn a more flexible structure, therefore result in higher accuracy. This conjecture kind of make sense given what was published in this other work “On The Relationship Between Self-Attention and Convolutional Layers.” In this paper, the authors first proved that a self-attention layer in a transformer can simulate a convolutional layer. The authors further looked into a trained transformer and showed that attention layers in the network did learn a structure similar to a convolutional layer. The network however did not learn a uniform convolution, but some version which the kernel size varies between layers. This seems to explain nicely why transformer outperforms CNNs when given enough data.

Task and Motion Planning

In AI, Robotics on June 1, 2021 at 10:33 pm

By Li Yang Ku

In this post I’ll briefly go through the problem of Task and Motion Planning (TAMP) and talk about some recent works that try to tackle it. One of the main motivation of solving the TAMP problem is to allow robots to solve household tasks like the robot Rosey in the cartoon Jetsons. We are talking about more “complicated” tasks such as grabbing a mug at the back of your shelf which most Roombas would just give up. Task like these can’t usually be achieved in one motion and the robot might need to move things around before being able to reach the target. Task and Motion Planning (TAMP) is referring to this set of tasks that requires multiple sequences of planned motions.

Traditionally, the AI community focused more on symbolic task planners that uses first order logic to answer questions. It was once thought that by combining all these logic, a machine would be as intelligent as humans. In fact, a whole industry of expert system was once built based on these assumptions. (which reminds me of the current self-driving car industry.) On the other hand, motion planning is a robotics problem and many approaches that search a path in the robot joint space based on heuristics were invented. Both fields have decent solutions to the individual problems but simply combining them together won’t solve the TAMP problem. One of the reason is because a task planner without the knowledge of whether the robot can reach an object cannot plan symbolically.

Its hard to say who or when the name TAMP was coined, since it is not a problem hard to discover but a problem that sticks out if the goal is to build a robot that solves tasks. The exact words however seems to first appear in the title of the paper “Hierarchical Task and Motion Planning in the Now” [1] published in 2011 by Leslie Kaelbling and Tomas Perez, the MIT dual that were most famous for their work on planning in partially observable environments. In this paper, a goal is iteratively divided into subgoals based on the precondition of this goal. For example, to place object O at region R, two precondition is needed 1) the robot needs to be holding O, and 2) the path to R needs to be cleared. The goal is divided until we reach a leaf goal which then the motion planning is ran and the robot executes the action. Preconditions for goals are labeled with abstraction levels to help determine which pre-condition should be considered to split into subgoals first. Once the robot executes the action, the whole planning process is processed again. This approach was tested on a 2D robot in simulation on household tasks. The main limitation of this approach seems to be requiring actions to have well defined precondtions.

In the paper “Combined Task and Motion Planning Through an Extensible Planner-Independent Interface Layer” [2] published in 2014, Srivastava et al. proposed an interface between off the shelf task planner and motion planner. One of the main difficulties on combining task and motion planning is to bridge the symbolic space used in task planning with the continuous space used in motion planning. In this work, pose generators are used to create a finite set of pose references that can be used in task planning while the motion planner uses these generated poses to plan trajectories. The interface further updates the task planner when there are no valid trajectories. The authors demonstrated this approach on a task which the robot PR2 has to grasp a target object on a densely cluttered table. The image below shows PR2 trying to grasp the grey object among a pile of blocking objects.

Before we go further we need to first introduce the Planning Domain Definition Language (PDDL). PDDL was invented in 1998 to provide a platform for the International Planning Competition. It was inspired by two earlier language (STRIPS, ADL) designed for robotics planning that is more action focused. Actions are defined with pre-conditions and post-conditions and a planning problem is consist of domain descriptions and problem descriptions. The domain description describes the environment while the problem description states what goal the planner should try to reach. The goal is to find a sequence of actions described in the domain description to satisfy the problem description. The task planner we described in the previous paper [2] is a planner in the PDDL domain. Since PDDL 2.1, fluents were introduced to the language. Fluents are numerical valued variables in the environment that can be modified by actions. (Fluents will be mentioned in the last paper of this post.) A lot of extensions have been proposed since the invention of PDDL, such as Probabilistic PDDL (PPDDL) that takes into account uncertainty.

Jumping back to 2020, I am going to talk about two related recent papers tackling the TAMP problem. The first paper “PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning” [3] by Garret et al. is a publication from the same MIT group managed by Leslie Kaelbling and Tomas Perez. This can be considered a PDDL extension and the name is based on the new added conditional generator called “Stream”, which can generate a sequence of outputs that satisfy conditions. An example of a stream is a grasps stream that generates sequences of grasp poses given an object pose. One of the main differences between this approach and the previous paper [2] by Srivastava et al. is that in [2] the authors try to bridge PDDL task planning and motion planning through an interface, while in [3] a new language is proposed to wrap off the shelf PDDL task planners as a subroutine. The authors of the PDDLStream paper argues that their approach offers a domain-agnostic language which can solve problems of different domains (such as arm robot versus multi-robot rovers) without redesigning the interface. The way PDDLStream works is that at each iteration certain amounts of new facts are generated based on the streams given existing facts, these new facts along with existing facts and the domain information are then fed into a PDDL solver that tries to find a solution. This process iterates until we have a solution. To reduce the search time, levels of facts are introduced to describe how many stream evaluations is needed to certify a fact. This information is used to guide the stream generation process to prevent going for more complicated solutions first. The authors also included a few variants that delay the evaluation of certain costly streams to speed up the search process which I will not go into details in this post.

The last paper I am going to talk about in this post is what first got my attention. This paper “Online Replanning in Belief Space for Partially Observable Task and Motion Problems” also by Garret et al. was based on the previous PDDLStream paper [3] and is a collaboration between the MIT group and Dieter Fox, who is one of my favorite robotics professors and now a director at robotics research in Nvidia (Nvidia set up a whole branch in Seattle for him so I guess they love him more.) This paper was also advertised by Josh Tenenbaum in his keynote talk at RSS 2020. The difference versus [3] is that this work is trying to solve a set of partially observable TAMP problems called “Stochastic Shortest Path Problems” (SSPP). SSPP is a type of belief space planning problems, which instead of planning over states, planning are done over distribution of states in order to handle stochasticity. An example of a belief space planning problem would be stating the current state of an object to be 60% behind box A and 40% behind box B and the goal is to execute a sequence of actions such that the object would be 95% behind box C. This seems a rather simple example, but it can get complicated if say the room is dark and there is an additional option to go and turn on the light so that you would be less likely to mix up the object with a similar looking one. To solve a probabilistic SSPP problem with a deterministic PDDLStream solver, the authors implemented a particle-based belief representation (its a particle filter.) A particle represents an object pose associated with a probability of this pose being the true pose. These particles are represented by fluents, which are variables that can be modified by actions in the PDDL language. By having an update-belief stream that certifies valid belief updates and a detect action that would result in distributions over poses being updated based on Bayesian update, probabilistic representations are symbolized and become solvable by a deterministic solver. For example, in order to find an object that is not visible, the robot may remove the object that it thinks most likely blocking the object. It would than do a detect action which the observation would update existing particles that represent the poses of the object. In the left image above, each cross or asterisk represents a particle that represents a possible pose of the green block. Green marks represents higher probability while black represents lower probability. There is uncertainty of the location because the robot is viewing from an angle which the green box is not visible. To make the whole system work, replanning that follows previous plans and ways to defer heavy computations are also introduced in the work, which I will not dive in. The authors tested this framework in simulation and also demonstrated on a robot in a kitchen environment which you can see videos here.

In summary, we have some good progress in solving TAMP. My only complain would be that these approaches seem to be based on traditional frameworks of task planning and motion planning. TAMP is a hard problem so building on existing approaches makes the problem more manageable. However, if we do have enough resource, it would be interesting to re-imagine a framework that doesn’t separate task and motion planning but treat them as the same kind of problem but under different scales.

Paper Picks: RSS 2020

In AI, Paper Talk, Robotics on December 13, 2020 at 8:58 pm

by Li Yang Ku

Just like CVPR, RSS (Robotics: Science and Systems) is virtual this year and all the videos are free of charge. You can find all the papers here and corresponding videos on the RSS youtube page once you finished bingeing Netflix, Hulu, Amazon Prime, and Disney+.

In this post, I am going to talk about a few RSS papers I found interesting. The best talk I watched so far was however (unsurprisingly) the keynote given by Josh Tenenbaum, who is probably one of the most charismatic speakers in the field of AI. Even though I am not a big fan of his recent “brain is a physics engine” work, it sounds less absurd and even a bit reasonable when he says it. This talk is an inspiring high level walk through of many AI research projects that try to tackle the problem of understanding intuitive physics and other aspects of the human mind. My favorite part of this talk was when Josh Tenenbaum showed a video of a baby trying to stack cylinders on top of a cat. Josh argued that machine learning approaches that fit parameters to data will not be able to generalize to an infinite amount of tasks (such as placing cylinders on cats) and is quite different from how our minds model the world.

a) Tasbolat Taunyazov, Weicong Sng, Brian Lim, Hian Hian See, Jethro Kuan, Abdul Fatir Ansari, Benjamin Tee, Harold Soh. “Event-Driven Visual-Tactile Sensing and Learning for Robots” (video)

If you’ve been to a computer vision or robotics conference in the past 10 years, you probably seen one of these event cameras (also called neuromorphic camera) that have super low latency but only detects when the brightness changes. A typical demo would be to point the camera at a rotating fan and show that it can capture the individual blades. It was a thing that was advertised as having great potential but people still haven’t quite figure out how to use it yet. In this paper, the authors not only used an event camera but also developed a low latency “event” tactile sensor and used them to detect objects with different weights by grasping them.

In this work, the event camera and event tactile sensor outputs are fed into a spiking neural network (SNN). A spiking neuron network is an artificial neural network that was inspired by biological neurons that become active when they receive spikes exceeding a threshold within a time window. In a SNN, information is passed through spike trains in parallel and the timing and frequency of spikes play a large role in the final outcome. Similar to convolution neural networks (CNN), neurons can stack in layers but they perform convolution also in the time dimension. Training is however a lot harder compared to CNN since the derivative of a spike is not well defined. Read this paper if you are interested in knowing more details of SNN.

The classification task is then based on which neuron had the most spikes. In the figure above we can see the accuracy increases over time when more information is observed. With just vision input, the robot can distinguish objects that look different but not objects that look the same but with different weights. Once the haptic sensor received more feedback while lifting the object, the combined SNN can reach a higher accuracy versus using a single modality.

b) Adam Allevato, Elaine Schaertl Short, Mitch Pryor, Andrea Thomaz. “Learning Labeled Robot Affordance Models Using Simulations and Crowdsourcing” (video)

Affordance can be defined as the functions an object affords an agent (see my paper for my definition.) A lot of research in this field tries to learn to identify affordances based on data labeled by experts. In this work, the authors try to ground affordance to language through crowd sourcing instead. The authors first tried to collect data by having subjects observe a real robot performing straight line movements that move towards a random location relative to an object. The subjects have to then enter the action the robot might be performing. The data collected turned out to be too noisy. So what the authors did instead was to take the verbs describing these actions collected on the real robot and use them as options for multiple choice questions on mechanical turk with a simulated robot.

By counting the percentage of two actions chosen for the same robot action in the collected data, the authors came up with a way to define a hierarchical relationship between these labeled actions based on conditional probability. The following are two hierarchies built with different thresholds. Some of them kind of make sense. For example, the generated hierarchies below shows that tip is a kind of touch and flip is a kind of tip.

The authors also trained classifiers that take the robot arm motion and the resulting object pose change as input and output the most likely label. They showed that classifiers trained on the effect on the object performs better then classifiers trained on the robot arm motion. The authors claimed that this result suggests humans may consider affordance primarily as a function of the effect on the object rather than the action itself.

c) Hong Jun, Dylan Losey, Dorsa Sadigh. “Shared Autonomy with Learned Latent Actions” (video)

For some people with disability, a robot that can be easily teleoperated through a joystick would be quite helpful in daily life. However, if you ever tried to control a robot with a joystick you would know it is no easy task. Shared Autonomy tries to solve this problem by guessing what the user tries to achieve and helps the user finish the intended action. Although this approach is convenient in a setting which the robot can easily interpret the users plan, it does not provide options for more detailed manipulation preferences such as where to cut a tofu. The authors try to address this by combining shared autonomy with latent actions.

In this work, shared autonomy is used at the start of the teleoperation, once the robot has higher confidence of the action the user intended to execute, the robot gradually switches to a 2-dimensional latent space control (e.g. z is the latent space in figure above.) This latent space is trained with an autoencoder using training data consists of (state, action, belief) tuples, which belief is the robot’s belief over a set of candidate goals. This autoencoder is conditioned on state and belief, which both would be provided to the decoder during run time as shown below.

The authors tested on two interesting tasks: 1) entree task which the robot has to cut the tofu and move tofu to plate, 2) dessert task which the robot has to stab the marshmallow, scoop it on icing, then dip it in rice. They showed that their approach required less time and has less error when compared to a latency space or shared autonomy only approach. You can see the whole task sequence in this video.