Life is a game, take it seriously

Task and Motion Planning

In AI, Robotics on June 1, 2021 at 10:33 pm

By Li Yang Ku

In this post I’ll briefly go through the problem of Task and Motion Planning (TAMP) and talk about some recent works that try to tackle it. One of the main motivation of solving the TAMP problem is to allow robots to solve household tasks like the robot Rosey in the cartoon Jetsons. We are talking about more “complicated” tasks such as grabbing a mug at the back of your shelf which most Roombas would just give up. Task like these can’t usually be achieved in one motion and the robot might need to move things around before being able to reach the target. Task and Motion Planning (TAMP) is referring to this set of tasks that requires multiple sequences of planned motions.

Traditionally, the AI community focused more on symbolic task planners that uses first order logic to answer questions. It was once thought that by combining all these logic, a machine would be as intelligent as humans. In fact, a whole industry of expert system was once built based on these assumptions. (which reminds me of the current self-driving car industry.) On the other hand, motion planning is a robotics problem and many approaches that search a path in the robot joint space based on heuristics were invented. Both fields have decent solutions to the individual problems but simply combining them together won’t solve the TAMP problem. One of the reason is because a task planner without the knowledge of whether the robot can reach an object cannot plan symbolically.

Its hard to say who or when the name TAMP was coined, since it is not a problem hard to discover but a problem that sticks out if the goal is to build a robot that solves tasks. The exact words however seems to first appear in the title of the paper “Hierarchical Task and Motion Planning in the Now” [1] published in 2011 by Leslie Kaelbling and Tomas Perez, the MIT dual that were most famous for their work on planning in partially observable environments. In this paper, a goal is iteratively divided into subgoals based on the precondition of this goal. For example, to place object O at region R, two precondition is needed 1) the robot needs to be holding O, and 2) the path to R needs to be cleared. The goal is divided until we reach a leaf goal which then the motion planning is ran and the robot executes the action. Preconditions for goals are labeled with abstraction levels to help determine which pre-condition should be considered to split into subgoals first. Once the robot executes the action, the whole planning process is processed again. This approach was tested on a 2D robot in simulation on household tasks. The main limitation of this approach seems to be requiring actions to have well defined precondtions.

In the paper “Combined Task and Motion Planning Through an Extensible Planner-Independent Interface Layer” [2] published in 2014, Srivastava et al. proposed an interface between off the shelf task planner and motion planner. One of the main difficulties on combining task and motion planning is to bridge the symbolic space used in task planning with the continuous space used in motion planning. In this work, pose generators are used to create a finite set of pose references that can be used in task planning while the motion planner uses these generated poses to plan trajectories. The interface further updates the task planner when there are no valid trajectories. The authors demonstrated this approach on a task which the robot PR2 has to grasp a target object on a densely cluttered table. The image below shows PR2 trying to grasp the grey object among a pile of blocking objects.

Before we go further we need to first introduce the Planning Domain Definition Language (PDDL). PDDL was invented in 1998 to provide a platform for the International Planning Competition. It was inspired by two earlier language (STRIPS, ADL) designed for robotics planning that is more action focused. Actions are defined with pre-conditions and post-conditions and a planning problem is consist of domain descriptions and problem descriptions. The domain description describes the environment while the problem description states what goal the planner should try to reach. The goal is to find a sequence of actions described in the domain description to satisfy the problem description. The task planner we described in the previous paper [2] is a planner in the PDDL domain. Since PDDL 2.1, fluents were introduced to the language. Fluents are numerical valued variables in the environment that can be modified by actions. (Fluents will be mentioned in the last paper of this post.) A lot of extensions have been proposed since the invention of PDDL, such as Probabilistic PDDL (PPDDL) that takes into account uncertainty.

Jumping back to 2020, I am going to talk about two related recent papers tackling the TAMP problem. The first paper “PDDLStream: Integrating Symbolic Planners and Blackbox Samplers via Optimistic Adaptive Planning” [3] by Garret et al. is a publication from the same MIT group managed by Leslie Kaelbling and Tomas Perez. This can be considered a PDDL extension and the name is based on the new added conditional generator called “Stream”, which can generate a sequence of outputs that satisfy conditions. An example of a stream is a grasps stream that generates sequences of grasp poses given an object pose. One of the main differences between this approach and the previous paper [2] by Srivastava et al. is that in [2] the authors try to bridge PDDL task planning and motion planning through an interface, while in [3] a new language is proposed to wrap off the shelf PDDL task planners as a subroutine. The authors of the PDDLStream paper argues that their approach offers a domain-agnostic language which can solve problems of different domains (such as arm robot versus multi-robot rovers) without redesigning the interface. The way PDDLStream works is that at each iteration certain amounts of new facts are generated based on the streams given existing facts, these new facts along with existing facts and the domain information are then fed into a PDDL solver that tries to find a solution. This process iterates until we have a solution. To reduce the search time, levels of facts are introduced to describe how many stream evaluations is needed to certify a fact. This information is used to guide the stream generation process to prevent going for more complicated solutions first. The authors also included a few variants that delay the evaluation of certain costly streams to speed up the search process which I will not go into details in this post.

The last paper I am going to talk about in this post is what first got my attention. This paper “Online Replanning in Belief Space for Partially Observable Task and Motion Problems” also by Garret et al. was based on the previous PDDLStream paper [3] and is a collaboration between the MIT group and Dieter Fox, who is one of my favorite robotics professors and now a director at robotics research in Nvidia (Nvidia set up a whole branch in Seattle for him so I guess they love him more.) This paper was also advertised by Josh Tenenbaum in his keynote talk at RSS 2020. The difference versus [3] is that this work is trying to solve a set of partially observable TAMP problems called “Stochastic Shortest Path Problems” (SSPP). SSPP is a type of belief space planning problems, which instead of planning over states, planning are done over distribution of states in order to handle stochasticity. An example of a belief space planning problem would be stating the current state of an object to be 60% behind box A and 40% behind box B and the goal is to execute a sequence of actions such that the object would be 95% behind box C. This seems a rather simple example, but it can get complicated if say the room is dark and there is an additional option to go and turn on the light so that you would be less likely to mix up the object with a similar looking one. To solve a probabilistic SSPP problem with a deterministic PDDLStream solver, the authors implemented a particle-based belief representation (its a particle filter.) A particle represents an object pose associated with a probability of this pose being the true pose. These particles are represented by fluents, which are variables that can be modified by actions in the PDDL language. By having an update-belief stream that certifies valid belief updates and a detect action that would result in distributions over poses being updated based on Bayesian update, probabilistic representations are symbolized and become solvable by a deterministic solver. For example, in order to find an object that is not visible, the robot may remove the object that it thinks most likely blocking the object. It would than do a detect action which the observation would update existing particles that represent the poses of the object. In the left image above, each cross or asterisk represents a particle that represents a possible pose of the green block. Green marks represents higher probability while black represents lower probability. There is uncertainty of the location because the robot is viewing from an angle which the green box is not visible. To make the whole system work, replanning that follows previous plans and ways to defer heavy computations are also introduced in the work, which I will not dive in. The authors tested this framework in simulation and also demonstrated on a robot in a kitchen environment which you can see videos here.

In summary, we have some good progress in solving TAMP. My only complain would be that these approaches seem to be based on traditional frameworks of task planning and motion planning. TAMP is a hard problem so building on existing approaches makes the problem more manageable. However, if we do have enough resource, it would be interesting to re-imagine a framework that doesn’t separate task and motion planning but treat them as the same kind of problem but under different scales.

Paper Picks: RSS 2020

In AI, Paper Talk, Robotics on December 13, 2020 at 8:58 pm

by Li Yang Ku

Just like CVPR, RSS (Robotics: Science and Systems) is virtual this year and all the videos are free of charge. You can find all the papers here and corresponding videos on the RSS youtube page once you finished bingeing Netflix, Hulu, Amazon Prime, and Disney+.

In this post, I am going to talk about a few RSS papers I found interesting. The best talk I watched so far was however (unsurprisingly) the keynote given by Josh Tenenbaum, who is probably one of the most charismatic speakers in the field of AI. Even though I am not a big fan of his recent “brain is a physics engine” work, it sounds less absurd and even a bit reasonable when he says it. This talk is an inspiring high level walk through of many AI research projects that try to tackle the problem of understanding intuitive physics and other aspects of the human mind. My favorite part of this talk was when Josh Tenenbaum showed a video of a baby trying to stack cylinders on top of a cat. Josh argued that machine learning approaches that fit parameters to data will not be able to generalize to an infinite amount of tasks (such as placing cylinders on cats) and is quite different from how our minds model the world.

a) Tasbolat Taunyazov, Weicong Sng, Brian Lim, Hian Hian See, Jethro Kuan, Abdul Fatir Ansari, Benjamin Tee, Harold Soh. “Event-Driven Visual-Tactile Sensing and Learning for Robots” (video)

If you’ve been to a computer vision or robotics conference in the past 10 years, you probably seen one of these event cameras (also called neuromorphic camera) that have super low latency but only detects when the brightness changes. A typical demo would be to point the camera at a rotating fan and show that it can capture the individual blades. It was a thing that was advertised as having great potential but people still haven’t quite figure out how to use it yet. In this paper, the authors not only used an event camera but also developed a low latency “event” tactile sensor and used them to detect objects with different weights by grasping them.

In this work, the event camera and event tactile sensor outputs are fed into a spiking neural network (SNN). A spiking neuron network is an artificial neural network that was inspired by biological neurons that become active when they receive spikes exceeding a threshold within a time window. In a SNN, information is passed through spike trains in parallel and the timing and frequency of spikes play a large role in the final outcome. Similar to convolution neural networks (CNN), neurons can stack in layers but they perform convolution also in the time dimension. Training is however a lot harder compared to CNN since the derivative of a spike is not well defined. Read this paper if you are interested in knowing more details of SNN.

The classification task is then based on which neuron had the most spikes. In the figure above we can see the accuracy increases over time when more information is observed. With just vision input, the robot can distinguish objects that look different but not objects that look the same but with different weights. Once the haptic sensor received more feedback while lifting the object, the combined SNN can reach a higher accuracy versus using a single modality.

b) Adam Allevato, Elaine Schaertl Short, Mitch Pryor, Andrea Thomaz. “Learning Labeled Robot Affordance Models Using Simulations and Crowdsourcing” (video)

Affordance can be defined as the functions an object affords an agent (see my paper for my definition.) A lot of research in this field tries to learn to identify affordances based on data labeled by experts. In this work, the authors try to ground affordance to language through crowd sourcing instead. The authors first tried to collect data by having subjects observe a real robot performing straight line movements that move towards a random location relative to an object. The subjects have to then enter the action the robot might be performing. The data collected turned out to be too noisy. So what the authors did instead was to take the verbs describing these actions collected on the real robot and use them as options for multiple choice questions on mechanical turk with a simulated robot.

By counting the percentage of two actions chosen for the same robot action in the collected data, the authors came up with a way to define a hierarchical relationship between these labeled actions based on conditional probability. The following are two hierarchies built with different thresholds. Some of them kind of make sense. For example, the generated hierarchies below shows that tip is a kind of touch and flip is a kind of tip.

The authors also trained classifiers that take the robot arm motion and the resulting object pose change as input and output the most likely label. They showed that classifiers trained on the effect on the object performs better then classifiers trained on the robot arm motion. The authors claimed that this result suggests humans may consider affordance primarily as a function of the effect on the object rather than the action itself.

c) Hong Jun, Dylan Losey, Dorsa Sadigh. “Shared Autonomy with Learned Latent Actions” (video)

For some people with disability, a robot that can be easily teleoperated through a joystick would be quite helpful in daily life. However, if you ever tried to control a robot with a joystick you would know it is no easy task. Shared Autonomy tries to solve this problem by guessing what the user tries to achieve and helps the user finish the intended action. Although this approach is convenient in a setting which the robot can easily interpret the users plan, it does not provide options for more detailed manipulation preferences such as where to cut a tofu. The authors try to address this by combining shared autonomy with latent actions.

In this work, shared autonomy is used at the start of the teleoperation, once the robot has higher confidence of the action the user intended to execute, the robot gradually switches to a 2-dimensional latent space control (e.g. z is the latent space in figure above.) This latent space is trained with an autoencoder using training data consists of (state, action, belief) tuples, which belief is the robot’s belief over a set of candidate goals. This autoencoder is conditioned on state and belief, which both would be provided to the decoder during run time as shown below.

The authors tested on two interesting tasks: 1) entree task which the robot has to cut the tofu and move tofu to plate, 2) dessert task which the robot has to stab the marshmallow, scoop it on icing, then dip it in rice. They showed that their approach required less time and has less error when compared to a latency space or shared autonomy only approach. You can see the whole task sequence in this video.

Paper Picks: CVPR 2020

In AI, Computer Vision, deep learning, Paper Talk, vision on September 7, 2020 at 6:30 am
by Li Yang Ku (Gooly)

CVPR is virtual this year for obvious reasons, and if you did not pay the $325 registration fee to attend this ‘prerecorded’ live event, you can now have a similar experience through watching all the recorded videos on their YouTube channel for free. Of course its not exactly the same since you are loosing out the virtual chat room networking experience, but honestly speaking, computer vision parties are often awkward in person already and I can’t imagine you missing much. Before we go through my paper picks, lets look at the trend first. The graph below is the accepted paper counts by topic this year.

CVPR 2020 stats

And the following are the stats for CVPR 2019:

CVPR 2019 stats

These numbers cannot be directly compared since the categories are not exactly the same, for example, deep learning that had the most submission in 2019 is no longer a category (Aren’t gonna be a very useful category when every paper is about deep learning.) The distribution of these two graphs look quite similar. However, if I have to analyze it at gunpoint, I would say the following:

  1. Recognition is still the most popular application for computer vision.
  2. The new category “Transfer/Low-shot/Semi/Unsupervised Learning” is the most popular problem to solve with deep networks.
  3. Despite being a controversial technology, more people are working on face recognition. For some countries this is probably still where most money is distributed.
  4. The new category “Efficient training and inference methods for networks” shows that there is an effort to push for practical use of the neural network.
  5. Based on this other statistic data, it seems that the keyword ‘graph’, ‘representation’, and ‘cloud’ doubled from last year. This is consistent with my observation that people are exploring 3D data more since the research space on 2D image is the most crowded and competitive.

Now for my random paper picks:

a) Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. “CvxNet: Learnable Convex Decomposition” (video)

This Google Research paper introduces a new representation for 3D shapes that can be learned by neural networks and used by physics engines directly. In the paper, the authors mentioned that there are two types of 3D representations, 1) explicit representations such as meshes. These representations can be used in many applications such as physics simulations directly because they contain information of the surface. explicit representations are however hard to learn with neural networks. The other type is 2) implicit representations such as voxel grids, voxel grids can be learned from neural networks since it can be considered as a classification problem that labels each voxel empty or not. However, turning these voxel grids into a mesh is quite expensive. The authors therefore introduce this convex decomposition representation that represent a 3D shape with a union of convex parts. Since a convex shape can be represented by a set of hyperplanes that draw the boundary of the shape, it becomes a learnable classification problem while remains the benefit of having information of the shape boundary. This representation is therefore both implicit and explicit. The authors also demonstrated that a learned CvxNet is able to generate 3D shapes from 2D images with much better success compared to other approaches as show below.

b) Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman. “Ego-Topo: Environment Affordances From Egocentric Video” (video)

Environment Affordance

This paper on predicting an environment’s affordance is a collaboration between UT Austin’s computer vision group and Facebook AI Research. This paper caught my eye since my dissertation was also about affordances using a graph like structure. If you are not familiar of the word “affordance”, its a controversial word made up to describe what action/function an object/environment affords a person/robot.

In this work, the authors argue that the space that an action is taken place in is important to understanding first person videos. Traditional approaches on classifier actions in videos usually just take a chunk of the video and generate a representation for classification, while SLAM (simultaneous localization and mapping) approaches that tries to create the exact 3D structure of the environment often fails when humans move too fast. Instead, this work learns a network that classifies whether two views belong to the same space. Based on this information, a graph where each node represents a space and the corresponding videos can be created. The edges between nodes then represent the action sequences that happened between these spaces. These videos within a node can then be used to predict what an environment affords. The authors further trained a graph convolution network that takes into account neighboring nodes to predict the next action in the video. The authors showed that taking into account the underlying space benefited in both tasks.

c) Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta. “Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects” (video)

use the force luke - Yoda | Meme Generator

This paper would probably won the best title award for this conference if there is one. This work is about estimating forces applied to objects by human in a video. Arguably, if robots can estimate forces applied on objects, it would be quite useful for performing tasks and predicting human intentions. However, personally I don’t think this is how humans understand the world and it may be solving a harder problem then needed. Having said that this is still an interesting paper worth discussing.

Estimating force and contact points

The difficulty of this task is that the ground truth forces applied on objects cannot be easily obtained. Instead of figuring out how to obtain this data, the authors use a physics simulator to simulate the outcome of applying the force and then use keypoints annotated in the next frame compared to the keypoints location of the simulated outcome as a signal to train the network. Contact points are also predicted by a separate network with annotated data. The figure above shows this training schema. Note that estimating gradients through a non-differentiable physics simulator is possible by looking at the result when each dimension is changed a little bit. The authors show this approach is able to obtain reasonable result on a collected dataset and can be extended to novel objects.

d) Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song. “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization” (video)

This is a Google Brain paper that tries to find a better architecture for object detection tasks that would benefit from more spatial information. For segmentation tasks, the typical architecture has an hour glass shaped encoder decoder structure that first down scales the resolution and then scales it back up to predict pixel-wise result. The authors argued that these type of neural networks that have this scale decreasing backbone may not be the best solution for tasks which localization is also important.

Left: ResNet, Right: Permute last 10 blocks

The idea is then to permute the order of the layers of an existing network such as ResNet and see if this can result in a better architecture. To avoid having to try out all combinations, the authors used Neural Architecture Search (basically another network) to learn what architecture would be better. The result is an architecture that has mixed resolutions and many skip connections that go further (image above). The authors showed that with this architecture they were able to outperform prior state of the art result and this same network was also able to achieve good results on other datasets other than the one trained on.