Life is a game, take it seriously

Machine Learning, Computer Vision, and Robotics

In Computer Vision, Machine Learning, Robotics on December 6, 2017 at 2:32 pm

By Li Yang Ku (Gooly)

Having TA’d for Machine Learning this semester and worked in the field of Computer Vision and Robotics for the past few years, I always have this feeling that the more I learn the less I know. Therefore, its sometimes good to just sit back and look at the big picture. This post will talk about how I see the relations between these three fields in a high level.

First of all, Machine Learning is more a brand then a name. Just like Deep Learning and AI, this name is used for getting funding when the previous name used is out of hype. In this case, the name popularized after AI projects failed in the 70s. Therefore, Machine learning covers a wide range of problems and approaches that may look quite different at first glance. Adaboost and support vector machine was the hot topic in Machine Learning when I was doing my master’s degree, but now it is deep neural network that gets all the attention.

Despite the wide variety of research in Machine Learning, they usually have this common assumption on the existent of a set of data. The goal is then to learn a model based on this set of data. There are a wide range of variations here, the data could be labeled or not labeled resulting in supervised or unsupervised approaches; the data could be labeled with a category or a real number, resulting in classification or regression problems; the model can be limited to a certain form such as a class of probability models, or can have less constraints in the case of deep neural network. Once the model is learned, there are also a wide range of possible usage. It can be used for predicting outputs given new inputs, filling missing data, generating new samples, or providing insights on hidden relationships between data entries. Data is so fundamental in Machine Learning, people in the field don’t really ask the question of why learning from data. Many datasets from different fields are collected or labeled and the learned models are compared based on accuracy, computation speed, generalizability, etc. Therefore Machine Learning people often consider Computer Vision and Robotics as areas for applying Machine Learning techniques.

Robotics on the other hand comes from a very different background. There are usually no data to start with in robotics. If you cannot control your robot or if your robot crashes itself at first move, how are you going to collect any data. Therefore, classical robotics is about designing models based on physics and geometries. You build models that model how the input and current observation of the robot changes the robot state. Based on this model you can infer the input that will safely control the robot to reach certain state.

Once you can command your robot to reach certain state, a wide variety of problems emerge. The robot will then have to do obstacle avoidance and path planning to reach certain goal. You may need to to find a goal state that satisfies a set of restrictions while optimizing a set of properties. Simultaneous localization and mapping (SLAM) may be needed if no maps are given. In addition, sensor fusion is required when multiple sensors with different properties are used. There may also be uncertainties in robot states where belief space planning may be helpful. For robots with a gripper, you may also need to be able to identify stable grasps and recognizing the type and pose of an object for manipulation. And of course, there is a whole different set of problems on designing the mechanics and hardware of the robot.  Unlike Machine Learning, a lot of approaches of these problems are solved without a set of data. However, most of these robotics problems (excluding mechanical and hardware problems) share a common goal of determining the robot input based on feedback. (Some) Roboticists view robotics as the field that has the ultimate goal of creating machines that act like humans, and Machine Learning and Computer Vision are fields that can provide methods to help accomplish such goal.

The field of Computer Vision started under AI in the 60s under the goal of helping robots to achieve intelligent behaviors, but left such goal behind after the internet era when tons of images on the internet are waiting to be classified. In this age, computer vision applications are no longer restricted to physical robots. In the past decade, the field of Computer Vision is driven by datasets. The implicit agreement on evaluation based on standardized datasets helped the field to advance in a reasonably fast pace (under the cost of millions of grad student hours on tweaking models to get a 1% improvement.) Given these datasets, the field of Computer Vision inevitably left the Robotics community and embraced the data-driven Machine Learning approaches. Most Computer Vision problems have a common goal of learning models for visual data. The model is then used to do classification, clustering, sample generation, etc. on images or videos. The big picture of Computer Vision can be seen in my previous post. Some Computer Vision scientists consider vision different from other senses and believe that the development of vision is fundamental to the evolution of intelligence (which could be true… experiments do show 50% of our brain neurons are vision related.) Nowadays, Computer Vision and Machine Learning are deeply tangled; Machine Learning techniques help foster Computer Vision solutions, while successful models in Computer Vision contribute back to the field of Machine Learning. For example, the successful story of Deep Learning started from Machine Learning models being applied to the ImageNet challenge, and end up with a wide range of architectures that can be applied to other problems in Machine Learning. On the other hand, Robotics is a field where Computer Vision folks are gradually moving back to. Several well known Computer Vision scientists, such as Jitendra Malik, started to consider how Computer Vision can help the field of Robotics ,since their conversation with Robotics colleagues were mostly about vision not working, based on the recent success on data-driven approaches in Computer Vision.

Advertisements

Paper Picks: ICRA 2017

In Computer Vision, deep learning, Machine Learning, Paper Talk, Robotics on July 31, 2017 at 1:04 pm

by Li Yang Ku (Gooly)

I was at ICRA (International Conference on Robotics and Automation) in Singapore to present one of my work this June. Surprisingly, the computer vision track seems to gain a lot of interest in the robotics community. The four computer vision sessions are the most crowded ones among all the sessions that I have attended. The following are a few papers related to computer vision and deep learning that I found quite interesting.

a) Schmidt, Tanner, Richard Newcombe, and Dieter Fox. “Self-supervised visual descriptor learning for dense correspondence.”

In this work, a self-supervised learning approach is introduced for generating dense visual descriptors with convolutional neural networks. Given a set of RGB-D videos of Schmidt, the first author, wandering around, a set of training data can be automatically generated by using Kinect Fusion to track feature points between frames. A pixel-wise contrastive loss is used such that two points belong to the same model point would have similar descriptors.

Kinect Fusion cannot associate points between videos, however with just training data within the same video, the authors show that the learned descriptors of the same model point (such as the tip of the nose) are similar across videos. This can be explained by the hypothesis that with enough data, a model point trajectory will inevitably come near to the same model point trajectory in another video. By chaining these trajectories, clusters of the same model point can be separated even without labels. The figure above visualizes the learned features with colors. Note that it learns a similar mapping across videos despite with no training signal across videos.

b) Pavlakos, Georgios, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis, and Kostas Daniilidis. “6-dof object pose from semantic keypoints.”

In this work, semantic keypoints predicted by convolutional neural networks are combined with a deformable shape model to estimate the pose of object instances or objects of the same class. Given a single RGB image of an object, a set of class specific keypoints is first identified through a CNN that is trained on labeled feature point heat maps. A fitting problem that maps these keypoints to keypoints on the 3D model is then solved using a deformable model that captures different shape variability. The figure above shows some pretty good results on recognizing the same feature of objects of the same class.

The CNN used in this work is the stacked hourglass architecture, where two hourglass modules are stacked together. The hourglass module was introduced in the paper “Newell, Alejandro, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.” An hourglass module is similar to a fully convolutional neural network but with residual modules, which the authors claim to make it more balanced between down sampling and up sampling. Stacking multiple hourglass modules allows repeated bottom up, top down inferences which improves on the state of the art performances.

c) Sung, Jaeyong, Ian Lenz, and Ashutosh Saxena. “Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories.”

In this work, point cloud, natural language, and manipulation trajectory data are mapped to a shared embedding space using a neural network. For example, given the point cloud of an object and a set of instructions as input, the neural network should map it to a region in the embedded space that is close to the trajectory that performs such action. Instead of taking the whole point cloud as input, a segmentation process that decides which part of the object to manipulate based on the instruction is first executed. Based on this shared embedding space, the closest trajectory to where the input point cloud and language map to can be executed during test time.

In order to learn a semantically meaningful embedding space, a loss-augmented cost that considers the similarity between different types of trajectory is used. The result shows that the network put similar groups of actions such as pushing a bar and moving a cup to a nozzle close to each other in the embedding space.

d) Finn, Chelsea, and Sergey Levine. “Deep visual foresight for planning robot motion.”

In this work, a video prediction model that uses a convolutional LSTM (long short-term memory) is used to predict pixel flow transformation from the current frame to the next frame for a non-prehensile manipulation task. This model takes the input image, end-effector pose, and a future action to predict the image of the next time step. The predicted image is then fed back into the network recursively to generate the next image. This network is learned from 50000 pushing examples of hundreds of objects collected from 10 robots.

For each test, the user specifies where certain pixels on an object should move to, the robot then uses the model to determine actions that will most likely reach the target using an optimization algorithm that samples actions for several iterations. Some of the results are shown in the figure above, the first column indicates the interface where the user specifies the goal. The red markers are the starting pixel positions and the green markers of the same shape are the goal positions. Each row shows a sequence of actions taken to reach the specified target.

Looking Into Neuron Activities: Light Controlled Mice and Crystal Skulls

In brain, Neural Science, Paper Talk, Serious Stuffs on April 2, 2017 at 9:50 pm

by Li Yang Ku (Gooly)

It might feel like there aren’t that much progress in brain theories recently, we still know very little about how signals are processed in our brain. However, scientists have moved away from sticking electrical probes into cat brains and became quite creative on monitoring brain activities.

Optogenetics techniques, which was first tested in early 2000, allow researchers to activate a neuron in a live brain by light. By controlling the light that activates motor neurons in a mouse, scientists can control its movement remotely, therefore creating a “remote controlled mouse” which you might heard of in some not that popular sci-fi novels. This is achieved by taking the DNA segment of an algae that produces light sensitive proteins and insert it into a specific brain neuron of the mouse using viral vectors. When light is shed on this protein, it opens its ion channel and activates the neuron. The result is pretty cool, but not as precise as your remote control car, yet. (see video below)

Besides the Optogenetics techniques that are used to understand the function of a neuron by actively triggering it, methods for monitoring neuron activities directly have also become quite exciting, such as using genetically modified mice with brain neurons that glow when activated. These approaches that use fluorescent markers to monitor the level of calcium in the cell can be traced back to the green fluorescent proteins introduced by Chalfie etc in 1994. With fluorescent indicators that binds with calcium, researcher can actually see brain activities the first time. A lot of progress have been made on improving these markers since; in 2007 a group in Harvard introduced the “Brainbow” that can generate up to 90 different fluorescent colors. This allowed scientists to identify neuron connection a lot easier and also helped them won a few photo contests.

To better observe these fluorescent protein sensors (calcium imaging), a recent publication in 2016 further introduced the “crystal skull”, an approach that replaces the top skull of a genetically modified mouse with a curved glass. This quite fancy approach allows researchers to monitor half a million brain neuron activities of a live mouse through mounting a fluorescence macroscope on top of the crystal skull.

References:

Chalfie, Martin. “Green fluorescent protein as a marker for gene expression.” Trends in Genetics 10.5 (1994): 151.

Madisen, Linda, et al. “Transgenic mice for intersectional targeting of neural sensors and effectors with high specificity and performance.” Neuron 85.5 (2015): 942-958.

Josh Huang, Z., and Hongkui Zeng. “Genetic approaches to neural circuits in the mouse.” Annual review of neuroscience 36 (2013): 183-215.

Kim, Tony Hyun, et al. “Long-Term Optical Access to an Estimated One Million Neurons in the Live Mouse Cortex.” Cell Reports 17.12 (2016): 3385-3394.