Life is a game, take it seriously

Paper Picks: ICRA 2017

In Computer Vision, deep learning, Machine Learning, Paper Talk, Robotics on July 31, 2017 at 1:04 pm

by Li Yang Ku (Gooly)

I was at ICRA (International Conference on Robotics and Automation) in Singapore to present one of my work this June. Surprisingly, the computer vision track seems to gain a lot of interest in the robotics community. The four computer vision sessions are the most crowded ones among all the sessions that I have attended. The following are a few papers related to computer vision and deep learning that I found quite interesting.

a) Schmidt, Tanner, Richard Newcombe, and Dieter Fox. “Self-supervised visual descriptor learning for dense correspondence.”

In this work, a self-supervised learning approach is introduced for generating dense visual descriptors with convolutional neural networks. Given a set of RGB-D videos of Schmidt, the first author, wandering around, a set of training data can be automatically generated by using Kinect Fusion to track feature points between frames. A pixel-wise contrastive loss is used such that two points belong to the same model point would have similar descriptors.

Kinect Fusion cannot associate points between videos, however with just training data within the same video, the authors show that the learned descriptors of the same model point (such as the tip of the nose) are similar across videos. This can be explained by the hypothesis that with enough data, a model point trajectory will inevitably come near to the same model point trajectory in another video. By chaining these trajectories, clusters of the same model point can be separated even without labels. The figure above visualizes the learned features with colors. Note that it learns a similar mapping across videos despite with no training signal across videos.

b) Pavlakos, Georgios, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis, and Kostas Daniilidis. “6-dof object pose from semantic keypoints.”

In this work, semantic keypoints predicted by convolutional neural networks are combined with a deformable shape model to estimate the pose of object instances or objects of the same class. Given a single RGB image of an object, a set of class specific keypoints is first identified through a CNN that is trained on labeled feature point heat maps. A fitting problem that maps these keypoints to keypoints on the 3D model is then solved using a deformable model that captures different shape variability. The figure above shows some pretty good results on recognizing the same feature of objects of the same class.

The CNN used in this work is the stacked hourglass architecture, where two hourglass modules are stacked together. The hourglass module was introduced in the paper “Newell, Alejandro, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. ECCV, 2016.” An hourglass module is similar to a fully convolutional neural network but with residual modules, which the authors claim to make it more balanced between down sampling and up sampling. Stacking multiple hourglass modules allows repeated bottom up, top down inferences which improves on the state of the art performances.

c) Sung, Jaeyong, Ian Lenz, and Ashutosh Saxena. “Deep Multimodal Embedding: Manipulating Novel Objects with Point-clouds, Language and Trajectories.”

In this work, point cloud, natural language, and manipulation trajectory data are mapped to a shared embedding space using a neural network. For example, given the point cloud of an object and a set of instructions as input, the neural network should map it to a region in the embedded space that is close to the trajectory that performs such action. Instead of taking the whole point cloud as input, a segmentation process that decides which part of the object to manipulate based on the instruction is first executed. Based on this shared embedding space, the closest trajectory to where the input point cloud and language map to can be executed during test time.

In order to learn a semantically meaningful embedding space, a loss-augmented cost that considers the similarity between different types of trajectory is used. The result shows that the network put similar groups of actions such as pushing a bar and moving a cup to a nozzle close to each other in the embedding space.

d) Finn, Chelsea, and Sergey Levine. “Deep visual foresight for planning robot motion.”

In this work, a video prediction model that uses a convolutional LSTM (long short-term memory) is used to predict pixel flow transformation from the current frame to the next frame for a non-prehensile manipulation task. This model takes the input image, end-effector pose, and a future action to predict the image of the next time step. The predicted image is then fed back into the network recursively to generate the next image. This network is learned from 50000 pushing examples of hundreds of objects collected from 10 robots.

For each test, the user specifies where certain pixels on an object should move to, the robot then uses the model to determine actions that will most likely reach the target using an optimization algorithm that samples actions for several iterations. Some of the results are shown in the figure above, the first column indicates the interface where the user specifies the goal. The red markers are the starting pixel positions and the green markers of the same shape are the goal positions. Each row shows a sequence of actions taken to reach the specified target.

Looking Into Neuron Activities: Light Controlled Mice and Crystal Skulls

In brain, Neural Science, Paper Talk, Serious Stuffs on April 2, 2017 at 9:50 pm

by Li Yang Ku (Gooly)

It might feel like there aren’t that much progress in brain theories recently, we still know very little about how signals are processed in our brain. However, scientists have moved away from sticking electrical probes into cat brains and became quite creative on monitoring brain activities.

Optogenetics techniques, which was first tested in early 2000, allow researchers to activate a neuron in a live brain by light. By controlling the light that activates motor neurons in a mouse, scientists can control its movement remotely, therefore creating a “remote controlled mouse” which you might heard of in some not that popular sci-fi novels. This is achieved by taking the DNA segment of an algae that produces light sensitive proteins and insert it into a specific brain neuron of the mouse using viral vectors. When light is shed on this protein, it opens its ion channel and activates the neuron. The result is pretty cool, but not as precise as your remote control car, yet. (see video below)

Besides the Optogenetics techniques that are used to understand the function of a neuron by actively triggering it, methods for monitoring neuron activities directly have also become quite exciting, such as using genetically modified mice with brain neurons that glow when activated. These approaches that use fluorescent markers to monitor the level of calcium in the cell can be traced back to the green fluorescent proteins introduced by Chalfie etc in 1994. With fluorescent indicators that binds with calcium, researcher can actually see brain activities the first time. A lot of progress have been made on improving these markers since; in 2007 a group in Harvard introduced the “Brainbow” that can generate up to 90 different fluorescent colors. This allowed scientists to identify neuron connection a lot easier and also helped them won a few photo contests.

To better observe these fluorescent protein sensors (calcium imaging), a recent publication in 2016 further introduced the “crystal skull”, an approach that replaces the top skull of a genetically modified mouse with a curved glass. This quite fancy approach allows researchers to monitor half a million brain neuron activities of a live mouse through mounting a fluorescence macroscope on top of the crystal skull.


Chalfie, Martin. “Green fluorescent protein as a marker for gene expression.” Trends in Genetics 10.5 (1994): 151.

Madisen, Linda, et al. “Transgenic mice for intersectional targeting of neural sensors and effectors with high specificity and performance.” Neuron 85.5 (2015): 942-958.

Josh Huang, Z., and Hongkui Zeng. “Genetic approaches to neural circuits in the mouse.” Annual review of neuroscience 36 (2013): 183-215.

Kim, Tony Hyun, et al. “Long-Term Optical Access to an Estimated One Million Neurons in the Live Mouse Cortex.” Cell Reports 17.12 (2016): 3385-3394.


Generative Adversarial Nets: Your Enemy is Your Best Friend?

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 20, 2017 at 7:10 pm

by Li Yang Ku (gooly)

Generating realistic images with machines was always one of the top items on my list of difficult tasks. Past attempts in the Computer Vision community were only able to get a blurry image at best. The well publicized Google Deepdream project was able to generate some interesting artsy images, however they were modified from existing images and were designed more to make you feel like on drugs then realistic. Recently (2016), a work that combines the generative adversarial network framework with convolutional neural networks (CNNs) generated some results that look surprisingly good. (A non vision person would likely not be amazed though.) This approach was quickly accepted by the community and was referenced more then 200 times in less then a year.

This work is based on an interesting concept first introduced by Goodfellow et al. in the paper “Generative Adversarial Nets” at NIPS 2014 ( The idea was to have two neural networks compete with each other. One would try to generate images as realistic as it can and the other network would try to distinguish them from real images at its best. By theory this competition will reach a global optimum where the generated image and the real image will belong to the same distribution (Could be a lot trickier in practice though). This work in 2014 got some pretty good results on digits and faces but the generated natural images are still quite blurry (see figure above).

In the more recent work “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford, Metz, and Chintala, convolutional neural networks and the generative adversarial net framework are successfully combined with a few techniques that help stabilize the training ( Through this approach, the generated images are sharp and surprisingly realistic at first glance. The figures above are some of the generated bedroom images. Notice that if you look closer some of them may be weird.

The authors further explored what the latent variables represents. Ideally the generator (neural network that generates image) should disentangle independent features and each latent variable should represent a meaningful concept. By modifying these variables, images that have different characteristics can be generated. Note that these latent variables are what given to the neural network that generates images and is randomly sampled from a uniform distribution in the previous examples. In the figure above is an example where the authors show that the latent variables do represent meaningful concepts through arithmetic operations. If you subtract the average latent variables of men without glasses from the average latent variables of men with glasses and add the average latent variables of women without glasses, you obtain a latent variable that result in women with glasses when passed through the generator. This process identifies the latent variables that represent glasses.