Life is a game, take it seriously

Archive for the ‘Computer Vision’ Category

Vicarious Publications

In AI, Computer Vision, deep learning, Machine Learning, Neural Science, Paper Talk on January 22, 2023 at 9:26 am

by Li Yang Ku

I worked at Vicarious, a robotics AI startup, from mid 2018 till it was acquired by Alphabet in 2022. Vicarious was a startup founded before the deep learning boom and it had been approaching AI through a more neuroscience based graphical model path. Nowadays it is definitely rare for AI startups to not wave the deep learning flag, but Vicarious did stick with its own ideology despite all the recent successes of neural network approaches. This post is about a few research publications my former colleagues at Vicarious did and how it lies along the path to AGI (artificial general intelligence.) Although Vicarious no longer exists, many authors of the following publications have been acquired into DeepMind and is continuing the same line of research.

a) George, Dileep, Wolfgang Lehrach, Ken Kansky, Miguel Lázaro-Gredilla, Christopher Laan, Bhaskara Marthi, Xinghua Lou et al. “A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs.” Science 358, no. 6368 (2017)

This publication in Science was one of the key contributions in Vicarious. In this work, the authors showed that the recursive cortical network (RCN), a hierarchical graphical model that can model contours in an image, is much better at solving CAPTCHAs (those annoying letters you need to enter to prove you are human.) compared to deep learning approaches. RCN is a template based approach that models edges and how they connect with nearby edges using graphical models. This allows it to generalize to a variety of changes with very few data, whereas deep learning approaches are usually more data hungry and sensitive to variations that it wasn’t trained on. One benefit of using graphical models is that it can do inference on occlusions between digits by a series of forward and backward passes. In CAPTCHA tests there is usually ambiguities locally. A single bottom-up forward pass can generate a bunch of proposals, but to resolve the conflicts, a top-down backward pass to the low level features is needed. Although it is possible to expand this forward backward iteration into a very long forward pass in a neural network (which we will talk about in the query training paper below), the graphical model approach is a lot more interpretable in general.

b) Kansky, Ken, Tom Silver, David A. Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. “Schema networks: Zero-shot transfer with a generative causal model of intuitive physics.” In International conference on machine learning. (2017)

This work can be seen as Vicarious’ response to DeepMind’s Deep Q-Networks (DQN) approach that gained great publicity by beating Atari games. One of the weakness of DQN like approaches is on generalizing beyond its training experiences. The authors showed that DQN agents trained on the regular breakout game failed to generalize to variations of the game such as when the paddle is slightly higher than the original game. The authors argue that is because the agent lack knowledge of the causality of the world it is operating in. This work introduces the Schema Network, which assumes the world is modeled by many entities each with attributes representing its type and position in binary. In these noiseless game environment, there are perfect causality rules that model how entities behave by itself or interact with each other. These rules (schemas) can be iteratively identified through linear programing relaxation given a set of past experiences. With the learned rules, the schema network is a probabilistic model where planning can be done by setting future reward to 1 and perform belief propagation on the model. This approach was shown to be able to generalize to variations of the Atari breakout game while state of the art deep RL models failed.

c) Lázaro-Gredilla, Miguel, Wolfgang Lehrach, Nishad Gothoskar, Guangyao Zhou, Antoine Dedieu, and Dileep George. “Query training: Learning a worse model to infer better marginals in undirected graphical models with hidden variables.” In Proceedings of the AAAI Conference on Artificial Intelligence. (2021)

In this paper, a neural network is used to mimic the loopy belief propagation (LBP) algorithm that is commonly used to do inference on probabilistic graphical models. LBP calculates the marginals of each variable through a loopy message passing algorithm. At each time step messages about the probability of each variable are passed between neighboring factors and variables. What is interesting is that LBP can be unrolled into a multi-layer feedforward neural network, which each layer represents one iteration of the algorithm. By training with different queries (partially observed evidences), the model learns to estimate the marginal probability of unobserved variables. This approach is based on the observation that there are two sources of error when using probabilistic graphical models. 1) Error when learning the (factor) parameters of the model. 2) Error when doing inference given partially observed evidences on a learned model. The proposed approach, Query Training, tries to optimize predicting the marginals directly. Even though the learned parameters may result in a worse model, the predicted marginals can actually be better. Another major contribution of this work is about introducing a training process that considers the distribution of the queries. Hence, the learned model can be used to estimate the marginal probability of any variable given any partial evidence.

d) George, Dileep, Rajeev V. Rikhye, Nishad Gothoskar, J. Swaroop Guntupalli, Antoine Dedieu, and Miguel Lázaro-Gredilla. “Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps.” Nature communications 12, no. 1 (2021)

This work introduces the cloned-structured cognitive graph (CSCG), which is an extension of the cloned HMM model introduced in another Vicarious work “Learning higher-order sequential structure with cloned HMMs” published in 2019. Cloned Hidden Markov Models (CHMM) is a Hidden Markov Model but with an enforced sparsity structure that maps multiple hidden states (clones) to the same emission state. Clones of the same observation can help discover higher order temporal structures. For example, you may have a room with two corners that look the same but not the surroundings areas, having two hidden states that each represent one of these corners can model what you would see when moving around much accurately than just having a single hidden state representing both observations. By pre-allocating a fixed capacity for the number of clones per observation, the Expectation Maximization (EM) algorithm is able to learn to best use these clones to model a sequence of observations. CSCG is simply CHMM with actions. The action chosen became part of the transition function and the model can then learn a spatial structure by simply observing sequential data and the corresponding action at each time step.

What is interesting is that the activation of hidden states in a CSCG can explain place cell activations in rat experiments that were previously puzzling. Place cells in the hippocampus was named place cell because it was previously thought to be presenting a specific location in space. However, more recent experiments show that some place cells seems to encode routes toward goals instead of spatial locations. In a rat experiment which rats are trained to circle a square maze for 4 laps before getting an award, it was observed that the same locations in the maze are represented by different place cells. When CSCG is trained on these sequences, it naturally allocates different clones to different laps. The activations of hidden states when circling the maze matches nicely to the place cell firings observed in rats. The authors also showed that CSCG could also explain the remapping phenomenon observed in place cells when the environment changes.

From the papers I picked above, you can probably tell that Vicarious’ vision towards AGI emphasizes on more structured approaches instead of working towards a learn it all huge network. Generative models like probabilistic graphical model have the potential of being more robust at modeling the underlying causal relationships in an environment and have the benefit of not needing to re-train if the underlying relationships remains the same. While recent progress in neural network approaches such as transformer and large language models have surprised many on its capability, there still seems to be a gap between being able to reorganize opinions originated from humans to having intelligence that can form novel thoughts. I have doubts on the claim that AGI is within a few year’s reach, which many people have made; the path to AGI may still be long and these published ideas might be needed one day to breach the gap.

Visual Loop Machine

In AI, Computer Vision, deep learning, Machine Learning, Serious Stuffs on April 30, 2022 at 11:25 pm

by Li Yang Ku

Visual Loop Machine is my new side project since the Rap Machine I made that completes rap sentences. It is a tool that plays visual loops generated by StyleGAN2 along music in real-time. One of the reasons I started this project was because I’ve been waiting for visual effect/mixing software like Serato Video and MixEmergency to go on discount and as a Taiwanese Hakka, which are known for being cheap, I couldn’t justify myself purchasing it with the full price for my home DJ career. While waiting for the discount I came across some awesome visual loops generated by moving along the latent space of a Generative Adversarial Networks. This inspired me on making a new type of video that I called Multiple Temporal Dimension (MTD) videos. While normal videos have a single temporal dimension and a fixed order of frames, MTD videos have multiple time dimensions and therefore contain multiple possible sequences. This makes the video file polynomially larger but for short visual loops that are often used to play at nightclubs this could be acceptable. The Visual Loop Machine is a software that loads in MTD videos and play them based on audio feedback. The following video is an example:

Note that Visual Loop Machine is not a replacement of Serato Video or MixEmergency (which I will still purchase if there is a discount.) Visual Loop Machine cannot play normal videos made by awesome visual loop artists like Beeple nor can it mix between videos based on controls through DJ softwares. What it is special is that it doesn’t rely on traditional visual effects to be applied onto the original videos to match to the music. In some way, customized visual changes are already included in the MTD videos. Currently Visual Loop Machine uses the volume to adjust the changes and only supports two temporal dimensions. The MTD video continues to loop along the major temporal dimension while movement in the second temporal dimension is controlled by the relative volume of the audio. For people that want to test it out I’ve shared some of the MTD videos I created here. I haven’t packaged the Visual Loop Machine into an install/executable file yet (executable files for mac and linux are now available: linux mac, mac with apple silicon) but it is open source and I included some basic instructions on how to run it. Repository and instructions are here.

You can generate a MTD video by manually drawing each frame for multiple temporal dimensions, but the easier way to generate one is using a neural network. I used the StyleGAN2 network introduced by Nvidia to generate these videos. I added a function in my fork of the StyleGAN2 repository so anyone can generate almost infinite different variations of MTD videos using pretrained networks which you can find here or by searching on the internet. I’ve also trained one network using photos I took during a trip to national parks in Arizona and southern California, you can see two of the MTD videos based on this network at the start of the video below and some of the generated images in the top figure of this post. (If you would like to train your own network, I would suggest subscribing to the Google Colab Pro and follow this colab example by Arthur Findelair.) Note that I am not the first one that tries to associate images generate by StyleGAN with music (one example is this work done by Derrik Schultz, who also has a pretty cool class on making art with machine learning on Youtube.) However, Visual Loop Machine is unique in the way that it is meant for reacting to music in real time and allows the separation of image generation which requires a lot of GPU power from the player that can be ran on a normal laptop.

There are already quite a few posts about StyleGAN and StyleGAN2 on the internet so I am only going to talk about it briefly here. The main innovation of StyleGAN is a modification of the generator part of a typical Generative Adversarial Networks (GANs). Instead of the traditional approach which the latent code is fed into the generator network directly, StyleGAN maps the latent code to a separate space W and apply it across multiple places in the generation process. The authors showed that by mapping to this separate space W, the latent space can be disentangled from the training distribution, therefore generate more realistic images. Noise is also added across multiple locations in the generator, this allows the network to generate stochastic parts of an image (such as human hair) based on these noise instead of consuming network capacity on achieving pseudorandomness. The following is a figure of the architectures of a traditional GAN and a StyleGAN.

Comment below if you have any issues with running the software, I will try to address them when I have time. This work is more a proof of concept, for the MTD video to really work a more general video format will need to be defined.

The Quest to Finding “The” Object Representation for Robot Manipulation

In AI, Computer Vision, deep learning, Paper Talk, Robotics on February 6, 2022 at 12:06 pm

By Li Yang Ku

For many researchers in the field of Computer Vision, coming up with “the” object representation is a lifetime goal. An object representation is the result of mapping an Image to a feature space such that an agent can recognize or interact with these object. The field came a long way from edge/color/blob detection, weak classifiers used for Adaboost, bag of feature, constellation models, to the more recent last layer features of deep learning models. While most of the work focuses on finding the representation that is the best for classification tasks, for robotics applications, an agent also needs to know how to interact with the object. There are a lot of work on learning the affordance of an object, but knowing the affordance may not be enough for manipulation. What is useful in robotics manipulation is to be able to represent features that associate with a point or part of an object that is useful for manipulation and be able to generalize these features to novel items in the category. In fact, this was what I was trying to achieve in grad school. In this post, I will talk about more recent work that introduces models for this purpose.

a) Peter R. Florence, Lucas Manuelli, and Russ Tedrake, “Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation,” 2018.

In this work, the goal is to learn a deep learning model (ResNet is used here), which given an image of a part of an object outputs a descriptor of this location on the object. The hope is that this descriptor will remain the same when the object is viewed at a different angle and also generalize to objects of the same class. What this means is that if a robot learns that the handle of a cup is where it wants to grab, it can compute the descriptor of this location on the cup, and when seeing another cup at a different pose, it can still identify the handle by finding the location that has the most similar descriptor. The following are some visualization of the descriptor of a caterpillar toy at different pose, as you can see the color pattern of the toy remains quite similar even after deformation.

The authors introduced a way to easily collect data automatically. Using an RGBD camera mounted on a robot arm, images of an object from many different angles can be captured automatically. The positive image pairs for training can then be easily labeled by reconstructing the 3D scene and assuming a static environment where the same 3D location is the same point on the object. A loss function that minimizes the distance between two matching descriptors is used to learn this neural network.

The results are quite impressive as you can see in the video above. The authors also showed that it can generalize to unseen objects in the same category and demonstrated a grasping task on the robot.

b) Lucas Manuelli, Wei Gao, Peter Florence, Russ Tedrake, “kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation,” 2019.

This paper is also from Russ Tedrake’s lab with mostly the same authors, but what I found interesting is that they took a bit different approach on tackling a very similar problem. The author’s mentioned that their previous work wasn’t able to solve the task of manipulating the object to a specific configuration, such as learning to hang a mug on a rack. One reason is that it is hard to use the previous approach to specify a position that is not on the surface, such as the center of the mug handle, which is important to complete this insertion task. Instead of learning descriptors on the surface of the object, this work learns 3D keypoints that can also be outside of the object. With these 3D keypoints, actions can be executed based on keypoint positions by formulating it as an optimization problem. Some of the constraints used are 1) the distance between keypoints, 2) keypoints have to be above a plane such as the table, 3) the distance between a keypoint to a plane for placing object on a table. The following is an example of a manipulation formulation that places the cup upright.

During test time, MaskRCNN is used to crop out the object, an Integral Network is then used to predict keypoints in the image plus the depth. Here Integral Network is a Resnet where instead of using a max operation on heat maps to get a single location the expected location of the heat map is used instead. In this work, the keypoints are manually selected, but training images can be generated efficiently using an approach similar to the previous paper. By taking multiple images of the same scene and labeling one of them in 3D, the annotation can be propagated to all scenes. The authors demonstrated that with just a few annotation, the robot was able to manipulate novel objects of the same class. Some experimental results are shown in the video below.

c) Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B. Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, Vincent Sitzmann, “Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation,” 2021

This more recent work, is in some way a further extension of the two previous work I talked about. Similar to the Dense Object Nets, this work tries to learn common descriptors across objects of the same category. However, to overcome the same difficulty on manipulating objects based on descriptors in the image space, this work tries to identify 3D keypoints like the previous paper kPAM but on top of that also learn 3D poses. However, unlike the two previous work that uses RGB images, this work uses point clouds instead.

This work introduces the Neural Point Descriptor Field, which is a network that takes in a point cloud and a 3D point coordinate then outputs a descriptor representing a point with respect to the object. The hope is that this descriptor will remain the same across meaningful locations, such as the handle of a mug, across objects of the same category at different poses. The Neural Point Descriptor Field first encodes the point cloud using a PointNet structure. The encoded point cloud is then concatenated with the point coordinate and then fed through another network that predicts the occupancy of that point (see figure below.)

The reason to use a network that predicts occupancy is because the training data can be easily collected using a dataset of point clouds. The authors suggested that a network that can predict occupancy of a point would also include information of how far a point is from salient features of the object, therefore useful for generating a descriptor for 3D keypoints. The features of this occupancy network at each layer are then concatenated to form the neural point descriptor. Note that in order to achieve rotation invariant descriptors, an occupancy network based on Vector Neurons is used. (Vector Neurons are quite interesting but I will not go into details since it deserve its own post.) Some of the results are shown in the figure below, points chosen from demonstrations and points that have the closest descriptor on test objects are marked in green. As you can see, the points in the mug example all correspond to the handle.

In the previous section we showed how to obtain descriptors of keypoints that are rotation invariant and can possibly generalize across objects of the same category. Here we are going to talk about getting a descriptor for poses. The idea is based on the fact that given 3 non-collinear points in a reference frame we can define a pose. In this work, the authors simply define a set of fix 3D keypoint locations relative to the reference frame and concatenate the neural descriptors of these keypoints. By doing this, a grasp pose in demonstration can be associated to the most similar grasp pose during test time using iterative optimization. This allowed the authors to show that the robot can learn simple tasks such as pick and place from just a few demonstrations and generalize to other objects of the same category. See more information and videos of the experiments below:

BARS 2021 Paper Picks

In AI, Computer Vision, deep learning, Paper Talk, Robotics on November 3, 2021 at 10:30 pm

By Li Yang Ku

40 Funny And Creative Bar Signs That'll Make You Step In And Grab A Drink

I was at the Bay Area Robotics Symposium (BARS) at Stanford in person last week. It’s nice to see real person even though there is a mask mandate (which could be a good thing since the audience won’t be biased by the speaker’s look.) Faculty talks can be found in the video below. My recommended talk would be the fascinating keynote by Rob Reich (which starts around 5:04 and should be the first talk if you use the player embedded below.) and the most interesting comment would be Jitendra Malik saying the vision community should stop working on deep fakes. There were also quite a few spot light talks (mostly by students) with poster sessions and I picked a few that I found interesting below:

a) Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, Jitendra Malik, “Reconstructing Hand-Object Interactions in the Wild”

In this work, the authors try to estimate the hand gesture and the pose of the object the hand is interacting with given a single RGB image. These RGB images tested on are not lab collected clean images but “wild” images that are collected on the internet and makes the already challenging task even more difficult. The ground truth of these data is also expensive to obtain therefore a standard end to end deep learning approach is not quite doable. Instead, the authors leveraged several prior works and come up with an optimization-based procedure that achieved pretty decent result. So what is an optimization-based procedure? This was not clearly defined in the paper but I found a cited work “Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image” to be quite helpful in understanding. Here an optimization based approach means we have a model of the object/hand/body and a differentiable renderer such as OpenDR. Given a single RGB image and a defined loss function between the generated image and the given RGB, we can iteratively adjust parameters in the model to minimize the loss. In the previous mentioned paper, Powell’s dog leg method is used to update the parameters.

Four optimization steps is proposed in this work. 1) The hand pose is estimated through optimization. This is achieved by minimizing the projected keypoints of a 3D hand model with respect to 2D hand keypoints estimated using prior work given RGB image. 2) The object pose is estimated through optimization. This is achieved by using a given object mesh and a differential renderer that generates a mask to compare with the RGB image. 3) Joint optimization is performed between hand and object pose. Three different loss functions that capture the depth difference, interaction, and penetration between hand and object is used. 4) Pose refinement is done by leveraging contact priors learned from a separate dataset and a small network that takes in hand parameters and the distance from hand vertices to the object. The image below shows some of the results.

b) Alexander Khazatsky, Ashvin Nair, Daniel Jing, Sergey Levine, “What Can I Do Here? Learning New Skills by Imagining Visual Affordances”

This paper is about achieving something close to zero shot learning. Given a single goal image with novel objects, the robot has to manipulate objects to match the goal image. The kind of tasks tested are chosen from a fix set that includes “opening drawer”, “put lid on pot”, “relocating object”, etc. A prior dataset of performing these tasks are given and before evaluating a new task the robot also has about 5 minutes to play with the novel items in this environment. I would say these additional information is quite reasonable and not too far off from what humans have when solving new problems. For example, we have a large amount of experience on opening drawers and occasionally it may still take us a few minutes to open a new one.

The approach the authors proposed is called visuomotor affordance learning and it includes 4 steps. 1) Given the prior dataset, a latent space of the state (rgb image) is learned using the vector quantised variational autoencoder (VQVAE). 2) Given the prior dataset, what an environment can afford is learned using a conditional PixelCNN model. This is one of the core contribution of this paper. Affordance here means given a latent state of the observation the model learns the distribution of latent goal states. For example, if an image has a closed drawer the goal state which the drawer is open will have a high probability. This would allow the robot to guess beforehand what might be the goal at test time and spend most of the time during online learning on actions that are more relevant. 3) Offline learning using the latent state of the prior dataset with advantage weighted actor critic. 4) Online behavior learning (robot interacts in the new environment), this is achieved by sampling target states using the affordance model learned in 2) and try to learn to achieve that. A good affordance model is beneficial here since it will help the robot learn a very similar task before the actual evaluation. This approach is then evaluated by giving a target image and see how many times the robot succeeded in reaching the goal. The authors tested on both real robot and simulation and showed improvement over previous approaches. The figure above shows the whole approach. And the video below contains some examples.

c) Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Silvio Savarese, Chelsea Finn, “Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation”

In the previous paper, a goal image is provided to specify a robot task. In this work, the authors argue that language is a better way to communicate with the robot for a few reasons: 1) Providing goal image is not very practical. If you need to finish the task first to show what you want to achieve you probably don’t need the robot to do it anymore. 2) Goal image may over specify. e.g. if your goal image has multiple objects, the robot may try to match all objects to the image. 3) Goal image cannot specify certain tasks that don’t have a final image such as keep moving to the right.

First, a set of offline data that contains start image and final image is first labeled with descriptions of this task through crowd sourcing. A binary classifier that looks at the start image, current image, and a description are then trained to classify whether the difference in state can be described by the description. Second, an action-conditioned video prediction framework from prior work is used to learn a forward visual dynamic model, which given the current state and action generate the next state (state here means RGB image.) With these two models, during test time we can sample a set of action sequences and feed into the forward visual dynamic model to get a predicted future image. This predicted image along with the current image and the task description are then fed into trained binary classifier to obtain a score. The highest scored action sequence is then executed (see figure below, note that here 3 cameras at different locations are used.) This approach which the authors call Language-conditioned Offline Reward Learning (LORL) is tested on both simulation and a real robot.

d) Toki Migimatsu and Jeannette Bohg, “Grounding Predicates through Actions”

In this previous post(link), I talked about a few work that does planning using PDDL. PDDL is great at planning high level actions but it requires knowing the symbolic state. In this work, the authors try to address part of this problem on learning the symbolic state given an image. An approach to label a large dataset with predicates automatically is proposed (an example of a predicates is whether a drawer is opened or not.) The argument is that labeling actions is easier than labeling predicates and there are existing datasets with action type labeled. Given the video and labeled action, a convolutional neural network (CNN) is trained to automatically generate predicates. The input to this CNN is the image plus bounding boxes of detected objects and the output would be a vector of how likely each predicate is True. Here bounding boxes actually present the argument of the predicate. For example, to know if in(hand, drawer) is True, the bounding box input needs to be in the order of hand then drawer and the output would include the probability of all predicates that take in two arguments hand and drawer. In addition to the dataset, for each action in the dataset a PDDL definition that includes pre-condition and effect of the action also needs to be provided. During training, the PDDL definition of the labeled action is used to compare with predicate outputs of the CNN given the images before and after actions plus all combinations of bounding boxes. A loss function is then defined based on how much the network prediction agrees with the action definitions. The figure below is a good example of how it works. The authors labeled a large real world dataset and verified the effectiveness on a toy environment.

Transformer for Vision

In Computer Vision, deep learning, Machine Learning, Paper Talk, vision on October 9, 2021 at 11:04 pm

By Li Yang Ku

In my previous post I talked about this web app I made that can generate rap lyrics using the transformer network. Transformer is currently the most popular approach for natural language related tasks (I am counting OpenAI’s GPT-3 as a transformer extension.) In this post I am going to talk about a few different work that tries to apply it to vision related tasks.

If six years ago you told me that the next big thing in computer vision would be a model developed for natural language processing I might laugh and thought it was supposed to be funny. Language which is normally received over a period of time seems so different from image recognition where information is spatial, its hard to imagine any model designed for language would be any good at vision. Because of these differences, directly applying transformer to vision tasks is non-trivial (we will go through a few different approaches in this post.) However, because transformer is based on ideas that are quite general, applying them to vision tasks actually makes sense.

The principle idea of transformer is about learning attention; to understand a sentence, we often have to associate words with other words in the same sentence and these relation is where we put our attention. For example, the sentence “Imagine your parents comparing you to Dr. Jonny Kim”, when you look at the word “comparing” we would probably pay attention to “you” and “Dr. Jonny Kim” which we are comparing between. And when we focus on “you” the association might be “your parents”. In transformer, we consider this first word as a query, the other words that we pay attention to as keys and each key has corresponding values that represent the meaning of this association. By stacking these attention blocks, we transform the sentence to a vector that contains more high level information that can be used for different lingual tasks such as translation, sentence completion, question answering, etc. Associating different features is also quite important for vision, to recognize a cat, you might want to check for a cat head and a cat body that is at the right relative position. If the cat body is far away from the head, something is probably wrong.

One of the earlier attempts to use transformer for vision tasks is published in the paper “Image Transformer” in 2018. One of the main problem of using transformer for vision tasks is computations. An image may worth a thousand words, but it can also take hundreds if not thousands times more in memory size. Computing relations between pixels to all other pixels is therefore infeasible. In this work Niki et al. addressed this issue by keeping attention within a set of neighboring pixel. This Image Transformer was used on vision tasks such as image completion and super resolution. By generating one pixel at a time in a top-down left-right order, image completion seems to be the task most similar to sentence completion and most suitable for applying transformer. On these tasks the image transformer out performed state of the art approaches that mostly used GANs (General Adversarial Networks). The results actually look pretty good, see Table 1. below.

Another attempt on applying transformers to vision tasks is the work “Scaling Autoregressive Video Models” by Weissenborn et al. In this work, the authors try to tackle the problem of video generation, which also has close resemblance to the sentence completion task. For transformers to handle videos, computation and memory become an even bigger problem due to its quadratic consumption with respect to input size. To tackle the problem, the video is divided into smaller non-overlapping sub blocks. The attention layers are then only applied to each block individually. The problem with this approach is that there is no communication between blocks. To address this problem, the blocks are split differently for each layer, and there will be blocks that extend to all parts of each axis. For example, the block sizes used for the first 4 layers in the network are (4, 8, 4), (4, 4, 8), (1, 32, 4), and (1, 4, 32) where each tuple represents sizes of (time, height, width) and the input video is subscaled to size 4 x 32 x 32. The following images are results trained on the BAIR Robot Pushing dataset which the first image on the left is given and the rest are generated.

So far we’ve been talking about vision applications that are in some way more similar to language tasks where the problem can be seen as generating a new “token” given already generated “tokens”. In the paper “Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”, transformer is used to solve more traditional vision tasks such as classification and segmentation. Similar to previous approaches, one of the focus when applying transformer is on reducing computation. The authors of this work proposed to factorize the 2D attention into two 1D attentions along height first then width. They call this the axial attention block (figure below) and use it to replace the convolution layers in a ResNet (a type of convolutional neural network that won the ImageNet competition in 2015.) This Axial-Resnet can be used just like Resnet for image classification tasks or can be combined with a conditional random field to provide segmentation outputs. The authors showed that this approach was able to achieve state of the art results on a few segmentation benchmarks.

This next work I am going to talk about was published this June (2021) and the goal is to show that a pure transformer network can do as good as (or even better than) CNNs (Convolution Neural Network) on image classification tasks when pre-trained on large enough data. In this paper “An Image is Worth 16×16 Words: Transformer for Image Recognition at Scale”, the authors introduced the Vision Transformer that has basically the same structure as the language transformer. Each image is cut into non-overlapping patches and tokenized just like words in natural language processing tasks; these tokens are then fed into the Vision Transformer in a fixed order (see figure below.) Since we are considering a classification task, only the encoder part of a typical transformer is needed.

What I found quite interesting is that the authors mentioned that when trained and tested on mid-sized datasets, the Vision Transformer resulted in modest results a few percentage point lower than a ResNet. But when pre-trained on larger datasets, Vision Transformer obtained state of the art results on these mid-sized datasets. The reason that CNNs performed better on mid-sized datasets seems to be because of its convolutional structures that enforces translation invariance and locality that are quite useful for vision tasks. Vision Transformer does not have these constraints so it requires more data to learn them; but when enough data is given, it can learn a more flexible structure, therefore result in higher accuracy. This conjecture kind of make sense given what was published in this other work “On The Relationship Between Self-Attention and Convolutional Layers.” In this paper, the authors first proved that a self-attention layer in a transformer can simulate a convolutional layer. The authors further looked into a trained transformer and showed that attention layers in the network did learn a structure similar to a convolutional layer. The network however did not learn a uniform convolution, but some version which the kernel size varies between layers. This seems to explain nicely why transformer outperforms CNNs when given enough data.

Paper Picks: CVPR 2020

In AI, Computer Vision, deep learning, Paper Talk, vision on September 7, 2020 at 6:30 am
by Li Yang Ku (Gooly)

CVPR is virtual this year for obvious reasons, and if you did not pay the $325 registration fee to attend this ‘prerecorded’ live event, you can now have a similar experience through watching all the recorded videos on their YouTube channel for free. Of course its not exactly the same since you are loosing out the virtual chat room networking experience, but honestly speaking, computer vision parties are often awkward in person already and I can’t imagine you missing much. Before we go through my paper picks, lets look at the trend first. The graph below is the accepted paper counts by topic this year.

CVPR 2020 stats

And the following are the stats for CVPR 2019:

CVPR 2019 stats

These numbers cannot be directly compared since the categories are not exactly the same, for example, deep learning that had the most submission in 2019 is no longer a category (Aren’t gonna be a very useful category when every paper is about deep learning.) The distribution of these two graphs look quite similar. However, if I have to analyze it at gunpoint, I would say the following:

  1. Recognition is still the most popular application for computer vision.
  2. The new category “Transfer/Low-shot/Semi/Unsupervised Learning” is the most popular problem to solve with deep networks.
  3. Despite being a controversial technology, more people are working on face recognition. For some countries this is probably still where most money is distributed.
  4. The new category “Efficient training and inference methods for networks” shows that there is an effort to push for practical use of the neural network.
  5. Based on this other statistic data, it seems that the keyword ‘graph’, ‘representation’, and ‘cloud’ doubled from last year. This is consistent with my observation that people are exploring 3D data more since the research space on 2D image is the most crowded and competitive.

Now for my random paper picks:

a) Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. “CvxNet: Learnable Convex Decomposition” (video)

This Google Research paper introduces a new representation for 3D shapes that can be learned by neural networks and used by physics engines directly. In the paper, the authors mentioned that there are two types of 3D representations, 1) explicit representations such as meshes. These representations can be used in many applications such as physics simulations directly because they contain information of the surface. explicit representations are however hard to learn with neural networks. The other type is 2) implicit representations such as voxel grids, voxel grids can be learned from neural networks since it can be considered as a classification problem that labels each voxel empty or not. However, turning these voxel grids into a mesh is quite expensive. The authors therefore introduce this convex decomposition representation that represent a 3D shape with a union of convex parts. Since a convex shape can be represented by a set of hyperplanes that draw the boundary of the shape, it becomes a learnable classification problem while remains the benefit of having information of the shape boundary. This representation is therefore both implicit and explicit. The authors also demonstrated that a learned CvxNet is able to generate 3D shapes from 2D images with much better success compared to other approaches as show below.

b) Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman. “Ego-Topo: Environment Affordances From Egocentric Video” (video)

Environment Affordance

This paper on predicting an environment’s affordance is a collaboration between UT Austin’s computer vision group and Facebook AI Research. This paper caught my eye since my dissertation was also about affordances using a graph like structure. If you are not familiar of the word “affordance”, its a controversial word made up to describe what action/function an object/environment affords a person/robot.

In this work, the authors argue that the space that an action is taken place in is important to understanding first person videos. Traditional approaches on classifier actions in videos usually just take a chunk of the video and generate a representation for classification, while SLAM (simultaneous localization and mapping) approaches that tries to create the exact 3D structure of the environment often fails when humans move too fast. Instead, this work learns a network that classifies whether two views belong to the same space. Based on this information, a graph where each node represents a space and the corresponding videos can be created. The edges between nodes then represent the action sequences that happened between these spaces. These videos within a node can then be used to predict what an environment affords. The authors further trained a graph convolution network that takes into account neighboring nodes to predict the next action in the video. The authors showed that taking into account the underlying space benefited in both tasks.

c) Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta. “Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects” (video)

use the force luke - Yoda | Meme Generator

This paper would probably won the best title award for this conference if there is one. This work is about estimating forces applied to objects by human in a video. Arguably, if robots can estimate forces applied on objects, it would be quite useful for performing tasks and predicting human intentions. However, personally I don’t think this is how humans understand the world and it may be solving a harder problem then needed. Having said that this is still an interesting paper worth discussing.

Estimating force and contact points

The difficulty of this task is that the ground truth forces applied on objects cannot be easily obtained. Instead of figuring out how to obtain this data, the authors use a physics simulator to simulate the outcome of applying the force and then use keypoints annotated in the next frame compared to the keypoints location of the simulated outcome as a signal to train the network. Contact points are also predicted by a separate network with annotated data. The figure above shows this training schema. Note that estimating gradients through a non-differentiable physics simulator is possible by looking at the result when each dimension is changed a little bit. The authors show this approach is able to obtain reasonable result on a collected dataset and can be extended to novel objects.

d) Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song. “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization” (video)

This is a Google Brain paper that tries to find a better architecture for object detection tasks that would benefit from more spatial information. For segmentation tasks, the typical architecture has an hour glass shaped encoder decoder structure that first down scales the resolution and then scales it back up to predict pixel-wise result. The authors argued that these type of neural networks that have this scale decreasing backbone may not be the best solution for tasks which localization is also important.

Left: ResNet, Right: Permute last 10 blocks

The idea is then to permute the order of the layers of an existing network such as ResNet and see if this can result in a better architecture. To avoid having to try out all combinations, the authors used Neural Architecture Search (basically another network) to learn what architecture would be better. The result is an architecture that has mixed resolutions and many skip connections that go further (image above). The authors showed that with this architecture they were able to outperform prior state of the art result and this same network was also able to achieve good results on other datasets other than the one trained on.


Guest Post: How to Make a Posture Correction App

In AI, App, Computer Vision, deep learning on August 8, 2020 at 9:35 am

This is a guest post by Lila Mullany and Stephanie Casola from alwaysAI (in exchange they will post one of my articles in their company blog.) What this startup is developing might be useful to some of my readers that just want to implement deep learning vision apps without having to go through a steep learning curve. It’s also open source and free if you are just working on a home project. In the following, we’ll do a brief intro on alwaysAI and then Lila will talk about how to make a posture correction app with their library.

AlwaysAI is a startup located in San Diego that is making a deep learning computer vision platform that aims at making computer vision more accessible to developers. They provide a freemium version that could be quite useful to hobbyists as well as devs looking to build computer vision into commercial products. This platform is optimized to run on edge devices and can be an attractive option for anyone looking to build computer vision with resource constraints. One could easily create a computer vision app with alwaysAI and run it on a Raspberry Pi (a Raspberry Pi 4 costs about $80). If you know basic Python, you can sign up for a free account and create your own computer vision application with a few lines of code.

Typically, computer vision apps can take a lot of time to implement from scratch. With alwaysAI you can get going pretty quickly on object detection, object tracking, image classification, semantic segmentation, and pose estimation. Creating a computer vision application with alwaysAI starts with selecting a pre-trained model from their model catalog. If you want to  train your own model you can sign up here for their closed beta model training program.

At this time, all models are open source and available to be freely used in your apps. As for distributing your app, the first device you run your app on is free. For a free account, just sign up here.

open source | Funny Jokes and Laughs :)

For more information you can look up their documentation, blog, and Youtube channel. They also do hackathons, webinars, and weekly “Hacky Hours”. You can find out more about these events on their community page.

So that’s the intro, below Lila will show you an example of how their library can be used to build your own posture corrector.

Many of us spend most of our days hunched over a desk, leaning forward looking at a computer screen, or slumped down in our chair. If you’re like me, you’re only reminded of your bad posture when your neck or shoulders hurt hours later, or you have a splitting migraine. Wouldn’t it be great if someone could remind you to sit up straight? The good news is, you can remind yourself! In this tutorial, we’ll build a posture corrector app using a pose estimation model available from alwaysAI.

To complete the tutorial, you must have:

  1. An alwaysAI account (it’s free!)
  2. alwaysAI set up on your machine (also free)
  3. A text editor such as sublime or an IDE such as PyCharm, both of which offer free versions, or whatever else you prefer to code in

All of the code from this tutorial is available on GitHub.

Let’s get started!

After you have your free account and have set up your developer environment, download the starter apps; do so using this link before proceeding with the rest of the tutorial. We’ll build the posture corrector by modifying the ‘realtime_pose_detector’ starter app. You may want to copy the contents into a new directory, so you retain the original code.

There will be three main parts to this tutorial:

  1. The configuration file
  2. The main application
  3. The utility class for detecting poor posture

1) Creation of the Configuration File

Create this file as specified in this tutorial. For this example app we need one configuration variable (and more if you want them): scale, which is an int and will be used to tailor the sensitivity of the posture functions.

Now the configuration is all set up!

2) Creation of the App

Add the following import statements to the top of your file:

import os
import json
from posture import CheckPosture

We need ‘json’ to parse the configuration file, and ‘CheckPosture’ is the utility class for detecting poor posture, which we’ll define later in this tutorial.

NOTE: You can change the engine and the accelerator you use in this app depending on your deployment environment. Since I am developing on a Mac, I chose the engine to be ‘DNN’, and so I changed the engine parameter to be ‘edgeiq.Engine.DNN’. I also changed the accelerator to be ‘CPU’. You can read more about the accelerator options here, and more about the engine options here.

Next, remove the following lines from

text.append("Key Points:")
for key_point in pose.key_points:

Add the following lines to replace the ones you just removed (right under the ‘text.append’ statements):

# update the instance key_points to check the posture
# play a reminder if you are not sitting up straight
correct_posture = posture.correct_posture()
if not correct_posture:
# make a sound to alert the user to improper posture

We used an unknown object type just there and called some functions on it that we haven’t defined yet. We’ll do that in the last section!

Move the following lines to directly follow the end of the above code (directly after the ‘for’ loop, and right before the ‘finally’):

streamer.send_data(results.draw_poses(frame), text)
if streamer.check_exit():

3) Creating the Posture Utility Class

Create a new file called ‘’. Define the class using the line:

class CheckPosture

Create the constructor for the class. We’ll have three instance variables: key_points, scale, and message.

def __init__(self, scale=1, key_points={}):
    self.key_points = key_points
    self.scale = scale
    self.message = ""

We used defaults for scale and key_points, in case the user doesn’t provide them. We just initialize the variable message to hold an empty string, but this will store feedback that the user can use to correct their posture. You already saw the key_points variable get set in the section; this variable allows the functions in to make determinations about the user’s posture. Finally, the scale simply makes the calculations performed in either more or less sensitive when it is decreased or increased respectively.

Now we need to write some functions for

Create a getter and setter for the key_points, message, and scale variables:

def set_key_points(self, key_points):
    self.key_points = key_points

def get_key_points(self):
    return self.key_points

def set_message(self, message):
    self.message = message

def get_message(self):
    return self.message

def set_scale(self, scale):
    self.scale = scale

def get_scale(self):
    return self.scale

Now we need functions to actually check the posture. My bad posture habits include leaning forward toward my computer screen, slouching down in my chair, and tilting my head down to look at notes, so I defined methods for detecting these use cases. You can use the same principle of coordinate comparison to define your own custom methods, if you prefer.

First, we’ll define the method to detect leaning forward, as shown in the image below. This method works by comparing an ear and a shoulder on the same side of the body. So first it detects whether the ear and shoulder are both visible (i.e. the coordinate we want to use is not -1) on either the left or right side, and then it checks whether the shoulder’s x-coordinate is greater than the ear’s x-coordinate.

def check_lean_forward(self):
    if self.key_points['Left Shoulder'].x != -1 \
        and self.key_points['Left Ear'].x != -1 \
        and  self.key_points['Left Shoulder'].x >= \
        (self.key_points['Left Ear'].x + \
        (self.scale * 150)):
        return False
if self.key_points['Right Shoulder'].x != -1 \
    and self.key_points['Right Ear'].x != -1 \
    and  self.key_points['Right Shoulder'].x >= \
    (self.key_points['Right Ear'].x + \
    (self.scale * 160)):
    return False
return True

NOTE: the coordinates for ‘alwaysai/human-pose’ are 0,0 at the upper left corner. Also, the frame size will differ depending on whether you are using a Streamer input video or images, and this will also impact the coordinates. I developed using a Streamer object and the frame size was (720, 1280). For all of these functions, you’ll most likely need to play around with the coordinate differences, or modify the scale, as every person will have a different posture baseline. The principle of coordinate arithmetic will remain the same, however, and can be used to change app behavior in other pose estimation use cases! You could also use angles or a percent of the frame, so as to not be tied to absolute numbers. Feel free to re-work these methods and submit a pull request to the GitHub repo!

Next, we’ll define the method for slouching down in a chair, such as in the image below.

In this method, we’ll use the y-coordinate neck and nose keypoints to detect when the nose gets too close to the neck, which happens when someone’s back is hunched down in a chair. For me, about 150 points was the maximum distance I wanted to allow. If my nose is less than 150 points from my neck, I want to be notified. Again, these hardcoded values can be scaled with the ‘scale’ factor or modified as suggested in the note above.

def check_slump(self):
    if self.key_points['Neck'].y != -1 \
       and self.key_points['Nose'].y != -1 \
       and (self.key_points['Nose'].y >= \
       self.key_points['Neck'].y - (self.scale * 150)):
       return False
    return True

Now, we’ll define the method to detect when a head is tilted down, as shown in the image below. This method will use the ear and eye key points to detect when the y-coordinate of a given eye is closer to the bottom of the image than the ear on the same side of the body.

def check_head_drop(self):
    if self.key_points['Left Eye'].y != -1 \
        and self.key_points['Left Ear'].y != -1 \
        and self.key_points['Left Eye'].y > \
        (self.key_points['Left Ear'].y + (self.scale * 15)):
        return False

    if self.key_points['Right Eye'].y != -1 \
        and self.key_points['Right Ear'].y != -1 \
        and self.key_points['Right Eye'].y > \
        (self.key_points['Right Ear'].y + (self.scale * 15)):
        return False

    return True

Now, we’ll just make a method that checks all the posture methods. This method works by using python’s all method, which only returns True if all iterables in a list return True. Since all of the posture methods we defined return False if the poor posture is detected, the method we define now will return False if any one of those methods returns False.

def correct_posture(self):
    return all([self.check_slump(), 

And finally, we’ll build one method that returns a customized string that tells the user how they can modify their posture. This method is called in and the result is displayed on the streamer’s text.

def build_message(self):
    current_message = ""
    if not self.check_head_drop():
        current_message += "Lift up your head!\n"

    if not self.check_lean_forward():
        current_message += "Lean back!\n"

    if not self.check_slump():
        current_message += "Sit up in your chair, you're slumping!\n"

    self.message = current_message
    return current_message

That’s it! Now you have a working posture correcting app. You can customize this app by creating your own posture detection methods, using different keypoint coordinates, making the build_message return different helpful hints, and creating your own custom audio file to use instead of the ‘print(“\a”)’.

If you want to run this app on a Jetson Nano, update your Dockerfile and the accelerator and engine arguments in as described in this article.

Now, just start your app (visit this page if you need a refresher on how to do this for your current set up), and open your web browser to ‘localhost:5000’ to see the posture corrector in action!

This posture corrector application development was also covered in one of the previous weekly “Hacky Hours”, you can watch the video recording of it on Youtube, just click here.

10 Facts About Human Vision a Computer Vision Scientist Might Not Know

In Computer Vision, Neural Science, Visual Illusion on June 7, 2020 at 3:14 pm

by Li Yang Ku (Gooly)

The one thing that all computer vision scientists can agree on is probably that as of today, human vision is a lot better than computer vision algorithms (in the range of visible lights) on understanding our surrounding world. However, most computer vision scientists don’t usually look into our vision system for inspiration since it is not part of the computer science curriculum. This post is about a few interesting facts about our own vision system that I considered less commonly known among computer vision folks. (plus a few more commonly known facts to make it add up to ten.)

1) You transmit more signal when you don’t see:

You might think that the photoreceptors in our eyes are like light sensors that emit signals when photons hit the sensor. Well its actually the opposite, the photoreceptors in ours eyes depolarizes and releases more neurotransmitter when there are no light.

Visual Signals

2) Stars are smaller then they look:

I would argue that you can’t really see stars, when you look at the starry night sky you are seeing your eye’s “pixel” (the smallest dot in your visual field). This is because stars are too far away and are smaller than your eye’s resolution. Angular diameter is used to measure how large a circle appears in a view, for example, the star that has the largest angular diameter when viewed on earth is R Doradus, which has an angular diameter of 0.06 arc second (or 1.66667e-5 degree), but our eyes can tell at most 28 arc seconds apart. Because the light emitted by stars are very strong, even though it’s light only hit a small portion of our photoreceptor, it is enough to cause that single neuron to polarize.

Stars are smaller then they look

(Note that because of the earth atmosphere and imperfection of human eyes, when the star light hit your photoreceptor it will already be blurred and can be larger then 28 arc seconds if the star is bright, in this case brighter stars may appear larger then others) (relevant link) (relevant link 2)

3) Visual illusions help survival:

Visual illusions aren’t just your vision system malfunctioning or some left over trait from our ancestors, it actually is crucial for our survival. What we see when we open our eyes aren’t the raw information we get from our photoreceptors, it’s actually heavily post processed information. Visual illusions are merely results of these post processing when the input is not something human normally encounter in nature. For example, if you look at the Kanizsa’s triangle shown below, you tend to see an upside down triangle even though there are no contours of one. This visual illusion is easy to notice in this image, but the same functionality is actually happening every moment you see. This is the reason you can easily identify different objects overlapping in your visual field. You might think you separate objects because of color or brightness, but if you actually take a digital picture and look at the pixel values, it is not always obvious where the contour is. If it is that easy, segmentation would be a solved computer vision problem already. (See my previous post on other visual illusions)

Kanizsa's Triangle

4) Some people can see more colors:

When I was a kid, one of my dreams was to discover a new kind of color. When I grew older I realized it was impossible since we can visualize all the colors in the visible light spectrum and no new color is left to discover. But I was actually wrong, because color isn’t measurable externally because it is an internal representation in our brain. So my childhood dream shouldn’t be to “discover” a new kind of color but to “sense” a new kind of color instead. So the remaining question is whether it is possible to sense a new kind of color.

People often disagree about colors, that’s because we all see colors a little bit differently. We typically have 3 different kinds of color sensors in our eyes that we call cones. These cones response to lights of different wavelengths and we associate these wavelengths to the colors we call red, green, and blue. If a light’s wavelength lies in between two of the cone types’ response range, both will fire and we see a different color. Your cones’ response range are slightly different than mine, therefore our representation of color would also be slightly different.

Some people can see more colors

Studies show that a percentage of human (one study says 15% of women) have a fourth type of cone that responses to lights with bandwidth between green and red lights. This means that colors are actually sensed very differently by these people. These people with four cone types may not realize they are sensing differently because color is an internal representation that cannot be compared. These people may be seeing a new color normal people can’t see and getting responses like “Oh, thats just a different shade of green”, while in fact they are having a totally different experience.

(Note that since our screens that fuses red green and blue lights to simulate other color lights are designed only for people with the red, green, and blue cones. These people with four cone types would probably found the color of display to be different from the real object.) (relevant link)

5) Cones are not evenly distributed

You might expect the color photoreceptor (cones) in your eyes to be evenly distributed on your retina, but thats not true. You can find large areas in your eye with mostly one type of cones (link). Would this be a problem? It shouldn’t be once your brain post processed it and fill in all the missing color. To demonstrate your brain’s color filling ability, the following image is actually a gray scale image with colored grid lines. You will notice your brain fills in the missing color if you look at it from a distance.

Your brain fills in colors

Image Source:

6) The photoreceptors are located close to the last layer in your eye

If I am to design a digital camera I will probably put my light sensors facing towards the lens and have the wires connected on the other side so that it wouldn’t block the light source. This is however not how your eyes are designed. When lights go through your eye lens, it has to first go pass ganglion cells and their axions that transmit all the visual information to your brain, then another four layers that contain different neurons before hitting the photoreceptors that response to light. Luckily the five layers light has to pass through are mostly transparent, but still this seems to be a less optimal design.

To understand the reason our eyes have this kind of structure we might have to look at the early eyes that first appeared on earth. The following sequence of images shows the evolution of eyes, the first version is just some photoreceptor on the skin. A cavity gradually formed because it creates a pin hole camera effect that gives more information of the outer world, which really helps if your are trying to eat a prey or avoid becoming a prey. After millions years of evolution, the cavity closed and the lens is formed to provide the ability to focus. Since in the early designs these photoreceptors were flat, it might make sense that it was not located at the outer most layer so that it doesn’t get damaged easily. (It could also be just due to how it was wired originally, but it is very likely a design due to evolution.)

The evolution of eye

Image source:

7) Car dash board colors are not designed to match style

Your car’s dash board may have colored backlight during night time, it may look cool but the color choice was suppose to keep you safe not to match your style. However, different car brands use different colors because designers can’t agree on what color is safer.

Why car dashboard light have different colors

There are two types of photoreceptors in our eyes, the cones that detects colors which we described earlier, and the rods that doesn’t provide color information but are sensitive to brightness changes. When it’s dark we are mostly just using rods, therefore we normally don’t see much color during night. Although the rods don’t provide any color information, they do prefer lights with bandwidths close to blue and green lights. Therefore, one argument is that having a dim blue or green dash board light can take advantage of the sensitivity of the rods so your dash board would be more visible during night time.

The other camp however suggests using bright red dash lights, the argument is that instead of having the rods do all the jobs why not let the cones detect the dash board light. Since rods are not sensitive to red, the bright red color wouldn’t effect cone’s night vision. Both argument sounds reasonable, I guess the take away is that if you prefer a dim light use green or blue, but if you prefer a brighter dash board use red.

8) You cannot see what you did not learn to see

Seeing the world around you happens so naturally it is hard to imagine a person with a normal biological vision system to not see something in front of them. However, this is something that can happen. If you did not experience with vertical lines when you are learning to see, you might not be able to see vertical lines when exposed to a normal world. This is demonstrated in a series of experiments I talked about in my previous post, the short summary is that vision is not something you are born with but something you need to experience in order to acquire.

cat experiment

9) The world becomes less colorful if you stopped moving

Photoreceptors in yours eyes gradually decrease response to light even if the light level doesn’t change. So if you stopped moving (including your eyeballs) in a static world for long enough, the world you see aren’t going to be as colorful. However, since it usually requires a huge effort to not blink and not saccade, this isn’t normally a problem.

The reason to have this mechanism is to be adaptable to different environments. This is similar to the white balance and auto brightness adjustment option on a camera. If you are in a bright room, it’s probably better to be less sensitive to brightness. The side effect of this mechanism is that you see opposite colors if you look at a patch of colors too long. This side effect is actually used to help make Disney’s grass look greener.

Disneyland uses pink walkways to make grass look green

(More details: Photoreceptors that receive photons generates more messengers called cGMP that causes sodium gates to close and photoreceptors to have a higher membrane potential, but closing the gate too long will also cause calcium concentration to drop which leads to the gate reopening again.)

10) Vision regions in the brain can be repurposed for other senses

The current consensus among the neuroscience community is that our neocortex, which handles most of our visual processing and many other intelligent behaviors, mostly have the same structure across our brain.  Studies show that areas normally dedicated to vision is repurposed to tactile or auditory senses among blind people. Because of this, with modern technology it is possible to allow blind people to see again through tactile senses. Brainport is a technology that uses an electrode array placed on the user’s tongue to allow blind people to see through a camera that is connected to this electrode array. The resolution is only 20×20, but the company mentioned that users can’t tell much difference when given a higher resolution.

Helping the blind to see

Another approach to make the blind see again is to use implants on the brain surface that generate electrical stimulations. One example is the Intracortical Visual Prosthesis Project, if done right this approach should be able to provide visual information with higher resolution.

These are 10 facts about human vision, but probably not the 10 most interesting ones. See my post about visual pathways and subscribe to my blog for more interesting discoveries of human vision.

Talk the Talk: Optimization’s Untold Gift to Learning

In AI, Computer Vision, deep learning, Machine Learning on October 13, 2019 at 10:40 am

by Li Yang Ku (Gooly)

deep learning optimization

In this post I am going to talk about a fascinating talk by Nati Srebro at ICML this June. Srebro have given similar talks at many places but I think he really nailed it this time. This talk is interesting not only because he provided a different view of the role of optimization in deep learning but also because he clearly explained why many researcher’s argument on the reason that deep learning works doesn’t make sense.

Srebro first look into what we know about deep learning (typical feed forward network) based on three questions. The first question is regarding the capacity of the network. How many samples do we need to learn certain network architecture? The short answer is that it should be proportional to the number of parameters in the network, which is the total number of edges. The second question is about the expressiveness of the network. What can we express with certain model class? What type of questions can we learn? Since a two layer neural network is a universal approximator, it can learn any continuous function, this is however not a very useful information since it may require an exponentially large network and exponential amount of samples to learn. So the more interesting question is what can we express with a reasonable sized network? Many recent research more or less focuses on this question. However, Srebro argues that since there is another theory that says any function that can be executed within a reasonable amount of time can be captured by a network of reasonable size (please comment below if you know what theory this is), all problems that we expect to be solvable can be expressed by a reasonable sized network.

The third question is about computation. How hard is it to find optimal parameters? The bad news is that finding the weights for even tiny networks is NP-Hard. Theories (link1 link2) show that even if the training data can be perfectly expressed by a small neural network there are no polynomial time algorithm to find such set of weights. This means that neural network’s expressiveness described in question 2 doesn’t really do much good since we aren’t capable of finding the optimal solution. But we all know that in reality neural network works pretty well, it seems that there are some magical property that allows us to learn neural networks. Srebro emphasizes that we still don’t know what is the magical property that makes neural networks learnable, but we do know it is not because we can represent the data well with the network. If you ask vision folks why neural networks work, they might say something like the lower layers of the network matches low level visual features and the higher layers match higher level visual features. However, this answer is about the expressiveness of the network described in question 2 which is not sufficient for explaining why neural networks work and provides zero evidence since we already know neural networks have the power to express any problem.

Srebro then talked about the observed behavior that neural networks usually don’t overfit to the training data. This is an unexpected property quite similar to the behavior of Adaboost, which was invented in 1997 and quite popular in the 2000s. It was only after the invention that people discovered that the reason Adaboost doesn’t overfit is because it is implicitly minimizing the L-1 norm that limits the complexity. So the question Srebro pointed out was whether the gradient decent algorithm for learning neural networks are also implicitly minimizing certain complexity measure that would be beneficial in reaching a solution that would generalize. Given a set of training data, a neural network can have multiple optimal solutions that are global minima (zero training error). However, some of these global minima perform better than the others on the test data. Srebro argues that the optimization algorithm might be doing something else other than just minimizing the training error. Therefore, by changing the optimization algorithm we might observe a difference in how well can a neural network generalize to test data, and this is exactly what Srebro’s group discovered. In one experiment they showed that even though using Adam optimization achieves lower training error then stochastic gradient decent, it actually performs worse on the test data. What this means is that we might not be putting enough emphasize on optimization in the deep learning community where a typical paper looks like the following:

Deep Learning Paper TemplateThe contributions are on the model and loss function, while the optimization is just a brief mention. So the main point Srebro is trying to convey is that different optimization algorithms would lead to different inductive biases, and different inductive biases would lead to different generalization properties. “We need to understand optimization algorithm not just as reaching some global optimum, but as reaching a specific optimum.”

Srebro further talked about a few more works based on these observations. If you are interested by now, you should probably watch the whole video (You would need to fast forward a bit to start.) I am however going to put in a little bit of my own thoughts here. Srebro emphasizes the importance of optimization a lot in this talk and said the deep models we use now can basically express any problem we have, therefore the model is not what makes deep learning work. However, we also know that the model does matter based on claims of many papers that invented new model architectures. So how could both of these claims be true? We have to remember that the model architecture is also part of the optimization process that shapes the geometry which the optimization algorithm is optimizing on. Hence, if the nerual network model provides a landscape that allows the optimization algorithm to reach a desired minimum more easily, it will also generalize better to the test data. In other words, the model and the optimization algorithm have to work together.

The Deep Learning Not That Smart List

In AI, Computer Vision, deep learning, Machine Learning, Paper Talk on May 27, 2019 at 12:00 pm

by Li Yang Ku (Gooly)

Deep learning is one of the most successful scientific story in modern history, attracting billions of investment money in half a decade. However, there is always the other side of the story where people discover the less magical part of deep learning. This post is about a few research (quite a few published this year) that shows deep learning might not be as smart as you think (most of the time they would came up with a way to fix it, since it used to be forbidden to accept paper without deep learning improvements.) This is just a short list, please comment below on other papers that also belong.

a) Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks.”, ICLR 2014

The first non-magical discovery of deep learning has to go to the finding of adversarial examples. It was discovered that images added with certain unnoticeable perturbations can result in mysterious false detections by a deep network. Although technically the first publication of this discovery should go to the paper “Evasion Attacks against Machine Learning at Test Time” by Battista Biggio et al. published in September 2013 in ECML PKDD, the paper that really caught people’s attention is this one that was put on arxiv in December 2013 and published in ICLR 2014. In addition to having bigger names on the author list, this paper also show adversarial examples on more colorful images that clearly demonstrates the problem (see image below.) Since this discover, there have been continuous battles between the band that tries to increase the defense against attacks and the band that tries to break it (such as “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples” by Athalye et al.), which leads to a recent paper in ICLR 2019 “Are adversarial examples inevitable?” by Shafahi et al. that questions whether it is possible that a deep network can be free of adversarial examples from a theoretical standpoint.

b) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.” CVPR 2018

This is not a paper intended to discover flaws of deep learning, in fact, the result of this paper is one of the most magical deep learning results I’ve seen. The authors showed that deep networks are able to fill in cropped out images in a very reasonable way (see image below, left input, right output) However, it also unveils some less magical parts of deep learning. Deep learning’s success was mostly advertised as learning from data and claimed to work better than traditional engineered visual features because it learns from large amount of data. This work, however, uses no data nor pre-trained weights. It shows that convolution and the specific layered network architecture, (which may be the outcome of millions of grad student hours through trial and error,) played a significant role in the success. In other words, we are still engineering visual features but in a more subtle way. It also raises the question of what made deep learning so successful, is it because of learning? or because thousands of grad students tried all kinds of architectures, lost functions, training procedures, and some combinations turned out to be great?

c) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” ICLR 2019.

It was widely accepted in the deep learning community that CNNs recognize objects by combining lower level filters that represent features such as edges into more complex shapes layer by layer. In this recent work, the authors noticed that contrary to what the community believes, existing deep learning models seems to have a strong bias towards textures. For example, a cat with elephant texture is often recognized as an elephant. Instead of learning how a cat looks like, CNNs seem to take the short cut and just try to recognize cat fur. You can find a detailed blog post about this work here.

d) Wieland Brendel, and Matthias Bethge. “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet.” ICLR 2019.

This is a paper from the same group as the previous paper. Based on the same observations, this paper claims that CNNs are not that different from bag of feature approaches that classifies based on local features. The authors created a network that only looks at local patches in an image without high level spatial information and was able to achieve pretty good result on ImageNet. The author further shuffled features in an image and existing deep learning models seems to be not sensitive to these changes. Again CNNs seem to be taking short cuts by making classifications based on just local features. More on this work can be found in this post.

e) Azulay, Aharon, and Yair Weiss. “Why do deep convolutional networks generalize so poorly to small image transformations?.” rejected by ICLR 2019.

This is a paper that discovered that modern deep networks may fail to recognize images shifted 1 pixel apart, but got rejected because reviewers don’t quite buy-in on the experiments nor the explanation. (the authors made a big mistake of not providing an improved deep network in the paper.) The paper showed that when the image is shifted slightly or if a sequence of frames from a video is given to a modern deep network, jaggedness appear in the detection result (see example below where the posterior probability of recognizing the polar bear varies a lot frame by frame.) The authors further created a dataset from ImageNet with the same images embedded in a larger image frame at a random location and showed that the performance dropped about 30% when the embedded frame is twice the width of the original image. This work shows that despite modern networks getting close to human performance on image classification tasks on ImageNet, it might not be able to generalize to the real world as well as we hoped.

f) Nalisnick, Eric, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. “Do Deep Generative Models Know What They Don’t Know?.” ICLR 2019

This work from DeepMind looks into tackling the problem that when tested on data with a distribution different from training, deep neural network can give wrong results with high confidence. For example, in the paper “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks” by Louizos and Welling, it was discovered that on the MNIST dataset a trained network can be highly confident but wrong when the input number is tilted. This makes deploying deep learning to critical tasks quite problematic. Deep generative models were thought to be a solution to such problems, since it also models the distribution of the samples, it can reject anomalies if it does not belong to the same distribution as the training samples. However, the authors short answer to the question is no; even for very distinct datasets such as digits versus images of horse and trucks, anomalies cannot be identified, and many cases even wrongfully provide stronger confidence than samples that does come from the trained dataset. The authors therefore “urge caution when using these models with out-of-training-distribution inputs or in unprotected user-facing systems.”

%d bloggers like this: