Life is a game, take it seriously

Archive for the ‘Machine Learning’ Category

Training a Rap Machine

In AI, brain, deep learning, Machine Learning, Serious Stuffs on January 9, 2020 at 7:15 pm

by Li Yang Ku (Gooly)

(link to the rap machine if you prefer to try it out first)

In my previous post, I gave a short tutorial on how to use the Google AI platform for small garage projects. In this post, I am going to follow up and talk about how I built (or more like an attempt to build) my holiday project, a machine that completes your rap lyrics using the “Transformer” neural network.

Transformer is a neural network model introduced by Google Brain for mostly language related tasks. What is interesting about this architecture is that instead of taking one word at a time, it takes in the whole input sentence at once and learns the relationship between each word. This allows transformers to learn useful relationships such as what a pronoun refers to in a sentence. In the original paper “Attention is All You Need”, this ability to understand relations between words are referred to as attention since it has the ability to focus more on certain pairs of words. I will not go into the details of Transformer since quite a few people already explained it in great length in their blogs (such as this blog and this blog.) My rationale was that Transformer’s ability to learn relationships between words in rap sentences should allow the network to learn what words rhyme well together or have the right flow.

I collected rap lyrics from 14 artists with a total of around 180000 lines of rap. These rap lyrics are further filtered to around 50000 lines of lyrics that I considered to “rhyme” with another line. The first line is the input and the second line is the output. Pairs of these sentences are then split into training and evaluation (90:10 split.) The Transformer architecture I used is mostly based on this Google Cloud Platform tutorial on generating poetry. After a little bit of hyper tuning, I ended up with a Transformer architecture of 2 hidden layers, 4 heads (number of attentions), and 512 hidden layer size. This model is then trained for 75000 iterations. The total cost of experimenting with different models and trial and error cost me a bit more then a hundred bucks (note that you are given $300 to start).

The final result was, well, entertaining but with limited success. Occasionally it will rhyme if your sentence ends with a common word but most of the time its just a weirdo spitting out random sentences with a semi-rapper attitude. The following are a few example results:

Input: This thing I made raps like a machine

Output: It’s all fair in love with me

Input: It generates raps that run deep

Output: That’s what I mean

Input: No rap is free, this thing cost me a fee

Output: I got a whole lotta guap on me

Input: Consider this as a new year gift that raps

Output: Chillin’ on some wildin’ out of that

Input: You might say my input rap is bad

Output: I was gone but now I m back

Input: If you don’t like your output rap

Output: Where the fuck am I going back

You can try out the rap machine here yourself. Thank you all for reading this blog and wish you all an entertaining 2020!

Tool Tutorial: Google AI Platform for Hobbyist

In AI, App, deep learning, Machine Learning, Serious Stuffs on October 27, 2019 at 10:44 pm

by Li Yang Ku (Gooly)

In this post I am going to talk about the Google AI platform (previously called Google ML engine) and how to use it if deep learning is just your after work hobby. I will provide links to other tutorials and details at the end so that you can try it out, but the purpose of this post is to give you a big picture of how it works without having to read through all the marketing phrases targeting company decision makers.

Google AI platform is part of the Google cloud and provides computing power for training and deploying deep networks. So what’s the difference between this platform and any other cloud computing services such as AWS (Amazon Web Services)? Google AI platform is specialized for deep learning and is suppose to simplify the process. If you are using TensorFlow (also developed by Google) with a pretty standard neural network architecture, it should be a breeze to train and deploy your model for online applications. There is no need to set up servers, all you need is a few lines of gcloud commands and your model will be trained and deployed in the cloud. (You also get a $300 dollar first year credit for signing up on Google Cloud Platform, which is quite a lot for home projects.) Note that Google AI platform is not the only shop in town, take a look at Microsoft’s Azure AI if you like to shop around.

So how does it work? First of all, there are four ways to communicate with Google AI platform. You can do it 1) locally: where you have all the code on your computer and communications are made through commands directly, 2) on Google Colab: Colab is another Google project that is basically a Jupyter notebook on the cloud which you can share with others, 3) on the AI platform notebook: which is similar to Colab but have more direct access to the platform and more powerful machines, and 4) on any other cloud server or jupyter notebook like webservice such as FloydHub. The main difference between using Colab versus AI platform notebook is pricing. Colab is free (even with GPU access), but has limitations such as up to 12 hours of run time and shuts down after 90 minutes of idle time. It provides you with about 12GB RAM and 50GB disk space (although the disk is half full when started due to preinstalled packages). After disconnected, you can still reconnect with whatever you wrote in the notebook, but you will lost whatever is in the RAM and disk. For a home project, Colab is probably sufficient, the disk space is not a limitation since we can store training data in google storage. (Note that it is also possible to connect Google drive in Colab so that you don’t need to start from scratch every time.) On the other hand, AI platform notebook could be pricey if you want to keep it running (0.137 / hour and 99.89 / month for a non-gpu machine).

Before we move on, we also have to understand the differences between computation and storage on the Google AI platform. Unlike personal computers where disk space and computation are tightly integrated, they are separated on the cloud. There are machines that are responsible for computation and machines that are responsible for storage. Here, Google AI platform is responsible for the computation while the Google Cloud Storage takes care of the stored data and code. Therefore, before we start using the platform we will need to first create a storage space called bucket. This can be easily done through a one line command once you created a Google Cloud account.

If you are using Colab, you will also need to have the code for training your neural network downloaded to your Colab virtual machine. One common work flow would be to use software version control services such as Github for your code and just clone the files to Colab every time you start. It makes more sense to use Colab if you are collaborating with others or want to share how you train your model, otherwise doing everything locally might be simpler.

So the whole training process looks like this:

  1. Create a Google Cloud Project.
  2. Create a bucket where the Google AI platform can perform computations on.
  3. With a single command, upload your code to the bucket and request the AI platform to perform training.
  4. Can also perform hyper parameter tuning if needed.
  5. If you want the trained model locally, you can simply download it from the bucket through a user interface or command.

A trained model is not very useful if not used. Google AI platform provides an easy way to deploy your model as a service in the cloud. Before continuing, we should clarify some Google terminology. At Google AI platform, a “model” means an interface that solves certain tasks and a trained model is named  a “version” of this “model” (reference). In the following, quotation marks will be put around Google specific terminologies to avoid confusion.

The deployment and prediction process is then the following:

  1. Create a “model” at AI platform.
  2. Create a “version” of the “model” by providing the trained model stored in the bucket.
  3. Make predictions through one of the following approaches:
    • gcloud commands
    • Python interface
    • Java interface
    • REST API
      (the first three methods are just easier ways to generate a REST request)

And that’s all you need to grant your home made web application access to scalable deep learning prediction capability. You can run this whole process I described above through this official tutorial in Colab and more descriptions of this tutorial can be found here. I will be posting follow up posts on building specific applications on Google AI platform, so stay tuned if you are interested.

References:

Talk the Talk: Optimization’s Untold Gift to Learning

In AI, Computer Vision, deep learning, Machine Learning on October 13, 2019 at 10:40 am

by Li Yang Ku (Gooly)

deep learning optimization

In this post I am going to talk about a fascinating talk by Nati Srebro at ICML this June. Srebro have given similar talks at many places but I think he really nailed it this time. This talk is interesting not only because he provided a different view of the role of optimization in deep learning but also because he clearly explained why many researcher’s argument on the reason that deep learning works doesn’t make sense.

Srebro first look into what we know about deep learning (typical feed forward network) based on three questions. The first question is regarding the capacity of the network. How many samples do we need to learn certain network architecture? The short answer is that it should be proportional to the number of parameters in the network, which is the total number of edges. The second question is about the expressiveness of the network. What can we express with certain model class? What type of questions can we learn? Since a two layer neural network is a universal approximator, it can learn any continuous function, this is however not a very useful information since it may require an exponentially large network and exponential amount of samples to learn. So the more interesting question is what can we express with a reasonable sized network? Many recent research more or less focuses on this question. However, Srebro argues that since there is another theory that says any function that can be executed within a reasonable amount of time can be captured by a network of reasonable size (please comment below if you know what theory this is), all problems that we expect to be solvable can be expressed by a reasonable sized network.

The third question is about computation. How hard is it to find optimal parameters? The bad news is that finding the weights for even tiny networks is NP-Hard. Theories (link1 link2) show that even if the training data can be perfectly expressed by a small neural network there are no polynomial time algorithm to find such set of weights. This means that neural network’s expressiveness described in question 2 doesn’t really do much good since we aren’t capable of finding the optimal solution. But we all know that in reality neural network works pretty well, it seems that there are some magical property that allows us to learn neural networks. Srebro emphasizes that we still don’t know what is the magical property that makes neural networks learnable, but we do know it is not because we can represent the data well with the network. If you ask vision folks why neural networks work, they might say something like the lower layers of the network matches low level visual features and the higher layers match higher level visual features. However, this answer is about the expressiveness of the network described in question 2 which is not sufficient for explaining why neural networks work and provides zero evidence since we already know neural networks have the power to express any problem.

Srebro then talked about the observed behavior that neural networks usually don’t overfit to the training data. This is an unexpected property quite similar to the behavior of Adaboost, which was invented in 1997 and quite popular in the 2000s. It was only after the invention that people discovered that the reason Adaboost doesn’t overfit is because it is implicitly minimizing the L-1 norm that limits the complexity. So the question Srebro pointed out was whether the gradient decent algorithm for learning neural networks are also implicitly minimizing certain complexity measure that would be beneficial in reaching a solution that would generalize. Given a set of training data, a neural network can have multiple optimal solutions that are global minima (zero training error). However, some of these global minima perform better than the others on the test data. Srebro argues that the optimization algorithm might be doing something else other than just minimizing the training error. Therefore, by changing the optimization algorithm we might observe a difference in how well can a neural network generalize to test data, and this is exactly what Srebro’s group discovered. In one experiment they showed that even though using Adam optimization achieves lower training error then stochastic gradient decent, it actually performs worse on the test data. What this means is that we might not be putting enough emphasize on optimization in the deep learning community where a typical paper looks like the following:

Deep Learning Paper TemplateThe contributions are on the model and loss function, while the optimization is just a brief mention. So the main point Srebro is trying to convey is that different optimization algorithms would lead to different inductive biases, and different inductive biases would lead to different generalization properties. “We need to understand optimization algorithm not just as reaching some global optimum, but as reaching a specific optimum.”

Srebro further talked about a few more works based on these observations. If you are interested by now, you should probably watch the whole video (You would need to fast forward a bit to start.) I am however going to put in a little bit of my own thoughts here. Srebro emphasizes the importance of optimization a lot in this talk and said the deep models we use now can basically express any problem we have, therefore the model is not what makes deep learning work. However, we also know that the model does matter based on claims of many papers that invented new model architectures. So how could both of these claims be true? We have to remember that the model architecture is also part of the optimization process that shapes the geometry which the optimization algorithm is optimizing on. Hence, if the nerual network model provides a landscape that allows the optimization algorithm to reach a desired minimum more easily, it will also generalize better to the test data. In other words, the model and the optimization algorithm have to work together.

The Deep Learning Not That Smart List

In AI, Computer Vision, deep learning, Machine Learning, Paper Talk on May 27, 2019 at 12:00 pm

by Li Yang Ku (Gooly)

Deep learning is one of the most successful scientific story in modern history, attracting billions of investment money in half a decade. However, there is always the other side of the story where people discover the less magical part of deep learning. This post is about a few research (quite a few published this year) that shows deep learning might not be as smart as you think (most of the time they would came up with a way to fix it, since it used to be forbidden to accept paper without deep learning improvements.) This is just a short list, please comment below on other papers that also belong.

a) Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks.”, ICLR 2014

The first non-magical discovery of deep learning has to go to the finding of adversarial examples. It was discovered that images added with certain unnoticeable perturbations can result in mysterious false detections by a deep network. Although technically the first publication of this discovery should go to the paper “Evasion Attacks against Machine Learning at Test Time” by Battista Biggio et al. published in September 2013 in ECML PKDD, the paper that really caught people’s attention is this one that was put on arxiv in December 2013 and published in ICLR 2014. In addition to having bigger names on the author list, this paper also show adversarial examples on more colorful images that clearly demonstrates the problem (see image below.) Since this discover, there have been continuous battles between the band that tries to increase the defense against attacks and the band that tries to break it (such as “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples” by Athalye et al.), which leads to a recent paper in ICLR 2019 “Are adversarial examples inevitable?” by Shafahi et al. that questions whether it is possible that a deep network can be free of adversarial examples from a theoretical standpoint.

b) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.” CVPR 2018

This is not a paper intended to discover flaws of deep learning, in fact, the result of this paper is one of the most magical deep learning results I’ve seen. The authors showed that deep networks are able to fill in cropped out images in a very reasonable way (see image below, left input, right output) However, it also unveils some less magical parts of deep learning. Deep learning’s success was mostly advertised as learning from data and claimed to work better than traditional engineered visual features because it learns from large amount of data. This work, however, uses no data nor pre-trained weights. It shows that convolution and the specific layered network architecture, (which may be the outcome of millions of grad student hours through trial and error,) played a significant role in the success. In other words, we are still engineering visual features but in a more subtle way. It also raises the question of what made deep learning so successful, is it because of learning? or because thousands of grad students tried all kinds of architectures, lost functions, training procedures, and some combinations turned out to be great?

c) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” ICLR 2019.

It was widely accepted in the deep learning community that CNNs recognize objects by combining lower level filters that represent features such as edges into more complex shapes layer by layer. In this recent work, the authors noticed that contrary to what the community believes, existing deep learning models seems to have a strong bias towards textures. For example, a cat with elephant texture is often recognized as an elephant. Instead of learning how a cat looks like, CNNs seem to take the short cut and just try to recognize cat fur. You can find a detailed blog post about this work here.

d) Wieland Brendel, and Matthias Bethge. “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet.” ICLR 2019.

This is a paper from the same group as the previous paper. Based on the same observations, this paper claims that CNNs are not that different from bag of feature approaches that classifies based on local features. The authors created a network that only looks at local patches in an image without high level spatial information and was able to achieve pretty good result on ImageNet. The author further shuffled features in an image and existing deep learning models seems to be not sensitive to these changes. Again CNNs seem to be taking short cuts by making classifications based on just local features. More on this work can be found in this post.

e) Azulay, Aharon, and Yair Weiss. “Why do deep convolutional networks generalize so poorly to small image transformations?.” rejected by ICLR 2019.

This is a paper that discovered that modern deep networks may fail to recognize images shifted 1 pixel apart, but got rejected because reviewers don’t quite buy-in on the experiments nor the explanation. (the authors made a big mistake of not providing an improved deep network in the paper.) The paper showed that when the image is shifted slightly or if a sequence of frames from a video is given to a modern deep network, jaggedness appear in the detection result (see example below where the posterior probability of recognizing the polar bear varies a lot frame by frame.) The authors further created a dataset from ImageNet with the same images embedded in a larger image frame at a random location and showed that the performance dropped about 30% when the embedded frame is twice the width of the original image. This work shows that despite modern networks getting close to human performance on image classification tasks on ImageNet, it might not be able to generalize to the real world as well as we hoped.

f) Nalisnick, Eric, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. “Do Deep Generative Models Know What They Don’t Know?.” ICLR 2019

This work from DeepMind looks into tackling the problem that when tested on data with a distribution different from training, deep neural network can give wrong results with high confidence. For example, in the paper “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks” by Louizos and Welling, it was discovered that on the MNIST dataset a trained network can be highly confident but wrong when the input number is tilted. This makes deploying deep learning to critical tasks quite problematic. Deep generative models were thought to be a solution to such problems, since it also models the distribution of the samples, it can reject anomalies if it does not belong to the same distribution as the training samples. However, the authors short answer to the question is no; even for very distinct datasets such as digits versus images of horse and trucks, anomalies cannot be identified, and many cases even wrongfully provide stronger confidence than samples that does come from the trained dataset. The authors therefore “urge caution when using these models with out-of-training-distribution inputs or in unprotected user-facing systems.”

Paper Picks: IROS 2018

In AI, deep learning, Paper Talk, Robotics on December 30, 2018 at 4:18 pm

By Li Yang Ku (Gooly)

I was at IROS in Madrid this October presenting some fan manipulation work I did earlier (see video below), which the King of Spain also attended (see figure above.) When the King is also talking about deep learning, you know what is a hype the trend in robotics. Madrid is a fabulous city, so I am only able to pick a few papers below to share.

 

a) Roberto Lampariello, Hrishik Mishra, Nassir Oumer, Phillip Schmidt, Marco De Stefano, Alin Albu-Schaffer, “Tracking Control for the Grasping of a Tumbling Satellite with a Free-Floating Robot”

This is work done by folks at DLR (the German Aerospace Center). The goal is to grasp a satellite that is tumbling with another satellite. As you can tell this is a challenging task and this work presents progress extended from a series of previous work done by different space agencies. Research on related grasping tasks can be roughly classified as feedback control methods that solves a regulation control problem and optimal control approaches that computes a feasible optimal trajectory using an open loop approach. In this work, the authors proposes a system that combines both feedback and optimal control. This is achieved by using a motion planner which is generated off-line with all relevant constraints to provide visual servoing a reference trajectory. Servoing will deviate from the original plan but the gross motion will be maintained to avoid motion constraints (such as singularity.) This approach is tested on a gravity free facility. If you haven’t seen one of these zero gravity devices, they are quite common among space agencies and are used to turn off gravity (see figure above.)

b) Josh Tobin, Lukas Biewald , Rocky Duan , Marcin Andrychowicz, Ankur Handa, Vikash Kumar, Bob McGrew, Alex Ray, Jonas Schneider, Peter Welinder, Wojciech Zaremba, Pieter Abbeel, “Domain Randomization and Generative Models for Robotic Grasping.”

This is work done at OpenAI (mostly) that tries to tackle grasping with deep learning. Previous works on grasping with deep learning are usually trained on at most thousands of unique objects, which is relatively small compared to datasets for image classification such as ImageNet. In this work, a new data generation pipeline that cuts meshes and combine them randomly in simulation is proposed. With this approach the authors generated a million unrealistic training data and show that it can be used to learn grasping on realistic objects and achieve similar to state of the art accuracy. The proposed architecture is shown above, α is a convolutional neural network, β is a autoregressive model that generates n different grasps (n=20), and γ is another neural network trained separately to evaluate the grasp using the likelihood of success of the grasp calculated by the autoregressive model plus another observation from the in-hand camera. This use of autoregressive model is an interesting choice where the authors claimed to be advantageous since it can directly compute the likelihood of samples.

c) Barrett Ames, Allison Thackston, George Konidaris, “Learning Symbolic Representations for Planning with Parameterized Skills.”

This is a planning work (by folks I know) that combines parameterized motor skills with higher level planning. At each state the robot needs to select both an action and how to parameterize it. This work introduces a discrete abstract representation for such kind of planning and demonstrated it on Angry Birds and a coffee making task (see figure above.) The authors showed that the approach is capable of generating a state representation that requires very few symbols (here symbols are used to describe preconditions and state estimates), therefore allow an off-the-shelf probabilistic planner to plan faster. Only 16 symbols are needed for the Angry Bird task (not the real Angry Bird, a simpler version) and a plan can be found in 4.5ms. One of the observation is that the only parameter settings needed to be represented by a symbol are the ones that maximizes the probability of reaching the next state on the path to the goal.

RSS 2018 Highlights

In Machine Learning, Paper Talk, Robotics on July 10, 2018 at 3:18 pm

by Li Yang Ku (Gooly)

I was at RSS (Conference on Robotics Science and System) in Pittsburgh a few weeks ago. The conference was held in the Carnegie music hall and the conference badge can also be used to visit the two Carnegie museums next to it. (The Eskimo and native American exhibition on the third floor is a must see. Just in case you don’t know, an igloo can be built within 1.5 hours by just two Inuits and there is a video of it.)

RSS is a relatively small conference compared to IROS and ICRA. With only one single track, you get to see every accepted paper from many different fields ranging from robotic whiskers to surgical robots. I would however argue that the highlights of this year’s RSS are the Keynote talks by Bernardine Dias and Chad Jenkins. Unlike most keynote talks I’ve been to, these two talks were less about new technologies but about humanity and diversity. In this post, I am going to talk about both talks plus a few interesting papers in RSS.

a) Bernardine Dias, “Robotics technology for underserved communities: challenges, rewards, and lessons learned.”

Bernadine’s group focuses on changing technologies so that they can be accessible to communities that are left behind. One of the technologies developed was a tool for helping blind students learn braille and it had significant impact among blind communities across the globe. Bernadine gave an amazing talk at RSS. However, the video of her talk is not public yet (not sure if it will be) and surprisingly not many videos of her are on the internet. The closest content I can find is a really nice audio interview with Bernardine. There is also a short video describing their work below, but what this talk is really about is not the technology or design but the lessons learned through helping these underserved communities.

When roboticist talk about helping the society, many of them focus on the technology and left the actual application to the future. Bernadine’s group are different in that they actually travel to these underserved communities to understand what they need and integrate their feedbacks to the design process directly. This is easier said then done. You have to understand each community before your visit, some acts are considered good in one culture but an insult in another. Giving without understanding often results in waste. Bernardine mentioned in her talk that one of the schools in an underserved community they collaborated with received a large one-time donations for buying computers. It was a large event where important people came and was broadcasted on the news. However, to accommodate these hardwares, this two classroom school has to give up one of there classrooms and therefore reduce the number of classes they can teach. Ironically, the school does not have resources to power these computers nor people to teach students or teachers how to use them. The donation actually result in more harm then help to the community.

b) Odest Chadwicke (Chad) Jenkins, “Robotics: Making the World a Better Place through Minimal Message-oriented Transport Layers .”

While Bernardine tries to change technologies for underserved communities, Chad tries to design interfaces for helping people with disability by deploying robots to their home. Chad showed some of the work done by Charlie Kemp’s group and his lab with Henry Evans. Henry Evans was a successful financial officer at silicon valley until he had a stroke that caused him paralyzed and mute. However, Henry did not give up living fully and strived in advocating robots for people with disability. Henry’s story is inspiring and an example of how robots can help people with disability live freely. The robot for humanity project is the result of these successful collaborations. Since then, Henry gave three Ted talks through robots and the one below shows how Chad helped him fly a quadrotor.

 

However, the highlight of Chad’s talk was when he called out for more diversity in the community. Minorities, especially African Americans and Latinos, are way underrepresented in robotics communities in the U.S. The issue of diversity is usually not what roboticist or computer scientist would thought of or list as a priority. Based on Chad’s numbers, past robotics conferences including RSSs were not immune to these kind of negligence. This not hard to see, among the thousands of conference talks I’ve been to there were probably no more then three talks by African American speakers. Although there are no obvious solutions to solve this problem yet, having the community aware or agree that this is a problem is an important first step. Chad urges people to be aware of whether everyone is given equal opportunities and simply being friendly to isolated minorities in a conference may make a difference in the long run.

c) Rico Jonschkowski, Divyam Rastogi, and Oliver Brock. “Differential Particle Filters.”

This work introduces a differentiable particle filter (DPF) that can be trained end to end. The DPF is composed of a action sampler that generates action samples, an observation encoder, a particle proposer that learns to generate new particles based on observations, and an observation likelihood estimator that weights each particle. These four components are feedforward networks that can be learned through training data. What I found interesting is that the authors made comments similar to the authors of the paper Deep Image Prior; deep learning approaches work not just because of learning but also because of the engineered structure such as convolutional layers that encode priors. This motivated the authors to look for architectures that can encode prior knowledge of algorithms into the neural network.

d) Marc Toussaint, Kelsey R. Allen, Kevin A. Smith, and Joshua B. Tenenbaum. “Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning.”

Task and Motion Planning (TAMP) approaches are about combining symbolic task planners and geometric motion planners hierarchically. Symbolic task planners can be helpful in solving tasks sequences based on high level logic, while geometric planners operate in detailed specifications of the world state. This work is an extension that further considers dynamic physical interactions. The whole robot action sequence is modeled as a sequence of modes connected by switches. Modes represent durations that have constant contact or can be modeled by kinematic abstractions. The task can therefore be written in the form of a Logic-Geometric Program where the whole sequence can be jointly optimized. The video above show that such approach can solve tasks that the authors call physical puzzles. This work also won the best paper at RSS.

Paper Picks: CVPR 2018

In Computer Vision, deep learning, Machine Learning, Neural Science, Paper Talk on July 2, 2018 at 9:08 pm

by Li Yang Ku (Gooly)

I was at CVPR in salt lake city. This year there were more then 6500 attendances and a record high number of accepted papers. People were definitely struggling to see them all. It was a little disappointing that there were no keynote speakers, but among the 9 major conferences I have been to, this one has the best dance party (see image below). You never know how many computer scientists can dance until you give them unlimited alcohol.

In this post I am going to talk about a few papers that were not the most popular ones but were what I personally found interesting. If you want to know the papers that the reviewers though were interesting instead, you can look into the best paper “Taskonomy: Disentangling Task Transfer Learning” and four other honorable mentions including the “SPLATNet: Sparse Lattice Networks for Point Cloud Processing” from collaborations between Nvidia and some people in the vision lab at UMass Amherst which I am in.

a) Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. “Learning and Using the Arrow of Time.”

I am quite fond of works that explore cues in the world that may be useful for unsupervised learning. Traditional deep learning approaches requires large amount of labeled training data but we humans seem to be able to learn from just interacting with the world in an unsupervised fashion. In this paper, the direction of time is used as a clue. The authors train a neural network to distinguish the direction of time and show that such network can be helpful in action recognition tasks.

b) Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. “Learning to Localize Sound Source in Visual Scenes.”

This is another example of using cues available in the world. In this work, the authors ask whether a machine can learn the correspondence between visual scene and sound, and localize the sound source only by observing sound and visual scene pairs like humans? This is done by using a triplet network that tries to minimize the difference between visual feature of a video frame and the sound feature generated in a similar time window, while maximizing the difference between the same visual feature and a random sound feature. As you can see in the figure above, the network is able to associate different sounds with different visual regions.

c) Edward Kim, Darryl Hannan, and Garrett Kenyon. “Deep Sparse Coding for Invariant Multimodal Halle Berry Neurons.”

This work is inspired by experiments done by Quiroga et al. that found a single neuron in one human subject’s brain that fires on both pictures of Halle Berry and texts of Halle Berry’s name. In this paper, the authors show that training a deep sparse coding network that takes a face image and a text image of the corresponding name results in learning a multimodal invariant neuron that fires on both Halle Berry’s face and name. When certain modality is missing, the missing image or text can be generated. In this network, each sparse coding layer is learned through the Locally Competitive Algorithm (LCA) that uses principles of thresholding and local competition between neurons. Top down feedback is also used in this work through propagating reconstruction error downwards. The authors show interesting results where adding information to one modality changes the belief of the other modality. The figure above shows that this Halle Berry neuron in the sparse coding network can distinguish between cat women acted by Halle Berry versus cat women acted by Anne Hathaway and Michele Pfeiffer.

d) Assaf Shocher, Nadav Cohen, and Michal Irani. “Zero-Shot Super-Resolution using Deep Internal Learning.”

Super resolution is a task that tries to increase the resolution of an image. The typical approach nowaday is to learn it through a neural network. However, the author showed that this approach only works well if the down sampling process from the high resolution to the low resolution image is similar in training and testing. In this work, no training is needed beforehand. Given a test image, training examples are generated from the test image by down sampling patches of this same image. The fundamental idea of this approach is the fact that natural images have strong internal data repetition. Therefore, from the same image you can infer high resolution structures of lower resolution patches by observing other parts of the image that have higher resolution and similar structure. The image above shows their results (top row) versus state of the art results (bottom row).

e) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.”

Most modern approaches for denoising, super resolution, or inpainting tasks use an image generation network that trains on a large dataset that consist of pairs of images before and after the affect. This work shows that these nice outcomes are not just the result of learning but also the effect of the convolutional structure. The authors take an image generation network, feed random noise as input, and then update the network using the error between the outcome and the test image, such as the left image shown above for inpainting. After many iterations, the network magically generates an image that fills the gap, such as the right image above. What this works says is that unlike common belief that deep learning approaches for image restoration learns image priors better then engineered priors, the deep structure itself is just a better engineered prior.

Deep Learning Approaches For Object Detection

In Computer Vision, deep learning, Machine Learning, Paper Talk on March 25, 2018 at 3:16 pm

by Li Yang Ku

In this post I am going to talk about the progression of a few deep learning approaches for object detection. I will start from R-CNN and OverFeat (2013) then gradually move to more recent approaches such as the RetinaNet which won the best student paper in ICCV 2017. Object detection here refers to the task of identifying a limited set of object classes (20 ~ 200) in a given image by giving each identified object a bounding box and a label. This is one of the main stream challenges in Computer Vision which requires algorithms to output the locations of multiple object in addition to corresponding class. Some of the most well known datasets are the PASCAL visual object classes challenge (2005-2012) funded by the EU (20 classes ~10k images), the ImageNet object detection challenge (2013 ~ present) sponsored by Stanford, UNC, Google, and Facebook (200 classes ~500k images) , and the COCO dataset (2015 ~ current) first started by Microsoft (80 classes ~200K images). These datasets provide hand labeled bounding boxes and class labels of objects in images for training. Challenges for these datasets happen yearly; teams from all over the world submit their code to compete on an undisclosed test set.

In December 2012, the success of Alexnet on the ImageNet classification challenge was published. While many computer vision scientist around the world were still scratching their head trying to understand this result, several groups quickly harvested techniques implemented in Alexnet and tested it out. Based on the success of Alexnet, in November 2013 the vision group in Berkeley published (on arxiv) an approach for solving the object detection problem. This proposed R-CNN is a simple extension that extends the Alexnet that was designed to solve the classification problem to handle the detection problem. R-CNN is composed of 3 parts, 1) region proposal: where selective search is used to generate around 2000 possible object location bounding boxes, 2) feature extraction: Alexnet is used to generate features, 3) classification: a SVM (support vector machine) is trained for each object class. This hybrid approach successfully outperformed previous algorithms on the PASCAL dataset by a significant margin.

R-CNN architecture

Around the same time (December 2013), the NYU team (Yann LeCun, Rob Fergus) published an approach called OverFeat. OverFeat is based on the idea that convolutions can be done efficiently on dense image locations in a sliding window fashion. The fully connected layers in the Alexnet can be seen as 1×1 convolution layers. Therefore, instead of generating a classification confidence for a cropped fix size image, OverFeat generates a map of confidence on the whole image. To predict the bounding box a regressor network is added after the convolution layers. OverFeat was at the 4th place during the 2013 ImageNet object detection challenge but claimed to have better then 1st place result with longer training time which wasn’t ready in time for the competition.

Since then, a lot of researches expanded based on concepts introduced in these work. The SPP-net is an approach that speeds up the R-CNN approach up to 100x by performing the convolution operations just once on the whole image. (note that OverFeat does convolution on images of different scale) The SPP-net adds a spatial pyramid pooling layer before the fully connected layers. This spatial pyramid pooling layer transforms an arbitrary size feature map into a fixed size input by pooling from areas separated by grids of different scale. However, similar to R-CNN, SPP-net requires multistep training on feature extraction and the SVM classification. Fast R-CNN came across to address this problem. Similar to R-CNN, Fast R-CNN uses selective search to generate a set of possible region proposals and by adapting the idea of SPP-net, feature map is generated once on the whole image and a ROI pooling layers extracts a fixed size features for each region proposal. A multi task loss is also used so that the whole network can be trained together in one stage. The Fast R-CNN can speed up R-CNN up to 200x and produce better accuracy.

Fast R-CNN architecture

At this point, the region proposal process have become the computation bottleneck for Fast R-CNN. As a result, the “Faster” R-CNN addresses this issue by introducing the region proposal network that generates region proposals based on the same feature map used for classification. This requires a four stage training that alternates between these two networks but achieves a 5 frames per second speed.

Image pyramid where images of multiple scales are created for feature extraction was a common approach used in features such as SIFT features to handle scale invariant. So far, most R-CNN based approaches does not use image pyramids due to the computation and memory cost during training. The feature pyramid network shows that since deep convolution neural networks are by natural multi-scale, a similar effect can be achieved with little extra cost. This is done by combining top-down information with lateral information for each convolution layer as shown in the figure below. By restricting the feature maps to have the same dimension, the same classification network can be used for all scales; this has a similar flavor to traditional approaches that use the same detector for images of different scales in the image pyramid.

Till 2017, most of the high accuracy approaches on object detection are extensions of R-CNN that have a region proposal module separate from classification. Single stage approaches although faster, were not able to out perform in accuracy. The paper “Focal Loss for Dense Object Detection” published in ICCV 2017 discovers the problem with single stage approaches and proposed an elegant solution that results in faster and more accurate models. The lower accuracy among single stage approaches was a consequence of imbalance between foreground and background training examples. By replacing the cross entropy loss with the focal loss that down weights examples the network already has high confidence, the network improves substantially on accuracy. The figure below shows the difference between the cross entropy loss (CE) and the focal loss (FL). A larger gamma parameter puts less weight on high confidence examples.

The references of approaches I mentioned is listed below. Note that I only talked about a small part of a large body of work on object detection and the current progress on object detection have been moving in a rapid speed. If you look at the current leader board for the COCO dataset, the numbers have already surpassed the best approach I have mentioned by a substantial margin.

  • Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
  • Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229 (2013).
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Spatial pyramid pooling in deep convolutional networks for visual recognition.” In european conference on computer vision, pp. 346-361. Springer, Cham, 2014.
  • Girshick, Ross. “Fast r-cnn.” arXiv preprint arXiv:1504.08083(2015).
  • Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks.” In Advances in neural information processing systems, pp. 91-99. 2015.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. “Feature pyramid networks for object detection.” In CVPR, vol. 1, no. 2, p. 4. 2017.
  • Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal loss for dense object detection.” arXiv preprint arXiv:1708.02002 (2017).

 

Talk Picks: IROS 2017

In deep learning, Machine Learning, Robotics on February 10, 2018 at 1:06 pm

by Li Yang Ku (Gooly)

I was at IROS (International Conference on Intelligent Robots and Systems) in Vancouver recently (September 2017, this post took way too long to finish)  to present one of my work done almost two years ago. Interestingly, there are four deep learning related sessions this year and there are quite a few papers that I found interesting, however the talks at IROS were what I found the most inspiring. I am going to talk about three of them in the following.

a) “Toward Unifying Model-Based and Learning-Based Robotics”, plenary talk by Dieter Fox.  

In my previous post, I talked about how the machine learning field differs from the robotics field, where machine learning learns from data and robotics designs models that describe the environment. In this talk, Dieter tries to glue both worlds together. This 50 minutes talk is posted below. For those who don’t have 50 minutes, I describe the talk briefly in the following.

Dieter first described a list of work his lab did (robot localization, RGB-D matching, real time tracking, etc.) using model based approaches. Model based approaches matches models to data streams and controls the robot by finding actions that reaches the desired state. One of the benefits of such approach is that our own knowledge of how the physical world works can be injected into the model. Dieter then gave a brief introduction on deep learning and on one of his students work on learning visual descriptors in a self-supervised way, which I covered in a previous post. Based on the recent success in deep learning, Dieter suggested that there are ways to incorporate model based approaches into a deep learning framework and show an example on how we can add knowledge of rigid body motion into a network by forcing it to output segmentations and their poses. The overall conclusion is that 1) model based approaches are accurate within a local basin of attraction which the models match the environment, 2) deep learning provide larger basin of attraction in the trained regime, 3) Unifying both approaches give you more powerful systems.

 

b) “Robotics as the Path to Intelligence”, keynote talk by Oliver Brock

Oliver Brock gave an exciting interactive talk on understanding intelligence in one of the IROS keynote sessions. Unfortunately it is not recorded and the given slides cannot be distributed, so I posted the most similar talk he gave below instead. It is also a pretty good talk with some of the contents overlapped but under a different topic.

In the IROS talk, Oliver made a few points. First, he start out with the AlphaGo by Deepmind, stating that its success in the game go is very similar to the IBM Deep Blue that beats the chess champion in 1996. In both cases, despite the system’s superior game play performance, it needs a human to play for it. A lot of things that humans are good at are usually difficult to our current approach to artificial intelligence. How we define intelligence is crucial because it will shape our research direction and how we solve problems. Oliver then showed that defining intelligence is non-trivial and has to do with what we perceive by performing an interactive experiment with the audience. He then talked about his work on integrating cross model perception and action, the importance of manipulation towards intelligence, and soft hands that can solve hard manipulation problems.

 

c) “The Power of Procrastination”, special event talk by Jorge Cham

This is probably the most popular talk of all the IROS talks. The speaker Jorge Cham is the author of the popular PHD Comics (which I may have posted on my blog without permission) and has a PhD degree in robotics from Stanford university. The following is not the exact same talk he gave in IROS but very similar.

 

Machine Learning, Computer Vision, and Robotics

In Computer Vision, Machine Learning, Robotics on December 6, 2017 at 2:32 pm

By Li Yang Ku (Gooly)

Having TA’d for Machine Learning this semester and worked in the field of Computer Vision and Robotics for the past few years, I always have this feeling that the more I learn the less I know. Therefore, its sometimes good to just sit back and look at the big picture. This post will talk about how I see the relations between these three fields in a high level.

First of all, Machine Learning is more a brand then a name. Just like Deep Learning and AI, this name is used for getting funding when the previous name used is out of hype. In this case, the name popularized after AI projects failed in the 70s. Therefore, Machine learning covers a wide range of problems and approaches that may look quite different at first glance. Adaboost and support vector machine was the hot topic in Machine Learning when I was doing my master’s degree, but now it is deep neural network that gets all the attention.

Despite the wide variety of research in Machine Learning, they usually have this common assumption on the existent of a set of data. The goal is then to learn a model based on this set of data. There are a wide range of variations here, the data could be labeled or not labeled resulting in supervised or unsupervised approaches; the data could be labeled with a category or a real number, resulting in classification or regression problems; the model can be limited to a certain form such as a class of probability models, or can have less constraints in the case of deep neural network. Once the model is learned, there are also a wide range of possible usage. It can be used for predicting outputs given new inputs, filling missing data, generating new samples, or providing insights on hidden relationships between data entries. Data is so fundamental in Machine Learning, people in the field don’t really ask the question of why learning from data. Many datasets from different fields are collected or labeled and the learned models are compared based on accuracy, computation speed, generalizability, etc. Therefore Machine Learning people often consider Computer Vision and Robotics as areas for applying Machine Learning techniques.

Robotics on the other hand comes from a very different background. There are usually no data to start with in robotics. If you cannot control your robot or if your robot crashes itself at first move, how are you going to collect any data. Therefore, classical robotics is about designing models based on physics and geometries. You build models that model how the input and current observation of the robot changes the robot state. Based on this model you can infer the input that will safely control the robot to reach certain state.

Once you can command your robot to reach certain state, a wide variety of problems emerge. The robot will then have to do obstacle avoidance and path planning to reach certain goal. You may need to to find a goal state that satisfies a set of restrictions while optimizing a set of properties. Simultaneous localization and mapping (SLAM) may be needed if no maps are given. In addition, sensor fusion is required when multiple sensors with different properties are used. There may also be uncertainties in robot states where belief space planning may be helpful. For robots with a gripper, you may also need to be able to identify stable grasps and recognizing the type and pose of an object for manipulation. And of course, there is a whole different set of problems on designing the mechanics and hardware of the robot.  Unlike Machine Learning, a lot of approaches of these problems are solved without a set of data. However, most of these robotics problems (excluding mechanical and hardware problems) share a common goal of determining the robot input based on feedback. (Some) Roboticists view robotics as the field that has the ultimate goal of creating machines that act like humans, and Machine Learning and Computer Vision are fields that can provide methods to help accomplish such goal.

The field of Computer Vision started under AI in the 60s under the goal of helping robots to achieve intelligent behaviors, but left such goal behind after the internet era when tons of images on the internet are waiting to be classified. In this age, computer vision applications are no longer restricted to physical robots. In the past decade, the field of Computer Vision is driven by datasets. The implicit agreement on evaluation based on standardized datasets helped the field to advance in a reasonably fast pace (under the cost of millions of grad student hours on tweaking models to get a 1% improvement.) Given these datasets, the field of Computer Vision inevitably left the Robotics community and embraced the data-driven Machine Learning approaches. Most Computer Vision problems have a common goal of learning models for visual data. The model is then used to do classification, clustering, sample generation, etc. on images or videos. The big picture of Computer Vision can be seen in my previous post. Some Computer Vision scientists consider vision different from other senses and believe that the development of vision is fundamental to the evolution of intelligence (which could be true… experiments do show 50% of our brain neurons are vision related.) Nowadays, Computer Vision and Machine Learning are deeply tangled; Machine Learning techniques help foster Computer Vision solutions, while successful models in Computer Vision contribute back to the field of Machine Learning. For example, the successful story of Deep Learning started from Machine Learning models being applied to the ImageNet challenge, and end up with a wide range of architectures that can be applied to other problems in Machine Learning. On the other hand, Robotics is a field where Computer Vision folks are gradually moving back to. Several well known Computer Vision scientists, such as Jitendra Malik, started to consider how Computer Vision can help the field of Robotics ,since their conversation with Robotics colleagues were mostly about vision not working, based on the recent success on data-driven approaches in Computer Vision.