Life is a game, take it seriously

Tool Tutorial: Google AI Platform for Hobbyist

In AI, App, deep learning, Machine Learning, Serious Stuffs on October 27, 2019 at 10:44 pm

by Li Yang Ku (Gooly)

In this post I am going to talk about the Google AI platform (previously called Google ML engine) and how to use it if deep learning is just your after work hobby. I will provide links to other tutorials and details at the end so that you can try it out, but the purpose of this post is to give you a big picture of how it works without having to read through all the marketing phrases targeting company decision makers.

Google AI platform is part of the Google cloud and provides computing power for training and deploying deep networks. So what’s the difference between this platform and any other cloud computing services such as AWS (Amazon Web Services)? Google AI platform is specialized for deep learning and is suppose to simplify the process. If you are using TensorFlow (also developed by Google) with a pretty standard neural network architecture, it should be a breeze to train and deploy your model for online applications. There is no need to set up servers, all you need is a few lines of gcloud commands and your model will be trained and deployed in the cloud. (You also get a $300 dollar first year credit for signing up on Google Cloud Platform, which is quite a lot for home projects.) Note that Google AI platform is not the only shop in town, take a look at Microsoft’s Azure AI if you like to shop around.

So how does it work? First of all, there are four ways to communicate with Google AI platform. You can do it 1) locally: where you have all the code on your computer and communications are made through commands directly, 2) on Google Colab: Colab is another Google project that is basically a Jupyter notebook on the cloud which you can share with others, 3) on the AI platform notebook: which is similar to Colab but have more direct access to the platform and more powerful machines, and 4) on any other cloud server or jupyter notebook like webservice such as FloydHub. The main difference between using Colab versus AI platform notebook is pricing. Colab is free (even with GPU access), but has limitations such as up to 12 hours of run time and shuts down after 90 minutes of idle time. It provides you with about 12GB RAM and 50GB disk space (although the disk is half full when started due to preinstalled packages). After disconnected, you can still reconnect with whatever you wrote in the notebook, but you will lost whatever is in the RAM and disk. For a home project, Colab is probably sufficient, the disk space is not a limitation since we can store training data in google storage. (Note that it is also possible to connect Google drive in Colab so that you don’t need to start from scratch every time.) On the other hand, AI platform notebook could be pricey if you want to keep it running (0.137 / hour and 99.89 / month for a non-gpu machine).

Before we move on, we also have to understand the differences between computation and storage on the Google AI platform. Unlike personal computers where disk space and computation are tightly integrated, they are separated on the cloud. There are machines that are responsible for computation and machines that are responsible for storage. Here, Google AI platform is responsible for the computation while the Google Cloud Storage takes care of the stored data and code. Therefore, before we start using the platform we will need to first create a storage space called bucket. This can be easily done through a one line command once you created a Google Cloud account.

If you are using Colab, you will also need to have the code for training your neural network downloaded to your Colab virtual machine. One common work flow would be to use software version control services such as Github for your code and just clone the files to Colab every time you start. It makes more sense to use Colab if you are collaborating with others or want to share how you train your model, otherwise doing everything locally might be simpler.

So the whole training process looks like this:

  1. Create a Google Cloud Project.
  2. Create a bucket where the Google AI platform can perform computations on.
  3. With a single command, upload your code to the bucket and request the AI platform to perform training.
  4. Can also perform hyper parameter tuning if needed.
  5. If you want the trained model locally, you can simply download it from the bucket through a user interface or command.

A trained model is not very useful if not used. Google AI platform provides an easy way to deploy your model as a service in the cloud. Before continuing, we should clarify some Google terminology. At Google AI platform, a “model” means an interface that solves certain tasks and a trained model is named  a “version” of this “model” (reference). In the following, quotation marks will be put around Google specific terminologies to avoid confusion.

The deployment and prediction process is then the following:

  1. Create a “model” at AI platform.
  2. Create a “version” of the “model” by providing the trained model stored in the bucket.
  3. Make predictions through one of the following approaches:
    • gcloud commands
    • Python interface
    • Java interface
    • REST API
      (the first three methods are just easier ways to generate a REST request)

And that’s all you need to grant your home made web application access to scalable deep learning prediction capability. You can run this whole process I described above through this official tutorial in Colab and more descriptions of this tutorial can be found here. I will be posting follow up posts on building specific applications on Google AI platform, so stay tuned if you are interested.

References:

Talk the Talk: Optimization’s Untold Gift to Learning

In AI, Computer Vision, deep learning, Machine Learning on October 13, 2019 at 10:40 am

by Li Yang Ku (Gooly)

deep learning optimization

In this post I am going to talk about a fascinating talk by Nati Srebro at ICML this June. Srebro have given similar talks at many places but I think he really nailed it this time. This talk is interesting not only because he provided a different view of the role of optimization in deep learning but also because he clearly explained why many researcher’s argument on the reason that deep learning works doesn’t make sense.

Srebro first look into what we know about deep learning (typical feed forward network) based on three questions. The first question is regarding the capacity of the network. How many samples do we need to learn certain network architecture? The short answer is that it should be proportional to the number of parameters in the network, which is the total number of edges. The second question is about the expressiveness of the network. What can we express with certain model class? What type of questions can we learn? Since a two layer neural network is a universal approximator, it can learn any continuous function, this is however not a very useful information since it may require an exponentially large network and exponential amount of samples to learn. So the more interesting question is what can we express with a reasonable sized network? Many recent research more or less focuses on this question. However, Srebro argues that since there is another theory that says any function that can be executed within a reasonable amount of time can be captured by a network of reasonable size (please comment below if you know what theory this is), all problems that we expect to be solvable can be expressed by a reasonable sized network.

The third question is about computation. How hard is it to find optimal parameters? The bad news is that finding the weights for even tiny networks is NP-Hard. Theories (link1 link2) show that even if the training data can be perfectly expressed by a small neural network there are no polynomial time algorithm to find such set of weights. This means that neural network’s expressiveness described in question 2 doesn’t really do much good since we aren’t capable of finding the optimal solution. But we all know that in reality neural network works pretty well, it seems that there are some magical property that allows us to learn neural networks. Srebro emphasizes that we still don’t know what is the magical property that makes neural networks learnable, but we do know it is not because we can represent the data well with the network. If you ask vision folks why neural networks work, they might say something like the lower layers of the network matches low level visual features and the higher layers match higher level visual features. However, this answer is about the expressiveness of the network described in question 2 which is not sufficient for explaining why neural networks work and provides zero evidence since we already know neural networks have the power to express any problem.

Srebro then talked about the observed behavior that neural networks usually don’t overfit to the training data. This is an unexpected property quite similar to the behavior of Adaboost, which was invented in 1997 and quite popular in the 2000s. It was only after the invention that people discovered that the reason Adaboost doesn’t overfit is because it is implicitly minimizing the L-1 norm that limits the complexity. So the question Srebro pointed out was whether the gradient decent algorithm for learning neural networks are also implicitly minimizing certain complexity measure that would be beneficial in reaching a solution that would generalize. Given a set of training data, a neural network can have multiple optimal solutions that are global minima (zero training error). However, some of these global minima perform better than the others on the test data. Srebro argues that the optimization algorithm might be doing something else other than just minimizing the training error. Therefore, by changing the optimization algorithm we might observe a difference in how well can a neural network generalize to test data, and this is exactly what Srebro’s group discovered. In one experiment they showed that even though using Adam optimization achieves lower training error then stochastic gradient decent, it actually performs worse on the test data. What this means is that we might not be putting enough emphasize on optimization in the deep learning community where a typical paper looks like the following:

Deep Learning Paper TemplateThe contributions are on the model and loss function, while the optimization is just a brief mention. So the main point Srebro is trying to convey is that different optimization algorithms would lead to different inductive biases, and different inductive biases would lead to different generalization properties. “We need to understand optimization algorithm not just as reaching some global optimum, but as reaching a specific optimum.”

Srebro further talked about a few more works based on these observations. If you are interested by now, you should probably watch the whole video (You would need to fast forward a bit to start.) I am however going to put in a little bit of my own thoughts here. Srebro emphasizes the importance of optimization a lot in this talk and said the deep models we use now can basically express any problem we have, therefore the model is not what makes deep learning work. However, we also know that the model does matter based on claims of many papers that invented new model architectures. So how could both of these claims be true? We have to remember that the model architecture is also part of the optimization process that shapes the geometry which the optimization algorithm is optimizing on. Hence, if the nerual network model provides a landscape that allows the optimization algorithm to reach a desired minimum more easily, it will also generalize better to the test data. In other words, the model and the optimization algorithm have to work together.

The Deep Learning Not That Smart List

In AI, Computer Vision, deep learning, Machine Learning, Paper Talk on May 27, 2019 at 12:00 pm

by Li Yang Ku (Gooly)

Deep learning is one of the most successful scientific story in modern history, attracting billions of investment money in half a decade. However, there is always the other side of the story where people discover the less magical part of deep learning. This post is about a few research (quite a few published this year) that shows deep learning might not be as smart as you think (most of the time they would came up with a way to fix it, since it used to be forbidden to accept paper without deep learning improvements.) This is just a short list, please comment below on other papers that also belong.

a) Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks.”, ICLR 2014

The first non-magical discovery of deep learning has to go to the finding of adversarial examples. It was discovered that images added with certain unnoticeable perturbations can result in mysterious false detections by a deep network. Although technically the first publication of this discovery should go to the paper “Evasion Attacks against Machine Learning at Test Time” by Battista Biggio et al. published in September 2013 in ECML PKDD, the paper that really caught people’s attention is this one that was put on arxiv in December 2013 and published in ICLR 2014. In addition to having bigger names on the author list, this paper also show adversarial examples on more colorful images that clearly demonstrates the problem (see image below.) Since this discover, there have been continuous battles between the band that tries to increase the defense against attacks and the band that tries to break it (such as “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples” by Athalye et al.), which leads to a recent paper in ICLR 2019 “Are adversarial examples inevitable?” by Shafahi et al. that questions whether it is possible that a deep network can be free of adversarial examples from a theoretical standpoint.

b) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Deep Image Prior.” CVPR 2018

This is not a paper intended to discover flaws of deep learning, in fact, the result of this paper is one of the most magical deep learning results I’ve seen. The authors showed that deep networks are able to fill in cropped out images in a very reasonable way (see image below, left input, right output) However, it also unveils some less magical parts of deep learning. Deep learning’s success was mostly advertised as learning from data and claimed to work better than traditional engineered visual features because it learns from large amount of data. This work, however, uses no data nor pre-trained weights. It shows that convolution and the specific layered network architecture, (which may be the outcome of millions of grad student hours through trial and error,) played a significant role in the success. In other words, we are still engineering visual features but in a more subtle way. It also raises the question of what made deep learning so successful, is it because of learning? or because thousands of grad students tried all kinds of architectures, lost functions, training procedures, and some combinations turned out to be great?

c) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” ICLR 2019.

It was widely accepted in the deep learning community that CNNs recognize objects by combining lower level filters that represent features such as edges into more complex shapes layer by layer. In this recent work, the authors noticed that contrary to what the community believes, existing deep learning models seems to have a strong bias towards textures. For example, a cat with elephant texture is often recognized as an elephant. Instead of learning how a cat looks like, CNNs seem to take the short cut and just try to recognize cat fur. You can find a detailed blog post about this work here.

d) Wieland Brendel, and Matthias Bethge. “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet.” ICLR 2019.

This is a paper from the same group as the previous paper. Based on the same observations, this paper claims that CNNs are not that different from bag of feature approaches that classifies based on local features. The authors created a network that only looks at local patches in an image without high level spatial information and was able to achieve pretty good result on ImageNet. The author further shuffled features in an image and existing deep learning models seems to be not sensitive to these changes. Again CNNs seem to be taking short cuts by making classifications based on just local features. More on this work can be found in this post.

e) Azulay, Aharon, and Yair Weiss. “Why do deep convolutional networks generalize so poorly to small image transformations?.” rejected by ICLR 2019.

This is a paper that discovered that modern deep networks may fail to recognize images shifted 1 pixel apart, but got rejected because reviewers don’t quite buy-in on the experiments nor the explanation. (the authors made a big mistake of not providing an improved deep network in the paper.) The paper showed that when the image is shifted slightly or if a sequence of frames from a video is given to a modern deep network, jaggedness appear in the detection result (see example below where the posterior probability of recognizing the polar bear varies a lot frame by frame.) The authors further created a dataset from ImageNet with the same images embedded in a larger image frame at a random location and showed that the performance dropped about 30% when the embedded frame is twice the width of the original image. This work shows that despite modern networks getting close to human performance on image classification tasks on ImageNet, it might not be able to generalize to the real world as well as we hoped.

f) Nalisnick, Eric, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. “Do Deep Generative Models Know What They Don’t Know?.” ICLR 2019

This work from DeepMind looks into tackling the problem that when tested on data with a distribution different from training, deep neural network can give wrong results with high confidence. For example, in the paper “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks” by Louizos and Welling, it was discovered that on the MNIST dataset a trained network can be highly confident but wrong when the input number is tilted. This makes deploying deep learning to critical tasks quite problematic. Deep generative models were thought to be a solution to such problems, since it also models the distribution of the samples, it can reject anomalies if it does not belong to the same distribution as the training samples. However, the authors short answer to the question is no; even for very distinct datasets such as digits versus images of horse and trucks, anomalies cannot be identified, and many cases even wrongfully provide stronger confidence than samples that does come from the trained dataset. The authors therefore “urge caution when using these models with out-of-training-distribution inputs or in unprotected user-facing systems.”