Life is a game, take it seriously

Paper Picks: CVPR 2020

In AI, Computer Vision, deep learning, Paper Talk, vision on September 7, 2020 at 6:30 am
by Li Yang Ku (Gooly)

CVPR is virtual this year for obvious reasons, and if you did not pay the $325 registration fee to attend this ‘prerecorded’ live event, you can now have a similar experience through watching all the recorded videos on their YouTube channel for free. Of course its not exactly the same since you are loosing out the virtual chat room networking experience, but honestly speaking, computer vision parties are often awkward in person already and I can’t imagine you missing much. Before we go through my paper picks, lets look at the trend first. The graph below is the accepted paper counts by topic this year.

CVPR 2020 stats

And the following are the stats for CVPR 2019:

CVPR 2019 stats

These numbers cannot be directly compared since the categories are not exactly the same, for example, deep learning that had the most submission in 2019 is no longer a category (Aren’t gonna be a very useful category when every paper is about deep learning.) The distribution of these two graphs look quite similar. However, if I have to analyze it at gunpoint, I would say the following:

  1. Recognition is still the most popular application for computer vision.
  2. The new category “Transfer/Low-shot/Semi/Unsupervised Learning” is the most popular problem to solve with deep networks.
  3. Despite being a controversial technology, more people are working on face recognition. For some countries this is probably still where most money is distributed.
  4. The new category “Efficient training and inference methods for networks” shows that there is an effort to push for practical use of the neural network.
  5. Based on this other statistic data, it seems that the keyword ‘graph’, ‘representation’, and ‘cloud’ doubled from last year. This is consistent with my observation that people are exploring 3D data more since the research space on 2D image is the most crowded and competitive.

Now for my random paper picks:

a) Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. “CvxNet: Learnable Convex Decomposition” (video)

This Google Research paper introduces a new representation for 3D shapes that can be learned by neural networks and used by physics engines directly. In the paper, the authors mentioned that there are two types of 3D representations, 1) explicit representations such as meshes. These representations can be used in many applications such as physics simulations directly because they contain information of the surface. explicit representations are however hard to learn with neural networks. The other type is 2) implicit representations such as voxel grids, voxel grids can be learned from neural networks since it can be considered as a classification problem that labels each voxel empty or not. However, turning these voxel grids into a mesh is quite expensive. The authors therefore introduce this convex decomposition representation that represent a 3D shape with a union of convex parts. Since a convex shape can be represented by a set of hyperplanes that draw the boundary of the shape, it becomes a learnable classification problem while remains the benefit of having information of the shape boundary. This representation is therefore both implicit and explicit. The authors also demonstrated that a learned CvxNet is able to generate 3D shapes from 2D images with much better success compared to other approaches as show below.

b) Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman. “Ego-Topo: Environment Affordances From Egocentric Video” (video)

Environment Affordance

This paper on predicting an environment’s affordance is a collaboration between UT Austin’s computer vision group and Facebook AI Research. This paper caught my eye since my dissertation was also about affordances using a graph like structure. If you are not familiar of the word “affordance”, its a controversial word made up to describe what action/function an object/environment affords a person/robot.

In this work, the authors argue that the space that an action is taken place in is important to understanding first person videos. Traditional approaches on classifier actions in videos usually just take a chunk of the video and generate a representation for classification, while SLAM (simultaneous localization and mapping) approaches that tries to create the exact 3D structure of the environment often fails when humans move too fast. Instead, this work learns a network that classifies whether two views belong to the same space. Based on this information, a graph where each node represents a space and the corresponding videos can be created. The edges between nodes then represent the action sequences that happened between these spaces. These videos within a node can then be used to predict what an environment affords. The authors further trained a graph convolution network that takes into account neighboring nodes to predict the next action in the video. The authors showed that taking into account the underlying space benefited in both tasks.

c) Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta. “Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects” (video)

use the force luke - Yoda | Meme Generator

This paper would probably won the best title award for this conference if there is one. This work is about estimating forces applied to objects by human in a video. Arguably, if robots can estimate forces applied on objects, it would be quite useful for performing tasks and predicting human intentions. However, personally I don’t think this is how humans understand the world and it may be solving a harder problem then needed. Having said that this is still an interesting paper worth discussing.

Estimating force and contact points

The difficulty of this task is that the ground truth forces applied on objects cannot be easily obtained. Instead of figuring out how to obtain this data, the authors use a physics simulator to simulate the outcome of applying the force and then use keypoints annotated in the next frame compared to the keypoints location of the simulated outcome as a signal to train the network. Contact points are also predicted by a separate network with annotated data. The figure above shows this training schema. Note that estimating gradients through a non-differentiable physics simulator is possible by looking at the result when each dimension is changed a little bit. The authors show this approach is able to obtain reasonable result on a collected dataset and can be extended to novel objects.

d) Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song. “SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization” (video)

This is a Google Brain paper that tries to find a better architecture for object detection tasks that would benefit from more spatial information. For segmentation tasks, the typical architecture has an hour glass shaped encoder decoder structure that first down scales the resolution and then scales it back up to predict pixel-wise result. The authors argued that these type of neural networks that have this scale decreasing backbone may not be the best solution for tasks which localization is also important.

Left: ResNet, Right: Permute last 10 blocks

The idea is then to permute the order of the layers of an existing network such as ResNet and see if this can result in a better architecture. To avoid having to try out all combinations, the authors used Neural Architecture Search (basically another network) to learn what architecture would be better. The result is an architecture that has mixed resolutions and many skip connections that go further (image above). The authors showed that with this architecture they were able to outperform prior state of the art result and this same network was also able to achieve good results on other datasets other than the one trained on.


Why the Idea That Our Brain has an Area Dedicated Just for Faces is Likely Wrong

In brain, Neural Science, Serious Stuffs, vision on August 17, 2020 at 8:59 pm

by Li Yang Ku (Gooly)

Image for post

I was reading a neuroscience text book recently and came across a paragraph about the discovery of the Fusiform Face Area (FFA) in human brain. It was not news to me that there is a face area in the brain that activates when a face is observed, but I was quite surprised by what the researcher actually claimed. Nancy Kanwisher, the scientist that named the Fusiform Face Area in 1997, has this hypothesis that this area is selectively responsive to faces and argues that this discovery supports the idea that the brain contains a few specialized components dedicated to solving very specific tasks. In the paragraph, she also mentioned quite a few experiments that backs her hypothesis including a research done at the Japan Science and Technology Agency which reported that monkeys raised without seeing faces was able to show adult-like ability in face discrimination and suggested that experience with faces may not be needed to process faces. These observation seems to contradict with the cat experiments I previously talked about. If vision is acquired instead of innate and even seeing something as simple as horizontal lines needs to be learned, the conclusion that our brain dedicates a pre-wired area just for faces seems counterintuitive.

The simplest way to show evidence that no areas in the brain is dedicated to a specific function is probably to look at blind people’s Fusiform Face Area. A quick search led me to an interesting podcast about some experiments done at Georgetown University in 2014. In this experiment, blind people were trained to recognize faces through a device that scans faces and converts them into a sequence of tones with different frequency. The researchers showed that when blind people tries to recognize faces through this device, the same Fusiform Face Area became activated. This result suggests that this face area can be activated by not just visual input of faces but also auditory representation of faces, which kind of strengthens the argument that it is a dedicated face area. However, I wasn’t able to find experiment details in publications regarding these claims recorded in the podcast.

A deeper search gave me another more informative experiment result done by researchers at John Hopkins and MIT in 2015. In this work “Visual Cortex Responds to Spoken Language in Blind Children“, scientists measured the activity of the Fusiform Face Area of blind children when they listen to stories and suggested that their Fusiform Face Area is taken over by language functionalities instead. These experiments showed that the Fusiform Face Area is not dedicated just to faces and the plasticity of the brain during childhood allows the area to learn different functionalities. However, one can still argue that this is a special case and doesn’t invalidate the claim that Fusiform Face Area is dedicated just to faces in normal circumstances.

Apparently I am not the only one surprised by Kanwisher’s claim. In 2017, Isabel Gauthier, a professor in Vanderbilt University, wrote an article titled ‘The Quest for the FFA led to the Expertise Account of its Specialization’ as a direct response/protest to Kanwisher’s article ‘The Quest for the FFA and Where It Led’. Gauthier’s article aren’t just a scientific article but also a letter that exposed the bitter story of a derailed collaboration between Kanwisher and Gauthier that dates back to 20 years ago. In 1997, when Kaniwisher published her famous article that claimed to found an area dedicated to face recognition, Gauthier was working on her PhD thesis in Yale titled “Dissecting face recognition: The role of expertise and level of categorization in object recognition”. Gauthier had a very different hypothesis about face recognition which suggested that the ability of recognizing faces is a skill learned just like other expertise skills that involved distinguishing similar objects visually. Kaniwisher’s publication at that time had the opposite conclusion from Gauthier’s thesis. However, as naive as it sounds, Gauthier held this belief that it would be advantageous for her to work with someone whom she disagrees with strongly and proposed to work as Kaniwisher’s postdoc with her own funding. Instead of drawing conclusion from existing experiments, the proposal was to come up with an experiment both of them can agree on.

Prior to this Gauthier have done experiments on training people to distinguish these objects called Greebles shown above. Her results show that subjects that learned to distinguish these Greebles had stronger response in the Fusiform Face Area when shown a normal Greeble versus an upside down Greeble. Kaniwisher argued that these Greebles might be too similar to faces therefore activates the same area. Therefore they agreed that the experiment should be done on experts of things that are very different from faces, in this case they picked birds and cars. Gauthier then recruited about 20 bird and car experts and scanned their brain with MRI while showing them pictures of cars and birds. The results were clear, when bird experts see an image of a bird, the same Fusiform Face Area was recorded with stronger activities. Bird experts only had strong FFA activity when bird images were shown but not when car images were shown and experiments with car experts were observed with the opposite results. These results seem to indicate that FFA is an area used to distinguish similar objects of the same class rather than an area dedicated to faces.

Car Talk Classics: Four Perfectly Good Hours: Magliozzi, Ray, Magliozzi, Tom: 9781598870992: Books

Kaniwisher however decided not to put her name on this paper that contradicts her original idea and hired another postdoc to redo this experiment with different settings. The results came out similar and again Kaniwisher refused to be listed on this paper titled “Revisiting the role of the fusiform face area in visual expertise” published in 2005. This could be the end of the conflict between Kaniwisher and Gauthier, and just like most disputes in academia, these stories were probably only known among a small group of grad students. Nancy Kaniwisher continued her academia career in MIT, won quite a few awards, and even went on a Ted talk in 2014. Isabel Gauthier became a professor in Vanderbilt University, also won a lot of awards, and continued with several researches that strengthen her expertise theory. However, in 2017 Kaniwisher published this article ‘The Quest for the FFA and Where It Led’, which she not only didn’t mention her involvement in early expertise experiments but claimed the effect shown in Gauthier’s work to be small and not replicable. She further claimed that her original paper that discovered this specialized brain area drew fire from Gauthier and many others because it found evidence for the century debate on domain specificity in the brain. Ironically, in this article Kaniwisher also argues that the field has a replication crisis and researchers should work harder to replicate their own results before publishing them. This seemly harmless article must have kept Gauthier awake for nights. Which led to this tell all article ‘The Quest for the FFA led to the Expertise Account of its Specialization’ that unveiled what would otherwise be another untold story in the ivory tower of academics.

Guest Post: How to Make a Posture Correction App

In AI, App, Computer Vision, deep learning on August 8, 2020 at 9:35 am

This is a guest post by Lila Mullany and Stephanie Casola from alwaysAI (in exchange they will post one of my articles in their company blog.) What this startup is developing might be useful to some of my readers that just want to implement deep learning vision apps without having to go through a steep learning curve. It’s also open source and free if you are just working on a home project. In the following, we’ll do a brief intro on alwaysAI and then Lila will talk about how to make a posture correction app with their library.

Sitting Up-Straight: A Personal Struggle for Proper… | Cirrus Insight

AlwaysAI is a startup located in San Diego that is making a deep learning computer vision platform that aims at making computer vision more accessible to developers. They provide a freemium version that could be quite useful to hobbyists as well as devs looking to build computer vision into commercial products. This platform is optimized to run on edge devices and can be an attractive option for anyone looking to build computer vision with resource constraints. One could easily create a computer vision app with alwaysAI and run it on a Raspberry Pi (a Raspberry Pi 4 costs about $80). If you know basic Python, you can sign up for a free account and create your own computer vision application with a few lines of code.

Typically, computer vision apps can take a lot of time to implement from scratch. With alwaysAI you can get going pretty quickly on object detection, object tracking, image classification, semantic segmentation, and pose estimation. Creating a computer vision application with alwaysAI starts with selecting a pre-trained model from their model catalog. If you want to  train your own model you can sign up here for their closed beta model training program.

At this time, all models are open source and available to be freely used in your apps. As for distributing your app, the first device you run your app on is free. For a free account, just sign up here.

open source | Funny Jokes and Laughs :)

For more information you can look up their documentation, blog, and Youtube channel. They also do hackathons, webinars, and weekly “Hacky Hours”. You can find out more about these events on their community page.

So that’s the intro, below Lila will show you an example of how their library can be used to build your own posture corrector.

Many of us spend most of our days hunched over a desk, leaning forward looking at a computer screen, or slumped down in our chair. If you’re like me, you’re only reminded of your bad posture when your neck or shoulders hurt hours later, or you have a splitting migraine. Wouldn’t it be great if someone could remind you to sit up straight? The good news is, you can remind yourself! In this tutorial, we’ll build a posture corrector app using a pose estimation model available from alwaysAI.

To complete the tutorial, you must have:

  1. An alwaysAI account (it’s free!)
  2. alwaysAI set up on your machine (also free)
  3. A text editor such as sublime or an IDE such as PyCharm, both of which offer free versions, or whatever else you prefer to code in

All of the code from this tutorial is available on GitHub.

Let’s get started!

After you have your free account and have set up your developer environment, download the starter apps; do so using this link before proceeding with the rest of the tutorial. We’ll build the posture corrector by modifying the ‘realtime_pose_detector’ starter app. You may want to copy the contents into a new directory, so you retain the original code.

There will be three main parts to this tutorial:

  1. The configuration file
  2. The main application
  3. The utility class for detecting poor posture

1) Creation of the Configuration File

Create this file as specified in this tutorial. For this example app we need one configuration variable (and more if you want them): scale, which is an int and will be used to tailor the sensitivity of the posture functions.

Now the configuration is all set up!

2) Creation of the App

Add the following import statements to the top of your file:

import os
import json
from posture import CheckPosture

We need ‘json’ to parse the configuration file, and ‘CheckPosture’ is the utility class for detecting poor posture, which we’ll define later in this tutorial.

NOTE: You can change the engine and the accelerator you use in this app depending on your deployment environment. Since I am developing on a Mac, I chose the engine to be ‘DNN’, and so I changed the engine parameter to be ‘edgeiq.Engine.DNN’. I also changed the accelerator to be ‘CPU’. You can read more about the accelerator options here, and more about the engine options here.

Next, remove the following lines from

text.append("Key Points:")
for key_point in pose.key_points:

Add the following lines to replace the ones you just removed (right under the ‘text.append’ statements):

# update the instance key_points to check the posture
# play a reminder if you are not sitting up straight
correct_posture = posture.correct_posture()
if not correct_posture:
# make a sound to alert the user to improper posture

We used an unknown object type just there and called some functions on it that we haven’t defined yet. We’ll do that in the last section!

Move the following lines to directly follow the end of the above code (directly after the ‘for’ loop, and right before the ‘finally’):

streamer.send_data(results.draw_poses(frame), text)
if streamer.check_exit():

3) Creating the Posture Utility Class

Create a new file called ‘’. Define the class using the line:

class CheckPosture

Create the constructor for the class. We’ll have three instance variables: key_points, scale, and message.

def __init__(self, scale=1, key_points={}):
    self.key_points = key_points
    self.scale = scale
    self.message = ""

We used defaults for scale and key_points, in case the user doesn’t provide them. We just initialize the variable message to hold an empty string, but this will store feedback that the user can use to correct their posture. You already saw the key_points variable get set in the section; this variable allows the functions in to make determinations about the user’s posture. Finally, the scale simply makes the calculations performed in either more or less sensitive when it is decreased or increased respectively.

Now we need to write some functions for

Create a getter and setter for the key_points, message, and scale variables:

def set_key_points(self, key_points):
    self.key_points = key_points

def get_key_points(self):
    return self.key_points

def set_message(self, message):
    self.message = message

def get_message(self):
    return self.message

def set_scale(self, scale):
    self.scale = scale

def get_scale(self):
    return self.scale

Now we need functions to actually check the posture. My bad posture habits include leaning forward toward my computer screen, slouching down in my chair, and tilting my head down to look at notes, so I defined methods for detecting these use cases. You can use the same principle of coordinate comparison to define your own custom methods, if you prefer.

First, we’ll define the method to detect leaning forward, as shown in the image below. This method works by comparing an ear and a shoulder on the same side of the body. So first it detects whether the ear and shoulder are both visible (i.e. the coordinate we want to use is not -1) on either the left or right side, and then it checks whether the shoulder’s x-coordinate is greater than the ear’s x-coordinate.

def check_lean_forward(self):
    if self.key_points['Left Shoulder'].x != -1 \
        and self.key_points['Left Ear'].x != -1 \
        and  self.key_points['Left Shoulder'].x >= \
        (self.key_points['Left Ear'].x + \
        (self.scale * 150)):
        return False
if self.key_points['Right Shoulder'].x != -1 \
    and self.key_points['Right Ear'].x != -1 \
    and  self.key_points['Right Shoulder'].x >= \
    (self.key_points['Right Ear'].x + \
    (self.scale * 160)):
    return False
return True

NOTE: the coordinates for ‘alwaysai/human-pose’ are 0,0 at the upper left corner. Also, the frame size will differ depending on whether you are using a Streamer input video or images, and this will also impact the coordinates. I developed using a Streamer object and the frame size was (720, 1280). For all of these functions, you’ll most likely need to play around with the coordinate differences, or modify the scale, as every person will have a different posture baseline. The principle of coordinate arithmetic will remain the same, however, and can be used to change app behavior in other pose estimation use cases! You could also use angles or a percent of the frame, so as to not be tied to absolute numbers. Feel free to re-work these methods and submit a pull request to the GitHub repo!

Next, we’ll define the method for slouching down in a chair, such as in the image below.

In this method, we’ll use the y-coordinate neck and nose keypoints to detect when the nose gets too close to the neck, which happens when someone’s back is hunched down in a chair. For me, about 150 points was the maximum distance I wanted to allow. If my nose is less than 150 points from my neck, I want to be notified. Again, these hardcoded values can be scaled with the ‘scale’ factor or modified as suggested in the note above.

def check_slump(self):
    if self.key_points['Neck'].y != -1 \
       and self.key_points['Nose'].y != -1 \
       and (self.key_points['Nose'].y >= \
       self.key_points['Neck'].y - (self.scale * 150)):
       return False
    return True

Now, we’ll define the method to detect when a head is tilted down, as shown in the image below. This method will use the ear and eye key points to detect when the y-coordinate of a given eye is closer to the bottom of the image than the ear on the same side of the body.

def check_head_drop(self):
    if self.key_points['Left Eye'].y != -1 \
        and self.key_points['Left Ear'].y != -1 \
        and self.key_points['Left Eye'].y > \
        (self.key_points['Left Ear'].y + (self.scale * 15)):
        return False

    if self.key_points['Right Eye'].y != -1 \
        and self.key_points['Right Ear'].y != -1 \
        and self.key_points['Right Eye'].y > \
        (self.key_points['Right Ear'].y + (self.scale * 15)):
        return False

    return True

Now, we’ll just make a method that checks all the posture methods. This method works by using python’s all method, which only returns True if all iterables in a list return True. Since all of the posture methods we defined return False if the poor posture is detected, the method we define now will return False if any one of those methods returns False.

def correct_posture(self):
    return all([self.check_slump(), 

And finally, we’ll build one method that returns a customized string that tells the user how they can modify their posture. This method is called in and the result is displayed on the streamer’s text.

def build_message(self):
    current_message = ""
    if not self.check_head_drop():
        current_message += "Lift up your head!\n"

    if not self.check_lean_forward():
        current_message += "Lean back!\n"

    if not self.check_slump():
        current_message += "Sit up in your chair, you're slumping!\n"

    self.message = current_message
    return current_message

That’s it! Now you have a working posture correcting app. You can customize this app by creating your own posture detection methods, using different keypoint coordinates, making the build_message return different helpful hints, and creating your own custom audio file to use instead of the ‘print(“\a”)’.

If you want to run this app on a Jetson Nano, update your Dockerfile and the accelerator and engine arguments in as described in this article.

Now, just start your app (visit this page if you need a refresher on how to do this for your current set up), and open your web browser to ‘localhost:5000’ to see the posture corrector in action!

This posture corrector application development was also covered in one of the previous weekly “Hacky Hours”, you can watch the video recording of it on Youtube, just click here.