by Li Yang Ku

I worked at Vicarious, a robotics AI startup, from mid 2018 till it was acquired by Alphabet in 2022. Vicarious was a startup founded before the deep learning boom and it had been approaching AI through a more neuroscience based graphical model path. Nowadays it is definitely rare for AI startups to not wave the deep learning flag, but Vicarious did stick with its own ideology despite all the recent successes of neural network approaches. This post is about a few research publications my former colleagues at Vicarious did and how it lies along the path to AGI (artificial general intelligence.) Although Vicarious no longer exists, many authors of the following publications have been acquired into DeepMind and is continuing the same line of research.
a) George, Dileep, Wolfgang Lehrach, Ken Kansky, Miguel Lázaro-Gredilla, Christopher Laan, Bhaskara Marthi, Xinghua Lou et al. “A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs.” Science 358, no. 6368 (2017)
This publication in Science was one of the key contributions in Vicarious. In this work, the authors showed that the recursive cortical network (RCN), a hierarchical graphical model that can model contours in an image, is much better at solving CAPTCHAs (those annoying letters you need to enter to prove you are human.) compared to deep learning approaches. RCN is a template based approach that models edges and how they connect with nearby edges using graphical models. This allows it to generalize to a variety of changes with very few data, whereas deep learning approaches are usually more data hungry and sensitive to variations that it wasn’t trained on. One benefit of using graphical models is that it can do inference on occlusions between digits by a series of forward and backward passes. In CAPTCHA tests there is usually ambiguities locally. A single bottom-up forward pass can generate a bunch of proposals, but to resolve the conflicts, a top-down backward pass to the low level features is needed. Although it is possible to expand this forward backward iteration into a very long forward pass in a neural network (which we will talk about in the query training paper below), the graphical model approach is a lot more interpretable in general.
b) Kansky, Ken, Tom Silver, David A. Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. “Schema networks: Zero-shot transfer with a generative causal model of intuitive physics.” In International conference on machine learning. (2017)
This work can be seen as Vicarious’ response to DeepMind’s Deep Q-Networks (DQN) approach that gained great publicity by beating Atari games. One of the weakness of DQN like approaches is on generalizing beyond its training experiences. The authors showed that DQN agents trained on the regular breakout game failed to generalize to variations of the game such as when the paddle is slightly higher than the original game. The authors argue that is because the agent lack knowledge of the causality of the world it is operating in. This work introduces the Schema Network, which assumes the world is modeled by many entities each with attributes representing its type and position in binary. In these noiseless game environment, there are perfect causality rules that model how entities behave by itself or interact with each other. These rules (schemas) can be iteratively identified through linear programing relaxation given a set of past experiences. With the learned rules, the schema network is a probabilistic model where planning can be done by setting future reward to 1 and perform belief propagation on the model. This approach was shown to be able to generalize to variations of the Atari breakout game while state of the art deep RL models failed.
c) Lázaro-Gredilla, Miguel, Wolfgang Lehrach, Nishad Gothoskar, Guangyao Zhou, Antoine Dedieu, and Dileep George. “Query training: Learning a worse model to infer better marginals in undirected graphical models with hidden variables.” In Proceedings of the AAAI Conference on Artificial Intelligence. (2021)
In this paper, a neural network is used to mimic the loopy belief propagation (LBP) algorithm that is commonly used to do inference on probabilistic graphical models. LBP calculates the marginals of each variable through a loopy message passing algorithm. At each time step messages about the probability of each variable are passed between neighboring factors and variables. What is interesting is that LBP can be unrolled into a multi-layer feedforward neural network, which each layer represents one iteration of the algorithm. By training with different queries (partially observed evidences), the model learns to estimate the marginal probability of unobserved variables. This approach is based on the observation that there are two sources of error when using probabilistic graphical models. 1) Error when learning the (factor) parameters of the model. 2) Error when doing inference given partially observed evidences on a learned model. The proposed approach, Query Training, tries to optimize predicting the marginals directly. Even though the learned parameters may result in a worse model, the predicted marginals can actually be better. Another major contribution of this work is about introducing a training process that considers the distribution of the queries. Hence, the learned model can be used to estimate the marginal probability of any variable given any partial evidence.
d) George, Dileep, Rajeev V. Rikhye, Nishad Gothoskar, J. Swaroop Guntupalli, Antoine Dedieu, and Miguel Lázaro-Gredilla. “Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps.” Nature communications 12, no. 1 (2021)
This work introduces the cloned-structured cognitive graph (CSCG), which is an extension of the cloned HMM model introduced in another Vicarious work “Learning higher-order sequential structure with cloned HMMs” published in 2019. Cloned Hidden Markov Models (CHMM) is a Hidden Markov Model but with an enforced sparsity structure that maps multiple hidden states (clones) to the same emission state. Clones of the same observation can help discover higher order temporal structures. For example, you may have a room with two corners that look the same but not the surroundings areas, having two hidden states that each represent one of these corners can model what you would see when moving around much accurately than just having a single hidden state representing both observations. By pre-allocating a fixed capacity for the number of clones per observation, the Expectation Maximization (EM) algorithm is able to learn to best use these clones to model a sequence of observations. CSCG is simply CHMM with actions. The action chosen became part of the transition function and the model can then learn a spatial structure by simply observing sequential data and the corresponding action at each time step.
What is interesting is that the activation of hidden states in a CSCG can explain place cell activations in rat experiments that were previously puzzling. Place cells in the hippocampus was named place cell because it was previously thought to be presenting a specific location in space. However, more recent experiments show that some place cells seems to encode routes toward goals instead of spatial locations. In a rat experiment which rats are trained to circle a square maze for 4 laps before getting an award, it was observed that the same locations in the maze are represented by different place cells. When CSCG is trained on these sequences, it naturally allocates different clones to different laps. The activations of hidden states when circling the maze matches nicely to the place cell firings observed in rats. The authors also showed that CSCG could also explain the remapping phenomenon observed in place cells when the environment changes.
From the papers I picked above, you can probably tell that Vicarious’ vision towards AGI emphasizes on more structured approaches instead of working towards a learn it all huge network. Generative models like probabilistic graphical model have the potential of being more robust at modeling the underlying causal relationships in an environment and have the benefit of not needing to re-train if the underlying relationships remains the same. While recent progress in neural network approaches such as transformer and large language models have surprised many on its capability, there still seems to be a gap between being able to reorganize opinions originated from humans to having intelligence that can form novel thoughts. I have doubts on the claim that AGI is within a few year’s reach, which many people have made; the path to AGI may still be long and these published ideas might be needed one day to breach the gap.