Introduction
From the moment we are born, we begin using our senses to build a coherent model of the world. As we grow, we constantly refine our model and access it effortlessly as we go about our lives. If we see a ball rolling onto the street, we might reason that a child could have kicked it there. When asked to pour a glass of wine, we wouldn’t search for a bottle opener if the wine is already in the decanter. If we are told, “Sally hammered the nail into the floor,” and asked whether the nail was vertical or horizontal, we can imagine the scenario with the appropriate level of detail to answer confidently: vertical [1]. In each of these cases, we are employing our unrivaled ability to make predictions and inferences about ordinary situations. This uniquely human capacity is what we call common sense [2].
Common sense arises from the distillation of past experience into a representation that can be accessed at an appropriate level of detail in any given scenario. A large portion of this knowledge is stored in our visual and motor cortices and it serves as our internal model of the world [3]. For common sense to be effective it needs to be amenable to answer a variety of hypotheticals – a faculty that we call imagination. This leads us to generative models, probabilistic representations, and reasoning algorithms.
What kind of generative model would suffice for common sense? One way to approach this is to instead ask: what kind of model does the human visual system build? In our recent Science paper, we take a step towards answering these questions by demonstrating how clues from the cortex can be incorporated into a computer vision model we call the Recursive Cortical Network (RCN) [4]. In this blog post, we describe RCN in the context of common sense, cortex, and our long-term research ambitions at Vicarious.
Are existing generative models suitable for common sense?
Modern research in machine learning and artificial intelligence is often reductionist. Researchers identify an aspect of intelligence, isolate its defining characteristics, and create a benchmark to evaluate progress on this narrow problem, while controlling for other variables as much as possible. The problem of common sense is resistant to this sort of reduction, as it involves so many different aspects of intelligence from the same model. In the case of vision, after a common sense model has been built, it should be capable of object recognition, segmentation, imputation, generation, and a combinatorial number of queries that bind the represented variables in different ways without requiring retraining for each of these query types.
Research in generative models often focuses on narrow solutions that can answer specific questions, but does not offer a simple way to fully leverage the model’s knowledge via arbitrary probabilistic queries. For instance, in Variational Autoencoders (VAEs) [5], a by-product of the training is a fast inference network. However, if the model is queried for imputation using a different observed variable set each time, we need to retrain a different network per query, rendering the model unusable. Furthermore, the overwhelming emphasis on the optimization of the “evidence lower bound” (ELBO) on black-box models underemphasizes the importance of obtaining meaningful latent variables. The use of an adequate generative structure (inductive biases) is beneficial both from an interpretability standpoint and to allow for rich integration into more complex systems, even if the price to pay is a somewhat smaller ELBO. One of the strengths – but at the same time, limitations – of Generative Adversarial Networks (GANs) [6] is that they do not prescribe any inference mechanism, so even after successfully training a generative model, we must resort to a different technique to answer probabilistic queries. Even some tractable models, such as Pixel RNNs[7], are defined following an ordering that makes some conditional queries simply tractable, while making others intractable.
These individual generative models are powerful within the confines of their training regime, but they do not give rise to the coherent understanding of the world that we recognize as common sense. In search of principles for moving beyond these narrow successes, we turn to the only known successful implementation of common sense: the human brain.
What kind of generative model is the brain?
Decades of research in cognitive science and neuroscience have yielded great insight into the computational and statistical properties of the human brain. These properties suggest several functional requirements for generative models on the path towards general intelligence. In a nutshell, the generative models that we aim to build are compositional, factorized, hierarchical, and flexibly queryable. In Table 1, we list a sampling of the observations from neuroscience that inform our research.