Common Sense, Cortex, and CAPTCHA

Posted October 2017 Back to Resources


From the moment we are born, we begin using our senses to build a coherent model of the world. As we grow, we constantly refine our model and access it effortlessly as we go about our lives. If we see a ball rolling onto the street, we might reason that a child could have kicked it there. When asked to pour a glass of wine, we wouldn’t search for a bottle opener if the wine is already in the decanter. If we are told, “Sally hammered the nail into the floor,” and asked whether the nail was vertical or horizontal, we can imagine the scenario with the appropriate level of detail to answer confidently: vertical [1]. In each of these cases, we are employing our unrivaled ability to make predictions and inferences about ordinary situations. This uniquely human capacity is what we call common sense [2].

Common sense arises from the distillation of past experience into a representation that can be accessed at an appropriate level of detail in any given scenario. A large portion of this knowledge is stored in our visual and motor cortices and it serves as our internal model of the world [3]. For common sense to be effective it needs to be amenable to answer a variety of hypotheticals – a faculty that we call imagination. This leads us to generative models, probabilistic representations, and reasoning algorithms.

What kind of generative model would suffice for common sense? One way to approach this is to instead ask: what kind of model does the human visual system build? In our recent Science paper, we take a step towards answering these questions by demonstrating how clues from the cortex can be incorporated into a computer vision model we call the Recursive Cortical Network (RCN) [4]. In this blog post, we describe RCN in the context of common sense, cortex, and our long-term research ambitions at Vicarious.

Are existing generative models suitable for common sense?

Modern research in machine learning and artificial intelligence is often reductionist. Researchers identify an aspect of intelligence, isolate its defining characteristics, and create a benchmark to evaluate progress on this narrow problem, while controlling for other variables as much as possible. The problem of common sense is resistant to this sort of reduction, as it involves so many different aspects of intelligence from the same model. In the case of vision, after a common sense model has been built, it should be capable of object recognition, segmentation, imputation, generation, and a combinatorial number of queries that bind the represented variables in different ways without requiring retraining for each of these query types.

Research in generative models often focuses on narrow solutions that can answer specific questions, but does not offer a simple way to fully leverage the model’s knowledge via arbitrary probabilistic queries. For instance, in Variational Autoencoders (VAEs) [5], a by-product of the training is a fast inference network. However, if the model is queried for imputation using a different observed variable set each time, we need to retrain a different network per query, rendering the model unusable. Furthermore, the overwhelming emphasis on the optimization of the “evidence lower bound” (ELBO) on black-box models underemphasizes the importance of obtaining meaningful latent variables. The use of an adequate generative structure (inductive biases) is beneficial both from an interpretability standpoint and to allow for rich integration into more complex systems, even if the price to pay is a somewhat smaller ELBO. One of the strengths – but at the same time, limitations – of Generative Adversarial Networks (GANs) [6] is that they do not prescribe any inference mechanism, so even after successfully training a generative model, we must resort to a different technique to answer probabilistic queries. Even some tractable models, such as Pixel RNNs[7], are defined following an ordering that makes some conditional queries simply tractable, while making others intractable.

These individual generative models are powerful within the confines of their training regime, but they do not give rise to the coherent understanding of the world that we recognize as common sense. In search of principles for moving beyond these narrow successes, we turn to the only known successful implementation of common sense: the human brain.

What kind of generative model is the brain?

Decades of research in cognitive science and neuroscience have yielded great insight into the computational and statistical properties of the human brain. These properties suggest several functional requirements for generative models on the path towards general intelligence. In a nutshell, the generative models that we aim to build are compositional, factorized, hierarchical, and flexibly queryable. In Table 1, we list a sampling of the observations from neuroscience that inform our research.

The Recursive Cortical Network: Scaffolding Versus Tabula Rasa

In our recent Science publication [4], we describe the Recursive Cortical Network (RCN): a generative model that satisfies the functional requirements listed in Table 1 and achieves strong performance and high data efficiency on a diverse set of computer vision tasks. RCN represents a departure from the prevailing deep learning zeitgeist that prizes learning from scratch, tabula rasa. RCN begins with “scaffolding”, prior structure that facilitates model building. For example, while most CNNs and VAEs are whole-image models that assume very little about objects and images, RCN is an object-based model that assumes factorization of contours and surfaces, and objects and background. RCN also represents shape explicitly, and the presence of lateral connections allows it to pool across large transformations without losing specificity, thereby increasing its invariance. Compositionality allows RCN to represent scenes with multiple objects while only requiring explicit training on individual objects. All of these features of RCN derive from our assumption that evolution has endowed the neocortex with similar scaffolding that makes it easy to learn representations in our world compared to starting from a totally blank slate.

With the right scaffolding in place, learning and inference become far easier. During learning, RCN is much more data efficient than its tabula rasa counterparts – 300x more in the case of a scene text recognition benchmark. Where many models will overfit to extraneous details of their training set, RCN identifies the salient aspects of a scene, permitting strong generalization to other similar scenes. Moreover, in the RCN setting, classification, detection, segmentation, and occlusion reasoning are all different, interconnected queries on the same model that provide explanations for the evidence present in the image.

CAPTCHA: why the central problem in AI is to understand the letter ‘A’

In 2013, we announced an early success of RCN: its ability to break text-based CAPTCHAs like those illustrated below (left column). With one model, we achieve an accuracy rate of 66.6% on reCAPTCHAs, 64.4% on BotDetect, 57.4% on Yahoo, and 57.1% on PayPal, all significantly above the 1% rate at which CAPTCHAs are considered ineffective (see [4] for more details). When we optimize a single model for a specific style, we can achieve up to 90% accuracy.  The Science article published this week reveals the details of the RCN model and its algorithms. After revealing the what and the how, we wanted to describe the why: why we chose the CAPTCHA benchmark first, and why it is still a very relevant benchmark for general AI.

The CAPTCHA-style letter A’s in the figure above (right column) illustrate the combinatorial number of ways in which this letter can be rendered and recognized by humans, without being explicitly trained on those kinds of variations. None of the public APIs for Optical Character Recognition (OCR) that we evaluated are able to capture this diversity, because this requires the recognition engine to generalize to distributions that are not represented in the training set. These methods are based on brute force pattern recognition. They have no notion of compositionality, and thus no mechanism to separate the letter A from its background. Furthermore, they have no understanding of objects, and therefore no way to reason about the shape and appearance of the letter A in isolation. As shown in the GIF below, deep learning methods like CNNs [24] trained on CAPTCHAs generalize poorly to small variations in the spacing of the individual letters [4]. In contrast, RCN remains robust as the letters spread. Note that performance in the animation is reported for CAPTCHA images that we created to evaluate the effect of spacing, separately from the reCAPTCHA dataset. Performance on reCAPTCHA and several other styles is reported along with more details in [4].

Douglas Hofstadter, an influential philosopher and AI researcher, famously quipped that the central problem in AI is to understand the letter A. Just like Hofstadter, we believe that “for any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale artificial intelligence.” Although the ‘super human’ accuracies on ImageNet classification or automatic caption generation systems can give the impression that perception is a solved problem, seemingly simple problems can provide enormous depth and insight towards developing human-like intelligence [25].

Our work in the paper is a small step in endowing computers to understand letterforms with the flexibility and fluidity of human perception. Even with our advancements, we are still far from having solved Hofstadter’s seemingly simple challenge of detecting ‘A’s with the same ‘fluidity and dynamism’ of humans. We believe that many of the ideas that we explored in the paper will be important for building systems that can generalize beyond their training distributions like humans do.


The world around us is filled with organisms that employ complex behaviors to thrive within their niches. While ants have super-human tunneling ability and salmon might be unrivaled navigators, their brains tell us little about general intelligence. Similarly, deep learning has demonstrated many narrow super-human abilities on recognizing photos and playing games. It is important not to conflate the success of deep learning in creating a diversity of narrow intelligences as progress on the path toward general intelligences.

Miles Brundage said it well [26]:

Progress so far has largely been toward demonstrating general approaches for building narrow systems rather than general approaches for building general systems. Progress toward the former does not entail substantial progress toward the latter.


General systems are hard to evaluate and harder to build than their narrow counterparts, but we must confront this difficulty directly if we ever hope to achieve human level intelligence with qualities like common sense. Our work on RCN is one small step on the long path towards more general intelligences, one that will continue to be full of feedback from neuroscience, shaped by computational constraints, and perfected with input from the AI and neuroscience communities.

We are always looking for exceptional researchers, engineers, and business people to collaborate with us in our quest for human-like AI. Click here to see open positions at Vicarious.

Reference Code

A reference implementation of RCN is available on Github here.


We used the following datasets for the Science paper:

CAPTCHA Datasets

reCAPTCHA (from
BotDetect (from
Paypal (from
Yahoo (from

MNIST Datasets

Original (available at
With occlusions (by us)
With noise (by us)