Common Sense, Cortex, and CAPTCHA

Direct link to Science paper here.


From the moment we are born, we begin using our senses to build a coherent model of the world. As we grow, we constantly refine our model and access it effortlessly as we go about our lives. If we see a ball rolling onto the street, we might reason that a child could have kicked it there. When asked to pour a glass of wine, we wouldn’t search for a bottle opener if the wine is already in the decanter. If we are told, “Sally hammered the nail into the floor,” and asked whether the nail was vertical or horizontal, we can imagine the scenario with the appropriate level of detail to answer confidently: vertical [1]. In each of these cases, we are employing our unrivaled ability to make predictions and inferences about ordinary situations. This uniquely human capacity is what we call common sense [2].

Common sense arises from the distillation of past experience into a representation that can be accessed at an appropriate level of detail in any given scenario. A large portion of this knowledge is stored in our visual and motor cortices and it serves as our internal model of the world [3]. For common sense to be effective it needs to be amenable to answer a variety of hypotheticals – a faculty that we call imagination. This leads us to generative models, probabilistic representations, and reasoning algorithms.

What kind of generative model would suffice for common sense? One way to approach this is to instead ask: what kind of model does the human visual system build? In our recent Science paper, we take a step towards answering these questions by demonstrating how clues from the cortex can be incorporated into a computer vision model we call the Recursive Cortical Network (RCN) [4]. In this blog post, we describe RCN in the context of common sense, cortex, and our long-term research ambitions at Vicarious.

Are existing generative models suitable for common sense?

Modern research in machine learning and artificial intelligence is often reductionist. Researchers identify an aspect of intelligence, isolate its defining characteristics, and create a benchmark to evaluate progress on this narrow problem, while controlling for other variables as much as possible. The problem of common sense is resistant to this sort of reduction, as it involves so many different aspects of intelligence from the same model. In the case of vision, after a common sense model has been built, it should be capable of object recognition, segmentation, imputation, generation, and a combinatorial number of queries that bind the represented variables in different ways without requiring retraining for each of these query types.

Research in generative models often focuses on narrow solutions that can answer specific questions, but does not offer a simple way to fully leverage the model’s knowledge via arbitrary probabilistic queries. For instance, in Variational Autoencoders (VAEs) [5], a by-product of the training is a fast inference network. However, if the model is queried for imputation using a different observed variable set each time, we need to retrain a different network per query, rendering the model unusable. Furthermore, the overwhelming emphasis on the optimization of the “evidence lower bound” (ELBO) on black-box models underemphasizes the importance of obtaining meaningful latent variables. The use of an adequate generative structure (inductive biases) is beneficial both from an interpretability standpoint and to allow for rich integration into more complex systems, even if the price to pay is a somewhat smaller ELBO. One of the strengths – but at the same time, limitations – of Generative Adversarial Networks (GANs) [6] is that they do not prescribe any inference mechanism, so even after successfully training a generative model, we must resort to a different technique to answer probabilistic queries. Even some tractable models, such as Pixel RNNs[7], are defined following an ordering that makes some conditional queries simply tractable, while making others intractable.

These individual generative models are powerful within the confines of their training regime, but they do not give rise to the coherent understanding of the world that we recognize as common sense. In search of principles for moving beyond these narrow successes, we turn to the only known successful implementation of common sense: the human brain.

What kind of generative model is the brain?

Decades of research in cognitive science and neuroscience have yielded great insight into the computational and statistical properties of the human brain. These properties suggest several functional requirements for generative models on the path towards general intelligence. In a nutshell, the generative models that we aim to build are compositional, factorized, hierarchical, and flexibly queryable. In Table 1, we list a sampling of the observations from neuroscience that inform our research.

Neuroscience observation (See [4] for additional explanation) Computational significance Representational choice in RCN
Factorization of contour representations and surface representations: Neuroscience evidence indicates that contours and surfaces are represented in a factored manner in the brain [8-11], which might be why people have no difficulty imagining a chair made of ice. This kind of factorization can be an efficient way to model functions in two and three dimensions [12]. Surfaces are modeled as a Markov Random Field that enforces continuity of surface properties except when interrupted at locations of contours.
Lateral connections in the visual cortex: Spatial lateral connections are a predominant feature of the visual cortex [13-16]. Laterals are known to play a role in enforcing contour continuity. Pooling in a hierarchy loses information about the relative poses between features. Lateral connections provide a way to enforce these relative constraints. Pool variables are connected by factors that enforce compatibility between the choices made in different pools.
Top-down object-based attention: The visual cortex has the ability to separate out instances of objects even when they are highly overlapping and transparent. This is called top-down object-based attention. Neuroscientists have specified the requirements for a hierarchy to support top-down attention control [17-21] The ability to support object-based attention is required for dealing with overlapping objects, and is required for object-background factorization and object-level compositionality. Object-level top-down attention is possible as a combination of the non-negative weights, lateral connections, and explaining away in the model.
Message passing based approximate inference (and learning): Several pieces of neuroscience evidence suggest that cortex is using a message-passing-like algorithm, and that it is doing inference on the generative model itself, rather than using auxiliary networks for pre-specified queries [3, 22]. For probabilistic graphical models, message passing algorithms hold a lot of promise as a computationally simple mechanism for approximate inference. See also our work on using message passing for feature learning [23]. Many representational choices, like compositionality, feature-specific lateral connections, and sparsity of weights were also found to be beneficial for message passing inference.

The Recursive Cortical Network: scaffolding versus tabula rasa

In our recent Science publication [4], we describe the Recursive Cortical Network (RCN): a generative model that satisfies the functional requirements listed in Table 1 and achieves strong performance and high data efficiency on a diverse set of computer vision tasks. RCN represents a departure from the prevailing deep learning zeitgeist that prizes learning from scratch, tabula rasa. RCN begins with “scaffolding”, prior structure that facilitates model building. For example, while most CNNs and VAEs are whole-image models that assume very little about objects and images, RCN is an object-based model that assumes factorization of contours and surfaces, and objects and background. RCN also represents shape explicitly, and the presence of lateral connections allows it to pool across large transformations without losing specificity, thereby increasing its invariance. Compositionality allows RCN to represent scenes with multiple objects while only requiring explicit training on individual objects. All of these features of RCN derive from our assumption that evolution has endowed the neocortex with similar scaffolding that makes it easy to learn representations in our world compared to starting from a totally blank slate.

With the right scaffolding in place, learning and inference become far easier. During learning, RCN is much more data efficient than its tabula rasa counterparts – 300x more in the case of a scene text recognition benchmark. Where many models will overfit to extraneous details of their training set, RCN identifies the salient aspects of a scene, permitting strong generalization to other similar scenes. Moreover, in the RCN setting, classification, detection, segmentation, and occlusion reasoning are all different, interconnected queries on the same model that provide explanations for the evidence present in the image.

CAPTCHA: why the central problem in AI is to understand the letter ‘A’

In 2013, we announced an early success of RCN: its ability to break text-based CAPTCHAs like those illustrated below (left column). With one model, we achieve an accuracy rate of 66.6% on reCAPTCHAs, 64.4% on BotDetect, 57.4% on Yahoo, and 57.1% on PayPal, all significantly above the 1% rate at which CAPTCHAs are considered ineffective (see [4] for more details). When we optimize a single model for a specific style, we can achieve up to 90% accuracy.  The Science article published this week reveals the details of the RCN model and its algorithms. After revealing the what and the how, we wanted to describe the why: why we chose the CAPTCHA benchmark first, and why it is still a very relevant benchmark for general AI.

The CAPTCHA-style letter A’s in the figure above (right column) illustrate the combinatorial number of ways in which this letter can be rendered and recognized by humans, without being explicitly trained on those kinds of variations. None of the public APIs for Optical Character Recognition (OCR) that we evaluated are able to capture this diversity, because this requires the recognition engine to generalize to distributions that are not represented in the training set. These methods are based on brute force pattern recognition. They have no notion of compositionality, and thus no mechanism to separate the letter A from its background. Furthermore, they have no understanding of objects, and therefore no way to reason about the shape and appearance of the letter A in isolation. As shown in the GIF below, deep learning methods like CNNs [24] trained on CAPTCHAs generalize poorly to small variations in the spacing of the individual letters [4]. In contrast, RCN remains robust as the letters spread. Note that performance in the animation is reported for CAPTCHA images that we created to evaluate the effect of spacing, separately from the reCAPTCHA dataset. Performance on reCAPTCHA and several other styles is reported along with more details in [4].

Douglas Hofstadter, an influential philosopher and AI researcher, famously quipped that the central problem in AI is to understand the letter A. Just like Hofstadter, we believe that “for any program to handle letterforms with the flexibility that human beings do, it would have to possess full-scale artificial intelligence.” Although the ‘super human’ accuracies on ImageNet classification or automatic caption generation systems can give the impression that perception is a solved problem, seemingly simple problems can provide enormous depth and insight towards developing human-like intelligence [25].

Our work in the paper is a small step in endowing computers to understand letterforms with the flexibility and fluidity of human perception. Even with our advancements, we are still far from having solved Hofstadter’s seemingly simple challenge of detecting ‘A’s with the same ‘fluidity and dynamism’ of humans. We believe that many of the ideas that we explored in the paper will be important for building systems that can generalize beyond their training distributions like humans do.


The world around us is filled with organisms that employ complex behaviors to thrive within their niches. While ants have super-human tunneling ability and salmon might be unrivaled navigators, their brains tell us little about general intelligence. Similarly, deep learning has demonstrated many narrow super-human abilities on recognizing photos and playing games. It is important not to conflate the success of deep learning in creating a diversity of narrow intelligences as progress on the path toward general intelligences.

Miles Brundage said it well [26]:

[P]rogress so far has largely been toward demonstrating general approaches for building narrow systems rather than general approaches for building general systems. Progress toward the former does not entail substantial progress toward the latter.

General systems are hard to evaluate and harder to build than their narrow counterparts, but we must confront this difficulty directly if we ever hope to achieve human level intelligence with qualities like common sense. Our work on RCN is one small step on the long path towards more general intelligences, one that will continue to be full of feedback from neuroscience, shaped by computational constraints, and perfected with input from the AI and neuroscience communities.

We are always looking for exceptional researchers, engineers, and business people to collaborate with us in our quest for human-like AI. Click here to see open positions at Vicarious.

Reference Code

A reference implementation of RCN is available on Github here.


We used the following datasets for the Science paper:


[1] Zwaan, R. A., & Madden, C. J. (2005). Embodied sentence comprehension. Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking, 224–245.

[2] Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM58(9), 92–103.

[3] Lee, T. S. (2015). The Visual System’s Internal Model of the World. Proceedings of the IEEE103(8), 1359–1378.

[4] George, D., Lehrach, W., Kansky, K., Lazaro-Gredilla, M., Laan, C., Marthi, B., Lou, X., Meng, Z., Liu, Y., Wang, H., Lavin, A., Phoenix, D. S. (2017). A generative vision model that trains with high data-efficiency and breaks text-based CAPTCHAs. Science.

[5] Kingma, D. P., & Welling, M. (2014). Stochastic Gradient VB and the Variational Auto-Encoder. In 2nd International Conference on Learning Representationsm (ICLR).

[6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS).

[7] Oord, A. van den, Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXiv Preprint arXiv:1601.06759.

[8] Field, D.J., Hayes, A., & Hess, R. F. Contour integration by the human visual system: evidence for a local association field. Vision Research, 33(2):173–193, 1993.

[9] Lamme, V. A. F., Rodriguez-Rodriguez, V., & Spekreijse, H. (1999). Separate processing dynamics for texture elements, boundaries and surfaces in primary visual cortex of the macaque monkey. Cerebral Cortex, 9(4):406–413.

[10] Lamme, V. A. F. & Roelfsema, P.R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neurosciences, 23(11):571–9.

[11] Huang, X & Paradiso, M. A. (2008). V1 response timing and surface filling-in. Journal of Neurophysiology, 100(1):539–547.

[12] Chandrasekaran, V., Wakin, M. B., Baron, D., & Baraniuk, R. G. (2009). Representation and Compression of Multidimensional Piecewise Functions Using Surflets. IEEE Transactions on Information Theory, 55(1), 374-400.

[13] Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. The Journal of Neuroscience, 9(7):2432–2442.

[14] DeYoe, E. A. & Van Essen, D. C. (1988). Concurrent processing streams in monkey visual cortex. Trends in Neurosciences, 11(5):219–226.

[15] Zhou, H., Friedman, H.S., & Von Der Heydt, R. (2000). Coding of border ownership in monkey visual cortex. The Journal of Neuroscience, 20(17):6594–6611.

[16] Thomson, A. M. & Bannister, A. P. (2003). Interlaminar connections in the neocortex. Cerebral Cortex 13, 5–14.

[17] Gilbert, C. D., & Li, W. (2013). Top-down influences on visual processing. Nature Reviews: Neuroscience, 14(5):350–63.

[18] Roelfsema, P. R., Lamme, V. A. F., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395(6700):376–381.

[19] Cohen, E. H. & Tong, F. (2015) Neural mechanisms of object-based attention. Cerebral Cortex, 25(4):1080–1092.

[20] Craft, E., Schutze, H., Niebur, E., & Von Der Heydt, R. (2007). A neural model of figure-ground organization. Journal of Neurophysiology, 97(6):4310–4326.

[21] Zhou, H., Friedman, H.S., & Von Der Heydt, R. (2000). Coding of border ownership in monkey visual cortex. The Journal of Neuroscience, 20(17):6594–661.

[22] Lee, T. S. & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. JOSA A, 20(7):1434–1448.

[23] Lázaro-Gredilla, M., Liu, Y., Phoenix, D. S., & George, D. (2016). Hierarchical compositional feature learning. arXiv Preprint arXiv:1611.02252.

[24] Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., & Shet, V. (2014). Multi-digit number recognition from street view imagery using deep convolutional neural networks. In International Conference on Learning Representations (ICLR).

[25] Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building machines that learn and think like people. Behavioral and Brain Sciences, 1–101.

[26] AlphaGo and AI Progress. Retrieved October 24, 2017, from