Intrinsic + Vicarious

The Vicarious business has been acquired by Intrinsic, a robotics software and AI company at Alphabet. Learn more about our shared mission here.

From Action to Abstraction: Learning Concepts through Sensorimotor Interactions

Posted February 2018 Back to Resources

A Thought Experiment

Imagine that you wake up in a strange room. It’s not the nice bedroom you went to sleep in, but a dimly lit cell with a damp, cold floor. The walls are made out of cracking plaster and the only intended way in or out seems to be an imposing steel door that is padlocked from the inside. High above on one wall is a barred window that lets in the only light. If, after looking around, you come to the conclusion that you are trapped, that wouldn’t be unreasonable. Things do look dire.

Would that satisfy you, though? Probably not. You would want to explore the room a bit more, maybe give that padlock a tug to see how secure it really is. Or maybe you would test the strength of those cracking plaster walls. Perhaps a few well-targeted blows to those well-worn walls would create a hole you can slip through? And maybe, just maybe, those bars on the window are set wide enough apart that you could wiggle between them to freedom? Interacting with the environment gives you much more information than just passively observing it. Seeing might be believing, but to actually justify those beliefs, you need to interact with your surroundings.

Concept of a Concept

Containment is a concept. Dog is a concept too. So is running, or forests, or beauty, or green, or death. Concepts are abstractions that we derive from everyday interactions with the world.  They form the reusable building blocks of knowledge that are essential to humans for making sense of the world.

When we have a conceptual understanding of something, we have in a way a mastery of that thing. In the case of containment, this mastery means that we can identify containers in the world, tell them apart from non-containers, put things into them, take things out of them, and anticipate what will happen when we interact with them. We can even begin to look at novel objects and see in them the potential to contain or be contained.

Common approaches to conceptual understanding in AI, including deep learning systems trained on datasets like ImageNet [1], appear to capture some of these abilities, but they lack the mastery that comes from interactions. Given an image or even a video, such approaches might be able to tell whether there is a specific kind of container in it—say, a cup, or a house, or a bottle—and locate where in the image the container is. But they would likely fail in spectacular ways when encountering a previously unseen type of container. Asking such a system to  contain itself would be met only with confusion, since it associates the container concept with a collection of visual features, but lacks an active understanding of containment.

Concepts from Sensorimotor Contingencies

Henri Poincaré was among the first to emphasize the role of sensorimotor representations in human understanding. In Science and Hypothesis [2] he argued that a motionless being would never acquire the concept of 3D space. Recently, several cognitive scientists have proposed that conceptual representations arise from the integration of perception and action. To pick just one work, O’Regan and Noë [3] define sensorimotor contingencies as “the structure of the rules governing the sensory changes produced by various motor actions,” viewing vision as a “mode of exploration of the world that is mediated by knowledge of what we call sensorimotor contingencies.” Noë [4] goes on to elaborate that “(concepts) are themselves techniques or means for handling what there is.”

While the importance of sensorimotor contingencies have been appreciated in the cognitive science community, those ideas have resulted in only a few concrete computational models that explore their role in concept formation. In a paper that we presented recently at AAAI-18, we introduced a computational model that learns concepts by interacting with the environment.

What We Did

We set out to represent and learn two essential abilities that make up conceptual understanding: the ability to actively detect the presence of a concept, and the ability to actively bring about a concept. Further, we wanted to investigate the situations in which interactive abilities are preferable to passive approaches, and understand how the reuse of abilities learned for simple concepts might help with learning more complex concepts.

We started by developing a training ground for learning active concepts, an environment we call PixelWorld (available on github). In PixelWorld, things are a bit simpler than they are in the real world. It is a discrete 2D grid environment inhabited by a pixel agent and one or more objects of different kinds, all composed of a few pixels (e.g., lines, blobs, and containers).

The agent has a simple embodiment: it perceives only a 3×3 window around itself, and can choose to move up, down, left, right, or stop and signal a bit of information. This embodiment requires it to learn even the most basic representations, such as the notion of an object, as interactive concepts. While this might seem like unnecessary sensory deprivation, eliminating a sophisticated visual perception system allows us to highlight the role of composing heterogeneous behaviors into meaningful concept representations.

We trained agents for two different kinds of tasks. The first task was to explore the environment and signal whether the concept was present, e.g., whether the agent was contained. It was rewarded if it got the answer right. The second task was to bring about the concept, e.g., making itself contained. It was rewarded if it brought about the concept and correctly signaled that it had. We used reinforcement learning to train agents to solve these tasks.

For example, we trained an agent to detect whether it was (horizontally) contained. The animation below illustrates its behavior: it checks whether there is a wall on the right, then checks whether there is a wall on the left. Since both tests succeed, it signals that it is contained.

We trained the next agent to become contained in environments where it has an object on each side of it: an intact container and a container with a hole in it. The animation shows it climbing into the object to the right and checking whether the object is an intact container. It finds a hole, so it climbs out and enters the container on the left, ending by signaling that it has become contained.

We can understand what the agent is doing by inspecting a trace of its actions:

The figure above shows each of the actions taken by the agent in the animation above it. Each square represents an action, with time increasing to the right. “DOWN”, “RIGHT”, “UP”, and “LEFT” are the agent’s basic actions, and each “SMC” row represents a particular sensorimotor contingency that the agent can choose to perform. An SMC can be thought of as a small program that, when executed, chooses a sequence of basic actions until it decides to stop and perform one of two signal actions that denote either success (“SIG1”, green) or failure (“SIG0”, red). Each of these SMCs originated as an agent that was trained to solve a simpler conceptual task. For example, “SMC 3” was trained to bring about being in a potential container when starting on the floor next to it on the left, and that is the first thing the agent in the animation does in time steps 0 to 11. In this way, the agent can perform complex tasks, such as bringing about containment, by executing a sequence of appropriate lower-level SMCs.

Our concepts extended beyond containment to include concepts such as being on the top of an object or being to the left of two objects:

Training these agents in a single environment would not have been enough, since a variety of environments are needed in order to tell which aspects of the environment are relevant to the concept and which aren’t. Having many different types of environments also allows us to characterize the types where an active approach and reuse of behaviors have benefits over a passive approach.

To address this need, we used a notation based on first-order logic to specify datasets of environments, using logical expressions to both generate environments and label them according to the concept that was present. We constructed 96 different environment datasets, organized into curricula from simple to complex concepts. Both the notation and the environments we defined with it are available in our PixelWorld release.

What We Found

We compared our active approach to a passive approach, using a CNN trained to detect whether a concept held, based on a static view of the entire environment. For concepts involving containment, the interactive approach clearly outperforms the CNN. For concepts involving distinguishing objects of differing shapes or spatial relations, we found that the CNN performed better in some cases and worse in others. Note that passive approaches, by definition, cannot interact with the environment, so our baseline applies only to detecting whether a concept holds. Only our active approach could succeed in environments that required a concept to be brought about.

We found that reuse of behaviors improved performance for both tasks ( detecting a concept and bringing about that concept), with the improvement most pronounced for concepts that involved multiple objects or that required complex sequences of behaviors.


Our work shows that interactive sensorimotor conceptual representations can be formalized and learned. While the experimental setup and embodiment of our paper helped to highlight the role of interactions, combining this approach with a generative vision system would be required to learn concepts from the real world. Moreover, combining sensorimotor representations with techniques similar to schema networks would allow the agent to have an internal representation of the external world that it can use for simulation and planning.

Though AI agents that break containment are currently a subject best left to sci-fi movies, we believe that deriving concepts from sensorimotor interactions is one of the keys to escaping the confines of current passive AI techniques.


Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei. “ImageNet: a large-scale hierarchical image database.” Computer Vision and Pattern Recognition, 2009.

Henri Poincare. “Science and hypothesis.” 1905.

J. Kevin O’Regan and Alva Noë. “A sensorimotor account of vision and visual consciousness.” Behavioral and Brain Sciences, 2001.

Alva Noë. “Concept pluralism, direct perception, and the fragility of presence.” Open MIND, 2014.