Posted January 2019Back to Our Work
The instructions for Ikea are infamously simple. There are no words, no crash courses in allen wrenches, just a little cartoon individual that is slowly working its way through the same problem you are. And though there might be more swearing and sweating on your end, by the time you reach the last diagram in the series, you and that unflappable little cartoon have the same piece of furniture sitting in front of you.
Obviously a robot could never complete such a task without significant hard coding and engineering efforts. Let’s consider a simpler, but similar problem. Imagine you have a group of objects in a bin, some of which are red, and some of which are green. You want the robot to pick up those objects and arrange them along the edges of the bin as shown in the figure below. How do you convey this task to the robot? The number of objects of different colors, their shapes, and their placements in the bin vary from one bin to another, and you still want the robot to be able to generalize to those settings.
Almost any individual, even young children, could solve this problem without words: you would just need to show an image of the initial state (objects scattered in the bin) and an image of the final state (objects arranged along the edges of the bin) to convey the idea. We can understand diagrams like the ones shown in Figure 1A because we can infer the concept that is represented by the relationship between these image pairs. For harder problems, an individual might need two or three examples for the concept to become clear, but unlike many machine learning algorithms, you’d never need thousands.
A concept is an abstraction of everyday experience. Going upstairs is a concept, but so is sorting apples from oranges or tidying a room. They are ways to divide up and make sense of the world and often the easiest way is to convey a desired outcome. While one could certainly describe the relationship between the above pair of pictures by simply noting how the pixels changed in one picture to another, that description would not capture the concept that is being conveyed, and because of that, it would not generalize to new settings.
A better way to convey the desired outcome between these two images would be say ‘stack green objects along the bottom, and red objects along the right’. But language is neither necessary nor sometimes even sufficient for representing many concepts. People can understand concepts demonstrated in images pairs intuitively, no matter their cultural or educational background. This is why assembly instructions are so often given as a sequence of pictures. Why waste the time of translating your instructions into a dozen languages, when the wonders of the human brain allow your instructions to be conveyed just as easily, and much more universally, with pictures?
Learning concepts, deducing them from image pairs, and then applying them to new situations has been the domain of humans and something robots can only do in science fiction. Even just a rudimentary step towards this skill would dramatically increase the types of tasks that robots could perform.
How do humans acquire, represent, and infer concepts? Philosophers, cognitive scientists, and neuroscientists have pondered this question for decades. An influential idea from cognitive science is called ‘image schemas’ where concepts are schematic imaginations involving objects and actions . The idea is simple in principle — you understand a concept by being able to imagine it, but the imagination is ‘schematic’ because you leave out irrelevant details. Another influential and related cognitive science idea is Barsalou’s ‘Perceptual Symbol Systems’  where concepts are simulations (or programs) on a visual perception system that has the ability to represent objects and their parts, and the ability to steer attention to different parts as required. Ullman’s seminal work on ‘visual routines’  suggested that many visual concepts require carrying out a set of elementary visual operations in a sequence , just like on a computer.
Neuroscientists found evidence for some of these ideas, for example, experiments have shown that monkeys use their visual cortex as temporary buffer for imagining sequences of elementary operations such as mental manipulation of objects in order to accomplish a complex task .
While several ideas have existed about the nature of concepts, influential ideas like that of image schemas and perceptual symbol systems have largely remained as descriptive theories [6, 7]. In this project, we sought to formalize these ideas to bring them into the foray of machine learning. One important observation that guided our approach is that while the visuo-spatial concepts are easy for humans, other types of concepts can prove to be much more difficult. For instance, predicting the next number in a sequence like 1, 1, 1, 1, 2, 3, 6 is difficult without years of training (answer here). Because of this we were able to conclude that concepts that are easy for humans, and form the basis of common sense, must be a small subset of all possible concepts.
We further hypothesized that human concepts are programs on a very special computer architecture: the architectural constraints and inductive biases of this computer explain why some concepts are easy to grasp while others are not. In this view, building machines that can learn concepts like humans will need to confront the problem of the architecture of this computer, and then learn concepts as programs using an appropriate curriculum.
We developed a visual cognitive computer (VCC) architecture that formalizes the cognitive and neuroscience insights about concepts. We also created a dataset of 546 different concepts in the tabletop world, that corresponds to arrangements of objects on a table. A subset of these concepts are shown in Figure 2. A concept is conveyed to the system as a set of image pairs as the one shown in Figure 1A. We developed induction methods that learn programs corresponding to the concept represented in the image pairs.
People utilize their prior knowledge to learn new concepts — our program induction method works in a similar manner. We use a model to predict which programs are likely given the images for a particular concept, and then update this model as more programs are learned. As the system learns more concepts, it guides the search for even more complex concepts. Our method learned 535 out of the 546 concepts with a search budget of 3 million programs.
The learned programs utilized the embodiment of the VCC in interesting ways. Figure 3 shows a program that utilized the gaze fixation mechanism as a working memory to remember the previous location of an object that was moved away. Some concepts require imagining objects that are not in the scene. See the paper for more examples.
When a robot is equipped with a VCC, tasks can be conveyed to it through image pairs. We show that a learned concept can be transferred to dramatically different situations, and even with different robot embodiments. This is very different from typical imitation learning setups where robots rote-learn from demonstrations, limiting their generalization. The following videos show how the robot can stack objects, arrange them in a circle, and sort limes from lemons in various settings, after inferring the corresponding concept from image pairs.
The Visual Cognitive Computer (VCC) is a unique cognitive architecture comprised of a generative visual hierarchy (VH), an object-centric dynamics model, embodied and deictic representations , and joint attention mechanisms.
Generative Visual Hierarchy: The vision hierarchy (VH) can parse input scenes containing multiple objects and imagine objects, similar to the generative model we developed. The VH uses top-down object-based attention to segment out objects from the background and to bind object attributes to the object identity.
Object-centric dynamics model: The dynamics model, combined with the VH, lets the VCC predict the effects of imagined movements (e.g. collisions during movement), and write those results into an imagination blackboard.
Embodied and deictic representation: VCC uses orienting movements (fixations, attention changes, and pointing) to bind objects in the world to cognitive programs.
Joint attention using pointing: VCC utilizes a pointing mechanism to establish shared attention between a teacher and the learner, as a first step in instantiating the cultural aspect of concept learning.
A set of elementary operations corresponding to parsing a scene, moving objects, and directing gaze and attention are predefined and form the instruction set of the computer. The different components of VCC interact through working memories that are local and structured. Local, controller-specific working memories make it easier to learn programs compared to architectures that use global random access memories.
Our work is relevant for many of the debates surrounding artificial intelligence, and we discuss some of them here.
Perceptual symbol systems, not amodal symbol manipulation: An important debate in AI is about the integration of symbolic systems with perception systems. While deep learning has led to dramatic improvements in pattern recognition, these systems lack compositionality and reasoning abilities that humans demonstrate, shortcomings that are becoming more and more apparent with current research. Amodal symbol systems like those in good-old-fashioned AI have compositionality, but they are brittle, and it is unclear how those symbols can be grounded. Our work follows the solution path suggested by Barsalou , where symbols are attached to simulations in a perceptual system that is generative, componential, and amenable to top-down attention.
Interactions with the environment, not ‘downstream tasks’: A dogma in the current machine learning literature is that of creating ‘disentangled representations’ for further ‘downstream tasks’. Many unsupervised learning systems, like the varieties of variational autoencoders, seek to form ‘disentangled’ encodings of the input, which will then presumably be processed further downstream. We believe that the encoding view ignores the fundamentally interactive nature of perception and cognition. Rather than encoding everything about the world in an embedding space, our cognitive architecture takes the view that the details of the world can be accessed on demand, and that this pattern of access itself becomes part of how humans represent concepts.
Top-down attention and the binding problem: A cognitive architecture that uses parallel distributed representations, as VCC does, will have to confront the ‘binding problem’ where the attributes of different entities in a scene need to be assigned the correct ownership. In VCC, binding is achieved through the ability of the visual perception system (a Recursive Cortical Network) to control object-based top-down attention.
VCC offers a sketch of how different components — perception, object dynamics, actions, imagination, working memory, and program induction — need to come together in a cognitive architecture to learn concepts. The overall synthesis imposes important functional requirements on the components. For example, the visual perception system needs to be object- and part-based, should factorize shape and appearance, and have the ability to steer top-down attention. We briefly discuss how our earlier work, which form components of the VCC, met these functional requirements.
Recursive Cortical Networks (RCN) for generative vision: The representational choices of RCN were guided by the requirement for it to be part of a concept learning system. RCN is explicitly object-based and compositional. Rather than reconstructing a whole scene in a holistic manner, RCN is compositional: object-shapes, object-appearances and image backgrounds are modeled using different mechanisms and to varying fidelity. RCN supports top-down object-based attention, another important requirement for concept learning.
Schema Networks for object dynamics: Schema Networks used an object-based model for predicting the interactions between objects, which allowed it to generalize to settings it has never seen before. VCC uses the combination of schema-like modeling and RCN to predict the effects of actions.
Creating abstractions from behavior libraries: Our previous work on sensorimotor-contingency based representations showed how concepts can be represented by combinations of behaviors in a reinforcement learning framework. VCC enhances the setting by incorporating a powerful perception system and working memories.
We are excited about the research directions opened up by this project. One avenue of work is to generalize beyond the tabletop world. This would require utilizing a more advanced perception system that handles 3D objects, and a more functional dynamics model that can handle 3D object dynamics and gravity. Allowing recursive function calls in VCC would increase the richness of concepts that can be represented. The concepts learned in the current incarnation of VCC can be thought of as pre-verbal. An exciting future direction is to connect VCC to language to build robots with grounded language understanding. Even though a picture is worth a thousand words, language currently doesn’t have much currency in the robotics world. We can’t wait to change that and while we promise that a cursing furniture building robot is not in our plans, we are committed to building the sort of common sense necessary to make robots more common.
Thanks to the many people who contributed to this paper and blog post: Dileep George, Swaroop Guntupalli, Miguel Lázaro Gredilla, Dianhuan Lin, Nick Hay, and Carter Wendelken.
Cached emulator data for our datasets [required]
Training data and RCN parses [for reference only]
Inferred concepts [for reference only]
Generalization data and RCN parses [for reference only]