Toward Learning a Compositional Visual Representation

Direct link to CVPR paper here.


Humans understand the world as a sum of its partsShapes are composed of other shapes, objects can be broken down into pieces, and this sentence is nothing more than a series of words. When presented with a new shape, object, or sentence, people can decompose the novelty into familiar parts. Our knowledge representation is naturally compositional.

At Vicarious, we believe that compositionality is one of the key design principles for artificial intelligence. For example, we believe that a vision system’s representation of a coffee pot should be very similar regardless of whether it stands alone on a countertop, or inside of the appliance in the figure above. Combinations of familiar objects, like a coffee pot, toaster, and griddle, should be immediately recognized as the sum of their parts.

Compositionality has a long history in computer vision. Song-Chun Zhu and Alan Yuille, among others, have pioneered statistical modeling approaches to build hierarchical feature representations for vision [1-5]. More recent work by Lake et al. [6] demonstrates that hand-written symbols can be learned from only a few examples using a compositional representation of the strokes.

Recent progress in computer vision is due in large part to advances in deep learning, and in particular, Convolutional Neural Network (CNNs). In our recent CVPR paper [7], we ask: are the visual representations learned by CNNs naturally compositional? And if not, can we “teach” them to be?

Are CNN activations compositional?

Convolutional Neural Network (CNNs) have become the de facto approach for extracting visual representations for various tasks, such as image segmentation [8] and object detection [9]. While a trained CNN may excel within the confines of its training regime, it is not obvious whether its learned representations may be reused compositionally (for instance, could the activations extracted from a CNN trained to recognize coffee machines and ovens be used to recognize the combined coffee pot and oven above, or would this recombination be confusing to the network?), or whether the CNN has learned to disentangle the natural parts that combine to make full images.

A CNN trained for classification, like the popular VGG network [10], takes an image as input and outputs beliefs that the image contains various common object categories, such as coffee mugs or airliners. Within the CNN, before the layer which outputs the final beliefs of what objects are in the image, there are many layers of activations that may be understood as abstract feature representations of the input image. These activations are the visual representations that we believe should exhibit some compositionality if they are to be more generally useful for AI.

Consider an image of an airliner and a coffee mug like the one below. A compositional vision system should represent this image in such a way that the representation of the airliner should be capable of being teased apart from the representation of the cup. CNNs maintain spatial information in their intermediate layers of activations, so to probe the compositionality of the CNN’s representation, we can apply a “mask” to the activations to zero out all activations outside the spatial location of the airliner, and compare how different these masked activations are to activations in response to an input image where we erased the cup and show the airliner in isolation. In other words, if the CNN has compositional activations, we would expect the CNN to have the same activations in the region of the airliner regardless of what objects (in this case a coffee mug) appear adjacent to the airliner.

So how compositional are the activations of a CNN like VGG? In the animation below, we visualize a coffee mug moving closer to an airliner (left). The middle animated figure below shows the activation map of the CNN on a high convolutional layer. The right animated figure shows the difference of the activation from that of an isolated airplane inside the mask of the airplane. If the activations were compositional, moving the coffee mug closer to the airliner should not shift activations in the region of the airliner; we would expect the view in the third column to remain black. Unfortunately, this is far from the case. Convolutional neural networks are not inherently compositional.

Teaching compositionality to CNNs

Here we describe our approach to teach CNNs compositional representations. In our CVPR ’17 paper [7], we focus on teaching a CNN to be compositional at the level of objects (e.g., airliners, coffee mugs) instead of the object part level, although the same idea could be applied to achieve more fine grained compositionality.

In the airplane and coffee mug example above, we described two feature maps: one obtained from masking the input, and another derived from applying a mask in the feature space. We noted that these two feature maps should be approximately equal if the representation can be called compositional. In other words, the sum of squared-difference between the two feature maps should be nearly zero. Our central insight is that this formulation of compositionality gives rise to a loss term based on this sum of squared-difference between the feature maps. We therefore add a term to the training objective which penalizes the CNN if activations in response to the masked image (i.e., object in isolation) differ from the masked activations (i.e., the activations with a mask applied).

simplified diagram of our approach is shown below. We make two copies of the CNN that share weights. The original input image is fed to one CNN (red) and the masked input image is fed to the other (blue). The compositionality cost can be computed from the two feature maps and the projected mask. We also compute the classification loss for both copies of the CNN. The CNN that sees the masked image should correctly classify the single object in its input, whereas the CNN that sees the full image should correctly classify all objects in its input. The network is then trained to minimize the sum of all cost terms. Note that this compositional training scheme is actually agnostic to the original training objective; for instance, the compositional penalty could be applied to segmentation tasks as well as classification tasks.

The total loss for our CNN is the sum of the classification cost for all objects, the classification cost for the masked objects, and the compositionality cost.

We showed above that a vanilla CNN’s features are not inherently compositional. But are CNNs trained with our novel cost function indeed compositional? One way to answer this question is by backtracing a classification decision to a heat-map in the input image through a visualization technique known as guided backpropagation [11]. In the following figure we show some example ”backtraces”: each column shows back-tracing the class label (bottom) for an input image using networks trained with different approaches. Intuitively, the heat-map shows what in the input image was most important for the classification decision that the CNN made. Our method leads to heat-maps that are better concentrated inside the object region corresponding to the class label, whereas CNNs trained without the compositional penalty (baseline-aug and VGG) tend to use a lot of object context and seemingly irrelevant features of the input to inform their beliefs of what object classes are in the input.

The above backtraces clearly indicate that our compositional CNN responds less to object context than baselines. While compositionality is a desirable property in and of itself, we wondered whether a CNN trained with our compositionality cost would actually perform better at classification than the “vanilla CNN”, trained with only the classification cost, especially since our compositional CNN mostly disregards object context. We tested our CNN’s classification performance on a subset of MS-COCO [12]. We took the first 20 categories and ignored objects that are too small, focusing instead on medium and large objects sizes where context is less important. The following plot (left) shows mAP (mean average precision over categories) of our method – “comp-full” – versus several baselines. The bar plot (right) shows AP for each category. We see significant and consistent improvement due to the compositionality cost across categories. We hypothesize that for medium and large sized object instances, training with compositionality induces a strong inductive bias that greatly helps performance.

This work demonstrates the promise of teaching compositionality to CNNs. However, CNNs still have much to learn. We look forward to continuing our research in learning compositional visual representations in CNNs and beyond. If you would like to contribute to these efforts, join us!


  1. Song-Chun Zhu and David Mumford. “A stochastic grammar of images.” Foundations and Trends® in Computer Graphics and Vision 2.4 (2007): 259-362.
  2. Zhangzhang Si and Song-Chun Zhu. “Learning and-or templates for object recognition and detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35.9 (2013): 2189-2205.
  3. Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. “Image parsing: Unifying segmentation, detection, and recognition.” International Journal of Computer Vision 63.2 (2005): 113-140.
  4. Long Zhu and Alan L. Yuille. “A hierarchical compositional system for rapid object detection.” Advances in Neural Information Processing Systems. 2006.
  5. Iasonas Kokkinos and Alan L. Yuille. “Inference and learning with hierarchical shape models.” International Journal of Computer Vision 93.2 (2011): 201-225.
  6. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. “Human-level concept learning through probabilistic program induction.” Science 350.6266 (2015): 1332-1338.
  7. Austin Stone, Huayan Wang, Michael Stark, Yi Liu, D. Scott Phoenix, and Dileep George. “Teaching compositionality to CNNs.” Conference on Computer Vision and Pattern Recognition. 2017.
  8. Pedro O. Pinheiro, Ronan Collobert, and Piotr Dollár. “Learning to segment object candidates.” Advances in Neural Information Processing Systems. 2015.
  9. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards real-time object detection with region proposal networks.” Advances in Neural Information Processing Systems. 2015.
  10. Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
  11. Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. “Striving for simplicity: The all convolutional net.” International Conference on Learning Representations-WS. 2015.
  12. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick.“Microsoft coco: Common objects in context.” European Conference on Computer. 2014.