Generative Models for the Brain

Studying recognition with energy-based models


The goal of our project is to study how brains solve the task of recognition. Our hypothesis is that this process in the brain can be seen as imagining certain identity-preserving transformations of an object, while trying to match it to another recognized object. We attempted to implement this mechanism computationally by training a neural network that converts information into latent space, and developing an energy-based model to determine whether two latent representations of the input data come from the same identity. To test our hypothesis, we used an energy-based model to guide the process of “imagining”, while keeping the transformations as identity-preserving as possible. Our hypothesis is supported if the proposed method is able to generalize better than a direct prediction based on the model’s output.


Artificial intelligence is transforming industries across a wide spectrum of areas, from autonomous driving to voice-powered personal assistants. However, these successes, powered by data science and deep learning, may be insufficient for AI to be capable of matching human abilities in the long run due to current limitations of technology [7]. In this project we study practical aspects of imagination, which is a distinguishing characteristic of counterfactual and causal reasoning, in the context of recognition.

Recognition is a vital part of human learning. Retrieving a concept or an object from our memory allows us to model more accurately the implications that its presence has on the environment, compared to the situation when the object or concept was unknown. The process also reinforces our memories of the given concept. In order to bring this capability to technology, it is important to teach machines to recognize the same object in various settings.

One of the tasks that is easily solved by humans but yet poses problems to machines is the ability to recognize faces. The problem of generalization is a big obstacle in current systems implementation. For example, while for a human it is quite easy to recognize an acquaintance regardless of their skin tone, it is a hard task for technology. Sometimes software is not capable of recognizing people with very light or very dark skin. However, lacking an alternative, governments resort to using this type of technology anyway.

The process of recognition has been previously studied through the task of mental rotation, introduced by Shepard and Metzler in 1971. During the experiment they presented pairs of drawings of three-dimensional, asymmetrical assemblages of cubes. The goal was to determine, as quickly as possible, whether one drawing was a rotated image of another, or mirror image of it. Their findings demonstrated that the time it took for recognition was strongly correlated with the amount of transformations (such as rotation) needed to reach from one image to the other [5]. This research informs our main hypothesis that the process of recognition in the brain could be implemented as “imagining” identity-preserving transformations on one of the objects of recognition, while trying to match it to the other object.


Traditional approaches to classification using discriminative methods generally require that all the categories are known in advance and training examples for all the categories are supplied. In case of face recognition, the number of categories is very large, the number of samples per category is small, and only a subset of the categories is known at the time of training. A common approach to this problem is to use non-discriminative (generative) probabilistic methods in a reduced-dimension space, where the model for one category can be trained without using examples from other categories. To apply discriminative learning techniques to this kind of application, we must devise a method that can extract information about the problem from the available data, without requiring specific information about the categories. The solution presented previously [3] is to learn a similarity metric from data.

As an example, generative adversarial networks (GANs) can be used to train an image generator, which attempts to generate a realistic image for any random input. Studies have shown [4] that by modifying training images it is possible to discover certain transformations in the input space of the generator (latent space). These transformations produce the same effect on the generated image as the previous modification on training images. Hypothetically, if such modifications on training images could be automated, allowing to find various latent space transformations while distinguishing between identity-preserving and non-preserving transformations, then the problem of identity recognition could be solved by finding a path consisting of only identity-preserving transformations between the latent representations of the two images.

Another type of models, energy-based models (EBMs) capture dependencies by associating a scalar energy (a measure of compatibility) to each configuration of the variables. To make a prediction, one provides the values of observed variables and finds values of the remaining variables that minimize the energy. Learning consists of finding an energy function that associates low energies to correct values of the remaining variables, and higher energies to incorrect values. A loss functional, minimized during learning, is used to measure the quality of the energy functions. Energy-based learning can be seen as an alternative to probabilistic estimation for prediction, classification or decision-making tasks [2].

While probabilistic models assign a normalized probability to every possible configuration of the variables being modeled, energy-based models assign an unnormalized energy to those configurations. A trainable similarity metric can be seen as associating an energy to pairs of input patterns.

In the context of our project, EBMs could be trained directly to detect pairs of images that both contain the same object, which requires more data, or to encode the input image into a representation where it is easier to discover a sufficiently large set of identity preserving transformations, which may succeed with less labelled data. In the case of the face recognition Y is a high-cardinality discrete variable and the set Y is discrete and finite, however its cardinality may be equal to tens of thousands.

In this work we used the CelebA dataset [6]. It consists of 202599 portraits of celebrities with the identity and feature information associated with each image. There are 10177 different identities in the dataset, meaning on average 19.9 photographs per person. Therefore, the dataset is particularly well suited for the task at hand, as it allows to train an energy-based model for identity recognition.


As mentioned in the introduction, we probe if the problem of recognition can be implemented via imagining directed identity-preserving transformations on the object to be recognized. Computationally we can reach something similar to imagination like mental rotation by traversing a latent space of a neural network model, if the encoding layers that convert data into latent space and decoding layers that extract data from latent space have been trained accordingly.

To conduct an experiment in traversal of latent space, we constructed an energy-based model based with the following architecture. We used a pair of images as inputs into two parallel convolutional neural network (CNN) stacks. These CNN models A and B have been pre-trained on the CelebA dataset to detect certain features in the photos of celebrities [8]. We applied methods of transfer learning to modify these networks, removing the top layers and adding another fully connected layer to adjust the composite model to our needs. The weights in models A and B were linked so that a change in one would be immediately reflected in the other, achieved by weight reuse.

Training of the model is conducted using pairs of images from the image space X with an aim to learn an energy function E. Pairs of photographs of the same person should result in low energy value, while a pair of photographs of different people from the dataset result in high energy value. Here, the output of the separate CNN models A and B corresponds to a feature vector in latent space Z. These intermediate results are combined in a fully connected layer to produce the final output of the model — the energy value E.

Here we solve the problem of recognition by traversing the latent space between the vectors in Z in a series of n steps. At every step, the energy value E is calculated and compared against a predefined threshold value. When the output energy value does not exceed a certain limit, we can conclude that the result of vector manipulation in the latent space corresponds to the model recognizing both latent values as the same person.


We based our code on the repository by Yilun Du et al. that has been developed for experiments in EBM compositionality [8]. Due to limitations in time and computational resources, we decided to load the weights of the model CelebA attractive from this paper in order to greatly reduce the need for thorough training. We continued training from the 22000th iteration of CelebA attractive for another 700 iterations as shown below.

With such training setup we ran the walk and full model on 1000 samples from the whole dataset. To make comparison easier, we modified the walk algorithm to compute the mean of energies over all steps.

The code developed in the course of this project is available on GitHub:


From the results we can see that the scores of the two methods are very close, it is possible that the results would be equal if both decision boundary energies were more calibrated so that the number of predicted positives would match the number of actual positives.

Knowing the characteristics of the walk algorithm, it appears that the latent space between Za and Zb has low energy on average. If this was not the case, then the walk algorithm would predict more false negatives.


  • Henri Laiho — developing methods, coding, writing
  • Elena Novikova — background research, writing
  • Rodion Krjutškov — reviewing, editing, writing


  1. Yilun Du, Shuang Li, Igor Mordatch (2020). Compositional Visual Generation and Inference with Energy Based Models. [accessed 27.01.2021]
  2. Yann Lecun, Sumit Chopra, Raia Hadsell (2006). A tutorial on energy-based learning. [accessed 27.01.2021]
  3. Sumit Chopra, Raia Hadsell, Yann Lecun (2005). Learning a similarity metric discriminatively, with application to face verification. [accessed 27.01.2021]
  4. Ali Jahanian, Lucy Chai, Phillip Isola (2020). On The “Steerability” of generative adversarial networks. [accessed 27.01.2021]
  5. Toshihiko Sasama, Hiroshi Mitsumoto, Kazuyo Yoneda, Shinichi Tamura (2009). Mental Rotation by Neural Network. [accessed 27.01.2021]
  6. Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou (2015). Deep Learning Face Attributes in the Wild. [accessed 27.01.2021]
  7. Sridhar Mahadevan (2019). Imagination Machines: A New Challenge for Artificial Intelligence. [accessed 27.01.2021]
  8. Yilun Du, Shuang Li, Igor Mordatch (2020). Compositional Visual Generation with Energy Based Models. [accessed 27.01.2020]

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store