The attention mechanism made big changes in deep learning.

Thanks to this, models can achieve better results. This mechanism was also the inspiration for perceivers and also transformer neural networks . And transformers led to the development of models such as Bidirectional Encoder Representations from Transformers ( BERT) and Generative Pre-trained Transformer (GPT), which are used in Natural Language Processing (NLP).

Attention mechanism was initially used in machine translations. There are two mechanisms of attention that can be found in the TensorFlow framework, which are implemented as Layer Attention (a.k.a. Luong-style attention) and Additive Attention (a.k.a. Bahdanau-style attention). In this article, I’m going to focus on explaining the two different attention mechanisms.


Bahdanau introduced an attention mechanism for improvement in machine translation from English to French. The whole mechanism is based on recurrent neural network’s auto-encoders. At the beginning, hidden states are produced for inputs from an encoder, by a function that takes into account previous hidden states, previous input and contextual vectors for each word (which came from input to the model).

Bahdanau attension mechanism

Illustration of attention mechanism from Bahdanau’s paper

neural network


The next step is creating an alignment model, which scores how well the match is between input around position j and output around position i. The score is based on the hidden state and the j annotation unit of the input sentence. The alignment model is a feedforward neutral network which is jointly trained with all the other components of the system. After getting alignment scores, the softmax function is calculated.

Next, the context vector is calculated from the weighted sum of the annotations (encoder outputs). If the alignment of objects is close to 1, it means that the object has a high influence on the decoder output. Finally, the context vector is concatenated to the output and fed to the decoder model.

Luong’s attention mechanism differs in where the attention is focused and how the alignment score is calculated. With this method, you can use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in the decoder. There are also three ways of measuring the alignment score:

The first one involves the simple multiplication of hidden states in the encoder with hidden states in decoder. The second one involves multiplication as in the first version, but also includes weights. The last method is slightly similar to the way that alignment scores are calculated using Bahdanau’s attention mechanism, but the decoder hidden state is added to the encoder hidden states.


book science reading



Attention mechanisms created a big revolution, especially in Natural Language Processing, which has allowed us to build better models, like BERT, for embedding words or phrases. This mechanism is also used in computer vision, which will be discussed in the next part of this article. This mechanism helps the model to “remember” all the inputted factors and to focus on specific parts of inputting e.g. concentrating on words when formulating a response or decision.


In the previous article, I described attention mechanisms by using an example of natural language processing. This method was first used in language processing, but this is not its only usage. We can also use attention mechanisms in another major field, for computer vision.


Images can be presented as multichannel matrices. For example, RGB images have dimensions: height, width and also three channels for each colour.

RGB images


An attention mechanism can be applied to the channels.

Channel Attention Module (CAM) is the method which helps a model to decide “what to pay attention to”. This is possible because the module adaptively calculates the weights of each channel. The first concept of channel attention was presented in an article which introduced Squeeze-and-Excitation Networks:


In this type of network, the input is an image with three dimensions (channels x height x width). The first layer is Global Average Pooling (GAP), where feature maps are reduced to a single pixel, and also each channel is converted to a 1x1 spatial dimension. This layer produces a vector which has a length that is equal to the number of channels, and its shape is: channels x 1 x 1. This vector goes to the MultiLayer perceptron (MLP), which has an input that is equal to the ratio of the number of channels and the ratio of reduction. If this ratio is high, fewer neurons are in the MLP. In the output of the MLP there is a sigmoid function that maps values within the range of 0 to 1.



Shared MLP


The difference between a Channel Attention Module and an SE Network is that Global Average Pooling generates not one, but two, vectors of shape (channels x 1 x 1). One vector is generated by GAP, and the second vector by Global Max Pooling. The advantage of this solution is that there is more information. Max-pooling can also provide features based on contextual information, such as edges. Average pooling loses this information because it gives a more smoothing effect. Both vectors are summed up and passed through sigmoid activation to generate weights of channels.


Convultion Layer



Attention mechanisms can be used to help the model to find “where to pay attention”. Spatial Attention Module (SAM) is useful for this task .There are three steps in using this method. In the first part step, there is a pooling operation through channels, where input in a shape (channels x height x width) is decomposed to two channels, which represent Max Pooling and Average Pooling across the channels in the image. Each pooling generates a feature map with the shape of: 1 x height x width. After that, there is a convolutional layer and batch norm layer for normalizing output. And, at the end, just like in the attention module described above, a sigmoid function is used to map values in the range of 0 to 1. This SAM is then applied to all the feature maps in the input tensor using a simple element-wise product.


Convolutional Block Attention Module


It is also possible to join those two methods together, which creates a Convolutional Block Attention Module (CBAM). This can be applied as a layer to every convolutional block in the model. It needs a feature map, which is generated by a convolutional layer, first from an applied CAM, and then from a SAM. After that, there are refined feature maps in the output.


The attention mechanisms described in this article provide very effective and efficient methods of improving results in a wide range of tasks related to computer vision, such as image classification, object detection, image generation and super-resolution.




Review of Yoshua Bengio’s lecture at the Artificial General Intelligence 2021 Conference

At the 2021 Artificial General Intelligence Conference, a star keynote speaker was Yoshua Bengio. He has been one of the leading figures of deep learning with neural networks, for which he was granted the Turing Award last year.

How can we help you?


To find out more about Digica, or to discuss how we may be of service to you, please get in touch.

Contact us