ALL ARTICLES FOR Deep learning


 The attention mechanism made big changes in deep learning.

Thanks to this, models can achieve better results. This mechanism was also the inspiration for perceivers and also transformer neural networks . And transformers led to the development of models such as Bidirectional Encoder Representations from Transformers ( BERT) and Generative Pre-trained Transformer (GPT), which are used in Natural Language Processing (NLP).

Attention mechanism was initially used in machine translations. There are two mechanisms of attention that can be found in the TensorFlow framework, which are implemented as Layer Attention (a.k.a. Luong-style attention) and Additive Attention (a.k.a. Bahdanau-style attention). In this article, I’m going to focus on explaining the two different attention mechanisms.


Bahdanau introduced an attention mechanism for improvement in machine translation from English to French. The whole mechanism is based on recurrent neural network’s auto-encoders. At the beginning, hidden states are produced for inputs from an encoder, by a function that takes into account previous hidden states, previous input and contextual vectors for each word (which came from input to the model).

Bahdanau attension mechanism

Illustration of attention mechanism from Bahdanau’s paper

neural network


The next step is creating an alignment model, which scores how well the match is between input around position j and output around position i. The score is based on the hidden state and the j annotation unit of the input sentence. The alignment model is a feedforward neutral network which is jointly trained with all the other components of the system. After getting alignment scores, the softmax function is calculated.

Next, the context vector is calculated from the weighted sum of the annotations (encoder outputs). If the alignment of objects is close to 1, it means that the object has a high influence on the decoder output. Finally, the context vector is concatenated to the output and fed to the decoder model.

Luong’s attention mechanism differs in where the attention is focused and how the alignment score is calculated. With this method, you can use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in the decoder. There are also three ways of measuring the alignment score:

The first one involves the simple multiplication of hidden states in the encoder with hidden states in decoder. The second one involves multiplication as in the first version, but also includes weights. The last method is slightly similar to the way that alignment scores are calculated using Bahdanau’s attention mechanism, but the decoder hidden state is added to the encoder hidden states.


book science reading



Attention mechanisms created a big revolution, especially in Natural Language Processing, which has allowed us to build better models, like BERT, for embedding words or phrases. This mechanism is also used in computer vision, which will be discussed in the next part of this article. This mechanism helps the model to “remember” all the inputted factors and to focus on specific parts of inputting e.g. concentrating on words when formulating a response or decision.


In the previous article, I described attention mechanisms by using an example of natural language processing. This method was first used in language processing, but this is not its only usage. We can also use attention mechanisms in another major field, for computer vision.


Images can be presented as multichannel matrices. For example, RGB images have dimensions: height, width and also three channels for each colour.

RGB images


An attention mechanism can be applied to the channels.

Channel Attention Module (CAM) is the method which helps a model to decide “what to pay attention to”. This is possible because the module adaptively calculates the weights of each channel. The first concept of channel attention was presented in an article which introduced Squeeze-and-Excitation Networks:


In this type of network, the input is an image with three dimensions (channels x height x width). The first layer is Global Average Pooling (GAP), where feature maps are reduced to a single pixel, and also each channel is converted to a 1x1 spatial dimension. This layer produces a vector which has a length that is equal to the number of channels, and its shape is: channels x 1 x 1. This vector goes to the MultiLayer perceptron (MLP), which has an input that is equal to the ratio of the number of channels and the ratio of reduction. If this ratio is high, fewer neurons are in the MLP. In the output of the MLP there is a sigmoid function that maps values within the range of 0 to 1.



Shared MLP


The difference between a Channel Attention Module and an SE Network is that Global Average Pooling generates not one, but two, vectors of shape (channels x 1 x 1). One vector is generated by GAP, and the second vector by Global Max Pooling. The advantage of this solution is that there is more information. Max-pooling can also provide features based on contextual information, such as edges. Average pooling loses this information because it gives a more smoothing effect. Both vectors are summed up and passed through sigmoid activation to generate weights of channels.


Convultion Layer



Attention mechanisms can be used to help the model to find “where to pay attention”. Spatial Attention Module (SAM) is useful for this task .There are three steps in using this method. In the first part step, there is a pooling operation through channels, where input in a shape (channels x height x width) is decomposed to two channels, which represent Max Pooling and Average Pooling across the channels in the image. Each pooling generates a feature map with the shape of: 1 x height x width. After that, there is a convolutional layer and batch norm layer for normalizing output. And, at the end, just like in the attention module described above, a sigmoid function is used to map values in the range of 0 to 1. This SAM is then applied to all the feature maps in the input tensor using a simple element-wise product.


Convolutional Block Attention Module


It is also possible to join those two methods together, which creates a Convolutional Block Attention Module (CBAM). This can be applied as a layer to every convolutional block in the model. It needs a feature map, which is generated by a convolutional layer, first from an applied CAM, and then from a SAM. After that, there are refined feature maps in the output.


The attention mechanisms described in this article provide very effective and efficient methods of improving results in a wide range of tasks related to computer vision, such as image classification, object detection, image generation and super-resolution.





Nowadays, no one needs to be convinced of the power and usefulness of deep neural networks. AI solutions based on neural networks have revolutionised almost every area of ​​technology, business, medicine, science and military applications. After the breakthrough win of Geoffrey Hinton's group in the ImageNet competition in 2012, neural networks have become the most popular machine learning algorithm. Since then, 21st century technology has come to increasingly rely on AI applications. We encounter AI solutions in almost every step of our daily lives - in cutting-age technologies, entertainment systems, business solutions, protective systems, the medical domain and many more areas. In many of these areas, AI solutions work in a way which is self-sufficient and under only little or no human supervision.



Quantum Machine













Image generated with Midjourney API, “Quantum Computing”



The blogs so far have been based on facts and knowledge backed by both relentless theory, empirical experience and evidence in the form of practical solutions. Today's article will be of a slightly different nature and will focus on a field still in development, rather in its empirical infancy, which is still far from generating practical real-life solutions. We are talking about quantum computing (QC) - a field with which the technological world on the one hand associates computational salvation, and on the other fears irreversible change in some of the systems on which our technological civilization operates. It is worth quoting at this point the words of the great Polish visionary and science-fiction writer Stanislaw Lem:

The risk can be of any magnitude, but the very fact of its existence implies the possibility of success

~ Stanisław Lem, The Magellanic Cloud


Quantum computing from a theoretical perspective is the computing power to break today's ciphers securing our money in the bank in a short period of time, but there is still uncertainty whether the barrier between theory and practice will not prove to be too high, in terms of physical possibilities, in the context of the coming years. I believe, however, that it is worth pointing your gaze to the future, even if uncertain. After all, even 30 years ago few would have believed in solutions that today we take for granted and perhaps even boring everyday life.

Known areas where quantum computing can lead to a revolution are, e.g., the design of new drugs and materials, the improvement of artificial intelligence and optimization tasks such as fleet management of taxis, trucks, ships, etc. As quantum computers are increasing in maturity, research on algorithms that are dedicated to utilising the power of quantum computers is moving from being a niche that only a few people look at theoretically, to an active area of research on a larger scale. More practically relevant application areas are expected in the future.

In today's blog we will focus on a particular field where QC is likely to find its application - quantum machine learning. An area combining quantum computing with classical machine learning, something we at Digica like the most :)


The basis of quantum computing - qubit as a carrier of information


Where lies the computational power of quantum computing? A qubit, a bit of quantum information, has an unusual feature, derived from the laws of quantum mechanics: it can be not only in state 0 and 1 at the same time, but also a little bit in 0 and a little bit in 1 - it is in a superposition of two states. Similarly, eight qubits can be in all states, from 0 to 255, at once. The implications of this are momentous. A classical byte requires reading a sequence of bits, and the processor processes only one bit at a time. A qubit, which is actually a kind of probability cloud that determines the possibility of each state, allows all these states to be processed simultaneously. So we are dealing with parallel processing, which in modern electronics would correspond to the use of multiple processors.


Comparision of bit and qubit

Comparison of bit and qubit states.



The performance of quantum computers does not depend on any clocks - there are none here at all. Performance is determined by the number of qubits. Adding each additional qubit pays off by doubling the speed of computation. In a single act of reading we would then receive information, for the processing of which a classical computer would consume centuries. The juxtaposition is striking: in order to achieve performance exceeding the best modern supercomputer, it is “enough” to construct a device consisting of around 1 million qubits (equivalent to 21000000 classical bits).


IBM Q Quantum Computer

IBM Q Quantum Computer; Photo by LarsPlougmann []


When discussing the possibilities of quantum computing, we must also take into account the existing problems the technology faces in the empirical field. Quantum computers are exceedingly difficult to engineer, build and program. As a result, they are crippled by errors in the form of noise, faults and loss of quantum coherence, which is crucial to their operation and yet falls apart before any nontrivial program has a chance to run to completion. This loss of coherence (called decoherence), caused by vibrations, temperature fluctuations, electromagnetic waves and other interactions with the outside environment, ultimately destroys the exotic quantum properties of the computer. Given the current pervasiveness of decoherence and other errors, contemporary quantum computers are unlikely to return correct answers for programs of even modest execution time. While competing technologies and competing architectures are attacking these problems, no existing hardware platform can maintain coherence and provide the robust error correction required for large-scale computation. A breakthrough is probably several years away.


Quantum Machine Learning


Machine learning (ML) is a set of algorithms and statistical models that can extract information hidden in data. By learning a model from a dataset, one can make predictions on previously unseen data taken from the same probability distribution. For several decades, machine learning research has focused on models that can provide theoretical guarantees of their performance. But in recent years, heuristics-based methods have dominated, in part because of the abundance of data and computational resources. Deep learning is one such heuristic method that has been very successful.


With the increasing development of deep machine learning, there has been a parallel significant increase in interest in the ever-expanding field of quantum computing. Quantum computing involves the design and utilisation of quantum systems to perform specific computations, with quantum systems defined as a generalisation of probability theory that introduces unique system behaviours such as superposition or quantum entanglement into reality. Such novel system behaviours are difficult to simulate in classical computers, so one area of research that has growing attention is the design of machine learning algorithms that would rely on quantum properties to accelerate their performance.


The ability to perform fast linear algebra on a state space that grows exponentially with the number of qubits has become a key feature motivating the application of quantum computers for machine learning. These quantum-accelerated linear algebra-based techniques for machine learning can be considered the first generation of quantum machine learning (QML) algorithms, which address a wide range of applications in both supervised and unsupervised learning, including principal component analysis, support vector machines, k-means clustering and recommender systems ( The main deficiency that arises in these algorithms is that proper data preparation is necessary, which amounts to embedding classical data in quantum states. Not only is such a process poorly scalable, but it also deprives the data of the specific structure that gives advantages with classical algorithms while questioning the practicality of quantum acceleration.


Quantum Deep Learning


When we talk about quantum computers, we usually mean fault-tolerant devices. They will be able to run Shor's algorithm for factorization (, as well as all the other algorithms that have been developed over the years. However, power comes at a price: in order to solve a factorization problem that is unfeasible for a classical computer, we will need many qubits. This overhead is needed for error correction, since most quantum algorithms we know are extremely sensitive to noise. Even so, programs running on devices larger than 50 qubits quickly become extremely difficult to simulate on classical computers. This opens up the possibility that devices of this size could be used to perform the first demonstration of a quantum computer doing something that is unfeasible for a classical computer. It will probably be a highly abstract task and useless for any practical purpose, but it will be proof-of-principle nonetheless. It would be a stage when we know that devices can do things that classical computers can't, but they won't be big enough to provide fault-tolerant implementations of familiar algorithms. John Preskill coined the term "Noisy Intermediate-Scale Quantum" ( to describe this stage. Noisy because we don't have enough qubits for error correction, so we will have to directly exploit imperfect qubits in the physical layer. And "Intermediate-Scale" because of the small (but not too small) number of qubits.


By analogy, just as machine learning evolved into deep learning with the emergence of new computational capabilities, with the theoretical availability of Noisy Intermediate-Scale Quantum (NISQ) processors, a second generation of QML has emerged based on heuristic methods that can be studied empirically due to the increased computational capabilities of quantum systems. These are algorithms using parameterized quantum transformations called parameterized quantum circuits (PQCs) or quantum neural networks (QNNs). As in classical deep learning, the parameters of PQCs/QNNs are optimised against a cost function using black-box optimization heuristics or gradient-based methods to learn representations of the training data. Quantum processors in the near term will still be quite small and noisy, so distinguishing and generalising quantum data will not be possible using quantum processors alone, so NISQ processors will have to work with classical coprocessors to become effective.


Abstract pipeline for inference and training of hybrid quantum model

Abstract pipeline for inference and training of a hybrid classical-quantum model





Let's answer the question in the title of the article - Is Quantum Machine Learning hot or not? The possibilities opened up by the laws of quantum mechanics when used for quantum computing are extremely enticing and promising. In the context of machine learning, the acceleration of many of the algorithms of both classical machine learning and deep learning instils excitement and seems to be a technology that offsets some of the pains that existing classical solutions encounter. Unfortunately, things that are possible in theory sometimes have a technological barrier in front of them, which is especially true for quantum computing. Nevertheless, in my opinion Quantum Machine Learning is hot, and although it may seem like a distant prospect, it does not detract from its advantages and the many solutions it carries. As is often the case in life, let time tell.







Review of Yoshua Bengio’s lecture at the Artificial General Intelligence 2021 Conference

At the 2021 Artificial General Intelligence Conference, a star keynote speaker was Yoshua Bengio. He has been one of the leading figures of deep learning with neural networks, for which he was granted the Turing Award last year.

Autoencoders have existed in Data Science for a long time. This type of model has three sequential parts: the input layer (encoder); the hidden layer; and the output layer (decoder). It seems like a simple concept, but it is very powerful. You use autoencoders as an unsupervised method, which means that labels are not used in their training process.


How can we help you?


To find out more about Digica, or to discuss how we may be of service to you, please get in touch.

Contact us