ALL ARTICLES FOR computer vision


As humans, we have a visual system that allows us to see (extract and understand) shapes, colours and contours. So why do we see every image as a different image? How do we know, for example, that a box in an image is, in reality, a box? And how do we know what a plane is or what a bird is?


In the previous article, I described attention mechanisms by using an example of natural language processing. This method was first used in language processing, but this is not its only usage. We can also use attention mechanisms in another major field, for computer vision.


Images can be presented as multichannel matrices. For example, RGB images have dimensions: height, width and also three channels for each colour.

RGB images


An attention mechanism can be applied to the channels.

Channel Attention Module (CAM) is the method which helps a model to decide “what to pay attention to”. This is possible because the module adaptively calculates the weights of each channel. The first concept of channel attention was presented in an article which introduced Squeeze-and-Excitation Networks:


In this type of network, the input is an image with three dimensions (channels x height x width). The first layer is Global Average Pooling (GAP), where feature maps are reduced to a single pixel, and also each channel is converted to a 1x1 spatial dimension. This layer produces a vector which has a length that is equal to the number of channels, and its shape is: channels x 1 x 1. This vector goes to the MultiLayer perceptron (MLP), which has an input that is equal to the ratio of the number of channels and the ratio of reduction. If this ratio is high, fewer neurons are in the MLP. In the output of the MLP there is a sigmoid function that maps values within the range of 0 to 1.



Shared MLP


The difference between a Channel Attention Module and an SE Network is that Global Average Pooling generates not one, but two, vectors of shape (channels x 1 x 1). One vector is generated by GAP, and the second vector by Global Max Pooling. The advantage of this solution is that there is more information. Max-pooling can also provide features based on contextual information, such as edges. Average pooling loses this information because it gives a more smoothing effect. Both vectors are summed up and passed through sigmoid activation to generate weights of channels.


Convultion Layer



Attention mechanisms can be used to help the model to find “where to pay attention”. Spatial Attention Module (SAM) is useful for this task .There are three steps in using this method. In the first part step, there is a pooling operation through channels, where input in a shape (channels x height x width) is decomposed to two channels, which represent Max Pooling and Average Pooling across the channels in the image. Each pooling generates a feature map with the shape of: 1 x height x width. After that, there is a convolutional layer and batch norm layer for normalizing output. And, at the end, just like in the attention module described above, a sigmoid function is used to map values in the range of 0 to 1. This SAM is then applied to all the feature maps in the input tensor using a simple element-wise product.


Convolutional Block Attention Module


It is also possible to join those two methods together, which creates a Convolutional Block Attention Module (CBAM). This can be applied as a layer to every convolutional block in the model. It needs a feature map, which is generated by a convolutional layer, first from an applied CAM, and then from a SAM. After that, there are refined feature maps in the output.


The attention mechanisms described in this article provide very effective and efficient methods of improving results in a wide range of tasks related to computer vision, such as image classification, object detection, image generation and super-resolution.





A practical use for object detection based on Convolutional Neural Networks is in devices which can support people with impaired vision. An embedded device which runs object-detection models can make everyday life easier for users with such a disability, for example by detecting any nearby obstructions.


Embedded Technology Enablers


However, so far we have only seen a limited use of embedded devices or “wearable” devices to deploy AI in order to support users directly. This is largely due to the resource limitations of  embedded systems, the most significant of which are computing power and energy consumption.


Steady progress continues to be made in embedded device technology, especially in the most important element, which is miniaturisation. The current state-of-the-art is a three- nanometer process for MOSFET (metal oxide semiconductor field effect transistor) devices. Such smaller devices allow for shorter signal propagation time, and therefore higher clock frequencies. The development of multi-core devices allows concurrent processing, which means that applications can run more quickly. The energy efficiency of devices has increased and substantial improvements have been made in the energy density of modern Li-Ion and Li-Polymer batteries. Combining all these factors together makes it now feasible to run computationally intensive tasks, such as machine learning model inference on modern embedded hardware.


As a result, AI-based embedded technology is now widely used to process, predict and visualise medical data in real time. An increasing number of devices have been FDA-approved. However, many more applications are not on the FDA regulatory pathway, including AI applications that aid operational efficiency or provide patients with some form of support. Several thousands of  such devices are in use today.


Support for the Visually Impaired


Digica has developed an AI-based object-detection system which runs on a portable embedded device and is intended to assist the blind and partially sighted. The embedded device is integrated with a depth-reading camera which is mounted on the user’s body.

The system detects obstacles using the depth camera and relays information to the user by a haptic (vibration) controller and a Bluetooth earpiece. For the initial prototype, we selected a Raspberry Pi 4 as the chosen embedded device.

The application passes each captured frame from the camera to a segmenter and object detectors. The initial segmentation stage recognises large, static surfaces such as roads or sidewalks.


Example of detected segmented output


Note that the segmented output shown above is not displayed by the application because no display is connected to the output device.

The subsequent detector stage is used for detecting dynamic, moving objects, such as vehicles and people. A crosswalk detector is implemented as the final stage in the pipeline. All detected items are prioritised based on proximity and potential hazard before being sent to the user.


Example of localised detection output


The segmentation and detection stages operate on RGB video data. Distance  information is also provided by the stereo-depth camera. This information is used to alert the user to the proximity of detected objects by relaying information via an earpiece and through haptic feedback. To simplify presentation to the user, detected objects are identified as being on the left, on the right or straight ahead.

Detected objects are prioritised according to proximity and danger to the user. For each prioritised detection a complete set of information is presented to the user. This set of information refers to the classified object (for example, a car), the object’s location relative to the camera and the distance to the object.


Example of distance information for a prioritised object


The system uses Tensorflow and ONNX models for object detection. The target hardware  is an ARM 64-based Raspberry Pi, which means that the Arm NN SDK can be used to accelerate the development of AI features.


Significant advances in embedded technology have made it realistic to introduce Edge AI applications, such as the one described above. The technology is small, cheap and powerful enough to justify using it in mainstream development.

At Digica, our embedded software team works together with our team of our AI experts to make such developments a reality.





Increasing yields has been a key goal for farmers since the dawn of agriculture. People have continually looked for ways to maximise food production from the land available to them. Until recently, land management techniques such as the use of fertilisers have been the primary tool for achieving this.


Challenges for Farmers


Whilst these techniques give a much improved chance of an increased yield, problems beyond the control of farmers have an enormous impact:


  1. Parasites - “rogue” plants growing amongst the crops may hinder growth; animals may destroy mature plants
  2. Weather - drought will prevent crops from flourishing, whilst heavy rain or prolonged periods of cold can be devastating for an entire season
  3. Human error - ramblers may trample on crops inadvertently, or farm workers may make mistakes
  4. Chance - sometimes it’s just the luck of the draw!


AI techniques can be used to reduce the element of randomness in farming. Identification of crop condition and the classification of likely causes of poor plant condition would allow remedial action to be taken earlier in the life cycle. This can also help prevent similar circumstances arising the following season.


Computer Vision to the Rescue


Computer vision is the most appropriate candidate technology for such systems. Images or video streams taken from fields could be fed into computer vision pipelines in order to detect features of interest.




A key issue in the development of computer vision systems is the availability of data; a potentially large number of images are required to train models. Ideal image datasets are often not available for public use; this is certainly the case in an agricultural context. Nor is the acquisition of such data a trivial exercise. Sample data is required over the entire life cycle of the plants - it takes many months for the plants to grow, and given the potential variation in environmental conditions, it could take years to gather a suitable dataset.


How Synthetic Data Can Help


The use of synthetic data offers a solution to this problem. The replication of nature synthetically poses a significant problem: the element of randomness. No two plants develop in the same way. The speed of growth, age, number and dimensions of plant features, and external factors such as sunlight, wind and precipitation all have an impact on the plant’s appearance.


Plant development can be modelled by the creation of L-systems for specific plants. These mathematical models can be implemented in tools such as Houdini. The Digica team used this approach to create randomised models of wheat plants.




The L-system we developed allowed many aspects of the wheat plants to be randomised, including height, stem segment length and leaf location and orientation. The effects of gravity were applied randomly and different textures were applied to modify plant colouration. The Houdini environment is scriptable using Python; this allows us to easily generate a very large number of synthetic wheat plants to allow the modelling of entire fields.


The synthetic data is now suitable for training computer vision models for the detection of healthy wheat, enabling applications such as:


  • filtering wheat from other plants
  • identifying damaged wheat
  • locating stunted and unhealthy wheat
  • calculation of biomass
  • assessing maturity of wheat


With the planet’s food needs projected to grow by 50% by 2050, radical solutions are required. AI systems will provide a solution to many of these problems; the use of synthetic data is fundamental to successful deployments.


Digica’s team includes experts in the generation and use of synthetic data; we have worked with it in a variety of applications since our inception 5 years ago. We never imagined that it could be used in such complex, rich environments as agriculture. It seems that there are no limits for the use of synthetic data in the Machine Learning process! 


Application of Computer Vision in the Industrial Sector


Inventory management is a key process for all industrial companies, but the inventory process is both time-consuming and error-prone. Mistakes can be very costly, and it is highly undesirable to store more raw materials or fully completed and ready-to-ship products than are required at any given time. On the other hand, any shortfall in elements that make up a product may leave customer orders unfulfilled on time.  In a warehouse which stores, for example, 10,000,000 items with an average value of $10, the loss of 0.1% of these items represents a cost of $100,000. The per annum cost of such a loss may run into millions of dollars. An automated object-counting system based on computer vision (CV) could speed up the process, reduce errors and lower costs.


Why is Inventory Management so complex?


There are many complexities to the art of inventory management, including the following factors:

  • Range - the variety of stock keeping units (SKUs) to be tracked
  • Accessibility - objects may be placed on high shelves in warehouses, out of reach and perhaps out of direct sight of workers
  • Human error - objects may be miscounted or misrecorded in tracking systems
  • Time management - taking an inventory of SKUs at the optimal frequency


These problems can be solved using an automated object counting system which is based on CV.  For such a system to be genuinely useful, it must display a high degree of accuracy. An appropriately designed and trained CV application can then significantly reduce the possibility of mistakes and the time taken to execute the process.


An Automated Object Counting System


Digica developed an object counting system based on CV that is both highly accurate and easily customisable. For example, the system is able to detect, classify and count objects by class when they are located on a pallet. The initial system was designed to count crates of bottles.


Example of detected crates when stacked on a pallet


A practical system deployed in a warehouse must be able to cope with a range of inconsistencies in the incoming data. It is unlikely that pallets are always placed in exactly the same locations or are always oriented in the same way. In the example above, all of the crates are detected in spite of the fact that the visible regions of the crates are not consistent. Crates are also recognised from both front and side views.


This system is clearly well suited for use with the CCTV systems which are typically installed in warehouse environments. However, the technology could be adapted to run on automated vehicles or drones, which are devices that often run an embedded operating system that is capable of running Machine Learning (ML) applications. This could lead to a fully automated inventory process in which humans are responsible only for controlling the work of the machines.


Note that this system does not need SKU-specific barcodes or QR codes, which simplifies the deployment of the system in existing warehouses. Therefore, existing processes do not require any modification, and it is not necessary to place objects so that any existing barcode is kept visible.


A Customisable System


This computer vision system is highly customisable. At its core is a pre-trained neural network which can be readily retrained to support a specific target environment. The possibilities are almost limitless! The system could  be used for purposes such as:

  • Detecting and counting small objects, such as screws or nails on a conveyor belt
  • Detecting boxes on pallets during packing for the purposes of quality control prior to shipping
  • Aggregating information about certain objects in large physical areas, such as shipping ports for example, by carrying out an inventory on shipping containers


Integration with a wider range of systems is also possible. As the system provides real-time inventory data, it is possible to automatically make orders for resources for which stocks are running low. Integration with other ML systems could allow predictive ordering to optimise prices. Sensor-fusion techniques can also be easily applied, by combining a CCTV signal with IR cameras for certain objects that present variable temperature spectra. Such a system makes it possible to monitor objects, such as batteries, which are at risk of overheating.


This system was trained using a combination of publicly-available and self-generated data. Whilst this works well in a demonstration environment, training on target environment data will give a higher level of accuracy. Such target environment data may not be available, but the problem lends itself well to the use of synthetic data for training purposes. Furthermore, such data can be easily integrated into the training pipeline.


The Digica team has completed a large range of projects which make use of computer vision.  With the advent of Industry 4.0, the time has come to give to industries that rely on Inventory Management the technology upgrade that they need to stay competitive!


How can we help you?


To find out more about Digica, or to discuss how we may be of service to you, please get in touch.

Contact us