ALL ARTICLES FOR Data Science

Data is all around us, and we don't even see it.

 

 

Data Scientists usually work on projects related to well known topics in Data Science and Machine Learning, for example, projects that rely on Computer Vision, Natural Language Processing (NLP) and Preventive Maintenance. However, in Digica, we're working on a few projects that do not really focus on processing either visual data, text or numbers. In fact, these unusual projects focus on types of data that are flowing around us all the time, but this data nevertheless remains invisible to us because we cannot see it.

 

       1. WiFi           

wifi router gbe15093f9 1920

WiFi technology generates a lot of waves around us. And this data can convey more information than you think. Having just a WiFi router and some mobile devices in a room is enough for us to easily detect what is happening in the room. With this technique, movement distorts the waves in such a way that we can then easily detect that movement, for example, if someone raises a hand. 

 

The nature of WiFi itself makes it pretty easy to set up this technique. Firstly, as mentioned above, we don't even need to use any extra instruments or tools, such as cameras. It's enough to have just a router and some mobile devices. And secondly, this technique can even work through the walls. So this means we can easily use it throughout the whole house, and without thinking about cables or adding any extra equipment to each room.


Some articles have already been published on human gesture recognition using this type of wave, for example, this article

 In that and other articles, you can read about how the algorithm can generate a pretty detailed picture. For example, the algorithm can recognize a person's limbs one by one, and then construct a 3D skeleton of that person. In this way, it is possible to reproduce many elements of a person's position and gestures. It’s actually a really cool effect as long as a stranger is not looking at someone else's data, which would be quite creepy! 

 

         2. Microwaves

 

 microwave gf33c4f3f5 1280

 

I'm sure that you have used microwaves to heat up a meal or cook food from scratch. And you may also be familiar with the idea of medical breast imaging. However, you might not know that those two topics use the same technology, but in different ways. 

 

It turns out that, after sending a microwave at breasts, the waves that are reflected back from healthy tissue looks different from the waves that are reflected from malignant tissue. "So what", you may say, "we already have mammography for that." Yes, but mammograms give a higher exposure to radiation. And it is really difficult to distinguish healthy tissue from malignant tissue in dense breast mammogram images, as described in this link. Microwaves were first studied in 1886, but, as you see, they are now being put to new uses, such as showing up malignant tissue in a way that is completely non-invasive and harmless to people.

 

By the way, microwaves are also perfect for weather forecasting. This is because water droplets scatter microwaves, and using this concept helps us to recognize clouds in the sky!

 

            3. CO2

 

Last but not least, we have Carbon Dioxide. This chemical compound is actually a great transmitter of knowledge. Did you know that CO2 can very accurately indicate the number of people in a room? Well, it does make sense because we generate CO2 all the time as a result of breathing. However, it’s not that obvious that we can be around 88% accurate in indicating the number of people in a given room! 

 

When this approach is set up, we can seamlessly detect, for example, that a room is unoccupied, and therefore it would be a good idea to save money by switching off all the electronics in that room. So this can be a great add-on to every smart home or office.

 

You might think that the simplest way to find out if a given room is unoccupied is to employ hardware specifically for this purpose, such as cameras and RFID tags. However, such a high-tech approach entails additional costs and, most importantly these days, carries the risk of breaching the privacy of people. On the other hand, as described above, the data is already there, and just needs to be found and utilized to achieve the required.

 

In the simplest case, we just read the levels of CO2 gas in a room, and plot those levels against time. Sometimes, for this task, we can also track the temperature of the room, as in thisexperiment. However, note that temperature data is often already available, for example, in air-conditioning systems. We only need to read the existing data, and then analyse that data correctly in order to provide the insight that is required in the particular project.



There are many, many more types of data that are invisible to the human eye, but offer an amazing playground for Data Scientists. For example, there are radio waves, which are a type of electromagnetic wave which have a wavelength that is longer than microwaves. And there are infrared waves, which are similar to radio waves but are shorter than microwaves, and are great for thermal imaging. And then there are sound waves, which we can use for echo-location (like bats). The above waves were the first ones that came to mind, but I'm sure that there are many other sources of invisible data that can be re-used for the purposes of Data Science.

 

 

 

 

One very important part of working as a Data Scientist is often overlooked. I'm referring to the part which involves visualising data and results. You probably don't think that this is an exciting task, but let me explain why it is so important. Usually, Data Scientists are fascinated by making complicated models and trying out new architectures. However, as the job title suggests, the main part of a Data Scientist's work is to do with data. For example, as a Data Scientist, at the very start of every project, you begin by familiarizing yourself with the datasets on which you will be working. Of course, the easiest way to do this is just to open the files and look at them. In the case of images, it is quite simple to understand the content of the image without using special tools or packages. On the other hand, visualisation is often the best option not only when it comes to gaining your first insights into data, but also when it comes to searching for and identifying the first set of patterns, trends or outliers.

Our brains are fascinating “machines” which can take in a great deal of information from just one picture. It is not without reason that we say that a picture is worth a thousand words. So that is why we need to create high quality visualisations. By high quality I'm not really referring to the choice of colours, but more to how the information is presented. Unfortunately, nearly every day, we see many graphs and plots that have been poorly designed. The problem is not that they look poor from an aesthetic point of view, but that the way in which information is presented is misleading or deceptive (sometimes even on purpose). For example, Reuters published the plot below, and I consider it to be one of the best examples of a misleading plot.

 

 

Above, we see a line plot with marked points and, at first sight, the plot looks correct. One of the points in the timeline has a description ("2005 Florida enacted …"), which seems to show that a change was made in the Law and that, as a result, there was a drop in the number of murders. But wait … let’s look carefully at the scale on the y-axis. From what we learned at school (that 0 is at the bottom left hand corner and that larger values are placed progressively higher up the axis), we expect every graph to follow the same rules. However, in this case, the author changed where 0 starts on the axis (to the top rather than at the bottom), which means that it is not immediately obvious how to interpret the plot. The simplest way to fix this is to turn the plot on the x-axis. And after that, you will see that the red colour is no longer the background of the plot, but is really the actual area under the plot. As a result, you will come to conclusions based on the corrected plot that are quite different to any conclusions which you can draw from the original plot. 

 

One type of plot is doomed to failure. You might be surprised to learn that I'm talking about the pie chart, which you know so well from many presentations and the media. Unfortunately, this chart has a fundamental problem with how it is defined. This is because human brains are not good at analysing angles. In a pie chart, we compare angles to say which group is bigger or smaller compared to the 100% data total. Well, we can say that element A has a bigger or smaller representation but, if the question is in the form of “how many times is part A bigger than part B”, then it is impossible to answer that question without listing the actual percentages for A and B. Another problem is that it seems to be impossible to compare two pie charts by just looking at the plots. Here is an example: 

 

Visualisation blog

If we want to know the difference between any part on the left plot to the same part on the right plot, then you can only take a guess because you do not know what the numbers are. I hope that you now understand why scientists of all types hate pie charts. John Tukey, who was a statistician, said "There is no data that can be displayed in a pie chart, that cannot be displayed BETTER in some other type of chart". And a 3D pie chart is even worse than a standard pie chart! Maybe the 3D version looks fancier to some people, but such a chart only makes it more difficult for readers to really understand the structure of the data.

 

The choice of colour palette is another challenge when it comes to data visualisation. This especially applies to a choropleth map, in which different elements are assigned a different colour. For example, you probably remember geography lessons in which you looked at map which showed mountains, lowlands, seas, etc. in different colours. This approach can be helpful, especially when we look at a map and we want to quickly find, for example, mountain ranges.

 

 

On the other hand, the example above shows how a colour palette can cause confusion in our understanding of data. When you look at the scale on the left side, we see that the minimum value is represented by purple and the maximum value by red. However, if we want to find out which colour represents the mean value in this range, then we cannot figure out the answer to that question. Next, note that some parts of the map are coloured white. Usually white means that a value is 0 or that there is no information for a given area. However that is not the case in this plot, where white colour is assigned to values around 200 metres, and 200 meters is not the mean value in this range. The final problem here is that, if you look at values on the scale, you see that intervals for each colour are about every 20, but the last interval is about 3000. So that’s a big surprise. 

 

Did you know that, if you want to cheat or hide some “inconvenient” data, bar plots are the best choice for that? Of course, I'm not suggesting that you would want to mislead anyone, but only helping you not to be misled by others. One method of cheating is to create the illusion of a “big difference” by using a scale which does not begin from 0. When we don’t start from zero, we only see the tip of the iceberg with a set starting point which was convenient for the author. A good example of that is the plot below. At first sight, it looks as if women in Latvia are several times taller than their counterparts in India. But this is not true! The real difference is about 5 inches, which is less than 10% of the average height across the data set.  

 

 

From the examples given above, you can see it is crucial to present data and information in the best way so that it is possible to understand them clearly and correctly. So, when you are at the stage of exploring a new set of data, I suggest that you try out various ways of presenting the data, and then see which form of presentation works best in achieving a particular outcome. At this stage of a project, ensure that you don't make any assumptions about that data as this can lead to misunderstandings and then to taking wrong decisions, for example, on the question of choosing the right model.

 

Visualisation is also often used at the very end of a project, at that time when you have your results and you want to present them to either your  teammates or your clients. It's important that you don't make any mistakes at this stage when presenting data. If you make a mistake, then any stakeholders may themselves misunderstand the results, which can cost time and money, and can lead to them making poor decisions further down the line. Don’t ever underestimate the full power of visualisations As humans, we prefer to just look at pictures rather than read through all the data. However, that means that presenting data in this summary way carries with a important obligations. In conclusion, think carefully about how you want to show to readers any plot, graph or any kind of graphical representation of data.

 

 

 

 

 

When you want to take the best possible care of your brain, certain things are recommended, such as fatty fish, vegetables, doing some brain exercises, and learning new things. But what about artificial neural networks? Of course, they don’t need human food, but they need the best quality and quantity of data. For this purpose you need to ask yourself exactly what task you want to solve, and where can you find relevant datasets. Are you searching datasets that apply to specific tasks like “medical image segmentation” or “entity recognition”? Or maybe you are searching datasets in a particular field of science, such as ion spectrometry or ECG signals. 

Here are some possible sources of  the best datasets for your machine learning project:

 

 1. Google dataset search

[https://datasetsearch.research.google.com/]

In 2018, Google launched a search engine for datasets. From my perspective as Data Scientist, it was a game changer. Finally, I have a dedicated web search engine for datasets. You can search through datasets based on their description on the relevant webpages. Many of the datasets have descriptions of measurement processes and content. Also, you have direct links to the datasets and information about the type of dataset (.zip, .xml, .pdf, .csv, etc.). Some of the datasets even have information about the type of license that applies to  the dataset. As you know, license information is crucial, especially in commercial projects.

 

 

 2. Kaggle

[https://www.kaggle.com/]

 

I would say that every Data Scientist knows of this website, and has visited it at least. You may have  heard of Kaggle as a platform where you can compete against others in Data Science problems. But have you ever thought about it as a good source of datasets?

 

On Kaggle you can search datasets based on main categories such as Computer Vision or Natural Language Processing, and also based on file types, size of datasets or type of licenses. If an interesting dataset is in table format, you can look into a summary of each column to see, for example a histogram of numerical values or a list of unique strings. You can even do basic filtering in this dataset or have a summary of samples number per class, so that you can check if your potential dataset is balanced or not. 

 

 

 

3. UCI Machine Learning Repository

[https://archive.ics.uci.edu/ml/index.php]

 

On this website, you can search datasets based on a number of criteria: type of task (classification, regression, clustering), type of data attributes (categorical, numerical or mixed), data type (text, sequential, time-series, multivariate) or area of datasets. Each dataset has a description and links to be able download the dataset. Many of the datasets published on this website were already used in scientific research, which means that they have high quality descriptions and also have a list of papers where they are mentioned. This is useful because, with this information, you can check whether someone has already done what you want to do. Unfortunately, there are only 622 datasets there.

 

 

 4. Papers with code

[https://paperswithcode.com/datasets

 

For me, this is one of the best web pages for every Data Scientist and I’m excited about explaing to you why I feel that way. If you want to do some quick research on a specific topic, for example, image semantic segmentation, you can search this web page by using an existing tag or searching in the browser. You can search papers with an implementation that is stored in GitHub along with the datasets on which any models were trained. For the most common task, you have a plot with summary results and a list of the top five solutions with links to the repository and the name of the deep learning framework (usually, PyTorch and TensorFlow). 

So, on this website, when you're looking for the best dataset, you can also search  for the best architecture for the problem that you want to solve. I highly recommend this website if you are doing a research task and have a summary of existing methods with implementations. 

 

 

 5. Hugging face

[https://huggingface.co/docs/datasets/index

 

Hugging Face is the platform with datasets, pre-trained models and even “spaces” where you can check how the solution works on some examples. Some of the solutions presented on the platform have access to the GitHub implementation. Also, there are some tutorials and how-to guides, which makes this platform appropriate for less experienced data scientists. I can especially recommend this website   for projects related to NLP and audio processing because there are so many datasets for multiple languages. However, if you are more interested in computer vision, then you can also find interesting things on this website.

 

 

   

   6. Create your own dataset       

 

        a. API

 

Sometimes you want to use data from specific web services, for example,Spotify , or websites such as social media, for example, Twitter. Some of these services or sites have an API, so you only need create a developer account, and download the relevant data. Using an API usually requires you to have authorisation, in most cases, OAuth, which gives you, as a developer, the required access token, access secret, etc. In Python, you also have special packages dedicated to specific platforms, such as, for example, Spotipy (to access Spotify), Tweepy, (to access Twitter), or python-facebook-api. Those packages have built-in functions for a specific type of request, like “Find account names which contain the specific substring”, or “Get 10 most popular tweets about football”. For each API you need to review the documentation to be sure that you get the data which you want. You also need to remember to preprocess the data before using it in your machine learningand deep learning model.

 

        b. Web scraping

 

If the service from which you want to get data does not have its own API, or you just want to take data from, for example,. a shopping website you can do, so-called,web scrapping. For that, you need to know a little about HTML, and it helps if you can write your own regular expressions. Web-scraping is probably more time-consuming than the other ways describe above in this article, but sometimes web-scraping is the only possible way to get the data that you want. For web scraping in Python, I recommend using the BeutifulSoup package.

 

 

Summing up, as a Data Scientist you have access to many sources of datasets. Some sources contain well-prepared, high quality datasets, whilst other datasets are more challenging to source, but you often have more control over the latter sources of data.You will have to choose the strategy which works best in a given context.I hope that this article gives you the information that you need to take an informed decision about the best approach.

 

 

 

It’s been over 20 years since Test Driven Development (TDD) was introduced to the software development community. Since that time, TDD has become a standard best practice and, more importantly, it has saved countless hours for those who use it. I don’t think I’ve ever met a professional software developer who has never played with TDD.

 

Quantum Machine

 

 

 

 

 

 

 

 

 

 

 

 

Image generated with Midjourney API, “Quantum Computing”

 

 

The blogs so far have been based on facts and knowledge backed by both relentless theory, empirical experience and evidence in the form of practical solutions. Today's article will be of a slightly different nature and will focus on a field still in development, rather in its empirical infancy, which is still far from generating practical real-life solutions. We are talking about quantum computing (QC) - a field with which the technological world on the one hand associates computational salvation, and on the other fears irreversible change in some of the systems on which our technological civilization operates. It is worth quoting at this point the words of the great Polish visionary and science-fiction writer Stanislaw Lem:

The risk can be of any magnitude, but the very fact of its existence implies the possibility of success

~ Stanisław Lem, The Magellanic Cloud

 

Quantum computing from a theoretical perspective is the computing power to break today's ciphers securing our money in the bank in a short period of time, but there is still uncertainty whether the barrier between theory and practice will not prove to be too high, in terms of physical possibilities, in the context of the coming years. I believe, however, that it is worth pointing your gaze to the future, even if uncertain. After all, even 30 years ago few would have believed in solutions that today we take for granted and perhaps even boring everyday life.

Known areas where quantum computing can lead to a revolution are, e.g., the design of new drugs and materials, the improvement of artificial intelligence and optimization tasks such as fleet management of taxis, trucks, ships, etc. As quantum computers are increasing in maturity, research on algorithms that are dedicated to utilising the power of quantum computers is moving from being a niche that only a few people look at theoretically, to an active area of research on a larger scale. More practically relevant application areas are expected in the future.

In today's blog we will focus on a particular field where QC is likely to find its application - quantum machine learning. An area combining quantum computing with classical machine learning, something we at Digica like the most :)

 

The basis of quantum computing - qubit as a carrier of information

 

Where lies the computational power of quantum computing? A qubit, a bit of quantum information, has an unusual feature, derived from the laws of quantum mechanics: it can be not only in state 0 and 1 at the same time, but also a little bit in 0 and a little bit in 1 - it is in a superposition of two states. Similarly, eight qubits can be in all states, from 0 to 255, at once. The implications of this are momentous. A classical byte requires reading a sequence of bits, and the processor processes only one bit at a time. A qubit, which is actually a kind of probability cloud that determines the possibility of each state, allows all these states to be processed simultaneously. So we are dealing with parallel processing, which in modern electronics would correspond to the use of multiple processors.

 

Comparision of bit and qubit

Comparison of bit and qubit states.

[https://blog.sintef.com/digital-en/diving-deep-into-quantum-computing/]

 

The performance of quantum computers does not depend on any clocks - there are none here at all. Performance is determined by the number of qubits. Adding each additional qubit pays off by doubling the speed of computation. In a single act of reading we would then receive information, for the processing of which a classical computer would consume centuries. The juxtaposition is striking: in order to achieve performance exceeding the best modern supercomputer, it is “enough” to construct a device consisting of around 1 million qubits (equivalent to 21000000 classical bits).

 

IBM Q Quantum Computer

IBM Q Quantum Computer; Photo by LarsPlougmann [https://newrycorp.com/insights/blog/technology-readiness-of-quantum-computing/]

 

When discussing the possibilities of quantum computing, we must also take into account the existing problems the technology faces in the empirical field. Quantum computers are exceedingly difficult to engineer, build and program. As a result, they are crippled by errors in the form of noise, faults and loss of quantum coherence, which is crucial to their operation and yet falls apart before any nontrivial program has a chance to run to completion. This loss of coherence (called decoherence), caused by vibrations, temperature fluctuations, electromagnetic waves and other interactions with the outside environment, ultimately destroys the exotic quantum properties of the computer. Given the current pervasiveness of decoherence and other errors, contemporary quantum computers are unlikely to return correct answers for programs of even modest execution time. While competing technologies and competing architectures are attacking these problems, no existing hardware platform can maintain coherence and provide the robust error correction required for large-scale computation. A breakthrough is probably several years away.

 

Quantum Machine Learning

 

Machine learning (ML) is a set of algorithms and statistical models that can extract information hidden in data. By learning a model from a dataset, one can make predictions on previously unseen data taken from the same probability distribution. For several decades, machine learning research has focused on models that can provide theoretical guarantees of their performance. But in recent years, heuristics-based methods have dominated, in part because of the abundance of data and computational resources. Deep learning is one such heuristic method that has been very successful.

 

With the increasing development of deep machine learning, there has been a parallel significant increase in interest in the ever-expanding field of quantum computing. Quantum computing involves the design and utilisation of quantum systems to perform specific computations, with quantum systems defined as a generalisation of probability theory that introduces unique system behaviours such as superposition or quantum entanglement into reality. Such novel system behaviours are difficult to simulate in classical computers, so one area of research that has growing attention is the design of machine learning algorithms that would rely on quantum properties to accelerate their performance.

 

The ability to perform fast linear algebra on a state space that grows exponentially with the number of qubits has become a key feature motivating the application of quantum computers for machine learning. These quantum-accelerated linear algebra-based techniques for machine learning can be considered the first generation of quantum machine learning (QML) algorithms, which address a wide range of applications in both supervised and unsupervised learning, including principal component analysis, support vector machines, k-means clustering and recommender systems (https://arxiv.org/pdf/2003.02989.pdf). The main deficiency that arises in these algorithms is that proper data preparation is necessary, which amounts to embedding classical data in quantum states. Not only is such a process poorly scalable, but it also deprives the data of the specific structure that gives advantages with classical algorithms while questioning the practicality of quantum acceleration.

 

Quantum Deep Learning

 

When we talk about quantum computers, we usually mean fault-tolerant devices. They will be able to run Shor's algorithm for factorization (https://arxiv.org/abs/quant-ph/9508027), as well as all the other algorithms that have been developed over the years. However, power comes at a price: in order to solve a factorization problem that is unfeasible for a classical computer, we will need many qubits. This overhead is needed for error correction, since most quantum algorithms we know are extremely sensitive to noise. Even so, programs running on devices larger than 50 qubits quickly become extremely difficult to simulate on classical computers. This opens up the possibility that devices of this size could be used to perform the first demonstration of a quantum computer doing something that is unfeasible for a classical computer. It will probably be a highly abstract task and useless for any practical purpose, but it will be proof-of-principle nonetheless. It would be a stage when we know that devices can do things that classical computers can't, but they won't be big enough to provide fault-tolerant implementations of familiar algorithms. John Preskill coined the term "Noisy Intermediate-Scale Quantum" (https://arxiv.org/abs/1801.00862) to describe this stage. Noisy because we don't have enough qubits for error correction, so we will have to directly exploit imperfect qubits in the physical layer. And "Intermediate-Scale" because of the small (but not too small) number of qubits.

 

By analogy, just as machine learning evolved into deep learning with the emergence of new computational capabilities, with the theoretical availability of Noisy Intermediate-Scale Quantum (NISQ) processors, a second generation of QML has emerged based on heuristic methods that can be studied empirically due to the increased computational capabilities of quantum systems. These are algorithms using parameterized quantum transformations called parameterized quantum circuits (PQCs) or quantum neural networks (QNNs). As in classical deep learning, the parameters of PQCs/QNNs are optimised against a cost function using black-box optimization heuristics or gradient-based methods to learn representations of the training data. Quantum processors in the near term will still be quite small and noisy, so distinguishing and generalising quantum data will not be possible using quantum processors alone, so NISQ processors will have to work with classical coprocessors to become effective.

 

Abstract pipeline for inference and training of hybrid quantum model

Abstract pipeline for inference and training of a hybrid classical-quantum model

[https://arxiv.org/pdf/2003.02989v2.pdf]

 

 

Conclusions

Let's answer the question in the title of the article - Is Quantum Machine Learning hot or not? The possibilities opened up by the laws of quantum mechanics when used for quantum computing are extremely enticing and promising. In the context of machine learning, the acceleration of many of the algorithms of both classical machine learning and deep learning instils excitement and seems to be a technology that offsets some of the pains that existing classical solutions encounter. Unfortunately, things that are possible in theory sometimes have a technological barrier in front of them, which is especially true for quantum computing. Nevertheless, in my opinion Quantum Machine Learning is hot, and although it may seem like a distant prospect, it does not detract from its advantages and the many solutions it carries. As is often the case in life, let time tell.

 

 

 

 

 

 

 

Here goes a second edition of our special Christmas treat for you!

 

(The first one was about Santa Classification using classical Computer Vision techniques. It not only was accurate - catching even the Grinch - but also provided feedback about exact pixels that stood behind the classification. And its third advantage is that it also showed me how bad I was at writing back then.)

 

This year we had to raise the stakes and show you a technology far less covered. What’s more, it has a far bigger business context than you might think.

 

You can’t detect Santa in the dark… or can you?

 

In our previous article we didn’t consider one thing. Santa detection will be quite challenging in a scarce amount of light. And that’s when Santa usually arrives… What technology to use then?


Actually it turns out that there is one technology that is both perfect for Santa classifying AND something that my company specialises in… it’s radars! Why?

 

  • As I mentioned before, they help detect objects in the dark. It produces, transmits and receives back  electromagnetic waves. No need for light here! But there’s more to that: even in the daylight typical computer vision analysis can be hard when there’s rain, snow, or direct sunlight. Radars are resistant to that (hence its popularity in automotive technology and in Santa classification).
  • Santa will usually be far away. That’s when radars also come into play: they  can spot objects that are even a few kilometres away! It means we can even spot Santa visiting your neighbour and omitting your house.
  • Radar data helps keep a lot of information confidential. You don’t keep images of the suspected objects on your disk, just the spectra of it. It obviously increases Santa’s privacy. And if your kid finds your Santa classifier folder on the disk, they wouldn’t recognize immediately what you’re doing.
  • Contrary to typical cameras, even the radar itself can be hidden from the human eye - it can be placed behind some opaque surface without any effect on the outputs. It means you don’t have to reveal to your neighbours what you’re up to!
  • Radars’ output provides azimuth of an object, which means we can get to  know the direction of its movement. It’s an obvious advantage in our case. We should expect Santa to arrive from the North Pole… This is how our radar detector looks like from our polish office perspective:

 

 

  • Radars provide speed information as well. I’m not sure of the exact speed of Santa’s sleigh (do you?) but even some slight approximation can be handy to rule out planes and Superman.
  • There is another interesting feature to be obtained from radars, which is called Radar cross-section (RCS). In short, it says how easy the object is to be observed. This is connected with the object’s size, obviously, but also the material it is made from (which influences, literally speaking, how much of the radio wave does this object reflect back to the sensor). What’s the takeaway for us? Let’s see the image below:

 

Source: https://discovery.ucl.ac.uk/id/eprint/10134022/1/open_radar%20(2).pdf

 

As you can see, using RCS feature is quite a hint when telling apart blue areas (vehicles) and red areas, like drones (represented by the “uav” category here - that is “unmanned aerial vehicles”). People and bicycles are a bit in the middle.

I don’t know about you, but for me Santa will probably fall into the blue category. Don't be mistaken by the flying part of the “UAV” class - here we focus on how detectable an object is. In this case the size will actually be quite substantial when we take into account Santa and his sleigh… and the reindeer. Long story short, we’ll probably look for something the size of a vehicle. Just in the sky. Hence the blue distribution.

How about orange and green categories in the image? That’s where e.g. speed (the feature we mentioned earlier) can come to play. The actual spectra shapes will also help us then - but we’ll mention them in a bit.

As I said, radar data analysis is something we do at Digica quite often. I don’t want to shower you with the very details of it - it’s Christmas after all (besides, you can always contact us about it). However, there are a few technical hints and fun* facts that are not-so-obvious about the radar data. 

*Please acknowledge though, that those facts are being “fun” for the author who is a Data Scientist

  • Contrary to how typical visual data looks (e.g. has 3 colour channels), radars typically have just one (we can imagine it to be the data in just “black and white”). 
  • It is much harder to work with radar data instead of camera-based, typical RGB images, because typical images are more common. It means there are almost no models trained on radar data on the internet. However, It is possible to reuse models trained on…. RGB data. Just take some typical pretrained model (like ResNet) as a base and customise it for your radar data, as if you were applying typical transfer learning. Remember the stuff about radar data having just one channel, whereas RGB data expects 3? Just stack three identical channels of radar data on top of each other. It works smarter than it sounds.
  • The reason such things can work is that radar outputs look like images indeed. You can still find patterns there that could help you tell some objects apart, like on the spectra below: 

 

Source: https://www.mdpi.com/1424-8220/21/1/210

For some weird reason I couldn't find similar sources with Santa examples.

  • Another atypical thing about radar data is that you cannot freely play with them in the augmentation phase, as you would do with typical images. Take for example top-bottom modifications:

 

Source: https://towardsdatascience.com/image-augmentation-for-deep-learning-histogram-equalization-a71387f609b2

 

The parts where you get rid of the cat's head are a big no-no (like, in general it’s inadvisable, but here for even more reasons). As the vertical axis often represents speed, in radar data such modifications might damage the data by modifying a very useful information. I mean - try it out if you’d like, but such data (cropped in the upper or bottom part of an image) will never occur in real life scenarios. 

Shifts on the horizontal axis should make sense though. Moving the signal from left to right will usually mean just changing its emergence in time.

As you can see, radar data is useful but has to be handled well. By the way, some people also appreciate them for tasks like weather monitoring, military, geology, medicine or navigation - but let’s be honest: who cares about that if you can focus on spotting a Santa with reindeers instead?

PS. This article wouldn’t exist if it weren’t for my fantastic colleague and expert in radars in the same person, Joanna Piwko. Or it would be much dumper and consisted of 10 Santa memes.

If I was to point out one most common mistake of a rookie Data Scientist, it’s their focus on the model, not on the data.

How can we help you?

 

To find out more about Digica, or to discuss how we may be of service to you, please get in touch.

Contact us