So why does the neural network sometimes detect a given shape and structure as a car, and sometimes as a plane? What other factors can we check to find out what makes the difference?
We could argue that there are differences in the cars themselves which make it possible for the neural network to conclude that essentially the same objects are different types of objects. And that's true, but let's see what happens when we analyse the same objects in different contexts.
For example, in the picture below, I have added a plane above the bird:
We can see that the plane is correctly detected as a plane.
In the picture below, I have placed the plane in a position that is below the bird.
Aha! So, in this case, the plane is not detected at all if we place the plane below the bird.
In the next example, I have placed several copies of the plane all around the bird.
And now an object with the same features (a plane) is detected as a bird. Hmm, what has changed?
The context changed drastically from a plane being positioned below a bird to the same bird being surrounded by six objects. How often do we see planes very close to each other? And how often do we see birds close to each other either in small or large groups? What's more, how often do we see any birds flying above any planes?
It's starting to look as if the neural network is not only learning the individual features of an object, but is also learning features of the context of this particular object. In this way, the neural network is taking into account certain elements of the setting in the whole image.
In what other ways does the context change how a neural network detects objects?
Let's take a look at the image.
And now let's add a second cat:
What differences can you see? On the face of it, everything seems to be fine. The second cat is correctly detected, but … what about detecting the two toothbrushes or the three bottles?
So adding a cat to the image has an effect on detecting a toothbrush! Isn't that weird?
Let's conclude with one final example. What if we take the background out completely, and then try to detect any remaining objects.
Hmm, a bear is detected as a "bird". I did not see that coming!
Let’s paste a small part of an image:
Oh dear! Our bear detector is really not doing a good job, is it? By adding a sheep to an image of a bear, the bear has become a sheep!
And what happens if we add noise to the background of the image:
Ooh, that's an interesting result. Finally, we detected a bear!
So our results suggest that:
- A bear (on a black background) is a bird!
- A bear (on a black background with a sheep in the background) is a sheep!
- And, finally, a bear (with a sheep in the immediate background and then the bear/sheep image on a noisy back-background) is a bear!
I don't know about you, but I'm rather confused. What is a bear? What does a bear look like? How can we correctly detect a bear?
To summarise:
What we have seen is that different contextual clues can make a crucial difference to how well a neural network detects objects. And we see that, when designing our models and solutions, we must take into account clues such as:
As you can see, the object to be detected is not defined by the object itself. In addition to certain elements of the given object, neural networks also detect and try to process a lot of important, relevant contextual clues. As Data Scientists, we do not yet fully understand which factors directly influence the final quality of detections, but it is clear that we must always "bear" (pun intended) in mind that a single (seemingly irrelevant) change in one part of the image can have a dramatic effect on the final quality of the detection.
Resources:
https://arxiv.org/pdf/2101.06278v3.pdf
https://arxiv.org/pdf/2202.05930.pdf
https://github.com/shivangi-aneja/COSMOS