Fooling an Image Classifier with Adversarial AI

By Damian Ruck

[This Blog was originally posted on our website, www.advai.co.uk]

SORTING YOUR HOLIDAY SNAPS

Artificial intelligence is changing the way we live. Mundane tasks that once would have been a boring afternoon of tedium can now be completed in minutes by a machine, whilst you get on with life. One example: organizing photographs. Few want to spend their Sunday’s sorting holiday snaps of family members from images of landscapes. Fortunately, there are now computer vision APIs freely available on the internet that can categorize your images for you, or perhaps be used by a business that offers that service to you. For us, that’s interesting as we specialise in defending AI systems. To defend, you have to be able to attack, so let’s have a look at one of commercial APIs.

The computer science literature is filled with cases of AI systems being fooled into misclassifying images. These attacks work because it’s possible to add carefully calibrated noise to an image of, say, a cat, in a way that makes an AI system fail to recognize it, even though any human can still see that the image contains a cat. A subset of these attacks can work, even when we only have access to the inputs and outputs of a model, not the inner workings — a black box attack.

ATTACKS IN THE WILD

These attacks may be possible in the lab, but we wanted to see if they worked in the wild — after all, Advai was set up to identify ways that AI systems can be tricked and manipulated in the real world. The image classifier we looked at was accessible via a free web-based API which allowed us to automatically upload images for classification, so long as we didn’t exceed our daily usage limit.

The image below is of Han Solo, Luke Skywalker and Princess Leia (pivotal characters from the Star Wars franchise for those who are not fans). When we showed this to the online image classifier it gave us an 84% chance that the image was of “people”. Our aim was to add carefully calibrated noise to the image, so that it looked the same as the original to any human, but the image classifier would think it was something totally different. Why? Well, there was a good technical reason (seeing how vulnerable the API was), but what if we wanted to enable the Rebel Alliance to upload photos without being flagged by the Sith’s automatic detection system? (Getting too geeky?) Okay, for a real world problem, this is important, because criminals are going to seek to do the same to systems designed to detect malicious content, fraud, etc.

As mentioned, a black-box attack uses just the inputs (the image) and the outputs (the probability the image was recognized as “people”). We chose to use one of the simplest black-box attacks out there; the aptly named Simple Black-box Attack (or SimBA).

In a nutshell, SimBA proceeds in steps by adding a small amount of randomly selected noise to the image at each step. The new image is retained if the classifier is less confident that the image is of “people”. If not, then we keep the previous image and try another batch of random noise. This is a little like how complex species evolve, where changes (or mutations) are random, but we eventually end up with something highly non-random due to the selection process.

Besides being simple, SimBA also has the advantage of being query efficient, which means we do not have to send too many requests to the API to generate our adversarial image. This is good, because every time we interact with the image classifier, we are risking detection. We may end up sending too many queries and thus exceeding our daily query limit, which can limit the effectiveness of the attack. Or, worse, trigger an alert within the image classifier that may lead to our account becoming blocked. Being query efficient makes SimBA both faster and stealthier.

A LACK OF CONFIDENCE

The one drawback of SimBA is that it requires confidence scores. This is not unusual for image attack methodologies, but is something certain online image classifiers do not have to provide. For example, the image classifier we were attacking told us that it was 84% sure that our initial image was of “people”, but the designers could just as easily have told us that the image was simply “people”, leaving the percentage out, Without knowing that our perturbated image had a chance of 83% of being of people, compared the 84% of the original, we would have had no way of knowing if the algorithm was on the right track. Without the confidence percentages, the images would both simply have been classified as “people”, i.e. they looked the same and SimBA would have been nullified. This is why one defense recommendation Advai makes is to never reveal confidence scores, unless there’s a good reason to do so.

Our initial attempts at fooling the image classifier led to us discovering an interesting quirk. For some reason, the first modification we made to the image always increased the confidence of the algorithm that it was looking at an image of “people” (increasing from 84% to around 86%). This happened in all cases, which meant that SimBA was stuck spinning its wheels at square one.

We, however, did supervise our attack so found a work-around. We cut our losses on the first round of manipulations and accepted that whatever we did with the first perturbations were always going to give the classifier more confidence. By then restarting the attack using this first perturbated image as our starting point, we could start to drive the confidence down from the inflated 86%.

As the animation below shows, SimBA started moving once we took the first perturbated image as a starting point. After that initial uptick in confidence, it only took 56 more iterations for the image to be back down at the confidence level of the original image. Then, within 920 iterations, the image classifier had changed its mind, thinking that the image was now most likely to be a “landscape”.

But just how different was the new image? Perhaps, after 920 cycles, the image really did look more like a landscape? However, as we can see below, the new image was indistinguishable to the human eye from the original, despite the image classifier seeing “people” in one and a “landscape” in the other. The only difference was a very small amount of noise that was evolved through the iterative process set in motion by SimBA (the right-most image).

So, what is the result?

We have shown that a popular online image classifier is vulnerable to being fooled. We had no access to its inner machinery and were able to generate our attack without exceeding the daily query limit that is available to everyone free-of-charge. Though making Star Wars characters look like a landscape seems whimsical — which it kind of is — there is a serious side. What if the attacker were a terrorist and the image were recruitment propaganda? It would have been mis-classified and passed the filters. In a case like this, the urgency of defending against these attacks is apparent.

This begs the question: how do we protect image classifiers against these types of attack? That’s what Advai is here for.

Footer