Training Day: Training Image Classifiers

An Introduction to Computer Vision, Part 4

Up until now we’ve discussed how computers “see” images, how algorithms detect objects within an image, and an ingenious shortcut utilized by Paul Viola and Michael Jones to greatly reduce the amount of time and processing power needed to train and run their algorithm 20 years ago. What we haven’t talked about is how algorithms learn to differentiate between objects. After all, an image to a computer is nothing more than an array of numbers. How is it able to tell that one section of numbers is a face and another section is the background? Well, it’s all in the training!

As always, we’re going to start with the intuition. As it turns out, training an image classifier (e.g. an object-detecting algorithm or model) is very analogous to teaching an infant to speak and recognize objects. So let’s build our intuition off of that!

Babies First Words

Meet Morticia and Gomez Addams!

Image Source: media-amazon.com

This beautiful couple are elated to see that their beautiful baby, Pubert, is on the cusp of saying his first word! With a little help from his parents, he finally manages to say, “Ma-ma”. Morticia and Gomez shower him with praise and Pubert is just loving it!

Image Source: Pinterest.com

He loves the praise so much that he starts saying it all of the time! He picks up a block and proudly exclaims, “Ma-ma!”

While finding it cute, Morticia gently says, “No, that’s a BLOCK,” then points to herself and says, “Ma-ma!”

Since Pubert is not being showered with praise, he instinctively realizes that he made a mistake. He looks at the block and looks at his mother. He sees that his mother is pointing at herself and repeating the one word in his vocabulary. He copies his mother by reaching out toward her and says, “ma-ma”. Morticia showers him with praise and now has learned that “Mama” isn’t a catch-all term for everything, but is instead a person.

Time passes and one day, Gomez walked by Pubert’s play room. Pubert wanted to get some praise from his father, so he makes some noise to get Gomez’s attention. He reaches out and proclaims, “Ma-ma!” Again, this doesn’t result in the praise he wanted.

Gomez playfully says, “No, not ‘Ma-ma’,” and, like Morticia, points to himself and says, “Da-da!” This confuses little Pubert. He knew that “Ma-ma” was a label for a person. Seeing that his father had all of the features of a person, he thought that the label “Ma-ma” would apply to him as well. So why is he not getting praised?

To help his son, Gomez picks up the little boy calls for the child’s mother. He points at his wife and says, “Ma-ma,” then points at himself and repeats the new word “Da-da”.

Pubert looks at his mother and says, “Ma-ma,” and gets praised. He studies her face and her form. He looks back at his father and does the same. He begins to notice that his mother and father have different features. Morticia is slender and has very long hair. His father has a medium build, short hair, and even hair on his face! Taking these differences into account, Pubert looks at his father and states, “Da-da!” His parents are overcome with joy and shower him with love. Not only has the boy added the word, “Da-da” to his growing lexicon, but he’s also learned that “Ma-ma” is a label for a slender person with long, dark hair, and “Da-da” is a label for a person with short hair and a moustache.

Time passes and Pubert sees his sister, Wednesday. She’s a slender girl with long, dark hair, so he calls out to her. “Ma-Ma!” he yells. Wednesday calls for their mother and together they teach him to say a primitive form of his sister’s name, “Wez-day.” He gets the praise he wants.

He then looks back at his mother and says, “Wez-day,” thinking that this new word was just another label for slender people with long, dark hair. Morticia corrects him saying, “No. Ma-ma!” She then directs his attention to his sister and, like before, says, “Wednesday.”

Just like when he learned “Da-da”, Pubert studied at his sister and tried to discern what features were unique to Wednesday. He notices that Wednesday is shorter and looks younger than their mother. He calls out, “Wez-day,” and everyone celebrates!

Life continues and Pubert has many of these lessons like this until he learns that “Ma-ma” is a label reserved for his mother, “Da-da” for his father, etc..

This was a rather long analogy, but let’s quickly breakdown what happened:

Pubert learned the word “Ma-ma”, but doesn’t really understand what the word means. He just knows he gets rewarded for saying it.
He calls a toy block “ma-ma”, because he doesn’t know any better. His mother corrects him and he learns that “ma-ma” is a term for people.
Later, he calls his father “Ma-ma,” since his father had all of the qualities (features) of a person. Gomez and Morticia work together to teach him the word “Da-da” and, after some studying, Pubert learned that there was a difference between his mother and father, and associated these differences with their respective labels.
The features that Pubert found important for calling someone “ma-ma” was still too general and he ended up referring to his sister as such. He then had to find out what features were unique to his sister and refine what features were used to define “Mama”.

As Pubert was exposed to more people, he would learn from mistakes by studying what sets his mother apart from the people around him. This story probably sounds very similar to parents, as this is a common process with teaching a child to speak. To a computer vision expert, this sounds familiar because this is pretty close what happens when we train an image classifier. We compile a large set of images of the target object we want the algorithm to recognize, as well as images that are not. The algorithm studies the target images and, in a sense, compares them to determine what sets the target apart from the rest. It repeats this process over and over again until it determines what the most important features of the target object are.

Before we go into the specifics, we should learn some key terminology in the context of image classification. These terms are:

True-Positive: The classifier correctly recognizes and labels the target object (Pubert sees his mother and says, “Mama”).

True-Negative: The classifier correctly recognizes that the image is not the target object (Pubert sees his father, knows he’s not his mother, and says, “Dada”).

False-Positive: The classifier mistakenly labels a non-target object as the target (Pubert sees his sister and calls her “Mama”).

False-Negative: The classifier mistakenly labels the target object as a non-target (Pubert calls his mother “Wez-day”).

With that out of the way, let’s dive into how image classifiers are trained.

You’re going to need images. A LOT of images. Specifically, you’ll need two sets of images. One set should be compiled of images of the target objects you want your algorithm to detect and you’ll want the them to be featured in a variety of settings. When I say “settings”, I mean that the images shouldn’t ALL come from the same background, at the exact same angle, with the same lighting, and the same distance from the camera. Generally, it should vary since, in practice, the target object will rarely, if ever, be subject to the exact same circumstances. If all or a majority of your images were taken in such sterile circumstances, it would also have a huge impact on the training process, a point we’ll address shortly.

The second set of should contain images of literally anything else. The entire point of this set is to help the algorithm learn what the target object isn’t and narrow down what features are unique to the target. If you really want to train your algorithm well, you should have many images that share similarities with the target object. For example, if you wanted to make an algorithm that could detect a Walmart, you would want to include pictures of Best Buy, Target, and other stores in this set. That way, the algorithm can learn that a Walmart isn’t just any large cement building.

Number of Images

As for the number of images in each set, it depends on how much time and resources you have. Image classifiers such as Google’s Xception model were trained on the ImageNet dataset, which contained over 14 million images. That’s a lot of images and it would take a very long time to train even with top of the line hardware. The Viola-Jones algorithm was trained on nearly 5,000 faces and about 5,000 non-faces. If you want a reliable model, you’re going to want a few hundred in for each set. Having a roughly even distribution (e.g. about 50–50 split of target and non-target images) is highly recommended.

Labeling

In each of these sets, the images should be labeled to indicate whether or not they’re an image of the target object (e.g. “Walmart” or “Not Walmart”). While this can be painstakingly mind-numbing, it is a crucial step since both of these sets will be combined. Without labels to use as an answer key, the algorithm will have no way of evaluating whether its predictions were correct. Labeled data ensures that your algorithm’s predictions are reliable and minimize the number of false-positives and false-negatives.

Image Transforming

Like we’ve said in previous lessons, the image is transformed from color to grayscale. This is true both in training and in practice. It makes the math easier and helps the algorithm to run quicker.

Another transformation that is performed is that the image resolution is rescaled. This reduces the computational complexity by reducing the total possible number of Haar-like features that can fit within the image. Viola and Jones reduced their images to 24×24 pixels and each image still contained over 180,000 features. This transformation is only done in training, however. After the algorithm is trained, the images it is used on remain the same resolution, but the features and their weights (more on this below) are scaled up.

Creating Training and Test Sets

After labeling all of the images, you shuffle sets together. After it’s shuffled you’ll make a set for training and a smaller set for testing the algorithm on. The reason for this is that once an algorithm or machine learning model has been trained, it is able to recognize data that it was trained with and, as a result, already knows what to predict. By reserving a small portion of the data (the validation or test set) for evaluating AFTER training is complete, we’re able to get a sense for how accurately the algorithm actually performs because it has never seen this particular data before.

Let’s refer back to the Viola-Jones algorithm once more to discuss training. So during training, the model is fed various face and non-face images. It maps out all of the possible Haar-like features in each image and makes a prediction on whether or not the image contains a face. Let’s say it first started with this image of George Clooney:

Image Source: Today.com

It’s the first image that it has seen and has literally nothing else to go off of. It maps out the Haar-like features in the image and predicts “face” for no reason (kinda like a baby’s first words). It’s correct and so the algorithm says to itself, “Okay, I am looking for something that looks exactly like this!” So it assigns a weight to every single Edge, Line, and Four-rectangle feature in this image. A “weight” just a numerical value of how important the feature is. The larger the number, the more important that feature is.

It should also be noted that when I say every feature, I mean EVERY feature! At this point with an untrained algorithm, every single feature that is possibly detected (even in the background) are fair game. In our hypothetical scenario here, not only is Georgie boy’s face considered a face, but even the blurred out background is considered a face. Why? Because it doesn’t know any better. This is the only image it has seen so far. As such, it even considers the background equally important as the actual face.

The next image it sees is this:

Image Source: Eater.com

For some odd reason, the person compiling the sets included this image of a seagull. Since it shares very few of the Haar-like features from Clooney’s photo, the algorithm predicts “Not Face”. Since it’s correct, the algorithm doesn’t change its opinion of what a face is (currently, a single picture of George Clooney). None of the weights are adjusted as a result. This is exactly why the non-target images used should share similarities with the target. It creates a higher chance of a false-positive. This leads to the algorithm making adjustments to the feature weights and reconsider what features are the most important to minimize a false-positive.

The next image it sees is this picture of Chris Pratt:

Image Source: Independent.co.uk

While this certainly has more features in common with Clooney’s picture than the seagull’s, remember that at this point, the algorithm thinks that a “face” is George Clooney with a blurry background (every feature in that picture is currently of equal importance). Because of this, let’s say that our algorithm is going to predict that this isn’t a face. BOOM! Our first false-negative! Now the algorithm looks at what similarities the two images have and adjusts the weights of the features it considers to be more important. Since both of the images have fuzzy backgrounds, we’ll say that our algorithm has noticed that both pictures have an outline of the head and shoulders and thinks, “Alright, this shape must be a face.” It adjusts the weights of the features that make up the head and shoulders to make them more important.

Then it see’s this picture:

Image Source: Haley Patterson

We have the all the features of the outline that we saw before and the algorithm now has a little bit of wiggle room for interpretation thanks to the weights being adjusted, so let’s say that it predicts “face”. BOOM! False-positive! Its definition that a face is simply the outline of the human body is too general. Now the algorithm re-evaluates and goes back to the features that Clooney and Pratt shared. They both have eyes! Perhaps eyes are the most important feature! So it adjusts the weights accordingly and moves onto the next photo.

Image Source: teenvogue.com

Well, this person has no eyes, so it can’t possibly be a face! It predicts “Not Face”. Another false-negative. Again, the algorithm re-evaluates and decides since they all had noses, so that must be what defines a face!

Ok, so it’s not the nose. They all have lighter skin!

Image Source: Lance Reddick

You can see where I’m going with this. This is why you need you need so many different images. With each image, the algorithm learns what is and is not a face and adjusts what combination of features to look for in an image that minimizes the probability of getting a false-positive or false-negative. And just like the baby in our analogy, the algorithm discovers more and more features that are unique to the target object.

While there is a lot more math involved (something we’ll go over next week), the overall process of training an image classifying algorithm is very intuitive and almost human in nature.