
How Google is using your reCAPTCHA entries to train machine learning models

Google’s reCAPTCHA service is marketed as a means to protect websites from bots. If the system suspects a bot is trying to access a site, it will put up some test that only humans should be able to pass. If you spend enough time on the internet you will have seen a version of this service before. A panel of images comes up and you have to select all the images that contain a fire hydrant, or a car or bridge. We’ve all encountered this system before. If you have interacted with this system before while trying to get access to your favourite website, congratulations you have contributed to some Google machine learning model by labelling some data for them. Deep inside Google’s reCAPTCHA webpages, this is what the company says about the use of data captured from this system:
reCAPTCHA also makes positive use of the human effort spent in solving CAPTCHAs by using the solutions to digitize text, annotate images, and build machine-learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems.
Let’s take a look at how Google are doing this, speculate about the models we are helping to improve, and what I make of this system where people are unwittingly improving some Alphabet Inc artificial intelligence model.
Quick overview of supervised machine learning
In a nutshell, supervised machine learning models are attempting to classify data based on the learning of patterns, or features, that characterise the different classes. To do this, a supervised machine learning model is supplied with a lot of labelled data, called training data. Labelled data is data that comes with a tag identifying the class. A supervised ML algorithm will learn the features that are associated with the class so it can classify new data.
So, to train a ML model to classify images of trains, planes or boats for example, thousands of labelled images of the items are fed into the algorithm where features like size, colour, shape et cetera are used to distinguish the classes. After training, one can then pass in new, unlabelled images of boats, trains and planes, and the ML model will classify them based on the learning from the training dataset. Now back to reCAPTCHA.
How is Google collecting data from reCAPTCHA?
As mentioned earlier, if the reCAPTCHA service suspects a bot is trying to interact with a website, it will present a test to confirm you are human. Sometimes it is a simple checkbox. Other times it is the more interesting challenge of selecting images from a set that fit a particular description. Once you have correctly identified the pictures that fit a description you are allowed to access the page you intended to visit. So, what you are doing on these challenges is providing some labelled data that will be used in a training dataset for some AI under the Alphabet Inc umbrella.
The obvious question is, how does Google know when a web user has selected all the images that fit the description? If the benefit for Google is us users labelling some data for an AI model, surely, they don’t already know what the images contain in advance. The answer is when Google presents you with a panel of, say, six images, five of the images are already labelled. The web user is asked to identify five images correctly, including, the one Google are looking to label. You only need to correctly identify the four images Google already has labelled, and your answer for the fifth unknown image goes into the AI dataset.
What the data is being used for
As for what artificial intelligence this data is being used to train, this is basically unknowable unless you are inside the company. But we can make some educated guessed based on the types of images we’ve been asked to identify. reCAPTCHA challenges seem to be related to roads, traffic signals or cars. This may be a clue that the data will go to train some model used by Waymo, Alphabet Inc’s self-driving car company. Google mention on their webpages that the data could be used to help improve maps, which also makes sense based on the images we are presented with. Again, it is difficult to know without being inside Alphabet Inc where all that data ends up going.
Final thoughts
I think most people would feel there is a sense of deception or dishonesty in the way Google uses the data we provide for what is a commercial endeavour without notifying users properly as to what is happening. Here’s the thing, I don’t believe most people would be bothered if Google made it explicitly clear that some of the answers from reCAPTCHA will be used to train Google models in the future. I do think it is important to inform people of what is happening and give the ability to opt-out though.
It is also worth noting that this system is only present in reCAPTCHA V2. Google now have a reCAPTCHA V3, which doesn’t interrupt users at all to detect bots. Instead, reCAPTCHA V3 scores all visitors to the site based on a range of metrics, the lower the score, the more likely you are a bot. However, reCAPTCHA V2 is still active on some websites. I will conclude by saying more transparency from technology companies should be encouraged. I can only assume the reason there is a lack of transparency is because of a worry that users will choose not to comply, but that should be a decision for us users to make.