reCAPTCHA: The tradeoff between privacy and detection

Liel Strauch, director of cyber security research, PerimeterX

Anyone who uses the Internet to purchase goods or services will be familiar with reCAPTCHA—Google’s service which aims to differentiate between human users and automated ones. It was designed as a more effective version of the classic CAPTCHA tool, used for the same purpose. The latter involves users typing in a series of case-sensitive letters and numbers, whereas the Google version merely requires ticking an ‘I’m not a robot’ box in order to establish that the subject in question is human.

Google developed these tools as a way to combat growing bot activity which continues to wreak havoc on the Internet, particularly in ecommerce, which faces a variety of bot threats. These include account takeover, scraping, scalping and checkout abuse. For example, bots can be utilized to buy valuable and scarce items—such as concert tickets or the latest pair of Kanye West’s Yeezy sneakers—and then resell these items for a hugely inflated price once they have ensured they have sold out via the official channels.

Mickey Alton, research team lead, PerimeterX

That said, new research on “Hacking Google reCAPTCHA v3 using Reinforcement Learning,” presented recently at the RLDM 2019 conference in Montreal, Canada, revealed that software using machine learning techniques can pass itself off as human more than 90 percent of the time against Google’s reCAPTCHA V3. As a result, Google’s attempts to oust the bots have not been as successful as they had hoped.

Privacy protection methods can be mistaken for bot activity

The study shows that today’s CAPTCHAs are becoming easier for bots and harder for humans, instead of serving their original purpose of stopping bots. If the human user has taken any unusual methods in order to protect their privacy, they may find reCAPTCHA works against them instead of for them. There is a success rate of over 90 percent by machines in solving CAPTCHAs, and there are more and more CAPTCHA-solving services. The solution, designed to show a human user is present, isn’t very efficient at blocking bots, and even discriminates against humans.

The major issue that is not being addressed enough is the damage to users' privacy, whether they are aware of it or not.

The researchers are justified in claiming that this system would most definitely discriminate against humans and not necessarily recognize bot behavior. For example, the article above discussed Google’s use of different factors to determine if a CAPTCHA solver is a human or a bot, and one of them is the of Tor. Using Tor isn’t an indication of bot behavior and yet it is being used as a major incrimination factor.

Moreover, using Tor and other anonymity services, are common ways to access websites that are blocked by a dictatorial state. And, ironically, in such cases the visitor is blocked twice: once by the oppressive state, and once again by the egalitarian technology company. Tor does not represent an automatic rejection in reCAPTCHA, but it is certainly a significant factor in people being blocked by reCAPTCHA.

According to PerimeterX’s own data, at least 52.3 percent of the traffic coming from Tor is linked to humans, meaning the traffic doesn’t manifest automated behavior. Basing detection mechanisms on Tor-related traffic or users choosing stricter privacy settings would not be our preferred practice.

Are Google accounts a good way to authenticate users?

The Google approach of making sure the user is authenticated to a Google service while defining its bot score is a valuable practice. However, the privacy invasion and discrimination of users who try to stay anonymous (similar to the case of Tor’s) might not be worth it. Having a Google account requires you to have a valid phone number, which isn’t allowing the wanted privacy. Nonetheless, even the condition of using a Google account can be “breached” pretty easily by using a fake Google account and burner numbers.

Source: PVACreator

For example, a quick search reveals many services (such as the one pictured above) that offer phone-verified account creation. Bots can easily implement a fake Google account and authenticate themselves, while humans who seek anonymity by not using a Google service and are not necessarily aware of the above option, get discriminated against.

As demonstrated in the example above, putting the quality of bot detection aside, the major issue that is not being addressed enough is the damage to users’ privacy, whether they are aware of it or not.

The modern reality enforced by Google many times leaves users no choice but to identify themselves using their Google account that holds sensitive personal information and could potentially be used to track browsing habits and history. And, as mentioned above, attackers can easily bypass this defense mechanism with minimal costs and effort. Can the tech giants be trusted with one of the most private assets we have—our identity?

Google’s own system might make it harder for humans to beat the reCAPTCHA system, and easier for bots, while making users pay a price—their privacy. Understanding the complexities of bot activity online, and the diversity of humans using the Internet, without invading humans’ privacy, is crucial to any attempts by companies like Google to fix the bot problem. Failure to do so could have dire results for human users and excellent results for bots and those who use them for nefarious purposes.

The ideal solution is a detection mechanism for bots based on their entire behavioral flow. Anything less makes this tool ineffective at its sole purpose.

PerimeterX provides application security technology, including Bot Defender and Code Defender. Liel Strauch and Mickey Alton contributed to this report.

Favorite

reCAPTCHA: The tradeoff between privacy and detection

Many websites seek to block bot activity by using Google’s reCAPTCHA, which requires users to tick an ‘I’m not a robot’ box. But researchers have found that bots are getting around that defense mechanism, while some humans are being blocked.

Privacy protection methods can be mistaken for bot activity

Are Google accounts a good way to authenticate users?