Friday, August 18, 2023

"Leaked Yandex Code Breaks Open the Creepy Black Box of Online Advertising"

I'll use Yandex to find something that Google has relegated to page 50 or 100 which, of course, they no longer show you.

So it seemed natural to mentally substitute "Google" for each occurrence of "Yandex" and then multiply the yuck factor by 20 to account for the GOOG's much greater reach and far more sophisticated technology. And then multiply by another order of magnitude to account for the company getting rid of the "Don't be evil" motto in 2015.  Yikes!

From Wired, August 18:

As the international tech giant moves toward Russian ownership, the leak raises concerns about the volume of data it has on its users.

If you live in Russia, there’s no avoiding Yandex. The tech giant—often referred to as “Russia’s Google”—is part of daily life for millions of people. It dominates online search, ride-hailing, and music streaming, while its maps, payment, email, and scores of other services are popular. But as with all tech giants, there’s a downside of Yandex being everywhere: It can gobble up huge amounts of data.

In January, Yandex suffered the unthinkable. It became the latest in a short list of high-profile firms to have its source code leaked. An anonymous user of the hacking site BreachForums publicly shared a downloadable 45-gigabyte cache of Yandex’s code. The trove, which is said to have come from a disgruntled employee, doesn’t include any user data but provides an unparalleled view into the operation of its apps and services. Yandex’s search engine, maps, AI voice assistant, taxi service, email app, and cloud services were all laid bare.

The leak also included code from two of Yandex’s key systems: its web analytics service, which captures details about how people browse, and its powerful behavioral analytics tool, which helps run its ad service that makes millions of dollars. This kind of advertising system underpins much of the modern web’s economy, with Google, Facebook, and thousands of advertisers relying on similar technologies. But the systems are largely black holes.

Now, an in-depth analysis of the source code belonging to these two services, by Kaileigh McCrea, a privacy engineer at cybersecurity firm Confiant, is shedding light on how the systems work. Yandex’s technologies collect huge volumes of data about people, and this can be used to reveal their interests when it is “matched and analyzed” with all of the information the company holds, Confiant’s findings say.

McCrea says the Yandex code shows how the company creates household profiles for people who live together and predicts people's specific interests. From a privacy perspective, she says, what she found is “deeply unsettling.” “There are a lot of creepy layers to this onion,” she says. The findings also reveal that Yandex has one technology in place to share some limited information with Rostelecom, the Russian-government-backed telecoms company.

Yandex’s chief privacy officer, Ivan Cherevko, in detailed written answers to WIRED’s questions, says the “fragments of code” are outdated, are different from the versions currently used, and that some of the source code was “never actually used” in its operations. “Yandex uses user data only to create new services and improve existing ones,” and it “never sells user data or discloses data to third parties without user consent,” he says.

However, the analysis comes as Russia’s tech giant is going through significant changes. Following Russia’s full-scale invasion of Ukraine in February 2022, Yandex is splitting its parent company, based in the Netherlands, from its Russian operations. Analysts believe the move could see Yandex in Russia become more closely connected to the Kremlin, with data being put at risk.

“They have been trying to maintain this image of a more independent and Western-oriented company that from time to time protested some repressive laws and orders, helping attract foreign investments and business deals,” says Natalia Krapiva, tech-legal counsel at digital rights nonprofit Access Now. “But in practice, Yandex has been losing its independence and caving in to the Russian government demands. The future of the company is uncertain, but it’s likely that the Russia-based part of the company will lose the remaining shreds of independence.”

Data Harvesting
The Yandex leak is huge. The 45 GB of source code covers almost all of Yandex’s major services, offering a glimpse into the work of its thousands of software engineers. The code appears to date from around July 2022, according to timestamps included within the data, and it mostly uses popular programming languages. It is written in English and Russian, but also includes racist slurs. (When it was leaked in January, Yandex said this was “deeply offensive and completely unacceptable,” and it detailed some ways that parts of the code broke its own company policies.)

McCrea manually inspected two parts of the code: Yandex Metrica and Crypta. Metrica is the firm’s equivalent of Google Analytics, software that places code on participating websites and in apps, through AppMetrica, that can track visitors, including down to every mouse movement. Last year, AppMetrica, which is embedded in more than 40,000 apps in 50 countries, caused national security concerns with US lawmakers after the Financial Times reported the scale of data it was sending back to Russia.

This data, McCrea says, is pulled into Crypta. The tool analyzes people’s online behavior to ultimately show them ads for things they’re interested in. More than 300 “factors” are analyzed, according to the company’s website, and machine learning algorithms group people based on their interests. “Every app or service that Yandex has, which is supposed to be over 90, is funneling data into Crypta for these advertising segments in one form or another,” McCrea says.

Some data collected by Yandex is handed over when people use its services, such as sharing their location to show where they are on a map. Other information is gathered automatically. Broadly, the company can gather information about someone’s device, location, search history, home location, work location, music listening and movie viewing history, email data, and more.

The source code shows AppMetrica collecting data on people’s precise location, including their altitude, direction, and the speed they may be traveling. McCrea questions how useful this is for advertising. It also grabs the names of the Wi-Fi networks people are connecting to. This is fed into Crypta, with the Wi-Fi network name being linked to a person’s overall Yandex ID, the researcher says. At times, its systems attempt to link multiple different IDs together....