50 corporations track third-party data from 88 percent of 1 million top websites

November 17, 2015

Percentage of sites tracked by top 50 corporations. These 50 corporations were monitoring the greatest percentage of third party data from the Alexa top one million sites studied. (credit: Tim Libert)

A survey of 1 million top websites finds that 88 percent share user data with third parties, according to Tim Libert, a doctoral student at the Annenberg School for Communication at the University of Pennsylvania. The study was published in an open-access paper in the International Journal of Communication.

These websites, listed on Alexa, contact an average of nine external domains, indicating that the activity of a single person visiting a single site may be tracked by multiple entities.

Libert discovered that a handful of U.S. companies receive the vast bulk of user data worldwide, led by Google, which tracked users on nearly 80% of the websites studied. Other top followers on the sites studied include Facebook (32%), Twitter (18%), ComScore (12%), Amazon (12%), and AppNexus (12%).

Elevated security risks

While this monitoring of your behavior doesn’t signal nefarious activity or a security breach, it does increase the risk of one taking place. “If your data is being sent to several companies,” says Libert, “that creates new potential points of failure where your data could be hacked or leaked.”

Using the contents of NSA documents leaked by Edward Snowden, Libert also determined that roughly 20% of websites are potentially vulnerable to known NSA spying techniques. And since a number of large companies are aggregating user data from countless smaller sites, this makes it even easier for government agencies to gather data.

This information may be used to serve you more relevant advertising or to deliver content more quickly, but the ways in which your data is used is not always transparent. And efforts to opt out of that data collection can prove frustrating.

Ostensibly there should be a solution: the browser’s “Do Not Track” (DNT) setting. However Libert’s study found that with the notable exception of Twitter, DNT requests are totally ignored.

“It’s up to regulators to work with companies to comply, because they’re not doing it on their own,” says Libert. “The infrastructure to respect people’s privacy preferences exists, and it works. But unless there are real financial consequences for corporations, they’re just going to ignore DNT. Self-regulation to date has been a big failure.”

This study used Libert’s open-source software platform, webXray, a tool for detecting third-party HTTP requests on large numbers of web pages and matching them to the companies that receive user data. Libert also has made a dataset of the study’s 35 million third-party request records available on his website: timlibert.me.

Abstract of Exposing the Invisible Web: An Analysis of Third-Party HTTP Requests on 1 Million Websites

This article provides a quantitative analysis of privacy-compromising mechanisms on 1 million popular websites. Findings indicate that nearly 9 in 10 websites leak user data to parties of which the user is likely unaware; more than 6 in 10 websites spawn third-party cookies; and more than 8 in 10 websites load Javascript code from external parties onto users’ computers. Sites that leak user data contact an average of nine external domains, indicating that users may be tracked by multiple entities in tandem. By tracing the unintended disclosure of personal browsing histories on the Web, it is revealed that a handful of U.S. companies receive the vast bulk of user data. Finally, roughly 1 in 5 websites are potentially vulnerable to known National Security Agency spying techniques at the time of analysis.