Why Net Scraping is Essential to Democracy

By dailywow On Dec 28, 2020

The fruits of web scraping – using code to collect data and information from websites – are all around us.

People build scrapers that can find every Applebee on the planet, or collect congressional laws and votes, or track fancy watches for sale on fan websites. Businesses use scrapers to manage their online retail inventory and monitor competitor prices. Many well-known websites use scrapers to keep track of airline tickets and job vacancies, for example. Google is essentially a giant, crawling web scraper.

Scrapers are also the tools of watchdogs and journalists, which is why The Markup filed an amicus letter this week in the US Supreme Court on a case threatening to make scraping illegal.

The case itself – Van Buren v. The United States – is not about cockroaches, but rather a legal issue regarding the prosecution of a Georgian police officer, Nathan Van Buren, who was bribed to look up confidential information in a law enforcement database. Van Buren has been prosecuted under the Computer Fraud and Abuse Act (CFAA), which prohibits unauthorized access to a computer network such as computer hacking, where someone breaks into a system to steal information (or, as in the classic movie “WarGames”) dramatized in the 1980s). “Possibly start the third world war).

In Van Buren’s case, the question arises whether the court will broadly define its problematic activities as “exceeding authorized access” to extract data, which would make it a crime under the CFAA for being on duty was allowed to access the database. And it is precisely this definition that could affect journalists.

Or, as Judge Neil Gorsuch put it during the oral presentation on Monday, lead in the direction of “perhaps making a federal criminal out of all of us”.

A 2025 recap for Tech & AI

This digicam breakthrough may quickly allow you to take…

With this new approach, AI movies might be created in just…

Investigative journalists and other watchdogs often use scrapers to shed light on issues big and small, from tracking lobbyists’ influence in Peru through capturing the digital visitor logs for government buildings to monitoring and collecting political ads on Facebook. In either case, the scraped pages and data are publicly available on the internet – no hacking required – but the sites involved can easily change the fine print in their terms of use to mark the aggregation of this information as “unauthorized”. And the US Supreme Court, depending on its decision, could decide that violating these Terms of Use is a crime under the CFAA.

“A law allowing powerful forces like the government or wealthy corporate actors to criminalize unilateral news-gathering activities by blocking those efforts through the terms of service on their websites would violate the first change,” The Markup wrote in ours Letter.

What kind of work is at risk? Here is a summary of the recent journalism made possible by web scraping:

The Atlantic’s COVID tracking project collects and aggregates data from across the country on a daily basis and serves as a means of monitoring where testing is taking place, where the pandemic is increasing and what racial differences are, who is contracting and who is dying of the virus.
This Reveal project scraped extremist Facebook groups and compared their membership lists with those of law enforcement groups on Facebook – and found a lot of overlap.

The most recent study of the markup on search results by Google showed that it consistently favors its own products, leaving some websites where the web giant itself is scraping information that is fighting for visitors and therefore advertising revenue. The US Department of Justice cited the problem in an antitrust lawsuit against the company.

A pattern of cookie cutter laws was found in Copy, Paste, Legislate, USA Today, promoted by special interest groups and propagated in legislations across the country.

This article was originally published on The Markup and republished under the Creative Commons Attribution-NonCommercial-NoDerivatives License.