Global: Enormous data pipelines powering major generative AI systems are rooted in mass invasions of privacy by design

Companies are extracting vast troves of online data through unlawful web scraping to build their generative artificial intelligence (AI) products in a way that is enabling a mass invasion of privacy, making these systems unlawful by design, Amnesty International said in a new briefing today.

Unlawful by Design: Exposing the Human Rights Costs of Generative AI documents serious risks in the large-scale data scraping and processing being used to build and train these systems, including violations of the right to privacy by design and adverse consequences for the environment and historically marginalized communities.

“Companies across the world are supplying generative AI products under the veneer of efficiency and sophistication, but in reality, these systems perpetuate mass invasions of privacy through unlawful web scraping: an automated process for extracting data from websites, including personal data, such as images and social media activity, to train AI models,” said Likhita Banerji, Head of the Algorithmic Accountability Lab, Amnesty International.

“The extractive data pipeline, inherent design choices made by tech companies and exploitative supply chains, to build generative AI systems have enabled a paradigm of technology development that opens up a risk of mass abuse of human rights.”

Amnesty International researched the models powering some of the most popular publicly available standalone generative AI tools, including GPT 3 by Open AI, Google’s Gemini, Meta’s Llama, DeepSeek and tools by Midjourney and Stable Diffusion.

Such systems rely on extracting information from billions of public online posts and images often without the explicit consent of the individuals appearing in or creating them. Not only does this infringe on privacy by design but as datasets powering AI models scale up, the presence of hateful and discriminatory content in their outputs also gets amplified, along with negative stereotypes and prejudices, especially along racial and gendered lines.

“These choices are not inevitable. We must challenge the design choices adopted by companies who build generative AI systems by relying on training data, including personal data, that is extracted non-consensually and on a grand scale.”

Likhita Banerji, Head of the Algorithmic Accountability Lab, Amnesty International

Racial, gender and cultural biases are consistent features of generative AI systems, a product of the training data that is largely pulled from the web and therefore polluted with real-world biases which harm historically marginalized communities. Additionally, generative AI systems pose risks to the right to freedom of thought as they are capable of influencing users’ thoughts and shaping their personal beliefs through predictive suggestions. This is especially true for larger models reliant on expansive training data.

“These choices are not inevitable. We must challenge the design choices adopted by companies who build generative AI systems by relying on training data, including personal data, that is extracted non-consensually and on a grand scale,” said Likhita Banerji.

“This is one of the most egregious practices among AI companies operating with disregard for human rights and must urgently be addressed. A different trajectory of technology development is possible if authorities act urgently to course correct.”

Heavy environmental costs

As the scale and speed of development has picked up at generative AI companies, so have the infrastructure requirements and associated environmental costs.

The higher processing needs of larger models require more energy-intensive chips, larger data centres, and consequently, more energy and water for its operationalisation. Generative AI production often results in a negative impact on communities that are historically marginalized as the lands and resources that belong to these communities are exploited to build data centres and fulfill processing requirements.

Google’s own sustainability report from 2024 noted a staggering 48 per cent increase in the company’s greenhouse gas emissions since 2019, attributable to data centre and supply chain emissions. Similarly, Microsoft’s emissions increased by 29 per cent between 2020 and 2024, attributable to data centres carrying out AI-supporting processes.

The intensive use of resources in generative AI production has led to communities from Cerrillos in Chile, and Querétaro in Mexico, to Arizona in the United States of America, resisting data centres in areas that are already heavily affected by droughts and shortages in electricity.

As part of its research process, Amnesty International wrote to Google, OpenAI, Meta, Stability AI, Midjourney, and DeepSeek giving them an opportunity to respond to the findings of the research briefing which states that their models are reliant on unlawful web scraping, among many other related human rights concerns.

Amnesty International also wrote to Intel and VMware specifically regarding the risks of discrimination, and to Google, Microsoft and Amazon about the environmental harms associated with their generative AI systems and related infrastructures. At the time of publication, only Microsoft, Amazon, Intel, OpenAI and Meta responded to Amnesty International. A summary of their responses is included in the briefing.

Amnesty International is calling on states to prohibit standalone generative AI systems that have been built using unlawful web scraping, defined as the bulk and mass collection of training data through the web. Companies must immediately cease the practice of unlawful non-consensual web scraping of personal data for AI training purposes, and states must hold companies to account for their involvement in any human rights abuses linked to their design and business choices.

Background

The briefing provides a human rights analysis of the ‘data pipeline’ that powers generative AI products, including the stages of data capture, analysis, and processing that are critical to the overall functioning of these systems. Specifically, this involves zooming in on the parameters and implications of design choices made in relation to the training data of generative AI models, with a focus on methods and sources of data collection, data processing, model scaling and data outputs.

Amnesty International defines standalone generative AI tools as products that are developed, deployed and marketed for their generative AI capabilities solely and specifically, such as AI chatbots, image/video/audio/text generators, and so on. This does not include products where generative AI is an added feature or function in a larger suite of products, for example, word processing software with optional generative AI features.