The Dark Side of AI: How Big Tech Is Using Your Data to Train Models Without Telling You

Every time you type a message into ChatGPT, ask Gemini a question, or let Meta AI help you write a caption, you are doing two things at once. You are getting a response. And you are quite possibly training the next version of the model that just answered you. Most people do not know this. And that according to regulators across Europe, Asia, and Latin America is precisely the problem. The AI boom of the last three years has been built on data. Enormous, unprecedented quantities of it: text scraped from websites, books, social media posts, private conversations, YouTube videos transcribed without creators’ consent, and the ongoing, real-time stream of user interactions that flows into the servers of every major AI company every second of every day. The companies building these models have largely treated this data as a free resource available to be collected, processed, and learned from with minimal disclosure and even less meaningful consent. That era is ending. Regulators are catching up. Fines are landing. Bans are being issued. And users are beginning to ask the question that the industry has spent years hoping they would not: what exactly are you doing with my data?

“Every chatbot developer we analyzed retains some chat data indefinitely. No platform offers users a way to remove their personal data from existing training sets. Not one.” — AAAI Research, 2025

€15M

Fine imposed on OpenAI by Italy’s data regulator in December 2024 for GDPR breaches

0

Number of AI platforms that allow users to remove personal data from existing training sets

5 years

How long Anthropic retains your chat data if you haven’t opted out of training

7%

Average reduction in capital inflows after a country is grey-listed, the scale of regulatory risk AI companies now face

What Is Actually Happening to Your Data

Let’s be specific, because vagueness is part of how this works. When you use most AI chatbots, your conversations, your questions, your personal details, your business strategies, your health concerns, whatever you type are logged. Many platforms use those logs to improve their models, meaning your private inputs become part of the training data that shapes future AI behaviour. The defaults on most platforms are opt-in to training, not opt-out. And the options to opt out, where they exist, are buried in settings menus that require deliberate effort to find.

A peer-reviewed study published by the Association for the Advancement of Artificial Intelligence in 2025 analysed the data practices of every major AI platform in detail. The findings were stark. Every chatbot developer analysed retains some chat data indefinitely, primarily for what they describe as trust, safety, and quality assessment purposes. No platform offers users a way to remove their personal data from existing training sets once it has been used. The data is in the model. It cannot be extracted.

The variation between platforms is significant and worth knowing. According to an independent ranking by privacy research firm Incogni, Meta AI, Gemini (Google), and Microsoft Copilot are the most aggressive data collectors and the least transparent about their practices. Meta’s privacy policy states that videos and interactions can be used to train its products to recognise objects and activities. Microsoft’s policy implies user prompts may be shared with third-party advertising services. Google expanded Gemini to children under 13 in May 2025 and retains chat data reviewed by human reviewers for three years, even when disconnected from a user’s Google account.

At the other end of the ranking, Mistral AI’s Le Chat and OpenAI’s ChatGPT offer clearer opt-out pathways and more accessible privacy documentation. OpenAI and Microsoft disclose that user data may be used for model training while offering opt-out options. But even the better performers share prompts with third parties service providers, legal authorities, and affiliated companies under terms that most users never read.

“Meta AI, Gemini, and Microsoft Copilot ranked as the most aggressive data collectors and least transparent AI platforms in 2025. Meta’s privacy policy covers multiple products, runs to tens of thousands of words, and requires a college-level reading ability to parse.”

The Scraping Problem: Data That Was Never Yours to Take

The data privacy issue extends far beyond what users voluntarily type into chatbots. The foundational training data for most large language models was assembled through web scraping automated systems that crawled the internet and collected text from websites, forums, news articles, academic papers, books, and social media platforms at massive scale.

This scraping was done largely without the knowledge or consent of the people who created the content being collected. Authors whose novels were scrapped. Journalists whose articles were processed. Doctors whose medical Q&As were ingested. Artists whose work was used to train image generation models. None of them were asked. Many of them did not even know it was happening until the lawsuits began.

When OpenAI faced a data shortage in late 2021, researchers developed a speech recognition tool called Whisper to transcribe over one million hours of YouTube videos, a move that potentially violated YouTube’s terms of service prohibiting use of its content for independent applications. According to reporting by The New York Times, some OpenAI employees including company president Greg Brockman were aware of the legal grey area but proceeded anyway. Google, meanwhile, broadened its own terms of service to allow tapping into publicly available Google Docs, restaurant reviews, and other user-generated content for AI training purposes, a policy change that affected hundreds of millions of users who had never consented to having their content used this way.

X (formerly Twitter) updated its privacy policy to allow Grok, its AI chatbot, to train on all public posts on the platform. Hundreds of millions of tweets written years before Grok existed became training data without any notification to the people who wrote them.

The Regulatory Reckoning: Fines, Bans, and the GDPR Fight

Europe has been the most aggressive regulator of AI data practices, and the enforcement timeline of the last two years reads like a mounting indictment of the industry’s approach to consent.

In April 2023, Italy’s Garante, the data protection authority suspended ChatGPT entirely, forcing OpenAI to revise its privacy policies, implement age verification, and clarify the legal basis on which it was processing European users’ data. OpenAI complied and was allowed back. Then, in December 2024, the Garante issued a €15 million fine against OpenAI for multiple GDPR breaches, including failing to identify a lawful basis for training ChatGPT before launch, and failing to notify Italian authorities of a data breach in March 2023.

On 30 January 2025, the Garante banned DeepSeek, the Chinese AI company from operating in Italy entirely, citing deficiencies in transparency and the transfer of European personal data to China. Within days, South Korea’s data protection authority was investigating DeepSeek and directing app stores to remove it. By June 2025, Berlin’s Data Protection Commissioner had declared DeepSeek unlawful and requested its delisting from Apple and Google’s stores.

Meta has had its own battles. The company voluntarily paused AI training on European users’ data in June 2024 following pressure from the Irish Data Protection Commission, the lead regulator for most major US tech companies under GDPR’s one-stop-shop system. The ICO in the UK engaged with Meta on developing a compliance framework, though it stopped short of formally approving the company’s approach.

A peer-reviewed paper published by Oxford University Press on 30 March 2026 analysing 19 data protection authority guidelines and a series of global enforcement actions found that regulators formally converge on the concept of legitimate interest as the legal basis for AI training, but diverge sharply on how it should be applied in practice. The European Commission has since proposed explicit GDPR amendments through its Digital Omnibus initiative that would codify AI training as legitimate interest, a move that would partially legalize what enforcement bodies have been wrestling with for years. Critics argue it is a capitulation to industry lobbying. Supporters say it provides the clarity that both regulators and companies have been lacking.

The Children Problem

One dimension of the AI data privacy debate that deserves more attention than it receives is children’s data. The AAAI study found alarming variation in how major platforms handle data from users under 18.

Google expanded Gemini to children under 13 in May 2025 and will train its models on data from users aged 13 to 18 but only if those users opt in. Amazon, Meta, and OpenAI allow users 13 and older to create accounts without treating their data differently from adult users, which the researchers interpreted as indicating these companies likely train on children’s data by default. Only Anthropic explicitly states it neither collects data from nor accepts users under 18.

The ethical dimension is clear: children cannot legally provide meaningful consent to their data being used for commercial AI training. The regulatory dimension is catching up: the EU AI Act’s prohibitions on certain high-risk AI uses involving children came into force in February 2025, with fines of up to 7 percent of worldwide annual turnover for non-compliance. For a company the size of Meta or Google, that is a fine measured in billions.

What You Can Actually Do About It

The honest answer is: not much, but not nothing.

Most platforms offer some form of opt-out from having your conversations used for training, but finding and activating these settings requires deliberate effort. On ChatGPT, you can turn off chat history and training in Settings → Model → Improve the model for everyone. On Google’s Gemini, you can pause activity saving, though the precise effect on model training is less clearly disclosed. On Meta AI, the opt-out is more limited, the company’s policy on social media data is essentially that it is fair game.

If you regularly input sensitive information, health data, business strategy, financial details, personal communications, the safest approach is to assume it could be used for training and adjust what you share accordingly. Enterprise and API tiers of most platforms offer stronger privacy guarantees, with training opt-outs enabled by default rather than requiring users to find them.

The deeper issue is systemic, not individual. No amount of individual privacy hygiene fixes a model that was trained on scraped data before you even knew the product existed. The data from a billion users is already inside these systems. What the regulatory battles are really about is whether the companies that collected and used it without proper consent can be held accountable and whether the rules being written now will actually prevent the same thing from happening with the next generation of models.

The Bottom Line

Big Tech is not secretly evil. It is operating in a regulatory environment that, until recently, had no clear rules about how AI training data should be sourced and governed. The companies that pushed the boundaries did so because they could and because their competitors were doing the same. The €15 million fine on OpenAI, the DeepSeek bans, the Meta training pauses – these are the first signals that the permissive era is ending.

What comes next will be determined by the regulatory battles playing out in Europe, the US, South Korea, Brazil, and dozens of other jurisdictions simultaneously. The GDPR amendments currently being proposed could either strengthen protection or weaken it, depending on how they are drafted. The EU AI Act is in force but untested at scale. The US has no comprehensive federal AI privacy law.

In the meantime, every time you open a chatbot and type something you would not want to see in a training dataset, you are taking a risk that the current rules such as they are may not adequately protect you from. That is not paranoia. It is an accurate reading of where the law and the technology currently stand.