Search for content, post, videos

How Can Organizations Harness the Use of AI and Maintain Strong Privacy Safeguards

When the GDPR was adopted in 2016, AI systems were far less advanced than they are today. The first Large Language Model (LLM), GPT-1, appeared in 2018, but LLMs only captured mainstream attention around 2020 with the release of GPT-3. Since the GDPR predates this rapid AI development, some of its requirements now pose challenges for AI providers and operators. In response, EU DPAs (Data Protection Authorities) have issued guidelines to clarify how GDPR applies to modern AI systems. This article summarizes the most relevant guidance and adds practical insights that could be helpful for both legal and technical teams. Most of the recommendations presented come from CNIL Recommendations on AI and GDPR, EDPB – Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models, as well as from the EDPB – Report of the work undertaken by the ChatGPT Taskforce.

Ensuring GDPR Compliance in AI Without Hindering Innovation

GDPR is built on seven core principles for lawful personal data processing. However, these principles were developed before the rise of AI, especially general-purpose LLMs, creating now compliance challenges. Below, we explore how these challenges arise and how they might be addressed.

Is the AI System Within GDPR’s Scope?

The first question is: Does the AI model process personal data?

Some DPAs, such as those in Hamburg and Denmark, argued that LLMs don’t store personal data, as the training data is transformed into tokens. In other words – data included in the training dataset is transformed into abstract mathematical representations and probability weights. The system does not ‘memorize’ raw data. It was argued that this transformation renders the data anonymous, and thus, outside the GDPR’s scope.

However, privacy experts and the EDPB itself in Opinion 28/2024, caution against this assumption. They stress that tokenized data may still qualify as personal data, as the original training data could be revealed through targeted queries or model extraction techniques. Therefore, whether an AI model processes personal data must be assessed on a case-by-case basis, taking into account ‘all the means reasonably likely to be used’.

This stance has been supported by technical studies. In 2020, Carlini et al. (a group of researchers led by Nicholas Carlini) demonstrated that ChatGPT-2 could identify individuals’ sensitive information present in the training datasets. They crafted targeted prompts—queries designed to coax the model into revealing memorized content. In doing so, the model occasionally output full names, email addresses, phone numbers, or passwords. According to other studies and user notifications, the risk still remains valid in 2025.

Legal Basis for Processing

A major question has been: what legal basis justifies the training and development of AI systems, especially on personal data scraped from the internet?

In the ChatGPT Taskforce report and Opinion 28/2024, the EDPB concluded that “legitimate interest” can serve as a valid legal basis for the development and deployment of AI systems—provided the controller meets the GDPR’s three-step balancing test. This is a significant clarification, as other legal bases (like consent or contractual necessity) are often unworkable in this context.

Purpose Limitation

GDPR requires data to be collected for specific, explicit, and legitimate purposes. If the AI system is developed for a specific operational use, which is defined already in the development stage, the purpose limitation should be easily achieved. However, for general-purpose AI models and foundation models that can be used by a wide variety of downstream applications, may be challenging to define an explicit purpose at the development stage.

CNIL provides an example of an image classification model with the training dataset made public. The model could be reused for various purposes – detection of intruders by AI-powered camera, camera systems measuring attendance on station platforms, or detection of defects on images taken as part of product quality controls. French DPA suggests that in such cases, it may be enough to describe the:

  • Type of system developed
  • System’s general capabilities

Even if all specific uses can’t be anticipated upfront.

Accuracy

According to the GDPR, the accuracy principle requires that personal data processed should be factually correct and reflect reality. However, as underlined by the ICO (UK Information Commissioner’s Office), the outputs of AI systems are rather “statistically informed guesses than facts”.

The ICO proposes the concept of “statistical accuracy” explaining that AI system does not need to be 100% statistically accurate to comply with the accuracy principle. In order to balance the potential lack of accuracy, both the ICO and the ChatGPT Taskforce urge developers to clearly communicate to the users that the outputs are probabilistic and may be biased or made up. However, this should not release the AI providers from taking remediation action if any misleading or factually incorrect personal data is produced by the system.

Data Minimization and Storage Limitation

GDPR requires that personal data be limited to what is necessary in the light of the objective defined. This is difficult to align with the vast datasets required to train AI systems, especially LLMs.

According to CNIL, the data minimization principle does not prevent AI developers from using large training datasets as long as certain minimization safeguards are applied. The same has been suggested by the ChatGPT Taskforce which highlighted the importance of:

  • Defining precise collection criteria
  • Ensuring that certain data categories are not collected or that certain sources (such as public social media profiles) are excluded from data collection

The ChatGPT Taskforce also stated that sensitive data categories should be filtered out, either at the collection stage or by deleting them from the dataset prior to the system’s training. This has caused some criticism. Complying with this opinion would require LLM providers to filter out all sensitive data, including information about public figures, which could defeat the purpose of LLMs.

The system must use sensitive data to respond to queries like “What are Taylor Swift’s religious beliefs?” Some commentators proposed that the ban on sensitive data collection should only apply to individuals who are not public figures. Similarly, controllers must ensure that only data that is strictly necessary is stored. CNIL made attempts to define what this means in practice and determined that if data is necessary for:

  • Maintenance operations
  • Model improvement

The storage limitation requirement is still complied with. The retention of AI system training data can be necessary to facilitate audits or for bias detection. Moreover, a previous training dataset may be necessary to improve the model in the next releases.

Fairness

Fairness is an overarching principle that requires personal data to be processed in a manner that is not unjustifiably detrimental or discriminatory to the data subject.

AI systems, despite all the development efforts, are still not bias-free. A system trained on biased personal data may reinforce or amplify existing biases and lead to discrimination. When it comes to high-risk AI systems, as defined by the EU AI Act, bias prevention is obligatory for all high-risk systems providers.

Even when not required by the EU AI Act, controllers processing personal data as part of an AI system will still need to implement de-biasing practices in order to comply with the GDPR fairness principle. Both ICO and CNIL recommend assessing, as part of DPIA (Data Privacy Impact Assessment), if both, the chosen training dataset and algorithm, prevent outputs that could be discriminatory.

Transparency

Controllers must inform individuals how and why their data is processed. For AI systems that make automated decisions with legal or similarly significant effects, “meaningful information about the logic involved” must also be provided.

This is especially challenging for deep learning systems (like deep neural networks), often seen as “black boxes.” Their complexity makes it hard to explain how outputs are generated. Deep learning systems often have millions (if not billions) of parameters, with each parameter influencing the final output. To mitigate the risk, the ICO in its guidance “Explaining decisions made with AI” advises to translate the rationale of the system’s results into “useable and easily understandable reasons”. This would suggest that not all details must be explained, but general information on how the system works should be provided in a clear language. Similarly, the EU AI Act expects the providers of high-risk AI systems to provide instructions that would enable deployers to correctly interpret the system’s output.

Another challenge is delivering privacy notices to individuals whose data is scraped from the web. Direct notification may not be feasible. Article 14(5)(b) GDPR allows for exemptions when notice delivery proves impossible or would involve a disproportionate effort. EDPB confirmed the exemption could apply, but stresses a case-by-case analysis taking into account the specific context of the data processing.

Even if direct notice isn’t possible, organizations should still make the information publicly available – for instance, by publishing it on their website, using model cards, FAQs, or annual transparency reports. CNIL recommends disclosing URLs or at least site categories where the data was sourced. In order to balance the information asymmetry, EDPB recommends some transparency measures that go even beyond the GDPR requirements. This could include the release of public and easily accessible communications, for example, by providing additional details about the collection criteria and all datasets used.

Data Subject Rights

AI systems rarely “memorize” raw personal data, making it difficult to verify whether an individual’s data was used in training. CNIL suggests allowing requestors to submit copies of data for comparison, though technical experts have raised some doubts regarding this approach – the chances are low that the document is still in the original form. This could, however, work with the AI systems producing images or videos, as the provider could compare the file provided with the content used in training.

Another recommendation was that it may sometimes be possible for the controller to determine whether the model has learned information about a person by conducting fictitious tests and attacks, such as membership inference or simply querying the model with targeted queries.

As for deletion requests, once a model is trained, the data cannot be simply deleted—removing it would require retraining, which is very costly for LLMs. There are two solutions that seem to be supported by both EDPB and DPAs:

  • Schedule the retraining periodically and inform the requestor that the data will not be included in the new training set
  • In the meantime, apply the output filters that would block any outputs containing the requestor’s personal data

The second solution has gained much more support, as long as the filters are robust and effective.

Since the proposed approaches are still not perfect, EDPB proposes to expand data subject rights beyond the strict letter of the law. It suggests that:

  • Data erasure right should be granted even when Article 17(1) criteria are not met
  • An opt-out list could be created, serving as a premature right to object. The list would “allow data subjects to object to the collection of their data on certain websites or online platforms by providing information that identifies them on those websites, including before the data collection occurs”

The idea may be inspired by the author’s opt-out right (from using copyrighted content for AI training) granted by the EU intellectual property law.

Data Privacy Impact Assessment – Let’s Assess Compliance

Under the GDPR, DPIA is required when innovative technologies are used for data processing. The CNIL has clarified that not all AI systems qualify as “innovative” in this context. Systems based on AI techniques that have been widely used and validated over time, such as regression or clustering algorithms, are no longer considered innovative. A different approach should be taken for systems employing newer techniques, such as deep learning, whose risks are still being understood and managed. In such cases, conducting a DPIA is necessary.

It has also been emphasized that while the development of AI systems often involves processing large amounts of data, this does not automatically constitute large-scale processing under the GDPR. A determination must be made as to whether the training dataset includes a very large number of individuals. On the other hand, CNIL has determined that high-risk AI systems, as well as foundation models and general-purpose AI, should fall within the scope of a DPIA.

Responsibility for the DPIA depends on who acts as the data controller. If the provider controls the processing during deployment, a full DPIA covering the entire lifecycle is recommended. If the provider is not the controller but knows how the system will be used, they may propose a DPIA model. Still, the deploying organization, as controller, remains responsible for conducting the DPIA—possibly using the provider’s model.

Conclusion

As part of a DPIA, organizations should assess compliance with the core data protection principles, including AI-specific risks (like allocative and representational harms or loss of control over the data made available online).

If the specific compliance measures discussed in this document are appropriately documented in the DPIA, the AI system should be well-positioned to meet the stringent requirements of the EU data protection law.

Leave a Reply

Your email address will not be published. Required fields are marked *