Data Leaks and Generative Models
Deep generative models are a class of algorithms that aim to capture the underlying structure and patterns of a dataset to produce new data samples based on them. These models have recently experienced incredible growth and popularity in the Natural Language Processing (NLP) field, with some trending applications such as the OpenAI's ChatGPT series.
The growing popularity of generative models has been made possible by exceptional advancements in Deep Learning research from both private and public sectors, alongside the growth of computing capacity through cloud services in particular. It has put the foundations of a massive adoption for both personal and professional use in various use cases such as translation, coding assistance, content generation, art creation, and so on.
Despite their outstanding capabilities and growing popularity, the adoption of generative models also raises concerns and potential cybersecurity risks, especially in terms of data leaks and privacy. Indeed, as more and more individuals rely on these models, businesses, as a consequence, are more and more prone to rely on them too. However, the usage of generative models in a business context is a potential source of fatal risks, which are crucial to understand and address to use generative models efficiently and above all, safely.
Security concerns related to Data Protection Frameworks
In April 2023, Italy blocked the use of ChatGPT for three main reasons :
Concerns over potential personal data security breach, as it happened on March 2023
Absence of information note to users
Absence of age filter check
As this case demonstrates, there are already conflicts between generative models such as ChatGPT and data protection frameworks such as GDPR and CCPA.
Source: ChatGPT suffers first data breach
Further potential conflicts with data protection regulations rules, more specific to AI models, can arise from the use of generative models in multiple ways, which we will review in the following sections.
Data leaks risks during generative models' main stages
The lifecycle of deep learning models, to which generative models belong, is made of two essential stages :
Training during which the model adjusts its parameters to learn structure and patterns from a vast amount of data to learn a task
Inference during which the model faces new data from the real world and must perform the task it has been trained for
Each of these stages comes with its own set of risks concerning data leaks and privacy.
Risks during the training stage
Generative models can be trained on a vast amount of potential personal data which is not compliant with data transparency and minimization GDPR rules.
After training, models behave as black boxes, where it is near to impossible to determine what aspects of the data the models retain in their internal representations, preventing them to guarantee data retention and the right to erasure compliances.
This lack of control over model representations creates a substantial risk of inadvertently generating personal, confidential, sensitive or copyrighted information from the training dataset causing conflict with data security regulations.
For instance, some code has been identified as copyrighted by its author when using GitHub Copilot, while he explicitly stated he did not allow GitHub to use his code.
Source: GitHub copilot under fire as dev claims it emits large chunks of my copyrighted code
To mitigate these risks, data anonymization and filtering of personal or sensitive information from the training dataset should be ensured.
Risks during the inference stage
During the inference stage, the interactions with a generative model often require input from the end user. Usually, to continuously improve the performance of the model, these inputs can be collected to expand the training dataset. This technique is called incremental training. However, when using without consideration publicly available models, end-users can inadvertently provide confidential information which can then leak into the model for further training.
For instance, OpenAI saves by default your ChatGPT conversations and prompts for future analysis, which can include model retraining in the future for better performance.
ChatGPT terms of use
We use data to make our models more helpful for people. ChatGPT, for instance, improves by further training on the conversations people have with it, unless you choose to disable training*.
Disabling training implies not saving its discussion history for retrieval after 30 days. We, therefore, lose some features to protect our discussions.
Here, the risk cannot be mitigated with the use of techniques such as data anonymization and filtering because confidential data from the real world can take every form. A culture of privacy awareness must be implemented in businesses to ensure data security and privacy.
Mitigation Strategies
To reduce data leaks and privacy issues with public generative models, multiple options can be deployed :
On-premise model deployment
Data cleaning
Access controls & usage monitoring
On-premise model deployment
For an organization, the simplest way to prevent the cybersecurity risk of public generative model usage is to develop its own model. This solution helps to have full control over the data involved in both training and inference.
Additionally, this solution allows to leverages the usage of confidential internal data to increase the performance of the model on some tasks, without risking a data leak.
Data cleaning
Before sending inputs to a public generative model in the form of an API, the API interaction can be preceded by a data cleaning step to remove data that can be private or confidential. However, although it can be quite straightforward to identify private information, identifying confidential information related to a business can be a bit more tricky, as it is often business dependent and not easy to identify.
Access controls & usage monitoring
Controlling who has access to a public model through an API and monitoring its usage can reduce the risk of data leaks.
However, nothing prevents an employee to use a private account to perform tasks. The risk of data leaks in this context is still possible.
The Importance of Developing Industry Standards and Guidelines
The capabilities of generative models and their adoption are in exponential growth. As a consequence, it becomes increasingly essential to establish, share and enforce best practices and guidelines for generative model interaction.
These will help to build the foundation of standards shared by organizations to navigate potential risks, maintain compliance with data protection regulations and create a more secure environment for AI applications.
Sharing industry-wide guidelines
The use of generative models should be the topic of a best-practice framework to ensure data privacy, security and compliance. This framework should include best practice considerations related to :
Data collection
Model training
Model deployment
Model monitoring
Collaborative efforts and knowledge sharing
Collaborative efforts between different stakeholders, including businesses, researchers, regulators, and industry experts would be necessary to establish these standards across the industry. By working together, and sharing knowledge and feedback, these stakeholders can develop a comprehensive understanding of the risks and challenges associated with generative models and create the most effective guidelines to address these concerns.
Encouraging transparency and accountability
AI service providers could benefit from being more transparent about their use of generative models, helping to build trust among users, customers, and partners. By adhering to guidelines, these providers can demonstrate their commitment to responsible AI practices and ensure accountability for their actions. It will also be a safety guarantee for users which will help providers widen their customer base and deepen their customer trust.
Facilitating regulatory compliance
Existing data protection regulations such as GDPR and CCPA could integrate these standards, which will help organizations to simplify the process of achieving compliance. Indeed, these guidelines can serve as a roadmap for businesses to navigate the complex regulatory landscape and ensure they meet the requirements when using generative models.
Conclusion
As a conclusion, it becomes clear that generative language models are at a turning point and that they will soon, if they have not already, transform the way people perform intellectual work the same way the introduction of the industrial machine has transformed the way people perform physical work.
However, more capabilities also come with more cautiousness. It is important to assess and be aware of the risks and drifts that can occur while using these models and build common standards shared by all actors in the field. It will help to build a balance to keep innovating with these tools without compromising data privacy and security.
Organizations should have this cautious and responsible approach when using these models, to leverage their capabilities without compromising themselves.
Subscribe to my newsletter
Read articles from kevin sylla directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
kevin sylla
kevin sylla
Data Strategy Consultant with a strong interest in data governance, data quality, MLOps & ML in the cloud.