Data Leakage and Toxicity in LLM Applications


Data Leakage and Toxicity in LLM Applications: What I Learned This Weekend!
Hey there, fellow AI enthusiasts! I had an absolutely mind-blowing weekend diving into Lesson 3 of the "Quality and Safety for LLM Applications" course by DeepLearning.AI on Coursera. Let me tell you—it was a thrilling ride through the world of data leakage and toxicity detection in large language models (LLMs). And of course, I’m here to break it all down for you, complete with code explanations! Buckle up because this is going to be fun!
Setting Up the Playground
Before we started detecting data leaks and toxic content, we had to set up our workspace. We used Python along with some incredible libraries: Pandas for data handling, whylogs for logging, and a few custom helpers.
import pandas as pd
pd.set_option('display.max_colwidth', None)
import whylogs as why
import helpers
chats = pd.read_csv("./chats.csv")
chats[10:11]
This simply loads a dataset of chat messages and displays a sample row. Now, let's get to the juicy part—detecting data leakage!
Data Leakage Detection
Step 1: Detecting Patterns
To spot potential data leakage, we used LangKit's regexes to identify patterns in the prompts and responses. This helps in flagging potential sensitive information appearing in model-generated text.
from langkit import regexes
helpers.visualize_langkit_metric(chats, "prompt.has_patterns")
helpers.visualize_langkit_metric(chats, "response.has_patterns")
helpers.show_langkit_critical_queries(chats, "response.has_patterns")
💡 What’s happening here? We’re visualizing instances where patterns in prompts or responses indicate sensitive data exposure.
We then annotate the dataset using whylogs.experimental.core.udf_schema
to apply User-Defined Functions (UDFs) for further analysis.
from whylogs.experimental.core.udf_schema import udf_schema
annotated_chats, _ = udf_schema().apply_udfs(chats)
annotated_chats.head(5)
This helps in identifying any anomalies related to pattern-based data leakage.
Step 2: Entity Recognition for Leakage Detection
We then brought in Named Entity Recognition (NER) to detect personally identifiable information (PII) or confidential details that may be leaking through the model’s outputs.
from span_marker import SpanMarkerModel
entity_model = SpanMarkerModel.from_pretrained(
"tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"
)
This loads a pre-trained NER model, which we use to detect sensitive entities in text prompts.
Defining Entity Leakage Categories
We define what types of entities could indicate data leakage, such as persons, products, and organizations.
leakage_entities = ["person", "product","organization"]
Then, we create UDFs to check for these entities in our chat dataset.
from whylogs.experimental.core.udf_schema import register_dataset_udf
@register_dataset_udf(["prompt"],"prompt.entity_leakage")
def entity_leakage(text):
entity_counts = []
for _, row in text.iterrows():
entity_counts.append(
next((entity["label"] for entity in \
entity_model.predict(row["prompt"]) if\
entity["label"] in leakage_entities and \
entity["score"] > 0.25), None
)
)
return entity_counts
💡 Key Takeaway: This function scans through the chat dataset and flags any sensitive entities that might indicate unintended data exposure.
We apply the same logic to the responses as well:
@register_dataset_udf(["response"],"response.entity_leakage")
def entity_leakage(text):
entity_counts = []
for _, row in text.iterrows():
entity_counts.append(
next((entity["label"] for entity in \
entity_model.predict(row["response"]) if\
entity["label"] in leakage_entities and \
entity["score"] > 0.25), None
)
)
return entity_counts
Finally, we visualize and evaluate the flagged messages:
helpers.show_langkit_critical_queries(chats, "prompt.entity_leakage")
This gives us a neat visual representation of potentially sensitive content leaks!
Toxicity Detection
Oh boy, this was an eye-opener! Detecting implicit toxicity in LLM-generated responses is crucial, especially when deploying AI in user-facing applications.
We used the HateBERT model from Hugging Face’s Transformers library:
from transformers import pipeline
toxigen_hatebert = pipeline("text-classification",
model="tomh/toxigen_hatebert",
tokenizer="bert-base-cased")
This loads the pre-trained toxicity classifier, which can label text as toxic or non-toxic.
Let's test it out:
toxigen_hatebert(["Something non-toxic",
"A benign sentence, despite mentioning women."])
Now, let’s apply this detection model to our dataset with a UDF:
@register_dataset_udf(["prompt"],"prompt.implicit_toxicity")
def implicit_toxicity(text):
return [int(result["label"][-1]) for result in
toxigen_hatebert(text["prompt"].to_list())]
💡 What does this do?
- It takes the text prompts.
- Feeds them into the toxicity detection model.
- Extracts toxicity scores.
- Converts them into binary values (toxic/non-toxic).
Finally, we visualize toxic queries using:
helpers.show_langkit_critical_queries(annotated_chats, "prompt.implicit_toxicity")
Final Thoughts 🎉
Wow! This was an incredibly fun and insightful lesson. Learning about data leakage and toxicity detection in LLMs really opened my eyes to the importance of quality control in AI applications.
📌 Key Takeaways:
- Data leakage can unintentionally expose sensitive information through AI-generated content.
- Pattern detection and NER help flag potential leaks.
- Toxicity detection is crucial to ensure AI interactions remain safe and ethical.
🚀 I’m super excited to keep learning and sharing more about LLM quality and safety. If you're interested, definitely check out the "Quality and Safety for LLM Applications" course by DeepLearning.AI on Coursera!
Until next time, happy coding! 🖥️⚡
Subscribe to my newsletter
Read articles from Mojtaba Maleki directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mojtaba Maleki
Mojtaba Maleki
Hi everyone! I'm Mojtaba Maleki, an AI Researcher and Software Engineer at The IT Solutions Hungary. Born on February 11, 2002, I hold a BSc in Computer Science from the University of Debrecen. I'm passionate about creating smart, efficient systems, especially in the fields of Machine Learning, Natural Language Processing, and Full-Stack Development. Over the years, I've worked on diverse projects, from intelligent document processing to LLM-based assistants and scalable cloud applications. I've also authored four books on Computer Science, earned industry-recognized certifications from Google, Meta, and IBM, and contributed to research projects focused on medical imaging and AI-driven automation. Outside of work, I enjoy learning new things, mentoring peers, and yes, I'm still a great cook. So whether you need help debugging a model or seasoning a stew, I’ve got you covered!