Data Leakage and Toxicity in LLM Applications


Data Leakage and Toxicity in LLM Applications: What I Learned This Weekend!
Hey there, fellow AI enthusiasts! I had an absolutely mind-blowing weekend diving into Lesson 3 of the "Quality and Safety for LLM Applications" course by DeepLearning.AI on Coursera. Let me tell you—it was a thrilling ride through the world of data leakage and toxicity detection in large language models (LLMs). And of course, I’m here to break it all down for you, complete with code explanations! Buckle up because this is going to be fun!
Setting Up the Playground
Before we started detecting data leaks and toxic content, we had to set up our workspace. We used Python along with some incredible libraries: Pandas for data handling, whylogs for logging, and a few custom helpers.
import pandas as pd
pd.set_option('display.max_colwidth', None)
import whylogs as why
import helpers
chats = pd.read_csv("./chats.csv")
chats[10:11]
This simply loads a dataset of chat messages and displays a sample row. Now, let's get to the juicy part—detecting data leakage!
Data Leakage Detection
Step 1: Detecting Patterns
To spot potential data leakage, we used LangKit's regexes to identify patterns in the prompts and responses. This helps in flagging potential sensitive information appearing in model-generated text.
from langkit import regexes
helpers.visualize_langkit_metric(chats, "prompt.has_patterns")
helpers.visualize_langkit_metric(chats, "response.has_patterns")
helpers.show_langkit_critical_queries(chats, "response.has_patterns")
💡 What’s happening here? We’re visualizing instances where patterns in prompts or responses indicate sensitive data exposure.
We then annotate the dataset using whylogs.experimental.core.udf_schema
to apply User-Defined Functions (UDFs) for further analysis.
from whylogs.experimental.core.udf_schema import udf_schema
annotated_chats, _ = udf_schema().apply_udfs(chats)
annotated_chats.head(5)
This helps in identifying any anomalies related to pattern-based data leakage.
Step 2: Entity Recognition for Leakage Detection
We then brought in Named Entity Recognition (NER) to detect personally identifiable information (PII) or confidential details that may be leaking through the model’s outputs.
from span_marker import SpanMarkerModel
entity_model = SpanMarkerModel.from_pretrained(
"tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"
)
This loads a pre-trained NER model, which we use to detect sensitive entities in text prompts.
Defining Entity Leakage Categories
We define what types of entities could indicate data leakage, such as persons, products, and organizations.
leakage_entities = ["person", "product","organization"]
Then, we create UDFs to check for these entities in our chat dataset.
from whylogs.experimental.core.udf_schema import register_dataset_udf
@register_dataset_udf(["prompt"],"prompt.entity_leakage")
def entity_leakage(text):
entity_counts = []
for _, row in text.iterrows():
entity_counts.append(
next((entity["label"] for entity in \
entity_model.predict(row["prompt"]) if\
entity["label"] in leakage_entities and \
entity["score"] > 0.25), None
)
)
return entity_counts
💡 Key Takeaway: This function scans through the chat dataset and flags any sensitive entities that might indicate unintended data exposure.
We apply the same logic to the responses as well:
@register_dataset_udf(["response"],"response.entity_leakage")
def entity_leakage(text):
entity_counts = []
for _, row in text.iterrows():
entity_counts.append(
next((entity["label"] for entity in \
entity_model.predict(row["response"]) if\
entity["label"] in leakage_entities and \
entity["score"] > 0.25), None
)
)
return entity_counts
Finally, we visualize and evaluate the flagged messages:
helpers.show_langkit_critical_queries(chats, "prompt.entity_leakage")
This gives us a neat visual representation of potentially sensitive content leaks!
Toxicity Detection
Oh boy, this was an eye-opener! Detecting implicit toxicity in LLM-generated responses is crucial, especially when deploying AI in user-facing applications.
We used the HateBERT model from Hugging Face’s Transformers library:
from transformers import pipeline
toxigen_hatebert = pipeline("text-classification",
model="tomh/toxigen_hatebert",
tokenizer="bert-base-cased")
This loads the pre-trained toxicity classifier, which can label text as toxic or non-toxic.
Let's test it out:
toxigen_hatebert(["Something non-toxic",
"A benign sentence, despite mentioning women."])
Now, let’s apply this detection model to our dataset with a UDF:
@register_dataset_udf(["prompt"],"prompt.implicit_toxicity")
def implicit_toxicity(text):
return [int(result["label"][-1]) for result in
toxigen_hatebert(text["prompt"].to_list())]
💡 What does this do?
- It takes the text prompts.
- Feeds them into the toxicity detection model.
- Extracts toxicity scores.
- Converts them into binary values (toxic/non-toxic).
Finally, we visualize toxic queries using:
helpers.show_langkit_critical_queries(annotated_chats, "prompt.implicit_toxicity")
Final Thoughts 🎉
Wow! This was an incredibly fun and insightful lesson. Learning about data leakage and toxicity detection in LLMs really opened my eyes to the importance of quality control in AI applications.
📌 Key Takeaways:
- Data leakage can unintentionally expose sensitive information through AI-generated content.
- Pattern detection and NER help flag potential leaks.
- Toxicity detection is crucial to ensure AI interactions remain safe and ethical.
🚀 I’m super excited to keep learning and sharing more about LLM quality and safety. If you're interested, definitely check out the "Quality and Safety for LLM Applications" course by DeepLearning.AI on Coursera!
Until next time, happy coding! 🖥️⚡
Subscribe to my newsletter
Read articles from Mojtaba Maleki directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mojtaba Maleki
Mojtaba Maleki
Hi everyone! My name is Mojtaba Maleki and I was born on the 11th of February 2002. I'm currently a Computer Science student at the University of Debrecen. I'm a jack-of-all-trades when it comes to programming, so if you have a problem, I'm your man! My expertise lies in Machine Learning, Web and Application Development and I have published four books about Computer Science on Amazon. I'm proud to have multiple valuable certificates from top companies, so if you're looking for someone with qualifications, you've come to the right place. If you're not convinced yet, I'm also a great cook, so if you're ever hungry, just let me know!