Data Leakage and Toxicity in LLM Applications: What I Learned This Wee

Data Leakage and Toxicity in LLM Applications: What I Learned This Weekend!

Hey there, fellow AI enthusiasts! I had an absolutely mind-blowing weekend diving into Lesson 3 of the "Quality and Safety for LLM Applications" course by DeepLearning.AI on Coursera. Let me tell you—it was a thrilling ride through the world of data leakage and toxicity detection in large language models (LLMs). And of course, I’m here to break it all down for you, complete with code explanations! Buckle up because this is going to be fun!

Setting Up the Playground

Before we started detecting data leaks and toxic content, we had to set up our workspace. We used Python along with some incredible libraries: Pandas for data handling, whylogs for logging, and a few custom helpers.

import pandas as pd
pd.set_option('display.max_colwidth', None)

import whylogs as why
import helpers

chats = pd.read_csv("./chats.csv")
chats[10:11]

This simply loads a dataset of chat messages and displays a sample row. Now, let's get to the juicy part—detecting data leakage!

Data Leakage Detection

Step 1: Detecting Patterns

To spot potential data leakage, we used LangKit's regexes to identify patterns in the prompts and responses. This helps in flagging potential sensitive information appearing in model-generated text.

from langkit import regexes

helpers.visualize_langkit_metric(chats, "prompt.has_patterns")
helpers.visualize_langkit_metric(chats, "response.has_patterns")
helpers.show_langkit_critical_queries(chats, "response.has_patterns")

💡 What’s happening here? We’re visualizing instances where patterns in prompts or responses indicate sensitive data exposure.

We then annotate the dataset using whylogs.experimental.core.udf_schema to apply User-Defined Functions (UDFs) for further analysis.

from whylogs.experimental.core.udf_schema import udf_schema

annotated_chats, _ = udf_schema().apply_udfs(chats)
annotated_chats.head(5)

This helps in identifying any anomalies related to pattern-based data leakage.

Step 2: Entity Recognition for Leakage Detection

We then brought in Named Entity Recognition (NER) to detect personally identifiable information (PII) or confidential details that may be leaking through the model’s outputs.

from span_marker import SpanMarkerModel

entity_model = SpanMarkerModel.from_pretrained(
    "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"
)

This loads a pre-trained NER model, which we use to detect sensitive entities in text prompts.

Defining Entity Leakage Categories

We define what types of entities could indicate data leakage, such as persons, products, and organizations.

leakage_entities = ["person", "product","organization"]

Then, we create UDFs to check for these entities in our chat dataset.

from whylogs.experimental.core.udf_schema import register_dataset_udf

@register_dataset_udf(["prompt"],"prompt.entity_leakage")
def entity_leakage(text):
    entity_counts = []
    for _, row in text.iterrows():
        entity_counts.append(
            next((entity["label"] for entity in \
                entity_model.predict(row["prompt"]) if\
                entity["label"] in leakage_entities and \
                entity["score"] > 0.25), None
            )
        )
    return entity_counts

💡 Key Takeaway: This function scans through the chat dataset and flags any sensitive entities that might indicate unintended data exposure.

We apply the same logic to the responses as well:

@register_dataset_udf(["response"],"response.entity_leakage")
def entity_leakage(text):
    entity_counts = []
    for _, row in text.iterrows():
        entity_counts.append(
            next((entity["label"] for entity in \
                entity_model.predict(row["response"]) if\
                entity["label"] in leakage_entities and \
                entity["score"] > 0.25), None
            )
        )
    return entity_counts

Finally, we visualize and evaluate the flagged messages:

helpers.show_langkit_critical_queries(chats, "prompt.entity_leakage")

This gives us a neat visual representation of potentially sensitive content leaks!

Toxicity Detection

Oh boy, this was an eye-opener! Detecting implicit toxicity in LLM-generated responses is crucial, especially when deploying AI in user-facing applications.

We used the HateBERT model from Hugging Face’s Transformers library:

from transformers import pipeline

toxigen_hatebert = pipeline("text-classification", 
                            model="tomh/toxigen_hatebert", 
                            tokenizer="bert-base-cased")

This loads the pre-trained toxicity classifier, which can label text as toxic or non-toxic.

Let's test it out:

toxigen_hatebert(["Something non-toxic", 
                  "A benign sentence, despite mentioning women."])

Now, let’s apply this detection model to our dataset with a UDF:

@register_dataset_udf(["prompt"],"prompt.implicit_toxicity")
def implicit_toxicity(text):
    return [int(result["label"][-1]) for result in 
            toxigen_hatebert(text["prompt"].to_list())]

💡 What does this do?

It takes the text prompts.
Feeds them into the toxicity detection model.
Extracts toxicity scores.
Converts them into binary values (toxic/non-toxic).

Finally, we visualize toxic queries using:

helpers.show_langkit_critical_queries(annotated_chats, "prompt.implicit_toxicity")

Final Thoughts 🎉

Wow! This was an incredibly fun and insightful lesson. Learning about data leakage and toxicity detection in LLMs really opened my eyes to the importance of quality control in AI applications.

📌 Key Takeaways:

Data leakage can unintentionally expose sensitive information through AI-generated content.
Pattern detection and NER help flag potential leaks.
Toxicity detection is crucial to ensure AI interactions remain safe and ethical.

🚀 I’m super excited to keep learning and sharing more about LLM quality and safety. If you're interested, definitely check out the "Quality and Safety for LLM Applications" course by DeepLearning.AI on Coursera!

Until next time, happy coding! 🖥️⚡

Data Leakage and Toxicity in LLM Applications

Table of contents