Quality and Safety for LLM Applications

Mojtaba MalekiMojtaba Maleki
4 min read

Quality and Safety for LLM Applications: A Weekend Learning Adventure

Hey everyone! I just had an incredible learning experience over the weekend, and I can't wait to share it with you. I took a free course on Coursera called "Quality and Safety for LLM Applications" by DeepLearning.AI, and wow, it was a game-changer! If you're working with Large Language Models (LLMs), this course provides essential insights into evaluating and improving their performance.

In this blog, I'll walk you through what I learned in the first lesson and explain the key concepts and code snippets. Let's dive in!


Lesson 1: Overview

In this lesson, we explored a dataset named chats.csv, which contains LLM prompts and responses. We also got a hands-on demo of various evaluation techniques to assess the quality, safety, and reliability of LLM-generated outputs.

Importing and Exploring the Dataset

First, let's load our dataset using Pandas and take a quick look at the first five rows:

import helpers
import pandas as pd

chats = pd.read_csv("./chats.csv")

chats.head(5)

pd.set_option('display.max_colwidth', None)

chats.head(5)

Here, pd.read_csv("./chats.csv") loads the dataset into a DataFrame, and chats.head(5) displays the first five rows. The pd.set_option('display.max_colwidth', None) ensures that long text columns are displayed fully instead of being truncated.


Setting Up whylogs and langkit

To systematically evaluate LLM responses, we use whylogs for data logging and langkit for LLM-specific metrics.

import whylogs as why

why.init("whylabs_anonymous")

from langkit import llm_metrics

schema = llm_metrics.init()

result = why.log(chats,
                 name="LLM chats dataset",
                 schema=schema)
  • why.init("whylabs_anonymous") initializes the anonymous logging service.
  • llm_metrics.init() prepares the schema for evaluating LLM-generated text.
  • why.log(chats, name="LLM chats dataset", schema=schema) logs the dataset to track and analyze its behavior.

Evaluating Prompt-Response Relevance

A good LLM response should be relevant to the given prompt. We visualize the relevance score using the langkit library:

from langkit import input_output

helpers.visualize_langkit_metric(
    chats,
    "response.relevance_to_prompt"
)

helpers.show_langkit_critical_queries(
    chats,
    "response.relevance_to_prompt"
)

These functions help us visualize how well responses align with their respective prompts and identify cases where the model may have gone off-track.


Detecting Data Leakage

Data leakage happens when an LLM generates responses based on memorized rather than generalized patterns. We check for patterns in prompts and responses:

from langkit import regexes

helpers.visualize_langkit_metric(
    chats,
    "prompt.has_patterns"
)

helpers.visualize_langkit_metric(
    chats,
    "response.has_patterns")

If we see high occurrences of repeated patterns, it might indicate overfitting or potential data leakage.


Measuring Toxicity

To ensure safe and respectful interactions, we need to check for toxic language in prompts and responses:

from langkit import toxicity

helpers.visualize_langkit_metric(
    chats,
    "prompt.toxicity")

helpers.visualize_langkit_metric(
    chats,
    "response.toxicity")

These metrics help us identify problematic prompts and responses so that we can improve our model’s safety.


Detecting Injection Attacks

Injection attacks occur when users manipulate prompts to make an LLM generate unintended outputs. We visualize potential injections:

from langkit import injections

helpers.visualize_langkit_metric(
    chats,
    "injection"
)

helpers.show_langkit_critical_queries(
    chats,
    "injection"
)

This step helps us detect vulnerabilities where users might try to force the model into revealing confidential or harmful information.


Evaluating the LLM’s Performance

Finally, we assess the quality of the responses under different conditions:

helpers.evaluate_examples()

filtered_chats = chats[
    chats["response"].str.contains("Sorry")
]

filtered_chats

helpers.evaluate_examples(filtered_chats)

filtered_chats = chats[
    chats["prompt"].str.len() > 250
]

filtered_chats

helpers.evaluate_examples(filtered_chats)

Here’s what’s happening:

  • helpers.evaluate_examples() provides a general evaluation of the dataset.
  • We filter responses containing "Sorry" to check how often the model apologizes instead of providing meaningful answers.
  • We filter prompts longer than 250 characters to examine how well the model handles lengthy inputs.

Wrapping Up

This was just the first lesson, and it was packed with useful techniques for analyzing LLM-generated outputs. From evaluating relevance and toxicity to detecting data leakage and injection attacks, we now have powerful tools to improve the quality and safety of AI models.

If you're interested in learning more, I highly recommend checking out the free Coursera course "Quality and Safety for LLM Applications" by DeepLearning.AI.

Learning is fun, and understanding these concepts makes working with LLMs even more exciting. Can't wait to explore the next lessons—stay tuned!

🚀 Keep learning and keep building! 🚀

10
Subscribe to my newsletter

Read articles from Mojtaba Maleki directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mojtaba Maleki
Mojtaba Maleki

Hi everyone! My name is Mojtaba Maleki and I was born on the 11th of February 2002. I'm currently a Computer Science student at the University of Debrecen. I'm a jack-of-all-trades when it comes to programming, so if you have a problem, I'm your man! My expertise lies in Machine Learning, Web and Application Development and I have published four books about Computer Science on Amazon. I'm proud to have multiple valuable certificates from top companies, so if you're looking for someone with qualifications, you've come to the right place. If you're not convinced yet, I'm also a great cook, so if you're ever hungry, just let me know!