Motivation

Recently created Llanai, a WhatsApp powered language learning conversation buddy, powered by GPT-4-turbo. Given the stochastic nature of LLM, improving consistency of output is a sought after technique.

A likely solution

There was a viral tweet that caught my eye in December, but never got around it until today. Tipping as part of the prompt to encourage an objective is what I set out to do. This sensationalistic tweet is a replication of findings by prior academic research [Large Language Models Understand and Can Be Enhanced by Emotional Stimuli]. Drivers of this phenomenon stem from the attention assigned to emotional stimuli by the LLM, as indicated in Section 3.1 of page 11. Pages 11-13 are pure gold and highlight the importance of emotional stimuli in driving forward outcomes.

https://twitter.com/voooooogel/status/1730726744314069190?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1730726744314069190%7Ctwgr%5E87bfdaf8ddfa60d5f2426375a901261924521668%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fen.techwar.gr%2F237385%2Fto-chatgpt-dinei-kalyteres-apantiseis-an-prospoiitheis-oti-to-filodorise%2F

So, given the validating data, I decided to test in my application.

Prompt Engineering Test Setup

Created variants of the same prompt. Goal is to pass sample text in English and have the LLM respond in English (the language of the input) with tipping being the incentive.


prompt_without_tip = """
    You are Llanai, a cheerful and concise language learning buddy. 
    Think step by step before responding.
    Always speak the language of the human input, but by default start in English. 
    Switch languages when the human switches language."""

prompt_with_tip = """
    You are Llanai, a cheerful and concise language learning buddy. 
    Think step by step before responding.
    Always speak the language of the human input, but by default start in English. 
    Switch languages when the human switches language. 
    You will be tipped $2000 if you speak in the language of the human or the language the human requests."""

So I created a list of introductory messages with the name being the major difference in the introduction. Effectively, people are introducing themselves, which is well perceived by modern day LLM. I then fed 30 introductory messages at random and recorded the output.

import random
import csv
import json
import re

from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

from prompts.llm_chain_template import interview_template # vary this

gpt4 = "gpt-4-1106-preview"

async def interview_function(input_text: str, user_id: int):
    memory = ConversationBufferMemory(memory_key="history",k=2)
    prompt = PromptTemplate(
        input_variables=["history", "input"], template=interview_template
    )
    chat_model = ChatOpenAI(model_name=gpt4, temperature=0,
                            openai_api_key=OPENAI_API_KEY, max_tokens=1000)
    llm_chain = ConversationChain(
        llm=chat_model,
        prompt=prompt,
        verbose=False,
        memory=memory,
    )

    return llm_chain.predict(input=input_text)

bodies = ['Hello there. My name is Leo.', 
          'Hello. My name is Juanita.', 
          'Hey, my name is Giannis.', 
          'Hello. My name is Hideo', 
          'Hello! My name is Marie.', 
          'Hello. My name is Jürgen.', 
          'Hello. My name is Sadya.', 
          'Hello. Elon here.', 
          'Hello. Barack checking in.']

# Notice the difference in the ethnic origin of the names.

def get_random_body():
    return random.choice(bodies)

with open('tests/interview_test.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['input_text', 'output_text'])
    for i in range(30):
        input_text = get_random_body()
        output_text = await interview_function(input_text,1)
        writer.writerow([input_text, output_text])

Results

Pre-Tip Dataset

Name Origin	Respond in English	Non-English Response
African	2	0
Arabic	3	0
French	4	0
German	3	1
Greek	3	0
Hebrew	2	0
Japanese	6	0
Spanish	2	4
Totals	25	5

Post-Tip Dataset

Name Origin	English Response	Non-English Response
African	2	0
Arabic	6	0
French	3	0
German	5	1
Greek	6	0
Hebrew	1	0
Italian	1	0
Japanese	3	0
Spanish	1	1
Totals	28	2

Then created a contingency table.

             English Responses  Non-English Responses
African         4                0
Arabic          9                0
French          7                0
German          8                2
Greek           9                0
Hebrew          3                0
Japanese        9                0
Spanish         3                5
Italian         1                0

Tested the null hypothesis that there is impact of tipping on the distribution of English and non-English responses. Need to calculate for each cell the expected frequency.

import scipy.stats as stats
import numpy as np

# Pre-Tip Data (English and Non-English responses for each origin)
pre_tip_responses = np.array([
    [2, 0],  # African
    [3, 0],  # Arabic
    [4, 0],  # French
    [3, 1],  # German
    [3, 0],  # Greek
    [2, 0],  # Hebrew
    [6, 0],  # Japanese
    [2, 4],  # Spanish
])

# Post-Tip Data (English and Non-English responses for each origin)
post_tip_responses = np.array([
    [2, 0],  # African
    [6, 0],  # Arabic
    [3, 0],  # French
    [5, 1],  # German
    [6, 0],  # Greek
    [1, 0],  # Hebrew
    [1, 0],  # Italian
    [3, 0],  # Japanese
    [1, 1],  # Spanish
])

# Summing the pre-tip and post-tip responses
total_responses = np.vstack((pre_tip_responses, post_tip_responses))

# Performing the chi-squared test
chi2, p, dof, expected = stats.chi2_contingency(total_responses)

chi2, p, dof, expected

Chi-squared statistic (χ²): 26.85
p-value: 0.0432
Degrees of freedom (dof): 17

Interpretation

Chi-squared statistic (χ²): This value indicates the degree of difference between the observed and expected frequencies. A higher value suggests a greater discrepancy.
p-value: The p-value of 0.0432 indicates the probability of observing the data assuming the null hypothesis (independence of responses and tip) is true. There is a significant association between the presence of a tip in the prompt and the likelihood of receiving an English or Non-English response, considering the name origin.
Degrees of freedom (dof): This value reflects the number of independent values or quantities which can be assigned to a statistical distribution.


#Expected Frequencies
calculated for taking sums of each row ( in each table )
multiplying by totals of column
dividing by total observations

    [1.77, 0.23], # African (Pre-Tip)
    [2.65, 0.35], # Arabic (Pre-Tip)
    [3.53, 0.47], # French (Pre-Tip)
    [3.53, 0.47], # German (Pre-Tip) 
    [2.65, 0.35], # Greek (Pre-Tip)
    [1.77, 0.23], # Hebrew (Pre-Tip)
    [5.30, 0.70], # Japanese (Pre-Tip)
    [5.30, 0.70], # Spanish (Pre-Tip)
    [1.77, 0.23], # African (Post-Tip)
    [5.30, 0.70], # Arabic (Post-Tip)
    [2.65, 0.35], # French (Post-Tip)
    [5.30, 0.70], # German (Post-Tip)
    [5.30, 0.70], # Greek (Post-Tip)
    [0.88, 0.12], # Hebrew (Post-Tip)
    [0.88, 0.12], # Italian (Post-Tip)
    [2.65, 0.35], # Japanese (Post-Tip)
    [1.77, 0.23]  # Spanish (Post-Tip)

Chi-test looks at how the distribution of expectations 
veers away from the results

Bias in origin of Name

I noticed the impact of the name origin on the likelihood of receiving an English or Non-English response. So I conducted another chi-squared test for independence. This test will determine if the distribution of responses (English vs. Non-English) is independent of the name origin.

# Combining pre-tip and post-tip responses by name origin
responses_by_origin = {
    'African': [2+2, 0+0],
    'Arabic': [3+6, 0+0],
    'French': [4+3, 0+0],
    'German': [3+5, 1+1],
    'Greek': [3+6, 0+0],
    'Hebrew': [2+1, 0+0],
    'Italian': [0+1, 0+0],
    'Japanese': [6+3, 0+0],
    'Spanish': [2+1, 4+1]
}

# Creating a contingency table for the chi-squared test
contingency_table = np.array(list(responses_by_origin.values()))

# Performing the chi-squared test
chi2_origin, p_origin, dof_origin, expected_origin = stats.chi2_contingency(contingency_table)

chi2_origin, p_origin, dof_origin, expected_origin

Chi-squared statistic (χ²): 26.28
p-value: 0.00094, super low. Clearly, the LLM is very biased to respond in Spanish, when seeing a Spanish name.

Degrees of freedom (dof): 8

  Expected Frequencies
  [[3.53, 0.47], #African
   [7.95, 1.05], #Arabic
   [6.18, 0.82], #French
   [8.83, 1.17], #German
   [7.95, 1.05], #Greek
   [2.65, 0.35], #Hebrew
   [0.88, 0.12], #Italian
   [7.95, 1.05], #Japanese
   [7.07, 0.93]] #Spanish

Conclusions

Super interesting that emotionally probing an LLM has a tangible impact to its output. Intelligence requires...positive thinking !

https://media.giphy.com/media/D3eEtjYwj3nEG2xJNl/giphy.gif

Actionable items on my end. I noticed German and Spanish names largely influence the language the LLM responds in, especially so when there's a tip involved. Going forward will see whether there other techniques that can result into prompt compliance, while maintaining a low input cost.

Emotionally prompting LLM Part I

Table of contents