[Paper Review] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Ramhee YeonRamhee Yeon
4 min read

Since I have joined a team which deals with AI and LLMs, I have decided to review a paper in relation to an LLM which deals with reinforcement learning of LLM and how it turns out to be better than the zero-shot learning.

It had been only 3 days in the team since I joined the team, but I myself needed to figure out my pathway with which skills I will develop and carry.

There are tons and loads of AI domains I would like to get involved in, like algorithms making or linear regression research or LLMs, but since our team currently focuses on LLM and fine-tuning it, I decided to further study how to evaluate the LLM outputs and reinforcement-learn it.

OpenAI team Anthropic conducted a research on Reinforcement Learning from Human Feedback(RLHF) and it was intriguing.

1. Introduction

Learning from Human Feedback would mean nothing much difficult to understand. There are two datasets sorted by humans, and a model is fine-tuned.

Datasets

  • Helpfulness

    • answering, writing, editing, documents etc.
  • Harmlessness

    • not related to harmful goals like bank rubbery

These two datasets will be categorised by humans and it is totally up to the crowdworkers to decide which category the text falls into.

Trade-off

An interesting result though, when the model learns from a single dataset, which is either the helpful dataset or the harmless dataset, it has got a tendency to show the trade-offs in the scores.

As shown above, the green triangle plot(Online Helpful RLHF) scores top with Elo score on the left, yet on the right it is least preferred by crowdworkers when scoring for harmlessness. Technically, a bias is formed when only a single dataset is used for training the model.

When trained with the two datasets, both helpful and harmless datasets, it shows a meaningful result that the few-shot accuracy shows better performance than the zero-shot accuracy in general NLP performance tests.

The graphs indicate that the bigger models the better performance the models show in general.

2. Suggested RLHF

This is the process of RLHF from data collection to reinforcement learning. It was somehow complicated for me to understand the whole, so I had to summarise for my own the entire process according to my understanding. There are approximately three steps in a nutshell as below.

1) AB test

  • crowdworker determines the outputs.

2) Preference Model

  • input: prompt, AB answers, preferred answer(human feedback)

  • output: scores

3) RLHF

  • input: prompt, RLHF answer, PM score

3. Evaluation

The standard NLP test methods are applied to check the model, but as mentioned above, when the models are fine-tuned with the datasets (helpful and harmless), the bias is formed.

Test methods are as below.

  • MMMLU: Benchmark for many domains with high-level questions (history, law, medicine, etc.)

  • LAMBDA: A task to predict the last word

  • HellaSwag: A task to choose an appropriate context

  • OpenBookQA: Basic science knowledge

  • ARC-Easy: Basic science knowledge (easy questions)

  • ARC-Challenge: higher-level questions which requires the reasoning

  • TriviaQA: common sense questions collected on the internet

4. Reflection

OpenAI team indicated that the model shows better performance in helpful scores, when evaluated by humans, but they are uncertain of the reason why it is like this. They assume it is because of the datasets, but further research is required to prove the datasets lack the correctness for fine-tuning.

Evaluation

It was always questioning me when I have to consider how to assess the model in a specific domain when an LLM is applied. Should a human give a feedback to the answers the LLM has given? Well, this paper at least had provided the information that humans (crowdworkers) did the job to divide datasets. From this research I could get the idea that human feedback is required and essential in output evaluation, although it is time consuming and requires budgets, because human force is the most expensive resource though.

Datasets

Dataset from the range of 100-500k will be ideal for fine-tuning, I though. The paper suggests the static datasets and online datasets, and the paper was released in 2022, which I think there could be better ideas and methods on how to make datasets using LLMs and evaluate them.

5. Reference

[1] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2204.05862.

0
Subscribe to my newsletter

Read articles from Ramhee Yeon directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ramhee Yeon
Ramhee Yeon

South Korean, master's degree of AI. Programmer for LLM, app and web development. I am seeking opportunities to go to the USA and Canada, and Europe. I used to live in Frankfurt, Moscow, Pretoria, Baguio. Where to next?