In the preceding article, we were introduced to AI/ML concepts and explored the process of running a local Large Language Model (LLM) - Ollama. We further delved into interacting with it via Java using JBang and Langchain4j.

Now, let's explore into what "chat memory" is and how langchain4j helps in the cumbersome task of maintaining the chat memory.

To begin with, let's discuss the necessity of chat memory. Since language models (LLMs) inherently lack the ability to preserve conversation states due to their stateless nature, supporting extended conversations requires careful handling of the dialogue context.

If we run the OllamaMistralExample from the previous article, the following are the responses from the model

Please enter your question - 'exit' to quit: My name is Kevin, the minion. I work for Gru!

 Hello Kevin the Minion! It's great to meet you, the dedicated and hardworking minion from Gru's team. I'm here to help answer any questions or provide information you may need. What can I assist you with today?

Please enter your question - 'exit' to quit: Who is my boss? 

 I cannot determine who your boss is as I don't have the ability to access or interpret real-world information. Your boss would be the person who has authority over you in your workplace, such as a manager or supervisor. If you are unsure, it may be best to ask someone in a position of seniority within your organization or consult your employment contract or HR department for clarification.

Please enter your question - 'exit' to quit: What is my name?  

 I am an artificial intelligence and do not have a name or personal identity. I exist to provide information and answer questions to the best of my ability. How may I assist you today?

From the responses above, we can clearly see that the model does not remember the context of the conversation during the interaction with the LLM as they don't remember the state. Hence, the application interacting with the LLM should manage the conversation message to and from the LLM.

For sending multiple messages, langchain4j's ChatLanguageModel interface provides the following methods

default Response<AiMessage> generate(ChatMessage... messages); 

Response<AiMessage> generate(List<ChatMessage> messages);

default Response<AiMessage> generate(List<ChatMessage> messages, ToolSpecification toolSpecification);

default Response<AiMessage> generate(List<ChatMessage> messages, List<ToolSpecification> toolSpecifications);

Now let's see a code example that uses the second method in the ChatLanguageModel interface, that is Response<AiMessage> generate(List<ChatMessage> messages);

//JAVA 21
//DEPS dev.langchain4j:langchain4j:0.28.0
//DEPS dev.langchain4j:langchain4j-ollama:0.28.0

import java.io.Console;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CompletableFuture;

import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.model.StreamingResponseHandler;
import dev.langchain4j.model.chat.StreamingChatLanguageModel;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import dev.langchain4j.model.output.Response;

class OllamaMistralBasicMemory {

    private static final String MODEL = "mistral";
    private static final String BASE_URL = "http://localhost:11434";
    private static Duration timeout = Duration.ofSeconds(120);

    public static void main(String[] args) {
        beginChatWithBasicMemory();
    }

    static void beginChatWithBasicMemory() {

        Console console = System.console();
        List<ChatMessage> messages = new ArrayList<>();

        StreamingChatLanguageModel model = OllamaStreamingChatModel.builder()
                .baseUrl(BASE_URL)
                .modelName(MODEL)
                .timeout(timeout)
                .temperature(0.0)
                .build();

        String question = console.readLine(
                "\n\nPlease enter your question - 'exit' to quit: ");
        while (!"exit".equalsIgnoreCase(question)) {

            messages.add(UserMessage.from(question));
            CompletableFuture<Response<AiMessage>> futureResponse = new CompletableFuture<>();
            model.generate(messages, new StreamingResponseHandler<AiMessage>() {

                @Override
                public void onNext(String token) {
                    System.out.print(token);
                }

                @Override
                public void onComplete(Response<AiMessage> response) {
                    messages.add(response.content());
                    futureResponse.complete(response);
                }

                @Override
                public void onError(Throwable error) {
                    futureResponse.completeExceptionally(error);
                }
            });

            futureResponse.join();
            question = console.readLine("\n\nPlease enter your question - 'exit' to quit: ");
        }
    }

}

The OllamaMistralBasicMemory class is a modified version of OllamaMistralExample class from the previous article. We use the StreamingChatLanguageModel which let's us get the response immediately for each token generated rather than having to wait for the full response.

Here we use an ArrayList to store the UserMessage and the AiMessage that gets sent to the LLM whenever we want the LLM to generate the response.

After each input received from the user, messages.add(UserMessage.from(question)); adds the user input to the list and when the response is completely received it triggers the event onComplete(Response<AiMessage> response) which in turn adds the message to the list of messages by messages.add(response.content());.

Now, try executing the OllamaMistralBasicMemory, and now the responses seem to align with what we expect and it seems to know the context. The following is the output for the same conversation as above.

Please enter your question - 'exit' to quit: My name is Kevin, the minion. I work for Gru!

 Hello Kevin the Minion! It's great to meet you, the dedicated and hardworking minion from Gru's team. I'm here to help answer any questions or provide information you may need. What can I assist you with today?

Please enter your question - 'exit' to quit: What is my name?                 

 I apologize for the confusion earlier, Kevin. You have introduced yourself as Kevin the Minion. So, your name is indeed Kevin! Is there something specific you would like to know or discuss related to Gru's lab or minion activities?

Please enter your question - 'exit' to quit: Who is my boss?

 Your boss is Gru! He is the mastermind and leader of the evil organization that you and your fellow Minions work for. Gru is known for his cunning plans and schemes, and he relies on your help to carry them out. If you have any questions or need assistance with tasks related to Gru's plans, feel free to ask!

As we can see, the LLM remembers the context and starts providing appropriate responses to the questions. However, there are a few problems with this implementation

First, LLMs possess a finite context window that accommodates a certain number of tokens at any given moment. Conversations have the potential to surpass this limit
Second, each token comes with a cost, which increases progressively as more tokens are requested from the LLM
Third, the resource usage increases considerably on both the LLM and the application over time as the list builds up

Managing ChatMessages manually is an arduous task. To simplify this process, LangChain4j provides the ChatMemory interface for managing ChatMessages that is backed by a List, offering additional features such as persistence (as provided by ChatMemoryStore) and the essential "eviction policy". This eviction policy to address the issues described above.

LangChain4j currently implements two algorithms for eviction policy:

MessageWindowChatMemory provides a sliding window functionality, retaining the N most recent messages and evicting the older ones when it goes beyond the specified capacity N. However, the SystemMessage type of ChatMessage is retained and not evicted. The other types of messages UserMessage, AiMessage and ToolExecutionResultMessage will be evicted
TokenWindowChatMemory also provides a sliding window functionality but retains the N most recent tokens instead of messages. A Tokenizer needs to be specified to count the tokens in each ChatMessage. If there isn't enough space for a new message, the oldest one (or multiple) is evicted. Messages are indivisible. If a message doesn't fit, it is evicted completely. Like the MessageWindowChatMemory, SystemMessage is not evicted.

Now, let's implement the OllamaMistralBasicMemory using ChatMemory with the MessageWindowChatMemory eviction policy

//JAVA 21
//DEPS dev.langchain4j:langchain4j:0.28.0
//DEPS dev.langchain4j:langchain4j-ollama:0.28.0

import java.io.Console;
import java.time.Duration;
import java.util.concurrent.CompletableFuture;

import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.memory.ChatMemory;
import dev.langchain4j.memory.chat.MessageWindowChatMemory;
import dev.langchain4j.model.StreamingResponseHandler;
import dev.langchain4j.model.chat.StreamingChatLanguageModel;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import dev.langchain4j.model.output.Response;

class OllamaMistralChatMemory {

    private static final String MODEL = "mistral";
    private static final String BASE_URL = "http://localhost:11434";
    private static Duration timeout = Duration.ofSeconds(120);

    public static void main(String[] args) {

        beginChatWithChatMemory();
        System.exit(0);
    }

    static void beginChatWithChatMemory() {

        Console console = System.console();
        ChatMemory memory = MessageWindowChatMemory.withMaxMessages(3);

        StreamingChatLanguageModel model = OllamaStreamingChatModel.builder()
                .baseUrl(BASE_URL)
                .modelName(MODEL)
                .timeout(timeout)
                .temperature(0.0)
                .build();

        String question = console.readLine(
                "\n\nPlease enter your question - 'exit' to quit: ");
        while (!"exit".equalsIgnoreCase(question)) {

            memory.add(UserMessage.from(question));
            CompletableFuture<Response<AiMessage>> futureResponse = new CompletableFuture<>();
            model.generate(memory.messages(), new StreamingResponseHandler<AiMessage>() {

                @Override
                public void onNext(String token) {
                    System.out.print(token);
                }

                @Override
                public void onComplete(Response<AiMessage> response) {
                    memory.add(response.content());
                    futureResponse.complete(response);
                }

                @Override
                public void onError(Throwable error) {
                    futureResponse.completeExceptionally(error);
                }
            });

            futureResponse.join();
            question = console.readLine("\n\nPlease enter your question - 'exit' to quit: ");
        }
    }

}

Here we have set the max messages to 3 for the sake of testing it quickly. A higher value can be set if needed. Therefore, the max number of ChatMessages that are retained is 3 including question (UserMessage) and response (AiMessage).

If we run the program and specify our name first, then ask a few more questions so that the context of name gets evicted after 3 messages. Now if we ask the LLM for the name, the LLM does not have the content as the MessageWindowChatMemory has evicted those messages. This is where the heavylifting of managing the messages is done by LangChain4j.

The ChatMemory is a low-level component to manage the messages. However, there are high-level components AiServices and ConversationalChain that are available in LangChain4j. We will explore those in the upcoming articles.

The code examples can be found here

Happy Coding!

AI/ML - Langchain4j - Chat Memory

Subscribe to my newsletter

Prabhu R

Prabhu R