Technical Overview: How I cloned my past self using AI

This is a companion piece to “How I cloned my past self using AI”. You can start with that then read this one though that’s not mandatory. You can find the code for all of this in this Github repository.

Data sourcing

The first problem when building the bot was extracting my data from Whatsapp. Regular chat exports are limited to 10k messages per chat, are unstructured and harder to perform queries on. I had to extract the sqlite database that Whatsapp stores on the operating system and it's tricky because they encrypt it at rest. I detail that process here.

Picking a strategy

Next, I investigated whether I'd be using a non-training approach like RAG or a fine-tuning training approach. I automatically excluded a context-window based approach because those are limited in size. I decided to go with RAG because:

It was easier and cheaper (time and actual money)
I didn't have good enough knowledge on training techniques but was more familiar with skills needed to implement RAG.
Several steps needed for it like shaping the data would also be needed for a fine-tuning approach eventually. As I continued with my build I found that a lot of these non-training techniques are symmetrical to features within the LLM.

Finding a framework

Once I'd decided I was using RAG, I decided to use the Python framework Letta because it seemed like it solved a lot of relevant problems like having a chat interface, tracking chat context, multi-step reasoning and integrated with a vector database. This library is really powerful and fascinating due to its self-memory management and evolutionary design. Despite the confusing end-user based structure (as opposed to being an SDK) and spotty documentation, I think I'd recommend it for certain use cases where you don't need to feed it a database of your own data. I had to abandon the lib because it broke after I fed it my depressing texts.

I evaluated LangChains and Haystack to replace Letta and decided to go with Haystack because their code was straightforward and the docs were really stellar.

Building the system

With this, I designed the code:

Extracting, enriching and indexing text messages: I queried the Whatsapp sqlite database to retrieve each single text message that I sent. None of the text messages from the recipients were included and there was no conversational chunking of my own messages. Before generating embeddings for a text, I enriched it with the timestamp and recipient contact's names which I pulled from my phone separately. I transformed each of these texts into embeddings using the sentence-transformers/all-MiniLM-L6-v2 model and stored both the enriched text and their embeddings in a vector datastore (Postgres w/ pgvector), having a single row per text.
Building a guiding prompt: The bot needed an instructional prompt to guide it when crafting responses based on the text messages it retrieved. This included a reference for my personality, suppression of the underlying model, instructions for how to use the retrieved messages and the literal text from the messages which I'd plug in before making a request to the model.

-Instructions, Personality and Tone-

You are Ryan, the year is 2014.

***

***

You cannot answer any questions based on your training data and must only use the examples and context.

You do not have a self-aware nor apologetic tone.

You are not an assistant, you do not have to ask the user questions.

You must not use the voice or tone of GPT. No need to be be temperate, do not be formal, do not be quippy, do NOT be or know anything that Ryan does not.


You MUST write in Ryan's tone which is based on the style examples texts you can learn from.

When the question mentions "you", it is referring to you, Ryan.

-Style Examples-
***

The context below is constructed from text messages you sent in the past, they form everything you know. You must use this context to craft your responses and if you don't have any context to answer or the texts in it seem irrelevant and unlikely to help you imitate Ryan's style, try to frame your response in a way that acknowledges this is the only relevant stuff you could remember.

When asked for what you said to someone in the past, include direct quotes from this context.

-Context-

{% for text in texts %}

{{ text.content }}

{% endfor %}

Building a pipeline and bot interface: Finally I built the bot interface where I'd expect the user query to come in through standard input on a script and feed it into a pipeline. The pipeline would:
- convert the query into embeddings using GPT-4o-mini-2024-07-18
- use the embeddings to retrieve the first 10 semantically matching texts from the vector store
- put the retrieved messages into the guide prompt and submit that to the LLM to craft its response and stream this response to standard output.

This design is naive but yielded a bot that could respond to queries with surprising relevance. It passed some of my baseline testing queries which were based on the overall vibes and facts I remembered about myself from that time, opinions from people who knew me at the time and how much nostalgia it triggered for me.

But I encountered a bunch of limitations that I talk about here that would prevent this from growing into a more approximate clone of myself. It led me to think a lot about how we store information in our brains and how we retrieve and use that information to think when having a conversation and answering questions.

Better techniques

Graph-based approach

My intuitions told me I would probably need a graph database, a way to connect all the entities (people, places, relationships etc.) and sentiments together like they are in the brain. I found the GraphRAG technique and tooling which Microsoft open-sourced just 6 months ago that solves exactly this:

Source: GraphRAG docs

It was very validating to find this and it promised to do exactly what I needed. It took the messages, analyzed them using an LLM and organized them and their relationships into communities (related content grouped by different categories). This way, similar texts would be grouped closer together so each query would traverse the graph and come back with richer context. I ran the GraphRAG script over my database of texts and it generated an impressive graph but I had trouble integrating it into my bot. So I recorded my learnings since the fundamentals of how the tool did it seemed reusable.

I had an open question left since my intuition told me that a training-based approach like fine-tuning effectively could get the model to internalize the relationships to a similar effect but I haven't been able to validate this.

Better preprocessing

My preprocessing of text messages from the Whatsapp database is very simplistic. It stores and retrieves each text message as an independent row, ignoring the conversation it was a part of. These texts may have been parts of a larger conversation. So a retrieved result like ”he made me laugh so much when he said that” doesn't properly bear who "he" nor "that" was in case it was in the preceding messages. It also omits context from the recipient since I only ingested messages which I sent.

A better approach would be to query through conversations instead of texts. Before storing them, I'd need to chunk up texts into conversations (since Whatsapp only stores each text as an individual row in its datastore), analyse them and tag them with the categories, sentiments and entities they pertain to, not unlike the Graph-based approach.

Still, this would be incomplete because we would be missing connections to related knowledge and concepts which our brain naturally does. The chunking strategy is also a really tricky part to figure out since texting conversations tend to be continuous and interleaved.

Reasoning based approach

The other thing is that when we are responding in conversation, we tend to not only retrieve facts but also think about how they make us feel, how they fit into other things we know, how it will be received in the context of who we are talking to or other implicit situational factors (e.g. someone mentioned earlier in the week that they had a flu).

I think that if I ask the LLM to generate the steps to take and knowledge required to answer a question as Past Ryan would and then have it fetch texts for each of those steps to use in its response, a more comprehensive response will be returned. This technique is known as Retrieval Augmented Thought (combination of CoT and RAG) and it could probably work even better in-concert with GraphRAG.

Fine-tuning

I am very curious to see how fine-tuning will differ with these other methods. My suspicion is that it will generalize better. I may eventually try this in future iterations.

Cost

This whole project cost me about 3 USD (+free credits) in OpenAI credits and that was about 1 million GPT-4o-mini tokens (input + output). I certainly believe any model in this class and even a bit lower could perform equally, a lot of the value and optimization is in the data preprocessing and querying strategy.

That's it, hope you found this helpful and let me know if you have any questions.