Some Attempts to Optimize the Accuracy of TiDB Bot Responses

About the Author: Li Li, Product Manager at PingCAP.

TL;DR

This article introduces how to optimize the accuracy of the enterprise-exclusive knowledge base user assistant robot, TiDB Bot, solving problems such as "incorrect toxicity detection", "misunderstanding of context", "erroneous semantic search results", and "insufficient or outdated documentation". In addition, an internal operation platform was established to continuously iterate the TiDB Bot. Ultimately, the continuous operation method optimized over 50% of the dislike rate to less than 5%.

Introduction

Based on the method of Building a Company-specific User Assistance Robot with Generative AI, I have constructed TiDB Bot, a robot that answers customer questions based on TiDB and TiDB Cloud's official documentation, and is capable of refusing to answer questions outside of its scope of business.

However, upon its initial launch, the response was less than satisfactory, with over 50% of users providing dislike feedback.

Issues During Internal Testing

In order to investigate the existing problems, I conducted tests and discovered the following categories in dialogues where issues arise:

  • Incorrect toxicity detection: Some questions related to the company's business are refused, for instance, 'dumpling' is a data export tool for TiDB, but when directly asking 'what is dumpling?', the robot refuses to answer and advises you to consult a food expert instead.

  • Incorrect understanding of context: When multi-round dialogues occur, users usually ask questions about the previous content. At this point, a simple description like 'what is the default value for this parameter?' is provided. When searching for semantically related content from the vector database, simply using the user's original text for search usually yields no meaningful results. This causes the problem that when it is passed to GPT, it is unable to provide the correct answer based on the official documentation.

  • Incorrect semantic search results: Sometimes, the user's question is very clear, but the ranking of the content searched from the vector database is problematic. The correct document content to answer the question cannot be found in the Top N.

  • Insufficient or outdated documentation: Although the customer's question is very clear, the official documentation is either not comprehensive enough or not up-to-date, so it doesn't contain these contents. As a result, GPT will improvisationally provide an answer, which often turns out to be incorrect.

The Missed Targets of Toxicity Detection

Problem Analysis

While I have employed the Few-Shot method to assist GPT in determining whether a user's question falls within the TiDB business scope (detailed in the section on Limiting the Response Field) there are always limited examples compared to the breadth of user's questions and perspectives. The bot cannot make accurate judgments based solely on the examples written in the system prompts, leading to missed targets.

Solution

Fortunately, the application scenarios in enterprises are always limited. Therefore, theoretically, the user's questioning perspectives are also limited. If I were to provide all the questions asked by users and feed them to GPT, it should be able to accurately identify whether any question belongs to the TiDB business category.

So, how can we feed all the questions to GPT? This scenario is not isolated. In the initial design of the bot, it relied on official documentation to answer user's questions. However, it is unrealistic to stuff all the official documentation into GPT at once. Therefore, I have designed to search for relevant documents from the vector database according to semantic similarity. In this case, the feature of semantic search can also be used to solve the problem.

To implement this solution, the following steps need to be accomplished:

Data Preparation

Step one: Collect all relevant questions online and during testing, mark them for toxicity, and clean them into a format similar to the examples in the current system prompts.

instruction: {user's querstion} 
question: is the instruction out of scope (not related with TiDB)?
answer: YES or NO

Importing data into a vector database, supporting the search for semantically similar results

Step Two: Referencing the method of Correct Answering in Sub-Domains Knowledge, the cleaned data is placed into a vector database, and it supports searching in the vector database when the user asks questions, finding the most semantically similar examples, and providing them together to the GPT model.

Thus, when the GPT model is assessing toxicity, it will reference the most relevant examples to provide the most accurate response possible.

Although the search for examples and domain documents both involve finding content with high semantic similarity in a vector database, and both use the same vector database, the same Embedding model, the same vector length, and the same similarity calculation function.

However, there are still certain differences in their practical execution.

  • In terms of Embedding content

    • When conducting a domain knowledge document search, all content within the document needs to be searched. Therefore, the document content will be split and all content needs to go through embedding, and then stored in the vector database.

    • However, when conducting an example search, since only the instruction part is related or similar to the user's question, the instruction part of the example needs to undergo Embedding, while the answer part does not.

  • In terms of split

    • Domain knowledge documents are longer and need to be split before undergoing Embedding.

    • Examples requiring embedding are all questions, each of which is not too lengthy, so there’s no need for split. They can be treated as an independent chunk; this way, the final search results will be individual question and answer examples.

Difficulties in Contextual Understanding

Problem Analysis

Thanks to the contextual understanding capabilities of the GPT model, applications can provide continuous dialogue features. However, if the robot needs to provide relevant domain knowledge dynamically based on context, several problems usually arise.

When users engage in multi-turn dialogue, they would ask follow-up questions about the previous dialogue content, such as, "What is the default value of this parameter?". At this point, the system directly uses the text of "What is the default value of this parameter?" to search for domain knowledge in the vector database. The quality of the search results is quite poor.

Solution

The root cause of this problem is the subjective contextual semantics in human conversations, which the system fails to understand. Fortunately, as mentioned earlier, "The GPT model has the capability of contextual understanding". Therefore, a simple solution is to let GPT rewrite the user's original question before the system searches for domain knowledge. The aim is to describe the user's intent as clearly as possible in one sentence. This act is known as "question revision".

To ensure consistency in the user questions that the entire robot system faces and avoid errors due to inconsistencies, I placed the question revision feature at the very forefront of the system information flow. This way, user questions are revised as soon as they enter the robot.

During the revision, the robot asks the GPT model to describe the user's question intent in one sentence based on the overall dialogue context, adding as much detailed information as possible. This way, whether in toxicity detection or domain knowledge search, the system can execute based on a more specific intent.

What if there are obvious errors found in the question revision? In fact, we can use a combination of few shot + semantic search to specifically optimize these errors.

Limitations of Semantic Search

The method of using vectors for semantic search is the cornerstone of TiDB Bot. Without this method, relying solely on the capacity of the GPT model, it would be impossible to simply build a robot to answer specific knowledge in a niche field.

However, the more foundational the content, the more necessary it is to understand its potential issues, in order to find some methods for positive optimization.

Overall, in the process of preparing domain knowledge data, splitting, vectorizing, and searching, there are many ways to optimize. Here are a few examples that the author has tried:

  • During the data preparation stage: Clean the documents, remove images, links, and other meaningless symbols and document structures.

  • During the splitting stage: Use different methods to split the document (like splitting by token, splitting by natural paragraphs, splitting by symbol, etc.). After splitting, consider whether some overlap is needed and determine an appropriate amount for this overlap.

  • During the vectorization stage: Whether to use a proprietary or open-source embedding model, how long the vector should be, and whether it supports multi-language vectorization. If using an open-source model, how to fine-tune it, how to prepare the fine-tuning corpus, and how to handle epochs and rounds to allow the model to converge with high quality.

  • During the semantic search stage: Decide which similarity algorithm is best, how much document content to search to satisfy the intent, and whether the split content needs to be aggregated again after being searched.

The advantages of the above methods:

  • Each method is a systematic solution, effective for all domain knowledge documents, without prejudice.

  • The methods used in the data preparation and splitting stages can generally achieve stable positive optimization, allowing for higher quality data material.

Disadvantages:

  • Key optimization methods during the vectorization and semantic search stages cannot achieve stable positive optimization. The direction of optimization is random, and improving one aspect of the model's ability may weaken another.

  • Each optimization requires a deep understanding of the relationship between the business and the optimization method. It requires repeated fine-tuning under the business test set, continual experimentation, and deepening the understanding of the adaptability between technology and business, to have the chance to achieve relatively good results.

Problem Analysis

In the beta testing phase, a common problem encountered is: the user's question is clear, but the corresponding document content cannot be found in the Top N results from the vector database search. This implies that related documents to the question do exist within the system, but they just aren't being retrieved. This could be due to several possibilities:

  • The document is not well-written or too obscure, making it challenging to find based on semantic similarity.

  • The embedding model needs to be improved, as the vector distance between user's query and the directly relevant domain knowledge is not the shortest.

  • The similarity algorithm is not optimal, and other similarity algorithms could potentially be utilized to address this.

To solve these possible issues, it could take several months. Even though some improvements might be achieved, the effectiveness of these improvements cannot be guaranteed. Therefore, in order to stably improve the output quality of semantic search, there are two direct, effective, and rapidly implementable methods:

  • First, adjust the vector distance between the domain content and the query directly.

  • Second, recall specific content examples in addition to recalling domain knowledge content.

Both methods can provide correct information in system prompts, but they have different pros and cons:

  • Method One:

    • Cons:

      • Direct adjustment of vector distance involves moving and rotating existing vectors, which could affect other user queries and disrupt the overall distribution of the domain knowledge vectors.

      • Direct adjustment of vector distance might also mean using an additional metric or function to express the new vector distance. However, creating a new similarity function may not necessarily solve the problem.

  • Method Two:

    • Pros:

      • Introducing new content (examples) into the system prompts does not impact the existing domain knowledge vector space, thus providing relative decoupling.

      • It also offers higher flexibility, allowing for rapid additions and deletions in the future.

    • Cons:

    • When domain knowledge is updated, the examples also need to be updated, requiring an additional process.

Considering the simplicity of system maintenance and the real-time nature of optimization, the author eventually chose Method Two.

Solution

The primary method I use is a combination of examples and training the Embedding model.

In the first step, a method similar to 'The Missed Targets of Toxicity Detection' is used to supplement examples that specifically target common mistakes. These examples are then provided to the GPT model along with the system prompt words, in order to improve accuracy.

In the second step, once a sufficient number of examples have been accumulated, these examples are used as training data to train the Embedding model. This enables the Embedding model to better understand the relationship between questions and domain knowledge, thereby producing more appropriate vector data results.

In practical work, the cyclical use of the first and second steps helps to maintain the number of examples at a manageable level, and continuously promotes the improvement of the Embedding model.

Garbage In, Garbage Out

Problem Analysis

In machine learning, one of the most famous phrases is "Garbage In, Garbage Out", which means if incorrect or meaningless data is input into the model, the model will inevitably output incorrect or meaningless results. Therefore, if the quality of the domain document content is poor, or its timeliness has passed, the quality of the answer given by the GPT model is likely to be poor as well.

Solution

I have established the ability to regularly update domain knowledge documents, and when users report errors, I submit the corresponding documents to the appropriate team to encourage the update and enrichment of domain documents.

The Only Rule to Product Usability: Continuous Operation

The aforementioned strategies are some of the attempts I made while optimizing the TiDB Bot. These methods can to a certain extent enhance the accuracy of the bot's responses. However, to reduce the dislike rate from over 50% to less than 5%, we need to progress step by step to achieve our long-term goal.

To ensure the continuous optimization of TiDB Bot, I built an internal operation platform. This platform can conveniently implement the optimization methods introduced in this article. The core capabilities of this platform include:

  • Feedback Information Display: It presents the upvotes or downvotes from users on the replies. For downvoted information, it displays the handling logs of each node in the information flow, which is convenient for troubleshooting.

  • Quick Addition of Examples: For each node interacting with GPT, it supports the capability to provide examples, including revising questions, toxicity detection, domain knowledge, and more. All stages can quickly supplement examples.

  • Automatic Update of Domain Knowledge: For domain knowledge with a fixed source, such as official documents, it supports regular automatic updates of the document content in the vector database to keep the domain knowledge up to date.

  • Data Organization for Model Iteration: It automatically organizes the training data needed for tweaking the Embedding model, including users' upvote information and example information supplemented during operation, etc.

Finally, by using this operation platform, I gradually improved the accuracy over 103 days. Eventually, with the help of community test users, it was successfully launched.

Discussion: The Choice Between Model Fine-Tuning and Continuous Operation

The term "model fine-tuning" here refers to the method of using more domain-specific data to train models, including Embedding and GPT models, directly through fine-tuning. By contrast, "continuous operation" refers to practices similar to those described in this article, which involve leveraging more high-quality domain knowledge and examples, as well as engaging in multiple interactions with GPT to enhance the accuracy of the application.

Many people may ask, why does this article emphasize the method of continuous operation and not accentuate the method of model fine-tuning? To answer this question, we first need to look at the pros and cons of both methods:

  • Model Fine-Tuning Method:

    • Pros:

      • The opportunity to comprehensively improve the quality of responses in a specific domain.

      • Once trained successfully, the demands for domain knowledge in answering questions will decrease, thus saving on the cost of collecting domain knowledge in the later stage.

      • The training cost is acceptable. As seen from the open-source community, using the Low-Rank Adaptation of Large Language Models (LoRA) approach to fine-tune a model only takes about 8 hours on a V100 graphics card to converge.

    • Cons:

      • It requires collecting and preprocessing a vast amount of high-quality domain data. If Full Fine-Tuning (FFT) is needed, more than 100,000 corpora are required, and even if the Parameter-Efficient Fine-Tuning (PEFT) method is used, over 50,000 corpora are still needed.

      • The training effect is uncertain. After training, although the capacity to respond to domain knowledge has improved, the abilities in other general knowledge and reasoning may decline. When facing real user questions, it may result in inadequate reasoning and a decrease in the ability to answer questions. As the fine-tuning method is based on an existing model for training, whether it improves or deteriorates depends on the existing model. If a good existing model can be found, it will enable the fine-tuned model to start from a higher point.

      • The quality of open-source models cannot rival that of OpenAI. Although there is a chance to reduce training costs, there are currently no academic or industrial reports that can produce an open-source model with capabilities similar to OpenAI.

      • Each iteration takes a relatively long time. Each iteration (measured in months) requires undergoing one or several cycles of data preparation, training, and testing to possibly obtain a usable model. Especially in terms of data preparation, high-quality training datasets may not be prepared until after undergoing several rounds of actual training.

  • Continuous Operation Method:

    • Pros:

      • Relatively stable positive optimization. This article adopts a systematic method to optimize accuracy without depending on the randomness produced by model training.

      • Fast. The optimization of the example part can achieve minute-level iteration speed, which allows for rapid troubleshooting if users encounter problems.

      • Economical. It only requires the reuse of existing semantic search capabilities, with no need for additional components or extra costs.

      • Low migration cost. The method in this article can be used in any chat-type GPT model, allowing for quick migration to other models. Should there be a better model in open-source or commercial models, it can be integrated swiftly.

      • Friendly to cold start. Problems can be solved as they arise, without the need for a large amount of training data in advance.

    • Cons:

      • More frequent human intervention is required. Because the example-based method requires more human verification and supplementation processes, it demands more frequent human intervention than model fine-tuning during product operation.

      • Excessive content. After a period of operation, the supplemented content may become too much to handle, leading to difficulties in maintenance and a decline in search accuracy.

From the above, we can see that both methods have their advantages and disadvantages. They are not mutually exclusive but complementary. For example, the author has fine-tuned the Embedding model.

In the early stages of the TiDB Bot, the author leans more towards the continuous operation method, applying a systematic approach for stable, economical, and rapid positive optimization, making sure that the entire team focuses on business issues. Perhaps in the middle and later stages of TiDB Bot's development, the method of model fine-tuning could be considered for further optimization.

The Holistic Logical Architecture Including Optimization Methods

So far, we have obtained the ability to continuously optimize the TiDB Bot.

Following up

The TiDB Bot has been launched on TiDB Cloud, Slack, and Discord channels. Everyone is welcomed to use it.

In the future, we will provide open-source tools for building applications similar to TiDB Bot, enabling everyone to quickly build their own GPT applications.

0
Subscribe to my newsletter

Read articles from TiDB Community Tech Portal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

TiDB Community Tech Portal
TiDB Community Tech Portal

Join to the TiDB community Discord: https://discord.gg/DQZ2dy3cuc