How to Build an Advanced AI-Powered Enterprise Content Pipeline Using Mixtral 8x7B and Qdrant

Akriti UpadhyayAkriti Upadhyay
12 min read

Introduction

As the digital landscape rapidly evolves, enterprises are facing the challenge of managing and harnessing the exponential growth of data to drive business success. With the expansion of volume and complexity of content, traditional content management approaches are failing to provide the agility and intelligence required to scale or extract valuable insights.

The integration of vector databases and the mixture of experts, such as Mixtral 8x7B LLM, offers a transformative solution for enterprises seeking to unlock the full potential of their content pipelines. In this blog post, we will explore the essential components and strategies for building an advanced AI-powered enterprise content pipeline using Mixtral 8x7B and Qdrant – an advanced vector database following the HNSW algorithm for approximate nearest neighbor search.

To build the advanced AI-powered pipeline, we’ll leverage a Retrieval Augmented Generation (RAG) pipeline by following these steps:

  1. Loading the Dataset using LlamaIndex

  2. Embedding Generation using Hugging Face

  3. Building the Model using Mixtral 8x7B

  4. Storing the Embedding in the Vector Store

  5. Building a Retrieval pipeline

  6. Querying the Retriever Query Engine

Enterprise Content Generation with Mixtral 8x7B

To build a RAG pipeline with Mixtral 8x7B, we’ll install the following dependencies:

%pip install -q llama-index==0.9.3 qdrant-client transformers[torch]

Loading the Dataset Using LlamaIndex

For the dataset, we have used Diffbot Knowledge Graph API. Diffbot is a sophisticated web scraping and data extraction tool that utilizes artificial intelligence to automatically retrieve and structure data from web pages. Unlike traditional web scraping methods that rely on manual programming to extract specific data elements, Diffbot uses machine learning algorithms to comprehend and interpret web content much like a human would. This allows Diffbot to accurately identify and extract various types of data, including articles, product details, and contact information, from a wide range of websites.

One of the standout features of Diffbot is its Knowledge Graph Search, which organizes the extracted data into a structured database known as a knowledge graph. A knowledge graph is a powerful representation of interconnected data that enables efficient searching, querying, and analysis. Diffbot's Knowledge Graph Search not only extracts individual data points from web pages but also establishes relationships between them by creating a comprehensive network of information.

To get the URL, make an account on Diffbot. Go to Knowledge Graph, and Search. Here we have used Organization in the Visual, and filtered by Industries->Pharmaceutical Companies.

Then, we chose GSK, which is a renowned pharmaceutical company, and clicked Articles.

After clicking Articles, we got an option to export it as CSV or make an API call.

We made an API call and used that URL to access the data in Python.

import requests
import json

# The Diffbot API URL
url = "https://kg.diffbot.com/<your-url>"

# Make a GET request to the API
response = requests.get(url)

# Parse the response text as JSON
data = response.json()

# Open a file in write mode
with open('json/response_text.json', 'w') as file:
    # Write the data to the file as JSON
    json.dump(data, file)

print("Response text has been saved to 'json/response_text.json'.")

The data is saved now in a JSON file. Using LlamaIndex Simple Directory Reader, we will load the data from the “json” directory.

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("/home/akriti/Notebooks/json").load_data()

Now, it’s time to split the documents into chunks using Sentence Splitter.

from llama_index.node_parser.text import SentenceSplitter

# Create a SentenceSplitter object with a specified chunk size
text_parser = SentenceSplitter(chunk_size=1024)

# Initialize empty lists to store text chunks and corresponding document indexes
text_chunks = []
doc_idxs = []

# Iterate over each document in the 'documents' list along with its index
for doc_idx, doc in enumerate(documents):
    # Split the text of the current document into smaller chunks using the SentenceSplitter
    cur_text_chunks = text_parser.split_text(doc.text)
   
    # Extend the list of text chunks with the chunks from the current document
    text_chunks.extend(cur_text_chunks)
   
    # Extend the list of document indexes with the index of the current document,
    # repeated for each corresponding text chunk
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

After that, we will create a Text Node object where we will assign the metadata of the source document to the metadata attribute of the node so that the relationships between them can be managed easily.

# Import the TextNode class from the llama_index schema module
from llama_index.schema import TextNode

# Initialize an empty list to store nodes
nodes = []

# Iterate over each index and text chunk in the text_chunks list
for idx, text_chunk in enumerate(text_chunks):
    # Create a new TextNode object with the current text chunk
    node = TextNode(text=text_chunk)
   
    # Retrieve the corresponding source document using the document index from doc_idxs
    src_doc = documents[doc_idxs[idx]]
   
    # Assign the metadata of the source document to the metadata attribute of the node
    node.metadata = src_doc.metadata
   
    # Append the node to the list of nodes
    nodes.append(node)

Embedding Generation Using Hugging Face

There are many supported embedding tool integrations with LlamaIndex; here we are moving forward with the Hugging Face Embedding tool.

# Import the HuggingFaceEmbedding class from the llama_index embeddings module
from llama_index.embeddings import HuggingFaceEmbedding

# Initialize a HuggingFaceEmbedding object with the specified model name
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

# Iterate over each node in the nodes list
for node in nodes:
    # Get the content of the node along with its metadata
    content_with_metadata = node.get_content(metadata_mode="all")
   
    # Use the embedding model to get the text embedding for the node's content
    node_embedding = embed_model.get_text_embedding(content_with_metadata)
   
    # Assign the computed embedding to the embedding attribute of the node
    node.embedding = node_embedding

Building the Model Using Mixtral 8x7B

Mixtral 8x7B is a cutting-edge language model developed by Mistral AI. It is a sparse mixture of experts (MOE) model with open weights. It is designed to offer powerful AI capabilities by integrating elements from BERT, RoBERTa, and GPT-3. This model represents a significant advancement in natural language processing by providing a practical and accessible solution for various applications.

Mixtral 8x7B employs a Mixture of Experts (MoE) architecture and is a decoder-only model. In this architecture, each layer consists of 8 feedforward blocks, referred to as experts. During processing, a router network dynamically selects two experts for each token at every layer, which enables effective information processing and aggregation.

One of Mixtral 8x7B's standout features is its exceptional performance, characterized by high-quality outputs across diverse tasks. The model is pre-trained with multilingual data using a context size of 32k tokens. It outperforms Llama 2 and GPT-3.5 on most benchmarks but, in some cases, it matches Llama 2 and the GPT-3.5 model.

Mixtral 8x7B is also available in an Instruct form, which is supervised and fine-tuned on an instruction-following dataset, and optimized through Direct Preference Optimization (DPO) training. To know more about Mixtral 8x7B in detail, visit this paper.

Using Hugging Face and LlamaIndex, we will load the model.

import torch
from llama_index.llms import HuggingFaceLLM

# Instantiate a HuggingFaceLLM object with specified parameters
llm = HuggingFaceLLM(
    context_window=4096,  # Maximum context window size
    max_new_tokens=256,  # Maximum number of new tokens to generate
    generate_kwargs={"temperature": 0.7, "do_sample": False},  # Generation settings
    tokenizer_name="mistralai/Mixtral-8x7B-v0.1",  # Pre-trained tokenizer name
    model_name="mistralai/Mixtral-8x7B-v0.1",  # Pre-trained model name
    device_map="auto",  # Automatic device mapping
    stopping_ids=[50278, 50279, 50277, 1, 0],  # Tokens to stop generation
    tokenizer_kwargs={"max_length": 4096},  # Tokenizer arguments
    model_kwargs={"torch_dtype": torch.float16}  # Model arguments
)

After that, we will create a service context with the loaded LLM and the embedding model.

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

Storing the Embedding in the Vector Store

Here, we have used the Qdrant Vector Database to store the embeddings. Qdrant is a high-performance open-source vector search engine designed to efficiently index and search through large collections of high-dimensional vectors. It's particularly well-suited for use cases involving similarity search, where the goal is to find items that are most similar to a query vector within a large dataset.

We will initiate the Qdrant Client first, and create a collection by enabling hybrid search.

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex, ServiceContext, SimpleDirectoryReader

# Initialize Qdrant client
client = qdrant_client.QdrantClient(location=":memory:")

# Create Qdrant vector store
vector_store = QdrantVectorStore(client=client, collection_name="my_collection",enable_hybrid=True)

We will add the node to the vector store and create a storage context. Also, we will create an index where we will use documents, service context, and storage context.

# Add nodes to the vector store
vector_store.add(nodes)

# Create a storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, service_context=service_context)

We will then use a query and embedding model to create a query embedding, which we will use later as a reference.

query_str = "Can you update me about shingles vaccine?"
query_embedding = embed_model.get_query_embedding(query_str)

Using hybrid query mode, we will create a vector store query using LlamaIndex, where we will use the query embedding and save the query result.

from llama_index.vector_stores import VectorStoreQuery
query_mode = "hybrid"
vector_store_query = VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=2, mode=query_mode)
query_result = vector_store.query(vector_store_query)

Then, we will parse the query result into the set of nodes.

from llama_index.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []

for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None and index < len(query_result.similarities):
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))

Building a Retrieval Pipeline

For building a retrieval pipeline, we’ll use the above to create a retriever class.

from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List, Optional
from llama_index.vector_stores import VectorStoreQuery
from llama_index.schema import NodeWithScore

class VectorDBRetriever(BaseRetriever):
    """Retriever over a qdrant vector store."""
    def init(self,
                vector_store: QdrantVectorStore,
                embed_model: Any,
                query_mode: str = "hybrid",
                similarity_top_k: int = 2) -> None:
        """Initialize parameters."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().init()
   
    def retrieve(self, querybundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = self._embed_model.get_query_embedding(query_bundle.query_str)
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = self._vector_store.query(vector_store_query)
        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None and index < len(query_result.similarities):
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))
        return nodes_with_scores


retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="hybrid", similarity_top_k=2
)

Finally, our query engine will be ready with the help of Retriever Query Engine.

from llama_index.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
retriever, service_context=service_context
)

Querying the Retriever Query Engine

As our query engine is ready, now is the time to pass some queries and see some results.

Question 1:

query_str = "Write a paragraph about GSK announcement about its shares."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK plc announced the completion of its share consolidation on July 18, 2022. This followed the demerger of the Consumer Healthcare business from the GSK Group to form Haleon. The consolidation of GSK shares became effective at 8.00 a.m. on July 19, 2022. As part of the consolidation, a ratio of 4 new ordinary shares was applied for every 5 existing ordinary shares. Fractional entitlements that arose from the consolidation were aggregated and sold in the open market, with the net proceeds paid to each relevant shareholder according to their entitlement. Following the issuance and consolidation, the total number of voting rights in GSK as of July 19, 2022, was 4,067,352,076.

Question 2:

query_str = "Write a paragraph about GSK's RSV vaccine."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK's Arexvy is the world's first respiratory syncytial virus (RSV) vaccine for older adults. The US Food and Drug Administration (FDA) approved Arexvy for the prevention of lower respiratory tract disease (LRTD) caused by RSV in individuals 60 years of age and older. This groundbreaking approval enables adults aged 60 years and older to be protected from RSV disease for the first time. The approval is based on data from the positive pivotal AReSVi-006 phase III trial that showed exceptional efficacy in older adults, including those with underlying medical conditions, and in those with severe RSV disease. The US launch was planned before the 2023/24 RSV season. RSV is a common, contagious virus that can lead to potentially serious respiratory illness. It causes approximately 177,000 hospitalizations and an estimated 14,000 deaths in the US in adults aged 65 years and older each year.

Question 3:

query_str = "Write a paragraph about GSK's Endrometrial Cancer Drug Development."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK has made significant improvement in the development of drugs for endometrial cancer. Their drug, Jemperli (dostarlimab), has been approved by the European Commission and the US Food and Drug Administration (FDA) for the treatment of adult patients with mismatch repair-deficient (dMMR)/microsatellite instability-high (MSI-H) primary advanced or recurrent endometrial cancer. Jemperli, in combination with carboplatin and paclitaxel (chemotherapy), is the first and only frontline immuno-oncology treatment in the European Union for this type of endometrial cancer. The FDA has also granted accelerated approval for Jemperli as a monotherapy for treating adult patients with dMMR/MSI-H recurrent or advanced endometrial cancer that has progressed on or following prior treatment with a platinum-containing regimen. This approval is based on the results from the dMMR/MSI-H population of Part 1 of the RUBY/ENGOT-EN6/GOG3031/NSGO phase III trial. GSK continues to evaluate Jemperli in the hopes of further expansion for the drug as data mature.

Question 4:

query_str = "Write a paragraph about GSK's Hepatocellular Carcinoma Drug Development."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK is making significant progress in the development of drugs for hepatocellular carcinoma (HCC). One of their drugs, Cobolimab, is currently in Phase II clinical trials for HCC. Cobolimab is a humanized monoclonal IgG4 antibody that inhibits T cell immunoglobulin mucin-3 (TIM-3), and is under development for the treatment of solid tumors including melanoma, squamous and non-squamous non-small cell lung carcinoma, HCC, and colorectal cancer. It is administered through the intravenous route. The drug's phase transition success rate (PTSR) and likelihood of approval (LoA) are being closely monitored. GSK's efforts in this area demonstrate their commitment to advancing treatments for HCC.

Question 5:

query_str = "Write a paragraph about GSK's Uncomplicated Cervical And Urethral Gonorrhea Drug Development."
response = query_engine.query(query_str)
print(str(response))

The response will be:

GSK is currently developing a potential first-in-class antibiotic, Gepotidacin, for the treatment of uncomplicated cervical and urethral gonorrhea. This drug is in Phase III of clinical development. Gepotidacin is the first in a new chemical class of antibiotics called triazaacenaphthylene bacterial topoisomerase inhibitors. It is being investigated for use in uncomplicated urinary tract infection and urogenital gonorrhea, two infections not addressed by new oral antibiotics in 20 years. The Phase III programme comprises two studies, EAGLE-1 and EAGLE-2, testing Gepotidacin in two common infections caused by bacteria identified as antibiotic-resistant threats. The development of Gepotidacin is the result of a successful public-private partnership between GSK, the US government's Biomedical Advanced Research and Development Authority (BARDA), and Defense Threat Reduction Agency (DTRA).

Final Words

With the help of the LlamaIndex framework, we used Diffbot API to extract enterprise content that was related to a pharmaceutical company, GSK. Using Hugging Face embeddings, Qdrant Vector Store, and Mixtral 8x7B, the retrieval pipeline was built. The results obtained using the retrieval query engine were quite fascinating. Building an advanced AI-powered enterprise content pipeline has become easy with the help of Mixtral 8x7B.

References

https://arxiv.org/pdf/2401.04088.pdf

https://docs.diffbot.com/docs/getting-started-with-diffbot-knowledge-graph

https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval.html

This article was originally published here: https://blog.superteams.ai/how-to-build-an-advanced-ai-powered-enterprise-content-pipeline-using-mixtral-8x7b-and-qdrant-b01aa66e3884

0
Subscribe to my newsletter

Read articles from Akriti Upadhyay directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Akriti Upadhyay
Akriti Upadhyay