Build Your Own Smart Company Chatbot: RAG with FastAPI, Next.js, and ChromaDB

TemiTope KayodeTemiTope Kayode
14 min read

Ever wished your internal company chatbot actually knew things? Tired of generic answers or "Sorry, I can't help with that"? What if you could build a chatbot that uses your company's actual documentation, wikis, or knowledge base to provide accurate, context-aware answers?

That's exactly what we're going to build today! We'll leverage the power of Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), a speedy FastAPI backend, a sleek Next.js frontend, and the ChromaDB vector database.

This post will guide you step-by-step, providing code you can adapt and run. By the end, you'll have a functional RAG system that can answer questions based on your provided documents.

What We'll Cover:

  1. What is RAG? (A quick primer)

  2. System Architecture: How the pieces fit together.

  3. Setting up the Backend: FastAPI, ChromaDB, Langchain, and the LLM connection.

  4. Ingesting Data: Loading and preparing documents for retrieval.

  5. Building the RAG Chain: The core logic for retrieving and generating answers.

  6. Creating the API Endpoint: Exposing our RAG chain via FastAPI.

  7. Setting up the Frontend: A simple Next.js interface to interact with our chatbot.

  8. Putting it all Together: Running the system.

Prerequisites:

  • Python 3.8+ and pip installed.

  • Node.js and npm/yarn installed.

  • Basic understanding of Python, FastAPI, React (Next.js), and APIs.

  • An OpenAI API key (or access to another LLM provider supported by Langchain). We'll use OpenAI for this example due to its widespread use and Langchain integration. Remember LLM usage can incur costs.

Ready? Let's dive in!

1. What is RAG? (The 60-Second Explanation)

RAG stands for Retrieval-Augmented Generation. It's a technique to make Large Language Models (LLMs) more factual and context-aware.

Instead of just relying on its pre-trained (and potentially outdated or generic) knowledge, an LLM in a RAG system works like this:

  1. Retrieve: When you ask a question, the system first searches a specific knowledge base (in our case, documents stored in ChromaDB) for relevant information snippets.

  2. Augment: These relevant snippets (the "context") are then added to your original question.

  3. Generate: This combined prompt (context + question) is sent to the LLM, which generates an answer based primarily on the provided context.

Why RAG?

  • Reduces Hallucinations: LLMs are less likely to make things up if they have relevant facts handy.

  • Uses Custom/Recent Data: Answers questions based on your specific, up-to-date documents.

  • More Control: You curate the knowledge base the LLM uses.

2. System Architecture

Here's a high-level view of our system:

+-------------+ +-------------+ +--------------+ +-------------+ +-----+ +---------+
| User | ---> | Next.js | ---> | FastAPI | ---> | ChromaDB | ---> | LLM | ---> | FastAPI | ---> [Back to Next.js & User]
| (Browser) | | Frontend | | Backend | | (Vector DB) | | | | |
+-------------+ +-------------+ +--------------+ +-------------+ +-----+ +---------+
^ | | | | |
|---------------------|----------------------|----------------------|--------------|--------------|
| User Interface | API Endpoint | Document | Generate | Format & Return
| Sends Question | Orchestrates RAG | Retrieval | Answer | Response

  1. User: Types a question into the Next.js frontend.

  2. Next.js Frontend: Sends the question to the FastAPI backend API endpoint.

  3. FastAPI Backend:

    • Receives the question.

    • Uses an embedding model to turn the question into a vector.

    • Queries ChromaDB with the question vector to find relevant document chunks.

    • Constructs a prompt containing the retrieved chunks (context) and the original question.

    • Sends the augmented prompt to the LLM (e.g., OpenAI's GPT).

    • Receives the generated answer from the LLM.

    • Formats the response (including a fallback "I don't know" if no context was found).

    • Sends the response back to the Next.js frontend.

  4. Next.js Frontend: Displays the answer received from the backend to the user.

  5. ChromaDB: Stores vector representations (embeddings) of our company documents, allowing efficient similarity search.

  6. LLM: The language model that generates the final answer based on the context provided by the retriever.

(Side Note: Could we use Gradio? Yes! Gradio is excellent for quickly building simple UIs for machine learning models, especially for demos. However, building separate FastAPI and Next.js components gives us a more scalable and customizable structure, closer to a production setup.)

3. Setting up the Backend (FastAPI, ChromaDB, Langchain)

Let's create our backend project.

# Create a project directory
mkdir rag-backend
cd rag-backend

# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

# Install dependencies
pip install fastapi uvicorn python-dotenv langchain langchain-openai chromadb tiktoken pydantic langchain-community
  • fastapi: The web framework.

  • uvicorn: The ASGI server to run FastAPI.

  • python-dotenv: To manage environment variables (like API keys).

  • langchain, langchain-openai, langchain-community: Core Langchain library, OpenAI integration, and community loaders/splitters.

  • chromadb: The vector database client.

  • tiktoken: Used by Langchain/OpenAI for token counting.

  • pydantic: For data validation (used by FastAPI).

Now, create a file named .env in the rag-backend directory to store your OpenAI API key:

# .env
OPENAI_API_KEY="your_openai_api_key_here"

Replace your_openai_api_key_here with your actual key.

Create a file named main.py. This will contain our core backend logic.

# main.py
import os
import chromadb
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from dotenv import load_dotenv

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader # Example loader
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

# --- Configuration & Setup ---

# Load environment variables (especially OPENAI_API_KEY)
load_dotenv()

# Check for API key
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in environment variables. Please set it in the .env file.")

# Constants
CHROMA_DB_PATH = "./chroma_db"
COLLECTION_NAME = "company_docs"
MODEL_NAME = "gpt-4o-mini" # Or choose another model

# --- Data Loading & Processing (Run Once or as Needed) ---

# IMPORTANT: In a real app, you'd run this data loading part separately
# or trigger it based on updates. For this example, we'll run it simply.
# We will use a dummy text file for demonstration.

# Create a dummy document file
DUMMY_DOC_PATH = "company_info.txt"
if not os.path.exists(DUMMY_DOC_PATH):
    with open(DUMMY_DOC_PATH, "w") as f:
        f.write("""
Acme Corporation - Internal FAQ

Q: What is the standard process for requesting vacation time?
A: Employees should submit vacation requests through the HR portal at least two weeks in advance. Requests are subject to manager approval based on team availability.

Q: How do I reset my corporate password?
A: You can reset your password via the self-service tool at passwordreset.acme.com. If you encounter issues, please contact the IT Help Desk at x5555.

Q: What are the office hours?
A: Standard office hours are 9:00 AM to 5:00 PM, Monday through Friday. Some departments may have different schedules; please check with your manager.

Q: Does Acme Corp offer remote work options?
A: Acme Corporation offers hybrid and fully remote work options for eligible positions. Eligibility depends on the role, team needs, and manager approval. Refer to the Remote Work Policy document on the company intranet for details.
        """)

def load_and_store_documents():
    """Loads documents, splits them, creates embeddings, and stores them in ChromaDB."""
    print("Loading and processing documents...")
    loader = TextLoader(DUMMY_DOC_PATH) # Replace with your actual loader (PDF, Web, etc.)
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    docs = text_splitter.split_documents(documents)

    print(f"Split into {len(docs)} chunks.")

    # Create embeddings
    embeddings = OpenAIEmbeddings() # Defaults to "text-embedding-ada-002"

    # Create Chroma vector store (or connect to existing)
    # It will save data to the specified directory
    vectorstore = Chroma.from_documents(
        docs,
        embeddings,
        collection_name=COLLECTION_NAME,
        persist_directory=CHROMA_DB_PATH
    )
    print(f"Documents stored in ChromaDB collection '{COLLECTION_NAME}' at {CHROMA_DB_PATH}")
    return vectorstore

# --- Initialize Vector Store ---
# Try to load existing vector store, otherwise create it
try:
    print("Attempting to load existing vector store...")
    vectorstore = Chroma(
        persist_directory=CHROMA_DB_PATH,
        embedding_function=OpenAIEmbeddings(),
        collection_name=COLLECTION_NAME
    )
    print("Vector store loaded successfully.")
except Exception as e: # Be more specific with exceptions in production
    print(f"Failed to load existing vector store: {e}. Creating a new one...")
    vectorstore = load_and_store_documents()

retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Retrieve top 3 relevant chunks

# --- Initialize LLM ---
llm = ChatOpenAI(model_name=MODEL_NAME, temperature=0) # Low temperature for factual answers

# --- Define RAG Chain ---
template = """You are an assistant for question-answering tasks for Acme Corporation.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer based on the context provided, just say that you don't know.
Use three sentences maximum and keep the answer concise.

Context: {context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# The RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# --- FastAPI Application ---
app = FastAPI(
    title="Company RAG Chatbot API",
    description="API for answering questions based on company documents.",
)

# Add CORS middleware to allow requests from our Next.js frontend
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"], # Allow all origins for simplicity (restrict in production!)
    allow_credentials=True,
    allow_methods=["*"], # Allow all methods
    allow_headers=["*"], # Allow all headers
)


class QueryRequest(BaseModel):
    question: str

class QueryResponse(BaseModel):
    answer: str
    # Optional: Add sources/context if needed
    # sources: list[str] = []

@app.post("/chat", response_model=QueryResponse)
async def chat_endpoint(request: QueryRequest):
    """
    Receives a question, processes it through the RAG chain, and returns the answer.
    """
    question = request.question
    if not question:
        raise HTTPException(status_code=400, detail="Question cannot be empty")

    try:
        # Check if context is found *before* calling the full chain
        # This helps ensure we only answer based on retrieved docs
        retrieved_docs = retriever.invoke(question)

        if not retrieved_docs:
             print("No relevant documents found.")
             answer = "I couldn't find any relevant information in the provided documents to answer your question."
        else:
            print(f"Found {len(retrieved_docs)} relevant document chunks.")
            # If we found docs, proceed with the full chain
            answer = rag_chain.invoke(question)
            print(f"LLM Answer: {answer}")

        return QueryResponse(answer=answer)

    except Exception as e:
        print(f"Error processing question: {e}") # Log the error
        # Consider more specific error handling here
        raise HTTPException(status_code=500, detail="Internal server error while processing the question.")

# --- Run the Application (for local development) ---
if __name__ == "__main__":
    import uvicorn
    # Ensure data is loaded/checked on startup when running directly
    if not os.path.exists(os.path.join(CHROMA_DB_PATH, COLLECTION_NAME)):
         print("Initialising Chroma DB as it doesn't exist...")
         load_and_store_documents()
    print("Starting FastAPI server...")
    uvicorn.run(app, host="0.0.0.0", port=8000)

Explanation:

  1. Imports & Setup: Import necessary libraries, load environment variables, define constants.

  2. Data Loading (load_and_store_documents):

    • Uses TextLoader (you can swap this for PDFMinerLoader, WebBaseLoader, etc., from langchain_community.document_loaders).

    • Uses RecursiveCharacterTextSplitter to break documents into smaller, manageable chunks.

    • Initializes OpenAIEmbeddings to convert text chunks into vectors.

    • Uses Chroma.from_documents to embed the chunks and store them in a persistent ChromaDB collection (./chroma_db directory).

  3. Initialize Vector Store: Tries to load the existing ChromaDB collection. If it fails (e.g., first run), it calls load_and_store_documents.

  4. Retriever: Creates a retriever from the vector store. search_kwargs={"k": 3} means it will fetch the top 3 most relevant document chunks for a given query.

  5. Initialize LLM: Sets up the ChatOpenAI model. temperature=0 makes the output more deterministic and factual.

  6. RAG Chain: This is the core Langchain Expression Language (LCEL) chain:

    • {"context": retriever | format_docs, "question": RunnablePassthrough()}: Takes the input question. Runs the retriever to get docs. Formats the docs into a single string. Passes the original question through unchanged. Creates a dictionary {"context": formatted_docs, "question": original_question}.

    • | prompt: Feeds this dictionary into our ChatPromptTemplate.

    • | llm: Sends the formatted prompt to the LLM.

    • | StrOutputParser(): Parses the LLM's chat message output into a simple string.

  7. FastAPI App:

    • Initialises the FastAPI app.

    • CORS Middleware: Crucial for allowing the Next.js frontend (running on a different port) to communicate with the backend. Note: allow_origins=["*"] is insecure for production. You should restrict it to your frontend's specific domain.

    • Pydantic Models: QueryRequest defines the expected input structure, QueryResponse defines the output.

    • /chat Endpoint:

      • Takes a QueryRequest.

      • Retrieves relevant documents first.

      • Crucially: If retrieved_docs is empty, it immediately returns the "I don't know" message without calling the LLM, enforcing the RAG principle.

      • If documents are found, it invokes the full rag_chain.

      • Includes basic error handling.

  8. Run: Standard Uvicorn setup to run the server locally.

To Run the Backend:

  1. Make sure you're in the rag-backend directory with your virtual environment activated.

  2. Ensure you have your .env file with the OPENAI_API_KEY.

  3. Run: python main.py

The first time you run it, it should print messages about loading/processing documents and storing them in ChromaDB. Subsequent runs should load the existing database. You'll see INFO: Uvicorn running on http://0.0.0.0:8000.

You can test it using tools like curl or Postman/Insomnia:

curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"question": "How do I request vacation?"}'

You should get a JSON response like: {"answer":"Employees should submit vacation requests through the HR portal at least two weeks in advance. Requests are subject to manager approval based on team availability."} (or similar, depending on the LLM).

Try asking something not in the document, like "What's the cafeteria menu?":

curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"question": "What is the cafeteria menu today?"}'

You should get the fallback response: {"answer":"I couldn't find any relevant information in the provided documents to answer your question."}

7. Setting up the Frontend (Next.js)

Now, let's create a simple interface to interact with our backend.

# Navigate outside the backend directory
cd ..

# Create a new Next.js app (using App Router and TypeScript recommended)
npx create-next-app@latest rag-frontend --typescript --tailwind --eslint --app

# Enter the frontend directory
cd rag-frontend

# Install axios (optional, you can use fetch)
npm install axios # or yarn add axios

Replace the content of app/page.tsx with the following:

// app/page.tsx
'use client'; // This directive marks the component as a Client Component

import { useState, FormEvent } from 'react';
import axios from 'axios'; // Using axios for API calls

export default function Home() {
  const [query, setQuery] = useState<string>('');
  const [answer, setAnswer] = useState<string>('');
  const [isLoading, setIsLoading] = useState<boolean>(false);
  const [error, setError] = useState<string | null>(null);

  const handleSubmit = async (event: FormEvent<HTMLFormElement>) => {
    event.preventDefault(); // Prevent default form submission
    if (!query.trim()) return; // Don't submit empty queries

    setIsLoading(true);
    setAnswer('');
    setError(null);

    // --- Make API Call ---
    // Ensure your FastAPI backend is running at http://localhost:8000
    const backendUrl = process.env.NEXT_PUBLIC_BACKEND_URL || 'http://localhost:8000/chat';

    try {
      const response = await axios.post(
        backendUrl,
        { question: query },
        { headers: { 'Content-Type': 'application/json' } }
      );

      if (response.data && response.data.answer) {
        setAnswer(response.data.answer);
      } else {
        setError('Received an unexpected response format from the server.');
      }
    } catch (err: any) {
      console.error('API call failed:', err);
      if (axios.isAxiosError(err) && err.response) {
        // Try to get error message from backend response
        setError(`Error: ${err.response.data?.detail || err.message}`);
      } else {
        setError(`An error occurred: ${err.message || 'Unknown error'}`);
      }
    } finally {
      setIsLoading(false);
    }
  };

  return (
    <main className="flex min-h-screen flex-col items-center justify-center p-12 bg-gray-50">
      <div className="w-full max-w-2xl bg-white p-8 rounded-lg shadow-md">
        <h1 className="text-3xl font-bold mb-6 text-center text-gray-800">
          Acme Corp Chatbot
        </h1>

        <form onSubmit={handleSubmit} className="mb-6">
          <input
            type="text"
            value={query}
            onChange={(e) => setQuery(e.target.value)}
            placeholder="Ask a question about Acme Corp..."
            className="w-full px-4 py-2 border border-gray-300 rounded-md focus:outline-none focus:ring-2 focus:ring-blue-500"
            disabled={isLoading}
          />
          <button
            type="submit"
            disabled={isLoading}
            className={`w-full mt-3 px-4 py-2 rounded-md text-white font-semibold transition-colors duration-200 ease-in-out ${
              isLoading
                ? 'bg-gray-400 cursor-not-allowed'
                : 'bg-blue-600 hover:bg-blue-700'
            }`}
          >
            {isLoading ? 'Thinking...' : 'Ask'}
          </button>
        </form>

        {error && (
          <div className="mt-4 p-3 bg-red-100 text-red-700 border border-red-300 rounded-md">
            {error}
          </div>
        )}

        {answer && (
          <div className="mt-6 p-4 bg-gray-100 border border-gray-200 rounded-md">
            <h2 className="text-lg font-semibold mb-2 text-gray-700">Answer:</h2>
            <p className="text-gray-800 whitespace-pre-wrap">{answer}</p>
          </div>
        )}
      </div>
       <footer className="mt-8 text-center text-gray-500 text-sm">
          Powered by RAG, FastAPI, Next.js & ChromaDB
      </footer>
    </main>
  );
}

Explanation:

  1. 'use client';: Necessary in Next.js App Router to use hooks like useState.

  2. State: Manages the user's input (query), the chatbot's answer, loading status (isLoading), and potential error messages.

  3. handleSubmit:

    • Triggered when the form is submitted.

    • Sets loading state.

    • Uses axios to make a POST request to our FastAPI backend (http://localhost:8000/chat) with the user's question.

    • Handles the response, updating the answer state or setting an error.

    • Resets loading state in the finally block.

  4. UI:

    • A simple form with an input field and a submit button.

    • Conditionally displays loading indicators, error messages, and the final answer.

    • Basic styling using Tailwind CSS (included with create-next-app).

To Run the Frontend:

  1. Make sure you're in the rag-frontend directory.

  2. Run: npm run dev (or yarn dev)

  3. Open your browser to http://localhost:3000 (or whatever port Next.js indicates).

8. Putting it all Together

  1. Start the Backend:

    • Open a terminal in the rag-backend directory.

    • Activate the virtual environment (source venv/bin/activate or venv\Scripts\activate).

    • Run python main.py. Wait until you see Uvicorn running on http://0.0.0.0:8000.

  2. Start the Frontend:

    • Open a separate terminal in the rag-frontend directory.

    • Run npm run dev (or yarn dev).

  3. Interact:

    • Go to http://localhost:3000 in your browser.

    • Ask questions that are covered in the company_info.txt file (e.g., "How to reset password?", "What are office hours?"). You should get concise answers based only on that text.

    • Ask questions that are not covered (e.g., "Tell me about project Phoenix?"). You should get the "I couldn't find relevant information..." response.

Congratulations! You've built a basic, but functional, RAG system with a FastAPI backend and a Next.js frontend.

Limitations & Next Steps

This is a foundational example. Real-world applications would need more sophistication:

  • Better Document Loading: Handle various formats (PDF, DOCX, HTML), maybe from multiple sources.

  • Advanced Retrieval: Explore different retriever settings (e.g., MMR for diversity) or techniques like HyDE (Hypothetical Document Embeddings).

  • More Robust Error Handling: Implement more specific error catching and user feedback.

  • Chat History: Implement context windows to remember previous turns in the conversation.

  • Scalability: Consider asynchronous processing for data ingestion, more robust database solutions if ChromaDB's limits are reached, and proper deployment strategies.

  • UI/UX: Build a more feature-rich chat interface.

In our next post, we'll explore how to make this chatbot even more powerful by adding Tools (also known as Function Calling). This allows the LLM to interact with other APIs or databases to fetch real-time information or perform actions, going beyond just retrieving static document content. Stay tuned!


I hope this detailed guide helps you build your own RAG-powered applications. Let me know in the comments if you have questions or built something cool with this! Happy coding!

#RAG #LLM #AI #FastAPI #Python #NextJS #React #JavaScript #ChromaDB #VectorDatabase #Chatbot #WebDevelopment

0
Subscribe to my newsletter

Read articles from TemiTope Kayode directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

TemiTope Kayode
TemiTope Kayode

Seasoned software engineer and founder specialised in web and mobile applications, enterprise applications, cloud computing, and DevOps using tools like Django, React, Flutter, AWS, and DigitalOcean. Currently a Senior Software Developer and a mentor, I balance coding with family and leisure. Holds a distinction in a Masters in Computer Science from Coventry University, blending education with practical prowess. Passionate about technology and innovation, eager to connect and explore new possibilities.