Boardgame rules retrieval using RAG

Antonio RopacAntonio Ropac
18 min read

In this article, I'll explore how to build a Retrieval-Augmented Generation (RAG) system specifically designed for board game rulebooks. This implementation uses Google's Gemini model and ChromaDB to create an application that can answer specific rule questions by automatically processing PDF rulebooks. Let's dive into the code and understand how each component works together.

Introduction

Board game rulebooks can be dense, complex, and sometimes confusing. This project aims to solve that problem by creating an intelligent model that can understand and answer specific rule questions by referencing the original rulebook text. The system uses a RAG architecture, which combines the power of embeddings for retrieval with generative AI for producing meaningful answers.

Setting Up the Environment

# Remove and conflicting libraries
!pip uninstall -qqy jupyterlab


# Install only what's needed with compatible versions
!pip install -q requests numpy pandas pillow pdfplumber protobuf chromadb google-generativeai google-api-core

# Import necessary libraries
import time
import os
import re
import json
import hashlib
import pdfplumber
from collections import deque
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor

# Configure Google Gemini API
import google.generativeai as genai
from kaggle_secrets import UserSecretsClient 
secret_label = "GOOGLE_API_KEY"
secret_value = UserSecretsClient().get_secret(secret_label) 
genai.configure(api_key=secret_value)

# Select Gemini model
model = genai.GenerativeModel('gemini-2.0-flash-lite')

# Import chromadb for vector database
import chromadb

This initial code block sets up our environment by installing and importing all necessary libraries. The time and os modules provide basic system functionality, while re handles regular expressions for text processing. The json module will help us parse responses from the Gemini API, and hashlib gives us a fallback method for generating vector representations. Pdfplumber is a specialized library that allows us to extract text from PDF files, which is essential for this code since all rulebooks in the dataset are in PDF format. The collections and concurrent.futures modules provide data structures and parallel execution capabilities to improve performance.

We then configure the Google Gemini API by retrieving the API key from Kaggle's secrets , which is a secure way to handle API key. The specific Gemini model we're using is 'gemini-2.0-flash-lite', chosen for its balance of performance and speed. Finally, we import ChromaDB, an open-source vector database that will store and retrieve our embedded text chunks, forming the foundation of our retrieval system.

PDF Text Extraction

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            total_pages = len(pdf.pages)
            text_content = ""

            for page_num, page in enumerate(pdf.pages, 1):
                page_text = page.extract_text() or ""
                text_content += page_text

                if page_num % 5 == 0 or page_num == total_pages:
                    print(f"Extracted text from page {page_num}/{total_pages}")

            text = text_content
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")

    # Post-process text for better handling of board game terminology
    if text:
        # Handle common board game terms and jargon
        text = re.sub(r'\bmeeples?\b', 'meeple(s)', text, flags=re.IGNORECASE)
        text = re.sub(r'\bvps?\b', 'Victory Point(s)', text, flags=re.IGNORECASE)

        # Fix common typos in boardgame rulebooks
        text = re.sub(r'\bplaver(s?)\b', r'player\1', text, flags=re.IGNORECASE)
        text = re.sub(r'\btum\b', 'turn', text, flags=re.IGNORECASE)

        # Handle hyphens at line breaks that might split words
        text = re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)

        # Clean up extra whitespace
        text = re.sub(r'\s+', ' ', text)

    return text

The extract_text_from_pdf function handles the critical first step of our pipeline: converting PDF rulebooks into processable text. It uses pdfplumber to open the PDF file and extract text content page by page, providing progress updates every five pages. This approach gives us visibility into the extraction process, which is especially helpful for larger rulebooks that might take some time to process. The function includes robust error handling to prevent crashes if there are issues with a particular PDF file.

What makes this function particularly suited for board game rulebooks is the post-processing section. After extraction, we apply several regular expression operations to standardize board game terminology, like converting "meeples" or "meeple" to a consistent format "meeple(s)" and ensuring "VP" is always expanded to "Victory Point(s)". We also fix common OCR errors found in scanned rulebooks, such as "plaver" being mistakenly used instead of "player". Additionally, the function handles hyphenated words that span multiple lines, combining them properly to maintain the integrity of the text. Finally, it cleans up extra whitespace to produce a more uniform text representation that's easier to process in subsequent steps.

Chunking Text for Processing

def chunk_text(text, chunk_size=300, chunk_overlap=50):
    if not isinstance(text, str):
        text = str(text)

    # Improved rule boundary detection for board game rulebooks
    rule_chunks = re.split(r'(\n\s*[A-Z][A-Z\s]+:|\n\s*•|\n\s*\d+\.|\n\n|\n\s*[A-Z][A-Z\s]+\n|\n\s*-{3,})', text)

    # Add specific patterns for game setup, player turns, and scoring sections
    setup_pattern = re.compile(r'(SETUP|GAME SETUP|PREPARATION|SET UP):', re.IGNORECASE)
    turns_pattern = re.compile(r'(PLAYER TURN|GAMEPLAY|PLAYING THE GAME|TURN STRUCTURE|ON YOUR TURN):', re.IGNORECASE)
    scoring_pattern = re.compile(r'(SCORING|END OF GAME|GAME END|VICTORY POINTS|WINNING):', re.IGNORECASE)

    # Recombine the splitter with the content
    processed_chunks = []
    current_chunk = ""

    for i in range(0, len(rule_chunks)):
        chunk = rule_chunks[i]

        # Check if this is a section header for important game concepts
        is_important_header = (setup_pattern.search(chunk) or 
                              turns_pattern.search(chunk) or 
                              scoring_pattern.search(chunk))

        if is_important_header and current_chunk:
            # Always start a new chunk for important game sections
            processed_chunks.append(current_chunk.strip())
            current_chunk = chunk
        elif i > 0 and (i % 2 == 1 or (i > 1 and i % 2 == 0 and rule_chunks[i-1].strip())):
            # This is a header or the content after a header
            if len(current_chunk) + len(chunk) > chunk_size and current_chunk:
                processed_chunks.append(current_chunk.strip())
                current_chunk = chunk
            else:
                current_chunk += chunk
        else:
            # This is regular content
            if len(current_chunk) + len(chunk) > chunk_size and current_chunk:
                processed_chunks.append(current_chunk.strip())
                current_chunk = chunk
            else:
                current_chunk += chunk

    if current_chunk:
        processed_chunks.append(current_chunk.strip())

    # Add overlapping chunks for better context
    final_chunks = []
    for i in range(len(processed_chunks)):
        # Keep some context from previous chunk
        prev_context = ""
        if i > 0:
            prev_text = processed_chunks[i-1]
            prev_context = prev_text[-chunk_overlap:] if len(prev_text) > chunk_overlap else prev_text

        # Keep some context from next chunk
        next_context = ""
        if i < len(processed_chunks) - 1:
            next_text = processed_chunks[i+1]
            next_context = next_text[:chunk_overlap] if len(next_text) > chunk_overlap else next_text

        final_chunk = (prev_context + " " + processed_chunks[i] + " " + next_context).strip()
        final_chunks.append(final_chunk)

    return final_chunks

The chunk_text function is a sophisticated text-splitting algorithm specifically designed for board game rulebooks. Rather than blindly splitting text based on character count, this function intelligently breaks the text at natural rule boundaries using regular expressions. It recognizes section headings, bullet points, numbered lists, and other structural elements common in rulebooks, ensuring that related information stays together within chunks.

The function incorporates domain-specific knowledge by identifying important sections like setup instructions, player turn structures, and scoring rules through specialized pattern matching. When these critical sections are detected, the function ensures they start new chunks, preserving their context and making them easier to retrieve independently. This attention to the structure of rulebooks significantly improves the quality of the retrieved information.

To further enhance context awareness, the function implements chunk overlapping. Each final chunk includes a portion of text from both the preceding and following chunks, creating contextual bridges between segments. This overlapping technique helps maintain continuity of information across chunk boundaries, which is particularly valuable when rules reference earlier or later sections. The default values of 300 characters per chunk with 50 character overlaps strikes a balance between granularity and context preservation, though these parameters can be adjusted based on specific needs.

Embedding Generation

def generate_embeddings(texts):
    if not isinstance(texts, list):
        texts = [texts]

    print(f"Generating embeddings for {len(texts)} text chunks...")
    embeddings = []

    for text in texts:
        text_preview = text[:50].replace('\n', ' ') + "..." if len(text) > 50 else text
        print(f"Processing text chunk: '{text_preview}'")

        # Ensure text is properly formatted
        if not isinstance(text, str):
            text = str(text)

        if not text.strip():
            # Handle empty text
            embeddings.append([0.0] * 512)
            continue

        try:
            # Use Gemini to encode text semantically
            prompt = f"""
            Encode the following text into a 512-dimensional numerical vector that captures its semantic meaning.
            Output ONLY a JSON array of 512 float values.

            TEXT: {text[:800]}

            OUTPUT (JSON array only):
            """

            response = model.generate_content(prompt)
            response_text = response.text.strip()

            # Extract the JSON array
            if response_text.startswith('[') and response_text.endswith(']'):
                vector = json.loads(response_text)
            else:
                match = re.search(r'(\[[\d\.\-\,\s]+\])', response_text)
                if match:
                    vector = json.loads(match.group(1))
                else:
                    raise ValueError("Invalid embedding format")

            # Ensure we have exactly 512 dimensions
            if len(vector) != 512:
                if len(vector) > 512:
                    vector = vector[:512]
                else:
                    vector = vector + [0.0] * (512 - len(vector))

            embeddings.append(vector)

        except Exception as e:
            # Fallback to hash-based embedding
            text_hash = hashlib.md5(text.encode()).digest()
            vector = []
            for b in text_hash:
                vector.extend([(b / 128.0 - 1.0) for _ in range(32)])
            embeddings.append(vector)

    return embeddings

The generate_embeddings function is an approach to semantic encoding that leverages the Gemini model to create vector representations of text chunks. Unlike traditional embedding methods that require specific embedding APIs, this function uses the generative capabilities of the model to produce embeddings. For each text chunk, it constructs a prompt that asks Gemini to generate a 512-dimensional vector representing the semantic meaning of the text. This approach essentially turns a generative AI model into an embedding model, demonstrating flexibility in working with available resources.

The function includes robust parsing logic to extract the vector from the model's response, handling different response formats. It uses regular expressions to find the JSON array in the response if it's not properly formatted, and then loads it as a Python object. To ensure consistency, the function normalizes all vectors to exactly 512 dimensions. This dimensional consistency is crucial for the application.

A particularly robust feature of this function is its fallback mechanism for handling failures. If the Gemini model fails to produce a usable embedding for any reason, the function doesn't crash but instead generates a deterministic hash-based representation of the text using MD5. While this fallback isn't semantically meaningful like the AI-generated embeddings, it ensures that every chunk gets some form of vector representation, maintaining the integrity of the processing pipeline. The function also includes detailed logging to track progress and identify any problematic chunks.

Processing Game Rulebooks

def process_game_rulebook(pdf_path, game_name, chroma_collection):
    print(f"Starting to process {game_name}...")

    try:
        # Step 1: Extract text from PDF
        rulebook_text = extract_text_from_pdf(pdf_path)
        print(f"[{game_name}] Text extraction done. Length: {len(rulebook_text)}")

        # Step 2: Break text into logical chunks
        text_chunks = chunk_text(rulebook_text, chunk_size=300, chunk_overlap=50)
        print(f"[{game_name}] Total chunks created: {len(text_chunks)}")

        # Step 3: Generate embeddings for all chunks
        chunk_embeddings = generate_embeddings(text_chunks)

        # Step 4: Add all chunks to the collection
        ids = [f"{game_name.lower().replace(' ', '_')}_chunk_{i}" for i in range(len(text_chunks))]
        metadatas = [{"game": game_name, "chunk_id": i} for i in range(len(text_chunks))]

        chroma_collection.add(
            embeddings=chunk_embeddings,
            documents=text_chunks,
            ids=ids,
            metadatas=metadatas
        )

        print(f"[{game_name}] All chunks processed successfully.")
        return True

    except Exception as e:
        print(f"Error processing {game_name}: {e}")
        return False

The process_game_rulebook function manages the entire rulebook processing pipeline, coordinating the extraction, chunking, embedding, and storage phases. It serves as the central workflow manager, ensuring that each step is executed in the proper sequence and that the results are correctly passed to the next stage. The function begins by extracting text from the PDF rulebook, then proceeds to chunk the text into semantically meaningful segments, generate embeddings for those chunks, and finally store everything in the vector database.

This function employs structured error handling using a try-except block that wraps the entire process, providing resilience against failures at any stage. If an error occurs during processing, the function catches it, logs the specific error message, and returns a failure status. This design prevents one problematic rulebook from crashing the entire system, allowing the application to continue processing other rulebooks. The function also includes detailed progress logging at each step, giving visibility into the processing status and helping identify bottlenecks.

When adding chunks to the ChromaDB collection, the function creates unique identifiers for each chunk by combining the game name (with spaces replaced by underscores) and a sequential chunk index. It also attaches metadata to each chunk, including the game name and chunk ID, which will be crucial for filtering queries later. This metadata enables the system to retrieve only the chunks relevant to a specific game when answering questions, improving both efficiency and accuracy. The combination of unique IDs and rich metadata makes the stored chunks easily retrievable and maintainable within the vector database.

Retrieving Relevant Information

def retrieve_relevant_chunks(query, game_name, chroma_collection, top_n=3):
    try:
        # Generate embedding for the query
        query_embedding = generate_embeddings(query)[0]

        # Query the collection with game filter
        results = chroma_collection.query(
            query_embeddings=[query_embedding],
            n_results=top_n,
            include=["documents", "metadatas"],
            where={"game": game_name}  # Filter by game name
        )

        return results

    except Exception as e:
        print(f"Error retrieving chunks: {e}")
        return {"documents": [[]], "metadatas": [[]]}

def generate_response_with_rag(query, game_name, chroma_collection):
    try:
        # Get relevant chunks for the specified game
        results = retrieve_relevant_chunks(query, game_name, chroma_collection)

        # Check if we have any documents
        if not results or "documents" not in results or not results["documents"] or len(results["documents"][0]) == 0:
            return f"No relevant rules found for {game_name}."

        # Extract documents
        documents = results["documents"][0]

        # Prepare context
        context_text = "\n\n".join([doc for doc in documents if doc])

        # Enhanced prompt to handle game terminology better
        query_enhanced = query

        # Convert common abbreviations in the query to full form
        abbrev_mapping = {
            r'\bVP\b': 'Victory Points',
            r'\bPvP\b': 'Player versus Player',
            r'\bPvE\b': 'Player versus Environment',
            r'\bDM\b': 'Dungeon Master',
            r'\bGM\b': 'Game Master',
            r'\bCoop\b': 'Cooperative'
        }

        for abbr, full in abbrev_mapping.items():
            query_enhanced = re.sub(abbr, full, query_enhanced, flags=re.IGNORECASE)

        # Create prompt for the model
        prompt = f"""
        Based on these excerpts from the {game_name} rulebook:

{context_text}


        Please answer this question about {game_name}: "{query_enhanced}"

        Find the specific rule that answers the question. If found, quote it directly, then provide a brief explanation.
        If you encounter board game terms like "meeple", "worker placement", "deck building", etc., explain them if needed.
        If spelling errors or unusual terms appear in the rules, interpret them in the context of board games.
        """

        # Send to the model
        response = model.generate_content(prompt)

        # Process the response
        result = response.text if hasattr(response, 'text') and response.text else f"No specific rule found for {game_name}."

        # Clean up the response
        result = result.replace('"""', '"').replace('```', '')

        return result

    except Exception as e:
        print(f"Error generating response: {e}")
        return f"Error generating response for {game_name}: {str(e)}"

The retrieval system is implemented through two interconnected functions: retrieve_relevant_chunks and generate_response_with_rag. The first function, retrieve_relevant_chunks, handles the semantic search component of our RAG system. It converts the user's query into an embedding using the same process we used for rulebook chunks, ensuring compatibility. It then queries the ChromaDB collection to find the most semantically similar chunks, but with an important constraint—it filters results to only include chunks from the specified game using the "where" clause. This game-specific filtering ensures that users get answers relevant to the exact game they're asking about, even if similar rules exist in other games.

The generate_response_with_rag function implements the complete Retrieval-Augmented Generation process. After retrieving relevant chunks using the previous function, it combines them into a coherent context for the language model. The function includes a sophisticated preprocessing step that expands common board game abbreviations in the user's query, converting shorthand like "VP" to "Victory Points" to improve matching with rulebook text. This preprocessing demonstrates domain-specific optimization that makes the system more user-friendly for board game enthusiasts who frequently use such abbreviations.

The construction of the prompt is particularly noteworthy. It provides the retrieved context within code blocks, clearly specifies the game being discussed, and gives detailed instructions to the model about how to respond. These instructions direct the model to quote relevant rules directly and explain board game terminology, ensuring responses are both accurate and educational. The prompt also acknowledges the reality that rulebooks often contain spelling errors or unusual terms, instructing the model to interpret them appropriately within the context of board games. This attention to domain-specific nuances significantly improves the quality and usefulness of the responses.

Helper Functions for User Interface

def create_rulebook_index(rulebooks_folder):
    """Create an index of available rulebooks without processing them."""
    available_games = []

    try:
        pdf_files = [f for f in os.listdir(rulebooks_folder) if f.lower().endswith('.pdf')]
        print(f"Found {len(pdf_files)} PDF files in the folder")

        for filename in pdf_files:
            # Extract game name from filename (remove extension)
            game_name = os.path.splitext(filename)[0]
            available_games.append(game_name)

    except Exception as e:
        print(f"Error accessing rulebooks folder: {e}")

    return available_games

def is_game_processed(game_name, chroma_collection):
    """Check if a game has already been processed and added to the collection."""
    try:
        # Try to query the collection with game filter to see if entries exist
        results = chroma_collection.query(
            query_embeddings=[[0]*512],  # Dummy embedding
            n_results=1,
            where={"game": game_name}
        )

        # If we get any documents, the game has been processed
        return results and "documents" in results and results["documents"] and len(results["documents"][0]) > 0
    except Exception as e:
        print(f"Error checking if game is processed: {e}")
        return False

The create_rulebook_index function provides a quick way to discover available rulebooks without the computational expense of processing them. It scans the specified folder for files with a PDF extension, extracts game names from the filenames by removing the extension, and compiles them into a list. This approach allows the application to present users with a menu of available rulebooks right at startup, providing immediate value even before any processing has occurred. The function includes error handling to gracefully manage issues with file system access, returning an empty list rather than crashing if the folder cannot be accessed.

The is_game_processed function serves as an optimization mechanism that prevents redundant processing of rulebooks. It checks whether a game's data already exists in the ChromaDB collection by attempting to query for chunks associated with that game name. The function uses a clever technique of providing a dummy embedding (a vector of zeros) and limiting the results to just one document—the minimum needed to determine existence. If any documents are returned, it confirms that the game has been processed previously. This check allows the application to skip the time-consuming extraction, chunking, and embedding steps for rulebooks that have already been processed, significantly improving efficiency during repeated use of the system.

Main Application Loop

def main():
    """Main application loop for board game rulebook processing and querying."""
    # Initialize ChromaDB client
    chroma_client = chromadb.Client()
    print("Successfully connected to ChromaDB")

    # Define collection name and create/access it
    collection_name = "board_game_rules"
    chroma_collection = chroma_client.create_collection(
        name=collection_name, 
        get_or_create=True
    )
    print(f"Successfully created/accessed collection: {collection_name}")

    # Define the folder path containing all rulebooks
    rulebooks_folder = "/kaggle/input/rulebook1/"

    running = True

    # Get available rulebooks once at startup
    available_games = create_rulebook_index(rulebooks_folder)
    available_games.sort()

    while running:
        print("\n=== BOARD GAME RULEBOOK ASSISTANT ===")
        print("Type a rulebook name, 'list' to see available rulebooks, or 'exit' to quit:")
        choice = input("> ").strip()

        # Check for exit command
        if choice.lower() == 'exit':
            print("Goodbye!")
            running = False
            continue

        # Show rulebook list
        elif choice.lower() == 'list':
            if not available_games:
                print("No rulebooks found in the folder.")
                continue

            # Display sorted list of rulebooks without numbers
            print("\n=== AVAILABLE RULEBOOKS ===")
            for game in available_games:
                print(f"• {game}")

            print("\nType the name of a rulebook to process or 'back' to return:")
            rulebook_choice = input("> ").strip()

            if rulebook_choice.lower() == 'back':
                continue
        else:
            # User entered a rulebook name directly
            rulebook_choice = choice

        # Skip this part if user typed 'back' from the list view
        if rulebook_choice.lower() != 'back':
            # Find closest matching rulebook name
            selected_game = None
            closest_match = None
            highest_similarity = 0

            for game in available_games:
                # Simple similarity score - can be improved with more sophisticated methods
                # Current method checks for case-insensitive substring match
                if rulebook_choice.lower() in game.lower():
                    # Calculate similarity as percentage of matched characters
                    similarity = len(rulebook_choice) / len(game) if len(game) > 0 else 0
                    if similarity > highest_similarity:
                        highest_similarity = similarity
                        closest_match = game

                # Exact match
                if game.lower() == rulebook_choice.lower():
                    selected_game = game
                    break

            # If no exact match but a close match was found
            if not selected_game and closest_match:
                selected_game = closest_match
            elif not selected_game:
                print(f"No rulebook found matching '{rulebook_choice}'. Please try again.")
                continue

            # Get PDF path for selected game
            pdf_path = os.path.join(rulebooks_folder, f"{selected_game}.pdf")

            # Check if already processed
            is_processed = is_game_processed(selected_game, chroma_collection)

            print(f"Processing rulebook for {selected_game}...")

            if not is_processed:
                # Process the rulebook in sequence: extract text -> chunk text -> embed -> store
                process_success = process_game_rulebook(pdf_path, selected_game, chroma_collection)

                if not process_success:
                    print(f"Error processing {selected_game}. Please try another rulebook.")
                    continue

            # Enter query mode for this rulebook
            querying = True
            while querying:
                print(f"\nAsk a rule about {selected_game} or type 'back' to see the main menu or 'exit' to quit")
                query = input("> ").strip()

                if query.lower() == 'back':
                    querying = False
                    continue
                elif query.lower() == 'exit':
                    querying = False
                    running = False
                    print("Goodbye!")
                    continue

                print(f"Processing rulebook: {selected_game}")

                # Generate response to query
                response = generate_response_with_rag(query, selected_game, chroma_collection)
                print("\nResponse:")
                print(response)
                print("\n" + "-"*50)

if __name__ == "__main__":
    main()

The main function implements a comprehensive command-line interface (CLI) for the board game rulebook assistant. It begins by initializing the ChromaDB client and creating or accessing a collection named "board_game_rules" using the get_or_create=True parameter, which ensures the collection persists across application restarts. This initialization establishes the foundation for storing and retrieving vector embeddings throughout the application lifecycle. The function then loads a directory of rulebook PDFs and generates an index of available games, which is sorted alphabetically for user convenience.

The core of the function is a nested loop structure that creates an intuitive, hierarchical interface. The outer loop handles the main menu, allowing users to list available rulebooks, select a rulebook by name, or exit the application. The interface includes a clever fuzzy matching feature that finds the closest matching rulebook name if an exact match isn't available, calculating similarity as the percentage of matched characters. This user-friendly approach accommodates typos and partial names, making the system more accessible to users who might not remember exact rulebook titles.

Once a rulebook is selected, the function checks if it has already been processed using the is_game_processed helper function. If not, it invokes the processing pipeline through process_game_rulebook. This optimization prevents redundant processing of the same rulebook across multiple sessions. After processing (or confirming previous processing), the function enters an inner loop for querying the selected rulebook. This nested structure creates a natural conversation flow where users can ask multiple questions about a single rulebook before returning to select another. Each query triggers the RAG system through generate_response_with_rag, which retrieves relevant chunks and generates a contextually appropriate answer. The dual-loop design provides an intuitive workflow that mirrors how people naturally interact with rulebooks: select a game, then ask multiple questions about it.

Conclusion

This board game rulebook assistant demonstrates the power of combining PDF text extraction, intelligent text chunking, vector embeddings, and generative AI in a retrieval-augmented generation architecture. The system handles the complexity of board game rulebooks by preserving rule boundaries, recognizing important sections, and understanding domain-specific terminology.

The implementation utilizes several fundamental RAG techniques:

  1. Text extraction from PDFs to create processable content

  2. Text chunking to break down large documents into manageable segments

  3. Vector embeddings to represent text chunks in semantic space

  4. Vector database storage for efficient retrieval

  5. Semantic similarity search to find relevant information

  6. Context-enhanced prompt engineering to generate accurate responses

However, the system does have noteworthy limitations. It can only process text content from PDFs and cannot interpret images, diagrams, or tables that are often crucial in board game rulebooks. This is a significant drawback as many rulebooks rely heavily on visual aids to explain game mechanics, board layouts, and card interactions.

Another important limitation is API rate limits when using Google's Gemini model. The free version of Gemini 2.0 Flash Lite is restricted to just 30 requests per minute, which creates a bottleneck when processing large rulebooks or handling multiple user queries in quick succession. While the paid version offers over 4,000 requests per minute, this represents a substantial cost increase that might be prohibitive for hobbyist or educational implementations.

Additionally, the system's accuracy is highly dependent on the quality of the extracted text and the effectiveness of the chunking algorithm. OCR errors in scanned rulebooks or complex formatting can lead to degraded performance in both retrieval and answer generation phases.

Despite these limitations, this project demonstrates how RAG systems can be built with accessible tools and models to create specialized knowledge assistants for domain-specific applications, making complex information more accessible and usable.

Edit: For reference see the code in Kaggle notebook: https://www.kaggle.com/code/antonioropac/boardgame-rules-retrieval-using-rag

For any questions connect with me on LinkedIn: https://www.linkedin.com/in/aropac57

0
Subscribe to my newsletter

Read articles from Antonio Ropac directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Antonio Ropac
Antonio Ropac