How I Extract PDF Text, Generate Embeddings, and Store Them in Chroma (Beginner-Friendly Guide)


In today’s world of AI and machine learning, turning raw text into something computers can truly “understand” is becoming essential — especially when working with large documents like PDFs. If you’ve ever wished you could search your PDFs for ideas and meanings — not just exact words — then this guide is for you.
In this post, I’ll walk you through the concepts and tools you need to make this happen. You’ll learn:
Which Python packages to install,
What each package does,
How to extract text from a PDF,
How to split that text into smaller, usable chunks,
How to convert those chunks into embeddings (numerical representations of meaning),
And finally, how to store and search those embeddings locally using a vector database.
I won’t share exact code snippets here — instead, I’ll help you understand the why behind each step. By the end, you’ll have a clear roadmap for setting up your own AI-powered PDF search tool and the confidence to experiment and build on your own.
Let’s get started!
1) Install the Required Packages
So first of all go ahead and set up a folder for your project in which you gonna save your .py/python file and the pdf you want to use as data, and requirements.txt a plain txt file in this file write this.
sentence-transformers
pymupdf
chromadb
Run this in your terminal:- pip install -r requirements.txt
1) Sentence-Transformers:- It makes it super easy to create embeddings for sentences, paragraphs, or documents , loads pre trained models like 'all-MiniLM-L6-v2'
, 'paraphrase-MiniLM-L12-v2'
,.Turns your text into dense vector representations.
3) ChromaDB:- The official Python client for the Chroma vector database. Used for storing and querying embeddings locally (PersistentClient
).
2) Import Packages:- Import all the package that we are using in this code
from sentence_transformers import SentenceTransformer
import fitz # PyMuPDF
import chromadb
3) Extract PDF text (PyMuPDF)
In this simple Python project, I first wrote a function to open a PDF file and extract all its text using the PyMuPDF
library (fitz
). After extracting the text, I printed the first 500 characters to check if it worked. Next, I created a split_text
function that breaks the text into smaller chunks of 10 words each. Finally, I generated these chunks and printed how many there are, along with a sample chunk. This makes the text easier to process for tasks like summarizing or training AI models.
4) Sentence-Transformers
Turning text into numbers a vector of numbers so that a computer can understand the meaning of the text in a mathematical way. To find similar text (semantic search).
SO, After splitting my PDF text into chunks, I loaded a pre-trained SentenceTransformer model called all-MiniLM-L6-v2
to convert each chunk into a numerical vector called an embedding. Using the encode
function, I transformed all my text chunks into embeddings and converted them to lists for easy handling. I then printed the total number of embeddings to confirm everything worked. Finally, I checked one embedding to see the numeric representation of my text — ready for search or AI tasks!
5) Stored in Chroma
ChromaDB Save the embeddings ( Vector ) in Database so when we can run a similarity search
After creating my embeddings, I set up a ChromaDB client with a local database path to save my data permanently. I then created (or got) a collection named pdf_collection
to keep everything organized. Using the add
method, I stored my text chunks, their embeddings, and unique IDs for each chunk. Once the data was added, I printed a success message to confirm it worked. Finally, I checked how many items were saved to make sure my entire PDF is now ready for quick, AI-powered search!
Conclusion:
Basically, I took a PDF, broke it into smaller parts, and used ChromaDB to search for the word ‘petals.’ It then pulled up the chunk where that word appears.
Follow-Up Note:
This is just the beginning — now that your PDF is chunked, embedded, and stored in ChromaDB, you can build amazing applications on top of it. Try adding a simple question-answering system, a chatbot that cites sections of your document, or even connect multiple PDFs into one searchable knowledge base. Keep experimenting, improve your chunk sizes and embeddings, and explore more advanced models as you grow. The power of AI search is in your hands — happy building!
Subscribe to my newsletter
Read articles from Najmussahar kazmi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
