While working at my startup, we faced two unique challenges with PDFs:

We needed to let users search for any keyword in a PDF and jump between results, just like in a professional PDF reader.
We also needed to highlight specific text chunks in the PDF—chunks that were generated as sources from citations in our product. These chunks might not match exactly due to formatting or OCR errors, so we needed a robust, similarity-based approach to find and highlight them, even if they spanned multiple text spans.

We explored several solutions. Some libraries offered basic PDF viewing, but robust, customizable searching and highlighting were either missing or locked behind expensive paywalls. For example, the popular React PDF viewer had keyword search and highlight features, but they were part of a paid package. We needed something more flexible and open.

So, I decided to build our own solution using open-source tools. This post is a deep dive into how I built a custom PDF search and multi-span highlight feature using pdf.js and react-pdf in React. If you’re building a product that needs to highlight arbitrary text chunks (not just keywords!) in PDFs, or want to add seamless search and navigation, this guide is for you.

User Story: Why Both Features Matter

Imagine you’re a researcher or a student using our platform. You upload a research paper, and our system automatically extracts citations and their source text. As you read the PDF, you want to:

Search for any keyword and jump between all matches, just like in Adobe Reader or Chrome’s PDF viewer.
See exactly where each citation’s source appears—highlighted right in the document, even if the text is split across multiple spans or has minor differences.

Our goal was to make this experience as smooth as possible:

Accurate highlighting, even when the text is split across multiple spans or slightly mismatched
Easy navigation between search results and highlighted chunks

Why Not Just Use the Browser’s Search?

When you render a PDF in the browser using pdf.js, the text is split into many small spans, and sometimes even into individual characters. The browser’s built-in search won’t work inside these layers. To enable searching and highlighting, you need to extract the text yourself, search through it, and then highlight the results.

The Stack

pdf.js: The engine behind most PDF viewers, including Firefox’s built-in one.
react-pdf: A React wrapper for pdf.js, making it easy to render PDFs in React.
React: For state management and UI.

Step 1: Rendering the PDF

First, use react-pdf to render the PDF. This gives you access to the Document and Page components.

import { Document, Page, pdfjs } from "react-pdf";
pdfjs.GlobalWorkerOptions.workerSrc = `//unpkg.com/pdfjs-dist@${pdfjs.version}/build/pdf.worker.min.mjs`;

<Document file={file}>
  <Page pageNumber={pageNumber} />
</Document>

Step 2: Extracting Text from Each Page

To search or highlight, you need the text content of each page. pdf.js provides a getTextContent() method for this.

const extractTextFromPage = async (pageNumber) => {
  const page = await documentRef.current.getPage(pageNumber);
  const textContent = await page.getTextContent();
  // Save textContent for searching/highlighting later
};

You can cache the extracted text for performance, especially for large PDFs.

Step 3: Searching Across All Pages

When the user enters a search query:

Loop through all pages.
For each page, extract the text (if not already cached).
Check if the page contains the search term.
If yes, find the matching text items and store their positions.

for (let pageNum = 1; pageNum <= numPages; pageNum++) {
  let textContent = pageTexts[pageNum] || await extractTextFromPage(pageNum);
  const pageText = textContent.items.map(item => item.str).join(" ").toLowerCase();
  if (pageText.includes(searchTerm)) {
    // Find and store matching items
  }
}

Applying Custom Text renderer for keyword highlights

 const customTextRenderer = useCallback(
    (textItem: any) => {
      if (!searchQuery || searchResults.length === 0) return textItem.str;

      const currentResult = searchResults[currentSearchIndex];
      if (!currentResult || currentResult.pageNumber !== currentPage)
        return textItem.str;

      const searchTerm = searchQuery.toLowerCase();
      const text = textItem.str;
      const lowerText = text.toLowerCase();

      if (lowerText.includes(searchTerm)) {
        const parts = text.split(new RegExp(`(${searchQuery})`, "gi"));
        return parts
          .map((part: string, index: number) =>
            part.toLowerCase() === searchTerm
              ? `<mark style="background-color: #ffff00; color: #000;">${part}</mark>`
              : part,
          )
          .join("");
      }

      return text;
    },
    [searchQuery, searchResults, currentSearchIndex, currentPage],
  );

Step 4: Highlighting Arbitrary Text Chunks (with Similarity Mapping)

Here’s the tricky part: PDF text is split into many spans (sometimes a single word or even a part of a word). To highlight a chunk (like a citation source) that may not match exactly, you need to:

Wait for the text layer to render (using a timeout or checking for the presence of spans).
Split the highlight string into words.
Walk through the spans, matching sequences of words to the highlight string.
Use a similarity function (e.g. fuzzy matching) to allow for minor mismatches.
When a match is found, highlight all the involved spans.

Example :

const applyChunkHighlighting = useCallback(
  (pageNumber) => {
    // ...clear previous highlights...
    setTimeout(() => {
      // Wait for text layer to be ready
      // Build a list of words with their span indices
      // For each highlight:
      //   - Find matching word sequences
      //   - Use similarity to allow for minor mismatches
      //   - For each match, highlight all involved spans
    }, 100);
  },
  [highlightTexts],
);

Similarity function:

const calculateTextSimilarity = (text1, text2) => {
  const words1 = text1.split(/\s+/);
  const words2 = text2.split(/\s+/);
  const intersection = words1.filter(word =>
    words2.some(w2 => w2.includes(word) || word.includes(w2))
  );
  return intersection.length / Math.max(words1.length, words2.length);
};

Highlighting a span:

const highlightSpan = (span, color) => {
  span.style.backgroundColor = color;
  span.classList.add("highlighted-by-fina");
};

Step 5: Navigating Search Results

Keep track of all search results and allow the user to jump between them. When navigating, scroll to the relevant page and highlight the current result.

Step 6: Clearing Highlights

Before applying new highlights, clear all previous highlights by removing the background color and custom class from all highlighted spans.

const clearAllHighlights = () => {
  const allHighlightedSpans = document.querySelectorAll(".highlighted-by-fina");
  allHighlightedSpans.forEach((span) => {
    const htmlSpan = span as HTMLElement;
    htmlSpan.style.backgroundColor = "";
    htmlSpan.classList.remove("highlighted-by-fina");
  });
};

Challenges and Tips

Text Layer Timing: The text layer may not be ready immediately after rendering. Use timeouts or check for the presence of spans.
Word Splitting: PDF text is often split unpredictably. Use a robust word matching algorithm and allow for fuzzy matches if needed.
Performance: For large PDFs, cache extracted text and debounce search input.

TL;DR

You need to work with the text layer (DOM spans) to apply highlights.
getTextContent() is great for searching and extracting, but not for mapping to DOM elements.
For robust, visual highlighting, always operate on the rendered spans.

How You Can Do It

Set up react-pdf and pdf.js in your React project.
Extract text from each page using getTextContent().
Implement a search function that loops through all pages and finds matches.
Highlight results by manipulating the text layer spans, handling multi-span and similarity-based matches.
Allow navigation between results and clear highlights as needed.

Conclusion

With a bit of work, you can build a PDF viewer in React that supports robust searching and multi-span, similarity-based highlighting—just like professional PDF readers, and even more flexible for custom use cases. The key is understanding how pdf.js structures text and how to manipulate the rendered spans for highlighting.

This is not a final solution—we are continuously working on finding better solutions and improving the experience.

Building a Powerful PDF Search and Multi-Span Highlight Feature in React with pdf.js and react-pdf