PDF Data Extraction with React.js:Simple Guide to make PDF text editor

Working with PDFs in web applications can be challenging, especially when you want to extract and manipulate the content inside them. In this blog, I’ll walk you through building a basic PDF text editor using React.js that allows you to upload a PDF, extract its text and images, and edit the content directly in the browser.

Why Extract PDF Data in React?

PDFs are widely used for sharing documents, but they are not inherently easy to edit or extract data from on the web. By extracting PDF content into editable HTML, you can:

Edit text and images inline
Save or export the modified content
Build custom PDF viewers or editors

Tools and Libraries We’ll Use

React.js: For building the UI.
pdfjs-dist: Mozilla’s PDF.js library to parse and extract PDF content.
Jodit Editor: A rich text editor to edit extracted content.
jsPDF and html2canvas: To export edited content back to PDF.
Next.js dynamic import: To load Jodit Editor without server-side rendering.
Custom hooks and toast notifications: For user feedback.

Step 1: Setting Up PDF.js Worker

PDF.js requires a worker script to parse PDFs efficiently. We set this up in a React useEffect hook:

useEffect(() => {
  pdfjsLib.GlobalWorkerOptions.workerSrc =
    "https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.4.120/pdf.worker.min.js";
}, []);

Step 2: Uploading and Validating PDF Files

We allow users to upload PDFs either by clicking or drag-and-drop. We validate the file type to ensure it’s a PDF:

const handleFileChange = (e) => {
  const file = e.target.files[0];
  if (file && file.type === 'application/pdf') {
    setImageFile(file);
    setIsContinueClicked(false);
  } else {
    showToastError('Please upload a valid PDF file');
  }
};

Step 3: Extracting Text and Images from PDF

Once a PDF is uploaded, we process it page by page:

Use pdfjsLib.getDocument to load the PDF.
For each page, extract text content with page.getTextContent().
Extract graphic operators (including images) to reconstruct the page layout.
Generate HTML that positions text and images absolutely to mimic the PDF layout.

Here’s a simplified snippet of extracting text and images:

const extractContent = async (url, base64Images) => {
  const pdf = await pdfjsLib.getDocument({ url }).promise;
  setNumPages(pdf.numPages);
  let fullHtmlContent = [];

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const graphicOperators = await extractGraphicOperators(page);
    const textContent = await page.getTextContent();
    const pageHtml = generatePageHtml(graphicOperators, textContent, base64Images, page.view[3]);
    fullHtmlContent.push(pageHtml);
  }

  setHtmlContent(fullHtmlContent);
};

Step 4: Editing Extracted Content with Jodit Editor

We use the Jodit rich text editor to allow users to edit the extracted HTML content. The editor is dynamically imported to avoid server-side rendering issues in Next.js:

<JoditEditor
  ref={editorRef}
  value={htmlContent[currentPage]}
  config={{
    readonly: false,
    toolbar: true,
    height: height,
    width: width,
    buttons: [
      "undo", "redo", "|", "bold", "italic", "underline", "|",
      "ul", "ol", "link", "image", "source"
    ],
  }}
  onChange={(content) => { htmlContent[currentPage] = content; }}
/>

Step 5: Navigating Pages and Downloading Edited PDF

Users can navigate between pages using Previous and Next buttons. After editing, the content can be exported back to PDF using jsPDF and html2canvas:

const downloadPdf = async () => {
  const doc = new jsPDF();
  for (let i = 0; i < htmlContent.length; i++) {
    const tempDiv = document.createElement('div');
    tempDiv.innerHTML = htmlContent[i];
    document.body.appendChild(tempDiv);

    const canvas = await html2canvas(tempDiv, { scale: 2 });
    const imgData = canvas.toDataURL('image/jpeg');

    if (i > 0) doc.addPage();
    doc.addImage(imgData, 'JPEG', 0, 0, 210, 297);

    document.body.removeChild(tempDiv);
  }
  doc.save('edited-document.pdf');
};

Full Example Code

You can find the full React component code here (or include your full code snippet).

Conclusion

Extracting and editing PDF content in React is achievable with the right tools. By combining PDF.js for extraction, Jodit Editor for editing, and jsPDF/html2canvas for exporting, you can build powerful PDF editors tailored to your needs.

Feel free to customize the editor toolbar, improve image handling, or add more features like annotations and highlights.

Try It Yourself!

If you want to experiment, start by setting up a React app and installing the dependencies:

npm install pdfjs-dist jodit-react jspdf html2canvas

Then, use the code snippets above to build your own PDF text editor.

If you have questions or want to share your project, drop a comment below!

Happy coding! 🚀

P.S. If you want the full source code or help integrating this into your app, just ask!
OR you can check out my GitHub at https://github.com/VrajVyas11/PDF-Manipulator in that you can find it at https://github.com/VrajVyas11/PDF-Manipulator/blob/main/components/Core/PDFEditor/PDFEditorSimpleText.jsx

✨How to Extract Data from a PDF Using React.js: Build a Basic PDF Text Editor