Reduce OCR extraction time by ~3x using parallelism with the Nutrient .NET SDK

Fabio De RoseFabio De Rose
5 min read

In this article, I’ll show you how to run OCR on multi-page PDF files efficiently using the Nutrient .NET SDK (formerly GdPicture.NET). This example leverages parallel processing to extract text from each page and save it as individual .txt files. It’s a clean and optimized approach, especially useful for batch jobs or when you need to process scanned PDFs quickly.

Let’s break it down step by step.


Step 1: Initialization and Setup

We start by defining our input file, output folder, and license key.

const string inputFile = @"C:\temp\input-file.pdf";
const string outputDir = @"C:\temp\ocr-results";
const string licenseKey = "";

We also make sure the output directory exists and register the SDK license:

Directory.CreateDirectory(outputDir);

LicenseManager lm = new LicenseManager();
lm.RegisterKEY(licenseKey);

This setup is mandatory to activate the Nutrient SDK and prepare the file system.


Step 2: Load PDF and Prepare for Parallel Execution

We load the input PDF and check if it’s valid:

using GdPicturePDF pdf = new();
GdPictureStatus status = pdf.LoadFromFile(inputFile);

if (status != GdPictureStatus.OK)
{
    Console.WriteLine("Error loading PDF: " + status);
    return;
}
int pageCount = pdf.GetPageCount();

Once the PDF is successfully loaded, we display how many pages will be processed:

Console.WriteLine($"Processing {pageCount} pages in parallel...");

To parallelize the job, we use Partitioner.Create and Parallel.ForEach, which help split the work into page ranges that can be handled simultaneously:

var pageNumbers = Partitioner.Create(1, pageCount + 1); // (inclusive, exclusive)

Step 3: Parallel OCR Processing

Now the real work begins. Each thread will:

  • Load the same input file (thread-safe handling)

  • Select a specific page

  • Extract or render the page as an image

  • Run OCR with optimized speed settings

  • Save the extracted text to a file

Here’s what happens inside the loop:

Parallel.ForEach(pageNumbers, range =>
{
    for (int p = range.Item1; p < range.Item2; p++)
    {
        using GdPicturePDF localPdf = new();
        using GdPictureOCR ocr = new();
        using GdPictureImaging img = new();

        ocr.ResourcesFolder = @"C:\GdPicture.NET 14\Redist\OCR";
        ocr.Context = OCRContext.OCRContextDocument;
        ocr.OCRMode = OCRMode.FavorSpeed;

        var st = localPdf.LoadFromFile(inputFile);
        if (st != GdPictureStatus.OK)
        {
            // Error handling...
        }

        st = localPdf.SelectPage(p);
        // Additional error checks omitted here for readability...

Each page is either directly extracted as an image (if it’s already an image-based page), or rendered to 300 DPI if it’s vector-based:

int imgId = localPdf.IsPageImage()
    ? localPdf.ExtractPageImage(1)
    : localPdf.RenderPageToGdPictureImage(300, true);

OCR is then performed using the built-in OCR engine. The resulting text is written to a .txt file named after the page number:

string text = ocr.GetOCRResultText(resId);
string outPath = Path.Combine(outputDir, $"ocr-result-{p}.txt");
File.WriteAllText(outPath, text);
Console.WriteLine($"[Page {p}] Done.");

Errors are captured and stored in a thread-safe ConcurrentBag<string> so you can review any problems afterward.


Step 4: Final Reporting

Once the parallel loop completes, the script prints all error messages (if any):

if (errors.Count > 0)
{
    Console.WriteLine("\nSome errors occurred:");
    foreach (var err in errors)
        Console.WriteLine(err);
}

This is helpful to quickly debug failed pages or misconfigurations in the OCR process.


Full Code


using System;
using System.IO;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Collections.Generic;
using GdPicture14;

class PdfOCRParallelOptimized
{
    public static void Main()
    {
        const string inputFile = @"C:\temp\input-file.pdf";
        const string outputDir = @"C:\temp\ocr-results";
        const string licenseKey = "";

        Directory.CreateDirectory(outputDir);

        LicenseManager lm = new LicenseManager();
        lm.RegisterKEY(licenseKey);

        using GdPicturePDF pdf = new();
        GdPictureStatus status = pdf.LoadFromFile(inputFile);

        if (status != GdPictureStatus.OK)
        {
            Console.WriteLine("Error loading PDF: " + status);
            return;

        }
        int pageCount = pdf.GetPageCount();

        Console.WriteLine($"Processing {pageCount} pages in parallel...");

        var pageNumbers = Partitioner.Create(1, pageCount + 1); // (inclusive, exclusive)
        var errors = new ConcurrentBag<string>();

        Parallel.ForEach(pageNumbers, range =>
        {
            for (int p = range.Item1; p < range.Item2; p++)
            {
                try
                {
                    using GdPicturePDF localPdf = new();
                    using GdPictureOCR ocr = new();
                    using GdPictureImaging img = new();

                    ocr.ResourcesFolder = @"C:\GdPicture.NET 14\Redist\OCR";
                    ocr.Context = OCRContext.OCRContextDocument;
                    ocr.OCRMode = OCRMode.FavorSpeed;
                    var st = localPdf.LoadFromFile(inputFile);
                    if (st != GdPictureStatus.OK)
                    {
                        errors.Add($"[Page {p}] Load error: {st}");
                        continue;
                    }
                    st = localPdf.SelectPage(p);
                    if (st != GdPictureStatus.OK)
                    {
                        errors.Add($"[Page {p}] SelectPage error: {st}");
                        continue;
                    }
                    int imgId = localPdf.IsPageImage()
                        ? localPdf.ExtractPageImage(1)
                        : localPdf.RenderPageToGdPictureImage(300, true);
                    if (imgId == 0)
                    {
                        errors.Add($"[Page {p}] Render error.");
                        continue;
                    }
                    st = ocr.SetImage(imgId);
                    if (st != GdPictureStatus.OK)
                    {
                        errors.Add($"[Page {p}] SetImage error: {st}");
                        continue;
                    }
                    string resId = ocr.RunOCR();
                    st = ocr.GetStat();
                    if (st != GdPictureStatus.OK)
                    {
                        errors.Add($"[Page {p}] RunOCR error: {st}");
                        continue;
                    }
                    string text = ocr.GetOCRResultText(resId);
                    st = ocr.GetStat();
                    if (st != GdPictureStatus.OK)
                    {
                        errors.Add($"[Page {p}] GetText error: {st}");
                        continue;
                    }
                    string outPath = Path.Combine(outputDir, $"ocr-result-{p}.txt");
                    File.WriteAllText(outPath, text);
                    Console.WriteLine($"[Page {p}] Done.");
                }
                catch (Exception ex)
                {
                    errors.Add($"[Page {p}] Exception: {ex.Message}");
                }
            }
        });

        Console.WriteLine("All pages processed.");

        if (errors.Count > 0)
        {
            Console.WriteLine("\nSome errors occurred:");
            foreach (var err in errors)
                Console.WriteLine(err);
        }
    }
}

Summary

This code sample is a great fit when:

  • You need to extract raw text from large PDFs quickly

  • You want to run OCR in a headless or batch mode

  • You're dealing with image-based PDFs or scanned documents

Here’s what it demonstrates in practice:

  • Parallel processing of multi-page documents

  • Using GdPictureOCR with different OCR modes and contexts

  • Managing per-page processing in a scalable and isolated way

  • Exporting clean .txt results for further indexing or analysis

You can extend this example by adding support for different OCR languages, integrating text post-processing, or chaining this logic with redaction, search, or compliance checks.

If you want to integrate this in your project or need help tailoring it to your use case, feel free to reach out at fabio.derose@nutrient.io.

Cheers
Fabio

0
Subscribe to my newsletter

Read articles from Fabio De Rose directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Fabio De Rose
Fabio De Rose

I love rock music, good wine, and .NET