Reduce OCR extraction time by ~3x using parallelism with the Nutrient .NET SDK


In this article, I’ll show you how to run OCR on multi-page PDF files efficiently using the Nutrient .NET SDK (formerly GdPicture.NET). This example leverages parallel processing to extract text from each page and save it as individual .txt
files. It’s a clean and optimized approach, especially useful for batch jobs or when you need to process scanned PDFs quickly.
Let’s break it down step by step.
Step 1: Initialization and Setup
We start by defining our input file, output folder, and license key.
const string inputFile = @"C:\temp\input-file.pdf";
const string outputDir = @"C:\temp\ocr-results";
const string licenseKey = "";
We also make sure the output directory exists and register the SDK license:
Directory.CreateDirectory(outputDir);
LicenseManager lm = new LicenseManager();
lm.RegisterKEY(licenseKey);
This setup is mandatory to activate the Nutrient SDK and prepare the file system.
Step 2: Load PDF and Prepare for Parallel Execution
We load the input PDF and check if it’s valid:
using GdPicturePDF pdf = new();
GdPictureStatus status = pdf.LoadFromFile(inputFile);
if (status != GdPictureStatus.OK)
{
Console.WriteLine("Error loading PDF: " + status);
return;
}
int pageCount = pdf.GetPageCount();
Once the PDF is successfully loaded, we display how many pages will be processed:
Console.WriteLine($"Processing {pageCount} pages in parallel...");
To parallelize the job, we use Partitioner.Create
and Parallel.ForEach
, which help split the work into page ranges that can be handled simultaneously:
var pageNumbers = Partitioner.Create(1, pageCount + 1); // (inclusive, exclusive)
Step 3: Parallel OCR Processing
Now the real work begins. Each thread will:
Load the same input file (thread-safe handling)
Select a specific page
Extract or render the page as an image
Run OCR with optimized speed settings
Save the extracted text to a file
Here’s what happens inside the loop:
Parallel.ForEach(pageNumbers, range =>
{
for (int p = range.Item1; p < range.Item2; p++)
{
using GdPicturePDF localPdf = new();
using GdPictureOCR ocr = new();
using GdPictureImaging img = new();
ocr.ResourcesFolder = @"C:\GdPicture.NET 14\Redist\OCR";
ocr.Context = OCRContext.OCRContextDocument;
ocr.OCRMode = OCRMode.FavorSpeed;
var st = localPdf.LoadFromFile(inputFile);
if (st != GdPictureStatus.OK)
{
// Error handling...
}
st = localPdf.SelectPage(p);
// Additional error checks omitted here for readability...
Each page is either directly extracted as an image (if it’s already an image-based page), or rendered to 300 DPI if it’s vector-based:
int imgId = localPdf.IsPageImage()
? localPdf.ExtractPageImage(1)
: localPdf.RenderPageToGdPictureImage(300, true);
OCR is then performed using the built-in OCR engine. The resulting text is written to a .txt
file named after the page number:
string text = ocr.GetOCRResultText(resId);
string outPath = Path.Combine(outputDir, $"ocr-result-{p}.txt");
File.WriteAllText(outPath, text);
Console.WriteLine($"[Page {p}] Done.");
Errors are captured and stored in a thread-safe ConcurrentBag<string>
so you can review any problems afterward.
Step 4: Final Reporting
Once the parallel loop completes, the script prints all error messages (if any):
if (errors.Count > 0)
{
Console.WriteLine("\nSome errors occurred:");
foreach (var err in errors)
Console.WriteLine(err);
}
This is helpful to quickly debug failed pages or misconfigurations in the OCR process.
Full Code
using System;
using System.IO;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Collections.Generic;
using GdPicture14;
class PdfOCRParallelOptimized
{
public static void Main()
{
const string inputFile = @"C:\temp\input-file.pdf";
const string outputDir = @"C:\temp\ocr-results";
const string licenseKey = "";
Directory.CreateDirectory(outputDir);
LicenseManager lm = new LicenseManager();
lm.RegisterKEY(licenseKey);
using GdPicturePDF pdf = new();
GdPictureStatus status = pdf.LoadFromFile(inputFile);
if (status != GdPictureStatus.OK)
{
Console.WriteLine("Error loading PDF: " + status);
return;
}
int pageCount = pdf.GetPageCount();
Console.WriteLine($"Processing {pageCount} pages in parallel...");
var pageNumbers = Partitioner.Create(1, pageCount + 1); // (inclusive, exclusive)
var errors = new ConcurrentBag<string>();
Parallel.ForEach(pageNumbers, range =>
{
for (int p = range.Item1; p < range.Item2; p++)
{
try
{
using GdPicturePDF localPdf = new();
using GdPictureOCR ocr = new();
using GdPictureImaging img = new();
ocr.ResourcesFolder = @"C:\GdPicture.NET 14\Redist\OCR";
ocr.Context = OCRContext.OCRContextDocument;
ocr.OCRMode = OCRMode.FavorSpeed;
var st = localPdf.LoadFromFile(inputFile);
if (st != GdPictureStatus.OK)
{
errors.Add($"[Page {p}] Load error: {st}");
continue;
}
st = localPdf.SelectPage(p);
if (st != GdPictureStatus.OK)
{
errors.Add($"[Page {p}] SelectPage error: {st}");
continue;
}
int imgId = localPdf.IsPageImage()
? localPdf.ExtractPageImage(1)
: localPdf.RenderPageToGdPictureImage(300, true);
if (imgId == 0)
{
errors.Add($"[Page {p}] Render error.");
continue;
}
st = ocr.SetImage(imgId);
if (st != GdPictureStatus.OK)
{
errors.Add($"[Page {p}] SetImage error: {st}");
continue;
}
string resId = ocr.RunOCR();
st = ocr.GetStat();
if (st != GdPictureStatus.OK)
{
errors.Add($"[Page {p}] RunOCR error: {st}");
continue;
}
string text = ocr.GetOCRResultText(resId);
st = ocr.GetStat();
if (st != GdPictureStatus.OK)
{
errors.Add($"[Page {p}] GetText error: {st}");
continue;
}
string outPath = Path.Combine(outputDir, $"ocr-result-{p}.txt");
File.WriteAllText(outPath, text);
Console.WriteLine($"[Page {p}] Done.");
}
catch (Exception ex)
{
errors.Add($"[Page {p}] Exception: {ex.Message}");
}
}
});
Console.WriteLine("All pages processed.");
if (errors.Count > 0)
{
Console.WriteLine("\nSome errors occurred:");
foreach (var err in errors)
Console.WriteLine(err);
}
}
}
Summary
This code sample is a great fit when:
You need to extract raw text from large PDFs quickly
You want to run OCR in a headless or batch mode
You're dealing with image-based PDFs or scanned documents
Here’s what it demonstrates in practice:
Parallel processing of multi-page documents
Using
GdPictureOCR
with different OCR modes and contextsManaging per-page processing in a scalable and isolated way
Exporting clean
.txt
results for further indexing or analysis
You can extend this example by adding support for different OCR languages, integrating text post-processing, or chaining this logic with redaction, search, or compliance checks.
If you want to integrate this in your project or need help tailoring it to your use case, feel free to reach out at fabio.derose@nutrient.io.
Cheers
Fabio
Subscribe to my newsletter
Read articles from Fabio De Rose directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Fabio De Rose
Fabio De Rose
I love rock music, good wine, and .NET