Redacts Sensitive Data through REGEX using .NET SDK

Fabio De RoseFabio De Rose
4 min read

In this article, I will walk you through a practical example of using the Nutrient .NET SDK (formerly GdPicture.NET) to scan a folder full of documents, perform OCR when needed, and redact sensitive 8-digit account numbers found in the documents. The code snippet is written in C# and demonstrates how to combine file processing, OCR, regex-based text search, and redaction in a simple and effective workflow.


Step 1: Setup and Initialization

First, we start by registering the license key for the Nutrient SDK:

LicenseManager licenseManager = new LicenseManager();
licenseManager.RegisterKEY("");

This is necessary to unlock the SDK functionalities.

Next, we declare paths and prepare a regular expression pattern to detect account numbers, defined as exactly 8 consecutive digits:

var pathFiles = "";
var pathOCR = "C:\\GdPicture.NET 14\\Redist\\OCR";
var regexAccNumber = @"\b\d{8}\b";

The OCR path points to the folder where the language data for OCR is stored. The regex will be used later to find all matching account numbers in the documents.

If you do not have the OCR dictionaries on your machine, you can download them here.


Step 2: User Input and File Collection

The program asks the user to enter the path of the folder containing the documents to process:

do
{
    Console.WriteLine("Enter the path of the folder with your documents (type 'exit' to leave):");
    pathFiles = Console.ReadLine();
    if (pathFiles.Equals("exit", StringComparison.OrdinalIgnoreCase))
    {
        Console.WriteLine("Process terminated.");
        return;
    }
    if (Directory.Exists(pathFiles))
    {
        isValidPath = true;
        filePaths = Directory.GetFiles(pathFiles, "*", SearchOption.AllDirectories);
    }
    else
    {
        Console.WriteLine("Path invalid, try again.");
    }
} while (!isValidPath);

This loop ensures the user enters a valid folder path or exits the process. Once a valid folder is given, all files (recursively) are collected for processing.


Step 3: Processing Each Document

For each file, we load the document using the Nutrient PDF class. The code supports both PDF and image/document formats:

foreach (string filePath in filePaths)
{
    Console.WriteLine($"------- Processing file : {Path.GetFileName(filePath)} -------");
    extension = Path.GetExtension(filePath);
    bool wasAnImage = false;

    using (GdPicturePDF pdf = new GdPicturePDF())
    using (MemoryStream str = new MemoryStream())
    {
        if (extension != ".pdf")
        {
            GdPictureDocumentConverter conv = new();
            st = conv.LoadFromFile(filePath);
            if (st != GdPictureStatus.OK)
            {
                Console.WriteLine($"Error while loading the file: {st}");
                continue;
            }
            conv.SaveAsPDF(str);
            wasAnImage = true;
        }

        if (wasAnImage)
            pdf.LoadFromStream(str);
        else
            pdf.LoadFromFile(filePath);

If the file is not a PDF (for example, an image), it is converted to PDF in memory before loading. This unifies the processing to a PDF workflow.


Step 4: Preparing the PDF for OCR and Redaction

The PDF origin and measurement units are set to ensure all positions and sizes match:

pdf.SetOrigin(PdfOrigin.PdfOriginTopLeft);
pdf.SetMeasurementUnit(PdfMeasurementUnit.PdfMeasurementUnitInch);

We then check if the PDF is valid and get the number of pages:

if (pdf.GetStat() == GdPictureStatus.OK)
{
    var totalPages = pdf.GetPageCount();

Step 5: OCR Pages Needing Text Recognition

For each page, the code verifies if OCR is needed by checking if the page is an image, has no text, or is larger than 1000 pixels:

for (int i = 1; i <= totalPages; i++)
{
    pdf.SelectPage(i);

    if (pdf.IsPageImage() == true || pdf.PageHasText() == false)
    {
        pdf.OcrPage("fra+deu", pathOCR, "", 300);
    }

The OCR is run with French and German languages ("fra+deu") using the OCR data folder defined earlier, at 300 dpi.


Step 6: Searching and Redacting Account Numbers

Using the regex for 8-digit numbers, the code searches the entire page text and collects bounding boxes where matches are found:

IEnumerable<GdPictureRectangleF> boundingRects;
int occurenceAccNumber = 1;

while (pdf.SearchTextRegex(regexAccNumber, occurenceAccNumber, false, out boundingRects))
{
    foreach (var rect in boundingRects)
    {
        Console.WriteLine($"Account number found - Drawing a redaction at (Inches): X={rect.Left}, Y={rect.Top}, Longueur={rect.Width}, Hauteur={rect.Height}");
        pdf.AddRedactionRegion((float)rect.Left, (float)rect.Top, (float)rect.Width, rect.Height);
        if (pdf.GetStat() != GdPictureStatus.OK)
            Console.WriteLine("Error while drawing the redaction: " + pdf.GetStat());
    }
    occurenceAccNumber++;
}

For each occurrence, a redaction rectangle is added to cover the sensitive information.


Step 7: Applying Redactions and Saving

After processing all pages, the redactions are applied:

pdf.ApplyRedaction();

If the original document was an image converted to PDF, the redacted PDF is saved to a specified output folder:

if (wasAnImage == true)
    pdf.SaveToFile(@"C:\temp\output\redacted-" + Path.GetFileName(filePath) + ".pdf");

Summary

This C# snippet offers a clear example of how to:

  • Load and convert documents to PDF,

  • Perform OCR on pages without text,

  • Search for specific sensitive data using regular expressions,

  • Add redaction regions programmatically,

  • Save redacted documents securely.

Using Nutrient's .NET SDK, you can automate document security tasks efficiently, which is essential for compliance and privacy in many business contexts.


If you want me to help you extend this example or integrate it in your projects, do not hesitate to send me an email at fabio.derose@nutrient.io.

Cheers!
Fabio

0
Subscribe to my newsletter

Read articles from Fabio De Rose directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Fabio De Rose
Fabio De Rose

I love rock music, good wine, and .NET