Extracting Structured Data from Credit Letters - LLM vs OCR

Sion KimSion Kim
8 min read

Introduction

Credit rating letters are critical in the debt securities market, providing assessments of the creditworthiness of entities and their debt instruments. Manually extracting company ratings, security details, etc., is time-consuming and error-prone. While traditional OCR and rule-based systems like AWS Textract offer partial solutions, they often struggle with the variability in formats, language, and structure inherent in the documents.

Large Language Models (LLMs) like Anthropic’s Claude Sonnet or Google’s Gemini present a powerful alternative, offering sophisticated natural language understanding and reasoning capabilities. We explore how to leverage LLMs for robust and automated credit letter data extraction, focusing on input methods, prompting strategies, and comparing this approach with specialized AWS services like Textract.

The Challenge: Semi-Structured Data Extraction

Credit letters, while often contain similar types of information, lack a standardized machine-readable format. They are typically issued as PDFs and contain:

  1. Core Entity: The company being rated, the date the letter was prepared

  2. Rated Instrument Details: Often presented in tables or paragraphs, including descriptions of bond securities, rating types (e.g. Issuer Credit Rating, Senior Unsecured Debt), rating actions (e.g. Affirmed, Upgraded, Downgraded, Withdrawn), the assigned rating (rating agency specific; e.g. AA, Baa2, BB+), outlook or credit watch status (e.g. Stable, Positive, Negative, Developing), the effective date of the rating action, and notes.

The goal is to extract the information accurately and structure it into a usable format, specifically a JSON format.

Given following credit letter example (from https://ttc.lacounty.gov/wp-content/uploads/2022/07/Fitch-Rating-letter.pdf),

Target Output Format:

[
  {
    "company_name": "Example Corp Inc.",
    "prepared_date": "2025-05-04",
    "rated_securities": [
      {
        "bond_description": "Issuer Credit Rating",
        "rating_type": "Long-Term Issuer Credit Rating",
        "rating_action": "Affirmed",
        "rating": "A+",
        "outlook_watch": "Stable",
        "effective_date": "2025-05-04",
        "notes": "Affirmation reflects solid operating performance."
      },
      {
        "bond_description": "$500M 4.5% Senior Unsecured Notes due 2035",
        "rating_type": "Senior Unsecured Debt",
        "rating_action": "Affirmed",
        "rating": "A+",
        "outlook_watch": "Stable",
        "effective_date": "2025-05-04",
        "notes": null
      },
      ...
    ]
  }
]

Input Data Handling Techniques

Before prompting, the credit letter needs to be fed into the model. Several approaches exist:

  1. Text Parsing (Pre-processing)

    1. Method: Extract raw text from PDFs (using libraries like PyPDF)

    2. Pros: Simple for computer-written PDFs (compared to handwritten), avoids OCR errors, potentially lower cost if text extraction is efficient

    3. Cons: Loses formatting (tables, layout) which can contain contextual clues, ineffective for scanned image-based PDFs, requires robust text extraction logic

  2. Image Capture + OCR

    1. Method: Convert PDF pages (especially scanned ones) or document images into images, then feed it to LLM to extract text

    2. Pros: Handles scanned documents and images. Can identify structural elements (tables, forms)

    3. Cons: Introduces potential OCR errors, adds an extra processing step and potential cost

  3. Direct PDF Upload (Utilizing Multimodality)

    1. Method: If the LLM supports direct PDF upload (Claude Sonnet 3.7 or Gemini 2.5), this can be the most streamlined approach. The model processes the document structure and text internally

    2. Pro: Simplifies the pipeline, potentially preserves layout context better than simple text extraction, leverages the model’s integrated understanding of document formats

    3. Cons: Dependent on API capabilities and limits (file size limit), potentially higher token usage

Recommendation: Using multi-modal capabilities (direct PDF or image upload) is preferable as it preserves form structures. If dealing only with digitally written PDFs, text parsing might be sufficient, but it requires careful evaluation.

Prompting Strategies

The key to successful LLM-based data extraction lies in effective prompting. Latest LLM models are highly capable even with simple instructions, but refining the prompt improves accuracy and consistency.

Assumed Input: PDF as shown above, referenced as {{ credit_letter_text }}

  1. Zero-Shot Prompting

    1. Provide instructions without any examples. Relies entirely on the model’s pre-trained knowledge
    System: You are an AI assistant specialized in extracting financial data from credit rating letters. Your task is to carefully read the provided text and extract specific information into a structured JSON format.

    User: Please extract the following information from the credit letter text below:
    1.  The name of the company being rated (`company_name`).
    2.  The date the credit letter was prepared or published (`prepared_date`). Format as YYYY-MM-DD.
    3.  For EACH rated security or instrument mentioned, extract:
        * `bond_description`: The name or description of the security/instrument (e.g., "Issuer Credit Rating", "$500M Senior Notes due 2030").
        * `rating_type`: The category of the rating (e.g., "Long-Term Issuer Credit Rating", "Senior Unsecured Debt").
        * `rating_action`: The action taken (e.g., "Affirmed", "Upgraded", "Downgraded", "Placed on Watch Negative", "Withdrawn").
        * `rating`: The assigned rating (e.g., "AA-", "Baa1", "NR"). Use "N/A" if withdrawn or not applicable.
        * `outlook_watch`: The outlook or credit watch status (e.g., "Stable", "Positive", "Negative", "Watch Developing"). Use null if not specified or not applicable (e.g., for withdrawn ratings).
        * `effective_date`: The date the rating action is effective. Format as YYYY-MM-DD. Often the same as the prepared date if not specified otherwise.
        * `notes`: Any brief accompanying notes or reasons for the action, if provided. Use null if none.

    Output the result as a JSON array containing a single object representing the letter. Inside this object, include `company_name`, `prepared_date`, and a `rated_securities` array containing objects for each distinct rating action.

    Strictly adhere to the JSON format specified. Ensure all requested fields are present for each security, using null or "N/A" where appropriate.

    Here is the credit letter text: {{credit_letter_text}}
  1. One-Shot Prompting

    1. Provide one complete example of input and desired output to guide the model

       System: You are an AI assistant specialized in extracting financial data from credit rating letters. Your task is to carefully read the provided text and extract specific information into a structured JSON format, following the example provided.
      
       User: Please extract information from the credit letter text according to the specified JSON format.
      
       **Example:**
      
       *Input Text Snippet:*
       "Rating Agency Inc. - Report Date: 2024-11-15
       Subject: Alpha Widgets Corp.
       We have upgraded Alpha Widgets Corp.'s Long-Term Issuer Credit Rating to 'BBB+' from 'BBB'. The outlook is now Positive. This reflects improved debt metrics. The rating on the company's Senior Secured Notes was also upgraded to 'BBB+'. Both actions are effective November 15, 2024."
      
       *Desired JSON Output:*
       ```json
       [
         {
           "company_name": "Alpha Widgets Corp.",
           "prepared_date": "2024-11-15",
           "rated_securities": [
             {
               "bond_description": "Issuer Credit Rating",
               "rating_type": "Long-Term Issuer Credit Rating",
               "rating_action": "Upgraded",
               "rating": "BBB+",
               "outlook_watch": "Positive",
               "effective_date": "2024-11-15",
               "notes": "This reflects improved debt metrics."
             },
             {
               "bond_description": "Senior Secured Notes",
               "rating_type": "Senior Secured Debt",
               "rating_action": "Upgraded",
               "rating": "BBB+",
               "outlook_watch": null,
               "effective_date": "2024-11-15",
               "notes": null
             }
           ]
         }
       ]
       ```
       Now, process the following credit letter text: {{credit_letter_text}}
      
  2. Few-Shot Prompting

    1. Provide multiple examples (2-5) to cover more variations in input style, language, or edge cases. This approach yields the best results for complex or variable inputs

Recommendation: Start with zero-shot prompting, as newer models are very capable. If accuracy or format adherence isn’t consistently met, move to one-shot, and then few-shot, providing diverse examples that cover common variations. Few-shot is considered to be the most robust approach.

Claude 3.5 Sonnet vs AWS Textract

To compare high-reasoning LLM like Claude 3.5 Sonnet to a specialized AWS service like Textract

FeatureAWS TextractClaude 3.5 Sonnet
Primary FunctionDocument OCR, Text/Form/Table Data ExtractionNatural Language Understanding (context), Generation, Reasoning
StrengthsPurpose-built for document structures (lines, words, tables, forms). Predictable pricing per page/query type. Integration with other AWS servicesSuperior understanding of language, synonyms, and context. Handles variations in phrasing and non-standard formats. Flexible prompting allows extracting complex/subtle details (e.g. notes)
WeaknessesStruggle with unstructured text (especially on font variations) or complex narrative descriptions. Less adept at nuanced interpretation. Requires extensive configuration and set-upPotential for hallucination (though reduced in newer models). Less specialized for just detecting table structure geometry. Requires careful prompt engineering for optimal results
Best forExtracting data from highly structured forms/tables within documents. Precise location of text (bounding boxes)Documents with narrative text, variable formats. Extracting information requiring contextual understanding or inference. When Flexibility to adapt to new phrasing or formats is crucial. Extracting qualitative data (e.g. notes) or interpreting rating rationale

The Power of Reasoning: Why LLM Models Excel in Credit Rating Letter Data Extraction

The capabilities go beyond simple keyword matching or template filling, allowing for robust extraction used in real-world credit letters.

  1. Contextual Disambiguation: It can understand “The rating was affirmed” means that previously "mentioned “A+ Senior Unsecured Notes” even if not explicitly represented in the same sentence.

  2. Synonym Handling: It recognizes “Stable Outlook”, “Outlook: Stable”, or “The outlook remains table” as the same context. Similarly, “Upgraded”, “Raised”, "or “Improved” as similar rating actions.

  3. Implicit Information Extraction: It can infer the rating type (e.g. “Senior Unsecured Debt”) from the bond description (e.g. “Senior Unsecured Notes maturing 2035”).

  4. Complex Instruction Handling: It can handle detailed instructions regarding the desired output structure, including JSON arrays and conditional logic

  5. Handling Negations: It can correctly interpret phrases like “The outlook is unchanged” or “Rating remains on negative watch”.

Conclusion

Automating data extraction from credit letters presents significant efficiency gains. While traditional OCR offers powerful tools for structured data, LLMs provide exceptional flexibility and reasoning capability, making them well-rounded for handling the linguistic nuances and format variability common in these documents. By utilizing effective input handling (especially multimodality) and prompting strategies (starting with zero-shot and iterating to few-shot), developers/analysts can design a system that accurately transforms unstructured credit letter data into desired (semi) structured data format.

1
Subscribe to my newsletter

Read articles from Sion Kim directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sion Kim
Sion Kim