Get started with AWS Textract
Amazon Textract is a service provided by Amazon Web Services (AWS) designed to automatically extract text, handwriting, layout elements and data from scanned documents. Textract goes beyond traditional Optical Character Recognition (OCR) technology by using machine learning to recognize the layout and format of a document, such as the arrangement of text in columns, tables, and forms.
Image credits: AWS
All extracted data is returned with bounding box coordinates—polygon frames that encompass each piece of identified data, such as a word, a line, a table, or individual cells within a table.
Image credits: AWS
Amazon Textract also returns a confidence score for everything it identifies so you can make informed decisions about how to use the results.
Key Features of Amazon Textract
Text and Data Extraction: Textract can recognize and extract printed and handwritten text from documents, forms, and tables. Textract’s ML powered OCR can recognize text in various fonts, styles and types like PDFs, images, and scans. It can also handle noisy or distorted text. Learn more »
A document can be in JPEG, PNG, PDF, or TIFF format. With PDF and TIFF format files, you can process multipage documents. For information about how Amazon Textract represents documents as
Block
objects, see Text Detection and Document Analysis Response Objects.The following is an acceptable input document example.
Image credits: AWS
Form and Table Recognition: Textract is not just limited to extracting raw text. It can identify the structure of data in forms and tables, enabling it to extract information like key-value pairs and table data efficiently.
Handwriting Recognition: Unlike many OCR solutions, Textract can also read and process handwritten notes, which is particularly useful in industries where hand-filled forms are common.
Layout Recognition: Amazon Textract provides you with the ability to extract layout elements such as paragraphs, titles, lists, headers, footers, and more from documents.
Signature Detection: Amazon Textract provides the ability to detect signatures on any document or image. This makes it easy to automatically detect signatures on documents such as checks, loan application forms, and claims forms. The location of the signatures and associated confidence scores are included in the API response.
Learn more »Query based extraction: Amazon Textract provides you with the flexibility to specify the data you need to extract from documents using queries. You can specify the information you need in the form of natural language questions (e.g., “What is the customer name”) and receive the exact information (e.g., ”John Doe”) as part of the API response. You do not need to know the data structure in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Textract Queries are pre-trained on a large variety of documents including paystubs, bank statements, W-2s, loan application forms, mortgage notes, claims documents, and insurance cards. The flexibility that Textract Queries provides reduces the need to implement post processing, reliance on manual reviews of extracted data or the need to train ML models.
Learn more »Analyze Lending: Analyze Lending API is a managed, preconfigured intelligent document processing API that fully automates the extraction of information from loan packages. Customers simply upload their mortgage loan documents to the Analyze Lending API and its prebuilt machine learning models will classify and split the document package by document type.
Invoices and receipts: Invoices and receipts can have a wide variety of layouts, which makes it difficult and time-consuming to manually extract data at scale. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts and automatically extracts relevant data such as vendor name, invoice number, item prices, total amount, and payment terms.
Identity documents: Amazon Textract uses machine learning (ML) to understand the context of identity documents such as U.S. passports and driver’s licenses without the need for templates or configuration. You can automatically extract specific information such as date of expiry and date of birth, as well as intelligently identify and extract implied information such as name and address. Using Analyze ID, businesses providing ID verification services and those in finance, healthcare, and insurance can easily automate account creation, appointment scheduling, employment applications, and more by allowing customers to submit a picture or scan of their identity document.
Integration with Other AWS Services: Textract can be integrated with other AWS services like Amazon S3 for document storage, AWS Lambda for event-driven processing, and Amazon RDS or DynamoDB for storing the extracted data.
Document Analysis: Textract analyzes documents for relationships between detected items. For instance, it can understand the association between a question and its answer in a form.
Bounding Box Coordinates: It provides bounding box coordinates for each detected piece of text, enabling precise location within the original document for highlighted display or further processing.
Support for Multiple Languages: Textract supports various languages, enhancing its usability across different geographic regions and business needs.
Use Cases
Automated Document Processing: For processing large volumes of documents, like invoices, receipts, and forms, without manual data entry.
Content Management: Helps in organizing and indexing large libraries of documents for easy search and retrieval.
Compliance and Data Privacy: Automated extraction makes it easier to redact sensitive information and comply with data privacy laws.
Data Digitization: Converts paper documents into actionable, digital data.
Healthcare and Finance: For extracting patient information, financial data, and other critical information from forms and documents in these sectors.
Advantages
Accuracy: Uses advanced machine learning models to provide highly accurate extraction.
Scalability: Can handle large volumes of documents without the need for additional infrastructure.
Cost-Effective: There are no minimum fees and no upfront commitments. Amazon Textract charges only for pages processed whether you extract text, text with tables, form data, queries or process invoices and identity documents.
Time-Saving: Automates the time-consuming task of manual data entry and document analysis.
Accelerates business processes: Streamlines workflows that rely on document processing.
Unlocks data insights: Extracts valuable information from documents for analysis and decision-making.
Amazon Textract is a powerful tool for businesses and organizations looking to automate the extraction of text and data from various document types, significantly reducing manual effort and improving efficiency and accuracy in data handling.
Processing Documents with Synchronous Operations
In the context of Amazon Textract and more broadly in computer science, the term "synchronous" refers to operations that are conducted in a sequential, blocking manner. When you process documents with synchronous operations in Textract, it implies that the tasks are carried out one after the other, and each task must complete before the next one begins. It return results in near real time. This is in contrast to asynchronous operations, where tasks can be initiated and then run in the background, allowing other operations to proceed without waiting for their completion.
When you use Amazon Textract for synchronous operations, here's what typically happens:
Request-Response Model: You make a request to Textract to process a document, and then you wait for it to complete the processing. The response is not returned until the processing is finished . This is different from an asynchronous approach, where you would submit a job and then check back later for the results.
Real-Time Processing: Synchronous operations are often used for real-time processing needs. For example, if you have an application where a user uploads a document and expects immediate results, synchronous processing would be appropriate.
Blocking Calls: The application making the call to Textract will block, or wait, until Textract completes the processing of the document and returns the result. This means during this time, your application might not be able to perform other tasks.
Use Cases: Synchronous operations are best suited for scenarios where the document is relatively small, and the latency of processing is acceptable within the application's workflow. Examples include extracting text from single-page documents or reading data from small forms. You can use Amazon Textract synchronous operations for the following purposes:
Text detection
Text analysis
Invoice and receipt analysis
Identity document analysis
API Methods: In Amazon Textract, synchronous operations typically use API methods like
DetectDocumentText
(for plain text detection) andAnalyzeDocument
(for form and table analysis).
Implications of Synchronous Operations
Simplicity: Synchronous operations are generally easier to implement and understand because they follow a straightforward, linear flow of execution.
Resource Utilization: These operations can be less efficient in terms of resource utilization since the system waits for the completion of the task before moving on to the next one.
Latency: In a synchronous model, the response time includes the time taken to process the document, which might not be ideal for high-latency or batch processing scenarios.
Scalability: Synchronous processing might not scale as well as asynchronous processing, especially when dealing with a large number of documents or large-sized documents.
Synchronous operations in the context of Amazon Textract and other similar services are about processing tasks in a sequential and blocking manner. They are suitable for real-time, immediate processing needs but come with certain limitations in terms of scalability and resource utilization compared to asynchronous operations.
Processing Documents with Asynchronous Operations
Processing documents with asynchronous operations, particularly in the context of Amazon Textract, involves a different workflow compared to synchronous operations. Asynchronous processing is designed for handling larger documents or a batch of documents where immediate response is not required. This method is particularly useful when dealing with multi-page documents and when the processing time might be longer.
You can use Amazon Textract asynchronous operations for the following purposes:
Text detection
Text analysis
Expense analysis
Lending document analysis
Key Characteristics of Asynchronous Operations:
Non-Blocking Calls: In asynchronous processing, the initial request to start processing a document does not block the caller until the processing is complete. Instead, the service (like Amazon Textract) acknowledges the request and starts processing in the background.
Job IDs: When you initiate an asynchronous operation, Textract assigns a unique job identifier (Job ID) to the task. You use this Job ID to periodically check the status of the processing task.
Status Checks: After submitting the document for processing, you can periodically poll Textract to check the status of the job using the Job ID. The job status could be “IN_PROGRESS,” “SUCCEEDED,” or “FAILED,” among others.
Retrieving Results: Once the processing is complete (status is “SUCCEEDED”), you can retrieve the results of the processing. The results are typically stored in an Amazon S3 bucket or can be fetched directly depending on your setup.
Handling Larger Documents: Asynchronous operations are ideal for processing larger documents like multi-page PDFs or documents that require more complex analysis, which might take more time to process.
Scalability: This approach is more scalable as it allows your system to handle other tasks while waiting for the document processing to complete, rather than blocking on a single operation.
Workflow of Asynchronous Operations in Amazon Textract:
The following diagram shows the process for detecting document text in a document image stored in an Amazon S3 bucket. In the diagram, an Amazon SQS queue gets the completion status from the Amazon SNS topic.
Image credits: AWS
Submit Document: You start by submitting a document for processing using an API like
StartDocumentTextDetection
orStartDocumentAnalysis
. These APIs are used for different types of document analysis tasks.Receive Job ID: Amazon Textract responds with a Job ID for the submitted document.
Poll for Status: Your application or service periodically sends requests to Textract to check the status of the job using the provided Job ID.
Retrieve Results: Once the status indicates that processing is complete, you can then retrieve the results, which may include extracted text, form data, table data, etc.
Post-Processing: After retrieving the results, you can proceed with any post-processing steps, like data analysis, storage, or further business logic.
Advantages:
Efficient for Large Documents: Does not require keeping a connection open while the document is being processed, making it more efficient for large documents.
Better Resource Utilization: Allows your application to perform other operations while the document is being processed.
Scalable: More suitable for applications that need to process a high volume of documents or require processing complex documents.
Asynchronous processing in Amazon Textract and similar services offers a more flexible and scalable approach for document processing, especially when dealing with large-scale or resource-intensive tasks.
Tutorial: Detecting text with an AWS Lambda function and AWS Textract
GitHub repository: https://github.com/Brain2life/blog-aws-textract-lambda
AWS Lambda is a compute service that you can use to run code without provisioning or managing servers. You can call Amazon Textract API operations from within an AWS Lambda function. The following tutorial shows how to create a Lambda function in Python that calls DetectDocumentText.
The Lambda function returns a list of Block objects with information about the detected words and lines of text. The instructions include example Python code that shows you how to call the Lambda function with a document supplied from an Amazon S3 bucket or your local computer. Images stored in Amazon S3 must be in single-page PDF or TIFF document format, or in JPEG or PNG format. Local images must be in single-page PDF or TIFF format. The Python code returns part of the JSON response for each Block type detected in the document.
For an example that uses Lambda functions to process documents at a large scale, see Amazon Textract IDP CDK Constructs and Use machine learning to automate and process documents at scale.
Keep in mind that Amazon Textract is only available at specific regions at the moment of this article writing:
Step 1: Create an Amazon S3 bucket to store documents (console)
In this step, you create an Amazon S3 bucket to store your documents for analysis.
Go to the AWS Management Console.
Navigate to the S3 Service
In the S3 dashboard, click the “Create bucket” button.
Enter a unique name for your bucket. This name must be globally unique across all AWS accounts and regions.
Keep the rest settings as default and create the bucket.
Select the created S3 bucket and create folder to store the documents.
Upload PNG image to S3 bucket:
Step 2: Create an AWS Lambda function (console)
In this step, you create an empty AWS Lambda function and an IAM execution role that lets your function call the DetectDocumentText
operation. As we are supplying documents from Amazon S3, this step also shows us how to grant access to the bucket that stores our documents.
Later you add the source code and optionally add a layer to the Lambda function.
Sign in to the AWS Management Console and open the AWS Lambda console at https://console.aws.amazon.com/lambda/.
Choose Create function. For more information, see Create a Lambda Function with the Console.
Choose the following options:
Choose Author from scratch.
Enter a value for Function name.
For Runtime, choose Python 3.9.
For Architecture, choose x86_64.
Choose Create function to create the AWS Lambda function.
On the function page, choose the Configuration tab.
On the Permissions pane, under Execution role, choose the role name to open the role in the IAM console.
In the Permissions tab, choose Add permissions and then Create inline policy.
Choose the JSON tab and replace the policy with the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Action": "textract:DetectDocumentText", "Resource": "*", "Effect": "Allow", "Sid": "DetectDocumentText" } ] }
Choose Review policy.
Enter a name for the policy, for example DetectDocumentTextAccessPolicy.
Choose Create policy.
As we are storing documents for analysis in an Amazon S3 bucket, we must add an Amazon S3 access policy. To do this, repeat steps 7 to 11 in the AWS Lambda console and make the following changes.
For step 8, use the following policy. Replace
bucket/folder path
with the Amazon S3 bucket and folder path to the documents that you want to analyze.{ "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::bucket/folder path/*" } ] }
For step 10, choose a different policy name, such as S3Bucket-access.
Step 3: (Optional) Create a layer (console)
To run this example, you don't need to perform this step. The DetectDocumentText
operation is included in the default Lambda Python environment as part of AWS SDK for Python (Boto3). If other parts of your Lambda function require recent AWS service updates that aren't in the default Lambda Python environment, then perform this step to add the most recent Boto3 SDK release as a layer to your function.
First, you create a zip file archive that contains the Boto3 SDK. Then, you create a layer and add the zip file archive to the layer. For more information, see Using layers with your Lambda function.
To create and add a layer (console)
Open a command prompt and enter the following commands to create a deployment package with the most recent version of the AWS SDK.
pip install boto3 --target python/. zip boto3-layer.zip -r python/
Note the name of the zip file (boto3-layer.zip), which you use in step 8 of this procedure.
Open the AWS Lambda console at https://console.aws.amazon.com/lambda/.
In the navigation pane, choose Layers.
Choose Create layer.
Enter values for Name and Description.
For Code entry type, choose Upload a .zip file and select Upload.
In the dialog box, choose the zip file archive (boto3-layer.zip) that you created in step 1 of this procedure.
For Compatible runtimes, choose Python 3.9.
Choose Create to create the layer.
Choose the navigation pane menu icon.
In the navigation pane, choose Functions.
In the resources list, choose the function that you created previously in Step 1: Create an AWS Lambda function (console).
Choose the Code tab.
In the Layers section, choose Add a layer.
Choose Custom layers.
In Custom layers, choose the layer name that you entered in step 6.
In Version choose the layer version, which should be 1.
Choose Add.
Step 4: Add Python code (console)
In this step, you add Python code to your Lambda function by using the Lambda console code editor. The code detects text in a document with DetectDocumentText
and returns a list of Block objects with information about the detected text. The document can be located in an Amazon S3 bucket or a local computer. Images stored in Amazon S3 must be single-page PDF or TIFF format documents or in JPEG or PNG format. Local images must be in single-page PDF or TIFF format.
To add Python code (console)
Navigate to the Code tab.
In the code editor, replace the code in
lambda_function.py
with the following code:https://github.com/Brain2life/blog-aws-textract-lambda/blob/main/lambda_function.py
Choose Deploy to deploy your Lambda function.
Step 5: Try your Lambda function
Now that you’ve created your Lambda function, you can invoke it to detect text in a document. In this step, you use Python code on your computer to pass a local document or a document in an Amazon S3 bucket to your Lambda function. Documents passed from a local computer must be smaller than 6291456 bytes. If your documents are larger, upload them to an Amazon S3 bucket and call the script with the Amazon S3 path to the image. For information about uploading image files to an Amazon S3 bucket, see Uploading objects.
Make sure you run the code in the same AWS Region in which you created the Lambda function. You can view the AWS Region for your Lambda function in the navigation bar of the function details page in the Lambda console.
If the AWS Lambda function returns a timeout error, extend the timeout period for the Lambda function. For more information, see Configuring function timeout (console).
For more information about invoking a Lambda function from your code, see Invoking AWS Lambda Functions.
To try your Lambda function
If you haven't already done so, do the following:
Make sure that the user has
lambda:InvokeFunction
permission. You can use the following policy:{ "Version": "2012-10-17", "Statement": [ { "Sid": "InvokeLambda", "Effect": "Allow", "Action": "lambda:InvokeFunction", "Resource": "ARN for lambda function" } ] }
You can get the ARN for your Lambda function from the function overview in the Lambda console.
To provide access, add permissions to your users, groups, or roles:
Users and groups in AWS IAM Identity Center:
Create a permission set. Follow the instructions in Create a permission set in the AWS IAM Identity Center User Guide.
Users managed in IAM through an identity provider:
Create a role for identity federation. Follow the instructions in Creating a role for a third-party identity provider (federation) in the IAM User Guide.
IAM users:
Create a role that your user can assume. Follow the instructions in Creating a role for an IAM user in the IAM User Guide.
(Not recommended) Attach a policy directly to a user or add a user to a user group. Follow the instructions in Adding permissions to a user (console) in the IAM User Guide.
Install and configure AWS SDK for Python. For more information, see Step 2: Set Up the AWS CLI and AWS SDKs.
Save the following code to a file named
client.py
:
https://github.com/Brain2life/blog-aws-textract-lambda/blob/main/client.py
Run the code. For the command line argument, supply the Lambda function name and the document that you want to analyze. You can supply a path to a local document, or you can use the Amazon S3 path to an document stored in an Amazon S3 bucket. For example:
python client.py function_name s3://bucket/path/document.jpg
If the document is in an Amazon S3 bucket. make sure that it is the same bucket that you specified previously in step 12 of Step 1: Create an AWS Lambda function (console).
If successful, your code returns a partial JSON response for each Block type detected in the document.
References:
Subscribe to my newsletter
Read articles from Maxat Akbanov directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Maxat Akbanov
Maxat Akbanov
Hey, I'm a postgraduate in Cyber Security with practical experience in Software Engineering and DevOps Operations. The top player on TryHackMe platform, multilingual speaker (Kazakh, Russian, English, Spanish, and Turkish), curios person, bookworm, geek, sports lover, and just a good guy to speak with!