Build a Serverless Workflow: AWS Lambda + Textract (Sync & Async with SNS + SQS) — Step‑by‑Step Guide


Extracting text and data from documents at scale is a common requirement in modern applications, from invoice processing to contract analysis. Amazon Textract, combined with AWS Lambda, provides a serverless approach to building automated pipelines for text extraction.
In this blog, we’ll walk through both synchronous and asynchronous integration approaches between Lambda and Textract, including how to use SNS and SQS for async jobs. By the end, you’ll have a clear roadmap to build a robust, serverless text-extraction pipeline.
Prerequisites
Before diving into the integration, make sure you have the following in place:
- AWS Account – with access to create and configure S3, Lambda, IAM, Textract, SNS, and SQS.
Architecture
Synchronous Architecture (S3 → Lambda → Textract → S3)
Secuence Diagram
sequenceDiagram
participant U as User
participant S3 as Amazon S3
participant L as AWS Lambda
participant T as Amazon Textract
U->>S3: Upload Document
S3-->>L: Trigger Event (Object Created)
L->>T: Invoke Textract (StartDocumentTextDetection)
T-->>L: Return Extracted Text
L->>S3: Save Extracted Text (S3 prefix or separate bucket)
S3-->>U: Processed file available
Asynchronous Architecture (S3 → Lambda → Textract → SNS → SQS → Lambda → S3)
Secuence Diagram
sequenceDiagram
participant U as User
participant S3 as Amazon S3
participant L1 as Lambda (Start Job)
participant T as Amazon Textract
participant SNS as Amazon SNS
participant SQS as Amazon SQS
participant L2 as Lambda (Process Result)
U->>S3: Upload Document
S3-->>L1: Trigger Event (Object Created)
L1->>T: StartDocumentTextDetection (Async)
T-->>SNS: Job Completion Notification
SNS-->>SQS: Deliver Message
SQS-->>L2: Trigger Event (Message Received)
L2->>T: GetDocumentTextDetection (Retrieve Results)
L2->>S3: Save Extracted Text (with prefix or separate bucket)
S3-->>U: Processed file available
Synchronous
Step 1: Create an S3 bucket
When setting up Amazon S3 for this workflow, we have two main options:
Two Buckets (Recommended for Separation of Concerns)
Bucket 1: Used for uploading the input documents that need to be processed.
Bucket 2: Used for storing the processed output documents.
✅ Advantage: Provides a clean separation between raw and processed data, making it easier to manage permissions, lifecycle policies, and logging.
Single Bucket with Prefixes
Use one bucket, but organize files with two prefixes (folders):
input/
for documents to be processedoutput/
for processed results✅ Advantage: Easier to set up and manage since you only maintain one bucket.
⚠️ Consideration: Requires stricter access control and prefix-based policies to prevent accidental overwrites or permission leaks.
Step 2: Create an IAM customer-managed policy
Go to the IAM console and create a policy, choose the JSON tab
Paste the next JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "TextractPermissions",
"Effect": "Allow",
"Action": [
"textract:DetectDocumentText"
],
"Resource": "*"
},
{
"Sid": "S3Permissions",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::YOUR_S3_BUCKET/*"
},
{
"Sid": "CloudWatchLogGroupAccess",
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
},
{
"Sid": "CloudWatchLogStreamAccess",
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/SyncLambdaTextractRole:*"
]
}
]
}
You can name it as you prefer. For this example, it will be named as SyncLambdaTextractPolicy
Step 3: Create an IAM role for the AWS Lambda
Go to the IAM console and create a role, choose Lambda
as the service or use case
Attach the SyncLambdaTextractPolicy
to the permissions policies
You can name it as you prefer. For this example, it will be named as SyncLambdaTextractRole
Step 4: Create the AWS Lambda Function
Create a lambda function with the next setup settings
Runtine: Python
Execution Role: SyncLambdaTextractRole
or the role name that you created in step 3
Paste the next Python code in the Code tab
import boto3
import os
def lambda_handler(event, context):
textract = boto3.client('textract')
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_key = event['Records'][0]['s3']['object']['key']
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3_bucket,
'Name': s3_key
}
}
)
text_lines = []
for item in response['Blocks']:
if item['BlockType'] == 'LINE':
text_lines.append(item['Text'])
print("Extracted Text:", "\n".join(text_lines))
return {"status": "success", "lines": text_lines}
Test event for testing
{
"Records": [
{
"s3": {
"bucket": { "name": "your-bucket-name" },
"object": { "key": "your-file.pdf" }
}
}
]
}
Asynchronous
Step 1: Create an S3 bucket
Use one bucket, but organize files with two prefixes (folders):
input/
for documents to be processedoutput/
for processed results✅ Advantage: Easier to set up and manage since you only maintain one bucket.
⚠️ Consideration: Requires stricter access control and prefix-based policies to prevent accidental overwrites or permission leaks.
Step 2: Lambda Start Textract Job
Step 2.1: Create an IAM customer-managed policy
Go to the IAM console and create a policy, paste the next JSON content. You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobPolicy
Note: This step can be omitted to create a customer inline policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "TextractPermissions",
"Effect": "Allow",
"Action": [
"textract:StartDocumentTextDetection"
],
"Resource": "*"
},
{
"Sid": "S3ReadAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::YOUR_S3_BUCKET",
"arn:aws:s3:::YOUR_S3_BUCKET/*"
]
},
{
"Sid": "CloudWatchLogGroupAccess",
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
},
{
"Sid": "CloudWatchLogStreamAccess",
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/AsyncStartTextractJob:*"
]
}
]
}
Step 2.2: Create an IAM role for the AWS Lambda
Go to the IAM console and create a role, choose Lambda
as the service or use case
Attach the AsyncStartTextractJobPolicy
to the permissions policies
You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobRole
Step 2.3: Create an AWS SNS Topic
To handle notifications from AWS Textract, you need to create an SNS topic. You have two main options:
Option 1: Create the Topic via the AWS Console
Navigate to the SNS Console. Click Create topic, give it a name (e.g., textract-job-completion
), and configure any additional settings such as access policies or delivery protocols.
✅ Advantage: Intuitive interface with easy configuration and management of subscriptions.
In the information section of the topic, you will be able to see the TopicARN
Option 2: Create the Topic via AWS CLI
Use the AWS CloudShell or your local terminal to execute the following command:
aws sns create-topic --name textract-job-completion
Note, this will give you an output with the TopicArn
{
"TopicArn": "arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion"
}
✅ Advantage: Quick and scriptable, ideal for automation or Infrastructure as Code.
Step 2.4: Create AWS Textract Assume Role
Create an IAM Role for Textract
Go to the IAM Console → Roles → Create role.
Under Trusted entity type, select AWS service.
Choose Textract as the service that will use this role.
Attach Permissions
Add a policy that allows
sns:Publish
access to your specific SNS topic.Example policy (replace the ARN with your own SNS topic ARN):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "SNS:Publish",
"Resource": "arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion"
}
]
}
You can name it as you prefer. For this example, it will be named as TextractSNSPublish
We will need the ARN of this role. Once created will look something like arn:aws:iam::************:role/TextractSNSPublish
Step 2.5: Create the AWS Lambda Function
Create a lambda function with the following settings
Name: AsyncStartTextractJob
(This can be renamed, but remember to update the log groups’ permissions)
Runtine: Python
Execution Role: AsyncStartTextractJobRole
or the role name that you created in step 2.2
Environment Variables:
SNS_TOPIC_ARN:
arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion
from step 2.3TEXTRACT_ROLE_ARN:
arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/TextractSNSPublish
from step 2.4
Paste the next Python code in the Code tab
import os
import boto3
import urllib.parse # <-- for decoding S3 keys
textract = boto3.client("textract")
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
TEXTRACT_ROLE_ARN = os.environ['TEXTRACT_ROLE_ARN']
def start_textract_job(bucket, key, sns_topic_arn, textract_role_arn):
response = textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': bucket,
'Name': key
}},
NotificationChannel={
'SNSTopicArn': sns_topic_arn,
'RoleArn': textract_role_arn
})
return response["JobId"]
def lambda_handler(event, context):
# S3 event records
for record in event["Records"]:
bucket = record["s3"]["bucket"]["name"]
key = record["s3"]["object"]["key"]
# Decode URL-encoded key
key = urllib.parse.unquote_plus(key)
jobId = start_textract_job(bucket, key, SNS_TOPIC_ARN, TEXTRACT_ROLE_ARN)
print(f"Started Textract job {jobId} for file {key} in bucket {bucket}")
return {
"statusCode": 200,
"body": f"Textract Job(s) started successfully."
}
Step 2.6 Create an AWS SQS & Subscribe the SNS to the Queue
When Amazon Textract completes a job, it sends a notification to the SNS topic you configured. To process these notifications reliably, you can subscribe an SQS queue to the SNS topic. This ensures that your application can consume the messages asynchronously and at its own pace.
You have two main options for creating the SQS queue:
Option 1: Create the Queue & subscribe via the AWS Console
Navigate to the Amazon SQS Console.
Choose Create queue and provide a name (e.g.,
textract-results-queue
).After creation, go to your SNS Topic in the AWS Console and add a subscription to this SQS queue.
Copy the Topic ARN from the SNS topic information section when prompted.
✅ Advantage: Simple, user-friendly interface — ideal for quick setup and testing.
Option 2: Create the Queue & subscribe via AWS CLI
You can also create the queue programmatically:
aws sqs create-queue --queue-name textract-results-queue
Output
{
"QueueUrl": "https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue"
}
To get the queue ARN, run the next command
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue \
--attribute-names QueueArn
Output
{
"Attributes": {
"QueueArn": "arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue"
}
}
After creating the queue, subscribe it to your SNS topic using the following command (replace with your Topic ARN and Queue ARN):
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion \
--protocol sqs \
--notification-endpoint arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue
✅ Advantage: Easy to script, automate, and reproduce across environments.
Notes:
Once the SQS queue is subscribed to the SNS topic, every Textract job completion notification will be delivered to the queue, where your application can consume and process it.
Step 2.7 Add an AWS Lambda Function Trigger event when S3 Upload
Option 1: Create the Trigger Event via AWS CLI
Use the AWS CloudShell or your local terminal to execute the following command:
aws s3api put-bucket-notification-configuration \
--bucket YOUR_S3_BUCKET_NAME \
--notification-configuration '{
"LambdaFunctionConfigurations": [
{
"LambdaFunctionArn": "arn:aws:lambda:us-east-2:YOUR_AWS_ACCOUNT_ID:function:AsyncStartTextractJob",
"Events": ["s3:ObjectCreated:Put"]
}
]
}'
Option 2: Using the AWS Console (UI)
Open the S3 Console → https://s3.console.aws.amazon.com/s3
Click your bucket name (
your-bucket-name
).Go to the Properties tab.
Scroll down to Event notifications → click Create event notification.
Fill in the details:
Name:
PutTriggerLambda
Event types: select
PUT
(under ObjectCreated).Destination: choose Lambda function → select
AsyncStartTextractJob
.
Click Save changes.
Step 3: Lambda Process Textract Job Results
Step 3.1: Create an IAM customer-managed policy
Go to the IAM console and create a policy, paste the next JSON content. You can name it as you prefer. For this example, it will be named as AsyncProcessTextractJobResultsPolicy
Note: This step can be omitted to create a customer inline policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/AsyncProcessTextractJobResults:*"
]
},
{
"Sid": "AllowTextractRead",
"Effect": "Allow",
"Action": [
"textract:GetDocumentTextDetection"
],
"Resource": "*"
},
{
"Sid": "AllowS3Write",
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": "arn:aws:s3:::YOUR_S3_BUCKET/output/*"
},
{
"Sid": "AllowSQSTrigger",
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
],
"Resource": "arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue"
}
]
}
This IAM policy gives a Lambda function the ability to:
Write logs to CloudWatch (for monitoring).
Read the Textract job results.
Save processed data into S3.
Consume and delete messages from an SQS queue that delivers Textract job notifications.
Step 3.2: Create an IAM role for the AWS Lambda
Go to the IAM console and create a role, choose Lambda
as the service or use case
Attach the AsyncProcessTextractJobResultsPolicy
to the permissions policies and saved. You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobRole
Step 3.3: Create the AWS Lambda Function
Create a lambda function with the following settings
Name: AsyncProcessTextractJobResults
(This can be renamed, but remember to update the log groups’ permissions)
Runtine: Python
Execution Role: AsyncProcessTextractJobResultsRole
or the role name that you created in step 3.2
Environment Variables:
- SQS_QUEUE_URL:
https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue
To get this, you can go to step 2.6
Timeout: 30s (This can be increased, but make sure that the SQS queue visibility timeout matches that)
Paste the next Python code in the Code tab
import json
import boto3
import os
s3_client = boto3.client("s3")
textract_client = boto3.client("textract")
sqs_client = boto3.client("sqs")
SQS_QUEUE_URL = os.environ['SQS_QUEUE_URL'] # add this env var for your queue
def get_textract_job_info(record):
# Get the Textract job info from the SNS/SQS event
message = json.loads(record["body"])
textract_message = json.loads(message["Message"])
job_id = textract_message["JobId"]
status = textract_message["Status"]
object_name = textract_message["DocumentLocation"]["S3ObjectName"]
bucket_name = textract_message["DocumentLocation"]["S3Bucket"]
return {
"job_id": job_id,
"status": status,
"bucket_name": bucket_name,
"object_name": object_name
}
def get_all_textract_blocks(job_id):
# # Get raw Textract output
response = textract_client.get_document_text_detection(JobId=job_id)
# Collect all text blocks
raw_text = []
while True:
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
raw_text.append(block["Text"])
if "NextToken" in response:
response = textract_client.get_document_text_detection(
JobId=job_id, NextToken=response["NextToken"]
)
else:
break
return "\n".join(raw_text)
def get_base_filename(object_name):
return os.path.splitext(os.path.basename(object_name))[0]
def delete_sqs_message(receiptHandle):
# Delete message from SQS once done
sqs_client.delete_message(
QueueUrl=SQS_QUEUE_URL,
ReceiptHandle=receiptHandle
)
def lambda_handler(event, context):
print("Event:", json.dumps(event))
for record in event["Records"]:
job_info = get_textract_job_info(record)
job_id = job_info["job_id"]
status = job_info["status"]
bucket_name = job_info["bucket_name"]
object_name = job_info["object_name"]
if status != "SUCCEEDED":
print(f"Textract job {job_id} failed with status: {status}")
# delete the message anyway, so it doesn’t retry forever
delete_sqs_message(record["receiptHandle"])
continue
text = get_all_textract_blocks(job_id)
# Save result to S3 with the same filename but .txt extension
base_filename = get_base_filename(object_name)
output_key = f"output/{base_filename}.txt"
s3_client.put_object(
Bucket=bucket_name,
Key=output_key,
Body=text.encode("utf-8"),
ContentType="text/plain"
)
print(f"✅ Processed Textract result saved at s3://{bucket_name}/{output_key}")
delete_sqs_message(record["receiptHandle"])
print(f"🗑️ Deleted message from SQS: {record['messageId']}")
return {"statusCode": 200, "body": "Processing complete"}
Step 3.4 Add an AWS Lambda Function Trigger event for the SQS Queue
Option 1. Create the event source mapping (SQS → Lambda)
aws lambda create-event-source-mapping \
--function-name AsyncProcessTextractJobResults \
--batch-size 10 \
--event-source-arn arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue
Explanation:
batch-size
= max number of SQS messages passed per Lambda invocation (default 10, max 10,000).event-source-arn
= ARN of your SQS queue.
Option 2: Using the AWS Console (UI)
Open the Lambda Console → https://console.aws.amazon.com/lambda
Select your Lambda function.
Go to the Configuration tab.
Under Triggers → click Add trigger.
Choose SQS from the list.
Select your SQS queue.
(Optional) Set Batch size (default 10).
Click Add.
When to use Sync vs Async
Synchronous (Detect/Analyze — direct reply)
Best for: small, single‑page images (JPG/PNG/TIFF) or very small PDFs where the user is waiting for a response.
Pros: simple, fast to wire up, easy to expose via API Gateway.
Trade‑offs: request/response timeouts, Lambda runtime limits, not great for multi‑page or large files.
Asynchronous (Start/Get with SNS+SQS)
Best for: multi‑page PDFs/TIFFs, batches, or workloads where you don’t need an immediate response.
Pros: resilient, scalable, cost‑effective; decoupled with retries and DLQs.
Trade‑offs: more moving parts (SNS, SQS, extra Lambda), eventual consistency.
Best Practices
Use SQS between SNS and Lambda for reliability.
Break large documents into smaller batches when possible.
Encrypt sensitive documents in S3.
Monitor Textract usage with CloudWatch Metrics.
Use DLQs (Dead Letter Queues) for failed message processing.
Conclusion
Integrating AWS Lambda with Amazon Textract enables the creation of powerful document-processing pipelines with minimal infrastructure management. For small documents, the synchronous flow is simple and effective. For larger workloads, the asynchronous flow, combined with SNS and SQS, ensures reliability and scalability.
In this example, we saved the extracted text as a file in Amazon S3, but the same output could just as easily be stored in a database, or even used directly as input to an AWS Bedrock model for downstream tasks such as summarization, classification, or question answering.
By following the steps outlined in this guide, you now have the foundation to build automated document extraction systems tailored to your business needs.
Subscribe to my newsletter
Read articles from Brayan Arrieta Alfaro directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Brayan Arrieta Alfaro
Brayan Arrieta Alfaro
I'm a software engineer with capabilities for doing application design and software development with several technologies and frameworks. I’m a friendly, organized, responsible person, always focused on achieving the best results, I have clear goals and a great determination to overcome any obstacle that may arise. Always excited about learning and working with newer technologies and methodologies that allow me to create better software solutions and grow professionally.