Implementing Azure AI Vision with Python: OCR, Object Detection, Tagging & Captioning

Introduction

Computer Vision is a branch of Artificial Intelligence that allows machines to see, analyze, and understand images and videos. Azure AI Vision provides pre-built models through simple APIs, enabling developers to integrate features like object detection, OCR, tagging, and caption generation into their applications.

In this blog, we’ll walk through how to implement Azure AI Vision using Python, showcasing its major features with practical code examples.

Azure AI Vision & Use Cases

Instead of spending time training deep learning models from scratch, Azure AI Vision provides ready-to-use models that can:

  • Detect objects in images

  • Extract text using OCR

  • Automatically generate tags

  • Create human-like captions describing images

This makes it extremely useful for building real-world AI solutions quickly and at scale.

How It Helps LLMs

When combined with Large Language Models (LLMs), these visual capabilities make AI systems multimodal:

  • LLMs can reason not just on text, but also on what’s inside an image.

  • Example: Azure Vision extracts text from an invoice (OCR) → LLM interprets and summarizes it → business system updates automatically.

Real-World Use Case

Imagine a retail company managing thousands of product images:

  • Azure Vision tags and captions products for search and cataloging.

  • OCR extracts text from product labels.

  • LLMs use this extracted data to auto-generate product descriptions for e-commerce.

This saves time, improves accuracy, and makes systems more intelligent.

Python Implementation

1- Analyzing an Image with Azure AI Vision in Python (Image Tagging)

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import os
import json
from dotenv import load_dotenv

load_dotenv("env.env")

endpoint="https://az-computer-vision1.cognitiveservices.azure.com/"
key=os.getenv("AI_VISION_KEY")

client = ImageAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("ai_vision_test.jpg", "rb") as image:
    image_details=image.read()

response=client.analyze(
    image_data=image_details,
    visual_features=[VisualFeatures.TAGS, VisualFeatures.CAPTION]

)

print(json.dumps(response.as_dict(), indent=4))

Code Walkthrough

  1. Importing Required Libraries

    • azure.ai.vision.imageanalysis provides the ImageAnalysisClient to interact with the Azure AI Vision service.

    • VisualFeatures defines which features (Tags, Captions, OCR, etc.) we want to extract.

    • AzureKeyCredential securely passes our API key to the client.

    • dotenv is used to safely load credentials stored in an .env file.

  2. Setting Up Authentication

    • The endpoint is the URL of your Azure AI Vision resource.

    • The key is stored in an environment file (env.env) for security, instead of hardcoding it into the script.

  3. Creating the Client

    • The ImageAnalysisClient connects to Azure’s AI Vision service using the endpoint and API key.
  4. Reading the Image

    • The image (ai_vision_test.jpg) is read in binary format because the API expects raw bytes of the image.
  5. Analyzing the Image

    • We call the analyze() method and pass two visual features:

      • VisualFeatures.TAGS → Returns a list of keywords that describe the image.

      • VisualFeatures.CAPTION → Generates a human-readable caption summarizing the image.

  6. Printing Results

    • The response is converted into a dictionary and printed in a nicely formatted JSON output using json.dumps().

    • This makes it easy to see what Azure Vision has detected in the image.

2- Object Detection with Azure AI Vision in Python

Another powerful feature of Azure AI Vision is object detection. It not only identifies what objects are present in an image but also returns their locations using bounding boxes. This is especially useful in applications like surveillance, manufacturing defect detection, and retail product recognition.

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import os
import json
from dotenv import load_dotenv

load_dotenv("env.env")

endpoint="https://az-computer-vision1.cognitiveservices.azure.com/"
key=os.getenv("AI_VISION_KEY")

client = ImageAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("fruit_bucket.png", "rb") as image:
    image_details=image.read()

response=client.analyze(
    image_data=image_details,
    visual_features=[VisualFeatures.OBJECTS]    # detects objects in the image with bounding boxes

)

print(json.dumps(response.as_dict(), indent=4))

Code Walkthrough

  1. Importing Libraries
    We bring in the same core classes (ImageAnalysisClient, VisualFeatures, and AzureKeyCredential) used earlier for tagging and captions.

  2. Authentication

    • The endpoint points to your Azure AI Vision resource.

    • The key is securely loaded from an .env file.

  3. Image Input

    • We load the image fruit_bucket.png in binary format so it can be sent to the API.
  4. Object Detection Request

    • VisualFeatures.OBJECTS tells Azure Vision to detect all visible objects in the image.

    • The API responds with a list of objects, their confidence scores, and bounding box coordinates.

  5. Readable Output

    • The response.as_dict() method returns the structured result as a dictionary.

    • json.dumps() formats it neatly so we can see exactly what was detected.

3- Extracting Text from Images with Azure AI Vision (OCR)

Azure AI Vision provides powerful OCR (Optical Character Recognition) capabilities. With this feature, we can detect both printed and handwritten text from images and documents. This is especially useful in scenarios like digitizing scanned documents, extracting text from receipts, or reading quotes from images.

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import os
import json
from dotenv import load_dotenv

load_dotenv("env.env")

endpoint="https://az-computer-vision1.cognitiveservices.azure.com/"
key=os.getenv("AI_VISION_KEY")

client = ImageAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("quote.jpg", "rb") as image:
    image_details=image.read()

response=client.analyze(
    image_data=image_details,
    visual_features=[VisualFeatures.READ]    # detects text in the image using Optical Character Recognition (OCR)

)

for line in response.read.blocks[0].lines:
    print(line. Text)

Code Walkthrough

  1. Authentication

    • The endpoint and key are loaded from the .env file to keep secrets secure.

    • ImageAnalysisClient is initialized to interact with Azure AI Vision.

  2. Reading the Image

    • The file quote.jpg is loaded in binary mode before sending it to the API.
  3. Performing OCR

    • The VisualFeatures.READ option tells Azure Vision to run OCR.

    • The response contains detected text blocks, lines, and even word-level details if needed.

  4. Extracting Results

    • We loop through response.read.blocks[0].lines and print each line of detected text.
0
Subscribe to my newsletter

Read articles from Ritesh Kumar Nayak directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ritesh Kumar Nayak
Ritesh Kumar Nayak

Passionate about helping organizations build scalable infrastructure and DevOps solutions with cloud technologies. Experienced in designing robust systems, automating processes, and driving efficiency through innovative cloud solutions. Advocate for best practices in DevOps and cloud computing, committed to enabling teams to achieve their full potential.