API Timeouts:

Why should we implement timeouts?

When developing a backend system, it is likely that you will need to integrate it with other internal backend systems or third-party APIs. Suppose that the Orders API needs to access information from the Users API, but encounters an issue due to the Users service being unavailable. How long should the Orders API wait before timing out? If timeouts are not implemented, the Orders API will wait forever. Implementing API timeouts is crucial as it prevents long-running or unresponsive requests from adversely affecting a system's performance and availability.

The Order and User service

For demonstration, I have created a dockerized FastAPI app with two services, one for Orders and the other one for Users. The Orders service calls the Users service to get the user data for an order.

I'll use the requests library in Python for making API calls.

The Orders service has an in-memory database which is a list of dictionaries that stores all the orders. Each order has an id, a user ID and the product title.

orders_db = [
    {'id': 1, 'user': 1, 'product': 'Apple IPhone'},
    {'id': 2, 'user': 2, 'product': 'Macbook'},
    {'id': 3, 'user': 1, 'product': 'Wireless Charger'},
    {'id': 4, 'user': 2, 'product': 'Book 1'}
]

The Orders service has an API GET /orders/?user_id=<user_id> (implemented here get_orders(user_id: int) -> dict), which will return all the orders created by the specified user ID in the format below.

So If you make an API request GET /orders/?user_id=1, it will return all orders by the user ID 1.

{
    "user" : {
        "id": 1,
        "name": "Aaron Apple",
        "phone": "99992222"
    },
    "orders": [
        {
            "id": 1,
            "user": 1,
            "product": "Apple IPhone"
        },
        {
            "id": 3,
            "user": 1,
            "product": "Wireless Charger"
        }
    ]
}

How to implement a timeout?

The timeout implementation in the requests library is straightforward, just add the timeout parameter and the timeout value in seconds.

import requests
API_TIMEOUT_IN_SECS = 10
r = requests.get('<URL>', timeout=API_TIMEOUT_IN_SECS)

So for our example, the call_api function implements the requests.get() method. The call_api has a parameter api_timeout_in_secs which by default is set to 10 seconds.

The get_user_data function gets the user_id as an argument. It builds the User API URL and sends this URL and api_timeout_in_secs as 3 seconds to the call_api function.

import requests
from fastapi import FastAPI, HTTPException

app = FastAPI()

orders_db = [
    {'id': 1, 'user': 1, 'product': 'Apple IPhone'},
    {'id': 2, 'user': 2, 'product': 'Macbook'},
    {'id': 3, 'user': 1, 'product': 'Wireless Charger'},
    {'id': 4, 'user': 2, 'product': 'Book 1'}
]


class UserNotFound(HTTPException):
    status_code = 404

    def __init__(self, detail: str) -> None:
        super().__init__(self.status_code, detail=detail)


def call_api(url: str, api_timeout_in_secs: int = 10):
    return requests.get(url=url, timeout=api_timeout_in_secs)


def get_user_data(user_id: int) -> dict:
    try:
        response = call_api(
            url=f'http://user_api_1:8200/users/{user_id}?fromService=order',
            api_timeout_in_secs=3
        )
    except requests.exceptions.ReadTimeout:
        raise UserNotFound('User not found: User API timed out')

    if response.status_code == 200:
        return response.json()

    if response.status_code == 404:
        return {'id': user_id}

    raise UserNotFound('User not found: User API failed')


@app.get('/orders/')
def get_orders(user_id: int) -> dict:
    orders = list(filter(lambda item: item['user'] == user_id, orders_db))
    if not orders:
        raise HTTPException(
            status_code=404, detail='No orders found for this user ID')
    return {
        'user': get_user_data(user_id=user_id),
        'orders': orders,
    }

The Orders service will now wait only for 3 seconds, after which the requests.exceptions.ReadTimeout exception will be raised. This exception can be handled by the Orders service gracefully by raising a custom exception UserNotFound('User not found: User API timed out').

Benefits of implementing timeouts

Improved reliability since the system can avoid long-running or unresponsive requests.
Setting timeouts can help prevent the server from being tied up waiting for a request to complete. This can help free up resources for other requests, improving the overall efficiency of the system.
Timeouts can help prevent cascading failures that could impact the entire system.
Setting a timeout for API requests can help prevent denial of service attacks, where an attacker sends many requests to the server, causing it to become unresponsive.

API Retries:

Why should we implement API retries?

API retries are important to implement because they can help improve the resilience and reliability of a system by handling transient errors that can occur during communication with other systems or services.

When a client sends a request to an API, there is a chance that the request may not be fulfilled due to various reasons, such as network latency, server load, or dependencies on external services. In some cases, these errors are transient, meaning they are temporary and can be resolved by retrying the request after a certain interval of time.

How to implement API retry using the requests library?

Create a class called RetryManager which creates an HTTPAdapter with the max_retries set as the Retry instance. Return a request session object mounted with the adapter. Calling .get() or .post() methods on this session object will automatically retry requests.

import requests
from requests.adapters import HTTPAdapter, Retry


class RetryManager:

    def __init__(self, num_retries: int = 3) -> None:
        self.num_retries = num_retries

    def get_session(self):
        session = requests.Session()

        BACKOFF_FACTOR = 1  # if 1 then 1s, 2s, 4s, 8s, 16s
        STATUS_CODES_TO_RETRY_ON = [404, 502, 503, 504]
        METHODS_TO_RETRY_ON = ['GET', 'POST', 'OPTIONS']

        retries = Retry(
            total=self.num_retries,
            backoff_factor=BACKOFF_FACTOR,
            status_forcelist=STATUS_CODES_TO_RETRY_ON,
            method_whitelist=frozenset(METHODS_TO_RETRY_ON)
        )

        adapter = HTTPAdapter(max_retries=retries)
        session.mount('http://', adapter)
        session.mount('https://', adapter)

        return session


def call_api_with_retry(url: str, api_timeout_in_secs: int = 10):
    retry_session = RetryManager(
        num_retries=5
    ).get_session()
    return retry_session.get(url=url, timeout=api_timeout_in_secs)

To test the retries, we will make the get_user API in the User service return a 502.

@app.get('/users/{user_id}')
def get_user(user_id: int):
    raise HTTPException(status_code=502, detail="Service down")

Now, we update the get_user_data in the Order service function to call the call_api_with_retry function.

def get_user_data(user_id: int) -> dict:
    try:
        response = call_api_with_retry(
            url=f'http://user_api_1:8200/users/{user_id}?fromService=order',
            api_timeout_in_secs=10
        )
    except requests.exceptions.ReadTimeout:
        raise UserNotFound('User not found: User API timed out')
    except requests.exceptions.RetryError:
        raise UserNotFound('User not found: User service may be down')

    if response.status_code == 200:
        return response.json()

    if response.status_code == 404:
        return {'id': user_id}

    raise UserNotFound('User not found: User API failed')

Now when we hit the get_orders API GET /orders/?user_id=1, it will internally call the User service and since the User service is down, it should retry this 5 times. The first retry will be immediate, the second will be after 2 seconds, then 4 seconds, 8 seconds and finally 16 seconds.

Check the user service logs below (1 request + 5 retried requests).

api_1 | 01/05/2023 11:45:37 INFO - "GET /users/1?fromService=order HTTP/1.1" 502
api_1 | 01/05/2023 11:45:37 INFO - "GET /users/1?fromService=order HTTP/1.1" 502
api_1 | 01/05/2023 11:45:39 INFO - "GET /users/1?fromService=order HTTP/1.1" 502
api_1 | 01/05/2023 11:45:43 INFO - "GET /users/1?fromService=order HTTP/1.1" 502
api_1 | 01/05/2023 11:45:51 INFO - "GET /users/1?fromService=order HTTP/1.1" 502
api_1 | 01/05/2023 11:46:07 INFO - "GET /users/1?fromService=order HTTP/1.1" 502

Benefits

By retrying API requests that fail due to transient errors, the system can increase the likelihood of requests being successfully fulfilled, improving overall reliability.
By handling temporary failures automatically through retries, the system can maintain availability and reduce downtime for users or customers.
Retrying failed API requests automatically can help reduce the need for manual intervention and troubleshooting, reducing operational costs and freeing up resources for other tasks.

To implement timeouts and retries effectively, it is essential to carefully determine the parameters, such as the timeout value in seconds, number of retries, backoff_factor, and other relevant factors, based on the specific requirements of your system.

Implementing API timeouts and retries can have a significant impact on a system, improving its reliability, performance, and user experience.

Link to the project: https://github.com/akshays94/python-api-retry

I hope you found this helpful. Thanks for reading!

Implementing API Timeouts and Retries in Python