Triton Response Cache for TensorRT models

aikic0deraikic0der
4 min read

Introduction

Triton Response Cache

Triton Response Cache (referred to as the Cache from now on) is a feature of NVIDIA’s Triton Inference Server that stores the response of a model inference request so that if the same request comes in again, the server can immediately give back the saved response without redoing all the calculations.

Known Limitations

The official document lists several known limitations and the one we are addressing in this article is that the Cache only supports models whose input and output tensors are located on CPU memory.

Methodology

Since a TensorRT model has both its input and output tensors located on GPU, the Cache does not support it. If you are thinking we can make the Cache support it by wrapping the TensorRT model by a model that runs on CPU, then you guess it right. It is the Triton Business Logic Scripting (BLS) to the rescue.

Triton Business Logic Scripting

Triton BLS supports intermixing custom logic with model execution and can be implemented as a Triton Python model thanks to a set of utility functions introduced in version 21.08 that enables executing inference requests on other models being served by Triton as part of the execution step of the Python model.

Triton Response Cache for TensorRT models

Our strategy to support response caching for a TensorRT model is by wrapping it with a Python model that has the exact same input and output configuration. Inside the Python model, the input requests are passed through to the TensorRT model and the responses from the TensorRT models are passed back as the Python model’s responses.

Demo

Building Triton server

We need to build a Triton server with TensorRT and Python backends and response cache enabled for our demonstration.

$ export TRITON_VERSION=r24.03
$ git clone -b "${TRITON_VERSION}" https://github.com/triton-inference-server/server.git triton-server
$ cd triton-server
$ python3 compose.py \
  --output-name=triton-server:${TRITON_VERSION} \
  --backend=python \
  --backend=tensorrt \
  --enable-gpu \
  --repoagent=checksum \
  --cache=local \
  --container-version=${TRITON_VERSION}

Create model repository

Create a model folder for the TensorRT model and another for the BLS wrapper model. The model repository structure is as follows

/path/to/model-repository/
|-- model-trt/
|   |-- 1/
|   |   |-- model.plan
|   |-- config.pbtxt
|-- model-bls/
|   |-- 1/
|   |   |-- model.py
|   |-- config.pbtxt

TensorRT model

We can use any TensorRT model. In this demo, we use the intfloat/multilingual-e5-large model, which is a relatively large model (hence caching is helpful in some use-cases). Its model configuration file config.pbtxt is as follows

platform: "tensorrt_plan"
max_batch_size: 8
dynamic_batching: {}

input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: [512]
}

input {
  name: "attention_mask"
  data_type: TYPE_INT32
  dims: [512]
}

output {
  name: "last_hidden_state"
  data_type: TYPE_FP32
  dims: [512, 1024]
}

BLS model

The BLS model config should be the same as that of the TensorRT model except for the backend and the response_cache configuration

backend: "python"
max_batch_size: 8
dynamic_batching: {}

response_cache {
  enable: true
}

input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: [512]
}

input {
  name: "attention_mask"
  data_type: TYPE_INT32
  dims: [512]
}

output {
  name: "last_hidden_state"
  data_type: TYPE_FP32
  dims: [512, 1024]
}

The model.py, when executed, invokes a call to the TensorRT model with the inputs that it receives and returns the response from the TensorRT

import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def execute(self, requests):
        for request in requests:
            # Get INPUT0
            in_0 = pb_utils.get_input_tensor_by_name(request, "input_ids")
            # Get INPUT1
            in_1 = pb_utils.get_input_tensor_by_name(request, "attention_mask")

            # Create inference request object
            infer_request = pb_utils.InferenceRequest(
                model_name="model-trt",
                requested_output_names=["last_hidden_layer"],
                inputs=[in_0, in_1],
            )

            # Perform synchronous blocking inference request
            infer_response = infer_request.exec()
            # set error if there is an error with the TensorRT model
            if infer_response.has_error():
                raise pb_utils.TritonModelException(infer_response.error().message())
            inference_response = pb_utils.InferenceResponse(
                    output_tensors=infer_response.output_tensors()
            )
            responses.append(inference_response)
        return responses

Run triton-server with the model repository

$ docker run --gpus=0 --rm \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v /path/to/model-repository:/triton-models \
    triton-server:${TRITON_VERSION} -- \
    tritonserver --model-repository=/models --log-verbose=2 --cache-config=local,size=104857600

Now we have the Triton server running with the responses from the TensorRT model cached in memory.

0
Subscribe to my newsletter

Read articles from aikic0der directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

aikic0der
aikic0der