Triton Response Cache for TensorRT models
Introduction
Triton Response Cache
Triton Response Cache (referred to as the Cache from now on) is a feature of NVIDIA’s Triton Inference Server that stores the response of a model inference request so that if the same request comes in again, the server can immediately give back the saved response without redoing all the calculations.
Known Limitations
The official document lists several known limitations and the one we are addressing in this article is that the Cache only supports models whose input and output tensors are located on CPU memory.
Methodology
Since a TensorRT model has both its input and output tensors located on GPU, the Cache does not support it. If you are thinking we can make the Cache support it by wrapping the TensorRT model by a model that runs on CPU, then you guess it right. It is the Triton Business Logic Scripting (BLS) to the rescue.
Triton Business Logic Scripting
Triton BLS supports intermixing custom logic with model execution and can be implemented as a Triton Python model thanks to a set of utility functions introduced in version 21.08 that enables executing inference requests on other models being served by Triton as part of the execution step of the Python model.
Triton Response Cache for TensorRT models
Our strategy to support response caching for a TensorRT model is by wrapping it with a Python model that has the exact same input and output configuration. Inside the Python model, the input requests are passed through to the TensorRT model and the responses from the TensorRT models are passed back as the Python model’s responses.
Demo
Building Triton server
We need to build a Triton server with TensorRT and Python backends and response cache enabled for our demonstration.
$ export TRITON_VERSION=r24.03
$ git clone -b "${TRITON_VERSION}" https://github.com/triton-inference-server/server.git triton-server
$ cd triton-server
$ python3 compose.py \
--output-name=triton-server:${TRITON_VERSION} \
--backend=python \
--backend=tensorrt \
--enable-gpu \
--repoagent=checksum \
--cache=local \
--container-version=${TRITON_VERSION}
Create model repository
Create a model folder for the TensorRT model and another for the BLS wrapper model. The model repository structure is as follows
/path/to/model-repository/
|-- model-trt/
| |-- 1/
| | |-- model.plan
| |-- config.pbtxt
|-- model-bls/
| |-- 1/
| | |-- model.py
| |-- config.pbtxt
TensorRT model
We can use any TensorRT model. In this demo, we use the intfloat/multilingual-e5-large model, which is a relatively large model (hence caching is helpful in some use-cases). Its model configuration file config.pbtxt
is as follows
platform: "tensorrt_plan"
max_batch_size: 8
dynamic_batching: {}
input {
name: "input_ids"
data_type: TYPE_INT32
dims: [512]
}
input {
name: "attention_mask"
data_type: TYPE_INT32
dims: [512]
}
output {
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [512, 1024]
}
BLS model
The BLS model config should be the same as that of the TensorRT model except for the backend
and the response_cache
configuration
backend: "python"
max_batch_size: 8
dynamic_batching: {}
response_cache {
enable: true
}
input {
name: "input_ids"
data_type: TYPE_INT32
dims: [512]
}
input {
name: "attention_mask"
data_type: TYPE_INT32
dims: [512]
}
output {
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [512, 1024]
}
The model.py
, when executed, invokes a call to the TensorRT model with the inputs that it receives and returns the response from the TensorRT
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def execute(self, requests):
for request in requests:
# Get INPUT0
in_0 = pb_utils.get_input_tensor_by_name(request, "input_ids")
# Get INPUT1
in_1 = pb_utils.get_input_tensor_by_name(request, "attention_mask")
# Create inference request object
infer_request = pb_utils.InferenceRequest(
model_name="model-trt",
requested_output_names=["last_hidden_layer"],
inputs=[in_0, in_1],
)
# Perform synchronous blocking inference request
infer_response = infer_request.exec()
# set error if there is an error with the TensorRT model
if infer_response.has_error():
raise pb_utils.TritonModelException(infer_response.error().message())
inference_response = pb_utils.InferenceResponse(
output_tensors=infer_response.output_tensors()
)
responses.append(inference_response)
return responses
Run triton-server with the model repository
$ docker run --gpus=0 --rm \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /path/to/model-repository:/triton-models \
triton-server:${TRITON_VERSION} -- \
tritonserver --model-repository=/models --log-verbose=2 --cache-config=local,size=104857600
Now we have the Triton server running with the responses from the TensorRT model cached in memory.
Subscribe to my newsletter
Read articles from aikic0der directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by