Some Notes About NVIDIA TensorRT

Winston LiuWinston Liu
3 min read

Although I have been using TensorRT framework for a few months, but I haven't get a systematically understanding of its whole view.

So, it's here, and I am going to note some facts about this fantastic machine learning inference tool.

What is TensorRT

The NVIDIA TensorRT is a SDK that faciliates high-performance machine learning inference with the NVIDIA hardware platforms, especially for GPUs. It is designed to work in a complementary fashion with some existing training frameworks, such as Tensorflow, Pytorch and MXNet etc.

The Programming Model

TensorRT provides support for both C++ and Python. The C++ APIs support all GPU hardware platform, but the Python APIs is not.

TensorRT operates in two stages. In the first stage, usually operated offline, we provide a model definition into it, and TensorRT will optimize the model for target GPU. And in the second stage, we use the optimized model for inference.

The Build Stage

The highest-level interface of the build stage of TensorRT is the Builder, which is responsible for optimizing the model and producing an Engine.

In order to build an TensorRT engine, we must first specify the network definition, and then specify the configuration of the Builder, and finally call the Builder to create our target engine.

The NetworkDefinition interface is used to define the model. The most common path to transfer a model to TensorRT is to export it from a framework in ONXX format, and use the TensorRT's ONXX parser to populate the network definition.

Of course, We can construct a model step by step using TensorRT's Layer and Tensor interfaces.

The BuilderConfig interface is used to specify how TensorRT should optimize the model. Among the configuration options available, We can control TensorRT's ability of reduction of the precision calculations, and control the tradeoff between memory and runtime execution speed.

In the final build stage, the builder eliminates dead computations, folds constants, and reorders and combines some operations to run more efficiently on the GPU.

The builder creates the engine in a serialized form call Plan, which can be deserialized immediately, or saved to disk for later use.

The Runtime Stage

The highest-level interface of the execution phase is Runtime. When using the runtime, we first deserialize a plan to create an engine, and then create an execution context from the engine.

With the execution engine created, the runtime will repeatedly populate input buffers for inference and execute the kernel operations.

The Engine interface represents an optimized model. We can query the engine for the information of input and output tensors of the network, such as the expected dimensions, data types, formats, and so on.

The ExecutionContext is the main interface for invoking inference. The execution context consists of all the states of a particular invocation. We can create multiple execution contexts in a single engine, and run them in parallel.

0
Subscribe to my newsletter

Read articles from Winston Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Winston Liu
Winston Liu