Rust bindings for NVIDIA TensorRT-LLM's C++ Executor runtime.
This workspace contains two crates:
tensorrt-llm-sys: raw FFI declarations plus a small C++ bridge that exposes a stable C ABI.tensorrt-llm: safe Rust wrappers for executor construction, request enqueueing, response polling, cancellation, and shutdown.
The binding does not expose TensorRT-LLM C++ classes directly. C++ ABI compatibility across compilers, standard libraries, PyTorch builds, and TensorRT-LLM releases is fragile, so the safe crate speaks to a narrow C shim instead. The shim owns all native C++ objects, copies request inputs into TensorRT-LLM request objects, and copies response data into Rust-owned vectors before returning to user code.
Implemented now:
- Decoder-only executor creation from an engine directory.
- Encoder-decoder executor creation from encoder and decoder engine directories.
ExecutorConfigfor the common high-level executor settings.SamplingConfigandOutputConfigfor token-generation requests.- Request enqueueing.
- Blocking or timed response polling for any request or a specific request id.
- Streaming and non-streaming token responses.
- Generated beam token ids, finish reasons, cumulative log probabilities, per-token log probabilities, decoding iteration metadata, client ids, and request errors.
- Cancellation, shutdown,
can_enqueue_requests,is_participant, and native version string access.
Not implemented in this first cut:
- Tensor-valued fields such as context logits, generation logits, encoder outputs, multimodal embeddings, and custom additional outputs.
- LoRA, prompt tuning, guided decoding, speculative decoding, KV-cache retention, and custom logits processors.
- MPI/orchestrator configuration helpers.
Those features need additional typed C ABI shapes around TensorRT-LLM tensor/config classes. The crate layout is ready for them, but they are intentionally omitted from the safe surface until each feature can be copied and validated without leaking C++ ABI details.
You need TensorRT-LLM C++ headers and, for binaries/tests that call the native bridge, a built TensorRT-LLM C++ library plus CUDA and TensorRT development files matching that build.
If no local TensorRT-LLM checkout is configured, the build script shallow-clones NVIDIA's repository into target/trtllm-src/TensorRT-LLM and uses its headers:
cargo buildPin or redirect the clone when needed:
export TRTLLM_RS_GIT_REPO=https://github.com/NVIDIA/TensorRT-LLM.git
export TRTLLM_RS_GIT_REF=main
export TRTLLM_RS_GIT_DIR=/opt/TensorRT-LLM
export TRTLLM_RS_GIT_UPDATE=1When building TensorRT-LLM from source and TENSORRT_ROOT is not set, the build script also shallow-clones NVIDIA's TensorRT repository into target/trt-src/TensorRT and forwards its headers to TensorRT-LLM's CMake configure step:
export TENSORRT_RS_GIT_REPO=https://github.com/NVIDIA/TensorRT.git
export TENSORRT_RS_GIT_REF=main
export TENSORRT_RS_GIT_DIR=/opt/TensorRT
export TENSORRT_RS_GIT_UPDATE=1
# Use this when you want to require a local/binary TensorRT install instead.
export TENSORRT_RS_SKIP_GIT_CLONE=1That clone is enough for compiling the Rust libraries and C++ bridge object, but runnable binaries still need libtensorrt_llm. To let this crate build TensorRT-LLM from the cloned checkout on a CUDA/TensorRT-capable machine:
export TRTLLM_RS_BUILD_FROM_SOURCE=1
export CUDA_HOME=/usr/local/cuda
export TENSORRT_ROOT=/usr/local/tensorrt
# Optional CMake controls.
export TRTLLM_RS_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release"
export TRTLLM_RS_CUDA_ARCHITECTURES=80
export TRTLLM_RS_CMAKE_TARGET=tensorrt_llm
export TRTLLM_RS_CMAKE_BUILD_ARGS="--verbose"
cargo build --releaseBuilding TensorRT-LLM itself is a large CUDA/CMake build and requires TensorRT 10+ development headers/libraries, including NvInfer.h, libnvinfer.so, and libnvonnxparser.so. The TensorRT GitHub repository provides OSS headers/components, but not NVIDIA's closed TensorRT runtime library libnvinfer.so; install TensorRT or point TENSORRT_ROOT at an extracted TensorRT package when that binary is not otherwise discoverable. The build script forwards TENSORRT_ROOT to upstream CMake. TRTLLM_RS_CUDA_ARCHITECTURES defaults to 80 to avoid CMake's native GPU detection on hosts whose installed CUDA compiler cannot target the local GPU. For deployment, set it to the architecture you intend to run, for example 90 for H100 or 120 for Blackwell with a CUDA compiler new enough to support it. If you already have a built checkout or package, point this crate at it instead.
TensorRT-RTX packages such as TensorRT-RTX-1.5.0.114 use RTX-specific library names and do not ship regular headers. You can still point TENSORRT_ROOT at that directory; the build script will use libtensorrt_shim.so / libtensorrt_onnxparser_rtx.so from the package and regular headers from the TensorRT GitHub clone:
export TENSORRT_ROOT=$PWD/TensorRT-RTX-1.5.0.114
export TRTLLM_RS_BUILD_FROM_SOURCE=1
cargo build --releaseTypical environment:
export TRTLLM_ROOT=/opt/TensorRT-LLM
export TENSORRT_ROOT=/usr/local/tensorrt
export CUDA_HOME=/usr/local/cuda
# Optional; defaults to tensorrt_llm.
export TRTLLM_RS_LINK_LIBS=tensorrt_llm
# Optional when matching a prebuilt C++ ABI, e.g. PyTorch/libstdc++ builds.
# export TRTLLM_RS_CXX11_ABI=1
cargo build --releaseThe build script looks for headers in $TRTLLM_ROOT/cpp/include and libraries in common TensorRT-LLM build directories. You can override discovery directly:
export TRTLLM_INCLUDE_DIR=/path/to/TensorRT-LLM/cpp/include
export TRTLLM_LIB_DIR=/path/to/TensorRT-LLM/cpp/build/tensorrt_llm
export TRTLLM_RS_EXTRA_INCLUDE_DIRS=/extra/include1:/extra/include2
export TRTLLM_RS_EXTRA_LIB_DIRS=/extra/lib1:/extra/lib2
export TRTLLM_RS_EXTRA_CXXFLAGS="-Wno-deprecated-declarations"
export TRTLLM_RS_EXTRA_LINK_LIBS="nvinfer,nvinfer_plugin,cudart"For cargo check, docs, or CI without TensorRT-LLM installed:
TRTLLM_RS_SKIP_NATIVE=1 cargo check --workspace --all-targets --no-default-featuresThat mode is only for type-checking Rust code. Running code that calls FFI still requires the native bridge and TensorRT-LLM libraries.
use std::time::Duration;
use tensorrt_llm::{Executor, ExecutorConfig, OutputConfig, Request, ResponsePayload, Result};
fn main() -> Result<()> {
let mut executor = Executor::new("/models/llama/trtllm-engine", &ExecutorConfig::default())?;
// Tokenization is intentionally outside the crate. Pass token ids from
// your tokenizer of choice.
let prompt_token_ids = vec![1, 15043, 29892];
let request = Request::builder(prompt_token_ids, 64)
.output_config(OutputConfig {
exclude_input_from_output: true,
return_log_probs: true,
..OutputConfig::default()
})
.build()?;
let request_id = executor.enqueue(&request)?;
loop {
for response in executor.await_responses_for(request_id, Some(Duration::from_millis(250)))? {
let is_final = response.is_final();
match response.payload {
ResponsePayload::Result(result) => println!("{:?}", result.beams),
ResponsePayload::Error(message) => eprintln!("request error: {message}"),
}
if is_final {
return Ok(());
}
}
}
}Run the examples:
cargo run --release --example generate -- /path/to/engine 1 15043 29892
cargo run --release --example streaming -- /path/to/engine 1 15043 29892- Rust never owns or references TensorRT-LLM C++ objects directly.
- Every fallible native call returns a status code and a thread-local error string.
- Request builders validate basic invariants before crossing FFI.
- Response data is copied into Rust-owned
Vecs while the bridge response list is still alive. - The safe
Executorwrapper owns exactly one native executor handle and destroys it on drop.
This code targets the current TensorRT-LLM Executor API shape. TensorRT-LLM is evolving quickly, so native compile errors after upgrading TensorRT-LLM are most likely caused by C++ API signature drift in executor.h. Keep the Rust C ABI stable and update only crates/tensorrt-llm-sys/native/trtllm_rs.cpp when upstream C++ types change.