Serving Multiple LLMs on a Single H100 with TensorRT-LLM & Triton
01 Overview & Motivation
Running multiple AI models — a large language model, a reranker, and an embedding model — on a single GPU server is a common production challenge. The naive approach is to spin up three separate containers, each with its own GPU slice, using vLLM. That works, but it leaves performance gains on the table.
NVIDIA's TensorRT-LLM combined with Triton Inference Server offers a more optimal path: compile each generative model into a GPU-architecture-specific .engine binary, serve all models under a single Triton process, and control memory allocation precisely via environment variables.
This guide walks through the complete production workflow — from downloading raw weights to a running container — for a three-model stack on a single NVIDIA H100 SXM5 (80GB HBM3):
02 Core Concepts
What is TensorRT-LLM?
TensorRT-LLM is NVIDIA's open-source library that optimises LLM inference using TensorRT. It provides Python APIs to define, compile, and run transformer models as TensorRT engines. Key capabilities include:
- In-flight batching (continuous batching) for high throughput
- Paged KV-cache to prevent memory fragmentation
- FP8, INT8, INT4 quantization (GPTQ, AWQ, SmoothQuant)
- Tensor parallelism & pipeline parallelism across multiple GPUs
- Flash Attention 2, chunked context processing
- Speculative decoding (draft-model acceleration)
What is Triton Inference Server?
NVIDIA Triton Inference Server is a production-grade serving framework that hosts multiple models simultaneously across different backends — TensorRT, TensorRT-LLM, Python, ONNX, TensorFlow, PyTorch. A single Triton process manages GPU scheduling, request batching, health endpoints, and Prometheus metrics for all loaded models.
The Two-Phase Workflow
.engine file. This build step happens once (typically in Docker stage 1 or as a pre-build), then the compiled artifact is reused for every container restart.
03 vLLM vs TensorRT-LLM
| Feature | vLLM | TensorRT-LLM + Triton |
|---|---|---|
| Setup complexity | ✓ Simple | ✗ Complex (engine build step) |
| Cold start time | ✓ Fast (runtime compile) | ✗ Engine build is slow (one-time) |
| Throughput on H100 | ~ Good | ✓ Best-in-class (+20–40%) |
| Latency P99 | ~ Good | ✓ Lower (compiled kernels) |
| Multi-model serving | ✗ Separate processes | ✓ Single Triton process |
| GPU memory control | ~ Per-instance fraction | ✓ Fine-grained per-model |
| FP8 quantization (H100) | ✓ | ✓ Native SmoothQuant |
| KV cache paging | ✓ PagedAttention | ✓ Paged KV cache |
| Embedding models | ✓ Direct | ~ Python backend |
| OpenAI-compatible API | ✓ Built-in | ✗ Need vLLM or proxy |
| Metrics (Prometheus) | ✓ | ✓ Native |
| Portability across GPU arches | ✓ Works anywhere | ✗ Must rebuild per arch (SM80 → SM90) |
04 Architecture Design
gpt_oss_20b (tensorrtllm)
/engines/gpt-oss-20b/*.engine
GPU: 80% (64 GB)
bge_reranker (python)
/models/bge-reranker-large
GPU: 10% (8 GB)
Request Flow
An incoming inference request arrives at Triton on port 8000 (HTTP/REST) or 8001 (gRPC). Triton routes it to the correct model backend by model name. The TRT-LLM backend uses in-flight batching — new requests join the active generation loop without waiting for the current batch to finish, maximizing GPU utilization.
Project Directory Structure
# Your build context (rsync this to the H100 VM)
trtllm-project/
├── Dockerfile # Multi-stage build
├── entrypoint.sh # Patches configs + starts Triton
│
├── models/ # HuggingFace weights (local)
│ ├── gpt-oss-20b/
│ ├── bge-reranker-large/
│ └── nomic-embed-text-v2-moe/
│
├── triton_repo/ # Triton model repository
│ ├── gpt_oss_20b/
│ │ ├── config.pbtxt
│ │ └── 1/ # (engine loaded from /engines/)
│ ├── bge_reranker/
│ │ ├── config.pbtxt
│ │ └── 1/
│ │ └── model.py
│ └── nomic_embed/
│ ├── config.pbtxt
│ └── 1/
│ └── model.py
│
└── backends/ # Python backend scripts
├── bge_reranker.py
└── nomic_embed.py
05 Prerequisites
trtllm-build) requires the same GPU architecture as the target runtime. H100 = SM90. You cannot compile on your laptop and deploy on H100. Always build on the H100 VM.
On the H100 VM — Install NVIDIA Container Toolkit
# 1. Update and install docker
sudo apt-get update
sudo apt-get install -y docker.io
# 2. Add NVIDIA Container Toolkit repo
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 3. Install toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# 4. Configure docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 5. Verify GPU is visible
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Pull the Base Image (large — ~20 GB)
# Log in to NVIDIA NGC (free account required)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key from https://ngc.nvidia.com/setup>
# Pull Triton + TRT-LLM runtime image
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
06 Step 1 — Download Model Weights
All three models must be downloaded as HuggingFace-format checkpoints before building the Docker image. The two embedding/reranker models are served as-is via the Python backend, so no compilation is needed for them.
# Install huggingface CLI
pip install -U huggingface_hub
# Authenticate (needed for gated models)
huggingface-cli login
# Paste your HF token when prompted
# Create model directory
mkdir -p ~/trtllm-project/models
# ── gpt-oss-20b (replace with your actual model ID) ──────────────
huggingface-cli download your-org/gpt-oss-20b \
--local-dir ~/trtllm-project/models/gpt-oss-20b \
--local-dir-use-symlinks False
# ── bge-reranker-large ────────────────────────────────────────────
huggingface-cli download BAAI/bge-reranker-large \
--local-dir ~/trtllm-project/models/bge-reranker-large \
--local-dir-use-symlinks False
# ── nomic-embed-text-v2-moe ───────────────────────────────────────
huggingface-cli download nomic-ai/nomic-embed-text-v2-moe \
--local-dir ~/trtllm-project/models/nomic-embed-text-v2-moe \
--local-dir-use-symlinks False
# Verify sizes
du -sh ~/trtllm-project/models/*
07 Step 2 — Build the TRT Engine
This is the most critical step. We use the Triton+TRT-LLM image as a builder environment, mount the model weights, and produce a compiled .engine file optimized for H100's SM90 architecture.
2a. Convert HuggingFace checkpoint to TRT-LLM format
docker run --rm --gpus all \
-v ~/trtllm-project/models:/models \
-v ~/trtllm-project/engines:/engines \
nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
python /app/tensorrt_llm/examples/gpt/convert_checkpoint.py \
--model_dir /models/gpt-oss-20b \
--output_dir /engines/trt_ckpt/gpt-oss-20b \
--dtype float16 \
--tp_size 1
--dtype float16 flag uses native FP16. For even better throughput on H100, use --dtype float8 (FP8 is natively accelerated on H100's Transformer Engine). This reduces the 20B model VRAM footprint from ~40 GB to ~20 GB.
2b. Compile the TRT engine with trtllm-build
docker run --rm --gpus all \
-v ~/trtllm-project/engines:/engines \
nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
trtllm-build \
--checkpoint_dir /engines/trt_ckpt/gpt-oss-20b \
--output_dir /engines/trt/gpt-oss-20b \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--max_batch_size 32 \
--max_input_len 4096 \
--max_seq_len 8192 \
--use_paged_context_fmha enable \
--workers 1
# Verify the engine was created
ls -lh ~/trtllm-project/engines/trt/gpt-oss-20b/
# Should see: rank0.engine, config.json
float16 with float8 in both plugin flags for native FP8 acceleration on H100. Benchmark both and compare quality vs speed tradeoff for your specific use case.
08 Step 3 — Triton Model Repository Configs
Each model needs a config.pbtxt file that tells Triton which backend to use, what inputs/outputs look like, and GPU parameters. The placeholder tokens (__VAR__) are replaced at container startup by entrypoint.sh.
triton_repo/gpt_oss_20b/config.pbtxt
name: "gpt_oss_20b"
backend: "tensorrtllm"
max_batch_size: __MAX_BATCH_SIZE_GPT__
model_transaction_policy { decoupled: true }
input [
{ name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] },
{ name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] },
{ name: "request_output_len" data_type: TYPE_INT32 dims: [ 1 ] },
{ name: "streaming" data_type: TYPE_BOOL dims: [ 1 ], optional: true }
]
output [
{ name: "output_ids" data_type: TYPE_INT32 dims: [ -1 ] },
{ name: "sequence_length" data_type: TYPE_INT32 dims: [ 1 ] }
]
parameters {
key: "engine_dir"
value: { string_value: "/engines/trt/gpt-oss-20b" }
}
parameters {
key: "executor_worker_path"
value: { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
parameters {
key: "kv_cache_free_gpu_mem_fraction"
value: { string_value: "__KV_CACHE_FRACTION__" }
}
parameters {
key: "max_num_sequences"
value: { string_value: "256" }
}
instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]
triton_repo/bge_reranker/config.pbtxt
name: "bge_reranker"
backend: "python"
max_batch_size: __MAX_BATCH_SIZE_RERANKER__
input [
{ name: "query", data_type: TYPE_STRING, dims: [ 1 ] },
{ name: "passages", data_type: TYPE_STRING, dims: [ -1 ] }
]
output [
{ name: "scores", data_type: TYPE_FP32, dims: [ -1 ] }
]
parameters {
key: "model_path"
value: { string_value: "/models/bge-reranker-large" }
}
instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]
triton_repo/nomic_embed/config.pbtxt
name: "nomic_embed"
backend: "python"
max_batch_size: __MAX_BATCH_SIZE_EMBED__
input [
{ name: "texts", data_type: TYPE_STRING, dims: [ -1 ] }
]
output [
{ name: "embeddings", data_type: TYPE_FP32, dims: [ -1, 768 ] }
]
parameters {
key: "model_path"
value: { string_value: "/models/nomic-embed-text-v2-moe" }
}
instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]
09 Step 4 — Python Backend Scripts
The reranker and embedding models don't use TRT-LLM — they're served through Triton's Python backend, which gives us full control over how PyTorch loads the model and handles requests. GPU memory is bounded via torch.cuda.set_per_process_memory_fraction().
backends/bge_reranker.py
import os, json
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class TritonPythonModel:
def initialize(self, args):
config = json.loads(args["model_config"])
params = config.get("parameters", {})
model_path = params.get("model_path", {}).get(
"string_value", "/models/bge-reranker-large")
# Clamp GPU memory to the configured fraction
gpu_frac = float(os.environ.get("RERANKER_GPU_FRACTION", "0.10"))
torch.cuda.set_per_process_memory_fraction(gpu_frac, device=0)
self.device = torch.device("cuda:0")
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_path, torch_dtype=torch.float16
).to(self.device)
self.model.eval()
def execute(self, requests):
responses = []
for req in requests:
query = pb_utils.get_input_tensor_by_name(req, "query"
).as_numpy()[0][0].decode()
passages = [p.decode() for p in
pb_utils.get_input_tensor_by_name(req, "passages").as_numpy()[0]]
pairs = [[query, p] for p in passages]
enc = self.tokenizer(pairs, padding=True, truncation=True,
max_length=512, return_tensors="pt").to(self.device)
with torch.no_grad():
scores = self.model(**enc).logits.squeeze(-1).float().cpu().numpy()
out = pb_utils.Tensor("scores", scores.astype(np.float32))
responses.append(pb_utils.InferenceResponse(output_tensors=[out]))
return responses
def finalize(self):
del self.model
torch.cuda.empty_cache()
backends/nomic_embed.py
import os, json
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoTokenizer, AutoModel
class TritonPythonModel:
def initialize(self, args):
config = json.loads(args["model_config"])
params = config.get("parameters", {})
model_path = params.get("model_path", {}).get(
"string_value", "/models/nomic-embed-text-v2-moe")
gpu_frac = float(os.environ.get("EMBED_GPU_FRACTION", "0.05"))
torch.cuda.set_per_process_memory_fraction(gpu_frac, device=0)
self.device = torch.device("cuda:0")
self.tokenizer = AutoTokenizer.from_pretrained(
model_path, trust_remote_code=True)
self.model = AutoModel.from_pretrained(
model_path, trust_remote_code=True,
torch_dtype=torch.float16
).to(self.device)
self.model.eval()
def execute(self, requests):
responses = []
for req in requests:
texts = [t.decode() for t in
pb_utils.get_input_tensor_by_name(req, "texts").as_numpy()[0]]
enc = self.tokenizer(texts, padding=True, truncation=True,
max_length=512, return_tensors="pt").to(self.device)
with torch.no_grad():
out = self.model(**enc)
embeddings = out.last_hidden_state.mean(dim=1).float().cpu().numpy()
t = pb_utils.Tensor("embeddings", embeddings.astype(np.float32))
responses.append(pb_utils.InferenceResponse(output_tensors=[t]))
return responses
def finalize(self):
del self.model
torch.cuda.empty_cache()
10 Step 5 — The Dockerfile
A two-stage Dockerfile: Stage 1 compiles the TRT engine inside the builder image, Stage 2 copies only the artifacts needed at runtime. The final image contains no build tools, keeping it lean.
# ══════════════════════════════════════════════════════════════
# Stage 1 — Engine Builder
# Compiles gpt-oss-20b → TensorRT engine on H100 (SM90)
# ══════════════════════════════════════════════════════════════
FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 AS engine-builder
RUN pip install --no-cache-dir \
transformers==4.44.0 \
huggingface_hub==0.24.0 \
sentencepiece einops
# Build-time arguments — override with --build-arg
ARG MODEL_SOURCE=local # "local" or "hub"
ARG HF_TOKEN=""
ARG GPT_MODEL_ID="your-org/gpt-oss-20b"
ARG MAX_BATCH_SIZE=32
ARG MAX_INPUT_LEN=4096
ARG MAX_SEQ_LEN=8192
# Option A: copy local weights (default)
COPY ./models/gpt-oss-20b /workspace/hf_models/gpt-oss-20b
# Option B: download from Hub at build time
RUN if [ "$MODEL_SOURCE" = "hub" ]; then \
huggingface-cli download ${GPT_MODEL_ID} \
--token ${HF_TOKEN} \
--local-dir /workspace/hf_models/gpt-oss-20b; \
fi
# Convert HF → TRT-LLM checkpoint
RUN python /app/tensorrt_llm/examples/gpt/convert_checkpoint.py \
--model_dir /workspace/hf_models/gpt-oss-20b \
--output_dir /workspace/trt_ckpt/gpt-oss-20b \
--dtype float16 \
--tp_size 1
# Compile TRT engine
RUN trtllm-build \
--checkpoint_dir /workspace/trt_ckpt/gpt-oss-20b \
--output_dir /workspace/trt_engines/gpt-oss-20b \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--max_batch_size ${MAX_BATCH_SIZE} \
--max_input_len ${MAX_INPUT_LEN} \
--max_seq_len ${MAX_SEQ_LEN} \
--use_paged_context_fmha enable \
--workers 1
# ══════════════════════════════════════════════════════════════
# Stage 2 — Runtime Server
# ══════════════════════════════════════════════════════════════
FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
LABEL description="TRT-LLM: gpt-oss-20b + bge-reranker-large + nomic-embed-text-v2-moe"
RUN pip install --no-cache-dir \
transformers==4.44.0 \
sentence-transformers==3.0.1 \
einops numpy \
torch==2.3.0 \
sentencepiece
# Copy compiled engine from builder stage
COPY --from=engine-builder /workspace/trt_engines/gpt-oss-20b /engines/trt/gpt-oss-20b
# Copy HF weights for Python-backend models
COPY ./models/bge-reranker-large /models/bge-reranker-large
COPY ./models/nomic-embed-text-v2-moe /models/nomic-embed-text-v2-moe
# Copy Triton model repository + backend scripts
COPY ./triton_repo /triton_repo
COPY ./backends/bge_reranker.py /triton_repo/bge_reranker/1/model.py
COPY ./backends/nomic_embed.py /triton_repo/nomic_embed/1/model.py
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
# ── Default ENV (all overridable via docker run -e) ───────────
ENV GPT_GPU_FRACTION=0.80
ENV RERANKER_GPU_FRACTION=0.10
ENV EMBED_GPU_FRACTION=0.05
ENV MAX_BATCH_SIZE_GPT=32
ENV MAX_BATCH_SIZE_RERANKER=64
ENV MAX_BATCH_SIZE_EMBED=128
ENV KV_CACHE_FRACTION=0.90
ENV TRITON_HTTP_PORT=8000
ENV TRITON_GRPC_PORT=8001
ENV TRITON_METRICS_PORT=8002
ENV LOG_VERBOSE=0
EXPOSE 8000 8001 8002
ENTRYPOINT ["/entrypoint.sh"]
entrypoint.sh
#!/bin/bash
set -e
echo "======================================================"
echo " TRT-LLM Multi-Model Server Starting"
echo " gpt-oss-20b: ${GPT_GPU_FRACTION}"
echo " bge-reranker-large: ${RERANKER_GPU_FRACTION}"
echo " nomic-embed-text-v2-moe: ${EMBED_GPU_FRACTION}"
echo "======================================================"
# Patch config placeholders with runtime ENV values
sed -i \
-e "s|__GPT_GPU_FRACTION__|${GPT_GPU_FRACTION}|g" \
-e "s|__MAX_BATCH_SIZE_GPT__|${MAX_BATCH_SIZE_GPT}|g" \
-e "s|__KV_CACHE_FRACTION__|${KV_CACHE_FRACTION}|g" \
/triton_repo/gpt_oss_20b/config.pbtxt
sed -i \
-e "s|__RERANKER_GPU_FRACTION__|${RERANKER_GPU_FRACTION}|g" \
-e "s|__MAX_BATCH_SIZE_RERANKER__|${MAX_BATCH_SIZE_RERANKER}|g" \
/triton_repo/bge_reranker/config.pbtxt
sed -i \
-e "s|__EMBED_GPU_FRACTION__|${EMBED_GPU_FRACTION}|g" \
-e "s|__MAX_BATCH_SIZE_EMBED__|${MAX_BATCH_SIZE_EMBED}|g" \
/triton_repo/nomic_embed/config.pbtxt
export RERANKER_GPU_FRACTION EMBED_GPU_FRACTION
# Start Triton — load only the 3 known models
exec tritonserver \
--model-repository=/triton_repo \
--http-port=${TRITON_HTTP_PORT} \
--grpc-port=${TRITON_GRPC_PORT} \
--metrics-port=${TRITON_METRICS_PORT} \
--log-verbose=${LOG_VERBOSE} \
--model-control-mode=explicit \
--load-model=gpt_oss_20b \
--load-model=bge_reranker \
--load-model=nomic_embed
11 Step 6 — Build & Run
Sync project to the H100 VM
# From your local machine — sync everything except model weights
# (you already downloaded weights directly on the VM)
rsync -avz --exclude='models/' \
./trtllm-project/ \
user@h100-vm.example.com:/home/user/trtllm-project/
# SSH into the VM
ssh user@h100-vm.example.com
Build the Docker image
cd ~/trtllm-project
# Standard build (uses local model weights)
docker build \
-t trtllm-server:latest \
--progress=plain \
. 2>&1 | tee build.log
# Optional: build with HuggingFace Hub download
docker build \
--build-arg MODEL_SOURCE=hub \
--build-arg HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx \
--build-arg GPT_MODEL_ID=your-org/gpt-oss-20b \
--build-arg MAX_BATCH_SIZE=64 \
-t trtllm-server:latest \
.
# Verify image was created
docker images | grep trtllm-server
Run the container
docker run --gpus all \
--shm-size=16g \
--name trtllm-server \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
\
# ── GPU memory allocation ─────────────────────────────
-e GPT_GPU_FRACTION=0.80 \
-e RERANKER_GPU_FRACTION=0.10 \
-e EMBED_GPU_FRACTION=0.05 \
-e KV_CACHE_FRACTION=0.90 \
\
# ── Batch sizes ───────────────────────────────────────
-e MAX_BATCH_SIZE_GPT=32 \
-e MAX_BATCH_SIZE_RERANKER=64 \
-e MAX_BATCH_SIZE_EMBED=128 \
\
# ── Ports ─────────────────────────────────────────────
-e TRITON_HTTP_PORT=8000 \
-e TRITON_GRPC_PORT=8001 \
-e TRITON_METRICS_PORT=8002 \
\
# ── Debug ─────────────────────────────────────────────
-e LOG_VERBOSE=0 \
\
trtllm-server:latest
12 ENV Variable Reference
| Variable | Default | Description |
|---|---|---|
GPT_GPU_FRACTION | 0.80 | Share of H100 VRAM allocated to gpt-oss-20b |
RERANKER_GPU_FRACTION | 0.10 | Share of VRAM for bge-reranker-large (PyTorch memory cap) |
EMBED_GPU_FRACTION | 0.05 | Share of VRAM for nomic-embed-text-v2-moe |
KV_CACHE_FRACTION | 0.90 | Fraction of GPT's allocation reserved for paged KV cache |
MAX_BATCH_SIZE_GPT | 32 | Max concurrent sequences for TRT-LLM in-flight batching |
MAX_BATCH_SIZE_RERANKER | 64 | Max batch size for reranker inference |
MAX_BATCH_SIZE_EMBED | 128 | Max batch size for embedding inference |
TRITON_HTTP_PORT | 8000 | REST/HTTP endpoint for inference requests |
TRITON_GRPC_PORT | 8001 | gRPC endpoint (lower latency, recommended for production) |
TRITON_METRICS_PORT | 8002 | Prometheus metrics scrape endpoint |
LOG_VERBOSE | 0 | 0 = off, 1 = verbose Triton debug logs |
GPT_GPU_FRACTION + RERANKER_GPU_FRACTION + EMBED_GPU_FRACTION must be ≤ 0.95 on an H100 80GB. The remaining 5% (~4 GB) is needed for CUDA context, kernel code, and Triton's own memory. Exceeding this will cause OOM errors at model load time.
13 Testing & Validation
Health Check
# Server health
curl http://localhost:8000/v2/health/ready
# List loaded models
curl http://localhost:8000/v2/models | python3 -m json.tool
# Individual model readiness
curl http://localhost:8000/v2/models/gpt_oss_20b/ready
curl http://localhost:8000/v2/models/bge_reranker/ready
curl http://localhost:8000/v2/models/nomic_embed/ready
Test Inference — gpt_oss_20b
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient("localhost:8000")
# Tokenize your prompt (use your tokenizer)
input_ids = np.array([[1, 15043, 29892]], dtype=np.int32) # "Hello,"
lengths = np.array([[3]], dtype=np.int32)
out_len = np.array([[100]], dtype=np.int32)
inputs = [
httpclient.InferInput("input_ids", input_ids.shape, "INT32"),
httpclient.InferInput("input_lengths", lengths.shape, "INT32"),
httpclient.InferInput("request_output_len", out_len.shape, "INT32"),
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(lengths)
inputs[2].set_data_from_numpy(out_len)
result = client.infer("gpt_oss_20b", inputs)
output_ids = result.as_numpy("output_ids")
print(output_ids)
Test Inference — bge_reranker
query = np.array([["What is TensorRT?"]], dtype=object)
passages = np.array([["TensorRT is NVIDIA's inference optimizer",
"TensorRT-LLM extends TensorRT for LLMs"]], dtype=object)
inputs = [
httpclient.InferInput("query", query.shape, "BYTES"),
httpclient.InferInput("passages", passages.shape, "BYTES"),
]
inputs[0].set_data_from_numpy(query)
inputs[1].set_data_from_numpy(passages)
result = client.infer("bge_reranker", inputs)
scores = result.as_numpy("scores")
print("Rerank scores:", scores) # e.g. [0.92, 0.87]
Prometheus Metrics
# Raw metrics
curl http://localhost:8002/metrics
# Key metrics to watch
curl -s http://localhost:8002/metrics | grep -E \
"nv_inference_request_success|nv_gpu_memory|nv_inference_queue_duration"
14 Performance Tuning Tips
For gpt-oss-20b Throughput
- Use FP8 instead of FP16 — cuts VRAM ~50%, allows 2× batch size on H100
- Increase
max_num_sequencesinconfig.pbtxtto allow more in-flight requests - Enable
--use_paged_context_fmha(already included) — crucial for long contexts - Set
KV_CACHE_FRACTION=0.92to give more cache to the LLM at the cost of less headroom - For throughput-heavy workloads, use
--tp_size 1(single GPU); for latency-heavy, consider tensor parallelism across 2× H100 if available
For Embedding / Reranker Throughput
- Increase
MAX_BATCH_SIZE_EMBED=256andMAX_BATCH_SIZE_RERANKER=128if traffic is bursty - Compile nomic-embed as a TRT ONNX backend for ~2× faster embedding inference
- Use
torch.compile(model)ininitialize()for Python backend models (PyTorch 2.x)
Monitor GPU Utilization Live
# Real-time GPU stats inside the container
docker exec trtllm-server watch -n 1 nvidia-smi
# Memory per process
docker exec trtllm-server nvidia-smi \
--query-compute-apps=pid,used_memory \
--format=csv
15 Troubleshooting
| Error | Cause | Fix |
|---|---|---|
CUDA out of memory |
GPU fractions sum > 0.95 | Reduce one fraction. Try GPT_GPU_FRACTION=0.75 |
engine file not found |
Build stage failed silently | Check build.log; re-run with --progress=plain |
unsupported SM version |
Engine built on different GPU arch | Always build and run on the same GPU model |
| Model not ready on startup | Python backend import error | Run docker logs trtllm-server, check transformers version |
tritonserver: not found |
Wrong base image | Must use tritonserver:24.12-trtllm-python-py3 not plain CUDA image |
| High P99 latency spikes | KV cache eviction under load | Increase KV_CACHE_FRACTION or reduce max_num_sequences |
docker exec -it trtllm-server bash and then curl localhost:8000/v2 | python3 -m json.tool to see exactly which models loaded and their status without restarting.