Deep Dive · MLOps · GPU Infrastructure

Serving Multiple LLMs on a Single H100 with TensorRT-LLM & Triton

📅 May 2026 ⏱ 18 min read 🧠 Advanced 🎯 Production Guide
TensorRT-LLM Triton Inference Server NVIDIA H100 Docker vLLM gpt-oss-20b bge-reranker nomic-embed

01 Overview & Motivation

Running multiple AI models — a large language model, a reranker, and an embedding model — on a single GPU server is a common production challenge. The naive approach is to spin up three separate containers, each with its own GPU slice, using vLLM. That works, but it leaves performance gains on the table.

NVIDIA's TensorRT-LLM combined with Triton Inference Server offers a more optimal path: compile each generative model into a GPU-architecture-specific .engine binary, serve all models under a single Triton process, and control memory allocation precisely via environment variables.

This guide walks through the complete production workflow — from downloading raw weights to a running container — for a three-model stack on a single NVIDIA H100 SXM5 (80GB HBM3):

H100 80GB VRAM Allocation
gpt-oss-20b · 80%
reranker
embed
OS
gpt-oss-20b — 64 GB (80%)
bge-reranker-large — 8 GB (10%)
nomic-embed-text-v2-moe — 4 GB (5%)
CUDA overhead — ~4 GB (5%)

02 Core Concepts

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA's open-source library that optimises LLM inference using TensorRT. It provides Python APIs to define, compile, and run transformer models as TensorRT engines. Key capabilities include:

  • In-flight batching (continuous batching) for high throughput
  • Paged KV-cache to prevent memory fragmentation
  • FP8, INT8, INT4 quantization (GPTQ, AWQ, SmoothQuant)
  • Tensor parallelism & pipeline parallelism across multiple GPUs
  • Flash Attention 2, chunked context processing
  • Speculative decoding (draft-model acceleration)

What is Triton Inference Server?

NVIDIA Triton Inference Server is a production-grade serving framework that hosts multiple models simultaneously across different backends — TensorRT, TensorRT-LLM, Python, ONNX, TensorFlow, PyTorch. A single Triton process manages GPU scheduling, request batching, health endpoints, and Prometheus metrics for all loaded models.

The Two-Phase Workflow

Key Insight
Unlike vLLM which loads and JIT-compiles at runtime, TensorRT-LLM requires an offline compilation step that produces a GPU-architecture-specific .engine file. This build step happens once (typically in Docker stage 1 or as a pre-build), then the compiled artifact is reused for every container restart.

03 vLLM vs TensorRT-LLM

Feature vLLM TensorRT-LLM + Triton
Setup complexity✓ Simple✗ Complex (engine build step)
Cold start time✓ Fast (runtime compile)✗ Engine build is slow (one-time)
Throughput on H100~ Good✓ Best-in-class (+20–40%)
Latency P99~ Good✓ Lower (compiled kernels)
Multi-model serving✗ Separate processes✓ Single Triton process
GPU memory control~ Per-instance fraction✓ Fine-grained per-model
FP8 quantization (H100)✓ Native SmoothQuant
KV cache paging✓ PagedAttention✓ Paged KV cache
Embedding models✓ Direct~ Python backend
OpenAI-compatible API✓ Built-in✗ Need vLLM or proxy
Metrics (Prometheus)✓ Native
Portability across GPU arches✓ Works anywhere✗ Must rebuild per arch (SM80 → SM90)
When to stay with vLLM
If your team needs OpenAI-compatible endpoints out of the box, frequent model swaps, or you're not latency-constrained at current traffic, vLLM is simpler to operate. TRT-LLM pays off at high concurrency (>50 RPS) or when squeezing the last 30% performance out of expensive H100 capacity.

04 Architecture Design

Request Flow

An incoming inference request arrives at Triton on port 8000 (HTTP/REST) or 8001 (gRPC). Triton routes it to the correct model backend by model name. The TRT-LLM backend uses in-flight batching — new requests join the active generation loop without waiting for the current batch to finish, maximizing GPU utilization.

Project Directory Structure

SHELL directory layout
# Your build context (rsync this to the H100 VM)
trtllm-project/
├── Dockerfile                      # Multi-stage build
├── entrypoint.sh                   # Patches configs + starts Triton
│
├── models/                          # HuggingFace weights (local)
│   ├── gpt-oss-20b/
│   ├── bge-reranker-large/
│   └── nomic-embed-text-v2-moe/
│
├── triton_repo/                     # Triton model repository
│   ├── gpt_oss_20b/
│   │   ├── config.pbtxt
│   │   └── 1/                       # (engine loaded from /engines/)
│   ├── bge_reranker/
│   │   ├── config.pbtxt
│   │   └── 1/
│   │       └── model.py
│   └── nomic_embed/
│       ├── config.pbtxt
│       └── 1/
│           └── model.py
│
└── backends/                        # Python backend scripts
    ├── bge_reranker.py
    └── nomic_embed.py

05 Prerequisites

Hardware Requirement
Engine compilation (trtllm-build) requires the same GPU architecture as the target runtime. H100 = SM90. You cannot compile on your laptop and deploy on H100. Always build on the H100 VM.

On the H100 VM — Install NVIDIA Container Toolkit

BASH VM setup
# 1. Update and install docker
sudo apt-get update
sudo apt-get install -y docker.io

# 2. Add NVIDIA Container Toolkit repo
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 3. Install toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 4. Configure docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 5. Verify GPU is visible
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Pull the Base Image (large — ~20 GB)

BASH pull base
# Log in to NVIDIA NGC (free account required)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key from https://ngc.nvidia.com/setup>

# Pull Triton + TRT-LLM runtime image
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

06 Step 1 — Download Model Weights

All three models must be downloaded as HuggingFace-format checkpoints before building the Docker image. The two embedding/reranker models are served as-is via the Python backend, so no compilation is needed for them.

BASH download all models
# Install huggingface CLI
pip install -U huggingface_hub

# Authenticate (needed for gated models)
huggingface-cli login
# Paste your HF token when prompted

# Create model directory
mkdir -p ~/trtllm-project/models

# ── gpt-oss-20b (replace with your actual model ID) ──────────────
huggingface-cli download your-org/gpt-oss-20b \
  --local-dir ~/trtllm-project/models/gpt-oss-20b \
  --local-dir-use-symlinks False

# ── bge-reranker-large ────────────────────────────────────────────
huggingface-cli download BAAI/bge-reranker-large \
  --local-dir ~/trtllm-project/models/bge-reranker-large \
  --local-dir-use-symlinks False

# ── nomic-embed-text-v2-moe ───────────────────────────────────────
huggingface-cli download nomic-ai/nomic-embed-text-v2-moe \
  --local-dir ~/trtllm-project/models/nomic-embed-text-v2-moe \
  --local-dir-use-symlinks False

# Verify sizes
du -sh ~/trtllm-project/models/*
Storage Requirement
A 20B parameter model in FP16 requires approximately 40 GB of disk space. Ensure your VM has at least 120 GB free before downloading all three models.

07 Step 2 — Build the TRT Engine

This is the most critical step. We use the Triton+TRT-LLM image as a builder environment, mount the model weights, and produce a compiled .engine file optimized for H100's SM90 architecture.

2a. Convert HuggingFace checkpoint to TRT-LLM format

BASH checkpoint conversion
docker run --rm --gpus all \
  -v ~/trtllm-project/models:/models \
  -v ~/trtllm-project/engines:/engines \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
  python /app/tensorrt_llm/examples/gpt/convert_checkpoint.py \
    --model_dir  /models/gpt-oss-20b \
    --output_dir /engines/trt_ckpt/gpt-oss-20b \
    --dtype      float16 \
    --tp_size    1
Architecture Note
The --dtype float16 flag uses native FP16. For even better throughput on H100, use --dtype float8 (FP8 is natively accelerated on H100's Transformer Engine). This reduces the 20B model VRAM footprint from ~40 GB to ~20 GB.

2b. Compile the TRT engine with trtllm-build

BASH engine compilation (~20–40 min on H100)
docker run --rm --gpus all \
  -v ~/trtllm-project/engines:/engines \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
  trtllm-build \
    --checkpoint_dir         /engines/trt_ckpt/gpt-oss-20b \
    --output_dir             /engines/trt/gpt-oss-20b \
    --gemm_plugin            float16 \
    --gpt_attention_plugin   float16 \
    --max_batch_size         32 \
    --max_input_len          4096 \
    --max_seq_len            8192 \
    --use_paged_context_fmha enable \
    --workers                1

# Verify the engine was created
ls -lh ~/trtllm-project/engines/trt/gpt-oss-20b/
# Should see: rank0.engine, config.json
Optional — FP8 for Maximum H100 Throughput
Replace float16 with float8 in both plugin flags for native FP8 acceleration on H100. Benchmark both and compare quality vs speed tradeoff for your specific use case.

08 Step 3 — Triton Model Repository Configs

Each model needs a config.pbtxt file that tells Triton which backend to use, what inputs/outputs look like, and GPU parameters. The placeholder tokens (__VAR__) are replaced at container startup by entrypoint.sh.

triton_repo/gpt_oss_20b/config.pbtxt

PROTOBUF gpt_oss_20b/config.pbtxt
name: "gpt_oss_20b"
backend: "tensorrtllm"
max_batch_size: __MAX_BATCH_SIZE_GPT__

model_transaction_policy { decoupled: true }

input [
  { name: "input_ids"          data_type: TYPE_INT32  dims: [ -1 ] },
  { name: "input_lengths"      data_type: TYPE_INT32  dims: [  1 ] },
  { name: "request_output_len" data_type: TYPE_INT32  dims: [  1 ] },
  { name: "streaming"          data_type: TYPE_BOOL   dims: [  1 ], optional: true }
]

output [
  { name: "output_ids"      data_type: TYPE_INT32  dims: [ -1 ] },
  { name: "sequence_length" data_type: TYPE_INT32  dims: [  1 ] }
]

parameters {
  key: "engine_dir"
  value: { string_value: "/engines/trt/gpt-oss-20b" }
}
parameters {
  key: "executor_worker_path"
  value: { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value: { string_value: "__KV_CACHE_FRACTION__" }
}
parameters {
  key: "max_num_sequences"
  value: { string_value: "256" }
}

instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]

triton_repo/bge_reranker/config.pbtxt

PROTOBUF bge_reranker/config.pbtxt
name: "bge_reranker"
backend: "python"
max_batch_size: __MAX_BATCH_SIZE_RERANKER__

input [
  { name: "query",    data_type: TYPE_STRING, dims: [ 1 ] },
  { name: "passages", data_type: TYPE_STRING, dims: [ -1 ] }
]

output [
  { name: "scores", data_type: TYPE_FP32, dims: [ -1 ] }
]

parameters {
  key: "model_path"
  value: { string_value: "/models/bge-reranker-large" }
}

instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]

triton_repo/nomic_embed/config.pbtxt

PROTOBUF nomic_embed/config.pbtxt
name: "nomic_embed"
backend: "python"
max_batch_size: __MAX_BATCH_SIZE_EMBED__

input [
  { name: "texts", data_type: TYPE_STRING, dims: [ -1 ] }
]

output [
  { name: "embeddings", data_type: TYPE_FP32, dims: [ -1, 768 ] }
]

parameters {
  key: "model_path"
  value: { string_value: "/models/nomic-embed-text-v2-moe" }
}

instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]

09 Step 4 — Python Backend Scripts

The reranker and embedding models don't use TRT-LLM — they're served through Triton's Python backend, which gives us full control over how PyTorch loads the model and handles requests. GPU memory is bounded via torch.cuda.set_per_process_memory_fraction().

backends/bge_reranker.py

PYTHON Triton Python backend — bge-reranker-large
import os, json
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class TritonPythonModel:

    def initialize(self, args):
        config = json.loads(args["model_config"])
        params = config.get("parameters", {})
        model_path = params.get("model_path", {}).get(
            "string_value", "/models/bge-reranker-large")

        # Clamp GPU memory to the configured fraction
        gpu_frac = float(os.environ.get("RERANKER_GPU_FRACTION", "0.10"))
        torch.cuda.set_per_process_memory_fraction(gpu_frac, device=0)

        self.device    = torch.device("cuda:0")
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model     = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.float16
        ).to(self.device)
        self.model.eval()

    def execute(self, requests):
        responses = []
        for req in requests:
            query    = pb_utils.get_input_tensor_by_name(req, "query"
                        ).as_numpy()[0][0].decode()
            passages = [p.decode() for p in
                        pb_utils.get_input_tensor_by_name(req, "passages").as_numpy()[0]]

            pairs = [[query, p] for p in passages]
            enc   = self.tokenizer(pairs, padding=True, truncation=True,
                        max_length=512, return_tensors="pt").to(self.device)

            with torch.no_grad():
                scores = self.model(**enc).logits.squeeze(-1).float().cpu().numpy()

            out = pb_utils.Tensor("scores", scores.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[out]))
        return responses

    def finalize(self):
        del self.model
        torch.cuda.empty_cache()

backends/nomic_embed.py

PYTHON Triton Python backend — nomic-embed-text-v2-moe
import os, json
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoTokenizer, AutoModel

class TritonPythonModel:

    def initialize(self, args):
        config = json.loads(args["model_config"])
        params = config.get("parameters", {})
        model_path = params.get("model_path", {}).get(
            "string_value", "/models/nomic-embed-text-v2-moe")

        gpu_frac = float(os.environ.get("EMBED_GPU_FRACTION", "0.05"))
        torch.cuda.set_per_process_memory_fraction(gpu_frac, device=0)

        self.device    = torch.device("cuda:0")
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path, trust_remote_code=True)
        self.model     = AutoModel.from_pretrained(
            model_path, trust_remote_code=True,
            torch_dtype=torch.float16
        ).to(self.device)
        self.model.eval()

    def execute(self, requests):
        responses = []
        for req in requests:
            texts = [t.decode() for t in
                     pb_utils.get_input_tensor_by_name(req, "texts").as_numpy()[0]]

            enc = self.tokenizer(texts, padding=True, truncation=True,
                     max_length=512, return_tensors="pt").to(self.device)

            with torch.no_grad():
                out        = self.model(**enc)
                embeddings = out.last_hidden_state.mean(dim=1).float().cpu().numpy()

            t = pb_utils.Tensor("embeddings", embeddings.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[t]))
        return responses

    def finalize(self):
        del self.model
        torch.cuda.empty_cache()

10 Step 5 — The Dockerfile

A two-stage Dockerfile: Stage 1 compiles the TRT engine inside the builder image, Stage 2 copies only the artifacts needed at runtime. The final image contains no build tools, keeping it lean.

DOCKERFILE Dockerfile (multi-stage)
# ══════════════════════════════════════════════════════════════
# Stage 1 — Engine Builder
# Compiles gpt-oss-20b → TensorRT engine on H100 (SM90)
# ══════════════════════════════════════════════════════════════
FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 AS engine-builder

RUN pip install --no-cache-dir \
    transformers==4.44.0 \
    huggingface_hub==0.24.0 \
    sentencepiece einops

# Build-time arguments — override with --build-arg
ARG MODEL_SOURCE=local      # "local" or "hub"
ARG HF_TOKEN=""
ARG GPT_MODEL_ID="your-org/gpt-oss-20b"
ARG MAX_BATCH_SIZE=32
ARG MAX_INPUT_LEN=4096
ARG MAX_SEQ_LEN=8192

# Option A: copy local weights (default)
COPY ./models/gpt-oss-20b /workspace/hf_models/gpt-oss-20b

# Option B: download from Hub at build time
RUN if [ "$MODEL_SOURCE" = "hub" ]; then \
      huggingface-cli download ${GPT_MODEL_ID} \
        --token ${HF_TOKEN} \
        --local-dir /workspace/hf_models/gpt-oss-20b; \
    fi

# Convert HF → TRT-LLM checkpoint
RUN python /app/tensorrt_llm/examples/gpt/convert_checkpoint.py \
      --model_dir  /workspace/hf_models/gpt-oss-20b \
      --output_dir /workspace/trt_ckpt/gpt-oss-20b \
      --dtype      float16 \
      --tp_size    1

# Compile TRT engine
RUN trtllm-build \
      --checkpoint_dir         /workspace/trt_ckpt/gpt-oss-20b \
      --output_dir             /workspace/trt_engines/gpt-oss-20b \
      --gemm_plugin            float16 \
      --gpt_attention_plugin   float16 \
      --max_batch_size         ${MAX_BATCH_SIZE} \
      --max_input_len          ${MAX_INPUT_LEN}  \
      --max_seq_len            ${MAX_SEQ_LEN}    \
      --use_paged_context_fmha enable \
      --workers                1


# ══════════════════════════════════════════════════════════════
# Stage 2 — Runtime Server
# ══════════════════════════════════════════════════════════════
FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

LABEL description="TRT-LLM: gpt-oss-20b + bge-reranker-large + nomic-embed-text-v2-moe"

RUN pip install --no-cache-dir \
    transformers==4.44.0 \
    sentence-transformers==3.0.1 \
    einops numpy \
    torch==2.3.0 \
    sentencepiece

# Copy compiled engine from builder stage
COPY --from=engine-builder /workspace/trt_engines/gpt-oss-20b /engines/trt/gpt-oss-20b

# Copy HF weights for Python-backend models
COPY ./models/bge-reranker-large        /models/bge-reranker-large
COPY ./models/nomic-embed-text-v2-moe   /models/nomic-embed-text-v2-moe

# Copy Triton model repository + backend scripts
COPY ./triton_repo                      /triton_repo
COPY ./backends/bge_reranker.py         /triton_repo/bge_reranker/1/model.py
COPY ./backends/nomic_embed.py          /triton_repo/nomic_embed/1/model.py

COPY entrypoint.sh /entrypoint.sh
RUN  chmod +x /entrypoint.sh

# ── Default ENV (all overridable via docker run -e) ───────────
ENV GPT_GPU_FRACTION=0.80
ENV RERANKER_GPU_FRACTION=0.10
ENV EMBED_GPU_FRACTION=0.05
ENV MAX_BATCH_SIZE_GPT=32
ENV MAX_BATCH_SIZE_RERANKER=64
ENV MAX_BATCH_SIZE_EMBED=128
ENV KV_CACHE_FRACTION=0.90
ENV TRITON_HTTP_PORT=8000
ENV TRITON_GRPC_PORT=8001
ENV TRITON_METRICS_PORT=8002
ENV LOG_VERBOSE=0

EXPOSE 8000 8001 8002

ENTRYPOINT ["/entrypoint.sh"]

entrypoint.sh

BASH entrypoint.sh
#!/bin/bash
set -e

echo "======================================================"
echo " TRT-LLM Multi-Model Server Starting"
echo " gpt-oss-20b:              ${GPT_GPU_FRACTION}"
echo " bge-reranker-large:       ${RERANKER_GPU_FRACTION}"
echo " nomic-embed-text-v2-moe:  ${EMBED_GPU_FRACTION}"
echo "======================================================"

# Patch config placeholders with runtime ENV values
sed -i \
  -e "s|__GPT_GPU_FRACTION__|${GPT_GPU_FRACTION}|g" \
  -e "s|__MAX_BATCH_SIZE_GPT__|${MAX_BATCH_SIZE_GPT}|g" \
  -e "s|__KV_CACHE_FRACTION__|${KV_CACHE_FRACTION}|g" \
  /triton_repo/gpt_oss_20b/config.pbtxt

sed -i \
  -e "s|__RERANKER_GPU_FRACTION__|${RERANKER_GPU_FRACTION}|g" \
  -e "s|__MAX_BATCH_SIZE_RERANKER__|${MAX_BATCH_SIZE_RERANKER}|g" \
  /triton_repo/bge_reranker/config.pbtxt

sed -i \
  -e "s|__EMBED_GPU_FRACTION__|${EMBED_GPU_FRACTION}|g" \
  -e "s|__MAX_BATCH_SIZE_EMBED__|${MAX_BATCH_SIZE_EMBED}|g" \
  /triton_repo/nomic_embed/config.pbtxt

export RERANKER_GPU_FRACTION EMBED_GPU_FRACTION

# Start Triton — load only the 3 known models
exec tritonserver \
  --model-repository=/triton_repo \
  --http-port=${TRITON_HTTP_PORT} \
  --grpc-port=${TRITON_GRPC_PORT} \
  --metrics-port=${TRITON_METRICS_PORT} \
  --log-verbose=${LOG_VERBOSE} \
  --model-control-mode=explicit \
  --load-model=gpt_oss_20b \
  --load-model=bge_reranker \
  --load-model=nomic_embed

11 Step 6 — Build & Run

Sync project to the H100 VM

BASH from your laptop
# From your local machine — sync everything except model weights
# (you already downloaded weights directly on the VM)
rsync -avz --exclude='models/' \
  ./trtllm-project/ \
  user@h100-vm.example.com:/home/user/trtllm-project/

# SSH into the VM
ssh user@h100-vm.example.com

Build the Docker image

BASH on the H100 VM
cd ~/trtllm-project

# Standard build (uses local model weights)
docker build \
  -t trtllm-server:latest \
  --progress=plain \
  . 2>&1 | tee build.log

# Optional: build with HuggingFace Hub download
docker build \
  --build-arg MODEL_SOURCE=hub \
  --build-arg HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx \
  --build-arg GPT_MODEL_ID=your-org/gpt-oss-20b \
  --build-arg MAX_BATCH_SIZE=64 \
  -t trtllm-server:latest \
  .

# Verify image was created
docker images | grep trtllm-server

Run the container

BASH docker run — all ENV vars
docker run --gpus all \
  --shm-size=16g \
  --name trtllm-server \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  \
  # ── GPU memory allocation ─────────────────────────────
  -e GPT_GPU_FRACTION=0.80 \
  -e RERANKER_GPU_FRACTION=0.10 \
  -e EMBED_GPU_FRACTION=0.05 \
  -e KV_CACHE_FRACTION=0.90 \
  \
  # ── Batch sizes ───────────────────────────────────────
  -e MAX_BATCH_SIZE_GPT=32 \
  -e MAX_BATCH_SIZE_RERANKER=64 \
  -e MAX_BATCH_SIZE_EMBED=128 \
  \
  # ── Ports ─────────────────────────────────────────────
  -e TRITON_HTTP_PORT=8000 \
  -e TRITON_GRPC_PORT=8001 \
  -e TRITON_METRICS_PORT=8002 \
  \
  # ── Debug ─────────────────────────────────────────────
  -e LOG_VERBOSE=0 \
  \
  trtllm-server:latest

12 ENV Variable Reference

Variable Default Description
GPT_GPU_FRACTION0.80Share of H100 VRAM allocated to gpt-oss-20b
RERANKER_GPU_FRACTION0.10Share of VRAM for bge-reranker-large (PyTorch memory cap)
EMBED_GPU_FRACTION0.05Share of VRAM for nomic-embed-text-v2-moe
KV_CACHE_FRACTION0.90Fraction of GPT's allocation reserved for paged KV cache
MAX_BATCH_SIZE_GPT32Max concurrent sequences for TRT-LLM in-flight batching
MAX_BATCH_SIZE_RERANKER64Max batch size for reranker inference
MAX_BATCH_SIZE_EMBED128Max batch size for embedding inference
TRITON_HTTP_PORT8000REST/HTTP endpoint for inference requests
TRITON_GRPC_PORT8001gRPC endpoint (lower latency, recommended for production)
TRITON_METRICS_PORT8002Prometheus metrics scrape endpoint
LOG_VERBOSE00 = off, 1 = verbose Triton debug logs
Memory Budget Warning
GPT_GPU_FRACTION + RERANKER_GPU_FRACTION + EMBED_GPU_FRACTION must be ≤ 0.95 on an H100 80GB. The remaining 5% (~4 GB) is needed for CUDA context, kernel code, and Triton's own memory. Exceeding this will cause OOM errors at model load time.

13 Testing & Validation

Health Check

BASH
# Server health
curl http://localhost:8000/v2/health/ready

# List loaded models
curl http://localhost:8000/v2/models | python3 -m json.tool

# Individual model readiness
curl http://localhost:8000/v2/models/gpt_oss_20b/ready
curl http://localhost:8000/v2/models/bge_reranker/ready
curl http://localhost:8000/v2/models/nomic_embed/ready

Test Inference — gpt_oss_20b

PYTHON
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient("localhost:8000")

# Tokenize your prompt (use your tokenizer)
input_ids = np.array([[1, 15043, 29892]], dtype=np.int32)   # "Hello,"
lengths    = np.array([[3]], dtype=np.int32)
out_len    = np.array([[100]], dtype=np.int32)

inputs = [
    httpclient.InferInput("input_ids", input_ids.shape, "INT32"),
    httpclient.InferInput("input_lengths", lengths.shape, "INT32"),
    httpclient.InferInput("request_output_len", out_len.shape, "INT32"),
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(lengths)
inputs[2].set_data_from_numpy(out_len)

result = client.infer("gpt_oss_20b", inputs)
output_ids = result.as_numpy("output_ids")
print(output_ids)

Test Inference — bge_reranker

PYTHON
query    = np.array([["What is TensorRT?"]], dtype=object)
passages = np.array([["TensorRT is NVIDIA's inference optimizer",
                       "TensorRT-LLM extends TensorRT for LLMs"]], dtype=object)

inputs = [
    httpclient.InferInput("query",    query.shape,    "BYTES"),
    httpclient.InferInput("passages", passages.shape, "BYTES"),
]
inputs[0].set_data_from_numpy(query)
inputs[1].set_data_from_numpy(passages)

result = client.infer("bge_reranker", inputs)
scores = result.as_numpy("scores")
print("Rerank scores:", scores)  # e.g. [0.92, 0.87]

Prometheus Metrics

BASH
# Raw metrics
curl http://localhost:8002/metrics

# Key metrics to watch
curl -s http://localhost:8002/metrics | grep -E \
  "nv_inference_request_success|nv_gpu_memory|nv_inference_queue_duration"

14 Performance Tuning Tips

For gpt-oss-20b Throughput

  • Use FP8 instead of FP16 — cuts VRAM ~50%, allows 2× batch size on H100
  • Increase max_num_sequences in config.pbtxt to allow more in-flight requests
  • Enable --use_paged_context_fmha (already included) — crucial for long contexts
  • Set KV_CACHE_FRACTION=0.92 to give more cache to the LLM at the cost of less headroom
  • For throughput-heavy workloads, use --tp_size 1 (single GPU); for latency-heavy, consider tensor parallelism across 2× H100 if available

For Embedding / Reranker Throughput

  • Increase MAX_BATCH_SIZE_EMBED=256 and MAX_BATCH_SIZE_RERANKER=128 if traffic is bursty
  • Compile nomic-embed as a TRT ONNX backend for ~2× faster embedding inference
  • Use torch.compile(model) in initialize() for Python backend models (PyTorch 2.x)

Monitor GPU Utilization Live

BASH
# Real-time GPU stats inside the container
docker exec trtllm-server watch -n 1 nvidia-smi

# Memory per process
docker exec trtllm-server nvidia-smi \
  --query-compute-apps=pid,used_memory \
  --format=csv

15 Troubleshooting

Error Cause Fix
CUDA out of memory GPU fractions sum > 0.95 Reduce one fraction. Try GPT_GPU_FRACTION=0.75
engine file not found Build stage failed silently Check build.log; re-run with --progress=plain
unsupported SM version Engine built on different GPU arch Always build and run on the same GPU model
Model not ready on startup Python backend import error Run docker logs trtllm-server, check transformers version
tritonserver: not found Wrong base image Must use tritonserver:24.12-trtllm-python-py3 not plain CUDA image
High P99 latency spikes KV cache eviction under load Increase KV_CACHE_FRACTION or reduce max_num_sequences
Quick Debug Command
Run docker exec -it trtllm-server bash and then curl localhost:8000/v2 | python3 -m json.tool to see exactly which models loaded and their status without restarting.