Deploying Multi-Model LLM Inference with TensorRT-LLM on NVIDIA H100

01 Overview & Motivation

Running multiple AI models — a large language model, a reranker, and an embedding model — on a single GPU server is a common production challenge. The naive approach is to spin up three separate containers, each with its own GPU slice, using vLLM. That works, but it leaves performance gains on the table.

NVIDIA's TensorRT-LLM combined with Triton Inference Server offers a more optimal path: compile each generative model into a GPU-architecture-specific .engine binary, serve all models under a single Triton process, and control memory allocation precisely via environment variables.

This guide walks through the complete production workflow — from downloading raw weights to a running container — for a three-model stack on a single NVIDIA H100 SXM5 (80GB HBM3):

H100 80GB VRAM Allocation

gpt-oss-20b · 80%

reranker

embed

OS

gpt-oss-20b — 64 GB (80%)

bge-reranker-large — 8 GB (10%)

nomic-embed-text-v2-moe — 4 GB (5%)

CUDA overhead — ~4 GB (5%)

02 Core Concepts

What is TensorRT-LLM?

TensorRT-LLM is NVIDIA's open-source library that optimises LLM inference using TensorRT. It provides Python APIs to define, compile, and run transformer models as TensorRT engines. Key capabilities include:

In-flight batching (continuous batching) for high throughput
Paged KV-cache to prevent memory fragmentation
FP8, INT8, INT4 quantization (GPTQ, AWQ, SmoothQuant)
Tensor parallelism & pipeline parallelism across multiple GPUs
Flash Attention 2, chunked context processing
Speculative decoding (draft-model acceleration)

What is Triton Inference Server?

NVIDIA Triton Inference Server is a production-grade serving framework that hosts multiple models simultaneously across different backends — TensorRT, TensorRT-LLM, Python, ONNX, TensorFlow, PyTorch. A single Triton process manages GPU scheduling, request batching, health endpoints, and Prometheus metrics for all loaded models.

The Two-Phase Workflow

Key Insight

Unlike vLLM which loads and JIT-compiles at runtime, TensorRT-LLM requires an offline compilation step that produces a GPU-architecture-specific .engine file. This build step happens once (typically in Docker stage 1 or as a pre-build), then the compiled artifact is reused for every container restart.

03 vLLM vs TensorRT-LLM

Feature	vLLM	TensorRT-LLM + Triton
Setup complexity	✓ Simple	✗ Complex (engine build step)
Cold start time	✓ Fast (runtime compile)	✗ Engine build is slow (one-time)
Throughput on H100	~ Good	✓ Best-in-class (+20–40%)
Latency P99	~ Good	✓ Lower (compiled kernels)
Multi-model serving	✗ Separate processes	✓ Single Triton process
GPU memory control	~ Per-instance fraction	✓ Fine-grained per-model
FP8 quantization (H100)	✓	✓ Native SmoothQuant
KV cache paging	✓ PagedAttention	✓ Paged KV cache
Embedding models	✓ Direct	~ Python backend
OpenAI-compatible API	✓ Built-in	✗ Need vLLM or proxy
Metrics (Prometheus)	✓	✓ Native
Portability across GPU arches	✓ Works anywhere	✗ Must rebuild per arch (SM80 → SM90)

When to stay with vLLM

If your team needs OpenAI-compatible endpoints out of the box, frequent model swaps, or you're not latency-constrained at current traffic, vLLM is simpler to operate. TRT-LLM pays off at high concurrency (>50 RPS) or when squeezing the last 30% performance out of expensive H100 capacity.

04 Architecture Design

Docker Container on NVIDIA H100 SXM5 (80 GB)

Triton Inference Server v24.12

gpt_oss_20b (tensorrtllm)

/engines/gpt-oss-20b/*.engine
GPU: 80% (64 GB)

bge_reranker (python)

/models/bge-reranker-large
GPU: 10% (8 GB)

nomic_embed (python)

/models/nomic-embed-text-v2-moe
GPU: 5% (4 GB)

:8000 HTTP :8001 gRPC :8002 Prometheus

Request flow: Client -> Triton Router -> Model Backend -> GPU Execution -> Response

Request Flow

An incoming inference request arrives at Triton on port 8000 (HTTP/REST) or 8001 (gRPC). Triton routes it to the correct model backend by model name. The TRT-LLM backend uses in-flight batching — new requests join the active generation loop without waiting for the current batch to finish, maximizing GPU utilization.

Project Directory Structure

SHELL directory layout

# Your build context (rsync this to the H100 VM)
trtllm-project/
├── Dockerfile                      # Multi-stage build
├── entrypoint.sh                   # Patches configs + starts Triton
│
├── models/                          # HuggingFace weights (local)
│   ├── gpt-oss-20b/
│   ├── bge-reranker-large/
│   └── nomic-embed-text-v2-moe/
│
├── triton_repo/                     # Triton model repository
│   ├── gpt_oss_20b/
│   │   ├── config.pbtxt
│   │   └── 1/                       # (engine loaded from /engines/)
│   ├── bge_reranker/
│   │   ├── config.pbtxt
│   │   └── 1/
│   │       └── model.py
│   └── nomic_embed/
│       ├── config.pbtxt
│       └── 1/
│           └── model.py
│
└── backends/                        # Python backend scripts
    ├── bge_reranker.py
    └── nomic_embed.py

05 Prerequisites

Hardware Requirement

Engine compilation (trtllm-build) requires the same GPU architecture as the target runtime. H100 = SM90. You cannot compile on your laptop and deploy on H100. Always build on the H100 VM.

On the H100 VM — Install NVIDIA Container Toolkit

BASH VM setup

# 1. Update and install docker
sudo apt-get update
sudo apt-get install -y docker.io

# 2. Add NVIDIA Container Toolkit repo
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 3. Install toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 4. Configure docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 5. Verify GPU is visible
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Pull the Base Image (large — ~20 GB)

BASH pull base

# Log in to NVIDIA NGC (free account required)
docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key from https://ngc.nvidia.com/setup>

# Pull Triton + TRT-LLM runtime image
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

06 Step 1 — Download Model Weights

All three models must be downloaded as HuggingFace-format checkpoints before building the Docker image. The two embedding/reranker models are served as-is via the Python backend, so no compilation is needed for them.

BASH download all models

# Install huggingface CLI
pip install -U huggingface_hub

# Authenticate (needed for gated models)
huggingface-cli login
# Paste your HF token when prompted

# Create model directory
mkdir -p ~/trtllm-project/models

# ── gpt-oss-20b (replace with your actual model ID) ──────────────
huggingface-cli download your-org/gpt-oss-20b \
  --local-dir ~/trtllm-project/models/gpt-oss-20b \
  --local-dir-use-symlinks False

# ── bge-reranker-large ────────────────────────────────────────────
huggingface-cli download BAAI/bge-reranker-large \
  --local-dir ~/trtllm-project/models/bge-reranker-large \
  --local-dir-use-symlinks False

# ── nomic-embed-text-v2-moe ───────────────────────────────────────
huggingface-cli download nomic-ai/nomic-embed-text-v2-moe \
  --local-dir ~/trtllm-project/models/nomic-embed-text-v2-moe \
  --local-dir-use-symlinks False

# Verify sizes
du -sh ~/trtllm-project/models/*

Storage Requirement

A 20B parameter model in FP16 requires approximately 40 GB of disk space. Ensure your VM has at least 120 GB free before downloading all three models.

07 Step 2 — Build the TRT Engine

This is the most critical step. We use the Triton+TRT-LLM image as a builder environment, mount the model weights, and produce a compiled .engine file optimized for H100's SM90 architecture.

2a. Convert HuggingFace checkpoint to TRT-LLM format

BASH checkpoint conversion

docker run --rm --gpus all \
  -v ~/trtllm-project/models:/models \
  -v ~/trtllm-project/engines:/engines \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
  python /app/tensorrt_llm/examples/gpt/convert_checkpoint.py \
    --model_dir  /models/gpt-oss-20b \
    --output_dir /engines/trt_ckpt/gpt-oss-20b \
    --dtype      float16 \
    --tp_size    1

Architecture Note

The --dtype float16 flag uses native FP16. For even better throughput on H100, use --dtype float8 (FP8 is natively accelerated on H100's Transformer Engine). This reduces the 20B model VRAM footprint from ~40 GB to ~20 GB.

2b. Compile the TRT engine with `trtllm-build`

BASH engine compilation (~20–40 min on H100)

docker run --rm --gpus all \
  -v ~/trtllm-project/engines:/engines \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
  trtllm-build \
    --checkpoint_dir         /engines/trt_ckpt/gpt-oss-20b \
    --output_dir             /engines/trt/gpt-oss-20b \
    --gemm_plugin            float16 \
    --gpt_attention_plugin   float16 \
    --max_batch_size         32 \
    --max_input_len          4096 \
    --max_seq_len            8192 \
    --use_paged_context_fmha enable \
    --workers                1

# Verify the engine was created
ls -lh ~/trtllm-project/engines/trt/gpt-oss-20b/
# Should see: rank0.engine, config.json

Optional — FP8 for Maximum H100 Throughput

Replace float16 with float8 in both plugin flags for native FP8 acceleration on H100. Benchmark both and compare quality vs speed tradeoff for your specific use case.

08 Step 3 — Triton Model Repository Configs

Each model needs a config.pbtxt file that tells Triton which backend to use, what inputs/outputs look like, and GPU parameters. The placeholder tokens (__VAR__) are replaced at container startup by entrypoint.sh.

triton_repo/gpt_oss_20b/config.pbtxt

PROTOBUF gpt_oss_20b/config.pbtxt

name: "gpt_oss_20b"
backend: "tensorrtllm"
max_batch_size: __MAX_BATCH_SIZE_GPT__

model_transaction_policy { decoupled: true }

input [
  { name: "input_ids"          data_type: TYPE_INT32  dims: [ -1 ] },
  { name: "input_lengths"      data_type: TYPE_INT32  dims: [  1 ] },
  { name: "request_output_len" data_type: TYPE_INT32  dims: [  1 ] },
  { name: "streaming"          data_type: TYPE_BOOL   dims: [  1 ], optional: true }
]

output [
  { name: "output_ids"      data_type: TYPE_INT32  dims: [ -1 ] },
  { name: "sequence_length" data_type: TYPE_INT32  dims: [  1 ] }
]

parameters {
  key: "engine_dir"
  value: { string_value: "/engines/trt/gpt-oss-20b" }
}
parameters {
  key: "executor_worker_path"
  value: { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value: { string_value: "__KV_CACHE_FRACTION__" }
}
parameters {
  key: "max_num_sequences"
  value: { string_value: "256" }
}

instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]

triton_repo/bge_reranker/config.pbtxt

PROTOBUF bge_reranker/config.pbtxt

name: "bge_reranker"
backend: "python"
max_batch_size: __MAX_BATCH_SIZE_RERANKER__

input [
  { name: "query",    data_type: TYPE_STRING, dims: [ 1 ] },
  { name: "passages", data_type: TYPE_STRING, dims: [ -1 ] }
]

output [
  { name: "scores", data_type: TYPE_FP32, dims: [ -1 ] }
]

parameters {
  key: "model_path"
  value: { string_value: "/models/bge-reranker-large" }
}

instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]

triton_repo/nomic_embed/config.pbtxt

PROTOBUF nomic_embed/config.pbtxt

name: "nomic_embed"
backend: "python"
max_batch_size: __MAX_BATCH_SIZE_EMBED__

input [
  { name: "texts", data_type: TYPE_STRING, dims: [ -1 ] }
]

output [
  { name: "embeddings", data_type: TYPE_FP32, dims: [ -1, 768 ] }
]

parameters {
  key: "model_path"
  value: { string_value: "/models/nomic-embed-text-v2-moe" }
}

instance_group [{ kind: KIND_GPU, gpus: [0], count: 1 }]

09 Step 4 — Python Backend Scripts

The reranker and embedding models don't use TRT-LLM — they're served through Triton's Python backend, which gives us full control over how PyTorch loads the model and handles requests. GPU memory is bounded via torch.cuda.set_per_process_memory_fraction().

backends/bge_reranker.py

PYTHON Triton Python backend — bge-reranker-large

import os, json
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class TritonPythonModel:

    def initialize(self, args):
        config = json.loads(args["model_config"])
        params = config.get("parameters", {})
        model_path = params.get("model_path", {}).get(
            "string_value", "/models/bge-reranker-large")

        # Clamp GPU memory to the configured fraction
        gpu_frac = float(os.environ.get("RERANKER_GPU_FRACTION", "0.10"))
        torch.cuda.set_per_process_memory_fraction(gpu_frac, device=0)

        self.device    = torch.device("cuda:0")
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model     = AutoModelForSequenceClassification.from_pretrained(
            model_path, torch_dtype=torch.float16
        ).to(self.device)
        self.model.eval()

    def execute(self, requests):
        responses = []
        for req in requests:
            query    = pb_utils.get_input_tensor_by_name(req, "query"
                        ).as_numpy()[0][0].decode()
            passages = [p.decode() for p in
                        pb_utils.get_input_tensor_by_name(req, "passages").as_numpy()[0]]

            pairs = [[query, p] for p in passages]
            enc   = self.tokenizer(pairs, padding=True, truncation=True,
                        max_length=512, return_tensors="pt").to(self.device)

            with torch.no_grad():
                scores = self.model(**enc).logits.squeeze(-1).float().cpu().numpy()

            out = pb_utils.Tensor("scores", scores.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[out]))
        return responses

    def finalize(self):
        del self.model
        torch.cuda.empty_cache()

backends/nomic_embed.py

PYTHON Triton Python backend — nomic-embed-text-v2-moe

import os, json
import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoTokenizer, AutoModel

class TritonPythonModel:

    def initialize(self, args):
        config = json.loads(args["model_config"])
        params = config.get("parameters", {})
        model_path = params.get("model_path", {}).get(
            "string_value", "/models/nomic-embed-text-v2-moe")

        gpu_frac = float(os.environ.get("EMBED_GPU_FRACTION", "0.05"))
        torch.cuda.set_per_process_memory_fraction(gpu_frac, device=0)

        self.device    = torch.device("cuda:0")
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path, trust_remote_code=True)
        self.model     = AutoModel.from_pretrained(
            model_path, trust_remote_code=True,
            torch_dtype=torch.float16
        ).to(self.device)
        self.model.eval()

    def execute(self, requests):
        responses = []
        for req in requests:
            texts = [t.decode() for t in
                     pb_utils.get_input_tensor_by_name(req, "texts").as_numpy()[0]]

            enc = self.tokenizer(texts, padding=True, truncation=True,
                     max_length=512, return_tensors="pt").to(self.device)

            with torch.no_grad():
                out        = self.model(**enc)
                embeddings = out.last_hidden_state.mean(dim=1).float().cpu().numpy()

            t = pb_utils.Tensor("embeddings", embeddings.astype(np.float32))
            responses.append(pb_utils.InferenceResponse(output_tensors=[t]))
        return responses

    def finalize(self):
        del self.model
        torch.cuda.empty_cache()

10 Step 5 — The Dockerfile

A two-stage Dockerfile: Stage 1 compiles the TRT engine inside the builder image, Stage 2 copies only the artifacts needed at runtime. The final image contains no build tools, keeping it lean.

DOCKERFILE Dockerfile (multi-stage)

# ══════════════════════════════════════════════════════════════
# Stage 1 — Engine Builder
# Compiles gpt-oss-20b → TensorRT engine on H100 (SM90)
# ══════════════════════════════════════════════════════════════
FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 AS engine-builder

RUN pip install --no-cache-dir \
    transformers==4.44.0 \
    huggingface_hub==0.24.0 \
    sentencepiece einops

# Build-time arguments — override with --build-arg
ARG MODEL_SOURCE=local      # "local" or "hub"
ARG HF_TOKEN=""
ARG GPT_MODEL_ID="your-org/gpt-oss-20b"
ARG MAX_BATCH_SIZE=32
ARG MAX_INPUT_LEN=4096
ARG MAX_SEQ_LEN=8192

# Option A: copy local weights (default)
COPY ./models/gpt-oss-20b /workspace/hf_models/gpt-oss-20b

# Option B: download from Hub at build time
RUN if [ "$MODEL_SOURCE" = "hub" ]; then \
      huggingface-cli download ${GPT_MODEL_ID} \
        --token ${HF_TOKEN} \
        --local-dir /workspace/hf_models/gpt-oss-20b; \
    fi

# Convert HF → TRT-LLM checkpoint
RUN python /app/tensorrt_llm/examples/gpt/convert_checkpoint.py \
      --model_dir  /workspace/hf_models/gpt-oss-20b \
      --output_dir /workspace/trt_ckpt/gpt-oss-20b \
      --dtype      float16 \
      --tp_size    1

# Compile TRT engine
RUN trtllm-build \
      --checkpoint_dir         /workspace/trt_ckpt/gpt-oss-20b \
      --output_dir             /workspace/trt_engines/gpt-oss-20b \
      --gemm_plugin            float16 \
      --gpt_attention_plugin   float16 \
      --max_batch_size         ${MAX_BATCH_SIZE} \
      --max_input_len          ${MAX_INPUT_LEN}  \
      --max_seq_len            ${MAX_SEQ_LEN}    \
      --use_paged_context_fmha enable \
      --workers                1


# ══════════════════════════════════════════════════════════════
# Stage 2 — Runtime Server
# ══════════════════════════════════════════════════════════════
FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

LABEL description="TRT-LLM: gpt-oss-20b + bge-reranker-large + nomic-embed-text-v2-moe"

RUN pip install --no-cache-dir \
    transformers==4.44.0 \
    sentence-transformers==3.0.1 \
    einops numpy \
    torch==2.3.0 \
    sentencepiece

# Copy compiled engine from builder stage
COPY --from=engine-builder /workspace/trt_engines/gpt-oss-20b /engines/trt/gpt-oss-20b

# Copy HF weights for Python-backend models
COPY ./models/bge-reranker-large        /models/bge-reranker-large
COPY ./models/nomic-embed-text-v2-moe   /models/nomic-embed-text-v2-moe

# Copy Triton model repository + backend scripts
COPY ./triton_repo                      /triton_repo
COPY ./backends/bge_reranker.py         /triton_repo/bge_reranker/1/model.py
COPY ./backends/nomic_embed.py          /triton_repo/nomic_embed/1/model.py

COPY entrypoint.sh /entrypoint.sh
RUN  chmod +x /entrypoint.sh

# ── Default ENV (all overridable via docker run -e) ───────────
ENV GPT_GPU_FRACTION=0.80
ENV RERANKER_GPU_FRACTION=0.10
ENV EMBED_GPU_FRACTION=0.05
ENV MAX_BATCH_SIZE_GPT=32
ENV MAX_BATCH_SIZE_RERANKER=64
ENV MAX_BATCH_SIZE_EMBED=128
ENV KV_CACHE_FRACTION=0.90
ENV TRITON_HTTP_PORT=8000
ENV TRITON_GRPC_PORT=8001
ENV TRITON_METRICS_PORT=8002
ENV LOG_VERBOSE=0

EXPOSE 8000 8001 8002

ENTRYPOINT ["/entrypoint.sh"]

entrypoint.sh

BASH entrypoint.sh

#!/bin/bash
set -e

echo "======================================================"
echo " TRT-LLM Multi-Model Server Starting"
echo " gpt-oss-20b:              ${GPT_GPU_FRACTION}"
echo " bge-reranker-large:       ${RERANKER_GPU_FRACTION}"
echo " nomic-embed-text-v2-moe:  ${EMBED_GPU_FRACTION}"
echo "======================================================"

# Patch config placeholders with runtime ENV values
sed -i \
  -e "s|__GPT_GPU_FRACTION__|${GPT_GPU_FRACTION}|g" \
  -e "s|__MAX_BATCH_SIZE_GPT__|${MAX_BATCH_SIZE_GPT}|g" \
  -e "s|__KV_CACHE_FRACTION__|${KV_CACHE_FRACTION}|g" \
  /triton_repo/gpt_oss_20b/config.pbtxt

sed -i \
  -e "s|__RERANKER_GPU_FRACTION__|${RERANKER_GPU_FRACTION}|g" \
  -e "s|__MAX_BATCH_SIZE_RERANKER__|${MAX_BATCH_SIZE_RERANKER}|g" \
  /triton_repo/bge_reranker/config.pbtxt

sed -i \
  -e "s|__EMBED_GPU_FRACTION__|${EMBED_GPU_FRACTION}|g" \
  -e "s|__MAX_BATCH_SIZE_EMBED__|${MAX_BATCH_SIZE_EMBED}|g" \
  /triton_repo/nomic_embed/config.pbtxt

export RERANKER_GPU_FRACTION EMBED_GPU_FRACTION

# Start Triton — load only the 3 known models
exec tritonserver \
  --model-repository=/triton_repo \
  --http-port=${TRITON_HTTP_PORT} \
  --grpc-port=${TRITON_GRPC_PORT} \
  --metrics-port=${TRITON_METRICS_PORT} \
  --log-verbose=${LOG_VERBOSE} \
  --model-control-mode=explicit \
  --load-model=gpt_oss_20b \
  --load-model=bge_reranker \
  --load-model=nomic_embed

11 Step 6 — Build & Run

Sync project to the H100 VM

BASH from your laptop

# From your local machine — sync everything except model weights
# (you already downloaded weights directly on the VM)
rsync -avz --exclude='models/' \
  ./trtllm-project/ \
  user@h100-vm.example.com:/home/user/trtllm-project/

# SSH into the VM
ssh user@h100-vm.example.com

Build the Docker image

BASH on the H100 VM

cd ~/trtllm-project

# Standard build (uses local model weights)
docker build \
  -t trtllm-server:latest \
  --progress=plain \
  . 2>&1 | tee build.log

# Optional: build with HuggingFace Hub download
docker build \
  --build-arg MODEL_SOURCE=hub \
  --build-arg HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx \
  --build-arg GPT_MODEL_ID=your-org/gpt-oss-20b \
  --build-arg MAX_BATCH_SIZE=64 \
  -t trtllm-server:latest \
  .

# Verify image was created
docker images | grep trtllm-server

Run the container

BASH docker run — all ENV vars

docker run --gpus all \
  --shm-size=16g \
  --name trtllm-server \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  \
  # ── GPU memory allocation ─────────────────────────────
  -e GPT_GPU_FRACTION=0.80 \
  -e RERANKER_GPU_FRACTION=0.10 \
  -e EMBED_GPU_FRACTION=0.05 \
  -e KV_CACHE_FRACTION=0.90 \
  \
  # ── Batch sizes ───────────────────────────────────────
  -e MAX_BATCH_SIZE_GPT=32 \
  -e MAX_BATCH_SIZE_RERANKER=64 \
  -e MAX_BATCH_SIZE_EMBED=128 \
  \
  # ── Ports ─────────────────────────────────────────────
  -e TRITON_HTTP_PORT=8000 \
  -e TRITON_GRPC_PORT=8001 \
  -e TRITON_METRICS_PORT=8002 \
  \
  # ── Debug ─────────────────────────────────────────────
  -e LOG_VERBOSE=0 \
  \
  trtllm-server:latest

12 ENV Variable Reference

Variable	Default	Description
`GPT_GPU_FRACTION`	0.80	Share of H100 VRAM allocated to gpt-oss-20b
`RERANKER_GPU_FRACTION`	0.10	Share of VRAM for bge-reranker-large (PyTorch memory cap)
`EMBED_GPU_FRACTION`	0.05	Share of VRAM for nomic-embed-text-v2-moe
`KV_CACHE_FRACTION`	0.90	Fraction of GPT's allocation reserved for paged KV cache
`MAX_BATCH_SIZE_GPT`	32	Max concurrent sequences for TRT-LLM in-flight batching
`MAX_BATCH_SIZE_RERANKER`	64	Max batch size for reranker inference
`MAX_BATCH_SIZE_EMBED`	128	Max batch size for embedding inference
`TRITON_HTTP_PORT`	8000	REST/HTTP endpoint for inference requests
`TRITON_GRPC_PORT`	8001	gRPC endpoint (lower latency, recommended for production)
`TRITON_METRICS_PORT`	8002	Prometheus metrics scrape endpoint
`LOG_VERBOSE`	0	0 = off, 1 = verbose Triton debug logs

Memory Budget Warning

GPT_GPU_FRACTION + RERANKER_GPU_FRACTION + EMBED_GPU_FRACTION must be ≤ 0.95 on an H100 80GB. The remaining 5% (~4 GB) is needed for CUDA context, kernel code, and Triton's own memory. Exceeding this will cause OOM errors at model load time.

13 Testing & Validation

Health Check

BASH

# Server health
curl http://localhost:8000/v2/health/ready

# List loaded models
curl http://localhost:8000/v2/models | python3 -m json.tool

# Individual model readiness
curl http://localhost:8000/v2/models/gpt_oss_20b/ready
curl http://localhost:8000/v2/models/bge_reranker/ready
curl http://localhost:8000/v2/models/nomic_embed/ready

Test Inference — gpt_oss_20b

PYTHON

import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient("localhost:8000")

# Tokenize your prompt (use your tokenizer)
input_ids = np.array([[1, 15043, 29892]], dtype=np.int32)   # "Hello,"
lengths    = np.array([[3]], dtype=np.int32)
out_len    = np.array([[100]], dtype=np.int32)

inputs = [
    httpclient.InferInput("input_ids", input_ids.shape, "INT32"),
    httpclient.InferInput("input_lengths", lengths.shape, "INT32"),
    httpclient.InferInput("request_output_len", out_len.shape, "INT32"),
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(lengths)
inputs[2].set_data_from_numpy(out_len)

result = client.infer("gpt_oss_20b", inputs)
output_ids = result.as_numpy("output_ids")
print(output_ids)

Test Inference — bge_reranker

PYTHON

query    = np.array([["What is TensorRT?"]], dtype=object)
passages = np.array([["TensorRT is NVIDIA's inference optimizer",
                       "TensorRT-LLM extends TensorRT for LLMs"]], dtype=object)

inputs = [
    httpclient.InferInput("query",    query.shape,    "BYTES"),
    httpclient.InferInput("passages", passages.shape, "BYTES"),
]
inputs[0].set_data_from_numpy(query)
inputs[1].set_data_from_numpy(passages)

result = client.infer("bge_reranker", inputs)
scores = result.as_numpy("scores")
print("Rerank scores:", scores)  # e.g. [0.92, 0.87]

Prometheus Metrics

BASH

# Raw metrics
curl http://localhost:8002/metrics

# Key metrics to watch
curl -s http://localhost:8002/metrics | grep -E \
  "nv_inference_request_success|nv_gpu_memory|nv_inference_queue_duration"

14 Performance Tuning Tips

For gpt-oss-20b Throughput

Use FP8 instead of FP16 — cuts VRAM ~50%, allows 2× batch size on H100
Increase max_num_sequences in config.pbtxt to allow more in-flight requests
Enable --use_paged_context_fmha (already included) — crucial for long contexts
Set KV_CACHE_FRACTION=0.92 to give more cache to the LLM at the cost of less headroom
For throughput-heavy workloads, use --tp_size 1 (single GPU); for latency-heavy, consider tensor parallelism across 2× H100 if available

For Embedding / Reranker Throughput

Increase MAX_BATCH_SIZE_EMBED=256 and MAX_BATCH_SIZE_RERANKER=128 if traffic is bursty
Compile nomic-embed as a TRT ONNX backend for ~2× faster embedding inference
Use torch.compile(model) in initialize() for Python backend models (PyTorch 2.x)

Monitor GPU Utilization Live

BASH

# Real-time GPU stats inside the container
docker exec trtllm-server watch -n 1 nvidia-smi

# Memory per process
docker exec trtllm-server nvidia-smi \
  --query-compute-apps=pid,used_memory \
  --format=csv

15 Troubleshooting

Error	Cause	Fix
`CUDA out of memory`	GPU fractions sum > 0.95	Reduce one fraction. Try `GPT_GPU_FRACTION=0.75`
`engine file not found`	Build stage failed silently	Check `build.log`; re-run with `--progress=plain`
`unsupported SM version`	Engine built on different GPU arch	Always build and run on the same GPU model
Model not ready on startup	Python backend import error	Run `docker logs trtllm-server`, check `transformers` version
`tritonserver: not found`	Wrong base image	Must use `tritonserver:24.12-trtllm-python-py3` not plain CUDA image
High P99 latency spikes	KV cache eviction under load	Increase `KV_CACHE_FRACTION` or reduce `max_num_sequences`

Quick Debug Command

Run docker exec -it trtllm-server bash and then curl localhost:8000/v2 | python3 -m json.tool to see exactly which models loaded and their status without restarting.

Serving Multiple LLMs on a Single H100 with TensorRT-LLM & Triton

01 Overview & Motivation

02 Core Concepts

What is TensorRT-LLM?

What is Triton Inference Server?

The Two-Phase Workflow

03 vLLM vs TensorRT-LLM

04 Architecture Design

gpt_oss_20b (tensorrtllm)

bge_reranker (python)

nomic_embed (python)

Request Flow

Project Directory Structure

05 Prerequisites

On the H100 VM — Install NVIDIA Container Toolkit

Pull the Base Image (large — ~20 GB)

06 Step 1 — Download Model Weights

07 Step 2 — Build the TRT Engine

2a. Convert HuggingFace checkpoint to TRT-LLM format

2b. Compile the TRT engine with trtllm-build

08 Step 3 — Triton Model Repository Configs

triton_repo/gpt_oss_20b/config.pbtxt

triton_repo/bge_reranker/config.pbtxt

triton_repo/nomic_embed/config.pbtxt

09 Step 4 — Python Backend Scripts

backends/bge_reranker.py

backends/nomic_embed.py

10 Step 5 — The Dockerfile

entrypoint.sh

11 Step 6 — Build & Run

Sync project to the H100 VM

Build the Docker image

Run the container

12 ENV Variable Reference

13 Testing & Validation

Health Check

Test Inference — gpt_oss_20b

Test Inference — bge_reranker

Prometheus Metrics

14 Performance Tuning Tips

For gpt-oss-20b Throughput

For Embedding / Reranker Throughput

Monitor GPU Utilization Live

15 Troubleshooting

2b. Compile the TRT engine with `trtllm-build`