deploymentopen-sourceinference-economics

Open-Source Serving Stacks: vLLM vs TGI vs TensorRT-LLM in 2026

The three engines powering most production inference — benchmarked, compared, and mapped to the right workloads. Your choice of serving engine determines 30-60% of your inference cost.

Digiteria Labs/11 min read

Key Signals

  • vLLM 0.8 (released January 2026) now supports speculative decoding, disaggregated prefill, and LoRA hot-swapping natively — making it the default choice for multi-model and multi-tenant inference deployments.
  • TensorRT-LLM 0.17 delivers 35-50% higher throughput than vLLM on identical NVIDIA hardware for single-model serving, but at the cost of a rigid compilation step and NVIDIA lock-in.
  • TGI 3.0 shipped with Rust-native tensor parallelism and grammar-constrained decoding, closing much of the performance gap with vLLM while remaining the simplest to deploy via a single Docker container.
  • On an 8xH100 cluster running Llama 4 70B at FP8 precision, independent benchmarks show TensorRT-LLM at 4,800 tok/s, vLLM at 3,400 tok/s, and TGI at 2,900 tok/s for batch-128 throughput.
  • The cost-per-million-tokens gap between the fastest and slowest engine is $0.04-0.07 at scale — meaningful enough to justify migration work for any team spending more than $20K/month on inference compute.

What Happened

I've been tracking the open-source inference serving landscape for the past two years, and what struck me most about 2025 was the sheer number of frameworks that quietly died. Remember the 2024 explosion? SGLang, LMDeploy, MLC-LLM, PowerInfer, and a dozen others. A lot of smart people built a lot of impressive things. Most of them didn't make it.

By early 2026, three engines account for roughly 85% of production open-model inference: vLLM, NVIDIA's TensorRT-LLM, and Hugging Face's Text Generation Inference (TGI). SGLang is still hanging around as a fourth option for research workloads, but its production adoption has plateaued. (I keep hearing "we use it for experiments" and never "we use it in prod." That tells you something.)

The reason is less exciting than you might think. It wasn't about who had the fastest kernels. It was about who invested in the boring stuff — robust health checks, graceful restarts, multi-GPU orchestration, KV cache management under memory pressure, handling traffic spikes without dropping requests. The surviving engines are the ones that treated operational maturity as a feature, not an afterthought. Each shipped major releases in January 2026 that widened their respective advantages.

What I'm seeing is a three-way tradeoff that maps cleanly to organizational profiles. vLLM optimizes for flexibility — run any model, swap adapters at runtime, deploy on any hardware. TensorRT-LLM optimizes for raw throughput — compile once, extract every FLOP from NVIDIA silicon. TGI optimizes for simplicity — pull a Docker image, set a model ID, start serving. Where your organization sits on that flexibility-performance-simplicity triangle is, in my view, the most important infrastructure decision you'll make this year.

The performance gap between these engines has narrowed substantially since 2024. Two years ago, TensorRT-LLM was 2-3x faster than alternatives. Today it's 35-50% faster. If you chose vLLM or TGI for operational reasons 18 months ago, you made the right call — the throughput penalty is now small enough that flexibility and simplicity often dominate the total cost equation. I think that gap keeps shrinking.

Benchmark Methodology

Before I throw numbers at you, a word on how they were collected. All benchmarks cited here were run on identical hardware: 8xH100 SXM5 80GB GPUs connected via NVLink, 2TB system RAM, running Ubuntu 22.04 with CUDA 12.6. Models were tested at FP8 precision (the current production default for 70B+ models) using the ShareGPT conversational dataset for realistic input/output length distributions. I'm reporting both throughput (tokens per second at batch saturation) and latency (time-to-first-token and inter-token latency at realistic concurrency levels).

These numbers are reproducible but hardware-specific. If you're running on A100s, the absolute numbers will be lower but the relative rankings hold. If you're on B200s, TensorRT-LLM's advantage widens slightly due to its tighter integration with Blackwell's FP4 pipeline.

Head-to-Head: Technical Comparison

Let me walk through how each of these engines actually works under the hood, because the architecture choices explain a lot about where each one shines and where it falls apart.

Architecture Differences

vLLM uses PagedAttention for KV cache management and implements continuous batching in Python with C++/CUDA kernels for the hot path. Its real advantage isn't any single optimization — it's the scheduler. It can interleave requests of different lengths efficiently, which matters enormously in production where request sizes vary wildly. In v0.8, vLLM added a disaggregated prefill mode that separates the compute-heavy prompt processing from the memory-heavy decode phase. You can run prefill on one set of GPUs and decode on another. That's a meaningful architectural shift.

TensorRT-LLM takes a fundamentally different approach: it compiles the model into an optimized TensorRT engine ahead of time, fusing operations and selecting hardware-specific kernel implementations. This compilation step takes 15-45 minutes depending on model size but produces an engine tuned to the exact GPU, batch size range, and sequence length you specify. The runtime is C++ with a Python API layer. It's fast. It's also rigid in a way that bites you at 2 AM when something needs to change.

TGI 3.0 rewrote its core serving loop in Rust, replacing the previous Python-based scheduler. It uses Flash Attention 2 kernels and implements its own paged KV cache. The Rust rewrite reduced tail latency by 40% compared to TGI 2.x by eliminating GIL contention that was throttling the scheduler under high concurrency. That's not a small thing. Tail latency is what your users actually feel.

Throughput Benchmarks — Llama 4 70B, FP8, 8xH100

MetricTensorRT-LLM 0.17vLLM 0.8.2TGI 3.0.1
Throughput (batch 128)4,800 tok/s3,400 tok/s2,900 tok/s
Throughput (batch 32)2,100 tok/s1,700 tok/s1,500 tok/s
TTFT (p50, 2K prompt)82 ms105 ms118 ms
TTFT (p99, 2K prompt)140 ms195 ms230 ms
Inter-token latency (p50)11 ms14 ms16 ms
Inter-token latency (p99)19 ms28 ms35 ms
Max concurrent requests5121024256
GPU memory utilization92%88%85%

Throughput Benchmarks — Mistral Medium 3 22B, FP8, 2xH100

MetricTensorRT-LLM 0.17vLLM 0.8.2TGI 3.0.1
Throughput (batch 64)3,200 tok/s2,600 tok/s2,400 tok/s
Throughput (batch 16)1,400 tok/s1,200 tok/s1,100 tok/s
TTFT (p50, 1K prompt)38 ms48 ms55 ms
Inter-token latency (p50)8 ms10 ms11 ms

Configuration Examples

vLLM production deployment:

# vllm-serve.yaml — Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama4-70b
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:0.8.2
          args:
            - --model=meta-llama/Llama-4-70B-Instruct
            - --tensor-parallel-size=8
            - --dtype=fp8
            - --max-model-len=32768
            - --enable-chunked-prefill
            - --max-num-batched-tokens=65536
            - --enable-prefix-caching
            - --gpu-memory-utilization=0.90
            - --swap-space=8
            - --disable-log-requests
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 8

TensorRT-LLM build and serve:

# Step 1: Convert and quantize the model
python convert_checkpoint.py \
  --model_dir ./Llama-4-70B-Instruct \
  --output_dir ./trt_ckpt \
  --dtype float16 \
  --tp_size 8

# Step 2: Build the TensorRT engine
trtllm-build \
  --checkpoint_dir ./trt_ckpt \
  --output_dir ./trt_engine \
  --gemm_plugin fp8 \
  --max_batch_size 128 \
  --max_input_len 32768 \
  --max_seq_len 40960 \
  --paged_kv_cache enable \
  --use_fused_mlp enable \
  --multiple_profiles enable

# Step 3: Launch the server
python -m tensorrt_llm.serve \
  --engine_dir ./trt_engine \
  --tokenizer_dir ./Llama-4-70B-Instruct \
  --port 8000 \
  --max_beam_width 1

TGI single-command deployment:

# TGI 3.0 — one command, production-ready
docker run --gpus all \
  -p 8080:80 \
  -v /data:/data \
  ghcr.io/huggingface/text-generation-inference:3.0.1 \
  --model-id meta-llama/Llama-4-70B-Instruct \
  --num-shard 8 \
  --quantize fp8 \
  --max-input-tokens 32768 \
  --max-total-tokens 40960 \
  --max-batch-size 128 \
  --max-concurrent-requests 256

Feature Matrix

FeatureTensorRT-LLMvLLMTGI
LoRA hot-swapNo (recompile)Yes (runtime)Yes (reload)
Speculative decodingYesYes (v0.8+)No
Structured output / grammarNoYes (outlines)Yes (native)
Multi-model on one GPUNoYesNo
AMD GPU supportNoYes (ROCm)Partial
B200 / FP4 supportYes (native)Yes (v0.8.1+)Planned (Q2 2026)
OpenAI-compatible APIYesYes (default)Yes (v2.0+)
Disaggregated prefillYesYes (v0.8+)No
Prefix cachingYesYesNo
Build/compile step requiredYes (15-45 min)NoNo
Minimum operational expertiseHighMediumLow

The headline says TensorRT-LLM is 35-50% faster. That's true. But if you look at the actual total cost of ownership, the picture gets murkier. The compilation step must be re-run for every model update, batch size change, or sequence length adjustment. For teams shipping model updates weekly, that's 2-4 hours of engineering time per cycle. At fewer than 500 GPU-hours/month, the engineering overhead of TensorRT-LLM often exceeds the compute savings.

Cost Analysis at Scale

Consider a reference deployment: serving Llama 4 70B at 50M tokens per day on on-demand H100 instances at $3.50/GPU-hour (current CoreWeave pricing).

Monthly compute cost by engine:

  • TensorRT-LLM: 50M tok/day / 4,800 tok/s = 10,417 GPU-seconds/day = 2.89 GPU-hours/day per 8-GPU node. Monthly: 87 GPU-node-hours x 8 GPUs x $3.50 = $2,436/month.
  • vLLM: 50M tok/day / 3,400 tok/s = 14,706 GPU-seconds/day = 4.08 GPU-hours/day. Monthly: 123 GPU-node-hours x 8 GPUs x $3.50 = $3,444/month.
  • TGI: 50M tok/day / 2,900 tok/s = 17,241 GPU-seconds/day = 4.79 GPU-hours/day. Monthly: 144 GPU-node-hours x 8 GPUs x $3.50 = $4,032/month.

That $1,596/month spread matters: at 500M tokens/day, it becomes $15,960/month — $191K/year. That's headcount.

Who wins:

  • vLLM wins the value-adjusted comparison for most teams. It's 30% cheaper than TGI and only 29% more expensive than TensorRT-LLM, while offering dramatically more operational flexibility. For multi-model shops, vLLM's ability to serve multiple LoRA adapters from a single base model eliminates the need for separate GPU allocations per fine-tune — cutting total fleet cost by 40-60%.
  • TensorRT-LLM wins on pure unit economics for single-model, high-volume deployments. If you're running one model at 500M+ tokens/day and your team has the infra expertise, the $191K/year savings is real money.
  • Teams with mixed GPU fleets should pick vLLM. It's the only engine with production-grade support for both NVIDIA and AMD GPUs, which lets you arbitrage spot pricing across hardware vendors.

Who loses:

  • TGI loses on cost at every scale beyond prototyping. The 17% throughput gap to vLLM compounds into five-figure annual differences for mid-scale deployments.
  • Teams running TensorRT-LLM with frequent model updates. The recompilation overhead isn't just engineering time — it's deployment velocity.
  • Anyone not benchmarking on their own workload. These numbers assume ShareGPT-like conversational traffic. RAG workloads shift the rankings. Code generation shifts them differently.

"The serving engine is the last place you should be clever and the first place you should be rigorous. Pick the boring choice that matches your operational maturity, then benchmark relentlessly on your actual traffic. The internet's benchmarks are not your benchmarks."

When the Rankings Change

The numbers above tell one story. But three scenarios flip the default recommendations:

Scenario 1: RAG-heavy workloads (long prefill, short decode). When input sequences average 8K+ tokens and outputs are under 512 tokens, the prefill phase dominates latency and cost. TensorRT-LLM's fused attention kernels and vLLM's disaggregated prefill mode both shine here. TGI falls further behind because it can't separate prefill from decode across different hardware. If you're doing RAG at scale, skip TGI entirely.

Scenario 2: Structured output at high volume. This is the one case where TGI genuinely earns its keep. If you're generating JSON, SQL, or function calls and need guaranteed schema compliance, TGI's native grammar-constrained decoding is the cleanest implementation available. vLLM supports this via the Outlines integration, but it adds 8-12% latency overhead. TensorRT-LLM has no native support.

Scenario 3: Multi-tenant SaaS with per-customer fine-tunes. If you're serving dozens of LoRA adapters off a shared base model, vLLM is the only viable option. Full stop. It can hot-swap adapters per request with sub-millisecond overhead. TGI supports LoRA but requires a model reload — seconds of downtime. TensorRT-LLM requires a full recompilation per adapter.

Operational Risk: Every model update, quantization change, or max-sequence-length adjustment in TensorRT-LLM requires rebuilding the engine. In one production incident I analyzed, a team's 45-minute engine build failed silently due to an OOM during compilation, resulting in 3 hours of downtime. If you choose TensorRT-LLM, invest heavily in CI/CD for your engine build pipeline.

The Convergence Ahead

Looking at the roadmaps, these three engines are slowly converging. vLLM is investing in compilation-based optimization (their "vLLM Compiler" project, expected Q3 2026) that would close the throughput gap with TensorRT-LLM. TGI is adding disaggregated serving and speculative decoding in their Q2 roadmap. TensorRT-LLM is slowly adding dynamic batching features that reduce the rigidity of its compiled engines.

I think by Q4 2026, the performance gap may shrink to 10-15%. When that convergence happens, the decision becomes purely operational: what does your team know how to run, and what does your deployment topology look like?

My advice: invest in operational expertise around your chosen engine now. Switching costs are measured in weeks of engineering time, not hours.

What I'd Do

If you're a CTO: Default to vLLM for new deployments unless you have a specific, measured reason to choose otherwise. It has the largest community (14K+ GitHub stars, weekly releases), the broadest hardware support, and the most flexible serving model. Only move to TensorRT-LLM if your benchmarks show a >30% throughput advantage that translates to >$100K/year in savings. Mandate that your infra team runs comparative benchmarks quarterly.

If you're a founder: If you're pre-scale (under $10K/month in inference spend), do not self-host at all. Use a managed inference provider — Together AI, Fireworks, Groq — that has already optimized their serving stack. When you cross $10K/month, start with vLLM on a managed Kubernetes cluster and revisit the engine choice when you cross $50K/month.

If you're an infra lead: Build a benchmarking harness this quarter that runs your top-3 models across all three engines using replayed production traffic. Automate it to run weekly against new engine releases. If you're currently on TGI and spending more than $15K/month on compute, run a vLLM proof-of-concept — you'll likely see 15-20% cost savings with 2-3 days of migration work. If you're on vLLM and running a single model above 200M tokens/day, prototype a TensorRT-LLM deployment for that specific model.

Sources

  1. "vLLM 0.8 Release Notes: Disaggregated Prefill, Speculative Decoding, and LoRA Hot-Swap," vLLM Project Blog, blog.vllm.ai (January 2026)
  2. "TensorRT-LLM 0.17 Performance Guide," NVIDIA Developer Documentation, developer.nvidia.com/tensorrt-llm (January 2026)
  3. "Text Generation Inference 3.0: The Rust Rewrite," Hugging Face Blog, huggingface.co/blog/tgi-3 (January 2026)
  4. "Independent LLM Serving Benchmark — Q1 2026," Anyscale Research, anyscale.com/research/llm-serving-benchmark-q1-2026
  5. "The GPU Cloud Price Index — February 2026," Control Plane Research, controlplane.digiterialabs.com/reports
  6. "Optimizing Inference Cost: A Practitioner's Guide to Serving Engine Selection," MLOps Community Whitepaper, mlops.community/serving-engines-2026

Need help implementing AI infrastructure for your organization? We help enterprises build, deploy, and optimize production AI systems. Learn about our AI consulting services.

Related insights