AI Compute Infrastructure & Deployment Guide 2026
Scale LLMs from prototype to production. Serverless vs dedicated GPUs, vLLM, TensorRT-LLM, Kubernetes, auto-scaling, and cost optimization patterns.
Moving an LLM from prototype to production is where most AI projects fail. Your demo works fine on a single GPU in your notebook, but production means handling thousands of concurrent requests, maintaining sub-second latency, and keeping costs under control. This guide covers the compute infrastructure decisions that make or break production AI deployments — from choosing between serverless and dedicated GPUs to optimizing inference with specialized serving frameworks.
Deployment Options Overview
There are four main ways to run LLMs in production:
| Approach | Best For | Latency | Cost Model | Complexity |
|---|---|---|---|---|
| Managed API (OpenAI, etc.) | Most applications | Low | Per-token | None |
| Serverless GPU | Variable traffic, startups | Cold start | Per-request | Low |
| Dedicated GPU (cloud) | Steady traffic, cost control | Low | Per-hour | Medium |
| On-premise / Colo | Data sovereignty, large scale | Low | CapEx | High |
Start with managed APIs. Move to self-hosting only when you have clear evidence that it will save money or meet requirements that APIs can't. Most teams overestimate their savings from self-hosting and underestimate the operational burden.
When to Self-Host vs Use APIs
Managed APIs (OpenAI, Anthropic, Google) handle infrastructure, scaling, and reliability. Self-hosting gives you control but adds significant complexity. Here's how to decide:
Use Managed APIs When:
- Your monthly AI spend is under $5,000
- You need the latest model capabilities (frontier models aren't available for self-hosting)
- Your team has limited ML infrastructure experience
- You need multi-modal capabilities (vision, audio)
- Latency requirements are standard (<2s)
Consider Self-Hosting When:
- Your monthly AI spend exceeds $10,000 (break-even point for dedicated GPUs)
- You need sub-100ms latency for specific workloads
- You have strict data residency requirements (healthcare, finance)
- You need to run models that aren't available via API (fine-tuned models, specialized models)
- You want to avoid vendor lock-in
GPU Selection for LLM Inference
The GPU you choose determines your model size, throughput, and cost. Here's the 2026 landscape:
| GPU | VRAM | Best For | Approx. Cost/hr |
|---|---|---|---|
| NVIDIA A10G | 24GB | 7B-13B models, single user | $1.00 |
| NVIDIA A100 (40GB) | 40GB | 13B-30B models | $2.50 |
| NVIDIA A100 (80GB) | 80GB | 30B-70B models | $3.50 |
| NVIDIA H100 | 80GB | 70B+ models, high throughput | $4.50 |
| NVIDIA L4 | 24GB | Small models, cost-efficient | $0.80 |
| AMD MI300X | 192GB | Largest models, memory-bound | $4.00 |
Model Size vs GPU Memory
As a rule of thumb, you need 1.5-2x the model size in VRAM for inference:
# VRAM requirements by model size (FP16)
# 7B model: ~14GB VRAM (fits on A10G)
# 13B model: ~26GB VRAM (needs A100 40GB)
# 30B model: ~60GB VRAM (needs A100 80GB)
# 70B model: ~140GB VRAM (needs 2x A100 80GB or 1x MI300X)
# With quantization (4-bit), requirements halve:
# 70B model (4-bit): ~70GB VRAM (fits on A100 80GB)
Inference Engines: vLLM vs TensorRT-LLM
Raw PyTorch inference is slow. Specialized serving engines can increase throughput by 10-20x:
vLLM
The most popular open-source inference engine. Uses PagedAttention for efficient memory management:
# Install
# pip install vllm
# Serve a model
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B",
tensor_parallel_size=1, # Set to 2+ for multi-GPU
gpu_memory_utilization=0.9, # Use 90% of GPU memory
max_model_len=4096,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
)
# Batch inference
prompts = [
"What is machine learning?",
"Explain quantum computing.",
"How does photosynthesis work?"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
vLLM advantages: Easy to use, supports most HuggingFace models, continuous batching, automatic prefix caching
TensorRT-LLM
NVIDIA's optimized inference engine. Harder to set up but delivers the best performance on NVIDIA GPUs:
# TensorRT-LLM requires model conversion
# 1. Convert model to TensorRT format
python convert_checkpoint.py --model_dir ./llama-3.1-8b \
--output_dir ./trt-engine \
--dtype float16
# 2. Build engine
trtllm-build --checkpoint_dir ./trt-engine \
--output_dir ./engine-output \
--gemm_plugin float16
# 3. Run inference
python run.py --engine_dir ./engine-output \
--max_output_len 512 \
--tokenizer_dir ./llama-3.1-8b \
--input_text "What is AI?"
TensorRT-LLM advantages: Fastest inference on NVIDIA hardware, FP8 support, in-flight batching
Tradeoff: Only supports NVIDIA GPUs, model conversion required, less flexible than vLLM
Engine Comparison
| Feature | vLLM | TensorRT-LLM | Text Generation Inference (TGI) |
|---|---|---|---|
| Ease of use | Easy | Hard | Medium |
| Throughput | High | Highest | High |
| Model support | Wide | Limited | Wide |
| Multi-GPU | Yes | Yes | Yes |
| Quantization | AWQ, GPTQ | FP8, INT8 | BitsAndBytes |
| Streaming | Yes | Yes | Yes |
Serverless GPU Platforms
For variable traffic, serverless GPU platforms handle scaling automatically:
Replicate
# Deploy any model with a simple API
import replicate
output = replicate.run(
"meta/meta-llama-3.1-8b-instruct",
input={"prompt": "What is AI?"}
)
# Or deploy your own fine-tuned model
# Upload model weights, get an API endpoint automatically
Together AI
import together
client = together.Together()
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
Serverless Comparison
| Platform | Cold Start | Pricing | Best For |
|---|---|---|---|
| Replicate | ~10s | Per-second GPU | Custom models, quick deploy |
| Together AI | ~5s | Per-token | High throughput, low latency |
| Fireworks AI | ~2s | Per-token | Fastest cold starts |
| Baseten | ~15s | Per-second GPU | Enterprise features |
Kubernetes Deployment
For large-scale production, Kubernetes with GPU operators provides the most control:
Basic vLLM Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-3.1-8B
- --tensor-parallel-size
- "1"
- --gpu-memory-utilization
- "0.9"
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm-inference
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer
Auto-Scaling with KEDA
# Scale based on request queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-autoscaler
spec:
scaleTargetRef:
name: llm-inference
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: metrics-api
metadata:
targetValue: "5"
url: "http://llm-service:8000/metrics"
valueLocation: "pending_requests"
Cost Optimization Strategies
1. Quantization
Reduce model precision to fit larger models on cheaper GPUs:
# 4-bit quantization with vLLM
llm = LLM(
model="meta-llama/Llama-3.1-70B",
quantization="awq", # or "gptq", "squeezellm"
gpu_memory_utilization=0.95,
)
| Precision | VRAM Reduction | Quality Loss | Speedup |
|---|---|---|---|
| FP16 (baseline) | 1x | 0% | 1x |
| INT8 | 0.5x | <1% | 1.5x |
| FP8 (H100 only) | 0.5x | <1% | 2x |
| INT4 (AWQ/GPTQ) | 0.25x | 2-5% | 2-3x |
2. Request Batching
# Dynamic batching with vLLM
# vLLM handles this automatically via continuous batching
# But you can also implement client-side batching
async def batch_process(prompts, batch_size=8):
"""Process prompts in batches for higher throughput."""
results = []
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i + batch_size]
outputs = llm.generate(batch, sampling_params)
results.extend([o.outputs[0].text for o in outputs])
return results
3. Model Selection by Traffic
class TrafficRouter:
"""Route requests to different models based on complexity."""
def __init__(self):
self.small_model = "llama-3.1-8b" # $0.50/hr
self.large_model = "llama-3.1-70b" # $3.50/hr
def route(self, prompt, complexity_hint=None):
# Simple heuristic: short prompts → small model
if complexity_hint == "simple" or len(prompt) < 200:
return self.small_model
# Or use a classifier
if self.is_complex(prompt):
return self.large_model
return self.small_model
def is_complex(self, prompt):
# Simple keyword-based classification
complex_keywords = ["explain", "compare", "analyze", "reason"]
return any(kw in prompt.lower() for kw in complex_keywords)
4. Spot/Preemptible Instances
Use spot instances for batch workloads (tolerant of interruptions):
# AWS EC2 spot instance for batch inference
# Save 60-90% compared to on-demand
# GCP preemptible VM
# Save ~80%
# Azure spot VMs
# Save 60-90%
# Important: Implement checkpointing for long training jobs
# Inference workloads can simply restart on a new instance
Latency Optimization
For interactive applications, latency is critical:
Time to First Token (TTFT) vs Time Per Output Token (TPOT)
| Metric | What It Measures | Target |
|---|---|---|
| TTFT | Time from request to first token | <500ms |
| TPOT | Time between consecutive tokens | <50ms |
| Total latency | TTFT + (TPOT × token_count) | <2s for 1K tokens |
Techniques to Reduce Latency
- Use KV cache — vLLM and TensorRT-LLM cache key-value pairs to avoid recomputation
- Enable prefix caching — Cache common prompt prefixes (system prompts, few-shot examples)
- Use speculative decoding — Draft tokens with a small model, verify with the large model
- Reduce max tokens — Set tight limits to prevent runaway generation
- Use streaming — Send tokens to the client as they're generated
Monitoring Production Deployments
# Key metrics to track
METRICS = {
"throughput": "requests/second",
"latency_p50": "median response time",
"latency_p99": "99th percentile response time",
"gpu_utilization": "GPU compute usage %",
"gpu_memory": "VRAM usage %",
"queue_depth": "Pending requests",
"error_rate": "Failed requests %",
"tokens_per_second": "Generation speed",
}
# vLLM exposes Prometheus metrics automatically
# Scrape endpoint: http://localhost:8000/metrics
# Example Grafana dashboard queries
# throughput: rate(vllm:generation_tokens_total[1m])
# latency: histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
# gpu_util: nvidia_gpu_utilization_gpu[0]
Common Mistakes
- Self-hosting too early — Most teams save money using APIs until $10K+/month spend
- Not using an inference engine — Raw PyTorch is 10-20x slower than vLLM or TensorRT-LLM
- Over-provisioning GPUs — Start with smaller instances and scale up based on metrics
- Ignoring cold starts — Serverless platforms have cold starts. Keep a warm instance for latency-sensitive apps
- No request timeouts — LLMs can generate indefinitely. Always set max_tokens and timeouts
- Single point of failure — Run multiple replicas behind a load balancer
- Not monitoring GPU memory — OOM kills are the #1 cause of production failures
Conclusion
Production LLM deployment is a spectrum. Start with managed APIs for simplicity, move to serverless GPUs as you grow, and only self-host on Kubernetes when you have the team and the scale to justify it. The key tools in 2026 are vLLM for ease of use, TensorRT-LLM for maximum NVIDIA performance, and quantization for cost reduction.
Measure everything — throughput, latency, GPU utilization, and cost per request. The optimal setup depends on your specific workload: a chatbot needs low latency, a batch processing pipeline needs high throughput, and a research tool might prioritize model capability over speed. There's no one-size-fits-all answer, but there is a right answer for your use case.
Related Guides: AI Cost Optimization · Local LLM Setup Guide · Batch Processing Guide