Guide May 14, 2026

AI Compute Infrastructure & Deployment Guide 2026

Scale LLMs from prototype to production. Serverless vs dedicated GPUs, vLLM, TensorRT-LLM, Kubernetes, auto-scaling, and cost optimization patterns.

Moving an LLM from prototype to production is where most AI projects fail. Your demo works fine on a single GPU in your notebook, but production means handling thousands of concurrent requests, maintaining sub-second latency, and keeping costs under control. This guide covers the compute infrastructure decisions that make or break production AI deployments — from choosing between serverless and dedicated GPUs to optimizing inference with specialized serving frameworks.

Deployment Options Overview

There are four main ways to run LLMs in production:

ApproachBest ForLatencyCost ModelComplexity
Managed API (OpenAI, etc.)Most applicationsLowPer-tokenNone
Serverless GPUVariable traffic, startupsCold startPer-requestLow
Dedicated GPU (cloud)Steady traffic, cost controlLowPer-hourMedium
On-premise / ColoData sovereignty, large scaleLowCapExHigh
Start with managed APIs. Move to self-hosting only when you have clear evidence that it will save money or meet requirements that APIs can't. Most teams overestimate their savings from self-hosting and underestimate the operational burden.

When to Self-Host vs Use APIs

Managed APIs (OpenAI, Anthropic, Google) handle infrastructure, scaling, and reliability. Self-hosting gives you control but adds significant complexity. Here's how to decide:

Use Managed APIs When:

  • Your monthly AI spend is under $5,000
  • You need the latest model capabilities (frontier models aren't available for self-hosting)
  • Your team has limited ML infrastructure experience
  • You need multi-modal capabilities (vision, audio)
  • Latency requirements are standard (<2s)

Consider Self-Hosting When:

  • Your monthly AI spend exceeds $10,000 (break-even point for dedicated GPUs)
  • You need sub-100ms latency for specific workloads
  • You have strict data residency requirements (healthcare, finance)
  • You need to run models that aren't available via API (fine-tuned models, specialized models)
  • You want to avoid vendor lock-in

GPU Selection for LLM Inference

The GPU you choose determines your model size, throughput, and cost. Here's the 2026 landscape:

GPUVRAMBest ForApprox. Cost/hr
NVIDIA A10G24GB7B-13B models, single user$1.00
NVIDIA A100 (40GB)40GB13B-30B models$2.50
NVIDIA A100 (80GB)80GB30B-70B models$3.50
NVIDIA H10080GB70B+ models, high throughput$4.50
NVIDIA L424GBSmall models, cost-efficient$0.80
AMD MI300X192GBLargest models, memory-bound$4.00

Model Size vs GPU Memory

As a rule of thumb, you need 1.5-2x the model size in VRAM for inference:

# VRAM requirements by model size (FP16)
# 7B model: ~14GB VRAM (fits on A10G)
# 13B model: ~26GB VRAM (needs A100 40GB)
# 30B model: ~60GB VRAM (needs A100 80GB)
# 70B model: ~140GB VRAM (needs 2x A100 80GB or 1x MI300X)

# With quantization (4-bit), requirements halve:
# 70B model (4-bit): ~70GB VRAM (fits on A100 80GB)

Inference Engines: vLLM vs TensorRT-LLM

Raw PyTorch inference is slow. Specialized serving engines can increase throughput by 10-20x:

vLLM

The most popular open-source inference engine. Uses PagedAttention for efficient memory management:

# Install
# pip install vllm

# Serve a model
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B",
    tensor_parallel_size=1,      # Set to 2+ for multi-GPU
    gpu_memory_utilization=0.9,   # Use 90% of GPU memory
    max_model_len=4096,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512,
)

# Batch inference
prompts = [
    "What is machine learning?",
    "Explain quantum computing.",
    "How does photosynthesis work?"
]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

vLLM advantages: Easy to use, supports most HuggingFace models, continuous batching, automatic prefix caching

TensorRT-LLM

NVIDIA's optimized inference engine. Harder to set up but delivers the best performance on NVIDIA GPUs:

# TensorRT-LLM requires model conversion
# 1. Convert model to TensorRT format
python convert_checkpoint.py --model_dir ./llama-3.1-8b \
    --output_dir ./trt-engine \
    --dtype float16

# 2. Build engine
trtllm-build --checkpoint_dir ./trt-engine \
    --output_dir ./engine-output \
    --gemm_plugin float16

# 3. Run inference
python run.py --engine_dir ./engine-output \
    --max_output_len 512 \
    --tokenizer_dir ./llama-3.1-8b \
    --input_text "What is AI?"

TensorRT-LLM advantages: Fastest inference on NVIDIA hardware, FP8 support, in-flight batching

Tradeoff: Only supports NVIDIA GPUs, model conversion required, less flexible than vLLM

Engine Comparison

FeaturevLLMTensorRT-LLMText Generation Inference (TGI)
Ease of useEasyHardMedium
ThroughputHighHighestHigh
Model supportWideLimitedWide
Multi-GPUYesYesYes
QuantizationAWQ, GPTQFP8, INT8BitsAndBytes
StreamingYesYesYes

Serverless GPU Platforms

For variable traffic, serverless GPU platforms handle scaling automatically:

Replicate

# Deploy any model with a simple API
import replicate

output = replicate.run(
    "meta/meta-llama-3.1-8b-instruct",
    input={"prompt": "What is AI?"}
)

# Or deploy your own fine-tuned model
# Upload model weights, get an API endpoint automatically

Together AI

import together

client = together.Together()

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Serverless Comparison

PlatformCold StartPricingBest For
Replicate~10sPer-second GPUCustom models, quick deploy
Together AI~5sPer-tokenHigh throughput, low latency
Fireworks AI~2sPer-tokenFastest cold starts
Baseten~15sPer-second GPUEnterprise features

Kubernetes Deployment

For large-scale production, Kubernetes with GPU operators provides the most control:

Basic vLLM Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - meta-llama/Llama-3.1-8B
          - --tensor-parallel-size
          - "1"
          - --gpu-memory-utilization
          - "0.9"
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-inference
  ports:
  - port: 8000
    targetPort: 8000
  type: LoadBalancer

Auto-Scaling with KEDA

# Scale based on request queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-autoscaler
spec:
  scaleTargetRef:
    name: llm-inference
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: metrics-api
    metadata:
      targetValue: "5"
      url: "http://llm-service:8000/metrics"
      valueLocation: "pending_requests"

Cost Optimization Strategies

1. Quantization

Reduce model precision to fit larger models on cheaper GPUs:

# 4-bit quantization with vLLM
llm = LLM(
    model="meta-llama/Llama-3.1-70B",
    quantization="awq",  # or "gptq", "squeezellm"
    gpu_memory_utilization=0.95,
)
PrecisionVRAM ReductionQuality LossSpeedup
FP16 (baseline)1x0%1x
INT80.5x<1%1.5x
FP8 (H100 only)0.5x<1%2x
INT4 (AWQ/GPTQ)0.25x2-5%2-3x

2. Request Batching

# Dynamic batching with vLLM
# vLLM handles this automatically via continuous batching
# But you can also implement client-side batching

async def batch_process(prompts, batch_size=8):
    """Process prompts in batches for higher throughput."""
    results = []
    for i in range(0, len(prompts), batch_size):
        batch = prompts[i:i + batch_size]
        outputs = llm.generate(batch, sampling_params)
        results.extend([o.outputs[0].text for o in outputs])
    return results

3. Model Selection by Traffic

class TrafficRouter:
    """Route requests to different models based on complexity."""
    
    def __init__(self):
        self.small_model = "llama-3.1-8b"    # $0.50/hr
        self.large_model = "llama-3.1-70b"   # $3.50/hr
    
    def route(self, prompt, complexity_hint=None):
        # Simple heuristic: short prompts → small model
        if complexity_hint == "simple" or len(prompt) < 200:
            return self.small_model
        
        # Or use a classifier
        if self.is_complex(prompt):
            return self.large_model
        return self.small_model
    
    def is_complex(self, prompt):
        # Simple keyword-based classification
        complex_keywords = ["explain", "compare", "analyze", "reason"]
        return any(kw in prompt.lower() for kw in complex_keywords)

4. Spot/Preemptible Instances

Use spot instances for batch workloads (tolerant of interruptions):

# AWS EC2 spot instance for batch inference
# Save 60-90% compared to on-demand

# GCP preemptible VM
# Save ~80%

# Azure spot VMs
# Save 60-90%

# Important: Implement checkpointing for long training jobs
# Inference workloads can simply restart on a new instance

Latency Optimization

For interactive applications, latency is critical:

Time to First Token (TTFT) vs Time Per Output Token (TPOT)

MetricWhat It MeasuresTarget
TTFTTime from request to first token<500ms
TPOTTime between consecutive tokens<50ms
Total latencyTTFT + (TPOT × token_count)<2s for 1K tokens

Techniques to Reduce Latency

  1. Use KV cache — vLLM and TensorRT-LLM cache key-value pairs to avoid recomputation
  2. Enable prefix caching — Cache common prompt prefixes (system prompts, few-shot examples)
  3. Use speculative decoding — Draft tokens with a small model, verify with the large model
  4. Reduce max tokens — Set tight limits to prevent runaway generation
  5. Use streaming — Send tokens to the client as they're generated

Monitoring Production Deployments

# Key metrics to track
METRICS = {
    "throughput": "requests/second",
    "latency_p50": "median response time",
    "latency_p99": "99th percentile response time",
    "gpu_utilization": "GPU compute usage %",
    "gpu_memory": "VRAM usage %",
    "queue_depth": "Pending requests",
    "error_rate": "Failed requests %",
    "tokens_per_second": "Generation speed",
}

# vLLM exposes Prometheus metrics automatically
# Scrape endpoint: http://localhost:8000/metrics

# Example Grafana dashboard queries
# throughput: rate(vllm:generation_tokens_total[1m])
# latency: histogram_quantile(0.99, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
# gpu_util: nvidia_gpu_utilization_gpu[0]

Common Mistakes

  1. Self-hosting too early — Most teams save money using APIs until $10K+/month spend
  2. Not using an inference engine — Raw PyTorch is 10-20x slower than vLLM or TensorRT-LLM
  3. Over-provisioning GPUs — Start with smaller instances and scale up based on metrics
  4. Ignoring cold starts — Serverless platforms have cold starts. Keep a warm instance for latency-sensitive apps
  5. No request timeouts — LLMs can generate indefinitely. Always set max_tokens and timeouts
  6. Single point of failure — Run multiple replicas behind a load balancer
  7. Not monitoring GPU memory — OOM kills are the #1 cause of production failures

Conclusion

Production LLM deployment is a spectrum. Start with managed APIs for simplicity, move to serverless GPUs as you grow, and only self-host on Kubernetes when you have the team and the scale to justify it. The key tools in 2026 are vLLM for ease of use, TensorRT-LLM for maximum NVIDIA performance, and quantization for cost reduction.

Measure everything — throughput, latency, GPU utilization, and cost per request. The optimal setup depends on your specific workload: a chatbot needs low latency, a batch processing pipeline needs high throughput, and a research tool might prioritize model capability over speed. There's no one-size-fits-all answer, but there is a right answer for your use case.