AI Model Distillation & Compression Guide 2026
Shrink LLMs without losing quality. Knowledge distillation, quantization, pruning, and deployment patterns for efficient AI systems.
Running GPT-5.5 costs $5 per million input tokens. Running a distilled 7B model costs $0.10. For many applications — classification, summarization, structured extraction — the smaller model performs nearly as well while being 50x cheaper and 10x faster. Model distillation and compression are no longer research curiosities. They're production necessities. This guide covers every technique to make large models small without making them stupid.
Why Compress Models?
| Factor | Full Model | Compressed Model |
|---|---|---|
| Inference cost | $5-30 / 1M tokens | $0.10-1 / 1M tokens |
| Latency | 2-5 seconds | 200-500ms |
| Memory (GPU) | 40-80 GB | 4-16 GB |
| Power consumption | 300-700W | 50-150W |
| On-device deployment | Impossible | Feasible |
Knowledge Distillation
Distillation trains a small "student" model to mimic a large "teacher" model. The student learns not just from ground-truth labels but from the teacher's probability distribution — including its uncertainty.
Basic Distillation
import torch
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
"""Combine hard labels with soft teacher predictions."""
def __init__(self, temperature=2.0, alpha=0.5):
super().__init__()
self.temperature = temperature
self.alpha = alpha # Weight for distillation vs hard labels
def forward(self, student_logits, teacher_logits, labels):
# Soft targets from teacher
soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
soft_predictions = F.log_softmax(student_logits / self.temperature, dim=1)
# KL divergence for distillation
distillation_loss = F.kl_div(
soft_predictions, soft_targets, reduction='batchmean'
) * (self.temperature ** 2)
# Standard cross-entropy with hard labels
hard_loss = F.cross_entropy(student_logits, labels)
return self.alpha * distillation_loss + (1 - self.alpha) * hard_loss
# Training loop
teacher = load_large_model().eval() # Frozen
student = SmallModel()
optimizer = torch.optim.AdamW(student.parameters(), lr=1e-4)
criterion = DistillationLoss(temperature=2.0, alpha=0.7)
for batch in dataloader:
inputs, labels = batch
with torch.no_grad():
teacher_logits = teacher(inputs)
student_logits = student(inputs)
loss = criterion(student_logits, teacher_logits, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Chain-of-Thought Distillation
For reasoning tasks, distill the teacher's reasoning process, not just its final answer:
def generate_training_data(teacher, questions):
"""Generate reasoning traces from teacher for distillation."""
training_data = []
for question in questions:
# Get teacher's reasoning
response = teacher.generate(
f"Solve this step by step: {question}",
max_tokens=500,
temperature=0.3
)
# Parse reasoning and answer
reasoning, answer = parse_reasoning(response)
training_data.append({
"question": question,
"reasoning": reasoning,
"answer": answer
})
return training_data
# Train student on reasoning traces
def train_reasoning_student(student, training_data):
for example in training_data:
prompt = f"Question: {example['question']}\nReasoning:"
target = f" {example['reasoning']}\nAnswer: {example['answer']}"
# Standard language modeling loss
loss = compute_lm_loss(student, prompt, target)
loss.backward()
optimizer.step()
Quantization
Quantization reduces the precision of model weights. FP32 → FP16 → INT8 → INT4. Each step halves memory and often speeds up inference.
Post-Training Quantization (PTQ)
# Using llama.cpp for GGUF quantization
# Convert and quantize in one step
python convert_hf_to_gguf.py \
--model-dir ./my-model \
--outfile ./my-model-q4.gguf \
--outtype q4_k_m
# Quantization types:
# q4_0 - 4-bit, fast, lower quality
# q4_k_m - 4-bit with mixed precision, balanced
# q5_k_m - 5-bit, better quality
# q8_0 - 8-bit, near-lossless
# f16 - 16-bit, no quantization
# Load quantized model
from llama_cpp import Llama
model = Llama(
model_path="./my-model-q4.gguf",
n_ctx=4096,
n_threads=8
)
output = model("Explain quantum computing:", max_tokens=200)
AWQ (Activation-Aware Weight Quantization)
# AWQ protects salient weights during quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-3.1-8B"
quant_path = "llama-3.1-8b-awq"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(
tokenizer,
quant_config={
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
# 4-bit AWQ typically retains 99%+ of FP16 quality
# while using 75% less memory
GPTQ (Gradient-Based Post-Training Quantization)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_id = "meta-llama/Llama-3.1-8B"
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(
model_id, quantize_config
)
# Calibrate on sample data
examples = ["Example text 1", "Example text 2", ...]
model.quantize(examples)
model.save_quantized("llama-3.1-8b-gptq")
Pruning
Pruning removes weights that contribute least to the output. Structured pruning removes entire neurons; unstructured pruning removes individual weights.
Magnitude Pruning
def magnitude_prune(model, sparsity=0.3):
"""Remove weights with smallest absolute values."""
for name, param in model.named_parameters():
if 'weight' in name and len(param.shape) > 1:
# Calculate threshold for this layer
flat = param.abs().flatten()
k = int(sparsity * flat.numel())
threshold = torch.kthvalue(flat, k).values
# Create mask
mask = (param.abs() >= threshold).float()
# Apply mask
param.data *= mask
return model
# Iterative pruning with fine-tuning
for sparsity in [0.1, 0.2, 0.3, 0.4]:
model = magnitude_prune(model, sparsity)
# Fine-tune to recover accuracy
for epoch in range(3):
train(model, dataloader)
# Re-apply mask (weights may regrow during training)
model = apply_mask(model, masks)
Structured Pruning (Head Pruning)
def prune_attention_heads(model, heads_to_prune):
"""Remove entire attention heads in transformers."""
for layer_idx, head_indices in heads_to_prune.items():
layer = model.model.layers[layer_idx].self_attn
for head_idx in sorted(head_indices, reverse=True):
# Zero out this head's query/key/value projections
head_size = layer.num_heads // layer.num_key_value_heads
# Mask the head in the output projection
start = head_idx * head_size
end = (head_idx + 1) * head_size
layer.o_proj.weight[:, start:end] = 0
return model
# Determine which heads to prune based on importance scores
head_importance = compute_head_importance(model, eval_data)
heads_to_prune = select_least_important_heads(head_importance, ratio=0.2)
Techniques Compared
| Technique | Size Reduction | Quality Loss | Training Needed | Best For |
|---|---|---|---|---|
| FP16 conversion | 2x | None | No | Quick win |
| INT8 (PTQ) | 4x | <1% | No | Production deployment |
| INT4 (AWQ/GPTQ) | 8x | 1-3% | No | Edge devices |
| Distillation | 10-100x | 2-10% | Yes | Task-specific models |
| Pruning + fine-tune | 2-10x | 1-5% | Yes | Known architectures |
Deployment Patterns
Cascade Architecture
class CascadeClassifier:
"""Use cheap model first, escalate to expensive model only when needed."""
def __init__(self):
self.fast_model = load_quantized_model("distilled-1b-q4")
self.accurate_model = load_api_model("gpt-5.4")
self.confidence_threshold = 0.9
def classify(self, text):
# Try fast model first
result = self.fast_model.classify(text)
if result.confidence >= self.confidence_threshold:
return result
# Escalate to accurate model
return self.accurate_model.classify(text)
def get_stats(self):
"""Track how often we need the expensive model."""
return {
"fast_hits": self.fast_calls,
"slow_hits": self.slow_calls,
"cost_savings": (self.fast_calls * 0.10 + self.slow_calls * 2.50) /
((self.fast_calls + self.slow_calls) * 2.50)
}
Speculative Decoding
def speculative_decode(draft_model, target_model, prompt, max_tokens=100):
"""Use small model to draft tokens, large model to verify."""
tokens = tokenize(prompt)
while len(tokens) < max_tokens:
# Draft multiple tokens with small model
draft_tokens = draft_model.generate(tokens, num_tokens=5)
# Verify with large model in parallel
target_logits = target_model.forward(tokens + draft_tokens)
# Accept tokens until disagreement
accepted = 0
for i, draft_token in enumerate(draft_tokens):
# Check if target model agrees
target_probs = softmax(target_logits[i])
draft_prob = target_probs[draft_token]
if draft_prob > 0.5: # Threshold for acceptance
tokens.append(draft_token)
accepted += 1
else:
# Sample from target model distribution
tokens.append(sample(target_probs))
break
if accepted == 0:
# No agreement, just use target model
tokens.append(target_model.generate_next(tokens))
return detokenize(tokens)
Evaluating Compressed Models
def evaluate_compression(original_model, compressed_model, test_tasks):
"""Comprehensive evaluation of model compression quality."""
results = {
"original": {},
"compressed": {},
"relative": {}
}
for task_name, task_fn in test_tasks.items():
orig_score = task_fn(original_model)
comp_score = task_fn(compressed_model)
results["original"][task_name] = orig_score
results["compressed"][task_name] = comp_score
results["relative"][task_name] = comp_score / orig_score
# Perplexity comparison
results["perplexity"] = {
"original": compute_perplexity(original_model, eval_data),
"compressed": compute_perplexity(compressed_model, eval_data)
}
# Speed comparison
results["speed"] = {
"original": benchmark_speed(original_model),
"compressed": benchmark_speed(compressed_model)
}
# Memory comparison
results["memory"] = {
"original": get_model_size(original_model),
"compressed": get_model_size(compressed_model)
}
return results
# Usage
test_tasks = {
"classification": lambda m: evaluate_classifier(m, cls_data),
"summarization": lambda m: evaluate_rouge(m, sum_data),
"qa": lambda m: evaluate_exact_match(m, qa_data)
}
results = evaluate_compression(teacher, student, test_tasks)
print(f"Quality retention: {results['relative']['classification']:.1%}")
print(f"Speedup: {results['speed']['original'] / results['speed']['compressed']:.1f}x")
print(f"Size reduction: {results['memory']['original'] / results['memory']['compressed']:.1f}x")
Best Practices
- Start with quantization — PTQ (INT8/INT4) gives 4-8x size reduction with minimal quality loss and no training. Always try this first
- Distill for specific tasks — General-purpose distillation is hard. Task-specific distillation (classification, extraction, summarization) works much better
- Use calibration data representative of production — Quantization quality depends on calibration data distribution. Use real production samples
- Measure end-to-end latency, not just model speed — Tokenization, detokenization, and data transfer often dominate
- Consider cascade architectures — Route 80% of requests to a cheap model and 20% to an expensive one. The savings are massive
- Monitor quality degradation over time — Compressed models may drift as input distributions change. Set up evaluation pipelines
Conclusion
Model compression is about trading accuracy for efficiency. The key insight: you don't need GPT-5.5 quality for every task. A distilled 7B model can match GPT-5.5 on classification tasks while being 50x cheaper. A quantized 4-bit model can run on a laptop where the full model needs an A100.
The practical workflow: quantize first (free win), then distill if you need more compression for a specific task, then prune if you have a known architecture. Measure everything — compression ratios are meaningless if quality drops below your use case's requirements.
Related Guides: Compute Infrastructure · Fine-Tuning Guide · Local LLM Setup