Guide May 17, 2026

AI Model Distillation & Compression Guide 2026

Shrink LLMs without losing quality. Knowledge distillation, quantization, pruning, and deployment patterns for efficient AI systems.

Running GPT-5.5 costs $5 per million input tokens. Running a distilled 7B model costs $0.10. For many applications — classification, summarization, structured extraction — the smaller model performs nearly as well while being 50x cheaper and 10x faster. Model distillation and compression are no longer research curiosities. They're production necessities. This guide covers every technique to make large models small without making them stupid.

Why Compress Models?

FactorFull ModelCompressed Model
Inference cost$5-30 / 1M tokens$0.10-1 / 1M tokens
Latency2-5 seconds200-500ms
Memory (GPU)40-80 GB4-16 GB
Power consumption300-700W50-150W
On-device deploymentImpossibleFeasible

Knowledge Distillation

Distillation trains a small "student" model to mimic a large "teacher" model. The student learns not just from ground-truth labels but from the teacher's probability distribution — including its uncertainty.

Basic Distillation

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    """Combine hard labels with soft teacher predictions."""
    
    def __init__(self, temperature=2.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha  # Weight for distillation vs hard labels
    
    def forward(self, student_logits, teacher_logits, labels):
        # Soft targets from teacher
        soft_targets = F.softmax(teacher_logits / self.temperature, dim=1)
        soft_predictions = F.log_softmax(student_logits / self.temperature, dim=1)
        
        # KL divergence for distillation
        distillation_loss = F.kl_div(
            soft_predictions, soft_targets, reduction='batchmean'
        ) * (self.temperature ** 2)
        
        # Standard cross-entropy with hard labels
        hard_loss = F.cross_entropy(student_logits, labels)
        
        return self.alpha * distillation_loss + (1 - self.alpha) * hard_loss

# Training loop
teacher = load_large_model().eval()  # Frozen
student = SmallModel()
optimizer = torch.optim.AdamW(student.parameters(), lr=1e-4)
criterion = DistillationLoss(temperature=2.0, alpha=0.7)

for batch in dataloader:
    inputs, labels = batch
    
    with torch.no_grad():
        teacher_logits = teacher(inputs)
    
    student_logits = student(inputs)
    loss = criterion(student_logits, teacher_logits, labels)
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Chain-of-Thought Distillation

For reasoning tasks, distill the teacher's reasoning process, not just its final answer:

def generate_training_data(teacher, questions):
    """Generate reasoning traces from teacher for distillation."""
    training_data = []
    
    for question in questions:
        # Get teacher's reasoning
        response = teacher.generate(
            f"Solve this step by step: {question}",
            max_tokens=500,
            temperature=0.3
        )
        
        # Parse reasoning and answer
        reasoning, answer = parse_reasoning(response)
        
        training_data.append({
            "question": question,
            "reasoning": reasoning,
            "answer": answer
        })
    
    return training_data

# Train student on reasoning traces
def train_reasoning_student(student, training_data):
    for example in training_data:
        prompt = f"Question: {example['question']}\nReasoning:"
        target = f" {example['reasoning']}\nAnswer: {example['answer']}"
        
        # Standard language modeling loss
        loss = compute_lm_loss(student, prompt, target)
        loss.backward()
        optimizer.step()

Quantization

Quantization reduces the precision of model weights. FP32 → FP16 → INT8 → INT4. Each step halves memory and often speeds up inference.

Post-Training Quantization (PTQ)

# Using llama.cpp for GGUF quantization
# Convert and quantize in one step

python convert_hf_to_gguf.py \
    --model-dir ./my-model \
    --outfile ./my-model-q4.gguf \
    --outtype q4_k_m

# Quantization types:
# q4_0 - 4-bit, fast, lower quality
# q4_k_m - 4-bit with mixed precision, balanced
# q5_k_m - 5-bit, better quality
# q8_0 - 8-bit, near-lossless
# f16 - 16-bit, no quantization

# Load quantized model
from llama_cpp import Llama

model = Llama(
    model_path="./my-model-q4.gguf",
    n_ctx=4096,
    n_threads=8
)

output = model("Explain quantum computing:", max_tokens=200)

AWQ (Activation-Aware Weight Quantization)

# AWQ protects salient weights during quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.1-8B"
quant_path = "llama-3.1-8b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(
    tokenizer,
    quant_config={
        "zero_point": True,
        "q_group_size": 128,
        "w_bit": 4,
        "version": "GEMM"
    }
)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

# 4-bit AWQ typically retains 99%+ of FP16 quality
# while using 75% less memory

GPTQ (Gradient-Based Post-Training Quantization)

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_id = "meta-llama/Llama-3.1-8B"
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_id, quantize_config
)

# Calibrate on sample data
examples = ["Example text 1", "Example text 2", ...]
model.quantize(examples)

model.save_quantized("llama-3.1-8b-gptq")

Pruning

Pruning removes weights that contribute least to the output. Structured pruning removes entire neurons; unstructured pruning removes individual weights.

Magnitude Pruning

def magnitude_prune(model, sparsity=0.3):
    """Remove weights with smallest absolute values."""
    
    for name, param in model.named_parameters():
        if 'weight' in name and len(param.shape) > 1:
            # Calculate threshold for this layer
            flat = param.abs().flatten()
            k = int(sparsity * flat.numel())
            threshold = torch.kthvalue(flat, k).values
            
            # Create mask
            mask = (param.abs() >= threshold).float()
            
            # Apply mask
            param.data *= mask
    
    return model

# Iterative pruning with fine-tuning
for sparsity in [0.1, 0.2, 0.3, 0.4]:
    model = magnitude_prune(model, sparsity)
    
    # Fine-tune to recover accuracy
    for epoch in range(3):
        train(model, dataloader)
        
    # Re-apply mask (weights may regrow during training)
    model = apply_mask(model, masks)

Structured Pruning (Head Pruning)

def prune_attention_heads(model, heads_to_prune):
    """Remove entire attention heads in transformers."""
    
    for layer_idx, head_indices in heads_to_prune.items():
        layer = model.model.layers[layer_idx].self_attn
        
        for head_idx in sorted(head_indices, reverse=True):
            # Zero out this head's query/key/value projections
            head_size = layer.num_heads // layer.num_key_value_heads
            
            # Mask the head in the output projection
            start = head_idx * head_size
            end = (head_idx + 1) * head_size
            layer.o_proj.weight[:, start:end] = 0
    
    return model

# Determine which heads to prune based on importance scores
head_importance = compute_head_importance(model, eval_data)
heads_to_prune = select_least_important_heads(head_importance, ratio=0.2)

Techniques Compared

TechniqueSize ReductionQuality LossTraining NeededBest For
FP16 conversion2xNoneNoQuick win
INT8 (PTQ)4x<1%NoProduction deployment
INT4 (AWQ/GPTQ)8x1-3%NoEdge devices
Distillation10-100x2-10%YesTask-specific models
Pruning + fine-tune2-10x1-5%YesKnown architectures

Deployment Patterns

Cascade Architecture

class CascadeClassifier:
    """Use cheap model first, escalate to expensive model only when needed."""
    
    def __init__(self):
        self.fast_model = load_quantized_model("distilled-1b-q4")
        self.accurate_model = load_api_model("gpt-5.4")
        self.confidence_threshold = 0.9
    
    def classify(self, text):
        # Try fast model first
        result = self.fast_model.classify(text)
        
        if result.confidence >= self.confidence_threshold:
            return result
        
        # Escalate to accurate model
        return self.accurate_model.classify(text)
    
    def get_stats(self):
        """Track how often we need the expensive model."""
        return {
            "fast_hits": self.fast_calls,
            "slow_hits": self.slow_calls,
            "cost_savings": (self.fast_calls * 0.10 + self.slow_calls * 2.50) / 
                          ((self.fast_calls + self.slow_calls) * 2.50)
        }

Speculative Decoding

def speculative_decode(draft_model, target_model, prompt, max_tokens=100):
    """Use small model to draft tokens, large model to verify."""
    
    tokens = tokenize(prompt)
    
    while len(tokens) < max_tokens:
        # Draft multiple tokens with small model
        draft_tokens = draft_model.generate(tokens, num_tokens=5)
        
        # Verify with large model in parallel
        target_logits = target_model.forward(tokens + draft_tokens)
        
        # Accept tokens until disagreement
        accepted = 0
        for i, draft_token in enumerate(draft_tokens):
            # Check if target model agrees
            target_probs = softmax(target_logits[i])
            draft_prob = target_probs[draft_token]
            
            if draft_prob > 0.5:  # Threshold for acceptance
                tokens.append(draft_token)
                accepted += 1
            else:
                # Sample from target model distribution
                tokens.append(sample(target_probs))
                break
        
        if accepted == 0:
            # No agreement, just use target model
            tokens.append(target_model.generate_next(tokens))
    
    return detokenize(tokens)

Evaluating Compressed Models

def evaluate_compression(original_model, compressed_model, test_tasks):
    """Comprehensive evaluation of model compression quality."""
    
    results = {
        "original": {},
        "compressed": {},
        "relative": {}
    }
    
    for task_name, task_fn in test_tasks.items():
        orig_score = task_fn(original_model)
        comp_score = task_fn(compressed_model)
        
        results["original"][task_name] = orig_score
        results["compressed"][task_name] = comp_score
        results["relative"][task_name] = comp_score / orig_score
    
    # Perplexity comparison
    results["perplexity"] = {
        "original": compute_perplexity(original_model, eval_data),
        "compressed": compute_perplexity(compressed_model, eval_data)
    }
    
    # Speed comparison
    results["speed"] = {
        "original": benchmark_speed(original_model),
        "compressed": benchmark_speed(compressed_model)
    }
    
    # Memory comparison
    results["memory"] = {
        "original": get_model_size(original_model),
        "compressed": get_model_size(compressed_model)
    }
    
    return results

# Usage
test_tasks = {
    "classification": lambda m: evaluate_classifier(m, cls_data),
    "summarization": lambda m: evaluate_rouge(m, sum_data),
    "qa": lambda m: evaluate_exact_match(m, qa_data)
}

results = evaluate_compression(teacher, student, test_tasks)
print(f"Quality retention: {results['relative']['classification']:.1%}")
print(f"Speedup: {results['speed']['original'] / results['speed']['compressed']:.1f}x")
print(f"Size reduction: {results['memory']['original'] / results['memory']['compressed']:.1f}x")

Best Practices

  1. Start with quantization — PTQ (INT8/INT4) gives 4-8x size reduction with minimal quality loss and no training. Always try this first
  2. Distill for specific tasks — General-purpose distillation is hard. Task-specific distillation (classification, extraction, summarization) works much better
  3. Use calibration data representative of production — Quantization quality depends on calibration data distribution. Use real production samples
  4. Measure end-to-end latency, not just model speed — Tokenization, detokenization, and data transfer often dominate
  5. Consider cascade architectures — Route 80% of requests to a cheap model and 20% to an expensive one. The savings are massive
  6. Monitor quality degradation over time — Compressed models may drift as input distributions change. Set up evaluation pipelines

Conclusion

Model compression is about trading accuracy for efficiency. The key insight: you don't need GPT-5.5 quality for every task. A distilled 7B model can match GPT-5.5 on classification tasks while being 50x cheaper. A quantized 4-bit model can run on a laptop where the full model needs an A100.

The practical workflow: quantize first (free win), then distill if you need more compression for a specific task, then prune if you have a known architecture. Measure everything — compression ratios are meaningless if quality drops below your use case's requirements.