Guide May 14, 2026

AI Fine-Tuning & Model Customization Guide 2026

When to fine-tune vs use RAG. SFT, DPO, RFT methods explained. Open-source fine-tuning with LoRA/QLoRA and production deployment patterns.

Fine-tuning a large language model used to be the domain of AI researchers with access to GPU clusters. In 2026, it's a production technique available to any developer with a few hundred examples and a cloud account. But fine-tuning is also overused — many teams jump to it before exhausting simpler approaches like prompt engineering and RAG. This guide covers when fine-tuning actually makes sense, the different methods available, how to do it efficiently, and how to deploy custom models in production.

Fine-Tuning vs RAG: Which Should You Use?

This is the most common question. The answer depends on what you're trying to achieve:

Use CaseBest ApproachWhy
Need domain knowledge (legal, medical)RAGKnowledge changes; RAG updates instantly
Need specific output format/styleFine-tuningTeaches the model how to respond, not what to know
Need to reduce prompt lengthFine-tuningExamples baked into weights, not context
Need to handle novel input patternsFine-tuningGeneralizes beyond what's in context
Need real-time knowledge updatesRAGNo retraining needed
Need to teach reasoning patternsFine-tuning (RFT)Reinforces chain-of-thought
Rule of thumb: RAG is for knowledge. Fine-tuning is for behavior. If you need the model to know new facts, use RAG. If you need the model to respond in a specific way, format, or style, use fine-tuning.

Fine-Tuning Methods Explained

There are four main approaches to customizing LLMs in 2026:

1. Supervised Fine-Tuning (SFT)

The classic approach: provide input/output examples and train the model to mimic them.

# Training data format (JSONL)
{"messages": [
    {"role": "system", "content": "You are a medical assistant."},
    {"role": "user", "content": "Patient has fever, cough, fatigue. Diagnosis?"},
    {"role": "assistant", "content": "Based on symptoms, possible influenza..."}
]}
{"messages": [
    {"role": "system", "content": "You are a medical assistant."},
    {"role": "user", "content": "Rash on arms, itchy, after hiking."},
    {"role": "assistant", "content": "Likely contact dermatitis or poison ivy..."}
]}

Best for: Classification, format compliance, translation, correcting instruction-following failures

Data needed: 50-500 high-quality examples

Cost: $0.008-0.080 per 1K tokens trained (OpenAI)

2. Direct Preference Optimization (DPO)

Instead of showing the model the "right" answer, show it a pair of answers and tell it which is better.

# DPO training data format
{
    "messages": [
        {"role": "user", "content": "Summarize this article about climate change."}
    ],
    "chosen": [
        {"role": "assistant", "content": "Climate change refers to... [concise, factual summary]"}
    ],
    "rejected": [
        {"role": "assistant", "content": "Well, climate change is a very complex topic that many people have opinions about... [verbose, unfocused]"}
    ]
}

Best for: Summarization quality, chat tone/style, ranking tasks

Data needed: 100-1,000 preference pairs

Advantage: Easier to collect preferences than perfect ground-truth outputs

3. Reinforcement Fine-Tuning (RFT)

Train a reasoning model to think better by grading its chain-of-thought and reinforcing high-scoring reasoning paths.

# RFT requires:
# 1. A prompt
# 2. A grader function that scores the model's reasoning
# 3. Multiple reasoning attempts per prompt

# Example grader for math problems
def grade_math_solution(problem, reasoning, answer):
    """Score 0-100 based on correctness and reasoning quality."""
    correct_answer = solve(problem)
    
    if abs(float(answer) - correct_answer) < 0.01:
        base_score = 80
    else:
        base_score = 0
    
    # Bonus for clear step-by-step reasoning
    if "step" in reasoning.lower() or "first" in reasoning.lower():
        base_score += 20
    
    return min(100, base_score)

Best for: Complex reasoning tasks, medical diagnosis, legal analysis, math/science problems

Data needed: 100-1,000 prompts with expert graders

Only available on: Reasoning models (o4-mini)

4. Vision Fine-Tuning

Train the model to better understand specific types of images.

# Vision fine-tuning data
{
    "messages": [
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/xray.jpg"}},
            {"type": "text", "text": "What abnormality do you see?"}
        ]},
        {"role": "assistant", "content": "There is a fracture visible in the distal radius..."}
    ]
}

Best for: Medical imaging, defect detection, document classification

Open-Source Fine-Tuning with LoRA

Full fine-tuning updates all model parameters and requires massive GPU resources. LoRA (Low-Rank Adaptation) updates only a small set of adapter weights, making fine-tuning accessible on consumer hardware.

LoRA/QLoRA Setup

# Install dependencies
# pip install transformers peft accelerate bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

# 1. Load base model (4-bit quantized for QLoRA)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    load_in_4bit=True,  # QLoRA: quantize to 4-bit
    device_map="auto",
)

# 2. Prepare for training
model = prepare_model_for_kbit_training(model)

# 3. Configure LoRA
lora_config = LoraConfig(
    r=16,              # Rank (higher = more capacity, more params)
    lora_alpha=32,     # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
print(f"Trainable parameters: {model.print_trainable_parameters()}")
# Output: trainable params: 33,554,432 || all params: 8,030,597,120 || trainable%: 0.4177

# 4. Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl")

# 5. Tokenize
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

tokenized = dataset.map(tokenize_function, batched=True)

# 6. Train
training_args = TrainingArguments(
    output_dir="./lora-adapter",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
)

from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
)

trainer.train()

# 7. Save adapter
model.save_pretrained("./lora-adapter")

Key LoRA parameters:

ParameterDescriptionTypical Values
r (rank)Size of low-rank matrices8, 16, 32, 64
lora_alphaScaling factor2*r (e.g., 32 for r=16)
target_modulesWhich layers to adaptq_proj, v_proj, k_proj, o_proj
lora_dropoutRegularization0.05-0.1

Loading a LoRA Adapter

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

# Merge adapter
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
model = model.merge_and_unload()  # Merge adapter into base for faster inference

# Or keep adapter separate for multi-tenant serving
# model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Different adapters for different customers

Platform Comparison

Platform Methods Cost Best For
OpenAI SFT, DPO, RFT, Vision $0.008-0.080/1K tokens Quick iteration, managed infrastructure
Google Vertex AI SFT, RLHF Pay per hour Gemini-based customization
AWS Bedrock Continued pre-training, fine-tuning Pay per hour Enterprise, AWS ecosystem
Hugging Face Any (open-source) Free (self-host) or cloud Maximum flexibility, open models
Replicate LoRA training Per-minute GPU Quick LoRA experiments

Data Preparation Best Practices

The quality of your training data matters more than the quantity. A few hundred excellent examples beat thousands of mediocre ones.

Data Quality Checklist

  • Consistency: All examples follow the same format and style
  • Diversity: Cover edge cases, not just happy paths
  • Accuracy: Every output is correct — the model will learn your mistakes
  • Completeness: Include system prompts, user context, expected behavior
  • Balance: Don't over-represent any single scenario

Data Format for Different Methods

# SFT format (conversational)
{"messages": [
    {"role": "system", "content": "You are a legal assistant."},
    {"role": "user", "content": "What is the statute of limitations for contract disputes in California?"},
    {"role": "assistant", "content": "In California, the statute of limitations for written contracts is 4 years (CCP 337), and for oral contracts is 2 years (CCP 339)."}
]}

# DPO format (preference pairs)
{
    "prompt": "Summarize this article",
    "chosen": "Concise, factual summary...",
    "rejected": "Verbose, opinionated summary..."
}

# Completion format (for base models)
{"prompt": "Translate English to French: 'Hello world'", "completion": "'Bonjour le monde'"}

Production Deployment

Serving Fine-Tuned Models

# Option 1: Use the platform's API (OpenAI, etc.)
# Fine-tuned models get their own model ID
response = client.chat.completions.create(
    model="ft:gpt-5:your-org:custom-model:abc123",  # Your fine-tuned model
    messages=[{"role": "user", "content": "Your prompt"}]
)

# Option 2: Self-host with vLLM (for open-source models)
from vllm import LLM, SamplingParams

llm = LLM(model="./merged-lora-model", tensor_parallel_size=2)
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)

outputs = llm.generate(["Your prompt here"], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

A/B Testing Fine-Tuned vs Base Model

import random

def route_request(user_prompt, base_model="gpt-5", tuned_model="ft:gpt-5:..."):
    """Route 10% of traffic to fine-tuned model for comparison."""
    if random.random() < 0.1:
        model = tuned_model
        variant = "fine_tuned"
    else:
        model = base_model
        variant = "base"
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_prompt}]
    )
    
    # Log for comparison
    log_experiment(user_prompt, response, variant)
    return response

Common Mistakes

  1. Not trying prompt engineering first — Fine-tuning is expensive. Exhaust prompt engineering and RAG before considering it
  2. Too few examples — Under 50 examples rarely produces meaningful improvements. Aim for 100-500
  3. Low-quality training data — The model learns your mistakes. Every incorrect example degrades performance
  4. Overfitting — Training too long causes the model to memorize examples instead of generalizing. Use validation loss to detect this
  5. Not evaluating properly — Run evals on held-out test data. Don't just look at training loss
  6. Mixing unrelated tasks — A model fine-tuned on both legal docs and cooking recipes will perform worse on both than two separate models
  7. Ignoring inference cost — Fine-tuned models often cost the same or more to run. Factor this into your decision

Conclusion

Fine-tuning is a powerful tool when used correctly — but it's not a substitute for good data, good prompts, or good RAG. Use it when you need to teach the model a specific behavior, format, or reasoning pattern that can't be expressed in a prompt. Start with SFT for most use cases, use DPO when you have preference data, and reserve RFT for complex reasoning tasks. For open-source models, LoRA/QLoRA makes fine-tuning accessible on consumer hardware.

The flywheel approach works best: write evals, try prompt engineering, collect failure cases, fine-tune on those cases, evaluate, repeat. Each iteration improves both your prompts and your training data.