AI Fine-Tuning & Model Customization Guide 2026
When to fine-tune vs use RAG. SFT, DPO, RFT methods explained. Open-source fine-tuning with LoRA/QLoRA and production deployment patterns.
Fine-tuning a large language model used to be the domain of AI researchers with access to GPU clusters. In 2026, it's a production technique available to any developer with a few hundred examples and a cloud account. But fine-tuning is also overused — many teams jump to it before exhausting simpler approaches like prompt engineering and RAG. This guide covers when fine-tuning actually makes sense, the different methods available, how to do it efficiently, and how to deploy custom models in production.
Fine-Tuning vs RAG: Which Should You Use?
This is the most common question. The answer depends on what you're trying to achieve:
| Use Case | Best Approach | Why |
|---|---|---|
| Need domain knowledge (legal, medical) | RAG | Knowledge changes; RAG updates instantly |
| Need specific output format/style | Fine-tuning | Teaches the model how to respond, not what to know |
| Need to reduce prompt length | Fine-tuning | Examples baked into weights, not context |
| Need to handle novel input patterns | Fine-tuning | Generalizes beyond what's in context |
| Need real-time knowledge updates | RAG | No retraining needed |
| Need to teach reasoning patterns | Fine-tuning (RFT) | Reinforces chain-of-thought |
Rule of thumb: RAG is for knowledge. Fine-tuning is for behavior. If you need the model to know new facts, use RAG. If you need the model to respond in a specific way, format, or style, use fine-tuning.
Fine-Tuning Methods Explained
There are four main approaches to customizing LLMs in 2026:
1. Supervised Fine-Tuning (SFT)
The classic approach: provide input/output examples and train the model to mimic them.
# Training data format (JSONL)
{"messages": [
{"role": "system", "content": "You are a medical assistant."},
{"role": "user", "content": "Patient has fever, cough, fatigue. Diagnosis?"},
{"role": "assistant", "content": "Based on symptoms, possible influenza..."}
]}
{"messages": [
{"role": "system", "content": "You are a medical assistant."},
{"role": "user", "content": "Rash on arms, itchy, after hiking."},
{"role": "assistant", "content": "Likely contact dermatitis or poison ivy..."}
]}
Best for: Classification, format compliance, translation, correcting instruction-following failures
Data needed: 50-500 high-quality examples
Cost: $0.008-0.080 per 1K tokens trained (OpenAI)
2. Direct Preference Optimization (DPO)
Instead of showing the model the "right" answer, show it a pair of answers and tell it which is better.
# DPO training data format
{
"messages": [
{"role": "user", "content": "Summarize this article about climate change."}
],
"chosen": [
{"role": "assistant", "content": "Climate change refers to... [concise, factual summary]"}
],
"rejected": [
{"role": "assistant", "content": "Well, climate change is a very complex topic that many people have opinions about... [verbose, unfocused]"}
]
}
Best for: Summarization quality, chat tone/style, ranking tasks
Data needed: 100-1,000 preference pairs
Advantage: Easier to collect preferences than perfect ground-truth outputs
3. Reinforcement Fine-Tuning (RFT)
Train a reasoning model to think better by grading its chain-of-thought and reinforcing high-scoring reasoning paths.
# RFT requires:
# 1. A prompt
# 2. A grader function that scores the model's reasoning
# 3. Multiple reasoning attempts per prompt
# Example grader for math problems
def grade_math_solution(problem, reasoning, answer):
"""Score 0-100 based on correctness and reasoning quality."""
correct_answer = solve(problem)
if abs(float(answer) - correct_answer) < 0.01:
base_score = 80
else:
base_score = 0
# Bonus for clear step-by-step reasoning
if "step" in reasoning.lower() or "first" in reasoning.lower():
base_score += 20
return min(100, base_score)
Best for: Complex reasoning tasks, medical diagnosis, legal analysis, math/science problems
Data needed: 100-1,000 prompts with expert graders
Only available on: Reasoning models (o4-mini)
4. Vision Fine-Tuning
Train the model to better understand specific types of images.
# Vision fine-tuning data
{
"messages": [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://example.com/xray.jpg"}},
{"type": "text", "text": "What abnormality do you see?"}
]},
{"role": "assistant", "content": "There is a fracture visible in the distal radius..."}
]
}
Best for: Medical imaging, defect detection, document classification
Open-Source Fine-Tuning with LoRA
Full fine-tuning updates all model parameters and requires massive GPU resources. LoRA (Low-Rank Adaptation) updates only a small set of adapter weights, making fine-tuning accessible on consumer hardware.
LoRA/QLoRA Setup
# Install dependencies
# pip install transformers peft accelerate bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
# 1. Load base model (4-bit quantized for QLoRA)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
load_in_4bit=True, # QLoRA: quantize to 4-bit
device_map="auto",
)
# 2. Prepare for training
model = prepare_model_for_kbit_training(model)
# 3. Configure LoRA
lora_config = LoraConfig(
r=16, # Rank (higher = more capacity, more params)
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
print(f"Trainable parameters: {model.print_trainable_parameters()}")
# Output: trainable params: 33,554,432 || all params: 8,030,597,120 || trainable%: 0.4177
# 4. Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl")
# 5. Tokenize
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length"
)
tokenized = dataset.map(tokenize_function, batched=True)
# 6. Train
training_args = TrainingArguments(
output_dir="./lora-adapter",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
)
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
)
trainer.train()
# 7. Save adapter
model.save_pretrained("./lora-adapter")
Key LoRA parameters:
| Parameter | Description | Typical Values |
|---|---|---|
| r (rank) | Size of low-rank matrices | 8, 16, 32, 64 |
| lora_alpha | Scaling factor | 2*r (e.g., 32 for r=16) |
| target_modules | Which layers to adapt | q_proj, v_proj, k_proj, o_proj |
| lora_dropout | Regularization | 0.05-0.1 |
Loading a LoRA Adapter
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# Merge adapter
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
model = model.merge_and_unload() # Merge adapter into base for faster inference
# Or keep adapter separate for multi-tenant serving
# model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Different adapters for different customers
Platform Comparison
| Platform | Methods | Cost | Best For |
|---|---|---|---|
| OpenAI | SFT, DPO, RFT, Vision | $0.008-0.080/1K tokens | Quick iteration, managed infrastructure |
| Google Vertex AI | SFT, RLHF | Pay per hour | Gemini-based customization |
| AWS Bedrock | Continued pre-training, fine-tuning | Pay per hour | Enterprise, AWS ecosystem |
| Hugging Face | Any (open-source) | Free (self-host) or cloud | Maximum flexibility, open models |
| Replicate | LoRA training | Per-minute GPU | Quick LoRA experiments |
Data Preparation Best Practices
The quality of your training data matters more than the quantity. A few hundred excellent examples beat thousands of mediocre ones.
Data Quality Checklist
- Consistency: All examples follow the same format and style
- Diversity: Cover edge cases, not just happy paths
- Accuracy: Every output is correct — the model will learn your mistakes
- Completeness: Include system prompts, user context, expected behavior
- Balance: Don't over-represent any single scenario
Data Format for Different Methods
# SFT format (conversational)
{"messages": [
{"role": "system", "content": "You are a legal assistant."},
{"role": "user", "content": "What is the statute of limitations for contract disputes in California?"},
{"role": "assistant", "content": "In California, the statute of limitations for written contracts is 4 years (CCP 337), and for oral contracts is 2 years (CCP 339)."}
]}
# DPO format (preference pairs)
{
"prompt": "Summarize this article",
"chosen": "Concise, factual summary...",
"rejected": "Verbose, opinionated summary..."
}
# Completion format (for base models)
{"prompt": "Translate English to French: 'Hello world'", "completion": "'Bonjour le monde'"}
Production Deployment
Serving Fine-Tuned Models
# Option 1: Use the platform's API (OpenAI, etc.)
# Fine-tuned models get their own model ID
response = client.chat.completions.create(
model="ft:gpt-5:your-org:custom-model:abc123", # Your fine-tuned model
messages=[{"role": "user", "content": "Your prompt"}]
)
# Option 2: Self-host with vLLM (for open-source models)
from vllm import LLM, SamplingParams
llm = LLM(model="./merged-lora-model", tensor_parallel_size=2)
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)
outputs = llm.generate(["Your prompt here"], sampling_params)
for output in outputs:
print(output.outputs[0].text)
A/B Testing Fine-Tuned vs Base Model
import random
def route_request(user_prompt, base_model="gpt-5", tuned_model="ft:gpt-5:..."):
"""Route 10% of traffic to fine-tuned model for comparison."""
if random.random() < 0.1:
model = tuned_model
variant = "fine_tuned"
else:
model = base_model
variant = "base"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_prompt}]
)
# Log for comparison
log_experiment(user_prompt, response, variant)
return response
Common Mistakes
- Not trying prompt engineering first — Fine-tuning is expensive. Exhaust prompt engineering and RAG before considering it
- Too few examples — Under 50 examples rarely produces meaningful improvements. Aim for 100-500
- Low-quality training data — The model learns your mistakes. Every incorrect example degrades performance
- Overfitting — Training too long causes the model to memorize examples instead of generalizing. Use validation loss to detect this
- Not evaluating properly — Run evals on held-out test data. Don't just look at training loss
- Mixing unrelated tasks — A model fine-tuned on both legal docs and cooking recipes will perform worse on both than two separate models
- Ignoring inference cost — Fine-tuned models often cost the same or more to run. Factor this into your decision
Conclusion
Fine-tuning is a powerful tool when used correctly — but it's not a substitute for good data, good prompts, or good RAG. Use it when you need to teach the model a specific behavior, format, or reasoning pattern that can't be expressed in a prompt. Start with SFT for most use cases, use DPO when you have preference data, and reserve RFT for complex reasoning tasks. For open-source models, LoRA/QLoRA makes fine-tuning accessible on consumer hardware.
The flywheel approach works best: write evals, try prompt engineering, collect failure cases, fine-tune on those cases, evaluate, repeat. Each iteration improves both your prompts and your training data.
Related Guides: RAG Implementation Guide · Evaluation & Testing Guide · Open-Source LLM Comparison