AI Data Preprocessing & Chunking Guide 2026
Optimize documents for LLMs. Text splitting strategies, chunk size optimization, overlap techniques, and production patterns for RAG pipelines.
The quality of your RAG system depends more on how you chunk documents than on which embedding model you use. A 1000-page PDF dumped into a vector database as 100-character chunks will produce incoherent, out-of-context retrievals. A well-chunked document with semantic boundaries, appropriate overlap, and metadata enrichment will produce precise, contextual answers. This guide covers every chunking strategy available in 2026, with code and decision frameworks.
Why Chunking Matters
LLMs have context limits. Even with 200K+ token windows, you can't feed an entire document corpus into every query. Chunking breaks documents into retrievable units, but bad chunking creates problems:
- Too small: Chunks lack context. "The president signed the bill" — which president? Which bill?
- Too large: Chunks exceed context limits or dilute relevance with irrelevant content
- Wrong boundaries: Splitting mid-sentence or mid-paragraph destroys meaning
- No overlap: Context at chunk boundaries is lost
- No metadata: Can't filter by source, date, or document type
Chunking Strategies Compared
| Strategy | How It Works | Best For | Trade-off |
|---|---|---|---|
| Fixed size | Every N characters/tokens | Simple docs, speed | May split sentences |
| Recursive | Split by hierarchy (paragraphs → sentences → words) | Most text | Slower, better quality |
| Semantic | Split when embedding similarity drops | Complex documents | Most accurate, slowest |
| Agentic | LLM decides chunk boundaries | High-value content | Expensive, best quality |
| Markdown-aware | Respects headers, lists, code blocks | Technical docs | Format-dependent |
Fixed-Size Chunking
The simplest approach. Fast but naive:
def fixed_size_chunks(text, chunk_size=1000, overlap=200):
"""Split text into fixed-size chunks with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Add metadata
chunks.append({
"text": chunk,
"start": start,
"end": end,
"index": len(chunks)
})
# Move forward by chunk_size - overlap
start += chunk_size - overlap
return chunks
# Usage
document = open("article.txt").read()
chunks = fixed_size_chunks(document, chunk_size=1500, overlap=300)
print(f"Created {len(chunks)} chunks")
# Store in vector DB
for chunk in chunks:
vector_db.add(
text=chunk["text"],
metadata={
"source": "article.txt",
"chunk_index": chunk["index"],
"start_char": chunk["start"]
}
)
Pros: Simple, fast, predictable chunk sizes
Cons: Ignores semantic boundaries, may split sentences mid-word
Recursive Chunking
Split by natural boundaries first, only falling back to smaller units when necessary:
import re
def recursive_chunks(text, chunk_size=1000, overlap=200):
"""Recursively split text by natural boundaries."""
# Level 1: Split by paragraphs
paragraphs = text.split('\n\n')
chunks = []
current_chunk = []
current_size = 0
for paragraph in paragraphs:
paragraph = paragraph.strip()
if not paragraph:
continue
# If adding this paragraph exceeds chunk_size, finalize current chunk
if current_size + len(paragraph) > chunk_size and current_chunk:
chunks.append('\n\n'.join(current_chunk))
# Keep overlap paragraphs for context
overlap_text = '\n\n'.join(current_chunk[-2:]) if len(current_chunk) > 1 else current_chunk[-1]
current_chunk = [overlap_text] if len(overlap_text) < overlap else []
current_size = sum(len(p) for p in current_chunk)
current_chunk.append(paragraph)
current_size += len(paragraph)
# Don't forget the last chunk
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
# Better version with sentence fallback
def recursive_chunks_v2(text, chunk_size=1000, overlap=200):
"""Split by paragraphs, then sentences if paragraphs are too long."""
separators = ['\n\n', '\n', '. ', '? ', '! ']
def split_by_separator(text, separator):
return [s.strip() for s in text.split(separator) if s.strip()]
chunks = []
current_chunk = []
current_size = 0
paragraphs = split_by_separator(text, '\n\n')
for paragraph in paragraphs:
# If paragraph itself is too long, split by sentences
if len(paragraph) > chunk_size:
sentences = split_by_separator(paragraph, '. ')
for sentence in sentences:
if current_size + len(sentence) > chunk_size and current_chunk:
chunks.append(' '.join(current_chunk))
current_chunk = current_chunk[-2:] # overlap
current_size = sum(len(s) for s in current_chunk)
current_chunk.append(sentence)
current_size += len(sentence)
else:
if current_size + len(paragraph) > chunk_size and current_chunk:
chunks.append('\n\n'.join(current_chunk))
current_chunk = current_chunk[-1:]
current_size = len(current_chunk[0]) if current_chunk else 0
current_chunk.append(paragraph)
current_size += len(paragraph)
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Semantic Chunking
Split when the meaning changes, not at arbitrary character counts:
import numpy as np
def semantic_chunks(text, client, max_chunk_size=1500, similarity_threshold=0.85):
"""Split text where semantic similarity between sentences drops."""
# Split into sentences
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
if len(sentences) <= 1:
return [text]
# Get embeddings for each sentence
embeddings = []
for sentence in sentences:
response = client.embeddings.create(
model="text-embedding-3-small",
input=sentence[:8000]
)
embeddings.append(response.data[0].embedding)
chunks = []
current_chunk = [sentences[0]]
current_embeddings = [embeddings[0]]
for i in range(1, len(sentences)):
# Compare current sentence to average of current chunk
chunk_avg = np.mean(current_embeddings, axis=0)
similarity = cosine_similarity(embeddings[i], chunk_avg)
# Split if similarity drops or chunk too large
chunk_text = ' '.join(current_chunk)
if similarity < similarity_threshold or len(chunk_text) > max_chunk_size:
chunks.append({
"text": chunk_text,
"sentences": len(current_chunk),
"avg_similarity": float(np.mean([
cosine_similarity(e, chunk_avg)
for e in current_embeddings
]))
})
current_chunk = [sentences[i]]
current_embeddings = [embeddings[i]]
else:
current_chunk.append(sentences[i])
current_embeddings.append(embeddings[i])
# Final chunk
if current_chunk:
chunks.append({
"text": ' '.join(current_chunk),
"sentences": len(current_chunk)
})
return chunks
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Markdown-Aware Chunking
Preserve document structure for technical documentation:
import re
def markdown_chunks(text, chunk_size=1500):
"""Split markdown while preserving headers and structure."""
# Split by headers
header_pattern = r'^(#{1,6}\s.+)$'
sections = re.split(f'(?m){header_pattern}', text)
chunks = []
current_header = ""
current_content = []
current_size = 0
for section in sections:
if not section.strip():
continue
# Check if this is a header
if re.match(header_pattern, section.strip()):
# Save previous section
if current_content:
chunks.append({
"header": current_header,
"text": '\n'.join(current_content),
"level": current_header.count('#') if current_header else 0
})
current_header = section.strip()
current_content = []
current_size = 0
else:
# Split long sections by paragraphs
paragraphs = section.split('\n\n')
for paragraph in paragraphs:
if current_size + len(paragraph) > chunk_size and current_content:
chunks.append({
"header": current_header,
"text": '\n\n'.join(current_content),
"level": current_header.count('#') if current_header else 0
})
current_content = []
current_size = 0
current_content.append(paragraph)
current_size += len(paragraph)
# Final section
if current_content:
chunks.append({
"header": current_header,
"text": '\n\n'.join(current_content),
"level": current_header.count('#') if current_header else 0
})
return chunks
# Usage with metadata
for chunk in markdown_chunks(document):
vector_db.add(
text=chunk["text"],
metadata={
"header": chunk["header"],
"header_level": chunk["level"],
"type": "documentation"
}
)
Chunk Size Optimization
There's no universal best chunk size. It depends on your use case:
| Use Case | Chunk Size | Overlap | Reason |
|---|---|---|---|
| FAQ / Q&A | 200-500 | 0-50 | Each chunk is self-contained |
| Technical docs | 500-1000 | 100-200 | Need surrounding context |
| Legal documents | 1000-2000 | 200-400 | Complex cross-references |
| Code repositories | Function-level | 0 | Functions are natural units |
| Books / long-form | 1500-3000 | 300-500 | Narrative flow matters |
Evaluating Chunk Quality
def evaluate_chunks(chunks, test_queries, client):
"""Evaluate chunking strategy using retrieval quality."""
scores = []
for query in test_queries:
# Embed query
query_emb = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# Find best matching chunk
best_score = 0
best_chunk = None
for chunk in chunks:
chunk_emb = client.embeddings.create(
model="text-embedding-3-small",
input=chunk["text"][:8000]
).data[0].embedding
similarity = cosine_similarity(query_emb, chunk_emb)
if similarity > best_score:
best_score = similarity
best_chunk = chunk
# Check if answer is actually in the chunk
has_answer = check_answer_in_chunk(query, best_chunk["text"])
scores.append({
"query": query,
"retrieval_score": best_score,
"has_answer": has_answer,
"chunk_length": len(best_chunk["text"])
})
# Summary metrics
avg_score = sum(s["retrieval_score"] for s in scores) / len(scores)
answer_rate = sum(1 for s in scores if s["has_answer"]) / len(scores)
return {
"avg_retrieval_score": avg_score,
"answer_containment_rate": answer_rate,
"details": scores
}
# Run evaluation
results = evaluate_chunks(chunks, test_queries, client)
print(f"Answer containment: {results['answer_containment_rate']:.1%}")
print(f"Avg retrieval score: {results['avg_retrieval_score']:.3f}")
Metadata Enrichment
Chunks without metadata are hard to filter and rank:
def enrich_chunks(chunks, source_doc):
"""Add metadata to chunks for better retrieval."""
enriched = []
for i, chunk in enumerate(chunks):
# Basic metadata
metadata = {
"source": source_doc["filename"],
"chunk_index": i,
"total_chunks": len(chunks),
"char_count": len(chunk["text"]),
"word_count": len(chunk["text"].split()),
}
# Extract keywords (simple TF-based)
words = chunk["text"].lower().split()
word_freq = {}
for word in words:
if len(word) > 4 and word.isalpha():
word_freq[word] = word_freq.get(word, 0) + 1
top_keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:5]
metadata["keywords"] = [k for k, _ in top_keywords]
# Document-level metadata
if "title" in source_doc:
metadata["document_title"] = source_doc["title"]
if "date" in source_doc:
metadata["document_date"] = source_doc["date"]
if "category" in source_doc:
metadata["category"] = source_doc["category"]
# Position metadata (beginning/middle/end)
position = i / len(chunks)
if position < 0.2:
metadata["position"] = "beginning"
elif position > 0.8:
metadata["position"] = "end"
else:
metadata["position"] = "middle"
enriched.append({
"text": chunk["text"],
"metadata": metadata
})
return enriched
Production Pipeline
class DocumentProcessor:
"""End-to-end document processing for RAG."""
def __init__(self, client, chunk_size=1000, overlap=200):
self.client = client
self.chunk_size = chunk_size
self.overlap = overlap
def process(self, document):
"""Process a document through the full pipeline."""
# 1. Clean text
cleaned = self._clean(document["text"])
# 2. Chunk
if document.get("format") == "markdown":
chunks = markdown_chunks(cleaned, self.chunk_size)
else:
chunks = recursive_chunks_v2(cleaned, self.chunk_size, self.overlap)
# 3. Enrich with metadata
enriched = enrich_chunks(chunks, document)
# 4. Generate embeddings
for chunk in enriched:
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=chunk["text"][:8000]
)
chunk["embedding"] = response.data[0].embedding
return enriched
def _clean(self, text):
"""Clean and normalize text."""
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)
# Remove control characters
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
# Normalize unicode
import unicodedata
text = unicodedata.normalize('NFKC', text)
return text.strip()
# Usage
processor = DocumentProcessor(client, chunk_size=1500, overlap=300)
document = {
"filename": "api-docs.md",
"title": "API Documentation",
"text": open("api-docs.md").read(),
"format": "markdown",
"category": "technical"
}
chunks = processor.process(document)
print(f"Processed into {len(chunks)} chunks")
# Store in vector DB
for chunk in chunks:
vector_db.add(
document=chunk["text"],
embedding=chunk["embedding"],
metadata=chunk["metadata"]
)
Common Pitfalls
- One-size-fits-all chunking — Different document types need different strategies. Don't chunk code the same way as legal text
- No overlap — Context at chunk boundaries is lost. Always include 10-20% overlap
- Ignoring document structure — Headers, sections, and lists carry semantic meaning. Preserve them
- Not cleaning input — Control characters, excessive whitespace, and encoding issues break embeddings
- Missing metadata — Without source attribution, users can't verify answers. Always include document metadata
- Not evaluating — Measure retrieval quality with test queries. Bad chunking won't be obvious until you test
Conclusion
Chunking is the foundation of effective RAG. Start with recursive chunking for most text, use semantic chunking when retrieval quality is critical, and always add metadata. The best chunk size is the one that contains complete answers to your users' questions — test with real queries to find it.
Remember: chunking is not a one-time decision. As your document corpus grows and user queries evolve, revisit your chunking strategy. The goal is not perfect chunks, but chunks that enable accurate retrieval.
Related Guides: RAG Implementation Guide · Embedding Models Comparison · Vector Databases Guide