AI Data Preprocessing & Chunking Guide 2026 - Optimize Documents for LLMs

The quality of your RAG system depends more on how you chunk documents than on which embedding model you use. A 1000-page PDF dumped into a vector database as 100-character chunks will produce incoherent, out-of-context retrievals. A well-chunked document with semantic boundaries, appropriate overlap, and metadata enrichment will produce precise, contextual answers. This guide covers every chunking strategy available in 2026, with code and decision frameworks.

Why Chunking Matters

LLMs have context limits. Even with 200K+ token windows, you can't feed an entire document corpus into every query. Chunking breaks documents into retrievable units, but bad chunking creates problems:

Too small: Chunks lack context. "The president signed the bill" — which president? Which bill?
Too large: Chunks exceed context limits or dilute relevance with irrelevant content
Wrong boundaries: Splitting mid-sentence or mid-paragraph destroys meaning
No overlap: Context at chunk boundaries is lost
No metadata: Can't filter by source, date, or document type

Chunking Strategies Compared

Strategy	How It Works	Best For	Trade-off
Fixed size	Every N characters/tokens	Simple docs, speed	May split sentences
Recursive	Split by hierarchy (paragraphs → sentences → words)	Most text	Slower, better quality
Semantic	Split when embedding similarity drops	Complex documents	Most accurate, slowest
Agentic	LLM decides chunk boundaries	High-value content	Expensive, best quality
Markdown-aware	Respects headers, lists, code blocks	Technical docs	Format-dependent

Fixed-Size Chunking

The simplest approach. Fast but naive:

def fixed_size_chunks(text, chunk_size=1000, overlap=200):
    """Split text into fixed-size chunks with overlap."""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Add metadata
        chunks.append({
            "text": chunk,
            "start": start,
            "end": end,
            "index": len(chunks)
        })
        
        # Move forward by chunk_size - overlap
        start += chunk_size - overlap
    
    return chunks

# Usage
document = open("article.txt").read()
chunks = fixed_size_chunks(document, chunk_size=1500, overlap=300)
print(f"Created {len(chunks)} chunks")

# Store in vector DB
for chunk in chunks:
    vector_db.add(
        text=chunk["text"],
        metadata={
            "source": "article.txt",
            "chunk_index": chunk["index"],
            "start_char": chunk["start"]
        }
    )

Pros: Simple, fast, predictable chunk sizes
Cons: Ignores semantic boundaries, may split sentences mid-word

Recursive Chunking

Split by natural boundaries first, only falling back to smaller units when necessary:

import re

def recursive_chunks(text, chunk_size=1000, overlap=200):
    """Recursively split text by natural boundaries."""
    
    # Level 1: Split by paragraphs
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for paragraph in paragraphs:
        paragraph = paragraph.strip()
        if not paragraph:
            continue
        
        # If adding this paragraph exceeds chunk_size, finalize current chunk
        if current_size + len(paragraph) > chunk_size and current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            
            # Keep overlap paragraphs for context
            overlap_text = '\n\n'.join(current_chunk[-2:]) if len(current_chunk) > 1 else current_chunk[-1]
            current_chunk = [overlap_text] if len(overlap_text) < overlap else []
            current_size = sum(len(p) for p in current_chunk)
        
        current_chunk.append(paragraph)
        current_size += len(paragraph)
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    return chunks

# Better version with sentence fallback
def recursive_chunks_v2(text, chunk_size=1000, overlap=200):
    """Split by paragraphs, then sentences if paragraphs are too long."""
    
    separators = ['\n\n', '\n', '. ', '? ', '! ']
    
    def split_by_separator(text, separator):
        return [s.strip() for s in text.split(separator) if s.strip()]
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    paragraphs = split_by_separator(text, '\n\n')
    
    for paragraph in paragraphs:
        # If paragraph itself is too long, split by sentences
        if len(paragraph) > chunk_size:
            sentences = split_by_separator(paragraph, '. ')
            for sentence in sentences:
                if current_size + len(sentence) > chunk_size and current_chunk:
                    chunks.append(' '.join(current_chunk))
                    current_chunk = current_chunk[-2:]  # overlap
                    current_size = sum(len(s) for s in current_chunk)
                
                current_chunk.append(sentence)
                current_size += len(sentence)
        else:
            if current_size + len(paragraph) > chunk_size and current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = current_chunk[-1:]
                current_size = len(current_chunk[0]) if current_chunk else 0
            
            current_chunk.append(paragraph)
            current_size += len(paragraph)
    
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    return chunks

Semantic Chunking

Split when the meaning changes, not at arbitrary character counts:

import numpy as np

def semantic_chunks(text, client, max_chunk_size=1500, similarity_threshold=0.85):
    """Split text where semantic similarity between sentences drops."""
    
    # Split into sentences
    sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
    
    if len(sentences) <= 1:
        return [text]
    
    # Get embeddings for each sentence
    embeddings = []
    for sentence in sentences:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=sentence[:8000]
        )
        embeddings.append(response.data[0].embedding)
    
    chunks = []
    current_chunk = [sentences[0]]
    current_embeddings = [embeddings[0]]
    
    for i in range(1, len(sentences)):
        # Compare current sentence to average of current chunk
        chunk_avg = np.mean(current_embeddings, axis=0)
        similarity = cosine_similarity(embeddings[i], chunk_avg)
        
        # Split if similarity drops or chunk too large
        chunk_text = ' '.join(current_chunk)
        if similarity < similarity_threshold or len(chunk_text) > max_chunk_size:
            chunks.append({
                "text": chunk_text,
                "sentences": len(current_chunk),
                "avg_similarity": float(np.mean([
                    cosine_similarity(e, chunk_avg) 
                    for e in current_embeddings
                ]))
            })
            current_chunk = [sentences[i]]
            current_embeddings = [embeddings[i]]
        else:
            current_chunk.append(sentences[i])
            current_embeddings.append(embeddings[i])
    
    # Final chunk
    if current_chunk:
        chunks.append({
            "text": ' '.join(current_chunk),
            "sentences": len(current_chunk)
        })
    
    return chunks

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Markdown-Aware Chunking

Preserve document structure for technical documentation:

import re

def markdown_chunks(text, chunk_size=1500):
    """Split markdown while preserving headers and structure."""
    
    # Split by headers
    header_pattern = r'^(#{1,6}\s.+)$'
    sections = re.split(f'(?m){header_pattern}', text)
    
    chunks = []
    current_header = ""
    current_content = []
    current_size = 0
    
    for section in sections:
        if not section.strip():
            continue
        
        # Check if this is a header
        if re.match(header_pattern, section.strip()):
            # Save previous section
            if current_content:
                chunks.append({
                    "header": current_header,
                    "text": '\n'.join(current_content),
                    "level": current_header.count('#') if current_header else 0
                })
            
            current_header = section.strip()
            current_content = []
            current_size = 0
        else:
            # Split long sections by paragraphs
            paragraphs = section.split('\n\n')
            
            for paragraph in paragraphs:
                if current_size + len(paragraph) > chunk_size and current_content:
                    chunks.append({
                        "header": current_header,
                        "text": '\n\n'.join(current_content),
                        "level": current_header.count('#') if current_header else 0
                    })
                    current_content = []
                    current_size = 0
                
                current_content.append(paragraph)
                current_size += len(paragraph)
    
    # Final section
    if current_content:
        chunks.append({
            "header": current_header,
            "text": '\n\n'.join(current_content),
            "level": current_header.count('#') if current_header else 0
        })
    
    return chunks

# Usage with metadata
for chunk in markdown_chunks(document):
    vector_db.add(
        text=chunk["text"],
        metadata={
            "header": chunk["header"],
            "header_level": chunk["level"],
            "type": "documentation"
        }
    )

Chunk Size Optimization

There's no universal best chunk size. It depends on your use case:

Use Case	Chunk Size	Overlap	Reason
FAQ / Q&A	200-500	0-50	Each chunk is self-contained
Technical docs	500-1000	100-200	Need surrounding context
Legal documents	1000-2000	200-400	Complex cross-references
Code repositories	Function-level	0	Functions are natural units
Books / long-form	1500-3000	300-500	Narrative flow matters

Evaluating Chunk Quality

def evaluate_chunks(chunks, test_queries, client):
    """Evaluate chunking strategy using retrieval quality."""
    
    scores = []
    
    for query in test_queries:
        # Embed query
        query_emb = client.embeddings.create(
            model="text-embedding-3-small",
            input=query
        ).data[0].embedding
        
        # Find best matching chunk
        best_score = 0
        best_chunk = None
        
        for chunk in chunks:
            chunk_emb = client.embeddings.create(
                model="text-embedding-3-small",
                input=chunk["text"][:8000]
            ).data[0].embedding
            
            similarity = cosine_similarity(query_emb, chunk_emb)
            if similarity > best_score:
                best_score = similarity
                best_chunk = chunk
        
        # Check if answer is actually in the chunk
        has_answer = check_answer_in_chunk(query, best_chunk["text"])
        
        scores.append({
            "query": query,
            "retrieval_score": best_score,
            "has_answer": has_answer,
            "chunk_length": len(best_chunk["text"])
        })
    
    # Summary metrics
    avg_score = sum(s["retrieval_score"] for s in scores) / len(scores)
    answer_rate = sum(1 for s in scores if s["has_answer"]) / len(scores)
    
    return {
        "avg_retrieval_score": avg_score,
        "answer_containment_rate": answer_rate,
        "details": scores
    }

# Run evaluation
results = evaluate_chunks(chunks, test_queries, client)
print(f"Answer containment: {results['answer_containment_rate']:.1%}")
print(f"Avg retrieval score: {results['avg_retrieval_score']:.3f}")

Metadata Enrichment

Chunks without metadata are hard to filter and rank:

def enrich_chunks(chunks, source_doc):
    """Add metadata to chunks for better retrieval."""
    
    enriched = []
    
    for i, chunk in enumerate(chunks):
        # Basic metadata
        metadata = {
            "source": source_doc["filename"],
            "chunk_index": i,
            "total_chunks": len(chunks),
            "char_count": len(chunk["text"]),
            "word_count": len(chunk["text"].split()),
        }
        
        # Extract keywords (simple TF-based)
        words = chunk["text"].lower().split()
        word_freq = {}
        for word in words:
            if len(word) > 4 and word.isalpha():
                word_freq[word] = word_freq.get(word, 0) + 1
        
        top_keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:5]
        metadata["keywords"] = [k for k, _ in top_keywords]
        
        # Document-level metadata
        if "title" in source_doc:
            metadata["document_title"] = source_doc["title"]
        if "date" in source_doc:
            metadata["document_date"] = source_doc["date"]
        if "category" in source_doc:
            metadata["category"] = source_doc["category"]
        
        # Position metadata (beginning/middle/end)
        position = i / len(chunks)
        if position < 0.2:
            metadata["position"] = "beginning"
        elif position > 0.8:
            metadata["position"] = "end"
        else:
            metadata["position"] = "middle"
        
        enriched.append({
            "text": chunk["text"],
            "metadata": metadata
        })
    
    return enriched

Production Pipeline

class DocumentProcessor:
    """End-to-end document processing for RAG."""
    
    def __init__(self, client, chunk_size=1000, overlap=200):
        self.client = client
        self.chunk_size = chunk_size
        self.overlap = overlap
    
    def process(self, document):
        """Process a document through the full pipeline."""
        
        # 1. Clean text
        cleaned = self._clean(document["text"])
        
        # 2. Chunk
        if document.get("format") == "markdown":
            chunks = markdown_chunks(cleaned, self.chunk_size)
        else:
            chunks = recursive_chunks_v2(cleaned, self.chunk_size, self.overlap)
        
        # 3. Enrich with metadata
        enriched = enrich_chunks(chunks, document)
        
        # 4. Generate embeddings
        for chunk in enriched:
            response = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=chunk["text"][:8000]
            )
            chunk["embedding"] = response.data[0].embedding
        
        return enriched
    
    def _clean(self, text):
        """Clean and normalize text."""
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove control characters
        text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
        
        # Normalize unicode
        import unicodedata
        text = unicodedata.normalize('NFKC', text)
        
        return text.strip()

# Usage
processor = DocumentProcessor(client, chunk_size=1500, overlap=300)

document = {
    "filename": "api-docs.md",
    "title": "API Documentation",
    "text": open("api-docs.md").read(),
    "format": "markdown",
    "category": "technical"
}

chunks = processor.process(document)
print(f"Processed into {len(chunks)} chunks")

# Store in vector DB
for chunk in chunks:
    vector_db.add(
        document=chunk["text"],
        embedding=chunk["embedding"],
        metadata=chunk["metadata"]
    )

Common Pitfalls

One-size-fits-all chunking — Different document types need different strategies. Don't chunk code the same way as legal text
No overlap — Context at chunk boundaries is lost. Always include 10-20% overlap
Ignoring document structure — Headers, sections, and lists carry semantic meaning. Preserve them
Not cleaning input — Control characters, excessive whitespace, and encoding issues break embeddings
Missing metadata — Without source attribution, users can't verify answers. Always include document metadata
Not evaluating — Measure retrieval quality with test queries. Bad chunking won't be obvious until you test

Conclusion

Chunking is the foundation of effective RAG. Start with recursive chunking for most text, use semantic chunking when retrieval quality is critical, and always add metadata. The best chunk size is the one that contains complete answers to your users' questions — test with real queries to find it.

Remember: chunking is not a one-time decision. As your document corpus grows and user queries evolve, revisit your chunking strategy. The goal is not perfect chunks, but chunks that enable accurate retrieval.