How to Implement Semantic Search with Vector Embeddings: Step-by-Step Guide 2026
Last Updated: 2026-05-14
Semantic search is transforming how applications retrieve information. Unlike keyword search, which matches exact terms, semantic search understands meaning and context. This guide shows you how to implement semantic search using vector embeddings in 2026—with working code, model comparisons, and production tips.
What Is Semantic Search?
Semantic search uses vector embeddings to find content based on meaning, not just keywords. When you search for "how to reduce AI costs," a keyword search looks for those exact words. Semantic search understands you're asking about cost optimization for AI and returns relevant results even if they don't contain the exact phrase.
The core idea: convert text into high-dimensional vectors (embeddings) where similar meanings are close together in vector space. Search becomes a nearest-neighbor lookup.
When to Use Semantic Search
- Knowledge bases: Find relevant documentation across thousands of articles
- Customer support: Match user questions to FAQ answers
- E-commerce: Return relevant products beyond exact SKU matches
- Research: Discover related papers, articles, or code snippets
- RAG (Retrieval-Augmented Generation): Retrieve context for LLM-powered apps
Step 1: Choose an Embedding Model
Embedding models convert text into vectors. Your choice affects accuracy, cost, and latency. Here's a comparison of 2026's leading options:
| Model | Dimensions | Performance (MTEB) | Price (per 1M tokens) | Best For |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 (adjustable) | 62.3% | .02 | Quick start, English-centric |
| OpenAI text-embedding-3-large | 3072 (adjustable) | 64.6% | .13 | High accuracy, multilingual |
| Cohere embed-english-v3.0 | 1024 | 64.1% | .10 | English semantic search |
| Cohere embed-multilingual-v3.0 | 1024 | 63.2% | .10 | Non-English content |
| all-MiniLM-L6-v2 (HuggingFace) | 384 | 56.4% | Free (self-hosted) | Budget projects, privacy |
| voyage-2 (Voyage AI) | 1024 | 65.1% | .06 | RAG applications |
Recommendation for 2026: Start with text-embedding-3-small for speed and cost-efficiency. Upgrade to text-embedding-3-large if you need maximum accuracy or multilingual support.
Step 2: Set Up a Vector Database
Vector databases store embeddings and enable fast similarity search. Here are the top options in 2026:
| Database | Type | Scaling | Free Tier | Best For |
|---|---|---|---|---|
| Pinecone | Managed | Serverless | 1 index, 5M vectors | Production, zero ops |
| Weaviate Cloud | Managed | Serverless | 1 cluster, 10M vectors | Hybrid search (keyword + vector) |
| Qdrant Cloud | Managed | Scalable | 1GB storage | Filtering + vector search |
| Chroma | Self-hosted | In-memory | Open-source | Local dev, small datasets |
| pgvector (PostgreSQL) | Extension | Depends on PG | Open-source | Existing Postgres stack |
Step 3: Build the Search Pipeline
Here's the complete pipeline:
- Embed: Convert your documents into vectors using an embedding model
- Store: Insert vectors into a vector database with metadata
- Query: Embed the user's search query
- Retrieve: Find nearest neighbors in the vector database
- Return: Display matched documents to the user
Implementation Example (Python + OpenAI + Pinecone)
import os
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
# Initialize clients
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
# Create index (run once)
index_name = "semantic-search-demo"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # text-embedding-3-small
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
# Step 1: Embed documents
def embed_text(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
documents = [
{"id": "1", "text": "How to reduce AI API costs in production", "category": "optimization"},
{"id": "2", "text": "Best practices for prompt engineering with GPT-4", "category": "prompting"},
{"id": "3", "text": "How to implement RAG with LangChain", "category": "tutorial"},
]
# Embed and upsert
for doc in documents:
embedding = embed_text(doc["text"])
index.upsert([
{
"id": doc["id"],
"values": embedding,
"metadata": {"text": doc["text"], "category": doc["category"]}
}
])
print(f"Indexed {len(documents)} documents")
# Step 2: Search
def semantic_search(query, top_k=3):
query_embedding = embed_text(query)
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
return results
# Test
results = semantic_search("How to save money on AI APIs")
for match in results["matches"]:
print(f"Score: {match['score']:.3f} | Text: {match['metadata']['text']}")
Expected output:
Indexed 3 documents
Score: 0.847 | Text: How to reduce AI API costs in production
Score: 0.712 | Text: Best practices for prompt engineering with GPT-4
Score: 0.698 | Text: How to implement RAG with LangChain
JavaScript/TypeScript Example (Node.js + OpenAI + Pinecone)
import OpenAI from "openai";
import { Pinecone } from "@pinecone-database/pinecone";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pc.index("semantic-search-demo");
async function embedText(text) {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding;
}
async function semanticSearch(query, topK = 3) {
const queryEmbedding = await embedText(query);
const results = await index.query({
vector: queryEmbedding,
topK,
includeMetadata: true,
});
return results;
}
// Usage
const results = await semanticSearch("How to save money on AI APIs");
results.matches.forEach((match) => {
console.log(Score: | Text: );
});
Step 4: Production Considerations
1. Chunking Strategy
Long documents must be split into chunks before embedding. Best practices for 2026:
- Chunk size: 500-1000 tokens (≈375-750 words)
- Overlap: 10-20% to preserve context across chunks
- Separators: Split on paragraphs, then sentences, then words
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(long_document)
2. Metadata Filtering
Combine vector search with metadata filters to improve relevance:
results = index.query(
vector=query_embedding,
top_k=10,
filter={"category": {"": "optimization"}}, # Only search optimization docs
include_metadata=True
)
3. Hybrid Search (Keyword + Vector)
For best results, combine keyword search (BM25) with vector search. Weaviate and Elasticsearch support this natively.
4. Latency Optimization
- Cache embeddings: Store pre-computed embeddings for frequent queries
- Reduce dimensions: Use
dimensionsparameter in OpenAI API to reduce from 1536 to 512 (minimal accuracy loss) - Batch embed: Send multiple texts in one API call
# Batch embedding (faster and cheaper)
responses = client.embeddings.create(
model="text-embedding-3-small",
input=["text1", "text2", "text3"] # Batch
)
5. Cost Optimization
- Use smaller models:
text-embedding-3-smallis 6x cheaper thantext-embedding-3-large - Cache aggressively: Store embeddings in Redis/Postgres
- Consider self-hosted: For >100M tokens/month, self-hosted models (BGE, E5) become cheaper
Common Errors and Solutions
Cause: Input text exceeds model's token limit (8192 for OpenAI embeddings)
Solution: Implement chunking (see Step 4.1). Split long documents into ≤8000-token chunks.
Cause: Embedding dimension doesn't match index dimension
Solution: Check model dimensions.
text-embedding-3-small = 1536, text-embedding-3-large = 3072. Create index with correct dimension.
Cause: Chunk size too large/small, or embedding model mismatch
Solution: Experiment with chunk sizes (500-1000 tokens). Try
text-embedding-3-large for better accuracy. Add metadata filters.
Cause: Network overhead, large batch sizes, or cold start
Solution: Use batch embedding, cache results, choose region-close vector DB instance.
Cause: Too many embedding requests
Solution: Implement exponential backoff. OpenAI allows 3M tokens/minute for embeddings. Use batch API for bulk indexing.
Complete Working Example (Copy-Paste)
Here's a minimal, complete implementation you can run today:
"""
Semantic Search Implementation - Complete Working Example
Requires: openai, pinecone-client, python-dotenv
Setup:
1. pip install openai pinecone-client python-dotenv
2. Create .env file with OPENAI_API_KEY and PINECONE_API_KEY
3. Run: python semantic_search.py
"""
import os
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
from dotenv import load_dotenv
load_dotenv()
# Initialize
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = "semantic-search-complete"
# Create index
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
def embed(text):
"""Convert text to vector embedding."""
return client.embeddings.create(
model="text-embedding-3-small",
input=text
).data[0].embedding
def index_documents(docs):
"""Index documents into Pinecone."""
vectors = []
for doc in docs:
embedding = embed(doc["text"])
vectors.append({
"id": doc["id"],
"values": embedding,
"metadata": {"text": doc["text"]}
})
index.upsert(vectors)
print(f"✓ Indexed {len(vectors)} documents")
def search(query, top_k=3):
"""Search for similar documents."""
query_vector = embed(query)
results = index.query(
vector=query_vector,
top_k=top_k,
include_metadata=True
)
return results["matches"]
# Demo
if __name__ == "__main__":
# Sample documents
docs = [
{"id": "1", "text": "How to reduce AI API costs in production using caching and model selection"},
{"id": "2", "text": "Best practices for prompt engineering with GPT-4 and Claude"},
{"id": "3", "text": "How to implement RAG with LangChain and Pinecone"},
{"id": "4", "text": "Comparing open-source LLMs: Llama 3 vs Mistral vs Qwen"},
]
# Index
index_documents(docs)
# Search
print("\n🔍 Search results for: 'How to save money on AI'\n")
results = search("How to save money on AI")
for i, match in enumerate(results, 1):
score = match['score']
text = match['metadata']['text']
print(f"{i}. Score: {score:.3f}")
print(f" Text: {text}\n")
# Output:
# ✓ Indexed 4 documents
#
# 🔍 Search results for: 'How to save money on AI'
#
# 1. Score: 0.842
# Text: How to reduce AI API costs in production using caching and model selection
#
# 2. Score: 0.719
# Text: Best practices for prompt engineering with GPT-4 and Claude
#
# 3. Score: 0.701
# Text: How to implement RAG with LangChain and Pinecone
Next Steps
- Add hybrid search: Combine keyword (BM25) + vector search for better recall
- Implement re-ranking: Use Cohere Re-rank or LLM-based re-ranking for top results
- Monitor performance: Track search latency, accuracy, and user feedback
- Scale up: Move from Pinecone Starter to Serverless for production traffic
Further Reading
- OpenAI Embeddings Guide (Official docs, updated 2026)
- Pinecone: What is a Vector Database (Comprehensive guide)
- LangChain RAG Tutorial (Building RAG pipelines)
- MTEB Leaderboard (Compare embedding models)
Published: 2026-05-14 | Author: AI Tool Reviewer | Reading time: 12 minutes