Semantic Search Overview
Semantic search finds documents by meaning, not just keywords. It uses embeddings to capture semantics. It understands query intent. It finds relevant documents regardless of exact word matches. It works better than keyword search for many tasks.
Semantic search converts queries and documents to embeddings. It compares embeddings using similarity. It ranks documents by semantic relevance. It returns most relevant results.
The diagram shows semantic search flow. Query converts to embedding. Documents convert to embeddings. Similarity search finds relevant documents.
Document Chunking Strategies
Document chunking splits documents into searchable pieces. Chunks must balance context and granularity. Too large chunks lose precision. Too small chunks lose context.
Chunking strategies include fixed-size, sentence-based, and semantic chunking. Fixed-size uses character or token limits. Sentence-based splits at sentence boundaries. Semantic chunking groups related content.
# Document Chunkingdef chunk_documents(text, chunk_size=500, overlap=50):chunks = []start = 0while start < len(text):end = start + chunk_sizechunk = text[start:end]chunks.append(chunk)start = end - overlapreturn chunks# Sentence-based chunkingfrom nltk.tokenize import sent_tokenizedef chunk_by_sentences(text, max_chunk_size=500):sentences = sent_tokenize(text)chunks = []current_chunk = ""for sentence in sentences:if len(current_chunk) + len(sentence) <= max_chunk_size:current_chunk += " " + sentenceelse:if current_chunk:chunks.append(current_chunk.strip())current_chunk = sentenceif current_chunk:chunks.append(current_chunk.strip())return chunks# Exampletext = "Machine learning is a subset of AI. It enables computers to learn. Deep learning uses neural networks."chunks = chunk_by_sentences(text, max_chunk_size=100)print("Chunks: " + str(chunks))
-- NeuronDB: Document ChunkingCREATE TABLE documents (id SERIAL PRIMARY KEY,content TEXT);CREATE TABLE document_chunks ASSELECTid,generate_substring(content, 1, 500, 50) AS chunk,embedding vector(384)FROM documents;-- Create embeddings for chunksUPDATE document_chunksSET embedding = neurondb.embed(chunk, 'sentence-transformers/all-MiniLM-L6-v2');
Chunking affects search quality. Good chunking preserves context. It enables precise retrieval.
Detailed Chunking Strategies
Fixed-size chunking uses character or token limits. It is simple to implement. It works for uniform documents. It may split sentences or concepts. Overlap helps preserve context across boundaries.
Sentence-based chunking splits at sentence boundaries. It preserves sentence integrity. It works well for natural language. It may create variable-sized chunks. It requires sentence segmentation.
Semantic chunking groups related content. It uses embeddings to find boundaries. It creates coherent chunks. It requires more computation. It produces better quality chunks.
# Detailed Chunking Implementationfrom sentence_transformers import SentenceTransformerimport numpy as npfrom sklearn.metrics.pairwise import cosine_similarityclass AdvancedChunking:def __init__(self):self.embedder = SentenceTransformer('all-MiniLM-L6-v2')def semantic_chunking(self, text, similarity_threshold=0.7, min_chunk_size=100, max_chunk_size=500):"""Chunk based on semantic similarity"""sentences = self.split_sentences(text)if len(sentences) == 0:return [text]# Create embeddingsembeddings = self.embedder.encode(sentences)chunks = []current_chunk = [sentences[0]]current_embeddings = [embeddings[0]]for i in range(1, len(sentences)):# Compute similarity with current chunkchunk_embedding = np.mean(current_embeddings, axis=0)similarity = cosine_similarity(chunk_embedding.reshape(1, -1),embeddings[i].reshape(1, -1))[0][0]# Check if should start new chunkchunk_text = ' '.join(current_chunk)should_split = (similarity < similarity_threshold orlen(chunk_text) + len(sentences[i]) > max_chunk_size)if should_split and len(chunk_text) >= min_chunk_size:chunks.append(' '.join(current_chunk))current_chunk = [sentences[i]]current_embeddings = [embeddings[i]]else:current_chunk.append(sentences[i])current_embeddings.append(embeddings[i])if current_chunk:chunks.append(' '.join(current_chunk))return chunksdef recursive_chunking(self, text, max_chunk_size=500, chunk_overlap=50):"""Recursively chunk large documents"""if len(text) <= max_chunk_size:return [text]# Try to split at paragraph boundariesparagraphs = text.split('\n\n')chunks = []current_chunk = ""for para in paragraphs:if len(current_chunk) + len(para) <= max_chunk_size:current_chunk += para + "\n\n"else:if current_chunk:chunks.append(current_chunk.strip())# Recursively chunk large paragraphsif len(para) > max_chunk_size:sub_chunks = self.recursive_chunking(para, max_chunk_size, chunk_overlap)chunks.extend(sub_chunks)else:current_chunk = para + "\n\n"if current_chunk:chunks.append(current_chunk.strip())return chunksdef split_sentences(self, text):"""Split text into sentences"""import re# Simple sentence splittingsentences = re.split(r'[.!?]+', text)return [s.strip() for s in sentences if s.strip()]# Examplechunker = AdvancedChunking()long_text = "Machine learning is a subset of AI. It enables computers to learn. Deep learning uses neural networks. Neural networks have multiple layers. Each layer processes information differently."semantic_chunks = chunker.semantic_chunking(long_text)print("Semantic chunks: " + str(semantic_chunks))recursive_chunks = chunker.recursive_chunking(long_text * 10)print("Recursive chunks count: " + str(len(recursive_chunks)))
Chunking Best Practices
Choose chunk size based on document type. Short documents need smaller chunks. Long documents can use larger chunks. Typical sizes are 200-500 tokens. Test different sizes for your use case.
Overlap helps preserve context. Typical overlap is 10-20% of chunk size. It ensures important information isn't split. It improves retrieval quality. It increases storage requirements.
Consider document structure. Respect paragraph boundaries when possible. Preserve section headers. Maintain list formatting. These improve chunk quality.
# Chunking Best Practicesclass BestPracticeChunking:def __init__(self):self.chunk_sizes = {'short': 200,'medium': 500,'long': 1000}def adaptive_chunking(self, text, document_type='medium'):"""Adapt chunk size to document type"""chunk_size = self.chunk_sizes.get(document_type, 500)# Detect document typeif len(text) < 1000:chunk_size = self.chunk_sizes['short']elif len(text) > 10000:chunk_size = self.chunk_sizes['long']return self.chunk_with_overlap(text, chunk_size, overlap=int(chunk_size * 0.1))def chunk_with_overlap(self, text, chunk_size, overlap=50):"""Chunk with overlap"""chunks = []start = 0while start < len(text):end = start + chunk_sizechunk = text[start:end]chunks.append(chunk)start = end - overlapreturn chunksdef preserve_structure(self, text):"""Chunk while preserving document structure"""# Split by paragraphs firstparagraphs = text.split('\n\n')chunks = []current_chunk = ""for para in paragraphs:if len(current_chunk) + len(para) <= 500:current_chunk += para + "\n\n"else:if current_chunk:chunks.append(current_chunk.strip())current_chunk = para + "\n\n"if current_chunk:chunks.append(current_chunk.strip())return chunks# Examplebest_practice = BestPracticeChunking()chunks = best_practice.adaptive_chunking(long_text, document_type='medium')structured_chunks = best_practice.preserve_structure(long_text)print("Adaptive chunks: " + str(len(chunks)))print("Structured chunks: " + str(len(structured_chunks)))
The diagram shows chunking strategies. Fixed-size creates uniform chunks. Sentence-based preserves sentence boundaries. Semantic chunking groups related content.
Query Processing
Query processing prepares queries for search. It converts queries to embeddings. It handles query expansion. It normalizes queries. It improves search quality.
Query processing includes normalization, expansion, and embedding. Normalization standardizes text. Expansion adds related terms. Embedding converts to vectors.
# Query Processingfrom sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-MiniLM-L6-v2')def process_query(query):# Normalizequery = query.lower().strip()# Convert to embeddingembedding = model.encode(query)return embedding, query# Examplequery = "How does machine learning work?"embedding, processed_query = process_query(query)print("Query embedding shape: " + str(embedding.shape))
Query processing improves search. It handles variations. It captures intent.
The diagram shows embedding generation process. Text preprocessed and tokenized. Encoder processes tokens. Embedding vector generated. Normalized for similarity search.
Ranking Algorithms
Ranking algorithms order search results. They use similarity scores. They combine multiple signals. They improve result relevance.
Ranking methods include similarity-based, learning-to-rank, and hybrid approaches. Similarity-based uses embedding similarity. Learning-to-rank uses machine learning. Hybrid combines multiple signals.
# Ranking Algorithmsimport numpy as npfrom sklearn.metrics.pairwise import cosine_similaritydef rank_documents(query_embedding, document_embeddings, top_k=10):similarities = cosine_similarity([query_embedding], document_embeddings)[0]ranked_indices = np.argsort(similarities)[::-1][:top_k]return ranked_indices, similarities[ranked_indices]# Examplequery_emb = np.random.randn(384)doc_embs = np.random.randn(1000, 384)indices, scores = rank_documents(query_emb, doc_embs, top_k=10)print("Top documents: " + str(indices))print("Scores: " + str(scores))
Ranking improves result quality. It orders by relevance. It combines multiple signals.
Detailed Ranking Algorithms
Similarity-based ranking uses embedding similarity directly. It is simple and fast. It works well when embeddings are good. It may miss important signals.
Learning-to-rank uses machine learning models. Features include query-document similarity, document length, position, and query characteristics. Models learn optimal feature weights. They improve ranking quality. They require training data.
BM25 is probabilistic ranking function. It combines term frequency and inverse document frequency. It works well for keyword search. It can be combined with semantic scores.
# Detailed Ranking Implementationimport numpy as npfrom sklearn.ensemble import RandomForestRegressorfrom rank_bm25 import BM25Okapiclass AdvancedRanking:def __init__(self):self.ltr_model = RandomForestRegressor(n_estimators=100)self.bm25 = Nonedef similarity_ranking(self, query_emb, doc_embeddings, top_k=10):"""Simple similarity-based ranking"""similarities = cosine_similarity([query_emb], doc_embeddings)[0]ranked_indices = np.argsort(similarities)[::-1][:top_k]return ranked_indices, similarities[ranked_indices]def bm25_ranking(self, query, documents, top_k=10):"""BM25 ranking for keyword matching"""if self.bm25 is None:tokenized_docs = [doc.lower().split() for doc in documents]self.bm25 = BM25Okapi(tokenized_docs)tokenized_query = query.lower().split()scores = self.bm25.get_scores(tokenized_query)ranked_indices = np.argsort(scores)[::-1][:top_k]return ranked_indices, scores[ranked_indices]def learned_to_rank(self, query, documents, query_emb, doc_embeddings, top_k=10):"""Learning-to-rank with multiple features"""# Extract featuresfeatures = []similarities = cosine_similarity([query_emb], doc_embeddings)[0]for i, doc in enumerate(documents):feature_vector = [similarities[i], # Semantic similaritylen(doc), # Document lengthi, # Initial positionlen(query), # Query lengthdoc.count(' '), # Word countlen(set(doc.lower().split()) & set(query.lower().split())) # Term overlap]features.append(feature_vector)# Predict relevance scoresrelevance_scores = self.ltr_model.predict(features)# Rank by predicted relevanceranked_indices = np.argsort(relevance_scores)[::-1][:top_k]return ranked_indices, relevance_scores[ranked_indices]def train_ltr_model(self, queries, documents_list, query_embeddings, doc_embeddings_list, relevance_labels):"""Train learning-to-rank model"""X_train = []y_train = []for queries_batch, docs_batch, q_embs, d_embs, labels_batch in zip(queries, documents_list, query_embeddings, doc_embeddings_list, relevance_labels):for query, docs, q_emb, d_embs, labels in zip(queries_batch, docs_batch, q_embs, d_embs, labels_batch):similarities = cosine_similarity([q_emb], d_embs)[0]for i, (doc, sim, label) in enumerate(zip(docs, similarities, labels)):features = [sim, len(doc), i, len(query),doc.count(' '), len(set(doc.lower().split()) & set(query.lower().split()))]X_train.append(features)y_train.append(label)self.ltr_model.fit(X_train, y_train)return self.ltr_model# Exampleranker = AdvancedRanking()query = "machine learning"documents = ["ML tutorial", "Deep learning guide", "AI introduction"]query_emb = np.random.randn(384)doc_embs = np.random.randn(3, 384)# Similarity rankingsim_indices, sim_scores = ranker.similarity_ranking(query_emb, doc_embs)print("Similarity ranking: " + str(sim_indices))# BM25 rankingbm25_indices, bm25_scores = ranker.bm25_ranking(query, documents)print("BM25 ranking: " + str(bm25_indices))
The diagram shows ranking process. Similarity scores computed. Results ordered by score. Top results returned.
Keyword vs Semantic Search
Keyword search matches exact words. It is fast and simple. It misses synonyms and related concepts. Semantic search matches meaning. It finds related content. It works better for many queries.
Keyword search uses inverted indexes. It matches query terms. It ranks by term frequency. Semantic search uses embeddings. It matches meaning. It ranks by semantic similarity.
# Keyword vs Semantic Searchfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sentence_transformers import SentenceTransformerimport numpy as np# Keyword searchtfidf = TfidfVectorizer()documents = ["Machine learning tutorial", "Deep learning guide", "AI introduction"]tfidf_matrix = tfidf.fit_transform(documents)query = "neural networks"query_vector = tfidf.transform([query])keyword_scores = (query_vector * tfidf_matrix.T).toarray()[0]# Semantic searchmodel = SentenceTransformer('all-MiniLM-L6-v2')doc_embeddings = model.encode(documents)query_embedding = model.encode([query])semantic_scores = cosine_similarity(query_embedding, doc_embeddings)[0]print("Keyword scores: " + str(keyword_scores))print("Semantic scores: " + str(semantic_scores))
Semantic search works better for meaning-based queries. Keyword search works for exact matches. Hybrid approaches combine both.
The diagram compares search methods. Keyword matches terms. Semantic matches meaning.
Hybrid Approaches
Hybrid search combines keyword and semantic search. It uses both exact matches and meaning. It improves result quality. It handles diverse query types.
Hybrid methods include score fusion, reranking, and multi-stage retrieval. Score fusion combines similarity scores. Reranking uses semantic scores to rerank keyword results. Multi-stage uses keyword for recall, semantic for precision.
# Hybrid Searchdef hybrid_search(query, documents, alpha=0.5):# Keyword scoreskeyword_scores = compute_keyword_scores(query, documents)# Semantic scoressemantic_scores = compute_semantic_scores(query, documents)# Combine scoreshybrid_scores = alpha * keyword_scores + (1 - alpha) * semantic_scores# Rankranked_indices = np.argsort(hybrid_scores)[::-1]return ranked_indices# Examplequery = "machine learning tutorial"documents = ["ML guide", "Deep learning intro", "AI basics"]results = hybrid_search(query, documents, alpha=0.6)print("Hybrid search results: " + str(results))
Hybrid search improves quality. It combines strengths of both methods. It handles diverse queries.
Query Expansion
Query expansion adds related terms to queries. It improves recall. It handles vocabulary mismatches. It finds more relevant documents.
Expansion methods include synonym expansion, embedding-based expansion, and feedback-based expansion. Synonym expansion adds synonyms. Embedding-based adds similar terms. Feedback-based uses user feedback.
# Query Expansiondef expand_query(query, word_embeddings, top_k=3):query_words = query.split()expanded_terms = set(query_words)for word in query_words:if word in word_embeddings:similar = word_embeddings.most_similar(word, topn=top_k)expanded_terms.update([w for w, _ in similar])expanded_query = " ".join(expanded_terms)return expanded_query# Exampleexpanded = expand_query("machine learning", word_embeddings, top_k=2)print("Expanded query: " + str(expanded))
Query expansion improves recall. It finds more relevant documents. It handles vocabulary variations.
The diagram shows query expansion methods. Synonym expansion uses word embeddings. LLM expansion generates related terms. Both methods improve retrieval coverage.
Summary
Semantic search finds documents by meaning. Document chunking splits documents appropriately. Query processing prepares queries. Ranking algorithms order results. Semantic search works better than keyword for many queries. Hybrid approaches combine both methods. Query expansion improves recall. Semantic search enables better information retrieval.