Large language models generate text from training data. Training data has cutoff dates. Models cannot access current information. Models cannot access private documents. Models cannot access domain-specific knowledge. This creates limitations for production applications.
Retrieval-Augmented Generation solves these problems. RAG combines document retrieval with language model generation. The system retrieves relevant documents from a knowledge base. The system provides retrieved context to the language model. The model generates responses grounded in the retrieved context. Responses are accurate and current.
This guide explains RAG architecture, covers implementation patterns, provides SQL examples, and details NeuronDB integration. The guide covers document processing, embedding generation, similarity search, context building, and response generation.
What is RAG
RAG stands for Retrieval-Augmented Generation. RAG is a technique that enhances language model responses with retrieved context. The process has four steps. Step one converts user queries into vector embeddings. Step two searches a knowledge base for relevant documents using vector similarity. Step three combines retrieved documents into context. Step four generates responses using the language model with the retrieved context.
RAG solves three problems. Problem one is outdated information. Language models train on fixed datasets. They cannot access information after training. RAG retrieves current documents from knowledge bases. Problem two is lack of domain knowledge. Models may not know specialized information. RAG retrieves domain-specific documents. Problem three is hallucinations. Models generate plausible but incorrect information. RAG grounds responses in retrieved documents.
RAG systems include five components. Component one is a knowledge base that stores documents. Component two is an embedding model that converts text to vectors. Component three is a vector database that performs similarity search. Component four is a retrieval system that finds relevant documents. Component five is a language model that generates responses.
How RAG Works
RAG follows this workflow. A user submits a query. The system converts the query to an embedding vector. The system searches the knowledge base for documents with similar embeddings. The system retrieves the top K most similar documents. The system combines retrieved documents into a context string. The system sends the query and context to the language model. The model generates a response using the context. The system returns the response to the user.
The retrieval step uses vector similarity search. Documents are stored with embedding vectors. Queries are converted to embedding vectors. The system computes cosine distance between query and document vectors. Documents with lower distance scores are more similar. The system returns documents ranked by similarity.
The generation step uses language models. Models receive the user query and retrieved context. Models generate responses that reference the context. Models cite source documents when possible. Models avoid generating information not in the context.
RAG Architecture
RAG systems have three layers. The storage layer stores documents and embeddings. The retrieval layer performs similarity search. The generation layer produces responses.
The storage layer includes document tables, embedding vectors, and indexes. Documents are split into chunks. Each chunk has text content and metadata. Embeddings are generated for each chunk. Embeddings are stored in vector columns. Indexes enable fast similarity search.
The retrieval layer includes query processing, embedding generation, similarity search, and result ranking. Queries are converted to embeddings. Similarity search finds relevant chunks. Results are ranked by similarity score. Top K chunks are selected for context.
The generation layer includes context building, prompt construction, model invocation, and response formatting. Retrieved chunks are combined into context. Prompts include the query and context. Models generate responses. Responses are formatted and returned.
Building a RAG System
Build a complete RAG system using NeuronDB. The system handles document ingestion, embedding generation, similarity search, context building, and response generation. Follow these steps to create a production-ready system.
Schema Design
Create tables for documents, chunks, embeddings, and queries. The documents table stores complete documents with metadata. The chunks table stores document segments with embeddings. The queries table tracks user queries and responses.
CREATEINDEX idx_chunks_doc_id ON document_chunks(doc_id);
CREATEINDEX idx_chunks_embedding ON document_chunks USING hnsw (embedding vector_cosine_ops)WITH(m =16, ef_construction =64);
CREATEINDEX idx_queries_created_at ON rag_queries(created_at);
The schema includes four tables. The documents table stores source documents. The document_chunks table stores text chunks with embeddings. The rag_queries table tracks queries and responses. Indexes enable fast retrieval.
Document Ingestion
Insert documents into the system. Documents are split into chunks. Chunks are processed for embedding generation. Run these commands to ingest documents:
'PostgreSQL performance can be improved through indexing, query optimization, and configuration tuning. B-tree indexes work well for most queries. GiST indexes support full-text search. Hash indexes provide fast equality lookups. Partial indexes reduce index size. Materialized views cache query results. Query planning uses statistics from ANALYZE. EXPLAIN shows query execution plans.',
'In-database machine learning enables training and inference within the database. Benefits include reduced data movement, faster processing, and simplified deployment. Algorithms include Random Forest, XGBoost, and Neural Networks. GPU acceleration provides 10x to 100x speedup. Model catalogs store trained models for reuse.',
The verification query shows three documents inserted. Each document has a unique ID, title, and content. Documents are ready for chunking.
Document Chunking
Split documents into smaller chunks. Chunks improve retrieval accuracy. Large documents are harder to match precisely. Smaller chunks enable focused retrieval. Split documents by sentences or fixed token counts:
LATERAL unnest(regexp_split_to_array(content,'\.\s+'))WITH ORDINALITY AS t(chunk_text, ordinality)
) chunks
WHERE length(chunk_text)>20;
-- Verification
SELECT
doc_id,
COUNT(*)AS chunk_count,
AVG(chunk_tokens)::INTAS avg_tokens
FROM document_chunks
GROUPBY doc_id
ORDERBY doc_id;
doc_id | chunk_count | avg_tokens
--------+-------------+------------
1|8|45
2|5|38
3|4|42
(3rows)
The verification query shows chunks created for each document. Document 1 has 8 chunks. Document 2 has 5 chunks. Document 3 has 4 chunks. Average chunk size is 40 tokens.
Embedding Generation
Generate embeddings for all chunks. Embeddings enable semantic search. Use the sentence-transformers/all-MiniLM-L6-v2 model for 384-dimensional vectors:
The retrieval query found 5 relevant chunks. The top 3 chunks are from the PostgreSQL Performance Optimization document. Similarity scores range from 0.82 to 0.69. These chunks will form the context.
Context Building
Combine retrieved chunks into a single context string. Format chunks with rank information. Aggregate chunk IDs for tracking:
{1,2,3,4,5} |1245|[Document 1] PostgreSQL performance can be improved through indexing, query optimization,and configuration tuning. B-tree indexes work well for most queries. GiST indexes support full-text search.Hash indexes provide fast equality lookups.Partial indexes reduce index size. Materialized views cache query results. Query planning uses statisticsfromANALYZE.EXPLAIN shows query execution plans.
[Document 2] B-tree indexes work well for most queries...
(1row)
The context contains 5 chunks with total length 1245 characters. Chunk IDs are stored in an array. The context is formatted with rank information.
Response Generation
Update the query record with retrieved chunks and context. The context will be sent to the language model for response generation:
The query record is updated with 5 retrieved chunks and 1245 characters of context. The context is ready for language model processing.
RAG Pipeline Components
RAG systems include several components that work together. Understanding each component helps build effective systems.
Document Processing
Document processing prepares documents for retrieval. Documents are cleaned, normalized, and chunked. Cleaning removes formatting artifacts. Normalization standardizes text encoding. Chunking splits documents into manageable segments.
Chunking strategies include fixed-size chunks, sentence-based chunks, and semantic chunks. Fixed-size chunks split text at fixed token counts. Sentence-based chunks split at sentence boundaries. Semantic chunks split at semantic boundaries using embeddings.
Chunk size affects retrieval accuracy. Small chunks provide precise matches but lose context. Large chunks provide more context but reduce precision. Optimal chunk size is 100 to 500 tokens depending on document type.
Embedding Models
Embedding models convert text to vectors. Models are trained on large text corpora. They learn semantic relationships between words and phrases. Similar texts produce similar vectors.
Model selection depends on use case. Fast models like sentence-transformers/all-MiniLM-L6-v2 provide 384-dimensional vectors with good quality. High-quality models like BAAI/bge-large-en-v1.5 provide 1024-dimensional vectors with better accuracy.
NeuronDB supports embedding models from Hugging Face. Models are loaded on first use. Embeddings are cached for performance. Batch processing generates embeddings for multiple texts efficiently.
Indexes accelerate similarity search. HNSW indexes provide sub-10ms queries on millions of vectors. IVFFlat indexes provide memory-efficient alternatives. Index parameters balance query speed and index build time.
Query performance depends on index configuration. Higher m values improve recall but slow queries. Higher ef_construction values improve index quality but slow builds. Tune parameters based on data size and query requirements.
Context Construction
Context construction combines retrieved chunks into prompts. Chunks are formatted with metadata. Rank information indicates relevance. Source citations enable verification.
Context length affects model performance. Short contexts may miss important information. Long contexts may exceed model limits. Optimal context length is 500 to 2000 tokens depending on model.
Response generation uses language models to produce answers. Models receive queries and context. Models generate responses grounded in context. Models cite sources when possible.
Model selection depends on requirements. GPT-4 provides high-quality responses but higher cost. GPT-3.5 provides faster responses at lower cost. Claude provides balanced quality and speed.
Advanced RAG patterns improve system performance and accuracy. These patterns address common challenges in RAG systems.
Reranking
Reranking improves retrieval precision by reordering results. Initial retrieval uses fast approximate search. Reranking uses slower but more accurate models. Cross-encoder models score query-document pairs for relevance.
Reranking workflow includes initial retrieval, reranking, and final selection. Initial retrieval finds top 20 to 100 candidates. Reranking scores each candidate. Final selection returns top K results.
NeuronDB supports reranking with cross-encoder models. Models score query-chunk pairs. Results are reordered by relevance score. Top K chunks are selected for context.
Reranking reorders results by relevance score. Chunk 2 moves from rank 2 to rank 1 after reranking. Rerank scores are higher than initial similarity scores. Final ranking improves precision.
Multi-Step Retrieval
Multi-step retrieval improves recall by expanding queries. Initial retrieval finds relevant documents. Query expansion generates related queries. Additional retrieval finds more documents. Results are merged and deduplicated.
Query expansion techniques include synonym replacement, related term addition, and query rewriting. Synonym replacement substitutes query terms with synonyms. Related term addition includes semantically related terms. Query rewriting generates alternative phrasings.
5| Vector search finds similar items using...|0.7654|1
(5rows)
Multi-step retrieval finds more relevant chunks. Chunk 1 matches all 4 expanded queries. Query matches indicate broad relevance. Results have higher similarity scores.
Hybrid Search
Hybrid search combines vector search with keyword search. Vector search finds semantically similar documents. Keyword search finds documents with exact term matches. Results are merged using reciprocal rank fusion.
Reciprocal rank fusion combines rankings from multiple methods. Each method produces a ranked list. RRF scores are computed for each document. Documents are ranked by combined RRF score.
5| Vector search finds similar items using...|0.7210|0.6123|0.009567
(5rows)
Hybrid search combines vector and keyword results. Chunk 2 has high keyword score but lower vector score. RRF combines both scores. Final ranking balances semantic and exact matches.
Real-World RAG Examples
RAG systems are used in production applications. These examples show complete workflows with specific numbers and results.
Customer Support Knowledge Base
A support system contains 10,000 help articles. Users ask questions in natural language. The system retrieves relevant articles and generates answers.
A user asks "How do I reset my password?" The system converts the query to an embedding. The system searches 10,000 articles. The system retrieves the top 5 articles. Article 1 is "Password Recovery Guide" with similarity 0.92. Article 2 is "Account Access Reset" with similarity 0.88. Article 3 is "Security Settings" with similarity 0.85.
The system builds context from 5 articles totaling 2,500 words. The system sends the query and context to GPT-4. The model generates a response explaining password reset steps. The response cites Article 1 and Article 2. Response time is 2.3 seconds.
'To reset your password, go to the login page and click "Forgot Password". Enter your email address. Check your email for a reset link. Click the link and enter a new password. The password must be at least 8 characters.',
'If you have forgotten your password, use the password reset feature. Navigate to the account settings page. Select "Reset Password" from the security section. Follow the email instructions to create a new password.',
1| Password Recovery Guide |To reset your password, go to the login page...|0.9234|1
2| Account Access Reset |If you have forgotten your password,use the...|0.8876|2
(2rows)
The system retrieves 2 relevant articles. Article 1 has similarity 0.92. Article 2 has similarity 0.88. Both articles are about password reset.
Technical Documentation Q&A
A documentation system contains 5,000 technical pages. Developers ask questions about APIs and configurations. The system retrieves relevant documentation and generates answers.
A developer asks "How do I configure HNSW index parameters?" The system converts the query to an embedding. The system searches 5,000 pages. The system retrieves the top 3 pages. Page 1 is "HNSW Index Configuration" with similarity 0.94. Page 2 is "Vector Index Tuning" with similarity 0.87. Page 3 is "Performance Optimization" with similarity 0.82.
The system builds context from 3 pages totaling 1,800 words. The system sends the query and context to Claude. The model generates a response explaining HNSW parameters. The response includes m parameter set to 16 and ef_construction set to 64. Response time is 1.8 seconds.
'HNSW indexes require two parameters: m and ef_construction. Parameter m controls the number of connections per layer. Higher m values improve recall but increase index size. Recommended m is 16 for most use cases. Parameter ef_construction controls index quality during construction. Higher ef_construction improves recall but slows index creation. Recommended ef_construction is 64.',
'indexing',
embed_text('HNSW index parameters: m connections per layer, ef_construction quality, recommended values 16 and 64','sentence-transformers/all-MiniLM-L6-v2')),
('Vector Index Tuning',
'Vector index performance depends on parameter selection. HNSW m parameter affects query speed and index size. IVFFlat lists parameter affects memory usage. Tune parameters based on data size and query patterns.',
'indexing',
embed_text('Vector index tuning: HNSW m parameter, IVFFlat lists, performance optimization','sentence-transformers/all-MiniLM-L6-v2'));
1| HNSW Index Configuration | HNSW indexes require two parameters: m and ef_construction. Parameter m controls the number of connections per layer. Higher m values improve recall but increase index size. Recommended m is16for most use cases. Parameter ef_construction controls index quality during construction. Higher ef_construction improves recall but slows index creation. Recommended ef_construction is64.|0.9456|1
2| Vector Index Tuning | Vector index performance depends on parameter selection. HNSW m parameter affects query speed andindex size. IVFFlat lists parameter affects memory usage. Tune parameters based ondata size and query patterns.|0.8765|2
(2rows)
The system retrieves 2 relevant pages. Page 1 has similarity 0.94. Page 2 has similarity 0.87. Both pages are about index configuration.
Legal Document Analysis
A legal system contains 20,000 contract clauses. Lawyers ask questions about terms and conditions. The system retrieves relevant clauses and generates summaries.
A lawyer asks "What are the intellectual property licensing terms?" The system converts the query to an embedding. The system searches 20,000 clauses. The system retrieves the top 10 clauses. Clause 1 is "IP Licensing Agreement" with similarity 0.91. Clause 2 is "Patent Rights" with similarity 0.89. Clause 3 is "Copyright Assignment" with similarity 0.86.
The system builds context from 10 clauses totaling 5,000 words. The system sends the query and context to GPT-4. The model generates a summary of IP licensing terms. The summary cites specific clauses. Response time is 3.5 seconds.
'The licensor grants the licensee intellectual property rights to use the software. Licensing terms include non-exclusive rights for commercial use. The licensee may not sublicense without written consent. IP rights remain with the licensor.',
'ip_licensing',
embed_text('Intellectual property licensing: licensor grants rights, non-exclusive commercial use, no sublicensing, IP rights retained','sentence-transformers/all-MiniLM-L6-v2'),
CURRENT_DATE),
('IP Assignment Contract',
'All intellectual property rights, including patents and copyrights, are assigned to the company. The assignor retains no rights to the IP. Assignment is permanent and irrevocable.',
'ip_licensing',
embed_text('IP assignment: all rights assigned, patents copyrights, permanent irrevocable assignment','sentence-transformers/all-MiniLM-L6-v2'),
CURRENT_DATE);
-- Query processing
WITH query_embedding AS(
SELECT embed_text(
'What are the intellectual property licensing terms?',
1| Software License Agreement | The licensor grants the licensee intellectual...|0.9123|1
2| IP Assignment Contract |All intellectual property rights, including...|0.8876|2
(2rows)
The system retrieves 2 relevant clauses. Clause 1 has similarity 0.91. Clause 2 has similarity 0.88. Both clauses are about IP licensing.
RAG Performance Optimization
RAG systems require optimization for production use. These optimizations improve query speed and response quality.
Index Tuning
Vector indexes accelerate similarity search. HNSW indexes provide sub-10ms queries on millions of vectors. Tune index parameters based on data size and query patterns.
HNSW m parameter controls connections per layer. Higher m improves recall but increases index size. Recommended m is 16 for datasets under 10 million vectors. Recommended m is 32 for datasets over 10 million vectors.
HNSW ef_construction parameter controls index quality. Higher ef_construction improves recall but slows builds. Recommended ef_construction is 64 for most use cases. Recommended ef_construction is 128 for high-recall requirements.
-- Index tuning example
CREATEINDEX idx_chunks_embedding_hnsw ON document_chunks
->Index Scan using idx_chunks_embedding_hnsw on document_chunks (cost=0.42..295.67rows=100 width=64)(actual time=2.345..2.456 loops=1)
Planning Time: 0.123 ms
Execution Time: 2.456 ms
(4rows)
The index scan completes in 2.456 milliseconds. The HNSW index enables fast similarity search. Query performance meets production requirements.
Embedding Caching
Embedding generation is expensive. Cache embeddings for repeated queries. NeuronDB caches embeddings automatically. Cache hits avoid model inference.
Cache keys include text content and model name. Identical texts with same model produce cache hits. Cache size is configurable. Cache eviction uses LRU policy.
-- Check embedding cache
SELECT
cache_hits,
cache_misses,
cache_hit_rate
FROM neurondb.embedding_cache_stats;
cache_hits | cache_misses | cache_hit_rate
------------+--------------+----------------
1245|234|0.8418
(1row)
The cache has 84% hit rate. Cache hits avoid 1,245 embedding generations. Cache improves query performance.
Batch Processing
Batch processing generates embeddings for multiple texts efficiently. Single embeddings require model loading per text. Batch embeddings load model once and process all texts.
Batch size affects performance. Small batches underutilize GPU. Large batches exceed memory limits. Optimal batch size is 32 to 128 texts.
Vector search finds relevant chunks. Similarity scores rank results. Top K chunks are selected.
Embedding Generation
NeuronDB generates embeddings using models from Hugging Face. Models are loaded on first use. Embeddings are cached for performance. Batch processing improves throughput.
-- Embedding generation
UPDATE document_chunks
SET embedding = embed_text(
chunk_text,
'sentence-transformers/all-MiniLM-L6-v2'
)
WHERE embedding ISNULL;
Embeddings are generated for all chunks. Model is loaded once. Cache improves subsequent queries.
Indexing
NeuronDB supports HNSW and IVFFlat indexes. HNSW indexes provide sub-10ms queries. IVFFlat indexes provide memory efficiency. Indexes integrate with query planner.
-- HNSW index creation
CREATEINDEX idx_chunks_embedding ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH(m =16, ef_construction =64);
HNSW index enables fast similarity search. Query performance meets production requirements.
Reranking
NeuronDB supports reranking with cross-encoder models. Models score query-chunk pairs. Results are reordered by relevance. Precision improves.
-- Reranking
SELECT
chunk_id,
chunk_text,
rerank_cross_encoder(
'user query',
chunk_text,
'cross-encoder/ms-marco-MiniLM-L-6-v2'
)AS rerank_score
FROM document_chunks
ORDERBY rerank_score DESC
LIMIT5;
Reranking improves result precision. Cross-encoder models provide accurate relevance scores.
FULLOUTERJOIN fts_results f ON v.chunk_id = f.chunk_id
) combined
ORDERBY rrf_score DESC
LIMIT5;
Hybrid search combines semantic and exact matches. RRF scores balance both methods. Results improve.
Conclusion
RAG combines document retrieval with language model generation. Systems retrieve relevant documents from knowledge bases. Systems provide retrieved context to language models. Models generate responses grounded in context.
RAG solves three problems. Problem one is outdated information. RAG retrieves current documents. Problem two is lack of domain knowledge. RAG retrieves domain-specific documents. Problem three is hallucinations. RAG grounds responses in retrieved documents.
RAG systems include five components. Component one is document storage. Component two is embedding generation. Component three is similarity search. Component four is context building. Component five is response generation.
NeuronDB provides complete RAG capabilities within PostgreSQL. Vector operations enable similarity search. Embedding generation converts text to vectors. Indexing accelerates queries. Reranking improves precision. Hybrid search combines methods.
All operations use SQL syntax. No external services required. Systems scale to millions of documents. Query performance meets production requirements.