Advanced Features

Overview

Advanced Features provide production-ready capabilities for building sophisticated AI applications. These features combine multiple techniques to deliver superior search relevance, intelligent caching, and end-to-end RAG (Retrieval Augmented Generation) pipelines.

Key Capabilities

Hybrid Search: Combine vector similarity with full-text search for best results
Reranking: Improve relevance with cross-encoder and LLM-based reranking
RAG Pipelines: End-to-end retrieval augmented generation workflows
Intelligent Caching: Semantic caching for embeddings, queries, and results

Production Benefits

Higher Relevance: Hybrid search and reranking improve result quality
Lower Latency: Intelligent caching reduces computation and API calls
Better UX: RAG pipelines provide accurate, context-aware responses
Cost Efficiency: Caching reduces API costs and improves throughput

Hybrid Search

Hybrid search combines vector similarity search (semantic understanding) withfull-text search (keyword matching) to deliver superior search results that understand both context and exact terms.

Why Hybrid Search?

Vector Search Alone: Understands meaning but may miss exact keyword matches
Full-Text Search Alone: Precise keywords but no semantic understanding
Hybrid Search: Best of both worlds - semantic understanding + exact matching

Implementation

Hybrid search query

-- Combine vector and full-text search with weighted scoring
WITH vector_results AS (
  SELECT id, content,
         embedding <=> embed_text('PostgreSQL replication', 'text-embedding-ada-002') AS distance
  FROM documents
  ORDER BY distance
  LIMIT 20
),
text_results AS (
  SELECT id, content,
         ts_rank(to_tsvector('english', content), 
                 to_tsquery('english', 'PostgreSQL & replication')) AS rank
  FROM documents
  WHERE to_tsvector('english', content) @@ to_tsquery('english', 'PostgreSQL & replication')
  ORDER BY rank DESC
  LIMIT 20
)
SELECT 
  COALESCE(v.id, t.id) AS id,
  COALESCE(v.content, t.content) AS content,
  -- Weighted combination: 70% vector, 30% text
  COALESCE(1.0 - v.distance, 0.0) * 0.7 + COALESCE(t.rank, 0.0) * 0.3 AS hybrid_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY hybrid_score DESC
LIMIT 10;

Reciprocal Rank Fusion (RRF)

RRF combines rankings from multiple sources without requiring score normalization.

RRF hybrid search

WITH vector_ranked AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> embed_text('query', 'text-embedding-ada-002')) AS v_rank
  FROM documents
  LIMIT 100
),
text_ranked AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(...) DESC) AS t_rank
  FROM documents
  WHERE to_tsvector('english', content) @@ to_tsquery('english', 'query')
  LIMIT 100
)
SELECT v.id,
       -- RRF formula: 1/(60 + rank) for each source
       1.0 / (60 + v.v_rank) + 1.0 / (60 + COALESCE(t.t_rank, 1000)) AS rrf_score
FROM vector_ranked v
LEFT JOIN text_ranked t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 10;

Best Practices

Use 60-70% weight for vector search, 30-40% for full-text search
Normalize scores from both sources before combining
Consider query type: semantic queries favor vector, exact terms favor text
Monitor recall and precision metrics to tune weights

Reranking

Reranking improves search relevance by re-scoring initial results using more sophisticated models. This two-stage approach (retrieve then rerank) provides better accuracy than single-stage retrieval.

Reranking Approaches

Cross-Encoder Reranking

Cross-encoders process query-document pairs together, providing more accurate relevance scores than bi-encoders (used in initial retrieval).

Cross-encoder reranking

-- Retrieve initial candidates
WITH candidates AS (
  SELECT id, content
  FROM documents
  ORDER BY embedding <=> embed_text('PostgreSQL replication', 'text-embedding-ada-002')
  LIMIT 50
)
-- Rerank with cross-encoder
SELECT 
  id,
  content,
  neurondb_rerank(
    'cross-encoder/ms-marco-MiniLM-L-6-v2',
    content,
    'PostgreSQL replication'
  ) AS relevance_score
FROM candidates
ORDER BY relevance_score DESC
LIMIT 10;

LLM-Based Reranking

Use large language models to rerank results based on semantic understanding and reasoning.

LLM reranking

-- Rerank with LLM
SELECT 
  id,
  content,
  neurondb_llm_rerank(
    'gpt-4',
    content,
    'PostgreSQL replication',
    criteria => 'relevance, accuracy, completeness'
  ) AS llm_score
FROM (
  SELECT id, content
  FROM documents
  ORDER BY embedding <=> embed_text('PostgreSQL replication', 'text-embedding-ada-002')
  LIMIT 20
) candidates
ORDER BY llm_score DESC
LIMIT 5;

Ensemble Reranking

Combine multiple reranking models for improved accuracy and robustness.

Ensemble reranking

SELECT 
  id,
  content,
  -- Weighted combination of multiple rerankers
  neurondb_rerank('cross-encoder/model1', content, query) * 0.5 +
  neurondb_rerank('cross-encoder/model2', content, query) * 0.3 +
  neurondb_llm_rerank('gpt-4', content, query) * 0.2 AS ensemble_score
FROM candidates
ORDER BY ensemble_score DESC;

Reranking Best Practices

Retrieve 20-100 candidates, rerank top 10-20
Use cross-encoders for speed, LLMs for accuracy
Cache reranking results for common queries
Monitor reranking latency and adjust candidate count

RAG Pipelines

Retrieval Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from your database before generating answers. This grounds LLM outputs in your actual data, reducing hallucinations and improving accuracy.

RAG Workflow

User Question: "What is PostgreSQL replication?"
Retrieve: Find relevant documents using hybrid search
Rerank: Score and sort results by relevance
Generate: LLM creates answer using retrieved context
Response: Return answer with source citations

Complete RAG Pipeline

End-to-end RAG pipeline

-- 1. Document ingestion with embeddings
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536),
  metadata JSONB
);

INSERT INTO documents (content, embedding, metadata)
SELECT 
  content,
  embed_text(content, 'text-embedding-ada-002'),
  jsonb_build_object('source', 'docs', 'timestamp', now())
FROM source_documents;

-- 2. RAG query function
CREATE OR REPLACE FUNCTION rag_query(user_query TEXT)
RETURNS TABLE(answer TEXT, sources JSONB) AS $$
DECLARE
  query_embedding vector(1536);
  retrieved_docs TEXT;
BEGIN
  -- Generate query embedding
  query_embedding := embed_text(user_query, 'text-embedding-ada-002');
  
  -- Retrieve relevant documents (hybrid search)
  SELECT string_agg(content, E'

')
  INTO retrieved_docs
  FROM (
    SELECT content
    FROM documents
    ORDER BY embedding <=> query_embedding
    LIMIT 5
  ) top_docs;
  
  -- Generate answer with context
  RETURN QUERY
  SELECT 
    neurondb_llm_generate(
      'gpt-4',
      format('Context: %s

Question: %s

Answer:', retrieved_docs, user_query)
    ) AS answer,
    (
      SELECT jsonb_agg(jsonb_build_object('id', id, 'content', content))
      FROM (
        SELECT id, content
        FROM documents
        ORDER BY embedding <=> query_embedding
        LIMIT 5
      ) sources
    ) AS sources;
END;
$$ LANGUAGE plpgsql;

-- 3. Use RAG pipeline
SELECT * FROM rag_query('What is PostgreSQL replication?');

RAG Best Practices

Retrieve 3-10 relevant documents for context
Use hybrid search for better retrieval quality
Rerank results before passing to LLM
Include source citations in responses
Monitor answer quality and adjust retrieval parameters

Intelligent Caching

Intelligent caching reduces computation, API calls, and latency by storing frequently accessed data, embeddings, and query results. NeuronDB provides multiple caching layers for optimal performance.

Embedding Cache

Cache generated embeddings to avoid redundant API calls and computation.

Embedding cache

-- Configure embedding cache
SET neurondb.embedding_cache_size = 10000;  -- Cache 10K embeddings
SET neurondb.embedding_cache_ttl = 86400;   -- 24 hour TTL

-- Cached embeddings are automatically used
SELECT embed_text('sample text', 'text-embedding-ada-002');  -- First call: API
SELECT embed_text('sample text', 'text-embedding-ada-002');  -- Second call: Cache

-- Check cache statistics
SELECT * FROM neurondb_embedding_cache_stats();

Query Result Cache

Cache query results for identical or similar queries using semantic similarity.

Query result cache

-- Enable query result cache
SET neurondb.query_cache_enabled = on;
SET neurondb.query_cache_size = 1000;
SET neurondb.query_cache_ttl = 3600;  -- 1 hour

-- Cache is automatically used for identical queries
SELECT * FROM documents
WHERE embedding <=> embed_text('PostgreSQL', 'text-embedding-ada-002') < 0.3
ORDER BY embedding <=> embed_text('PostgreSQL', 'text-embedding-ada-002')
LIMIT 10;

Semantic Cache

Cache results for semantically similar queries, not just identical ones.

Semantic cache

-- Enable semantic cache
SET neurondb.semantic_cache_enabled = on;
SET neurondb.semantic_cache_threshold = 0.95;  -- 95% similarity threshold

-- Similar queries use cached results
SELECT * FROM documents
WHERE embedding <=> embed_text('PostgreSQL database', 'text-embedding-ada-002') < 0.3
ORDER BY embedding <=> embed_text('PostgreSQL database', 'text-embedding-ada-002')
LIMIT 10;
-- If 'PostgreSQL' was cached and similarity > 0.95, uses cache

Model Output Cache

Cache model inference results to avoid redundant computation.

Model output cache

-- Cache ONNX model outputs
SET neurondb.model_cache_enabled = on;
SET neurondb.model_cache_size = 5000;

-- Cached predictions are reused
SELECT predict_classification('model', features) FROM data;
SELECT predict_classification('model', features) FROM data;  -- Uses cache

Caching Best Practices

Enable caching for frequently accessed embeddings and queries
Set appropriate TTLs based on data freshness requirements
Monitor cache hit rates and adjust sizes accordingly
Use semantic caching for similar but not identical queries
Clear cache when data or models are updated

Production Patterns

Complete Search Pipeline

Production search pipeline

-- 1. Hybrid retrieval
WITH hybrid_results AS (
  SELECT id, content, hybrid_score
  FROM (
    -- Vector search (70%)
    SELECT id, content, (1.0 - distance) * 0.7 AS hybrid_score
    FROM (
      SELECT id, content, embedding <=> embed_text('query', 'text-embedding-ada-002') AS distance
      FROM documents
      ORDER BY distance
      LIMIT 50
    ) vector_results
    UNION ALL
    -- Full-text search (30%)
    SELECT id, content, ts_rank(...) * 0.3 AS hybrid_score
    FROM documents
    WHERE to_tsvector('english', content) @@ to_tsquery('english', 'query')
    LIMIT 50
  ) combined
  ORDER BY hybrid_score DESC
  LIMIT 20
)
-- 2. Reranking
SELECT 
  id,
  content,
  neurondb_rerank('cross-encoder/model', content, 'query') AS final_score
FROM hybrid_results
ORDER BY final_score DESC
LIMIT 10;

RAG with Caching

Cached RAG pipeline

-- RAG with intelligent caching
CREATE OR REPLACE FUNCTION cached_rag_query(user_query TEXT)
RETURNS TABLE(answer TEXT, sources JSONB) AS $$
DECLARE
  cached_answer TEXT;
BEGIN
  -- Check semantic cache first
  SELECT answer INTO cached_answer
  FROM rag_cache
  WHERE query_embedding <=> embed_text(user_query, 'text-embedding-ada-002') > 0.95
  ORDER BY created_at DESC
  LIMIT 1;
  
  IF cached_answer IS NOT NULL THEN
    RETURN QUERY SELECT cached_answer, '[]'::jsonb;
    RETURN;
  END IF;
  
  -- Generate new answer and cache it
  RETURN QUERY
  WITH generated AS (
    SELECT * FROM rag_query(user_query)
  )
  INSERT INTO rag_cache (query_embedding, answer, sources)
  SELECT embed_text(user_query, 'text-embedding-ada-002'), answer, sources
  FROM generated
  RETURNING answer, sources;
END;
$$ LANGUAGE plpgsql;