GPU Feature Matrix

Feature parity, known gaps, and current state for CUDA, ROCm, and Metal GPU backends in NeuronDB.

Feature Parity Matrix

Feature	CUDA	ROCm	Metal	CPU Fallback
Vector Distance Calculations
L2 distance	✅	✅	✅	✅
Cosine distance	✅	✅	✅	✅
Inner product	✅	✅	✅	✅
Manhattan distance	✅	⚠️ Partial	❌	✅
Hamming distance	✅	⚠️ Partial	❌	✅
Index Operations
HNSW search	✅	✅	⚠️ Limited	✅
IVF search	✅	✅	❌	✅
Index build	⚠️ CPU only	⚠️ CPU only	⚠️ CPU only	✅
Batch Operations
Batch distance	✅	✅	✅	✅
Batch embedding	✅	✅	⚠️ Limited	✅
Memory Management
GPU memory pool	✅	✅	⚠️ Basic	N/A
Zero-copy transfers	✅	✅	❌	N/A
Memory monitoring	✅	⚠️ Partial	❌	N/A
Performance
Query acceleration	✅ 2-3x	✅ 2-3x	✅ 1.5-2x	Baseline
Batch acceleration	✅ 5-10x	✅ 5-10x	✅ 3-5x	Baseline

Legend:

✅ Fully supported
⚠️ Partial support or known limitations
❌ Not supported

CUDA Support

Supported Features

All vector distance metrics: L2, Cosine, Inner Product, Manhattan, Hamming
HNSW and IVF search: Full acceleration
Batch operations: Optimized for large batches
Memory management: Advanced pooling and monitoring
Multi-GPU: Support for multiple CUDA devices

Performance Characteristics

Query QPS: 2-3x improvement over CPU
Batch operations: 5-10x improvement for large batches
Memory efficiency: Optimized memory transfers
Scalability: Supports up to 8 GPUs

Known Limitations

Index build: Currently CPU-only (GPU build planned)
Mixed precision: FP16 support experimental
Tensor cores: Not yet utilized (future optimization)

Requirements

CUDA Version: 11.0+ (12.1+ recommended)
GPU: NVIDIA GPUs with Compute Capability 7.0+
Drivers: NVIDIA driver 450.80.02+
Libraries: cuBLAS, cuDNN (optional)

ROCm Support

Supported Features

Core distance metrics: L2, Cosine, Inner Product
HNSW search: Full acceleration
IVF search: Full acceleration
Batch operations: Optimized for AMD GPUs
Memory management: Basic pooling

Performance Characteristics

Query QPS: 2-3x improvement over CPU
Batch operations: 5-10x improvement
Memory efficiency: Good for AMD GPUs
Scalability: Supports multiple AMD GPUs

Known Limitations

Manhattan/Hamming: Partial support (slower than CUDA)
Memory monitoring: Limited metrics compared to CUDA
Index build: Currently CPU-only
Mixed precision: Not yet supported

Requirements

ROCm Version: 5.0+ (5.7+ recommended)
GPU: AMD GPUs with RDNA2+ or CDNA architecture
Drivers: ROCm driver stack
Libraries: rocBLAS, rocRAND

Metal Support (macOS)

Supported Features

Core distance metrics: L2, Cosine, Inner Product
HNSW search: Limited acceleration (some operations CPU-bound)
Batch operations: Basic support
Memory management: Basic pooling

Performance Characteristics

Query QPS: 1.5-2x improvement over CPU
Batch operations: 3-5x improvement
Memory efficiency: Good for Apple Silicon
Power efficiency: Excellent on Apple Silicon

Known Limitations

IVF search: Not yet supported
Manhattan/Hamming: Not supported
Index build: Currently CPU-only
Memory monitoring: Limited (no system-level APIs)
Multi-GPU: Limited (Apple Silicon typically single GPU)
Known bugs: Some Metal-specific issues with large batches

Requirements

macOS Version: 13.0+ (Ventura) or 14.0+ (Sonoma)
Hardware: Apple Silicon (M1, M2, M3, or later)
Metal Version: Metal 3.0+
Libraries: Metal Performance Shaders

Known Metal Bugs

Large batch size: Batches > 1000 may cause memory issues
- Workaround: Use smaller batches (500-800)
- Status: Under investigation
Concurrent queries: Some race conditions with concurrent Metal operations
- Workaround: Limit concurrent GPU queries
- Status: Partial fix in progress
Memory leaks: Occasional memory leaks in long-running processes
- Workaround: Restart PostgreSQL periodically
- Status: Being addressed

CPU Fallback

All GPU operations automatically fall back to CPU if:

GPU is not available
GPU operation fails
neurondb.gpu_fail_open = true (default)

Fallback behavior:

Automatic and transparent
No error thrown (unless gpu_fail_open = false)
Performance degrades to CPU baseline

Performance Comparison

Query Performance (QPS)

Dataset	Dimensions	CPU	CUDA	ROCm	Metal
SIFT-128	128	1,200	3,500	3,200	2,000
GIST-960	960	450	1,800	1,600	800
GloVe-100	100	1,500	4,200	3,800	2,400

Note: Actual performance varies based on hardware, dataset, and configuration.

Batch Performance (Throughput)

Operation	Batch Size	CPU	CUDA	ROCm	Metal
Distance calc	1,000	500 ops/s	5,000 ops/s	4,500 ops/s	2,000 ops/s
Embedding gen	100	10 texts/s	100 texts/s	90 texts/s	40 texts/s

Configuration

Enable GPU


-- Enable GPU mode
SET neurondb.compute_mode = true;

-- Select GPU device (CUDA/ROCm)
SET neurondb.gpu_device = 0;

-- Configure batch size
SET neurondb.gpu_batch_size = 1000;

-- Enable fail-open (fallback to CPU on error)
SET neurondb.gpu_fail_open = true;

Check GPU Status


-- Check GPU availability
SELECT * FROM neurondb_gpu_info();

-- Check if GPU is enabled
SELECT neurondb.gpu_enabled();

-- Get GPU device count
SELECT neurondb.gpu_device_count();

Migration Between GPU Backends

CUDA → ROCm

Process:

Install ROCm drivers and libraries
Rebuild NeuronDB with ROCm support

Update configuration:


SET neurondb.gpu_backend = 'rocm';
SET neurondb.gpu_device = 0;

Restart PostgreSQL

CUDA → Metal (macOS)

Process:

Ensure macOS 13+ and Apple Silicon
Rebuild NeuronDB with Metal support

Update configuration:


SET neurondb.gpu_backend = 'metal';

Restart PostgreSQL

Note: Some features may not be available on Metal (see limitations above).

Roadmap

Planned Features

Q2 2025:

GPU-accelerated index build (CUDA)
Mixed precision support (FP16)
Enhanced Metal support (fix known bugs)

Q3 2025:

Tensor core utilization (CUDA)
ROCm index build acceleration
Multi-GPU load balancing

Q4 2025:

Metal IVF support
Unified GPU API across backends
Advanced memory management

Troubleshooting

GPU Not Detected


-- Check GPU info
SELECT * FROM neurondb_gpu_info();

-- If empty, check:
-- 1. GPU drivers installed
-- 2. PostgreSQL has GPU access
-- 3. Correct backend compiled

Performance Issues


-- Check GPU utilization
SELECT * FROM neurondb.llm_gpu_utilization();

-- Monitor memory usage
SELECT neurondb.gpu_memory_usage();

-- Adjust batch size if needed
SET neurondb.gpu_batch_size = 500;  -- Reduce if OOM

Metal-Specific Issues


-- Reduce batch size for Metal
SET neurondb.gpu_batch_size = 500;  -- Lower than CUDA/ROCm

-- Disable Metal if issues persist
SET neurondb.compute_mode = false;  -- Fallback to CPU

CUDA Support - CUDA setup and configuration
ROCm Support - ROCm setup and configuration
Metal Support - Metal setup and configuration
Benchmark Results - Performance numbers
Configuration Guide - All GPU settings

GPU Feature Matrix

GPU Feature Matrix

Feature Parity Matrix

CUDA Support

Supported Features

Performance Characteristics

Known Limitations

Requirements

ROCm Support

Supported Features

Performance Characteristics

Known Limitations

Requirements

Metal Support (macOS)

Supported Features

Performance Characteristics

Known Limitations

Requirements

Known Metal Bugs

CPU Fallback

Performance Comparison

Query Performance (QPS)

Batch Performance (Throughput)

Configuration

Enable GPU

Check GPU Status

Migration Between GPU Backends

CUDA → ROCm

CUDA → Metal (macOS)

Roadmap

Planned Features

Troubleshooting

GPU Not Detected

Performance Issues

Metal-Specific Issues

Related Documentation