Observability Stack | Prometheus, Grafana & Jaeger

Overview

The NeuronDB observability stack provides comprehensive monitoring, visualization, and distributed tracing for the entire ecosystem. The stack includes:

Prometheus - Metrics collection, alerting, and querying
Grafana - Pre-configured dashboards and visualization
Jaeger - Distributed tracing for request flows
Alertmanager - Alert routing and notification management

Key Features

Complete Coverage: All modules and variants monitored (NeuronDB, NeuronAgent, NeuronMCP, NeuronDesktop)
Detailed Metrics: Module-specific metrics with proper labeling
Comprehensive Alerts: 40+ alert rules for all critical failure modes
Performance Optimization: Recording rules for common queries
Production Ready: Alertmanager integration with notification routing
Pre-configured: Grafana dashboards and Prometheus rules included

Prometheus

Prometheus collects metrics from all NeuronDB ecosystem components and provides a query language (PromQL) for monitoring and alerting.

Configuration Files

The Prometheus configuration is located in prometheus/ directory:

prometheus.yml - Main Prometheus configuration
alerts.yml - Alert rules (organized by module)
recording_rules.yml - Pre-computed metrics for performance
alertmanager.yml - Alertmanager configuration
postgres_exporter.yml - PostgreSQL exporter custom queries
service_discovery.yml - Service discovery reference

Quick Start

Start Prometheus with Docker Compose

# Start Prometheus
docker compose -f docker-compose.observability.yml up -d prometheus

# Access Prometheus UI
# http://localhost:9090

# Check targets
# http://localhost:9090/targets

Metrics Endpoints

All services expose Prometheus-compatible metrics:

NeuronDB: Via PostgreSQL exporter at :9187/metrics
NeuronAgent: :8080/metrics
NeuronDesktop API: :8081/metrics
Infrastructure: Node exporter (:9100/metrics), cAdvisor (:8080/metrics)

📋 Complete Prometheus Documentation: See prometheus/README.md for detailed configuration, metrics reference, and alert rules.

Grafana

Grafana provides pre-configured dashboards for visualizing NeuronDB ecosystem metrics, performance data, and health status.

Quick Start

Start Grafana with Docker Compose

# Start Grafana
docker compose -f docker-compose.observability.yml up -d grafana

# Access Grafana UI
# http://localhost:3001
# Default credentials: admin/admin

# Grafana will automatically provision:
# - Prometheus datasource
# - Pre-configured dashboards

Pre-configured Dashboards

Grafana includes dashboards for:

NeuronDB: Database health, query performance, index health, cache metrics
NeuronAgent: Service availability, error rates, latency, execution metrics
NeuronDesktop: API availability, error rates, connection metrics
NeuronMCP: Service availability, tool execution, connection pool
Infrastructure: System resources, container health, network metrics

Dashboard Provisioning

Grafana dashboards are automatically provisioned from grafana/provisioning/dashboards/ directory. The Prometheus datasource is configured in grafana/provisioning/datasources/prometheus.yml.

Custom Dashboards

Create custom dashboards in Grafana UI or add JSON files to grafana/dashboards/ directory.

Jaeger

Jaeger provides distributed tracing for request flows across all NeuronDB ecosystem components.

Quick Start

Start Jaeger with Docker Compose

# Start Jaeger
docker compose -f docker-compose.observability.yml up -d jaeger

# Access Jaeger UI
# http://localhost:16686

# Jaeger endpoints:
# - UI: :16686
# - OTLP gRPC: :4317
# - OTLP HTTP: :4318

Features

Distributed Tracing: Track requests across all services
Service Map: Visualize service dependencies
Trace Analysis: Identify bottlenecks and slow operations
Performance Insights: Understand request latency breakdown

Docker Compose Setup

Use the docker-compose.observability.yml file to run the complete observability stack:

Start observability stack

# Start all observability services
docker compose -f docker-compose.observability.yml up -d

# Check status
docker compose -f docker-compose.observability.yml ps

# View logs
docker compose -f docker-compose.observability.yml logs -f

# Stop services
docker compose -f docker-compose.observability.yml down

Access URLs

Prometheus: http://localhost:9090
Grafana: http://localhost:3001 (admin/admin)
Jaeger: http://localhost:16686
Alertmanager: http://localhost:9093 (if enabled)

Kubernetes Setup

The Helm chart includes the complete observability stack. Enable it in your values file:

Enable observability in Helm values

# values.yaml
monitoring:
  enabled: true
  prometheus:
    enabled: true
    retention: "30d"
    persistence:
      enabled: true
      size: "20Gi"
  grafana:
    enabled: true
    adminPassword: "change-me"  # Change in production!
    persistence:
      enabled: true
      size: "10Gi"
  jaeger:
    enabled: true

Access Services in Kubernetes

Port-forward to observability services

# Grafana
kubectl port-forward svc/neurondb-grafana 3001:3000 -n neurondb
# Access at: http://localhost:3001

# Prometheus
kubectl port-forward svc/neurondb-prometheus 9090:9090 -n neurondb
# Access at: http://localhost:9090

# Jaeger
kubectl port-forward svc/neurondb-jaeger 16686:16686 -n neurondb
# Access at: http://localhost:16686

Service Discovery

Kubernetes deployments use ServiceMonitors for automatic service discovery. Prometheus automatically discovers and scrapes all NeuronDB ecosystem services.

Metrics Reference

Key metrics exposed by each component:

NeuronDB Metrics

neurondb_queries_total - Total number of queries (by query_type, index_type)
neurondb_query_duration_seconds - Query duration histogram (by query_type)
neurondb_index_size_bytes - Index size in bytes (by index_name, index_type)
neurondb_vector_count - Number of vectors (by table_name)
neurondb_cache_hits_total - Cache hits (by cache_type)
neurondb_cache_misses_total - Cache misses (by cache_type)
neurondb_worker_status - Worker status (by worker_id, status)
neurondb_errors_total - Total errors (by error_type)

NeuronAgent Metrics

neurondb_agent_http_requests_total - Total HTTP requests (by method, endpoint, status)
neurondb_agent_http_request_duration_seconds - HTTP request duration (by method, endpoint)
neurondb_agent_executions_total - Agent executions (by agent_id, status)
neurondb_agent_execution_duration_seconds - Execution duration (by agent_id)
neurondb_agent_llm_calls_total - LLM API calls (by model, status)
neurondb_agent_llm_tokens_total - LLM tokens (by model, type)
neurondb_agent_memory_chunks_stored_total - Memory chunks stored (by agent_id)
neurondb_agent_tool_executions_total - Tool executions (by tool_name, status)
neurondb_agent_database_connections_active - Active DB connections

NeuronDesktop Metrics

neurondesktop_api_requests_total - Total API requests (by endpoint, method)
neurondesktop_api_errors_total - API errors (by endpoint, error_type)
neurondesktop_api_request_duration_seconds - Request duration (by endpoint)
neurondesktop_active_connections - Active connections
neurondesktop_active_mcp_connections - Active MCP connections
neurondesktop_active_neurondb_connections - Active NeuronDB connections
neurondesktop_active_agent_connections - Active agent connections

📋 Complete Metrics Reference: See Prometheus README for all available metrics with descriptions and labels.

Alert Rules

Prometheus includes 40+ alert rules organized by module, covering all critical failure modes:

NeuronDB Alerts

NeuronDBServiceDown (Critical) - Service down > 1m
NeuronDBConnectionFailure (Critical) - >5 failures in 5m
NeuronDBHighQueryLatency (Warning) - P95 > 1s for 5m
NeuronDBIndexHealthDegraded (Warning) - Health < 80% for 5m
NeuronDBCacheHitRateLow (Warning) - Hit rate < 70% for 5m
NeuronDBConnectionPoolExhausted (Critical) - Utilization > 90% for 5m

NeuronAgent Alerts

NeuronAgentServiceDown (Critical) - Service down > 1m
NeuronAgentHighErrorRate (Critical) - Error rate > 5% for 5m
NeuronAgentHighLatency (Warning) - P95 > 1s for 5m
NeuronAgentExecutionFailure (Critical) - >10 failures in 5m
NeuronAgentDatabaseConnectionIssue (Warning) - >5 errors in 5m

Infrastructure Alerts

HighCPUUsage (Warning) - CPU > 80% for 5m
HighMemoryUsage (Warning) - Memory > 85% for 5m
HighDiskUsage (Warning) - Disk > 85% for 5m
PrometheusTargetDown (Critical) - Target down > 2m

📋 Complete Alert Rules: See alerts.yml for all alert rules with conditions and descriptions.

Additional Resources

Prometheus README - Complete Prometheus documentation
Alert Rules - All alert definitions
Prometheus Config - Main configuration file
Prometheus Documentation - Official Prometheus docs
Grafana Documentation - Official Grafana docs
Jaeger Documentation - Official Jaeger docs