Skip to content

Add health checks to the databases #1181

@Vasilije1990

Description

@Vasilije1990

Issue Description

The current /health endpoint in cognee/api/client.py only returns a basic 200 status code without checking the actual health of backend components. For production deployments, container orchestration, and monitoring systems, we need a comprehensive health check that verifies all critical backend services are accessible and functioning properly.

Current State

Existing Health Endpoint (/health):

@app.get("/health")
def health_check():
    """Health check endpoint that returns the server status."""
    return Response(status_code=200)

Problems:

  • No actual health verification of backend services
  • Cannot detect database connectivity issues
  • Cannot identify LLM provider failures
  • No differentiation between critical and non-critical failures
  • Limited monitoring and observability data
  • No startup readiness verification

Requirements

1. Backend Components to Health Check

Critical Services (failure should return 503):

  • Relational Database: SQLite/PostgreSQL connectivity and schema validation
  • Vector Database: LanceDB/Qdrant/PGVector/FalkorDB/ChromaDB connectivity
  • Graph Database: Kuzu/Neo4j/FalkorDB/Memgraph connectivity and schema validation
  • File Storage: Local filesystem/S3 accessibility and permissions

Non-Critical Services (failure should return 200 with warnings):

  • LLM Provider: OpenAI/Ollama/Anthropic/Custom/Gemini API connectivity
  • Embedding Service: Embedding engine responsiveness
  • Cloud Storage: S3/Azure/GCS extended connectivity (if configured)

2. Health Check Endpoints

Primary Endpoints:

  • GET /health - Basic liveness probe (existing, enhanced)
  • GET /health/ready - Readiness probe for Kubernetes
  • GET /health/detailed - Comprehensive health status with component details

Response Format:

{
  "status": "healthy|degraded|unhealthy",
  "timestamp": "2024-01-15T10:30:45Z",
  "version": "1.0.0",
  "uptime": 3600,
  "components": {
    "relational_db": {
      "status": "healthy|unhealthy",
      "provider": "sqlite|postgres",
      "response_time_ms": 45,
      "details": "Connection successful"
    },
    "vector_db": {
      "status": "healthy|unhealthy", 
      "provider": "lancedb|qdrant|pgvector|falkordb|chromadb",
      "response_time_ms": 120,
      "details": "Index accessible"
    },
    "graph_db": {
      "status": "healthy|unhealthy",
      "provider": "kuzu|neo4j|falkordb|memgraph",
      "response_time_ms": 89,
      "details": "Schema validated"
    },
    "file_storage": {
      "status": "healthy|unhealthy",
      "provider": "local|s3|azure|gcs",
      "response_time_ms": 156,
      "details": "Storage accessible"
    },
    "llm_provider": {
      "status": "healthy|unhealthy|degraded",
      "provider": "openai|ollama|anthropic|custom|gemini",
      "response_time_ms": 1250,
      "details": "API responding"
    },
    "embedding_service": {
      "status": "healthy|unhealthy",
      "provider": "openai|huggingface|custom",
      "response_time_ms": 890,
      "details": "Embedding generation working"
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions