Skip to content

feat: Add OceanBase database support #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yuanoOo
Copy link

@yuanoOo yuanoOo commented Apr 18, 2025

This PR introduces native support for OceanBase distributed database, enabling Vanna.AI to work seamlessly with OceanBase instances.

@yuanoOo
Copy link
Author

yuanoOo commented Apr 18, 2025

@wgong PTAL.

Copy link

@llamapreview llamapreview bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 Core Changes

  • Primary purpose and scope: Introduces native support for the OceanBase distributed database within Vanna.AI.
  • Key components modified: The VannaBase class in src/vanna/base/base.py is extended with a new connection method connect_to_oceanbase and a corresponding SQL execution function run_sql_oceanbase.
  • Cross-component impacts: Extends the base class functionality, allowing Vanna instances to connect to and query OceanBase databases using the existing run_sql interface. A new dialect type "OceanBase/MySQL" is introduced.
  • Business value alignment: Enables users of Vanna.AI to leverage their OceanBase infrastructure, expanding the tool's applicability to environments using this distributed database.

1.2 Technical Architecture

  • System design modifications: Adds a new database connector method following the established pattern for other databases within VannaBase.
  • Component interaction changes: Utilizes the pymysql library to interact with OceanBase via its MySQL-compatible protocol.
  • Integration points impact: Introduces a new dependency (pymysql) required specifically for OceanBase connections. Defines a new dialect identifier.
  • Dependency changes and implications: Adds pymysql as an optional dependency, which users will need to install if they intend to connect to OceanBase.

2. Critical Findings

2.1 Must Fix (P0🔴)

Issue: Port parameter assignment incorrectly uses the database variable.

  • Analysis Confidence: High
  • Impact: The port variable is assigned the value of the database variable instead of the intended port parameter or environment variable. This will lead to incorrect port numbers being used and guaranteed connection failures.
  • Resolution: Correct the variable assignment logic: port = port or os.getenv("OCEANBASE_PORT").

Issue: Port number obtained from environment variables is not converted to an integer.

  • Analysis Confidence: High
  • Impact: Environment variables are read as strings. The pymysql.connect function expects the port argument to be an integer. Passing a string will cause a TypeError during connection attempts.
  • Resolution: Convert the port variable to an integer after retrieving it, including error handling for invalid formats:
  try:
      port = int(port)
  except (ValueError, TypeError):
      raise ImproperlyConfigured(f"Invalid port number format: {port}")

2.2 Should Fix (P1🟡)

Issue: Persistent database connection without proper lifecycle management.

  • Analysis Confidence: High
  • Impact: The current implementation establishes a single connection when connect_to_oceanbase is called and reuses it for all subsequent run_sql calls. This connection can become stale or time out, leading to query failures. It also doesn't handle potential network interruptions or database restarts gracefully.
  • Suggested Solution: Implement a more robust connection handling strategy. Options include:
    1. Per-Query Connections: Establish a new connection within the run_sql_oceanbase function for each query and close it afterwards. This ensures freshness but adds overhead.
    2. Connection Pooling: Use a connection pooling library (like DBUtils or SQLAlchemy's pooling) to manage a pool of connections, handling reuse, testing, and recreation automatically.
    3. Basic Reconnection Logic: Add try/except blocks around connection usage in run_sql_oceanbase to attempt reconnection if a connection error occurs.

Issue: Lack of support for secure SSL/TLS connections.

  • Analysis Confidence: High
  • Impact: Connections are made without encryption by default. Sensitive data (credentials, query results) transmitted over the network are vulnerable to eavesdropping, especially in untrusted environments.
  • Suggested Solution: Allow SSL parameters to be passed via **kwargs to connect_to_oceanbase and forward them to pymysql.connect. Update documentation to show how to configure SSL (e.g., providing certificate paths).
  # Example within connect_to_oceanbase
  ssl_params = {k: v for k, v in kwargs.items() if k.startswith('ssl_')}
  conn = pymysql.connect(
      # ... other params ...
      **ssl_params # Pass SSL arguments
  )

Issue: Missing unit and integration tests for the new connector.

  • Analysis Confidence: High
  • Impact: Without tests, regressions or bugs in the OceanBase connection logic might go unnoticed. Error handling scenarios (invalid credentials, network issues, malformed SQL) are not verified.
  • Suggested Solution: Add unit tests mocking pymysql to verify connection parameter handling, environment variable fallback, error raising, and basic execution flow. Add integration tests (if infrastructure allows) against a test OceanBase instance to verify real connection and query execution.

2.3 Consider (P2🟢)

Area: Input validation for connection parameters.

  • Analysis Confidence: Medium
  • Improvement Opportunity: Enhance validation beyond simple existence checks. For instance, validate the host parameter format (e.g., using regex for hostname or IP address patterns) to provide earlier feedback on configuration errors.

Area: Error handling standardization.

  • Analysis Confidence: Medium
  • Improvement Opportunity: The code currently raises ValidationError for some database errors and generic Exception for others. Consider defining more specific custom exceptions (e.g., OceanBaseConnectionError, OceanBaseQueryError) inheriting from a base VannaDBError for more consistent error handling and reporting across different database types.

Area: Connection Metadata logging/storage.

  • Analysis Confidence: Low
  • Improvement Opportunity: Optionally retrieve and store/log the connected OceanBase server version upon successful connection. This can be useful for debugging and ensuring compatibility with specific OceanBase features. This could involve running a query like SELECT @@version; after connecting.

2.4 Summary of Action Items

  1. P0 (Must Fix): Correct the port variable assignment and add integer type conversion for the port before merge.
  2. P1 (Should Fix): Implement robust connection management (e.g., per-query connections or pooling), add SSL/TLS support, and introduce unit tests for the OceanBase connector. These should be addressed soon, potentially in follow-up PRs if not in this one.
  3. P2 (Consider): Enhance input validation, standardize error handling, and potentially add version detection as future improvements.

3. Technical Analysis

3.1 Code Logic Analysis

📁 src/vanna/base/base.py - connect_to_oceanbase

  • Submitted PR Code:
    # ... imports and class definition ...
          def connect_to_oceanbase(
            self,
            host: str = None,
            user: str = None,
            password: str = None,
            database: str = None,
            port: int = None, # Type hint is int, but env var is str
            **kwargs
          ):
            # ... docstring ...
            try:
              import pymysql  # OceanBase uses MySQL protocol
            except ImportError:
              # ... raise DependencyError ...

            # Get parameters from environment variables if not provided
            host = host or os.getenv("OCEANBASE_HOST")
            user = user or os.getenv("OCEANBASE_USER")
            password = password or os.getenv("OCEANBASE_PASSWORD")
            database = database or os.getenv("OCEANBASE_DATABASE")
            port = database or os.getenv("OCEANBASE_PORT") # CRITICAL BUG HERE

            # Parameter validation
            if not all([host, user, password, database, port]):
              raise ImproperlyConfigured("Missing OceanBase connection parameters")

            try:
              # Establish connection
              conn = pymysql.connect(
                host=host,
                user=user,
                password=password,
                database=database,
                port=port, # Potential TypeError if port is str from env
                cursorclass=pymysql.cursors.DictCursor,
                **kwargs # No explicit SSL handling
              )
            except pymysql.Error as e:
              raise ValidationError(f"Connection failed: {e}")

            # Define SQL execution function (uses persistent 'conn')
            def run_sql_oceanbase(sql: str) -> pd.DataFrame:
                # ... implementation using 'conn' ...

            # Set class properties
            self.dialect = "OceanBase/MySQL"
            self.run_sql = run_sql_oceanbase # Assigns the nested function
            self.run_sql_is_set = True
  • Analysis:
    • Current logic incorrectly assigns the database value to port if the port argument is None.
    • The port number read from the environment variable (os.getenv) is a string, but pymysql.connect expects an integer, leading to a potential TypeError.
    • Connection parameters (host, user, password, database, port) are validated only for existence, not format.
    • The connection (conn) is established once and stored implicitly via the closure of run_sql_oceanbase. This connection is reused for all subsequent queries, risking staleness or timeouts (P1 issue).
    • No explicit support for SSL/TLS encryption is provided (P1 issue).
    • Error handling for connection uses ValidationError, while the nested run_sql_oceanbase uses ValidationError for pymysql.Error but re-raises generic Exception for others.
  • LlamaPReview Suggested Improvements:
    # ... imports and class definition ...
          def connect_to_oceanbase(
            self,
            host: str = None,
            user: str = None,
            password: str = None,
            database: str = None,
            port: int = None,
            # Consider adding explicit ssl_options: dict = None or similar
            **kwargs
          ):
            # ... docstring ...
            try:
              import pymysql
            except ImportError:
              raise DependencyError("Required package: pip install pymysql")

            # Get parameters from environment variables if not provided
            host = host or os.getenv("OCEANBASE_HOST")
            user = user or os.getenv("OCEANBASE_USER")
            password = password or os.getenv("OCEANBASE_PASSWORD")
            database = database or os.getenv("OCEANBASE_DATABASE")
            # Correctly fetch port from argument or environment variable
            port_str = str(port) if port is not None else os.getenv("OCEANBASE_PORT") # Fetch as string first

            # Parameter validation (basic existence)
            if not all([host, user, password, database, port_str]):
              raise ImproperlyConfigured("Missing OceanBase connection parameters (host, user, password, database, port)")

            # Validate and convert port to integer (MUST FIX)
            try:
                port_int = int(port_str)
            except (ValueError, TypeError):
                raise ImproperlyConfigured(f"Invalid port number format: {port_str}")

            # Store connection parameters for potential later use (e.g., per-query connections)
            self._ob_conn_params = {
                "host": host,
                "user": user,
                "password": password,
                "database": database,
                "port": port_int,
                "cursorclass": pymysql.cursors.DictCursor,
                **kwargs # Pass through other kwargs like SSL options
            }

            # Define SQL execution function (Example: using per-query connection)
            def run_sql_oceanbase(sql: str) -> pd.DataFrame:
                try:
                    # Establish a fresh connection for each query
                    with pymysql.connect(**self._ob_conn_params) as conn:
                        with conn.cursor() as cursor:
                            cursor.execute(sql)
                            result = cursor.fetchall()
                            # It's often better practice to commit SELECTs in some DBs,
                            # but typically not needed for pymysql default autocommit mode.
                            # conn.commit() # Usually not needed for SELECT
                            df = pd.DataFrame(result)
                            return df
                except pymysql.Error as e:
                    # Consider logging the error and SQL statement here
                    # conn.rollback() # Not needed as 'with' handles context
                    raise ValidationError(f"SQL execution error: {e}")
                except Exception as e:
                    # Consider logging the error and SQL statement here
                    raise e # Re-raise other exceptions

            # Set class properties
            self.dialect = "OceanBase/MySQL"
            self.run_sql = run_sql_oceanbase
            self.run_sql_is_set = True
            # Optionally test connection here once
            try:
                self.run_sql("SELECT 1")
                # Optionally log successful connection or version
            except Exception as e:
                # Warn user if initial test connection fails
                print(f"Warning: Initial connection test to OceanBase failed: {e}")
  • Improvement rationale:
    • Technical benefits: Fixes critical bugs related to port handling. Adopts a safer per-query connection pattern (addressing P1). Allows passing arbitrary kwargs (like SSL options) to pymysql. Stores connection parameters for reuse. Includes an optional initial connection test.
    • Business value: Increases reliability by preventing connection failures due to bugs and stale connections. Enables secure connections (if SSL args are passed).
    • Risk assessment: The per-query connection pattern adds latency compared to a persistent connection but significantly improves robustness. Connection pooling would be a more advanced alternative balancing performance and robustness.

3.2 Key Quality Aspects

  • System scalability considerations: The current persistent connection model does not scale well, as a single Vanna instance holds one connection. Per-query connections or pooling are better for scalability and concurrent use.
  • Performance bottlenecks and optimizations: Establishing a new connection per query introduces overhead. For high-frequency query scenarios, connection pooling (P1 suggestion) would be a necessary optimization.
  • Testing strategy and coverage: Currently missing (P1 issue). A comprehensive test suite is needed to ensure reliability and prevent regressions.
  • Documentation needs: The docstring should be updated to mention the pymysql dependency, environment variable names, and how to pass SSL options (once supported). Examples should be provided.

4. Overall Evaluation

  • Technical assessment: The PR introduces the necessary components for OceanBase support using a compatible library (pymysql). However, it contains critical bugs (port handling) and lacks robustness in connection management and security (SSL). The core approach aligns with existing patterns but needs refinement.
  • Business impact: Positive, as it expands Vanna.AI's database compatibility to include OceanBase, potentially unlocking new user segments.
  • Risk evaluation: High risk in its current state due to the critical P0 bugs. Medium risk regarding reliability and security due to P1 issues (connection management, SSL, testing).
  • Notable positive aspects and good practices: Follows the existing connector pattern in VannaBase. Uses environment variables for configuration. Correctly identifies pymysql as the appropriate driver.
  • Implementation quality: Requires improvement. Critical bugs must be fixed, and P1 issues should be addressed for production readiness.
  • Final recommendation: Request Changes
    • The P0 issues (port assignment and type conversion) must be fixed before merging.
    • Addressing the P1 issues (connection management, SSL support, testing) is strongly recommended, either in this PR or immediate follow-ups, to ensure a reliable and secure integration.

💡 LlamaPReview Community
Have feedback on this AI Code review tool? Join our GitHub Discussions to share your thoughts and help shape the future of LlamaPReview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant