-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Add OceanBase database support #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@wgong PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto Pull Request Review from LlamaPReview
1. Overview
1.1 Core Changes
- Primary purpose and scope: Introduces native support for the OceanBase distributed database within Vanna.AI.
- Key components modified: The
VannaBase
class insrc/vanna/base/base.py
is extended with a new connection methodconnect_to_oceanbase
and a corresponding SQL execution functionrun_sql_oceanbase
. - Cross-component impacts: Extends the base class functionality, allowing Vanna instances to connect to and query OceanBase databases using the existing
run_sql
interface. A new dialect type "OceanBase/MySQL" is introduced. - Business value alignment: Enables users of Vanna.AI to leverage their OceanBase infrastructure, expanding the tool's applicability to environments using this distributed database.
1.2 Technical Architecture
- System design modifications: Adds a new database connector method following the established pattern for other databases within
VannaBase
. - Component interaction changes: Utilizes the
pymysql
library to interact with OceanBase via its MySQL-compatible protocol. - Integration points impact: Introduces a new dependency (
pymysql
) required specifically for OceanBase connections. Defines a new dialect identifier. - Dependency changes and implications: Adds
pymysql
as an optional dependency, which users will need to install if they intend to connect to OceanBase.
2. Critical Findings
2.1 Must Fix (P0🔴)
Issue: Port parameter assignment incorrectly uses the database
variable.
- Analysis Confidence: High
- Impact: The
port
variable is assigned the value of thedatabase
variable instead of the intendedport
parameter or environment variable. This will lead to incorrect port numbers being used and guaranteed connection failures. - Resolution: Correct the variable assignment logic:
port = port or os.getenv("OCEANBASE_PORT")
.
Issue: Port number obtained from environment variables is not converted to an integer.
- Analysis Confidence: High
- Impact: Environment variables are read as strings. The
pymysql.connect
function expects theport
argument to be an integer. Passing a string will cause aTypeError
during connection attempts. - Resolution: Convert the
port
variable to an integer after retrieving it, including error handling for invalid formats:
try:
port = int(port)
except (ValueError, TypeError):
raise ImproperlyConfigured(f"Invalid port number format: {port}")
2.2 Should Fix (P1🟡)
Issue: Persistent database connection without proper lifecycle management.
- Analysis Confidence: High
- Impact: The current implementation establishes a single connection when
connect_to_oceanbase
is called and reuses it for all subsequentrun_sql
calls. This connection can become stale or time out, leading to query failures. It also doesn't handle potential network interruptions or database restarts gracefully. - Suggested Solution: Implement a more robust connection handling strategy. Options include:
- Per-Query Connections: Establish a new connection within the
run_sql_oceanbase
function for each query and close it afterwards. This ensures freshness but adds overhead. - Connection Pooling: Use a connection pooling library (like
DBUtils
orSQLAlchemy
's pooling) to manage a pool of connections, handling reuse, testing, and recreation automatically. - Basic Reconnection Logic: Add
try/except
blocks around connection usage inrun_sql_oceanbase
to attempt reconnection if a connection error occurs.
- Per-Query Connections: Establish a new connection within the
Issue: Lack of support for secure SSL/TLS connections.
- Analysis Confidence: High
- Impact: Connections are made without encryption by default. Sensitive data (credentials, query results) transmitted over the network are vulnerable to eavesdropping, especially in untrusted environments.
- Suggested Solution: Allow SSL parameters to be passed via
**kwargs
toconnect_to_oceanbase
and forward them topymysql.connect
. Update documentation to show how to configure SSL (e.g., providing certificate paths).
# Example within connect_to_oceanbase
ssl_params = {k: v for k, v in kwargs.items() if k.startswith('ssl_')}
conn = pymysql.connect(
# ... other params ...
**ssl_params # Pass SSL arguments
)
Issue: Missing unit and integration tests for the new connector.
- Analysis Confidence: High
- Impact: Without tests, regressions or bugs in the OceanBase connection logic might go unnoticed. Error handling scenarios (invalid credentials, network issues, malformed SQL) are not verified.
- Suggested Solution: Add unit tests mocking
pymysql
to verify connection parameter handling, environment variable fallback, error raising, and basic execution flow. Add integration tests (if infrastructure allows) against a test OceanBase instance to verify real connection and query execution.
2.3 Consider (P2🟢)
Area: Input validation for connection parameters.
- Analysis Confidence: Medium
- Improvement Opportunity: Enhance validation beyond simple existence checks. For instance, validate the
host
parameter format (e.g., using regex for hostname or IP address patterns) to provide earlier feedback on configuration errors.
Area: Error handling standardization.
- Analysis Confidence: Medium
- Improvement Opportunity: The code currently raises
ValidationError
for some database errors and genericException
for others. Consider defining more specific custom exceptions (e.g.,OceanBaseConnectionError
,OceanBaseQueryError
) inheriting from a baseVannaDBError
for more consistent error handling and reporting across different database types.
Area: Connection Metadata logging/storage.
- Analysis Confidence: Low
- Improvement Opportunity: Optionally retrieve and store/log the connected OceanBase server version upon successful connection. This can be useful for debugging and ensuring compatibility with specific OceanBase features. This could involve running a query like
SELECT @@version;
after connecting.
2.4 Summary of Action Items
- P0 (Must Fix): Correct the port variable assignment and add integer type conversion for the port before merge.
- P1 (Should Fix): Implement robust connection management (e.g., per-query connections or pooling), add SSL/TLS support, and introduce unit tests for the OceanBase connector. These should be addressed soon, potentially in follow-up PRs if not in this one.
- P2 (Consider): Enhance input validation, standardize error handling, and potentially add version detection as future improvements.
3. Technical Analysis
3.1 Code Logic Analysis
📁 src/vanna/base/base.py - connect_to_oceanbase
- Submitted PR Code:
# ... imports and class definition ...
def connect_to_oceanbase(
self,
host: str = None,
user: str = None,
password: str = None,
database: str = None,
port: int = None, # Type hint is int, but env var is str
**kwargs
):
# ... docstring ...
try:
import pymysql # OceanBase uses MySQL protocol
except ImportError:
# ... raise DependencyError ...
# Get parameters from environment variables if not provided
host = host or os.getenv("OCEANBASE_HOST")
user = user or os.getenv("OCEANBASE_USER")
password = password or os.getenv("OCEANBASE_PASSWORD")
database = database or os.getenv("OCEANBASE_DATABASE")
port = database or os.getenv("OCEANBASE_PORT") # CRITICAL BUG HERE
# Parameter validation
if not all([host, user, password, database, port]):
raise ImproperlyConfigured("Missing OceanBase connection parameters")
try:
# Establish connection
conn = pymysql.connect(
host=host,
user=user,
password=password,
database=database,
port=port, # Potential TypeError if port is str from env
cursorclass=pymysql.cursors.DictCursor,
**kwargs # No explicit SSL handling
)
except pymysql.Error as e:
raise ValidationError(f"Connection failed: {e}")
# Define SQL execution function (uses persistent 'conn')
def run_sql_oceanbase(sql: str) -> pd.DataFrame:
# ... implementation using 'conn' ...
# Set class properties
self.dialect = "OceanBase/MySQL"
self.run_sql = run_sql_oceanbase # Assigns the nested function
self.run_sql_is_set = True
- Analysis:
- Current logic incorrectly assigns the
database
value toport
if theport
argument isNone
. - The port number read from the environment variable (
os.getenv
) is a string, butpymysql.connect
expects an integer, leading to a potentialTypeError
. - Connection parameters (
host
,user
,password
,database
,port
) are validated only for existence, not format. - The connection (
conn
) is established once and stored implicitly via the closure ofrun_sql_oceanbase
. This connection is reused for all subsequent queries, risking staleness or timeouts (P1 issue). - No explicit support for SSL/TLS encryption is provided (P1 issue).
- Error handling for connection uses
ValidationError
, while the nestedrun_sql_oceanbase
usesValidationError
forpymysql.Error
but re-raises genericException
for others.
- Current logic incorrectly assigns the
- LlamaPReview Suggested Improvements:
# ... imports and class definition ...
def connect_to_oceanbase(
self,
host: str = None,
user: str = None,
password: str = None,
database: str = None,
port: int = None,
# Consider adding explicit ssl_options: dict = None or similar
**kwargs
):
# ... docstring ...
try:
import pymysql
except ImportError:
raise DependencyError("Required package: pip install pymysql")
# Get parameters from environment variables if not provided
host = host or os.getenv("OCEANBASE_HOST")
user = user or os.getenv("OCEANBASE_USER")
password = password or os.getenv("OCEANBASE_PASSWORD")
database = database or os.getenv("OCEANBASE_DATABASE")
# Correctly fetch port from argument or environment variable
port_str = str(port) if port is not None else os.getenv("OCEANBASE_PORT") # Fetch as string first
# Parameter validation (basic existence)
if not all([host, user, password, database, port_str]):
raise ImproperlyConfigured("Missing OceanBase connection parameters (host, user, password, database, port)")
# Validate and convert port to integer (MUST FIX)
try:
port_int = int(port_str)
except (ValueError, TypeError):
raise ImproperlyConfigured(f"Invalid port number format: {port_str}")
# Store connection parameters for potential later use (e.g., per-query connections)
self._ob_conn_params = {
"host": host,
"user": user,
"password": password,
"database": database,
"port": port_int,
"cursorclass": pymysql.cursors.DictCursor,
**kwargs # Pass through other kwargs like SSL options
}
# Define SQL execution function (Example: using per-query connection)
def run_sql_oceanbase(sql: str) -> pd.DataFrame:
try:
# Establish a fresh connection for each query
with pymysql.connect(**self._ob_conn_params) as conn:
with conn.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# It's often better practice to commit SELECTs in some DBs,
# but typically not needed for pymysql default autocommit mode.
# conn.commit() # Usually not needed for SELECT
df = pd.DataFrame(result)
return df
except pymysql.Error as e:
# Consider logging the error and SQL statement here
# conn.rollback() # Not needed as 'with' handles context
raise ValidationError(f"SQL execution error: {e}")
except Exception as e:
# Consider logging the error and SQL statement here
raise e # Re-raise other exceptions
# Set class properties
self.dialect = "OceanBase/MySQL"
self.run_sql = run_sql_oceanbase
self.run_sql_is_set = True
# Optionally test connection here once
try:
self.run_sql("SELECT 1")
# Optionally log successful connection or version
except Exception as e:
# Warn user if initial test connection fails
print(f"Warning: Initial connection test to OceanBase failed: {e}")
- Improvement rationale:
- Technical benefits: Fixes critical bugs related to port handling. Adopts a safer per-query connection pattern (addressing P1). Allows passing arbitrary
kwargs
(like SSL options) topymysql
. Stores connection parameters for reuse. Includes an optional initial connection test. - Business value: Increases reliability by preventing connection failures due to bugs and stale connections. Enables secure connections (if SSL args are passed).
- Risk assessment: The per-query connection pattern adds latency compared to a persistent connection but significantly improves robustness. Connection pooling would be a more advanced alternative balancing performance and robustness.
- Technical benefits: Fixes critical bugs related to port handling. Adopts a safer per-query connection pattern (addressing P1). Allows passing arbitrary
3.2 Key Quality Aspects
- System scalability considerations: The current persistent connection model does not scale well, as a single Vanna instance holds one connection. Per-query connections or pooling are better for scalability and concurrent use.
- Performance bottlenecks and optimizations: Establishing a new connection per query introduces overhead. For high-frequency query scenarios, connection pooling (P1 suggestion) would be a necessary optimization.
- Testing strategy and coverage: Currently missing (P1 issue). A comprehensive test suite is needed to ensure reliability and prevent regressions.
- Documentation needs: The docstring should be updated to mention the
pymysql
dependency, environment variable names, and how to pass SSL options (once supported). Examples should be provided.
4. Overall Evaluation
- Technical assessment: The PR introduces the necessary components for OceanBase support using a compatible library (
pymysql
). However, it contains critical bugs (port handling) and lacks robustness in connection management and security (SSL). The core approach aligns with existing patterns but needs refinement. - Business impact: Positive, as it expands Vanna.AI's database compatibility to include OceanBase, potentially unlocking new user segments.
- Risk evaluation: High risk in its current state due to the critical P0 bugs. Medium risk regarding reliability and security due to P1 issues (connection management, SSL, testing).
- Notable positive aspects and good practices: Follows the existing connector pattern in
VannaBase
. Uses environment variables for configuration. Correctly identifiespymysql
as the appropriate driver. - Implementation quality: Requires improvement. Critical bugs must be fixed, and P1 issues should be addressed for production readiness.
- Final recommendation: Request Changes
- The P0 issues (port assignment and type conversion) must be fixed before merging.
- Addressing the P1 issues (connection management, SSL support, testing) is strongly recommended, either in this PR or immediate follow-ups, to ensure a reliable and secure integration.
💡 LlamaPReview Community
Have feedback on this AI Code review tool? Join our GitHub Discussions to share your thoughts and help shape the future of LlamaPReview.
This PR introduces native support for OceanBase distributed database, enabling Vanna.AI to work seamlessly with OceanBase instances.