-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Great Expectations Bug Report
Describe the bug
When using row_condition
with a PostgreSQL datasource, Great Expectations fails to properly apply the condition filter to the SQL query. The generated SQL query for retrieving unexpected rows contains syntax errors (duplicate WHERE clauses) and completely ignores the row_condition filters. This results in all rows being evaluated instead of only the rows that match the specified condition.
To Reproduce
Code to reproduce the issue:
import great_expectations as gx
# Connection string
connection_string = "postgresql+psycopg://postgres:password@localhost:5432/testdb"
# Create GX context
context = gx.get_context()
# Add datasource
datasource = context.data_sources.add_postgres(
name="postgres_datasource",
connection_string=connection_string
)
# Define the result format configuration
result_format_config = {
"result_format": "COMPLETE",
"unexpected_index_column_names": ["customer_id"],
"return_unexpected_index_query": True,
"partial_unexpected_count": 1000,
"exclude_unexpected_values": False,
"include_unexpected_rows": True
}
# SQL query - Sample customer orders data
sql_query = """
SELECT
c.customer_id,
c.customer_name,
c.customer_type,
c.registration_date,
o.order_id,
o.order_amount,
o.order_status,
o.payment_method,
EXTRACT(YEAR FROM AGE(CURRENT_DATE, c.registration_date)) as years_as_customer,
COUNT(o.order_id) OVER (PARTITION BY c.customer_id) as total_orders
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
"""
# Add data asset
data_asset = datasource.add_query_asset(
name="customer_orders_analysis",
query=sql_query
)
# Get batch
batch_definition = data_asset.add_batch_definition("default_batch")
batch = batch_definition.get_batch()
# Create suite
suite = context.suites.add(gx.ExpectationSuite(name="test_suite"))
# Add expectation with row_condition
# This should only validate payment methods for premium customers with more than 5 years
expectation = gx.expectations.ExpectColumnValuesToNotBeInSet(
column="payment_method",
value_set=["check", "cash"],
row_condition='col("years_as_customer") > 5 AND col("customer_type") == "PREMIUM" AND col("total_orders") > 10',
condition_parser="great_expectations"
)
suite.add_expectation(expectation)
# Validate
validation_result = batch.validate(suite, result_format=result_format_config)
print(validation_result)
Generated SQL with syntax error:
The unexpected_index_query
in the validation result shows:
SELECT customer_id, payment_method
FROM SELECT *
FROM (SELECT c.customer_id, c.customer_name, ...) AS anon_1
WHERE true
WHERE payment_method IS NOT NULL AND payment_method IN ('check'::VARCHAR, 'cash'::VARCHAR);
Note the duplicate WHERE clauses (WHERE true WHERE
) and the complete absence of the row_condition filters (years_as_customer > 5
, customer_type == "PREMIUM"
, total_orders > 10
).
Expected behavior
The row_condition should be properly translated to SQL and applied as a filter. The expectation should only evaluate rows where:
years_as_customer > 5
customer_type == "PREMIUM"
total_orders > 10
Environment (please complete the following information):
- Operating System: MacOS
- Great Expectations Version: 1.5.1
- Data Source: PostgreSQL (using psycopg)
- Python Version: 3.9
Additional context
- The issue occurs specifically when using
row_condition
with SQL datasources - Tested with multiple condition combinations (using AND operators) and all exhibit the same behavior
- The validation incorrectly reports all rows as "unexpected" because the row_condition is not being applied
- This appears to be a bug in how Great Expectations translates the row_condition parser output to SQL for database sources
- The issue makes it impossible to use conditional expectations with SQL databases, forcing users to either pre-filter data in their queries or load entire datasets into memory