Skip to content

row_condition not applied correctly with PostgreSQL/SQL datasources #11319

@cfpalacio

Description

@cfpalacio

Great Expectations Bug Report

Describe the bug
When using row_condition with a PostgreSQL datasource, Great Expectations fails to properly apply the condition filter to the SQL query. The generated SQL query for retrieving unexpected rows contains syntax errors (duplicate WHERE clauses) and completely ignores the row_condition filters. This results in all rows being evaluated instead of only the rows that match the specified condition.

To Reproduce
Code to reproduce the issue:

import great_expectations as gx

# Connection string
connection_string = "postgresql+psycopg://postgres:password@localhost:5432/testdb"

# Create GX context
context = gx.get_context()

# Add datasource
datasource = context.data_sources.add_postgres(
    name="postgres_datasource",
    connection_string=connection_string
)

# Define the result format configuration
result_format_config = {
    "result_format": "COMPLETE",
    "unexpected_index_column_names": ["customer_id"],
    "return_unexpected_index_query": True,
    "partial_unexpected_count": 1000,
    "exclude_unexpected_values": False,
    "include_unexpected_rows": True
}

# SQL query - Sample customer orders data
sql_query = """
    SELECT 
        c.customer_id,
        c.customer_name,
        c.customer_type,
        c.registration_date,
        o.order_id,
        o.order_amount,
        o.order_status,
        o.payment_method,
        EXTRACT(YEAR FROM AGE(CURRENT_DATE, c.registration_date)) as years_as_customer,
        COUNT(o.order_id) OVER (PARTITION BY c.customer_id) as total_orders
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_date >= '2024-01-01'
"""

# Add data asset
data_asset = datasource.add_query_asset(
    name="customer_orders_analysis",
    query=sql_query
)

# Get batch
batch_definition = data_asset.add_batch_definition("default_batch")
batch = batch_definition.get_batch()

# Create suite
suite = context.suites.add(gx.ExpectationSuite(name="test_suite"))

# Add expectation with row_condition
# This should only validate payment methods for premium customers with more than 5 years
expectation = gx.expectations.ExpectColumnValuesToNotBeInSet(
    column="payment_method",
    value_set=["check", "cash"],
    row_condition='col("years_as_customer") > 5 AND col("customer_type") == "PREMIUM" AND col("total_orders") > 10',
    condition_parser="great_expectations"
)

suite.add_expectation(expectation)

# Validate
validation_result = batch.validate(suite, result_format=result_format_config)
print(validation_result)

Generated SQL with syntax error:
The unexpected_index_query in the validation result shows:

SELECT customer_id, payment_method 
FROM SELECT * 
FROM (SELECT c.customer_id, c.customer_name, ...) AS anon_1 
WHERE true 
WHERE payment_method IS NOT NULL AND payment_method IN ('check'::VARCHAR, 'cash'::VARCHAR);

Note the duplicate WHERE clauses (WHERE true WHERE) and the complete absence of the row_condition filters (years_as_customer > 5, customer_type == "PREMIUM", total_orders > 10).

Expected behavior
The row_condition should be properly translated to SQL and applied as a filter. The expectation should only evaluate rows where:

  • years_as_customer > 5
  • customer_type == "PREMIUM"
  • total_orders > 10

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 1.5.1
  • Data Source: PostgreSQL (using psycopg)
  • Python Version: 3.9

Additional context

  • The issue occurs specifically when using row_condition with SQL datasources
  • Tested with multiple condition combinations (using AND operators) and all exhibit the same behavior
  • The validation incorrectly reports all rows as "unexpected" because the row_condition is not being applied
  • This appears to be a bug in how Great Expectations translates the row_condition parser output to SQL for database sources
  • The issue makes it impossible to use conditional expectations with SQL databases, forcing users to either pre-filter data in their queries or load entire datasets into memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions