Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpectColumnValueLengthsToEqual is failing/raising exception when applied on a column having null values as well #10917

Open
suchintakp5 opened this issue Feb 6, 2025 · 2 comments

Comments

@suchintakp5
Copy link

Describe the bug
ExpectColumnValueLengthsToEqual is failing/raising exception when applied on a column having null values as well. in version 1.3.5. This was not failing in version 0.18

To Reproduce

import great_expectations as gx
import great_expectations.expectations as gxe

# Retrieve your Data Context
data_context = gx.get_context(mode="ephemeral")
# Define the Data Source name
data_source_name = "source_system_name_spark_dataframe"
# Add the Data Source to the Data Context
data_source = data_context.data_sources.add_spark(name=data_source_name)
# Define the Data Asset name
data_asset_name = "dataset_name"
# Add a Data Asset to the Data Source
data_asset = data_source.add_dataframe_asset(name=data_asset_name)
# Define the Batch Definition name
batch_definition_name = "dataset_batch_definition"

# Add a Batch Definition to the Data Asset
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)

df = <A pyspark dataframe containing few null values in a string column>

batch_parameters = {"dataframe": df}
# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

test = gxe.ExpectColumnValueLengthsToEqual(column=<column_name>, value=<length>)
# Test the Expectation
validation_results = batch.validate(test, result_format="COMPLETE")
print(validation_results)

Expected behavior
No exception should be raised. For the null values, the length should be equal to zero or they should not be considered as part the expectation result

Environment (please complete the following information):

  • Great Expectations Version: [1.3.5]
  • Data Source: [Spark dataframe created from a csv file]
  • Cloud environment: [Azure Databricks]

Additional context

{
  "success": false,
  "expectation_config": {
    "type": "expect_column_value_lengths_to_equal",
    "kwargs": {
      "column": "ID Number",
      "value": 7.0,
      "batch_id": "source_system_name_spark_dataframe-dataset_name"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "('column_values.value_length.map', '0464e137b2cdb1dd819e7ee85c081f95', ())": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 532, in _process_direct_and_bundled_metric_computation_configurations\n    metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/expectations/metrics/metric_provider.py\", line 99, in inner_func\n    return metric_fn(*args, **kwargs)\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/column_function_partial.py\", line 239, in inner_func\n    ) = execution_engine.get_compute_domain(\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/execution_engine/sparkdf_execution_engine.py\", line 800, in get_compute_domain\n    data: pyspark.DataFrame = self.get_domain_records(domain_kwargs=domain_kwargs)\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/execution_engine/sparkdf_execution_engine.py\", line 689, in get_domain_records\n    data = data.filter(filter_condition.condition)\n  File \"/databricks/spark/python/pyspark/instrumentation_utils.py\", line 48, in wrapper\n    res = func(*args, **kwargs)\n  File \"/databricks/spark/python/pyspark/sql/dataframe.py\", line 3123, in filter\n    jdf = self._jdf.filter(condition)\n  File \"/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py\", line 1321, in __call__\n    return_value = get_return_value(\n  File \"/databricks/spark/python/pyspark/errors/exceptions.py\", line 234, in deco\n    raise converted from None\npyspark.errors.exceptions.ParseException: \n[PARSE_SYNTAX_ERROR] Syntax error at or near 'IS'.(line 1, pos 10)\n\n== SQL ==\nID Number IS NOT NULL\n----------^^^\n\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 276, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 279, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/local_disk0/.ephemeral_nfs/envs/pythonEnv-70771ece-6841-4d7b-a9e8-4a8bc864ed04/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 537, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: \n[PARSE_SYNTAX_ERROR] Syntax error at or near 'IS'.(line 1, pos 10)\n\n== SQL ==\nID Number IS NOT NULL\n----------^^^\n\n",
      "exception_message": "\n[PARSE_SYNTAX_ERROR] Syntax error at or near 'IS'.(line 1, pos 10)\n\n== SQL ==\nID Number IS NOT NULL\n----------^^^\n",
      "raised_exception": true
    }
  }
}
@adeola-ak
Copy link
Contributor

hi there. What is your Databricks Runtime version? I'm not able to replicate this. Can you also check if the issue occurs in both single-user and multi-user clusters? Have you tested this outside of Databricks

@suchintakp5
Copy link
Author

This issue is reproducible in DBR version 15.4 in both single-user and multi-user cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Development

No branches or pull requests

2 participants