-
-
Notifications
You must be signed in to change notification settings - Fork 361
Description
Describe the bug
Data type checks only generate failure cases when nullable is set to false.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandera.
- (optional) I have confirmed this bug exists on the main branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pyspark.pandas as ps
import pandera.pyspark as pa
schema = pa.DataFrameSchema.from_yaml(
"./test_schema.yaml"
)
print(schema, "<<< schema")
df = ps.DataFrame(
{
"distance": ["15.2", "test", "13", "14", "NY", "GA"],
"city": [
"Orlando",
"Miami",
"Tampa",
"San Francisco",
"Los Angeles",
"San Diego",
],
"price": ["test", "12", "True", "16", "20", "18"],
}
)
print(df)
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print(err.failure_cases, "<<< failure_cases")
schema_type: dataframe
version: 0.25.0
columns:
distance:
title: null
description: null
dtype: float
nullable: true
checks: null
unique: false
coerce: true
required: true
regex: false
city:
title: null
description: null
dtype: string[python]
nullable: true
checks: null
unique: false
coerce: false
required: true
regex: false
price:
title: null
description: null
dtype: Int64
nullable: true
checks: null
unique: false
coerce: false
required: true
regex: false
checks: null
index:
- title: null
description: null
dtype: int64
nullable: true
checks: null
name: null
unique: false
coerce: false
dtype: null
coerce: false
strict: false
name: null
ordered: false
unique: null
report_duplicates: all
unique_column_names: false
add_missing_columns: false
title: null
description: null
Expected behavior
As stated in the docs https://pandera.readthedocs.io/en/stable/dtype_validation.html#how-data-types-interact-with-nullable Pandera considers type checks a first-class check that’s distinct from any downstream check that we may want to apply to the data.
I expected that having nullable: true
in the schema (e.g. for the distance column) would still produce failure cases when encountering wrong data types. However, even with coerce: true
, no failure cases are generated - they only appear when coerce is set to True and nullable is set to False
Desktop (please complete the following information):
- OS: Windows
- Browser: [e.g. chrome, safari]
- Version: ^0.25.0
Screenshots
This is the expected behavior, but it only occurs when nullable is set to false.