Skip to content

Data type checks only generate failure cases when nullable is set to false (pyspark.pandas) #2150

@kukuhsetiawan-ec

Description

@kukuhsetiawan-ec

Describe the bug
Data type checks only generate failure cases when nullable is set to false.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pyspark.pandas as ps
import pandera.pyspark as pa


schema = pa.DataFrameSchema.from_yaml(
    "./test_schema.yaml"
)
print(schema, "<<< schema")

df = ps.DataFrame(
    {
        "distance": ["15.2", "test", "13", "14", "NY", "GA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": ["test", "12", "True", "16", "20", "18"],
    }
)
print(df)

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases, "<<< failure_cases")
schema_type: dataframe
version: 0.25.0
columns:
  distance:
    title: null
    description: null
    dtype: float
    nullable: true
    checks: null
    unique: false
    coerce: true
    required: true
    regex: false
  city:
    title: null
    description: null
    dtype: string[python]
    nullable: true
    checks: null
    unique: false
    coerce: false
    required: true
    regex: false
  price:
    title: null
    description: null
    dtype: Int64
    nullable: true
    checks: null
    unique: false
    coerce: false
    required: true
    regex: false
checks: null
index:
  - title: null
    description: null
    dtype: int64
    nullable: true
    checks: null
    name: null
    unique: false
    coerce: false
dtype: null
coerce: false
strict: false
name: null
ordered: false
unique: null
report_duplicates: all
unique_column_names: false
add_missing_columns: false
title: null
description: null

Expected behavior

As stated in the docs https://pandera.readthedocs.io/en/stable/dtype_validation.html#how-data-types-interact-with-nullable Pandera considers type checks a first-class check that’s distinct from any downstream check that we may want to apply to the data.

I expected that having nullable: true in the schema (e.g. for the distance column) would still produce failure cases when encountering wrong data types. However, even with coerce: true, no failure cases are generated - they only appear when coerce is set to True and nullable is set to False

Desktop (please complete the following information):

  • OS: Windows
  • Browser: [e.g. chrome, safari]
  • Version: ^0.25.0

Screenshots

This is the expected behavior, but it only occurs when nullable is set to false.
Image

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions