Is there a way to validate columns regardless of column names? #1917

lucharo · 2025-02-25T11:59:10Z

lucharo
Feb 25, 2025

Hi, I've recently using pandera and I really love the ability to quickly validate data, it saves lots of tedious SQL/polars manual checks.

One issue I have encountered is that sometimes I will receive some data from a client of mine and then names might not match my pandera schema perfectly but I'd still to do some basic check validation. In my workflow I would often load a csv dump with all columns as strings (pl.read_csv("some.csv", infer_schema = False)) and then I would want to:

check all number columns use dots to represent decimal numbers rather than commas (e.g. 128.1 rather than 128,1)
make sure string columns don't contain the pipe character ("|") as we use this as our csv delimiter
make sure date columns follow a particular format (e.g. dd/mm/yyy)

this would save a ton of time renaming columns to the right values or having a back and forth with the client so that they send the files with the right column names. I understand that there needs to be something that tells pandera what columns should be what type though.

with problematic csv dumps I often find that polars will give an error when loading a csv and trying to infer the schema but it won't be as comprehensive as the data validation provided by pandera. Ideally I would like to immediately know what rows are problematic which currently i cannot achieve with either pandera or loading via polars directly

jfhc · 2025-05-02T08:25:43Z

jfhc
May 2, 2025

You probably have to use another heuristic to infer which column is which type. E.g. try to parse the first n values as dates. If most of them can be parsed as a date, it's probably a date column. Then try to parse as floats, if most/all can be parsed as floats, it's numeric. The rest are string. You can use some knowledge of the values that your data typical looks like to help figure these out. Define a pandera check for each type of column and apply the check corresponding to the inferred type to the appropriate columns.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is there a way to validate columns regardless of column names? #1917

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Is there a way to validate columns regardless of column names? #1917

Uh oh!

lucharo Feb 25, 2025

Replies: 1 comment

Uh oh!

jfhc May 2, 2025

lucharo
Feb 25, 2025

jfhc
May 2, 2025