Replies: 1 comment
-
You probably have to use another heuristic to infer which column is which type. E.g. try to parse the first n values as dates. If most of them can be parsed as a date, it's probably a date column. Then try to parse as floats, if most/all can be parsed as floats, it's numeric. The rest are string. You can use some knowledge of the values that your data typical looks like to help figure these out. Define a pandera check for each type of column and apply the check corresponding to the inferred type to the appropriate columns. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I've recently using pandera and I really love the ability to quickly validate data, it saves lots of tedious SQL/polars manual checks.
One issue I have encountered is that sometimes I will receive some data from a client of mine and then names might not match my pandera schema perfectly but I'd still to do some basic check validation. In my workflow I would often load a csv dump with all columns as strings (
pl.read_csv("some.csv", infer_schema = False)
) and then I would want to:this would save a ton of time renaming columns to the right values or having a back and forth with the client so that they send the files with the right column names. I understand that there needs to be something that tells pandera what columns should be what type though.
with problematic csv dumps I often find that polars will give an error when loading a csv and trying to infer the schema but it won't be as comprehensive as the data validation provided by pandera. Ideally I would like to immediately know what rows are problematic which currently i cannot achieve with either pandera or loading via polars directly
Beta Was this translation helpful? Give feedback.
All reactions