Validating tall / long / tidy data #5288
aberges-grd
started this conversation in
Community Use Cases
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What is tidy data?
As you may know, tidy data is a kinda common data format in Data Science, involving few columns and lots of rows. Essentially, you encode semantics in the columns name and let categorical columns do the indexing/grouping. This is particularly well suited for e.g. timeseries from industrial settings where variables usually have time dependence and hierarchical structure (a variable comes from a sensor ∈ machine part ∈ machine ∈ manufacturing group ∈ factory), so usually you organize data as timestamp, value and one or more "name" columns.
What do I want?
I want Great Expectations to either fully support tidy data as the "go-to" format for hierarchical data structures such as ones arising in industry OR provide a way to "pivot" the data (via e.g.
pandas
) prior to passing it to a suite. Ideally, add support for multi-index dataframes too, so I can do table checks for levels of the multiindex (e.g. check all sensors needed are present).What would it mean to support tidy data? Essentially: being able to apply "column expectations" inside
GroupBy
objects. So, I could check for example that the group[temperature sensor, milling machine]
(which would define a timeseries) has values within MIN and MAX (security / operational levels) and a median between a nominal value ± tolerance.Beta Was this translation helpful? Give feedback.
All reactions