Validating tall / long / tidy data #5288

aberges-grd · 2022-06-10T11:29:34Z

aberges-grd
Jun 10, 2022

What is tidy data?

As you may know, tidy data is a kinda common data format in Data Science, involving few columns and lots of rows. Essentially, you encode semantics in the columns name and let categorical columns do the indexing/grouping. This is particularly well suited for e.g. timeseries from industrial settings where variables usually have time dependence and hierarchical structure (a variable comes from a sensor ∈ machine part ∈ machine ∈ manufacturing group ∈ factory), so usually you organize data as timestamp, value and one or more "name" columns.

What do I want?

I want Great Expectations to either fully support tidy data as the "go-to" format for hierarchical data structures such as ones arising in industry OR provide a way to "pivot" the data (via e.g. pandas) prior to passing it to a suite. Ideally, add support for multi-index dataframes too, so I can do table checks for levels of the multiindex (e.g. check all sensors needed are present).

What would it mean to support tidy data? Essentially: being able to apply "column expectations" inside GroupBy objects. So, I could check for example that the group [temperature sensor, milling machine] (which would define a timeseries) has values within MIN and MAX (security / operational levels) and a median between a nominal value ± tolerance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validating tall / long / tidy data #5288

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Validating tall / long / tidy data #5288

Uh oh!

Uh oh!

aberges-grd Jun 10, 2022

What is tidy data?

What do I want?

Replies: 0 comments

aberges-grd
Jun 10, 2022