feat(api): dataset fields statistics #1360

MFori · 2024-12-18T10:20:44Z

Solves https://github.com/apify/apify-core/issues/18807 - proposal for new API endpoint /v2/datasets/{datasetId}/field-statistics which should return dataset field statistics

mtrunkat · 2024-12-18T15:13:32Z

apify-api/openapi/components/schemas/datasets/GetDatasetFieldStatisticsResponse.yaml

+        type: array
+        items:
+          type: string
+        description: 'Keys of the fields for which the statistics are provided.'


Is this the list of all fields from dataset.fields in DB or really the list of fields for which we have statistics? If the latter one then the question is if we need to be returning this redundant information as

response.data.fields === Object.keys(response.data.statistics)

It's the list of all fields defined in dataset.schema.fields.

I was thinking the same but because we store it separately and in docs there is this sentence: When you configure the dataset fields schema, we generate a field list and measure the following statistics, I though it has some reason.

But from implementation it looks like there will be always all fields in statistics, which means

response.data.fields === Object.keys(response.data.statistics)

If some field defined in dataset schema is never in dataset itself, it will have emptyCount=number_of_items.
And if some field is in dataset but isn't defined in dataset schema, it won't be in statistics.

So I assume you are right and it's redundant.

In such case, I'd remove it; we can always add new properties later, but we can't ever remove them as that would break existing integrations.

I already updated it, but now I wonder, whether it would be better like this having the fields right under data object. E.g. { "data": { "name": { "emptyCount": 100 } } }

Or keep the data.statistics object having the place to add other things under data in the future 🤔 such as
{ "data": { "statistics": { "name": { "emptyCount": 100 } } } }

@netmilk and @fnesveda , what do you think?

/v2/datasets/{datasetId}/field-statistics

{ "data":{ "statistics":{ "someValue":{ "emptyCount":100 }, "anotherValue":{ "min":100, "max":200, "emptyCount":0 } } } }

vs simply

{ "data":{ "someValue":{ "emptyCount":100 }, "anotherValue":{ "min":100, "max":200, "emptyCount":0 } } }

The former one is extensible and the latter one is simpler.

How about this?

/v2/datasets/{datasetId}/statistics

{ "data":{ "fieldStatistics":{ "someValue":{ "emptyCount":100 }, "anotherValue":{ "min":100, "max":200, "emptyCount":0 } } } }

That way, if we want to add more statistics about datasets later on, we can do it in the same endpoint.

In the ticket https://github.com/apify/apify-core/issues/18807 was note about this:

Basically we want the API to return the output of the data from the dataset statistics collection.
And the endpoint could potentially be /<datasetId>/stats or /<datasetId>/validation-statistics if it's the first one then we might want to add also the normal dataset statistics there, so that might be confusing...

I am thinking about what is better. We could split it but then we will have a few more endpoints in the docs. Or we can go with /stats and have this properties. Considering we don't plan to add much anytime soon, I'd go with a single endpoint for stats for simplicity.

What would you prefer @netmilk ?

I prefer @fnesveda's proposal: The uri .../statistics + fieldStatistics objects under fieldStatistics property, especially if you foresee additional types of statistics in the future returned in the response.

Naming the key just statistics doesn't provide any additional semantic meaning, it would now introduce just an additional nesting and it would lead to an overload of the term in the future.

Ok, I used @fnesveda's approach, please take a look @mtrunkat @netmilk

feat(api): dataset fields statistics

f8342c2

MFori added the t-console Issues with this label are in the ownership of the console team. label Dec 18, 2024

MFori requested review from netmilk and mtrunkat December 18, 2024 10:20

MFori self-assigned this Dec 18, 2024

update link in description

a8237f4

github-actions bot added this to the 105th sprint - Console team - Christmas milestone Dec 18, 2024

mtrunkat requested changes Dec 18, 2024

View reviewed changes

MFori added 3 commits December 18, 2024 18:13

remove redundant fields

060b245

update endpoint description

fb34858

update response description

425bbbc

MFori requested a review from mtrunkat December 18, 2024 17:19

MFori added 2 commits December 20, 2024 22:40

use /v2/datasets/{datasetId}/statistics approach

4cc52db

fix indent

2fb2806

mtrunkat approved these changes Dec 20, 2024

View reviewed changes

fieldStatistics nullable

c431a15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): dataset fields statistics #1360

feat(api): dataset fields statistics #1360

MFori commented Dec 18, 2024

mtrunkat Dec 18, 2024

MFori Dec 18, 2024 •

edited

Loading

mtrunkat Dec 18, 2024

MFori Dec 18, 2024

mtrunkat Dec 19, 2024 •

edited

Loading

fnesveda Dec 19, 2024

MFori Dec 19, 2024

mtrunkat Dec 20, 2024 •

edited

Loading

netmilk Dec 20, 2024

MFori Dec 20, 2024

feat(api): dataset fields statistics #1360

Are you sure you want to change the base?

feat(api): dataset fields statistics #1360

Conversation

MFori commented Dec 18, 2024

mtrunkat Dec 18, 2024

Choose a reason for hiding this comment

MFori Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

mtrunkat Dec 18, 2024

Choose a reason for hiding this comment

MFori Dec 18, 2024

Choose a reason for hiding this comment

mtrunkat Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

fnesveda Dec 19, 2024

Choose a reason for hiding this comment

MFori Dec 19, 2024

Choose a reason for hiding this comment

mtrunkat Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

netmilk Dec 20, 2024

Choose a reason for hiding this comment

MFori Dec 20, 2024

Choose a reason for hiding this comment

MFori Dec 18, 2024 •

edited

Loading

mtrunkat Dec 19, 2024 •

edited

Loading

mtrunkat Dec 20, 2024 •

edited

Loading