Skip to content

Make nrows a method separate from schema #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ablaom opened this issue Jul 21, 2021 · 5 comments
Closed

Make nrows a method separate from schema #137

ablaom opened this issue Jul 21, 2021 · 5 comments
Labels

Comments

@ablaom
Copy link
Member

ablaom commented Jul 21, 2021

Recall that ScientificTypes.schema is essentially the same as Tables.schema but with scitypes and nrows added as fields.

There are two reasons for separating out nrows:

  1. Computing nrows for a general table is typically more expensive than computing the schema. In particular, assuming progress towards Towards a more efficient schema methods for row-based tables  #127, the compute times are significant in the extreme case of out-of-memory tables, for example. Tables.schema does not include anything about the number of rows for this reason.

  2. With nrows gone, we can view a schema as a table (implement the Tables.jl interface for it) which would be convenient. For example, for a large table DataFrame(schema(X)) would then work.

As an aside, the current implementation of schema(X).nrows is poor for row-based tables and @OkonSamuel improved this in MLJBase, where there is already a separate nrows method (extended from MLJModelInterface). This implementation could be ported to ScientificTypes to replace the current schema(X).nrows.

Sadly, this is breaking. I would propose adding the nrows method and deprecating the nrows field of Schema. We could already implement Schema as a table, simply ignoring this field.

Are there any objections or other thoughts regarding this plan?

@OkonSamuel @quinnj @juliohm @bkamins

@bkamins
Copy link

bkamins commented Jul 21, 2021

If you want to introduce nrows method then probably both nrow and ncol should go to DataAPI.jl and then I would recommend using nrow function name, so that we have consistency with DataFrames.jl (and as a bonus it will be more easy for R users to discover).

@ablaom
Copy link
Member Author

ablaom commented Jul 22, 2021

@bkamins That's great feedback. Will overload DataAPI.jl nrow.

@bkamins
Copy link

bkamins commented Jul 22, 2021

We need to add nrow and ncol to DataAPI.jl. I will open a PR there to discuss.

@ablaom
Copy link
Member Author

ablaom commented Jul 26, 2021

@bkamins
Copy link

bkamins commented Jul 26, 2021

In case JuliaData/DataAPI.jl#40 takes long, probably you can expose nrow on your side, and only later make it a method of DataAPI.nrow (when it is there).

ablaom referenced this issue in OkonSamuel/ScientificTypes.jl Aug 6, 2021
@ablaom ablaom closed this as completed Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants