Skip to content

Make nrows a method separate from schema #137

Closed
@ablaom

Description

@ablaom

Recall that ScientificTypes.schema is essentially the same as Tables.schema but with scitypes and nrows added as fields.

There are two reasons for separating out nrows:

  1. Computing nrows for a general table is typically more expensive than computing the schema. In particular, assuming progress towards Towards a more efficient schema methods for row-based tables  #127, the compute times are significant in the extreme case of out-of-memory tables, for example. Tables.schema does not include anything about the number of rows for this reason.

  2. With nrows gone, we can view a schema as a table (implement the Tables.jl interface for it) which would be convenient. For example, for a large table DataFrame(schema(X)) would then work.

As an aside, the current implementation of schema(X).nrows is poor for row-based tables and @OkonSamuel improved this in MLJBase, where there is already a separate nrows method (extended from MLJModelInterface). This implementation could be ported to ScientificTypes to replace the current schema(X).nrows.

Sadly, this is breaking. I would propose adding the nrows method and deprecating the nrows field of Schema. We could already implement Schema as a table, simply ignoring this field.

Are there any objections or other thoughts regarding this plan?

@OkonSamuel @quinnj @juliohm @bkamins

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions