Description
Recall that ScientificTypes.schema
is essentially the same as Tables.schema
but with scitypes
and nrows
added as fields.
There are two reasons for separating out nrows
:
-
Computing
nrows
for a general table is typically more expensive than computing theschema
. In particular, assuming progress towards Towards a more efficientschema
methods for row-based tables #127, the compute times are significant in the extreme case of out-of-memory tables, for example.Tables.schema
does not include anything about the number of rows for this reason. -
With
nrows
gone, we can view aschema
as a table (implement the Tables.jl interface for it) which would be convenient. For example, for a large tableDataFrame(schema(X))
would then work.
As an aside, the current implementation of schema(X).nrows
is poor for row-based tables and @OkonSamuel improved this in MLJBase, where there is already a separate nrows
method (extended from MLJModelInterface). This implementation could be ported to ScientificTypes to replace the current schema(X).nrows
.
Sadly, this is breaking. I would propose adding the nrows
method and deprecating the nrows
field of Schema
. We could already implement Schema
as a table, simply ignoring this field.
Are there any objections or other thoughts regarding this plan?