-
Notifications
You must be signed in to change notification settings - Fork 123
Description
In order to help us think clearly about exactly what types of data data package is intended to support either in presently or in the future, it could be helpful to create a repo with example datasets of different types.
This will help both when thinking about what the spec should look like to properly represent all of the intended types of data, and provide a useful resource for writing test code.
For new users coming to Frictionless and wonderful whether it supports their data type, this could also be a good way to get started.
Repo structure
My first thought was to organize the repo by data type/modality (e.g. table, image, spectral, etc.), but actually, it might be better to do it by domain?
This way things are grouped together logically, in a way one it more likely to encountered them in the wild, and it would allow people working in the various domains to see, at a glance, which of the data types they tend to work with are represented?
astro/
bio/
biodiversity/
omics/
rna/
rnaseq/
fastq/
sample1.fq
datapackage.yml
README.md
counts/
counts.tsv
datapackage.yml
README.md
multiomics/
econ/
earth/
finance/
etc/
There are obviously a bunch of different ways one could organize things. There may also be existing taxonomies of data types / domains that we could work off of.
I would be careful not to worry to much about getting "the" right structure because I don't think there is going to be a single one that works well for everything.. Instead, let's just get something started, and then iterate on it and improve it as our understanding of the intended scope of the project evolves.
Dataset directory contents
- example data, formulated as a data pkg, if possible
- README with data info/source
- for data types which are not yet currently well-supported, this can be indicated in the readme
(e.g. "status: xx")
How to go about creating the repo?
Possible approach:
- each working group member could come up with a list of data types they think are relevant, and
a possible directory structure/naming convention for how they can be organized - we can come these into a single "fake" directly listing to show what the combined structure would
look like - meet & discuss / decide if changes need to be made
- once we are happy with the planned structure, a repo can be created, and user's can start
add representative datasets with the appropriate licensing.- it's probably also worth adding a script/hook to automatically generate an index/ToC for all
of the datasets.
- it's probably also worth adding a script/hook to automatically generate an index/ToC for all
Other considerations
- only include data with open license & provide information on source
- for large datasets, subsample..
@khusmann I couldn't pull up your comments from our Slack discussion about this a few months back, but I know you had some different ideas on this so please feel free to comment/share how you were thinking about this.
Anyone else is welcome to chime in, too, obviously. This is really just intended to get the ball rolling.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status