-
Notifications
You must be signed in to change notification settings - Fork 297
data: Intermediate layer for versioning of datasets #1675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Now implemented for the Worldbank Urban Population dataset. This dataset uses the same method for retrieving from WB as for retrieving from Zenodo (sandbox link for now), the structure also suits itself to providing the upstream information in the |
@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The |
Thanks @coroa , you can now have a closer look! I've updated the code, now
Finally,
|
Recording an idea: E.g.
For any data that we store on Zenodo, we can add them to the Zenodo repo. For datasets that we don't put on Zenodo or that are from Noting @lkstrp that this structure should also allow us to easily exchange "Zenodo" for any other type of data repo that allows for direct access to files, including S3 buckets. |
Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within |
We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data. |
Seconding the latter idea. Rather than splitting it, removing it from the repo. |
source: "archive" # "archive" or "build" | ||
version: "latest" # specific version from 'data/versions.csv', 'latest' or if 'source: "build"' is used, then use "upstream" here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are different versions when source is build
? If there are none we could also merge that into a single version
entry. Or do you plan to add for the option of some fancy build from specific commit hash or something?
EDIT: See main review
"source_name","version","recency","url","description","license" | ||
"osm","0.1","old","https://zenodo.org/records/12799202","Pre-built data of high-voltage transmission grid in Europe from OpenStreetMap.","ODbL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "source_name" and not "dataset_name"?
Can we use "tag" instead of "recency" and just remove the "old" labels?
"not supported" could also be a tag
EDIT: See general restructure in main review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is lovely @euronion !
I have a couple of thoughts on the general schema:
Sources
(1) a rule for retrieving the upstream version, which I called source: "build" in the config.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.
There is also a third rule for some datasets which can be build with extra rules for the workflow. For example OSM, or the atlite cutouts. So we could expand the schema to:
(1) source: "primary"
(retrieve from primary source)
(2) source: "archive"
(retrieve from our archive)
(3) source: "build"
(build from workflow)
While a dataset could never be both (1) and (3) at the same time, that wouldn't be ideal. Another option would be to split the data into internal (archive and build) and external (archive and primary) data. I'm not sure about the naming, though. I wouldn't merge (1) and (3), and for sure not call "retrieve the upstream version" source: "build"
. A third option would be to ignore the fact that we can build internal data from within the workflow. But if we are setting this up from scratch, I would prefer to keep that in the schema.
Versions
I would expect the version
key to be consistent across all sources. Technically, any version could be built or retrieved from the primary source or the archive. And if a combination exists/ can be retrieved is then just validated in a first step.
But if
(1) does not have versions, and is always just a primary url and
(3) does not have versions, and is always a build via the current local checkout,
we can also merge that all into one "version" tag for simplicity.
version: "primary"
-> retrieve from primary source
version: "build"
-> build via workflow
version: "latest" -> latest from archive
version: "v0.3.0" -> v.0.3.0
from archive
But that would mean we don't get versions for both primary sources and our archive in the schema (e.g. GEM). I think I like the idea of having mirrored versions across all source types more.
For sure I would mirror the version values and not use upstream
as version value. It's better to stick to classic version tags also for primary sources: latest
and nightly
or stable
and latest
, since the primary source will in most cases not be stable.
data/versions.csv
can then just be indexed by dataset_name
, source
and version
. This would already be enough for validation.
Oh yes, if we aim big, let's remove them. But I assume they have a reason to be there in the first place, and we cannot just move them all to Zenodo since the update frequency needs to be much lower there (?). The repo data is updated quite often and moving that to Zenodo will just be a huge pain and not the final solution. Either we come up with some S3 storage in here already, keep it in here for now or use some temp solution (repo/ nextcloud). |
This was not originally about update frequency, it was more about convenience and size constraints. Here you have updates of all files in the data directory over the last year:
by
|
And this is to frequent for Zenodo |
The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together? |
5 commits for data/transmission_projects Some of the 15 are deletions, some are within the span of a week. Still too many to handle manually i guess. |
I don't think we should be handling any of it manually anyways. I was thinking of writing a small CLI script that helps to create new versions to Zenodo. Not only to make it easier, but also to avoid passing mistakes. |
Files like parameter_corrections, or NEP plans, deserve to be version-controlled since they are hand-written rather than imported. So tracking them in a git repository would still be good practice for them. Maybe does not have to be directly in this repository, but also does not hurt. Maybe a sub-directory like: |
Small CLI script sounds good and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now. We need to reupload the whole directory for any new version on Zenodo. Zenodo cannot just update a single file of a bundle. So, if only one of 20 datasets needs an update, we have to reupload them all. This alone is an unpleasant misuse already. But all 20 of them get a new version tag as well, even if for 19 there is no difference between versions. So the whole purpose of versioning of datasets is also gone. As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support |
Agreed. I wasn't thinking of moving the data from the repo into the data bundle. I was thinking about moving the data from the repo into dedicated Zenodo datasets. One Zenodo URL per standalone dataset. Not what we are doing now with the databundle.
Keeping aside the tags, Zenodo is not build for having a single record contain multiple datasets. What I would be doing is create a dedicated record per dataset. In that case Zenodo is serving our purpose nicely. And since we use the |
Ok. If we create a single record for each record on Zenodo, I would still argue that this is an unnecessary overhead, but if you want to go for it, I'll give up my resistance. As you say, we can easily switch then 👍 |
Thanks for the feedback @lkstrp - what I understand is that you only have concerns about the schema, but no comments or concerns about the implementation. Is that correct? NamingAbout your schema concerns: I wasn't very happy about my suggestions either, so happy to change them. Indexing in
|
dataset | source | version | recency |
---|---|---|---|
GEM_GSPT | primary | Febuly-2999-V1 | unstable / nightly / untested |
GEM_GSPT | primary | April-2024-V1 | latest |
GEM_GSPT | primary | January-1970-V1 | deprecated |
GEM_GSPT | primary | January-2000-V1 | outdated |
GEM_GSPT | archive | April-2024-V1 | latest |
GEM_GSPT | archive | January-1970-V1 | deprecated |
GEM_GSPT | archive | January-2000-V1 | outdated |
... | ... | ... | ... |
OSM | build | build | unstable / nightly / untested |
OSM | archive | 0.7 | unstable / nightly / untested |
OSM | archive | 0.6 | latest |
OSM | archive | 0.1 | deprecated |
... | ... | ... | ... |
WDPA | primary | primary | unstable / nightly / untested / we don't have anything better or an archived version |
- all datasets are downloaded to
data/<dataset>/<version>/
config.yaml
will have
datasets:
<dataset>:
source: "primary" | "archive" | "build"
version: "<a version from versions.csv>" | "" # either version or recency need to be specified
recency: "" | "latest" | "nightly" # either version or recency need to be specified
Update after some discussions: For
Further:
|
This PR suggest an implementation method for an intermediate data layer, that allows to retrieve a specific version of a dataset, rather than always the latest.
The current status is implemented on the OSM dataset, as this one was already tracking versions and working very similarly. I'll expand it to two more datasets in the upcoming days.
Motivation
Upstream data dependencies are not always versioned or fixed, meaning they may change unexpectedly without a way to revert to a different version. This causes reproducibility issues.
Approach
This PR solves the problem in the following way:
data/versions.csv
. This allows us to switch to a different data plattform or provider if necessary; or use a versioned URL directly from a data provider if available.data/versions.csv
also records the license and the description for the dataset. I plan on utilising this information to automatically create new versions of datasets and distribute the license text + metadata information based, as well as utilise this file for the documentation hereupstream
version, which I calledsource: "build"
in theconfig.default.yaml
, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.TODO
upstream
toarchive
(small script/CLI tool)data/
to Zenodo, keep files indata
that are manual specifications/inputs, e.g. asdata/manual
data/versions.csv
Comments are already welcome!
Checklist
envs/environment.yaml
.config/config.default.yaml
.doc/configtables/*.csv
.doc/data_sources.rst
.doc/release_notes.rst
is added.