Skip to content

data: Intermediate layer for versioning of datasets #1675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

euronion
Copy link
Contributor

@euronion euronion commented May 7, 2025

This PR suggest an implementation method for an intermediate data layer, that allows to retrieve a specific version of a dataset, rather than always the latest.

The current status is implemented on the OSM dataset, as this one was already tracking versions and working very similarly. I'll expand it to two more datasets in the upcoming days.

Motivation

Upstream data dependencies are not always versioned or fixed, meaning they may change unexpectedly without a way to revert to a different version. This causes reproducibility issues.

Approach

This PR solves the problem in the following way:

  • We create archived versions of all external datasets (if we are allowed -> question of licensing) on e.g. Zenodo
  • The URL for retrieving each combination of (dataset x version) is stored in data/versions.csv. This allows us to switch to a different data plattform or provider if necessary; or use a versioned URL directly from a data provider if available.
  • data/versions.csv also records the license and the description for the dataset. I plan on utilising this information to automatically create new versions of datasets and distribute the license text + metadata information based, as well as utilise this file for the documentation here
  • I imagine that all externally retrieved data will get two rules: (1) a rule for retrieving the upstream version, which I called source: "build" in the config.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.

TODO

  • Implement for two more datasets
  • Helper and instructions for creating a new version of a dataset from upstream to archive (small script/CLI tool)
  • Unbundle the data bundle on Zenodo
  • Move datasets from data/ to Zenodo, keep files in data that are manual specifications/inputs, e.g. as data/manual
  • Update documentation to utilise data/versions.csv

Comments are already welcome!

Checklist

  • I tested my contribution locally and it works as intended.
  • Code and workflow changes are sufficiently documented.
  • Changed dependencies are added to envs/environment.yaml.
  • Changes in configuration options are added in config/config.default.yaml.
  • Changes in configuration options are documented in doc/configtables/*.csv.
  • Sources of newly added data are documented in doc/data_sources.rst.
  • A release note doc/release_notes.rst is added.

@euronion
Copy link
Contributor Author

euronion commented May 7, 2025

Now implemented for the Worldbank Urban Population dataset.

This dataset uses the same method for retrieving from WB as for retrieving from Zenodo (sandbox link for now), the structure also suits itself to providing the upstream information in the data/versions.csv file.

@coroa
Copy link
Member

coroa commented May 8, 2025

@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!

@euronion
Copy link
Contributor Author

euronion commented May 8, 2025

Thanks @coroa , you can now have a closer look!
Let's also have another chat on how it overlaps with your activities.

I've updated the code, now data/versions.csv is included.
It also includes a third dataset - the GEM GSPT (Global Steel Plant Tracker by GEM), which I have chosen as a third example, because:

  • It shows how the structure allows us to accommodate external data versioning activities, i.e. GEM does dedicated links to different versions of a dataset, so we don't necessarily need to create a Zenodo mirror (for other reasons we still should for this dataset)
  • It shows how we can track unsupported versions of a dataset in the data/versions.csv, i.e. I have added the newest version of the GSPT as "upstream" and "not supported", because in the new version the file format of the data changed and is no longer compatible with the current workflow. This can also be used to mark datasets, that are no longer compatible as "deprecated"

Finally,

  • I've opted to rename the output file of the GSPT, such that the version is only encoded in the folder, not in the file name (for easier switching to new versions) here

    xlsx=f"data/gem_gspt/{GEM_GSPT_VERSION}/Global-Steel-Plant-Tracker.xlsx",

  • And to showcase how we can avoid clutter/bugs with the ever more complicating dependencies between rules, I've used the rules.<rule_name>.output["<output_name>"] reference of the GEM GSPT instead of specifying the file name explicitly. We could use this approach instead of manually specifying the paths + versions each time:

    gem_gspt=rules.retrieve_gem_steel_plant_tracker.output["xlsx"],

@euronion
Copy link
Contributor Author

euronion commented May 9, 2025

@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!

Recording an idea:
With the structure I propose above, we have a dedicated folder for each data input and version. This would be a good place to store a copy of the LICENSE for that particular dataset as well as a metadata.json.

E.g.

data/worldbank_urban_population
├── 2025-05-07
│   ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.csv
│   ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.zip
│   ├── LICENSE
│   ├── Metadata_Country_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│   ├── Metadata_Indicator_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│   └── metadata.json
│    ...

For any data that we store on Zenodo, we can add them to the Zenodo repo. For datasets that we don't put on Zenodo or that are from upstream, we'd require a different solution for getting/storing the metadata and LICENSE.

Noting @lkstrp that this structure should also allow us to easily exchange "Zenodo" for any other type of data repo that allows for direct access to files, including S3 buckets.

@lkstrp
Copy link
Member

lkstrp commented May 9, 2025

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

@euronion
Copy link
Contributor Author

euronion commented May 9, 2025

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

@coroa
Copy link
Member

coroa commented May 14, 2025

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

Seconding the latter idea. Rather than splitting it, removing it from the repo.

Comment on lines +78 to +79
source: "archive" # "archive" or "build"
version: "latest" # specific version from 'data/versions.csv', 'latest' or if 'source: "build"' is used, then use "upstream" here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are different versions when source is build? If there are none we could also merge that into a single version entry. Or do you plan to add for the option of some fancy build from specific commit hash or something?
EDIT: See main review

Comment on lines +1 to +2
"source_name","version","recency","url","description","license"
"osm","0.1","old","https://zenodo.org/records/12799202","Pre-built data of high-voltage transmission grid in Europe from OpenStreetMap.","ODbL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "source_name" and not "dataset_name"?
Can we use "tag" instead of "recency" and just remove the "old" labels?
"not supported" could also be a tag

EDIT: See general restructure in main review.

Copy link
Member

@lkstrp lkstrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lovely @euronion !

I have a couple of thoughts on the general schema:

Sources

(1) a rule for retrieving the upstream version, which I called source: "build" in the config.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.

There is also a third rule for some datasets which can be build with extra rules for the workflow. For example OSM, or the atlite cutouts. So we could expand the schema to:
(1) source: "primary" (retrieve from primary source)
(2) source: "archive" (retrieve from our archive)
(3) source: "build" (build from workflow)

While a dataset could never be both (1) and (3) at the same time, that wouldn't be ideal. Another option would be to split the data into internal (archive and build) and external (archive and primary) data. I'm not sure about the naming, though. I wouldn't merge (1) and (3), and for sure not call "retrieve the upstream version" source: "build". A third option would be to ignore the fact that we can build internal data from within the workflow. But if we are setting this up from scratch, I would prefer to keep that in the schema.

Versions

I would expect the version key to be consistent across all sources. Technically, any version could be built or retrieved from the primary source or the archive. And if a combination exists/ can be retrieved is then just validated in a first step.
But if
(1) does not have versions, and is always just a primary url and
(3) does not have versions, and is always a build via the current local checkout,
we can also merge that all into one "version" tag for simplicity.
version: "primary" -> retrieve from primary source
version: "build" -> build via workflow
version: "latest" -> latest from archive version: "v0.3.0" -> v.0.3.0 from archive

But that would mean we don't get versions for both primary sources and our archive in the schema (e.g. GEM). I think I like the idea of having mirrored versions across all source types more.

For sure I would mirror the version values and not use upstream as version value. It's better to stick to classic version tags also for primary sources: latest and nightly or stable and latest, since the primary source will in most cases not be stable.

data/versions.csv can then just be indexed by dataset_name, source and version. This would already be enough for validation.

@lkstrp
Copy link
Member

lkstrp commented May 15, 2025

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

Seconding the latter idea. Rather than splitting it, removing it from the repo.

Oh yes, if we aim big, let's remove them. But I assume they have a reason to be there in the first place, and we cannot just move them all to Zenodo since the update frequency needs to be much lower there (?). The repo data is updated quite often and moving that to Zenodo will just be a huge pain and not the final solution. Either we come up with some S3 storage in here already, keep it in here for now or use some temp solution (repo/ nextcloud).

@coroa
Copy link
Member

coroa commented May 15, 2025

This was not originally about update frequency, it was more about convenience and size constraints. Here you have updates of all files in the data directory over the last year:

data/agg_p_nom_minmax.csv, 1
data/ammonia_plants.csv, 2
data/attributed_ports.json, 0
data/biomass_transport_costs_supplychain1.csv, 1
data/biomass_transport_costs_supplychain2.csv, 1
data/cement-plants-noneu.csv, 1
data/ch_cantons.csv, 0
data/ch_industrial_production_per_subsector.csv, 1
data/custom_extra_functionality.py, 1
data/custom_powerplants.csv, 0
data/district_heat_share.csv, 2
data/egs_costs.json, 1
data/eia_hydro_annual_capacity.csv, 1
data/eia_hydro_annual_generation.csv, 1
data/entsoegridkit/README.md, 0
data/entsoegridkit/buses.csv, 0
data/entsoegridkit/converters.csv, 0
data/entsoegridkit/generators.csv, 0
data/entsoegridkit/lines.csv, 0
data/entsoegridkit/links.csv, 1
data/entsoegridkit/transformers.csv, 0
data/existing_infrastructure/existing_heating_raw.csv, 1
data/gr-e-11.03.02.01.01-cc.csv, 0
data/heat_load_profile_BDEW.csv, 0
data/hydro_capacities.csv, 0
data/links_p_nom.csv, 1
data/nuclear_p_max_pu.csv, 1
data/parameter_corrections.yaml, 1
data/refineries-noneu.csv, 1
data/retro/comparative_level_investment.csv, 0
data/retro/data_building_stock.csv, 0
data/retro/electricity_taxes_eu.csv, 0
data/retro/floor_area_missing.csv, 0
data/retro/retro_cost_germany.csv, 0
data/retro/u_values_poland.csv, 0
data/retro/window_assumptions.csv, 0
data/switzerland-new_format-all_years.csv, 0
data/transmission_projects/manual/new_links.csv, 2
data/transmission_projects/nep/new_lines.csv, 2
data/transmission_projects/nep/new_links.csv, 3
data/transmission_projects/template/new_lines.csv, 1
data/transmission_projects/template/new_links.csv, 1
data/transmission_projects/template/upgraded_lines.csv, 1
data/transmission_projects/template/upgraded_links.csv, 1
data/transmission_projects/tyndp2020/new_lines.csv, 1
data/transmission_projects/tyndp2020/new_links.csv, 2
data/transmission_projects/tyndp2020/upgraded_lines.csv, 1
data/transmission_projects/tyndp2020/upgraded_links.csv, 1
data/unit_commitment.csv, 0

by

for i in $(git ls-files data); do echo $i, $(git log --oneline --since="1 year ago" ${i} | wc -l); done

@lkstrp
Copy link
Member

lkstrp commented May 15, 2025

And this is to frequent for Zenodo

@coroa
Copy link
Member

coroa commented May 15, 2025

And this is to frequent for Zenodo

The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?

@coroa
Copy link
Member

coroa commented May 15, 2025

And this is to frequent for Zenodo

The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?

5 commits for data/transmission_projects
and
15 for the rest (out of which 2 should live in technology data, i guess).

Some of the 15 are deletions, some are within the span of a week. Still too many to handle manually i guess.

@euronion
Copy link
Contributor Author

I don't think we should be handling any of it manually anyways. I was thinking of writing a small CLI script that helps to create new versions to Zenodo.

Not only to make it easier, but also to avoid passing mistakes.

@coroa
Copy link
Member

coroa commented May 15, 2025

Files like parameter_corrections, or NEP plans, deserve to be version-controlled since they are hand-written rather than imported.

So tracking them in a git repository would still be good practice for them. Maybe does not have to be directly in this repository, but also does not hurt.

Maybe a sub-directory like: data/manual, or a pypsa-eur-data-manual repository, but then this also needs to be maintained and version synced.

@lkstrp
Copy link
Member

lkstrp commented May 15, 2025

Small CLI script sounds good and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.

We need to reupload the whole directory for any new version on Zenodo. Zenodo cannot just update a single file of a bundle. So, if only one of 20 datasets needs an update, we have to reupload them all. This alone is an unpleasant misuse already. But all 20 of them get a new version tag as well, even if for 19 there is no difference between versions. So the whole purpose of versioning of datasets is also gone.

As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.

@euronion
Copy link
Contributor Author

Small CLI script sounds good

  • added as open TODO

and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.

Agreed. I wasn't thinking of moving the data from the repo into the data bundle. I was thinking about moving the data from the repo into dedicated Zenodo datasets. One Zenodo URL per standalone dataset. Not what we are doing now with the databundle.

We need to reupload the whole directory for any new version on Zenodo.
Yes, and I don't want to repeat that either if we just want to update parts of the data.

As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.

Keeping aside the tags, Zenodo is not build for having a single record contain multiple datasets. What I would be doing is create a dedicated record per dataset. In that case Zenodo is serving our purpose nicely. And since we use the storage(...) provider from Snakemake, we can always just provide a different URL if we want to switch to a storage bucket or another archive - they only need to provide version-specific direct URLs for accessing the datasets

@lkstrp
Copy link
Member

lkstrp commented May 15, 2025

Ok. If we create a single record for each record on Zenodo, I would still argue that this is an unnecessary overhead, but if you want to go for it, I'll give up my resistance. As you say, we can easily switch then 👍

@euronion
Copy link
Contributor Author

This is lovely @euronion !

I have a couple of thoughts on the general schema:
[...]

Thanks for the feedback @lkstrp - what I understand is that you only have concerns about the schema, but no comments or concerns about the implementation. Is that correct?

Naming

About your schema concerns: I wasn't very happy about my suggestions either, so happy to change them.

Indexing in data/sources.csv

I'm fine with indexing through dataset (or dataset_name, source, version); that's the status quo anyway, just with renamed variable names.

Sources

On the source I indeed intentionally mixed your (1) and (3) into "build", given that I don't know of any data source that provides both at the same time, but I see that it is more clear to separate them and accept that most datasets will have (2) and either (1) or (3), but not (1), (2) and (3).

Versions

The only benefit I see that we gain from having consistent version keys across sources is being able to get rid of Recency.
Especially since we don't want to increase the version numbers simultaneously, i.e. one would have datasets that are v1.0.0, some that are v1.0.1 or v4.0.0.

  • The downside I believe is that it requires more effort to compare our data with the primary sources version names, e.g. if we rename GEM's April-2024-V1 to v1.0.0 we are obfuscating their version number.

I'd rather keep the primary source's version names.

Recency

I introduce this column to help me find the "latest" version of a dataset, since the version is not guaranteed to be sorted or semantic versioning due to the different methods the primary data providers may use for version naming.
Then I realised that it has additional value to mark whether the model is still compatible with a dataset, and mark e.g. "old" or "deprecated/incompatible" versions, and what the current intended/supported version is.
I.e. you can keep "latest" in the config.yaml and get an auto-update of a dataset if you upgrade between PyPSA-Eur versions, without having to check whether a new version of the dataset is available and whether you need to update your configfile.

I think it would be nice to keep it, for look up purposes only and not for indexing of the file,
where instead of specifying the version in the configfile, one provides the recency.
Happy to rename, just not to "tag" - that does not seem descriptive enough for me.
What do you think?

To summarize ...

I'd go with something like this:

data/versions.csv:

dataset source version recency
GEM_GSPT primary Febuly-2999-V1 unstable / nightly / untested
GEM_GSPT primary April-2024-V1 latest
GEM_GSPT primary January-1970-V1 deprecated
GEM_GSPT primary January-2000-V1 outdated
GEM_GSPT archive April-2024-V1 latest
GEM_GSPT archive January-1970-V1 deprecated
GEM_GSPT archive January-2000-V1 outdated
... ... ... ...
OSM build build unstable / nightly / untested
OSM archive 0.7 unstable / nightly / untested
OSM archive 0.6 latest
OSM archive 0.1 deprecated
... ... ... ...
WDPA primary primary unstable / nightly / untested / we don't have anything better or an archived version
  • all datasets are downloaded to data/<dataset>/<version>/
  • config.yaml will have
datasets:
  <dataset>:
    source: "primary" | "archive" | "build"
    version: "<a version from versions.csv>" | "" # either version or recency need to be specified
    recency: "" | "latest" | "nightly"                        # either version or recency need to be specified
    

@euronion
Copy link
Contributor Author

Update after some discussions:

For data/versions.csv we will go with 6 entries:

  • dataset : name of the dataset
  • source: one of either primary | build | archive determining whether it is retrieved from the original data provider (primary), build based on the original data source, e.g. OSM (build) or an archived version retrieved from our mirror on e.g. Zenodo (archive)
  • version: Name of the version following the versioning schema of the original data provider. If the original data provider does not have a versioning schema, we'll go with a pragmatic version name, e.g. the date YYYY-MM-DD the data was retrieved and the archived version was created.
  • tags: a list of different tags that we support. For now, the only one is latest-supported, that refers to the latest version of a dataset that is supported by the model. latest-supported needs to be bumped when creating a new version of a dataset and putting it into the file. tags options for the future envisioned are e.g. nightly or latest.
  • supported: A flag either TRUE or FALSE indicating whether the current model version supports this dataset. We'll not actively monitor or test for compatibilities, the intention here is to provide indicate when a new version of a dataset was added, whether the previous version is just outdated or maybe the data schema/contents changed and is therefore no longer compatible and supported by the model.
  • URL: URL pointing to the resource for download.

Further:

  • Downloaded data will be located in dedicated subfolders data/<dataset>/<source>/<version>/, allowing for clear separation of any dataset.
  • If the primary or build source allows for downloading continuously updated data without a versioning schema, e.g. OSM, then the version to use by convention is 'unknown'`
  • In the config file, we specify the data using source and version for each dataset. version is a valid version from the .csv, with the special version name of latest-supported that get's resolved to the version of the dataset with this particular tag. This version should be the default for most users, as this way they always get the newest data that is compatible with the model after upgrades, without loosing previous datasets should they desire to switch back or compare.

@euronion euronion self-assigned this May 21, 2025
@fneum fneum mentioned this pull request May 25, 2025
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants