data: Intermediate layer for versioning of datasets #1675

euronion · 2025-05-07T11:27:07Z

This PR suggest an implementation method for an intermediate data layer, that allows to retrieve a specific version of a dataset, rather than always the latest.

The current status is implemented on the OSM dataset, as this one was already tracking versions and working very similarly. I'll expand it to two more datasets in the upcoming days.

Motivation

Upstream data dependencies are not always versioned or fixed, meaning they may change unexpectedly without a way to revert to a different version. This causes reproducibility issues.

Approach

This PR solves the problem in the following way:

We create archived versions of all external datasets (if we are allowed -> question of licensing) on e.g. Zenodo
The URL for retrieving each combination of (dataset x version) is stored in data/versions.csv. This allows us to switch to a different data plattform or provider if necessary; or use a versioned URL directly from a data provider if available.
data/versions.csv also records the license and the description for the dataset. I plan on utilising this information to automatically create new versions of datasets and distribute the license text + metadata information based, as well as utilise this file for the documentation here
I imagine that all externally retrieved data will get two rules: (1) a rule for retrieving the upstream version, which I called source: "build" in the config.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.

TODO

Implement for two more datasets
Helper and instructions for creating a new version of a dataset from upstream to archive (small script/CLI tool)
Unbundle the data bundle on Zenodo
Move datasets from data/ to Zenodo, keep files in data that are manual specifications/inputs, e.g. as data/manual
Update documentation to utilise data/versions.csv

Comments are already welcome!

Checklist

I tested my contribution locally and it works as intended.
Code and workflow changes are sufficiently documented.
Changed dependencies are added to envs/environment.yaml.
Changes in configuration options are added in config/config.default.yaml.
Changes in configuration options are documented in doc/configtables/*.csv.
Sources of newly added data are documented in doc/data_sources.rst.
A release note doc/release_notes.rst is added.

euronion · 2025-05-07T18:16:45Z

Now implemented for the Worldbank Urban Population dataset.

This dataset uses the same method for retrieving from WB as for retrieving from Zenodo (sandbox link for now), the structure also suits itself to providing the upstream information in the data/versions.csv file.

coroa · 2025-05-08T07:56:31Z

@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!

euronion · 2025-05-08T09:16:42Z

Thanks @coroa , you can now have a closer look!
Let's also have another chat on how it overlaps with your activities.

I've updated the code, now data/versions.csv is included.
It also includes a third dataset - the GEM GSPT (Global Steel Plant Tracker by GEM), which I have chosen as a third example, because:

It shows how the structure allows us to accommodate external data versioning activities, i.e. GEM does dedicated links to different versions of a dataset, so we don't necessarily need to create a Zenodo mirror (for other reasons we still should for this dataset)
It shows how we can track unsupported versions of a dataset in the data/versions.csv, i.e. I have added the newest version of the GSPT as "upstream" and "not supported", because in the new version the file format of the data changed and is no longer compatible with the current workflow. This can also be used to mark datasets, that are no longer compatible as "deprecated"

Finally,

I've opted to rename the output file of the GSPT, such that the version is only encoded in the folder, not in the file name (for easier switching to new versions) here

pypsa-eur/rules/retrieve.smk

Line 429 in 27534f2

xlsx=f"data/gem_gspt/{GEM_GSPT_VERSION}/Global-Steel-Plant-Tracker.xlsx",
And to showcase how we can avoid clutter/bugs with the ever more complicating dependencies between rules, I've used the rules.<rule_name>.output["<output_name>"] reference of the GEM GSPT instead of specifying the file name explicitly. We could use this approach instead of manually specifying the paths + versions each time:

pypsa-eur/rules/build_sector.smk

Line 696 in 27534f2

gem_gspt=rules.retrieve_gem_steel_plant_tracker.output["xlsx"],

euronion · 2025-05-09T09:00:55Z

@euronion Interesting idea, i think some of that aligns neatly with the data catalogue. I'll follow you developing that and maybe propose a slight variation today or tomorrow. The data/versions.csv you are mentioning is not part of the branch yet. Can you please add it!

Recording an idea:
With the structure I propose above, we have a dedicated folder for each data input and version. This would be a good place to store a copy of the LICENSE for that particular dataset as well as a metadata.json.

E.g.

data/worldbank_urban_population
├── 2025-05-07
│   ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.csv
│   ├── API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2.zip
│   ├── LICENSE
│   ├── Metadata_Country_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│   ├── Metadata_Indicator_API_SP.URB.TOTL.IN.ZS_DS2_en_csv_v2_86733.csv
│   └── metadata.json
│    ...

For any data that we store on Zenodo, we can add them to the Zenodo repo. For datasets that we don't put on Zenodo or that are from upstream, we'd require a different solution for getting/storing the metadata and LICENSE.

Noting @lkstrp that this structure should also allow us to easily exchange "Zenodo" for any other type of data repo that allows for direct access to files, including S3 buckets.

lkstrp · 2025-05-09T09:54:05Z

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

euronion · 2025-05-09T12:13:04Z

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

coroa · 2025-05-14T16:18:37Z

Sounds beautiful. Before I take a look, I'll just leave a note/ request here: Can we also split the data which is retrieved and stored within the repo/ upstream? That we don't mix that up within ./data. Probably moving all repo data just to ./data/repo or ./data/git or similar.

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

Seconding the latter idea. Rather than splitting it, removing it from the repo.

lkstrp · 2025-05-15T08:41:05Z

config/config.default.yaml

+    source: "archive" # "archive" or "build"
+    version: "latest" # specific version from 'data/versions.csv', 'latest' or if 'source: "build"' is used, then use "upstream" here


What are different versions when source is build? If there are none we could also merge that into a single version entry. Or do you plan to add for the option of some fancy build from specific commit hash or something?
EDIT: See main review

lkstrp · 2025-05-15T08:42:01Z

data/versions.csv

+"source_name","version","recency","url","description","license"
+"osm","0.1","old","https://zenodo.org/records/12799202","Pre-built data of high-voltage transmission grid in Europe from OpenStreetMap.","ODbL"


Why "source_name" and not "dataset_name"?
Can we use "tag" instead of "recency" and just remove the "old" labels?
"not supported" could also be a tag

EDIT: See general restructure in main review.

lkstrp

This is lovely @euronion !

I have a couple of thoughts on the general schema:

Sources

(1) a rule for retrieving the upstream version, which I called source: "build" in the config.default.yaml, and (2) a rule to retrieve an archived version of the data. Both rules yield the same files that are then consumed by the model.

There is also a third rule for some datasets which can be build with extra rules for the workflow. For example OSM, or the atlite cutouts. So we could expand the schema to:
(1) source: "primary" (retrieve from primary source)
(2) source: "archive" (retrieve from our archive)
(3) source: "build" (build from workflow)

While a dataset could never be both (1) and (3) at the same time, that wouldn't be ideal. Another option would be to split the data into internal (archive and build) and external (archive and primary) data. I'm not sure about the naming, though. I wouldn't merge (1) and (3), and for sure not call "retrieve the upstream version" source: "build". A third option would be to ignore the fact that we can build internal data from within the workflow. But if we are setting this up from scratch, I would prefer to keep that in the schema.

Versions

I would expect the version key to be consistent across all sources. Technically, any version could be built or retrieved from the primary source or the archive. And if a combination exists/ can be retrieved is then just validated in a first step.
But if
(1) does not have versions, and is always just a primary url and
(3) does not have versions, and is always a build via the current local checkout,
we can also merge that all into one "version" tag for simplicity.
version: "primary" -> retrieve from primary source
version: "build" -> build via workflow
version: "latest" -> latest from archive version: "v0.3.0" -> v.0.3.0 from archive

But that would mean we don't get versions for both primary sources and our archive in the schema (e.g. GEM). I think I like the idea of having mirrored versions across all source types more.

For sure I would mirror the version values and not use upstream as version value. It's better to stick to classic version tags also for primary sources: latest and nightly or stable and latest, since the primary source will in most cases not be stable.

data/versions.csv can then just be indexed by dataset_name, source and version. This would already be enough for validation.

lkstrp · 2025-05-15T08:45:06Z

We could also consider removing it completely from the repo and putting it on Zenodo? I.e. retrieve all data.

Seconding the latter idea. Rather than splitting it, removing it from the repo.

Oh yes, if we aim big, let's remove them. But I assume they have a reason to be there in the first place, and we cannot just move them all to Zenodo since the update frequency needs to be much lower there (?). The repo data is updated quite often and moving that to Zenodo will just be a huge pain and not the final solution. Either we come up with some S3 storage in here already, keep it in here for now or use some temp solution (repo/ nextcloud).

coroa · 2025-05-15T11:34:02Z

This was not originally about update frequency, it was more about convenience and size constraints. Here you have updates of all files in the data directory over the last year:

data/agg_p_nom_minmax.csv, 1
data/ammonia_plants.csv, 2
data/attributed_ports.json, 0
data/biomass_transport_costs_supplychain1.csv, 1
data/biomass_transport_costs_supplychain2.csv, 1
data/cement-plants-noneu.csv, 1
data/ch_cantons.csv, 0
data/ch_industrial_production_per_subsector.csv, 1
data/custom_extra_functionality.py, 1
data/custom_powerplants.csv, 0
data/district_heat_share.csv, 2
data/egs_costs.json, 1
data/eia_hydro_annual_capacity.csv, 1
data/eia_hydro_annual_generation.csv, 1
data/entsoegridkit/README.md, 0
data/entsoegridkit/buses.csv, 0
data/entsoegridkit/converters.csv, 0
data/entsoegridkit/generators.csv, 0
data/entsoegridkit/lines.csv, 0
data/entsoegridkit/links.csv, 1
data/entsoegridkit/transformers.csv, 0
data/existing_infrastructure/existing_heating_raw.csv, 1
data/gr-e-11.03.02.01.01-cc.csv, 0
data/heat_load_profile_BDEW.csv, 0
data/hydro_capacities.csv, 0
data/links_p_nom.csv, 1
data/nuclear_p_max_pu.csv, 1
data/parameter_corrections.yaml, 1
data/refineries-noneu.csv, 1
data/retro/comparative_level_investment.csv, 0
data/retro/data_building_stock.csv, 0
data/retro/electricity_taxes_eu.csv, 0
data/retro/floor_area_missing.csv, 0
data/retro/retro_cost_germany.csv, 0
data/retro/u_values_poland.csv, 0
data/retro/window_assumptions.csv, 0
data/switzerland-new_format-all_years.csv, 0
data/transmission_projects/manual/new_links.csv, 2
data/transmission_projects/nep/new_lines.csv, 2
data/transmission_projects/nep/new_links.csv, 3
data/transmission_projects/template/new_lines.csv, 1
data/transmission_projects/template/new_links.csv, 1
data/transmission_projects/template/upgraded_lines.csv, 1
data/transmission_projects/template/upgraded_links.csv, 1
data/transmission_projects/tyndp2020/new_lines.csv, 1
data/transmission_projects/tyndp2020/new_links.csv, 2
data/transmission_projects/tyndp2020/upgraded_lines.csv, 1
data/transmission_projects/tyndp2020/upgraded_links.csv, 1
data/unit_commitment.csv, 0

by

for i in $(git ls-files data); do echo $i, $(git log --oneline --since="1 year ago" ${i} | wc -l); done

lkstrp · 2025-05-15T12:16:24Z

And this is to frequent for Zenodo

coroa · 2025-05-15T12:20:19Z

And this is to frequent for Zenodo

The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?

coroa · 2025-05-15T12:49:52Z

And this is to frequent for Zenodo

The data bundle alone received about 10 versions in the same time span. Are you talking about the cumulative amount of updates if you bundle them up together?

5 commits for data/transmission_projects
and
15 for the rest (out of which 2 should live in technology data, i guess).

Some of the 15 are deletions, some are within the span of a week. Still too many to handle manually i guess.

euronion · 2025-05-15T12:50:49Z

I don't think we should be handling any of it manually anyways. I was thinking of writing a small CLI script that helps to create new versions to Zenodo.

Not only to make it easier, but also to avoid passing mistakes.

coroa · 2025-05-15T12:58:48Z

Files like parameter_corrections, or NEP plans, deserve to be version-controlled since they are hand-written rather than imported.

So tracking them in a git repository would still be good practice for them. Maybe does not have to be directly in this repository, but also does not hurt.

Maybe a sub-directory like: data/manual, or a pypsa-eur-data-manual repository, but then this also needs to be maintained and version synced.

lkstrp · 2025-05-15T13:11:59Z

Small CLI script sounds good and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.

We need to reupload the whole directory for any new version on Zenodo. Zenodo cannot just update a single file of a bundle. So, if only one of 20 datasets needs an update, we have to reupload them all. This alone is an unpleasant misuse already. But all 20 of them get a new version tag as well, even if for 19 there is no difference between versions. So the whole purpose of versioning of datasets is also gone.

As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.

euronion · 2025-05-15T13:34:11Z

Small CLI script sounds good

added as open TODO

and the the numbers also don't sound too high, but I am just against using Zenodo for this. On the long term the data bundle should vanish/ not just be a storage dump. So we shouldn't bloat that up now.

Agreed. I wasn't thinking of moving the data from the repo into the data bundle. I was thinking about moving the data from the repo into dedicated Zenodo datasets. One Zenodo URL per standalone dataset. Not what we are doing now with the databundle.

We need to reupload the whole directory for any new version on Zenodo.
Yes, and I don't want to repeat that either if we just want to update parts of the data.

As discussed above, the end goal of an data layer needs to provide a version tag per dataset, with two sources: 'archive' and 'primary', while primary may just support latest/ nightly. Zenodo is just not designed for this.

Keeping aside the tags, Zenodo is not build for having a single record contain multiple datasets. What I would be doing is create a dedicated record per dataset. In that case Zenodo is serving our purpose nicely. And since we use the storage(...) provider from Snakemake, we can always just provide a different URL if we want to switch to a storage bucket or another archive - they only need to provide version-specific direct URLs for accessing the datasets

lkstrp · 2025-05-15T14:28:59Z

Ok. If we create a single record for each record on Zenodo, I would still argue that this is an unnecessary overhead, but if you want to go for it, I'll give up my resistance. As you say, we can easily switch then 👍

euronion · 2025-05-15T18:04:44Z

This is lovely @euronion !

I have a couple of thoughts on the general schema:
[...]

Thanks for the feedback @lkstrp - what I understand is that you only have concerns about the schema, but no comments or concerns about the implementation. Is that correct?

Naming

About your schema concerns: I wasn't very happy about my suggestions either, so happy to change them.

Indexing in `data/sources.csv`

I'm fine with indexing through dataset (or dataset_name, source, version); that's the status quo anyway, just with renamed variable names.

Sources

On the source I indeed intentionally mixed your (1) and (3) into "build", given that I don't know of any data source that provides both at the same time, but I see that it is more clear to separate them and accept that most datasets will have (2) and either (1) or (3), but not (1), (2) and (3).

Versions

The only benefit I see that we gain from having consistent version keys across sources is being able to get rid of Recency.
Especially since we don't want to increase the version numbers simultaneously, i.e. one would have datasets that are v1.0.0, some that are v1.0.1 or v4.0.0.

The downside I believe is that it requires more effort to compare our data with the primary sources version names, e.g. if we rename GEM's April-2024-V1 to v1.0.0 we are obfuscating their version number.

I'd rather keep the primary source's version names.

Recency

I introduce this column to help me find the "latest" version of a dataset, since the version is not guaranteed to be sorted or semantic versioning due to the different methods the primary data providers may use for version naming.
Then I realised that it has additional value to mark whether the model is still compatible with a dataset, and mark e.g. "old" or "deprecated/incompatible" versions, and what the current intended/supported version is.
I.e. you can keep "latest" in the config.yaml and get an auto-update of a dataset if you upgrade between PyPSA-Eur versions, without having to check whether a new version of the dataset is available and whether you need to update your configfile.

I think it would be nice to keep it, for look up purposes only and not for indexing of the file,
where instead of specifying the version in the configfile, one provides the recency.
Happy to rename, just not to "tag" - that does not seem descriptive enough for me.
What do you think?

To summarize ...

I'd go with something like this:

data/versions.csv:

dataset	source	version	recency
GEM_GSPT	primary	Febuly-2999-V1	unstable / nightly / untested
GEM_GSPT	primary	April-2024-V1	latest
GEM_GSPT	primary	January-1970-V1	deprecated
GEM_GSPT	primary	January-2000-V1	outdated
GEM_GSPT	archive	April-2024-V1	latest
GEM_GSPT	archive	January-1970-V1	deprecated
GEM_GSPT	archive	January-2000-V1	outdated
...	...	...	...
OSM	build	build	unstable / nightly / untested
OSM	archive	0.7	unstable / nightly / untested
OSM	archive	0.6	latest
OSM	archive	0.1	deprecated
...	...	...	...
WDPA	primary	primary	unstable / nightly / untested / we don't have anything better or an archived version

all datasets are downloaded to data/<dataset>/<version>/
config.yaml will have

datasets:
  <dataset>:
    source: "primary" | "archive" | "build"
    version: "<a version from versions.csv>" | "" # either version or recency need to be specified
    recency: "" | "latest" | "nightly"                        # either version or recency need to be specified

euronion · 2025-05-19T12:16:26Z

Update after some discussions:

For data/versions.csv we will go with 6 entries:

dataset : name of the dataset
source: one of either primary | build | archive determining whether it is retrieved from the original data provider (primary), build based on the original data source, e.g. OSM (build) or an archived version retrieved from our mirror on e.g. Zenodo (archive)
version: Name of the version following the versioning schema of the original data provider. If the original data provider does not have a versioning schema, we'll go with a pragmatic version name, e.g. the date YYYY-MM-DD the data was retrieved and the archived version was created.
tags: a list of different tags that we support. For now, the only one is latest-supported, that refers to the latest version of a dataset that is supported by the model. latest-supported needs to be bumped when creating a new version of a dataset and putting it into the file. tags options for the future envisioned are e.g. nightly or latest.
supported: A flag either TRUE or FALSE indicating whether the current model version supports this dataset. We'll not actively monitor or test for compatibilities, the intention here is to provide indicate when a new version of a dataset was added, whether the previous version is just outdated or maybe the data schema/contents changed and is therefore no longer compatible and supported by the model.
URL: URL pointing to the resource for download.

Further:

Downloaded data will be located in dedicated subfolders data/<dataset>/<source>/<version>/, allowing for clear separation of any dataset.
If the primary or build source allows for downloading continuously updated data without a versioning schema, e.g. OSM, then the version to use by convention is 'unknown'`
In the config file, we specify the data using source and version for each dataset. version is a valid version from the .csv, with the special version name of latest-supported that get's resolved to the version of the dataset with this particular tag. This version should be the default for most users, as this way they always get the newest data that is compatible with the model after upgrades, without loosing previous datasets should they desire to switch back or compare.

euronion added 3 commits May 7, 2025 12:00

data: Suggest intermediate layer for versioning of OSM

8f6a4e1

code: Introduce method for parsing data version string

5e70267

data: Modify rules to retrieved archived WB dataset

93f9105

euronion added 5 commits May 8, 2025 11:03

code: Restructure auxiliary methods

36d45ed

code and data: restructure OSM network outputs

c49a3f2

data: Implement data versioning for GEM GPST

f03b37f

config: Add default data version entries for WB and GEM

8b544a5

data: Add data versioning file

27534f2

lkstrp reviewed May 15, 2025

View reviewed changes

euronion self-assigned this May 21, 2025

fneum mentioned this pull request May 25, 2025

feat: add a TYNDP base network #1646

Open

9 tasks

		source: "archive" # "archive" or "build"
		version: "latest" # specific version from 'data/versions.csv', 'latest' or if 'source: "build"' is used, then use "upstream" here

		"source_name","version","recency","url","description","license"
		"osm","0.1","old","https://zenodo.org/records/12799202","Pre-built data of high-voltage transmission grid in Europe from OpenStreetMap.","ODbL"

data: Intermediate layer for versioning of datasets #1675

Are you sure you want to change the base?

data: Intermediate layer for versioning of datasets #1675

Uh oh!

Conversation

euronion commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Approach

TODO

Checklist

Uh oh!

euronion commented May 7, 2025

Uh oh!

coroa commented May 8, 2025

Uh oh!

euronion commented May 8, 2025

Uh oh!

euronion commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkstrp commented May 9, 2025

Uh oh!

euronion commented May 9, 2025

Uh oh!

coroa commented May 14, 2025

Uh oh!

lkstrp May 15, 2025

Choose a reason for hiding this comment

Uh oh!

lkstrp May 15, 2025

Choose a reason for hiding this comment

Uh oh!

lkstrp left a comment

Choose a reason for hiding this comment

Uh oh!

lkstrp commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coroa commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkstrp commented May 15, 2025

Uh oh!

coroa commented May 15, 2025

Uh oh!

coroa commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

euronion commented May 15, 2025

Uh oh!

coroa commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkstrp commented May 15, 2025

Uh oh!

euronion commented May 15, 2025

Uh oh!

lkstrp commented May 15, 2025

Uh oh!

euronion commented May 15, 2025

Naming

Indexing in data/sources.csv

Sources

Versions

Recency

To summarize ...

Uh oh!

euronion commented May 19, 2025

Uh oh!

Uh oh!

euronion commented May 7, 2025 •

edited

Loading

euronion commented May 9, 2025 •

edited

Loading

lkstrp commented May 15, 2025 •

edited

Loading

coroa commented May 15, 2025 •

edited

Loading

coroa commented May 15, 2025 •

edited

Loading

coroa commented May 15, 2025 •

edited

Loading

Indexing in `data/sources.csv`