Skip to content

Fetch datasets as release assets (instead of Git LFS pull) #190

Open
@anthonyfok

Description

@anthonyfok

Large datasets, mostly CSV files, are currently fetched directly from Git LFS which induce significant Git LFS bandwidth costs.

Fetching these datasets as pre-compressed release assets will reduce download time and eliminate most GitHub Git LFS bandwidth costs. Thanks to @jvanulde for the idea and @DamonU2 for the pioneering work.

This, I think, is easier to implement and maintain, thus more robust and less error-prone than my previous unimplemented "XZ-compressed copies of repos" idea:

Data source repos:

  • OpenDRR/openquake-inputs
  • OpenDRR/model-inputs
  • OpenDRR/canada-srm2
  • OpenDRR/earthquake-scenarios

Scripts that fetch from these repos include (but may not be limited to):

  • python/add_data.sh (OpenDRR/opendrr-api)
  • scripts/DSRA_outputs2postgres_lfs.py (OpenDRR/model-factory)

Cf. these commands found in add_data.sh, for example:

fetch_csv openquake-inputs ...
fetch_csv model-inputs ...
curl -L https://api.github.com/repos/OpenDRR/canada-srm2/contents/cDamage/output?ref=tieg_natmodel2021
curl -L https://api.github.com/repos/OpenDRR/earthquake-scenarios/contents/FINISHED
python3 DSRA_outputs2postgres_lfs.py --dsraModelDir=$DSRA_REPOSITORY --columnsINI=DSRA_outputs2postgres.ini --eqScenario="$eqscenario"

XZ or Zstd compression? (compressed file sizes vs. decompression speed)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions