-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Goals include:
- Reduce download time, build time, disk usage...
- Increase robustness / resilience (e.g. recovering from interrupted download)
- ... (to be continued)
-
add_data: Preliminary reorganization (PR add_data.sh: Preliminary reorganization #68)
- Add header comment, make comments easier to read, and similar trivial/pedantic changes
- Use
jqto simply JSON parsing (see https://cameronnokes.com/blog/working-with-json-in-bash-using-jq/) - Move repetitive calls into functions (first round):
- run_psql()
- run_ogr2ogr()
- fetch_csv() (for CSV files hosted on GitHub, esp. LFS files)
- Add
apt-get install -y jqto python/Dockefile
-
Make download more fault tolerant and maybe faster
- add_data.sh: curl commands (make download more fault tolerant) (Issue add_data.sh - curl commands (should retry for transient errors during transfer) #66, solved in PR add_data.sh: Add --retry option to curl download #82)
- Download compressed CSV files instead uncompressed CVS through Git LFS (Issue Create XZ-compressed Git repos and download from them #91)
- Verify checksum (Issue add_data.sh - verify checksum of downloaded files #83)
- Abort and restart curl download if speed too slow and/or fails (Issue Abort and restart curl download if speed too slow and/or fails #90)
- Parallel downloads with
xargs -P? And/or use pre-generated tarball to group hundreds of CSV files in one go?
-
add_data.sh - flexible data loading (add_data.sh related Python scripts - flexible data loading model-factory#53)
-
Delay pygeoapi (or even Elasticsearch and Kibana) start (Issue Delay pygeoapi start (was: Investigate pygeoapi container restart problem) #93)
-
Shellcheck (PR Add .github/workflows/shellcheck.yml #89)
- Add GitHub workflow to run ShellCheck GitHub action, see https://github.com/marketplace/actions/shellcheck
- Fix ShellCheck errors and warnings
-
Move repetitive calls into functions (second round)
- fetch_dir() (for listings (directories) hosted on GitHub)
- run_git_clone() (?)
-
Benchmark and profiling
Future tasks (that have yet to be turned into GitHub issues):
- Use of e.g.
/usr/bin/time -vfor profiling docker-compose logs -f -tprovides log with timestamp- Some kind of DEBUG variable? e.g. Make the psql flag
-aor--echo-alloptional unless in DEBUG mode for a more concise log. - Add option to delete downloaded *.gpkg and *.csv files as soon as they have been imported to save space
- etc.
Maybe in Round 2 of refactoring? Or this round? Need to discuss with Drew first:
- Leave the model-factory/scripts/* files where they are instead of copying them?
- Use e.g. _build and _data directories to separate our code from downloaded data and temporary build files?
Random ideas, questions, etc.
- Make add_data.sh capable of being run over and over again ("incremental build", build stamp, etc.)
- ogr2ogr, if run repeatedly with the same data:
-append,-update, or-overwrite
- ogr2ogr, if run repeatedly with the same data:
- Use Backblaze S2 for large file storage for speed and reduced cost? https://nickb.dev/blog/backblaze-b2-as-a-cheaper-alternative-to-githubs-git-lfs
- GitHub data packs: Storage for $0.1/GB/month storage, Download for $0.1/GB download (TODO: verify)
- Amazon S3: TODO
- Backblaze S2: Storage for $0.005/GB/month storage; Download for $0.01/GB
Use eatmydata with PostgreSQL for speedUsefsync=off,synchronous_commit=offandfull_page_writes=offinstead, see Speed up database writes with synchronous_commit=off (and full_page_write=off and fsync=off?) #77
Reactions are currently unavailable