[Meta-issue] Optimize pipeline (python/add_data.sh etc.)

Goals include:

* Reduce download time, build time, disk usage...
* Increase robustness / resilience (e.g. recovering from interrupted download)
* ... (to be continued)

- [x] add_data: Preliminary reorganization (PR #68)
    - [x] Add header comment, make comments easier to read, and similar trivial/pedantic changes
    - [x] Use `jq` to simply JSON parsing (see https://cameronnokes.com/blog/working-with-json-in-bash-using-jq/)
    - [x] Move repetitive calls into functions (first round):
        - [x] run_psql()
        - [x] run_ogr2ogr()
        - [x] fetch_csv() (for CSV files hosted on GitHub, esp. LFS files)
    - [x] Add `apt-get install -y jq` to python/Dockefile

- [ ] Make download more fault tolerant and maybe faster
    - [x] add_data.sh: curl commands (make download more fault tolerant) (Issue #66, solved in PR #82)
    - [ ] Download compressed CSV files instead uncompressed CVS through Git LFS (Issue #91)
    - [ ] Verify checksum (Issue #83)
    - [ ] Abort and restart curl download if speed too slow and/or fails (Issue #90)
    - [ ] Parallel downloads with `xargs -P`?  And/or use pre-generated tarball to group hundreds of CSV files in one go?

- [ ] add_data.sh - flexible data loading (OpenDRR/model-factory#53)

- [ ] Delay pygeoapi (or even Elasticsearch and Kibana) start (Issue #93)

- [x] Shellcheck (PR #89)
    - [x] Add GitHub workflow to run ShellCheck GitHub action, see https://github.com/marketplace/actions/shellcheck
    - [x] Fix ShellCheck errors and warnings

- [ ] Move repetitive calls into functions (second round)
    - [ ] fetch_dir() (for listings (directories) hosted on GitHub)
    - [ ] run_git_clone()  (?)

- [ ] Benchmark and profiling

### Future tasks (that have yet to be turned into GitHub issues):

* Use of e.g. `/usr/bin/time -v` for profiling
* `docker-compose logs -f -t` provides log with timestamp
* Some kind of DEBUG variable?  e.g. Make the psql flag `-a` or `--echo-all` optional unless in DEBUG mode for a more concise log.
* Add option to delete downloaded *.gpkg and *.csv files as soon as they have been imported to save space
* etc.

Maybe in Round 2 of refactoring?  Or this round?  Need to discuss with Drew first:

- Leave the model-factory/scripts/* files where they are instead of copying them?
- Use e.g. _build and _data directories to separate our code from downloaded data and temporary build files?

Random ideas, questions, etc.

- Make add_data.sh capable of being run over and over again ("incremental build", build stamp, etc.)
    - ogr2ogr, if run repeatedly with the same data: `-append`, `-update`, or `-overwrite`
- Use Backblaze S2 for large file storage for speed and reduced cost?  https://nickb.dev/blog/backblaze-b2-as-a-cheaper-alternative-to-githubs-git-lfs
    - GitHub data packs: Storage for $0.1/GB/month storage, Download for $0.1/GB download (TODO: verify)
    - Amazon S3: TODO
    - Backblaze S2: Storage for $0.005/GB/month storage; Download for $0.01/GB
- ~Use eatmydata with PostgreSQL for speed~ Use `fsync=off`, `synchronous_commit=off` and `full_page_writes=off` instead, see #77


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta-issue] Optimize pipeline (python/add_data.sh etc.) #76

Future tasks (that have yet to be turned into GitHub issues):

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Meta-issue] Optimize pipeline (python/add_data.sh etc.) #76

Description

Future tasks (that have yet to be turned into GitHub issues):

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions