Skip to content

Optimize download amount of PyTest pre-commit check #3322

@m-hau

Description

@m-hau

When running the PyTest pre-commit check with the (default) configuration of USE_LOCAL_KNOWN_SLUGS = False, it will fetch an up-to-date copy of the slug pickles from Github. To get this ~400KiB of needed data, it will do a fresh full clone of this here devicetype-library git repository to a temporary directory, which comes to a total download size of ~610MiB (and about double that in used disk/ram space). That's an overhead of over 1500%! For every single invocation of the hook!

My first thought was: Easy fix, make the clone shallow and limit it to the last commit only. But even then the download is ~526MiB, which is better but still not great and still way too much. Looking at the shallow copy, ~95% of the space is taken by elevation images. So to get the download size down further, you need to make the repo clone a sparse one, where you only fetch the relevant tests sub-directory of the repo.

Doing the repo clone using --local or --shared could also speed things up, but there is no guarantee that the pre-commit check will be run from an actual git repository, it could also be using a shell from an unpacked source archive.

At this point I was taking a step back, and asked myself: Why even bother with a git clone at this point, if all you need is the latest version of two small files. It may be easier to replace the whole repo clone with a simple HTTPS download of these two files. If you don't want to pull in another dependency like requests for this, it can also be done using the python stdlib (urllib.request and friends). Doing it naively by fetching the files raw from the master branch, you don't have the guarantee that both files belong to the same commit (imagine a new commit is pushed while you are downloading the first file). But this can be prevented by doing three simple HTTP requests: Get the current commit (id) of the master branch, get file A from that exact commit, and then get file B from that commit.

Any thoughts on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions