Skip to content

Action to Check for Any Large Files in the Git History #9

@zhuchcn

Description

@zhuchcn

Although we should have the .gitignore file in all our repos, there is still a chance that some large files may be accidentally committed. And sometimes they realize it and remove these files with another commit. So when doing code review, unless we check each single commit to search for large files manually, there is no other way to find out. And once the PR is merged, these files will stay in the repo forever.

So I propose adding a GitHub action that does this task. I used the command below to list all objects in the repo sorted by their sizes.

git rev-list --objects --all \
  | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
  | sed -n 's/^blob //p' \
  | sort --numeric-sort --key=2 \
  | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Example output (last 10 lines):

1a98da8ccb1a1330e717215b970b3fd5601a40fe   12KiB src/cptac_luad/processing/create_metapipeline_dna_input_from_dataset_registry_yaml.py
d283a8677377c57d0247ba29360f9fc8b5b4b02c   13KiB src/cptac_luad/processing/create_metapipeline_dna_input_from_dataset_registry_yaml.py
8e392c90d8731ae09d23f7c292ba7f352cf88911   13KiB src/cptac_luad/processing/create_metapipeline_dna_input_from_dataset_registry_yaml.py
64bdb3bc4a54840130341e05bf20c407545c731f   14KiB src/cptac_luad/processing/create_metapipeline_dna_input_from_dataset_registry_yaml.py
5ca6502471ffae73e76942392973d813599aaad7   14KiB src/cptac_luad/processing/create_metapipeline_dna_input_from_dataset_registry_yaml.py
3365f0b7c3efedfc22555e1ac9351db7762dfb92   14KiB src/cptac_luad/processing/create_metapipeline_dna_input_from_dataset_registry_yaml.py
2e4d1fe710899f1941251178ea67f2fd9a476634   17KiB .pylintrc
59436aec7b1cbe9c208edd5f290c625ebc3792e8   17KiB .pylintrc
d159169d1050894d3ea3b98e1c965c4058208fe1   18KiB LICENSE.md
8e8e72a88324670c5abd321803477f3e419d36a6   18KiB .pylintrc

In this way, we will be able to catch these "hidden" large or "bad" files from the PR branch commit history.

@aholmes @nwiltsie What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions