Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image pull fail from corrupted Crane cache #3194

Open
AustinAbro321 opened this issue Nov 6, 2024 · 15 comments · May be fixed by #3559
Open

image pull fail from corrupted Crane cache #3194

AustinAbro321 opened this issue Nov 6, 2024 · 15 comments · May be fixed by #3559

Comments

@AustinAbro321
Copy link
Contributor

Describe what should be investigated or refactored

Seeing a flake in the test-external workflow. Images are failing to be saved.

Workflow run
Relevant logs:

  •  Fetching info for 9 images. This step may take several seconds to complete.
  •  Fetched info for 9 images
  •  Pulling 9 images (0.00 Byte of 243.74 MBs)

 WARNING  Failed to save images in parallel, falling back to sequential save: All attempts fail:
          #1: error writing layer: expected blob size 3419706, but only wrote 3207362
          #2: error writing layer: expected blob size 3419706, but only wrote 3207362
     ERROR:  failed to create package: All attempts fail:
             #1: error writing layer: expected blob size 3419706, but only wrote 3207362
             #2: error writing layer: expected blob size 3419706, but only wrote 3207362
    common.go:33: 
        	Error Trace:	/home/runner/work/zarf/zarf/src/test/external/common.go:33
        	            				/home/runner/work/zarf/zarf/src/test/external/ext_in_cluster_test.go:165
@AustinAbro321
Copy link
Contributor Author

I've validated that this is not caused by by disk space as the error in this case will look different

failed to create package: All attempts fail:
             #1: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296153189335322:
             no space left on device
             #2: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296152478328656:
             no space left on devic

@AustinAbro321
Copy link
Contributor Author

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

@RothAndrew reported that a similar error happening to him during his day to day with a separate private package. It does not happen to him in the images are not in the Zarf cache. This follows because in our usual e2e tests we delete the zarf cache right away for storage purpose, likely that is what's causing the flake to only appear in the test-external workflow

@RothAndrew
Copy link
Contributor

It happens so persistently for me that I ended up doing this pretty much anywhere I’m making zarf packages now. https://github.com/defenseunicorns-partnerships/wfapi/blob/main/scripts/build_zarf_package.sh

@RothAndrew
Copy link
Contributor

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

I wonder if the image size or number of layers makes a difference when trying to recreate. Podinfo is much smaller than most of the images I work with.

@AustinAbro321
Copy link
Contributor Author

AustinAbro321 commented Dec 20, 2024

@RothAndrew Has it ever happened with only one image?

Every failure I looked at in test external - https://github.com/zarf-dev/zarf/actions/workflows/test-external.yml?query=is%3Afailure fails with expected blob size 3419706. Decompressing the package and running find . -type f -size 3419706c on the layers returns ./94c7366c1c3058fbc60a5ea04b6d13199a592a67939a043c41c051c4bfcd117a. This is base layer for these six images in the package. Possibly having multiple images grabbing the same layer makes this flake more likely.

  - ghcr.io/fluxcd/helm-controller:v1.1.0
  - ghcr.io/fluxcd/image-automation-controller:v0.39.0
  - ghcr.io/fluxcd/image-reflector-controller:v0.33.0
  - ghcr.io/fluxcd/kustomize-controller:v1.4.0
  - ghcr.io/fluxcd/notification-controller:v1.4.0
  - ghcr.io/fluxcd/source-controller:v1.4.1

@RothAndrew
Copy link
Contributor

Has it ever happened with only one image?

I'm not sure. I feel like it definitely happens more when there are multiple images, or the images are large, or the registry it is pulling from is slow

@AustinAbro321
Copy link
Contributor Author

Pretty sure I found the issue, Zarf was not properly deleting invalid layers from the cache when they occur. @RothAndrew Feel free to test out #3358, though either way the team will see in time if the flake disappears

@AustinAbro321
Copy link
Contributor Author

While #3358 definitely did solve some incorrect logic, we are still seeing this error - https://github.com/zarf-dev/zarf/actions/runs/12752633384/job/35542502740?pr=3398.

I am unable to reproduce locally. Even when directly putting in a invalid layer in the zarf cache, it now gets cleaned up properly.
@RothAndrew Are you still seeing this error in v0.46.0? If so, do you have a public package I can test with?

@RothAndrew
Copy link
Contributor

Not sure. I’ll keep an eye out.

@CafeLungo
Copy link

CafeLungo commented Feb 11, 2025

on zarf 0.46.0:
We are seeing this often in a gitlab pipeline. We cannot reproduce it locally, however. We tried setting --oci-concurrency 1 and that seemed to help some, but even with that setting, it is still happening sometimes. This has been a tricky one to track down for us.

It appears the error comes from here: https://github.com/google/go-containerregistry/blame/main/pkg/v1/layout/write.go#L243

Other refs:

@AustinAbro321
Copy link
Contributor Author

Yup we are working on switching off of crane to a library that handles concurrency natively. See - #3434. Though OCI concurrency is only relevant when pulling and pushing Zarf packages

@AustinAbro321 AustinAbro321 linked a pull request Mar 7, 2025 that will close this issue
2 tasks
@RothAndrew
Copy link
Contributor

RothAndrew commented Mar 7, 2025

@AustinAbro321 would you be open to updating the title and description on this issue to reflect the more widespread issue? At first glance, this looks like a simple test flake but it's the issue that tracks the widespread issue of the crane cache expected blob size blah, but only wrote blah stuff which is the reason for replacing crane.

EDIT: one of the reasons for replacing crane

@AustinAbro321 AustinAbro321 changed the title flake: failing during image pull when building podinfo-flux package in test-external image pull fail from corrupted Crane cache Mar 7, 2025
@NunoSav
Copy link

NunoSav commented Mar 13, 2025

We're facing this issue in one of our pipelines. Is there a way to work around it? Thanks

@AustinAbro321
Copy link
Contributor Author

I believe the temporary solution while we work on a permanent fix in #3559 is to either use zarf tools clear-cache before running zarf package create or set the --zarf-cache directory to a random temporary directory each run.

@NunoSav
Copy link

NunoSav commented Mar 13, 2025

I believe the temporary solution while we work on a permanent fix in #3559 is to either use zarf tools clear-cache before running zarf package create or set the --zarf-cache directory to a random temporary directory each run.

Worked. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

Successfully merging a pull request may close this issue.

5 participants