image pull fail from corrupted Crane cache #3194

AustinAbro321 · 2024-11-06T18:42:31Z

Describe what should be investigated or refactored

Seeing a flake in the test-external workflow. Images are failing to be saved.

  •  Fetching info for 9 images. This step may take several seconds to complete.
  •  Fetched info for 9 images
  •  Pulling 9 images (0.00 Byte of 243.74 MBs)

 WARNING  Failed to save images in parallel, falling back to sequential save: All attempts fail:
          #1: error writing layer: expected blob size 3419706, but only wrote 3207362
          #2: error writing layer: expected blob size 3419706, but only wrote 3207362
     ERROR:  failed to create package: All attempts fail:
             #1: error writing layer: expected blob size 3419706, but only wrote 3207362
             #2: error writing layer: expected blob size 3419706, but only wrote 3207362
    common.go:33: 
        	Error Trace:	/home/runner/work/zarf/zarf/src/test/external/common.go:33
        	            				/home/runner/work/zarf/zarf/src/test/external/ext_in_cluster_test.go:165

AustinAbro321 · 2024-12-20T14:28:49Z

I've validated that this is not caused by by disk space as the error in this case will look different

failed to create package: All attempts fail:
             #1: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296153189335322:
             no space left on device
             #2: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296152478328656:
             no space left on devic

AustinAbro321 · 2024-12-20T15:28:55Z

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

@RothAndrew reported that a similar error happening to him during his day to day with a separate private package. It does not happen to him in the images are not in the Zarf cache. This follows because in our usual e2e tests we delete the zarf cache right away for storage purpose, likely that is what's causing the flake to only appear in the test-external workflow

RothAndrew · 2024-12-20T15:38:33Z

It happens so persistently for me that I ended up doing this pretty much anywhere I’m making zarf packages now. https://github.com/defenseunicorns-partnerships/wfapi/blob/main/scripts/build_zarf_package.sh

RothAndrew · 2024-12-20T15:46:48Z

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

I wonder if the image size or number of layers makes a difference when trying to recreate. Podinfo is much smaller than most of the images I work with.

AustinAbro321 · 2024-12-20T15:48:07Z

@RothAndrew Has it ever happened with only one image?

Every failure I looked at in test external - https://github.com/zarf-dev/zarf/actions/workflows/test-external.yml?query=is%3Afailure fails with expected blob size 3419706. Decompressing the package and running find . -type f -size 3419706c on the layers returns ./94c7366c1c3058fbc60a5ea04b6d13199a592a67939a043c41c051c4bfcd117a. This is base layer for these six images in the package. Possibly having multiple images grabbing the same layer makes this flake more likely.

  - ghcr.io/fluxcd/helm-controller:v1.1.0
  - ghcr.io/fluxcd/image-automation-controller:v0.39.0
  - ghcr.io/fluxcd/image-reflector-controller:v0.33.0
  - ghcr.io/fluxcd/kustomize-controller:v1.4.0
  - ghcr.io/fluxcd/notification-controller:v1.4.0
  - ghcr.io/fluxcd/source-controller:v1.4.1

RothAndrew · 2024-12-20T15:57:32Z

Has it ever happened with only one image?

I'm not sure. I feel like it definitely happens more when there are multiple images, or the images are large, or the registry it is pulling from is slow

AustinAbro321 · 2024-12-20T17:57:28Z

Pretty sure I found the issue, Zarf was not properly deleting invalid layers from the cache when they occur. @RothAndrew Feel free to test out #3358, though either way the team will see in time if the flake disappears

AustinAbro321 · 2025-01-13T18:14:46Z

While #3358 definitely did solve some incorrect logic, we are still seeing this error - https://github.com/zarf-dev/zarf/actions/runs/12752633384/job/35542502740?pr=3398.

I am unable to reproduce locally. Even when directly putting in a invalid layer in the zarf cache, it now gets cleaned up properly.
@RothAndrew Are you still seeing this error in v0.46.0? If so, do you have a public package I can test with?

RothAndrew · 2025-01-13T20:26:28Z

Not sure. I’ll keep an eye out.

CafeLungo · 2025-02-11T21:53:24Z

on zarf 0.46.0:
We are seeing this often in a gitlab pipeline. We cannot reproduce it locally, however. We tried setting --oci-concurrency 1 and that seemed to help some, but even with that setting, it is still happening sometimes. This has been a tricky one to track down for us.

It appears the error comes from here: https://github.com/google/go-containerregistry/blame/main/pkg/v1/layout/write.go#L243

Other refs:

ggcr: Image write concurrency errors google/go-containerregistry#1941
crane: incorrectly uses compressed layer of a cosign .sig file to write OCI image from cache google/go-containerregistry#1955
TestStreamingWriteLayer: value not computed until stream is consumed google/go-containerregistry#1253

AustinAbro321 · 2025-02-12T15:53:58Z

Yup we are working on switching off of crane to a library that handles concurrency natively. See - #3434. Though OCI concurrency is only relevant when pulling and pushing Zarf packages

RothAndrew · 2025-03-07T19:02:48Z

@AustinAbro321 would you be open to updating the title and description on this issue to reflect the more widespread issue? At first glance, this looks like a simple test flake but it's the issue that tracks the widespread issue of the crane cache expected blob size blah, but only wrote blah stuff which is the reason for replacing crane.

EDIT: one of the reasons for replacing crane

NunoSav · 2025-03-13T15:03:17Z

We're facing this issue in one of our pipelines. Is there a way to work around it? Thanks

AustinAbro321 · 2025-03-13T15:23:56Z

I believe the temporary solution while we work on a permanent fix in #3559 is to either use zarf tools clear-cache before running zarf package create or set the --zarf-cache directory to a random temporary directory each run.

NunoSav · 2025-03-13T15:34:02Z

I believe the temporary solution while we work on a permanent fix in #3559 is to either use zarf tools clear-cache before running zarf package create or set the --zarf-cache directory to a random temporary directory each run.

Worked. 👍

AustinAbro321 added the tech-debt 💳 label Nov 6, 2024

github-project-automation bot added this to Zarf Nov 6, 2024

github-project-automation bot moved this to Triage in Zarf Nov 6, 2024

schristoff removed the tech-debt 💳 label Nov 8, 2024

AustinAbro321 mentioned this issue Dec 20, 2024

fix: properly delete invalid layers during image pull #3358

Merged

2 tasks

AustinAbro321 mentioned this issue Jan 24, 2025

test: avoid flake in test external #3432

Merged

2 tasks

AustinAbro321 linked a pull request Mar 7, 2025 that will close this issue

feat: replace image pull and push library from ORAS to crane #3559

Draft

2 tasks

AustinAbro321 changed the title ~~flake: failing during image pull when building podinfo-flux package in test-external~~ image pull fail from corrupted Crane cache Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image pull fail from corrupted Crane cache #3194

image pull fail from corrupted Crane cache #3194

AustinAbro321 commented Nov 6, 2024

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024 •

edited

Loading

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Jan 13, 2025

RothAndrew commented Jan 13, 2025

CafeLungo commented Feb 11, 2025 •

edited

Loading

AustinAbro321 commented Feb 12, 2025

RothAndrew commented Mar 7, 2025 •

edited

Loading

NunoSav commented Mar 13, 2025

AustinAbro321 commented Mar 13, 2025

NunoSav commented Mar 13, 2025

image pull fail from corrupted Crane cache #3194

image pull fail from corrupted Crane cache #3194

Comments

AustinAbro321 commented Nov 6, 2024

Describe what should be investigated or refactored

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024 • edited Loading

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Jan 13, 2025

RothAndrew commented Jan 13, 2025

CafeLungo commented Feb 11, 2025 • edited Loading

AustinAbro321 commented Feb 12, 2025

RothAndrew commented Mar 7, 2025 • edited Loading

NunoSav commented Mar 13, 2025

AustinAbro321 commented Mar 13, 2025

NunoSav commented Mar 13, 2025

AustinAbro321 commented Dec 20, 2024 •

edited

Loading

CafeLungo commented Feb 11, 2025 •

edited

Loading

RothAndrew commented Mar 7, 2025 •

edited

Loading