-
-
Notifications
You must be signed in to change notification settings - Fork 678
Fix #22575: Use stable image IDs instead of full metadata for Docker build context hashes #22588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
benjyw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if request.build_upstream_images and isinstance( | ||
| getattr(field_set, "source", None), DockerImageSourceField | ||
| ): | ||
| docker_metadata_gets.append(Get(DigestContents, Digest, built_package.digest)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving off Get/MultiGet in favor of call-by-name. Sorry, this isn't properly documented yet. See #18905 for context.
So instead of Get(DigestContents, Digest, built_package.digest) you'd use get_digest_contents(get_digest_contents) (with from pants.engine.intrinsics import get_digest_contents)
and then you'd replace MultiGet with concurrently, imported from pants.engine.internals.selectors.
| # For Docker images, we need to extract the metadata filename and create stable digests | ||
| docker_metadata_gets = [] | ||
| docker_package_indices = [] | ||
| for i, built_package in enumerate(embedded_pkgs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about for field_set, built_package in zip(pkgs_wanting_embedding, embedded_pkgs), and maintain a list of docker_packages instead of docker_package_indices? The index is one extra thing to keep track of for the reader, and we could get rid of it.
|
|
||
| # Extract the original filename from the metadata | ||
| if metadata_contents: | ||
| original_filename = next(iter(metadata_contents)).path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the name of the metadata file (some path ending in docker-info.json). I recommend munging the name, so it's clear when debugging that this is redacted data. E.g., docker-info.stable.json.
| for metadata_contents, pkg_index in zip(docker_metadata_contents, docker_package_indices): | ||
| built_package = embedded_pkgs[pkg_index] | ||
|
|
||
| # Extract the original filename from the metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is unnecessary. The code two lines below it is already perfectly clear thanks to the good variable names you chose.
|
I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.) Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.) One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation? |
|
@kaos @benjyw thank you for reviewing. But it would be nice to know what contributes to changing hashes/digests |
Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.
That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose... |
Yes and no, I think. As long as you have the previous build cache yes. But if you don't, and docker rebuilds the image, you may end up with a new docker image digest despite the inputs being the same, resulting in rebuilding everything downstream unnecessarily.
We only need to filter it out at the point where we calculate pants hash for the target, so no need to affect anything besides the hash value I think. Edit: But these are the concerns I can see. I have no real issue with going in either direction here, up to you :) |
Right, but that's where we are today already wrt tags. So presumably this change is still a move in the right direction?
At that point the metadata is just a Digest, so some code deep in the engine core would have to know to filter that Digest, which introduces a lot of new machinery. This is a pretty good way to get the same effect entirely within the docker backend.
|
|
@kaos @benjyw at the moment Pants design relies a lot on the build cache for docker building, assuming buildkit will just quickly rebuild the image. IMHO while ideal, this is often inefficient in several ways, and it's currently of the biggest issues we have at the moment using the Docker Backend which is awesome btw. I think this needs a proper discussion outside of this PR, but just to give you an intuition of the impact of the problem. ...even using the split pex approach highlighted in https://www.pantsbuild.org/blog/2022/08/02/optimizing-python-docker-deploys-using-pants#multiple-images-and-tagging, for big dependencies pex (anything with torch, for instance) you quickly end up having a context of 40GB having to be uploaded to buildkit even if nothing changes in the deps pex, just so that buildkit can assess if caching is needed. In our case, this leads to up to 20+ seconds delay to every change, even if the change is only affecting the src files. If you combine that with the fact that the base image might trigger another change to the pants hash (because the tags change), this leads to a very poor experience. I have a proposal for that as well, which I will share perhaps in another issue/PR |
|
@benjyw I have addressed your comments |
benjyw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Just one small change further.
src/python/pants/backend/docker/util_rules/docker_build_context.py
Outdated
Show resolved
Hide resolved
|
Oh, and this needs a notes update in docs/notes/2.29.x.md under the docker backend, and also the highlights, since I think this is pretty important. |
|
@ademariag any updates on whether you'll continue pursuing #22588? I am also implementing Optimizing Python + Docker Deploys using Pants and similarly want to use |
|
Yes Will wrap this up soon |
|
@benjyw sorry for the long delay! I think I have done everything that was requested! Let me know if you need anything else |
benjyw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just need to move that release note and we can merge once CI is green.
| ### Backends | ||
|
|
||
| #### Docker | ||
| [Fixed](https://github.com/pantsbuild/pants/issues/22575) a bug that would make docker hashes unstable when using dynamic tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had a couple of releases since we last went around, so this needs to be in 2.31.x.md now ... :)
Summary
Fixes #22575 by ensuring Docker image build context hashes remain stable when upstream images have dynamic tags.
Problem
When building Docker images that depend on other Docker images, the build context hash was incorrectly including the full metadata (including tags) of upstream images. This caused downstream images to rebuild unnecessarily whenever upstream image tags changed, even if the actual image content was identical.
Example scenario that was broken:
Every build would produce different hashes for
app:imagebecause the timestamp inbase:image's tags was included in the hash calculation.Solution
This PR modifies the build context creation to use only the stable Docker image ID (SHA256 content hash) for upstream Docker images, rather than the full package digest that includes metadata like tags.
Key changes:
BuiltDockerImageartifactImplementation Details
The fix modifies
create_docker_build_contextindocker_build_context.pyto:This ensures that:
Testing
Manual Testing