Fix #22575: Use stable image IDs instead of full metadata for Docker build context hashes #22588

ademariag · 2025-08-25T16:37:59Z

Summary

Fixes #22575 by ensuring Docker image build context hashes remain stable when upstream images have dynamic tags.

Problem

When building Docker images that depend on other Docker images, the build context hash was incorrectly including the full metadata (including tags) of upstream images. This caused downstream images to rebuild unnecessarily whenever upstream image tags changed, even if the actual image content was identical.

Example scenario that was broken:

# BUILD file with dynamic tags
__defaults__({
    "docker_image": dict(
        image_tags=[
            "{pants.hash}",
            env("TIMESTAMP"),  # e.g., "20240116-123456"
        ],
    ),
})

docker_image(name="base", source="Dockerfile.base")
docker_image(name="app", source="Dockerfile")  # FROM base:image

Every build would produce different hashes for app:image because the timestamp in base:image's tags was included in the hash calculation.

Solution

This PR modifies the build context creation to use only the stable Docker image ID (SHA256 content hash) for upstream Docker images, rather than the full package digest that includes metadata like tags.

Key changes:

Separate handling for Docker images: When processing embedded packages, Docker images are now handled differently from other package types
Extract stable image ID: For Docker images, we extract the image ID from the BuiltDockerImage artifact
Create stable digest: A new digest is created using only the image ID as content, preserving the original filename structure

Implementation Details

The fix modifies create_docker_build_context in docker_build_context.py to:

# For Docker images, use only the image ID to ensure hash stability
for artifact in built_package.artifacts:
    if isinstance(artifact, BuiltDockerImage):
        stable_content = artifact.image_id.encode()
        stable_digest = await Get(
            Digest, CreateDigest([FileContent(original_filename, stable_content)])
        )
        embedded_pkgs_digest.append(stable_digest)

This ensures that:

✅ Hash remains stable when tags change but content doesn't
✅ Hash changes when actual image content changes
✅ Build context structure remains compatible
✅ Caching works correctly in CI/CD pipelines

Testing

Manual Testing

# Before fix: Different hashes each time
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: abc123...
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: def456... ❌

# After fix: Stable hashes
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: abc123...
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: abc123... ✅

benjyw

Thanks for the report and the fix! I think this is a good solution, and I can't see why using the image ID would not be sufficient. Just a few comments, and also it would be great to test this.

cc @kaos and @huonw in case they have any thoughts.

benjyw · 2025-08-26T00:05:07Z

src/python/pants/backend/docker/util_rules/docker_build_context.py

+        if request.build_upstream_images and isinstance(
+            getattr(field_set, "source", None), DockerImageSourceField
+        ):
+            docker_metadata_gets.append(Get(DigestContents, Digest, built_package.digest))


We are moving off Get/MultiGet in favor of call-by-name. Sorry, this isn't properly documented yet. See #18905 for context.

So instead of Get(DigestContents, Digest, built_package.digest) you'd use get_digest_contents(get_digest_contents) (with from pants.engine.intrinsics import get_digest_contents)

and then you'd replace MultiGet with concurrently, imported from pants.engine.internals.selectors.

benjyw · 2025-08-26T00:08:29Z

src/python/pants/backend/docker/util_rules/docker_build_context.py

+    # For Docker images, we need to extract the metadata filename and create stable digests
+    docker_metadata_gets = []
+    docker_package_indices = []
+    for i, built_package in enumerate(embedded_pkgs):


How about for field_set, built_package in zip(pkgs_wanting_embedding, embedded_pkgs), and maintain a list of docker_packages instead of docker_package_indices? The index is one extra thing to keep track of for the reader, and we could get rid of it.

benjyw · 2025-08-26T00:21:54Z

src/python/pants/backend/docker/util_rules/docker_build_context.py

+
+            # Extract the original filename from the metadata
+            if metadata_contents:
+                original_filename = next(iter(metadata_contents)).path


This is the name of the metadata file (some path ending in docker-info.json). I recommend munging the name, so it's clear when debugging that this is redacted data. E.g., docker-info.stable.json.

benjyw · 2025-08-26T00:23:27Z

src/python/pants/backend/docker/util_rules/docker_build_context.py

+        for metadata_contents, pkg_index in zip(docker_metadata_contents, docker_package_indices):
+            built_package = embedded_pkgs[pkg_index]
+
+            # Extract the original filename from the metadata


This comment is unnecessary. The code two lines below it is already perfectly clear thanks to the good variable names you chose.

kaos · 2025-08-26T07:16:13Z

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)

One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

ademariag · 2025-08-26T07:34:35Z

@kaos @benjyw thank you for reviewing.
An alternative could be to allow for the digest generation to be changed so that for each package we can decide which element make part of the actual digest, ignoring in this case everything but the image id, but I think this is something that might require more work.

But it would be nice to know what contributes to changing hashes/digests

benjyw · 2025-08-26T16:44:06Z

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)

One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose...

kaos · 2025-08-29T10:04:37Z

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.

Yes and no, I think. As long as you have the previous build cache yes. But if you don't, and docker rebuilds the image, you may end up with a new docker image digest despite the inputs being the same, resulting in rebuilding everything downstream unnecessarily.

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)
One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose...

We only need to filter it out at the point where we calculate pants hash for the target, so no need to affect anything besides the hash value I think.

Edit: But these are the concerns I can see. I have no real issue with going in either direction here, up to you :)

benjyw · 2025-08-29T20:49:37Z

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.

Yes and no, I think. As long as you have the previous build cache yes. But if you don't, and docker rebuilds the image, you may end up with a new docker image digest despite the inputs being the same, resulting in rebuilding everything downstream unnecessarily.

Right, but that's where we are today already wrt tags. So presumably this change is still a move in the right direction?

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)
One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose...

We only need to filter it out at the point where we calculate pants hash for the target, so no need to affect anything besides the hash value I think.

At that point the metadata is just a Digest, so some code deep in the engine core would have to know to filter that Digest, which introduces a lot of new machinery. This is a pretty good way to get the same effect entirely within the docker backend.

Edit: But these are the concerns I can see. I have no real issue with going in either direction here, up to you :)

ademariag · 2025-09-01T10:53:37Z

@kaos @benjyw at the moment Pants design relies a lot on the build cache for docker building, assuming buildkit will just quickly rebuild the image. IMHO while ideal, this is often inefficient in several ways, and it's currently of the biggest issues we have at the moment using the Docker Backend which is awesome btw.
So I am trying at all costs to make sure the pants.hash doesn't change without good reasons.

I think this needs a proper discussion outside of this PR, but just to give you an intuition of the impact of the problem.

...even using the split pex approach highlighted in https://www.pantsbuild.org/blog/2022/08/02/optimizing-python-docker-deploys-using-pants#multiple-images-and-tagging, for big dependencies pex (anything with torch, for instance) you quickly end up having a context of 40GB having to be uploaded to buildkit even if nothing changes in the deps pex, just so that buildkit can assess if caching is needed.

In our case, this leads to up to 20+ seconds delay to every change, even if the change is only affecting the src files.

If you combine that with the fact that the base image might trigger another change to the pants hash (because the tags change), this leads to a very poor experience. I have a proposal for that as well, which I will share perhaps in another issue/PR

ademariag · 2025-09-01T11:20:16Z

@benjyw I have addressed your comments

benjyw

Thanks for the fix! Just one small change further.

src/python/pants/backend/docker/util_rules/docker_build_context.py

benjyw · 2025-09-02T18:55:44Z

Oh, and this needs a notes update in docs/notes/2.29.x.md under the docker backend, and also the highlights, since I think this is pretty important.

marco-tortolani-datadog · 2025-11-03T18:08:00Z

@ademariag any updates on whether you'll continue pursuing #22588? I am also implementing Optimizing Python + Docker Deploys using Pants and similarly want to use pants.hash to avoid rebuilding docker images.

ademariag · 2025-11-03T18:36:16Z

Yes Will wrap this up soon

…t.py Co-authored-by: Benjy Weinberger <[email protected]>

ademariag · 2025-11-23T10:37:29Z

@benjyw sorry for the long delay! I think I have done everything that was requested! Let me know if you need anything else

benjyw

Looks good! Just need to move that release note and we can merge once CI is green.

benjyw · 2025-11-24T01:32:43Z

docs/notes/2.29.x.md

 ### Backends

+#### Docker
+[Fixed](https://github.com/pantsbuild/pants/issues/22575) a bug that would make docker hashes unstable when using dynamic tags.


We've had a couple of releases since we last went around, so this needs to be in 2.31.x.md now ... :)

benjyw reviewed Aug 26, 2025

View reviewed changes

ademariag force-pushed the fix-22575 branch from dda014a to 3da1494 Compare September 1, 2025 11:19

benjyw reviewed Sep 2, 2025

View reviewed changes

src/python/pants/backend/docker/util_rules/docker_build_context.py Outdated Show resolved Hide resolved

marco-tortolani-datadog mentioned this pull request Nov 3, 2025

Changing pants.hash with dynamic tags #22575

Open

ademariag and others added 5 commits November 23, 2025 10:35

fix for pantsbuild#22575

d990fdf

Lint files

d367bac

Address PR comments

5bf2c05

Update src/python/pants/backend/docker/util_rules/docker_build_contex…

477daea

…t.py Co-authored-by: Benjy Weinberger <[email protected]>

Adds documentation for release 2.29

09ba99a

ademariag force-pushed the fix-22575 branch from b9d942c to 09ba99a Compare November 23, 2025 10:36

benjyw approved these changes Nov 24, 2025

View reviewed changes

Uh oh!

Fix #22575: Use stable image IDs instead of full metadata for Docker build context hashes #22588

Are you sure you want to change the base?

Fix #22575: Use stable image IDs instead of full metadata for Docker build context hashes #22588

Uh oh!

Conversation

ademariag commented Aug 25, 2025

Summary

Problem

Example scenario that was broken:

Solution

Key changes:

Implementation Details

Testing

Manual Testing

Uh oh!

benjyw left a comment

Choose a reason for hiding this comment

Uh oh!

benjyw Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

benjyw Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

benjyw Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

benjyw Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

kaos commented Aug 26, 2025

Uh oh!

ademariag commented Aug 26, 2025

Uh oh!

benjyw commented Aug 26, 2025

Uh oh!

kaos commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjyw commented Aug 29, 2025

Uh oh!

ademariag commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ademariag commented Sep 1, 2025

Uh oh!

benjyw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benjyw commented Sep 2, 2025

Uh oh!

marco-tortolani-datadog commented Nov 3, 2025

Uh oh!

ademariag commented Nov 3, 2025

Uh oh!

ademariag commented Nov 23, 2025

Uh oh!

benjyw left a comment

Choose a reason for hiding this comment

Uh oh!

benjyw Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaos commented Aug 29, 2025 •

edited

Loading

ademariag commented Sep 1, 2025 •

edited

Loading