Skip to content

Conversation

@ademariag
Copy link

Summary

Fixes #22575 by ensuring Docker image build context hashes remain stable when upstream images have dynamic tags.

Problem

When building Docker images that depend on other Docker images, the build context hash was incorrectly including the full metadata (including tags) of upstream images. This caused downstream images to rebuild unnecessarily whenever upstream image tags changed, even if the actual image content was identical.

Example scenario that was broken:

# BUILD file with dynamic tags
__defaults__({
    "docker_image": dict(
        image_tags=[
            "{pants.hash}",
            env("TIMESTAMP"),  # e.g., "20240116-123456"
        ],
    ),
})

docker_image(name="base", source="Dockerfile.base")
docker_image(name="app", source="Dockerfile")  # FROM base:image

Every build would produce different hashes for app:image because the timestamp in base:image's tags was included in the hash calculation.

Solution

This PR modifies the build context creation to use only the stable Docker image ID (SHA256 content hash) for upstream Docker images, rather than the full package digest that includes metadata like tags.

Key changes:

  1. Separate handling for Docker images: When processing embedded packages, Docker images are now handled differently from other package types
  2. Extract stable image ID: For Docker images, we extract the image ID from the BuiltDockerImage artifact
  3. Create stable digest: A new digest is created using only the image ID as content, preserving the original filename structure

Implementation Details

The fix modifies create_docker_build_context in docker_build_context.py to:

# For Docker images, use only the image ID to ensure hash stability
for artifact in built_package.artifacts:
    if isinstance(artifact, BuiltDockerImage):
        stable_content = artifact.image_id.encode()
        stable_digest = await Get(
            Digest, CreateDigest([FileContent(original_filename, stable_content)])
        )
        embedded_pkgs_digest.append(stable_digest)

This ensures that:

  • ✅ Hash remains stable when tags change but content doesn't
  • ✅ Hash changes when actual image content changes
  • ✅ Build context structure remains compatible
  • ✅ Caching works correctly in CI/CD pipelines

Testing

Manual Testing

# Before fix: Different hashes each time
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: abc123...
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: def456... ❌

# After fix: Stable hashes
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: abc123...
TIMESTAMP=$(date +%s) pants package //app:image  # Hash: abc123... ✅

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report and the fix! I think this is a good solution, and I can't see why using the image ID would not be sufficient. Just a few comments, and also it would be great to test this.

cc @kaos and @huonw in case they have any thoughts.

if request.build_upstream_images and isinstance(
getattr(field_set, "source", None), DockerImageSourceField
):
docker_metadata_gets.append(Get(DigestContents, Digest, built_package.digest))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving off Get/MultiGet in favor of call-by-name. Sorry, this isn't properly documented yet. See #18905 for context.

So instead of Get(DigestContents, Digest, built_package.digest) you'd use get_digest_contents(get_digest_contents) (with from pants.engine.intrinsics import get_digest_contents)

and then you'd replace MultiGet with concurrently, imported from pants.engine.internals.selectors.

# For Docker images, we need to extract the metadata filename and create stable digests
docker_metadata_gets = []
docker_package_indices = []
for i, built_package in enumerate(embedded_pkgs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about for field_set, built_package in zip(pkgs_wanting_embedding, embedded_pkgs), and maintain a list of docker_packages instead of docker_package_indices? The index is one extra thing to keep track of for the reader, and we could get rid of it.


# Extract the original filename from the metadata
if metadata_contents:
original_filename = next(iter(metadata_contents)).path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the name of the metadata file (some path ending in docker-info.json). I recommend munging the name, so it's clear when debugging that this is redacted data. E.g., docker-info.stable.json.

for metadata_contents, pkg_index in zip(docker_metadata_contents, docker_package_indices):
built_package = embedded_pkgs[pkg_index]

# Extract the original filename from the metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is unnecessary. The code two lines below it is already perfectly clear thanks to the good variable names you chose.

@kaos
Copy link
Member

kaos commented Aug 26, 2025

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)

One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

@ademariag
Copy link
Author

@kaos @benjyw thank you for reviewing.
An alternative could be to allow for the digest generation to be changed so that for each package we can decide which element make part of the actual digest, ignoring in this case everything but the image id, but I think this is something that might require more work.

But it would be nice to know what contributes to changing hashes/digests

@benjyw
Copy link
Contributor

benjyw commented Aug 26, 2025

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)

One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose...

@kaos
Copy link
Member

kaos commented Aug 29, 2025

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.

Yes and no, I think. As long as you have the previous build cache yes. But if you don't, and docker rebuilds the image, you may end up with a new docker image digest despite the inputs being the same, resulting in rebuilding everything downstream unnecessarily.

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)
One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose...

We only need to filter it out at the point where we calculate pants hash for the target, so no need to affect anything besides the hash value I think.

Edit: But these are the concerns I can see. I have no real issue with going in either direction here, up to you :)

@benjyw
Copy link
Contributor

benjyw commented Aug 29, 2025

I worry this could regress other scenarios. Consider that when you build your Docker image, the exact same inputs may result in a different docker image hash each time too, depending on what the docker file is doing. (such as compiling binaries with, yet again, a timestamp in them.)

Sure, but with this change we're no worse off, right? And since the base image would be cached against its inputs, we wouldn't rebuild it in the first place, so the downstream images wouldn't be invalidated.

Yes and no, I think. As long as you have the previous build cache yes. But if you don't, and docker rebuilds the image, you may end up with a new docker image digest despite the inputs being the same, resulting in rebuilding everything downstream unnecessarily.

Right, but that's where we are today already wrt tags. So presumably this change is still a move in the right direction?

Also, it would be nice to avoid specialcasing the build process for docker (didn't look at the implementation, only the PR description.)
One idea could be to (possibly as an option?) filter out the image tags part from the pants hash calculation?

That would mean filtering the tags out of the metadata file at image build time, which I think means we would never know about them for any other purpose...

We only need to filter it out at the point where we calculate pants hash for the target, so no need to affect anything besides the hash value I think.

At that point the metadata is just a Digest, so some code deep in the engine core would have to know to filter that Digest, which introduces a lot of new machinery. This is a pretty good way to get the same effect entirely within the docker backend.

Edit: But these are the concerns I can see. I have no real issue with going in either direction here, up to you :)

@ademariag
Copy link
Author

ademariag commented Sep 1, 2025

@kaos @benjyw at the moment Pants design relies a lot on the build cache for docker building, assuming buildkit will just quickly rebuild the image. IMHO while ideal, this is often inefficient in several ways, and it's currently of the biggest issues we have at the moment using the Docker Backend which is awesome btw.
So I am trying at all costs to make sure the pants.hash doesn't change without good reasons.

I think this needs a proper discussion outside of this PR, but just to give you an intuition of the impact of the problem.

...even using the split pex approach highlighted in https://www.pantsbuild.org/blog/2022/08/02/optimizing-python-docker-deploys-using-pants#multiple-images-and-tagging, for big dependencies pex (anything with torch, for instance) you quickly end up having a context of 40GB having to be uploaded to buildkit even if nothing changes in the deps pex, just so that buildkit can assess if caching is needed.

In our case, this leads to up to 20+ seconds delay to every change, even if the change is only affecting the src files.

If you combine that with the fact that the base image might trigger another change to the pants hash (because the tags change), this leads to a very poor experience. I have a proposal for that as well, which I will share perhaps in another issue/PR

@ademariag
Copy link
Author

@benjyw I have addressed your comments

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Just one small change further.

@benjyw
Copy link
Contributor

benjyw commented Sep 2, 2025

Oh, and this needs a notes update in docs/notes/2.29.x.md under the docker backend, and also the highlights, since I think this is pretty important.

@marco-tortolani-datadog
Copy link

@ademariag any updates on whether you'll continue pursuing #22588? I am also implementing Optimizing Python + Docker Deploys using Pants and similarly want to use pants.hash to avoid rebuilding docker images.

@ademariag
Copy link
Author

Yes Will wrap this up soon

@ademariag
Copy link
Author

@benjyw sorry for the long delay! I think I have done everything that was requested! Let me know if you need anything else

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just need to move that release note and we can merge once CI is green.

### Backends

#### Docker
[Fixed](https://github.com/pantsbuild/pants/issues/22575) a bug that would make docker hashes unstable when using dynamic tags.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had a couple of releases since we last went around, so this needs to be in 2.31.x.md now ... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changing pants.hash with dynamic tags

4 participants