Skip to content
This repository has been archived by the owner on Nov 7, 2024. It is now read-only.

feat: Add external input support for container encapsulation #652

Merged
merged 16 commits into from
Sep 3, 2024

Conversation

antheas
Copy link
Collaborator

@antheas antheas commented Aug 2, 2024

This PR adds a --contentmeta option to ostree-rs-ext container encapsulate and makes the required changes to support it.

The --contentmeta file receives the path of a JSON file as input which has the following format (shown in yaml):

# The version of the file protocol (ignored)
version: 1

# The created at field in the imgcfg file
# optional, if provided overrides automatic generation
created: 2024-08-01T22:16:56Z

# The annotation field of the OCI and Docker standards
# Placed in both to ensure support from both podman and Docker (e.g., ghcr) ecosystems
labels:
  tag1: value1
  tag2: value2

# The layers that should be created in the order they should be created,
# provided with a human parse-able name (currently placed in "ostree.components")
layers:
    dedi:meta:wine: "dedi:meta:wine"
    rechunk_layer004: "jupiter-hw-support-btrfs,flite",
  ...

# OSTree hash to layer mappings
# Currently not ordered, but perhaps should be
mapping:    
    000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba: rechunk_layer027
    000c9e21926935eaf55daf34dd2dcbbd50e0082429368eae3b7809315b27bf15: dedi:meta:qt6-base
    000d29f48958065a804cf92b36943d5f59108f901ae3c1b56e2f543e061b5507: dedi:meta:kde

Then, ostree-rs-ext uses this file as a base to export the provided commit as an OCI image.

The implementation here hijacks and turns off the rechunking component of ostree-rs-ext in an unclean way, so it needs to be refactored. For example, ostree-rs-ext adds 1 additional extra layer for "new-packages"

In addition, the following changes had to be made:

  • Use ordered unserialization of JSON
    • Important for having a deterministic output and having the output layers look pretty
  • Switch from HashMap to BTreeMap
    • Same as above
  • Removes the canonical json exporter from ocidir-rs
    • Violates the JSON standard by encoding the characters \n etc literally
    • If included in a label, the whole Red Hat family of tools refuses to read the image
    • However, it seems that Github truncates \n in image descriptions anyway, so its unclear how important this is.
  • Places the image labels both at the outer and inner configs
    • This makes them visible in GHCR while preserving them in rpm-ostree status (example)

Overall, this PR makes it possible to remove rpm-ostree as a backend in image creation, and use something like rechunk for the packaging aspect. This opens up the possibility of including support for more distros (e.g., arch) as their packaging format can now be scripted.

In addition, since creating the JSON file is decoupled from producing the OCI image, it becomes possible to simulate different packaging scenarios without having to wait and analyze the output OCI image.

Performance

As a sidenote, an attempt was made to speed up ostree-rs-ext writing the output image. Currently, it is single threaded and takes around 5 minutes to export an image, which rounds out to around 10 minutes doubling the time it takes to produce an image. It would be nice if it took e.g., 1 minute and had a progress bar.

We could not make multithreading work due to the multiple references of ark-string and the mutable reference to the OSTree object, which only holds a dir. It would take a bit of restructuring.

Removing rpm-ostree from the process saved around 2 minutes in calculating the layers. Most of that time is due to OSTree file lookups, which is relatively easy to fix.

For example, rechunk uses a hash memory map extracted from OSTree, which takes around 20 seconds to calculate and allows for instant lookups.

Note that the command used, ostree ls is not currently machine parseable (size and hard link numbers overflow), so it would be nice to fix with e.g., a --csv tag.

@antheas
Copy link
Collaborator Author

antheas commented Aug 2, 2024

Here is an example contentmeta file:
contentmeta.json

@antheas
Copy link
Collaborator Author

antheas commented Aug 2, 2024

Worthwhile to mention that the "ostree.components" layer annotation is thrown away by podman and friends, so dev.hhd.rechunk.info was implemented as a substitute for avoiding layer shifts (see containers/podman#23463).

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for filing this! Just a quick initial look.

lib/Cargo.toml Outdated
@@ -36,7 +36,7 @@ once_cell = "1.9"
libc = "0.2.92"
libsystemd = "0.7.0"
openssl = "0.10.33"
ocidir = "0.1.0"
ocidir = { version = "0.2.0", git = "https://github.com/hhd-dev/ocidir-rs" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah...you have changes to that too!

The bump to 0.2 is in #653 at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind doing (again) at least a draft PR to the repo with hhd-dev/ocidir-rs@9f6095c and provide a bit of reproducer instructions around how skopeo is broken?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I see it's in the PR text:

Violates the JSON standard by encoding the characters \n etc literally

Hmmmm...ok. I will look.

Copy link
Collaborator Author

@antheas antheas Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/olpc-cjson/latest/olpc_cjson/

(specifically, ASCII control characters 0x00–0x1f are printed literally, which is not valid JSON). Therefore, serde_json cannot necessarily deserialize JSON produced by this formatter.

error: Querying manifest after push: Fetching manifest: Failed to invoke skopeo proxy method GetManifest: remote error: invalid character '\n' in string literal

Hopefully this helps

Only occurs if \n is included in a label. It would have been nice to generate fancy descriptions. Although the only place that reads them (ghcr) omits \n so new lines are not shown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, right thanks. This is probably then best tracked at containers/ocidir-rs#10 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since I do not think my fork is a proper fix it did not make sense to PR it.

@cgwalters
Copy link
Member

The --contentmeta file receives the path of a JSON file as input which has the following format (shown in yaml):

The code was very much written with the idea to do something like this, awesome that you were to get it to work!

@antheas
Copy link
Collaborator Author

antheas commented Aug 2, 2024

One of the problems of the spec I made is that serde does not preserve order in JSON dictionaries. Which is why indexmap is included.

Rust developers would suggest something like this to preserve ordering:

layers:
    - key: dedi:meta:wine
      friendly: "dedi:meta:wine"
    - key: dedi:meta:wine
      friendly: "dedi:meta:wine"

mapping:    
    - key: 000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba
      layer: rechunk_layer027
    - key: 000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba
      layer: dedi:meta:kde

Which I am not sure how I feed about. Other than that, I think it makes sense as an API.

lib/src/cli.rs Outdated Show resolved Hide resolved
lib/src/cli.rs Outdated
Comment on lines 714 to 717
/// When the image was created. Sync it with the io.container.image.created label.
pub created: Option<String>,
/// Top level labels, to be prefixed to the ones with --label
pub labels: Option<BTreeMap<String, String>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK having these here, but it seems like it'd make more sense for them to be separate CLI arguments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its valuable to have 2 sources for this, as currently rechunk will also generate the labels and the created tag and harmonize them. Since it pulls the rpm database it can do fancy stuff such as variable substitution. So this acts as a way of passing them through the 2 commands cleanly.

Having to refeed them into arguments would be hell.

I expect if anyone else tried to extend this they would agree.

Here is how the action example for the layers looks right now:
https://github.com/hhd-dev/rechunk/blob/496f4b84aced656b9c2c0f176f24323fe13129ad/.github/workflows/online_test_deck.yml#L51-L93

CLI may override the file.

lib/src/cli.rs Outdated Show resolved Hide resolved
lib/src/cli.rs Outdated Show resolved Hide resolved
lib/src/cli.rs Outdated
change_frequency: if k == "unpackaged" { std::u32::MAX } else { 1 },
change_time_offset: 1,
},
size: 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size of 1 seems odd here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is what was required to make it work without restructuring the code.

A proper implementation would remove this, incl. the dangling "Reserved for New Packages" layer at the end.

lib/src/cli.rs Outdated
sizes: raw
.layers
.into_iter()
.map(|(k, v)| ObjectSourceMetaSized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just directly parse this data from the input JSON?

Copy link
Collaborator Author

@antheas antheas Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of this fork only reuses ostree-rs-ext as an exporter. If we wanted to use it with its grouping algorithm, it would make sense to do it that.

Right now this section is fighting with the existing code to make it work as an exporter and by exposing a cleaner API for the json file.

@antheas
Copy link
Collaborator Author

antheas commented Aug 22, 2024

"should be fixed" as in future me will have to fix it :)

I will have a look over this PR this weekend and try to fix the comments. Incl. those that I "justified".

As for something you did not comment on: IndexMap. Rust treats JSON maps as unordered, but the spec I suggested requires them being ordered. This necessitates the use of the IndexMap package.

Do we keep or change the spec? In this case, to a list of tuples.

In addition, the mapping field should be treated as both ordered and non-ordered depending on perhaps an input argument ordered being True.

This would allow re-ordering the tar stream so that something like zstd:chunked can perform less range requests.

Currently, it is non-ordered, with a BTree or something inside ostree-rs-ext defining the ordering. For backward compatibility reasons and having fewer ways partitioning algorithms that are not that deterministic can fail, this should remain as the default ordering. With partitioning algorithms that care about the order of the tar stream being able to change that.

@antheas
Copy link
Collaborator Author

antheas commented Aug 22, 2024

For reordering the tar stream, it would be required to "stash" somewhere in the image the order of the current tar stream (essentially the mapping contents), so that the next image can reference it.

Otherwise, it would have to be done out-of-band, which is too messy imo. I'd rather all metadata be part of the image.

Compressed, this is around 10MB. Since ostree-rs-ext creates the image, it also has the opportunity to stash that into e.g., the commit layer, which is small enough and can be pulled before creating the next image.

The alternative would be the inner config, which skopeo reads. However, adding an 8MB blob might cause issues for certain registries.

@antheas
Copy link
Collaborator Author

antheas commented Aug 22, 2024

If we extract the mapping var from the contentmeta.json above and convert it in a \n separated list string, it is 12.3MB.

After GZIP compression, it becomes 6.9MB.

If the compressed array is then converted using base85 to be inserted into e.g., a label, it becomes 8.7MB.

@cgwalters
Copy link
Member

For reordering the tar stream, it would be required to "stash" somewhere in the image the order of the current tar stream (essentially the mapping contents), so that the next image can reference it.

Otherwise, it would have to be done out-of-band, which is too messy imo. I'd rather all metadata be part of the image.

That's forcing all consumers to download this though.

I think in the general case production build systems can maintain "out of band" caches as separate images in the registry. Look at how e.g. cosign/sigstore work - there's an OCI artifact uploaded to the registry that refers to the base manifest.

@cgwalters
Copy link
Member

Currently, it is non-ordered, with a BTree or something inside ostree-rs-ext defining the ordering.

A BTree is ordered. The output ordering of each chunk is sorted by the checksum.

Which yes...if we wanted to optimize zstd:chunked range requests would need to be configurable. I can very much see the argument that actually by default we should order the output of chunks first by their associated contentID, and only then by checksum.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

Which yes...if we wanted to optimize zstd:chunked range requests would need to be configurable. I can very much see the argument that actually by default we should order the output of chunks first by their associated contentID, and only then by checksum.

This takes a bit of thinking, but you can derive that the chunking of the image is irrelevant as far as the zstd:chunked ordering is concerned.

If the ordering of the mapping variable is respected and the new hashes are placed on its end then it does not matter how it is partitioned into layers. All new files will be on the end of the resulting layers, in respect to users of previous versions.

Regardless of how the previous versions or this version was structured into layers.

we should order the output of chunks first by their associated contentID

The above mean that this would only fragment the layers and be counterproductive.

The reason the layers variable is forced to be ordered in this PR is so that the resulting manifests have the same layer order, which is prettier and makes comparing manifests easier.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

That's forcing all consumers to download this though.

I think in the general case production build systems can maintain "out of band" caches as separate images in the registry.

6.7Mb of a 6.5 Gb image is oh, 0.1%? For fedora core, it would be 2MB out of 2GB, so it scales pretty well with the image size to be 0.1% to 0.2%.

I do not think that warrants an out-of-band store. But it also needs not be part of the spec, so enterprise users that want to squeeze out that 0.1% can.

Look at how e.g. cosign/sigstore work - there's an OCI artifact uploaded to the registry that refers to the base manifest.

Referencing the bazzite image store, in my opinion forcing out-of-band storage for signatures is a wart of the OCI spec. It clouds both skopeo manifests and the image store.

For most images, signing happens during build time. So the happy path would be to bundle the initial sigs with the image itself.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

A BTree is ordered. The output ordering of each chunk is sorted by the checksum.

I would refer to that as deterministic. At best, unless there are stable ABI guarantees in the Rust spec.

If the checksum algorithm changes for any reason, all existing images produced with ostree-rs-ext would be cache busted. This includes changing the string type ostree-rs-ext uses and Rust updates.

Ordered here would mean the spec would respect the ordering provided by mapping. False is provided as a sane default such that users of the API do not have to care about ordering if they do not have to, and for backwards compatibility.

@cgwalters
Copy link
Member

If the checksum algorithm changes for any reason, all existing images produced with ostree-rs-ext would be cache busted.

This is really an unimportant side discussion but just for reference: If we changed how ostree computes checksums, it would also duplicate on-disk state, etc. It's not going to happen. What we should be doing is getting ostree out of the container images and orienting around composefs and fsverity, probably moving to xattrs in the tar stream, etc. which would be the more useful format break.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

I meant the checksumming which causes ordering of the BTree within rust.

Looking at the Rust documentation, it is not clear how it is determined.

There are references to the Ord trait so perhaps in the current version it is alphabetical? I am unsure.

I am also not sure on how the final hash ordering is currently determined. But I would rather it not break unless it is done intentionally. I know the hashes themselves will not change.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

What we should be doing is getting ostree out of the container images and orienting around composefs and fsverity, probably moving to xattrs in the tar stream, etc. which would be the more useful format break.

I agree on the other points though. I am not aware of Fedora's plans, but for us I'd prefer we maintain interop with rpm-ostree as it ships today for the next 1 year at least on images shipped. And in the much sorter term fixing the xattrs issue, as it appears we will need it soon. Any image derived with rechunk cannot use it again because it loses xattrs.

That means extending ostree-rs-ext and depending on ostree for the time being.

@cgwalters
Copy link
Member

I've been thinking about this more. I think we can simplify things a lot if instead of going from container image -> container image, we go from:

  • base image
  • additional package set
  • zero or more merged artifact images

It's basically the same thing I'm proposing here https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2078106687

The big simplification here vs what you have is that each artifact image becomes its own chunk by default (unless specified to be merged). Again something like this:

from: quay.io/fedora/fedora-bootc:40
packages:
  - libvirt
  - cowsay
copy-images:
  - quay.io/exampleos/systemd-units@sha256:1234...
  - quay.io/exampleos/custom-agent@sha256:ab89..

There's a strong relationship here to the (relatively) new COPY --link (that's not implemented in podman yet) for the latter two bits - it's basically what we're doing.

@cgwalters
Copy link
Member

Well except actually, we need in the general case to do selinux labeling on artifacts when enabled, and canonicalize to ostree format short term, so it won't actually be COPY --link, but anyways the larger point is preserving the "one thing in, one thing out" in a reproducible fashion.

lib/src/cli.rs Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member

Removes the canonical json exporter from ocidir-rs
...
However, it seems that Github truncates \n in image descriptions anyway, so its unclear how important this is.

I agree this is a design flaw that needs fixing, but if it's not actually really useful right now can we then drop the newline requirement and the need to change ocidir-rs in the short term? I think that's the thing that's conflicting in the PR at the moment.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

I agree this is a design flaw that needs fixing, but if it's not actually really useful right now can we then drop the newline requirement and the need to change ocidir-rs in the short term? I think that's the thing that's conflicting in the PR at the moment.

Reverted and tested it does not conflict. I will make sure to filter \n before merging the changes to rechunk.

@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

I do not know why my local version did not catch the fixture error. Perhaps there needs to be a bit more testing there.

There is also still the following issue. ostree ext adds a last layer dedicated for new packages that will not be used.

        {
            "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip",
            "Digest": "sha256:5f9d3dcf5281c5f6512471366be68bee46c2485eddf4fd1887da6b240712be5f",
            "Size": 2340,
            "Annotations": {
                "ostree.components": ""
            }
        }

Changing it would require making basic_packing aware of skipping it and perhaps it is for a future PR.

@antheas antheas marked this pull request as ready for review August 30, 2024 20:32
@antheas
Copy link
Collaborator Author

antheas commented Aug 30, 2024

Well except actually, we need in the general case to do selinux labeling on artifacts when enabled, and canonicalize to ostree format short term, so it won't actually be COPY --link, but anyways the larger point is preserving the "one thing in, one thing out" in a reproducible fashion.

Yeah, for the next 1-2 years this will probably be the way to go. For the vast majority of images, the time downside is negligible (4-7m) and the end result is compatible with rpm-ostree.

There is no size or performance benefit to the alternative for OS images.

There is a huge benefit however for AI and OCI images for not using ostree, which brings us to below or perhaps the topic of next Tuesday.

I've been thinking about this more. I think we can simplify things a lot if instead of going from container image -> container image, we go from:

  • base image
  • additional package set
  • zero or more merged artifact images

It's basically the same thing I'm proposing here https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2078106687

People like using OCI images because the contents are inspectable, extendable, and iterative.

They would rather paste a bunch of dnf install ..., so I say let them do it.

I also like the idea of having an optional postprocessing step like rechunk that fixes the image only when it is meant for distribution (as in it does not have to be used during development). And to have that tool handle all the quirks that come with shifting packages around so that the image is optimized for downloading.

I received a lot of positive feedback about the fact that rechunk is a drop-in addition and after that you get 3x smaller downloads for free. Rechunk contains a new partitioning algorithm which has quite a bit better bandwidth savings compared to the one in ostree-rs-ext. And requires no changes to existing images or workflow.

The big simplification here vs what you have is that each artifact image becomes its own chunk by default (unless specified to be merged). Again something like this:

from: quay.io/fedora/fedora-bootc:40
packages:
  - libvirt
  - cowsay
copy-images:
  - quay.io/exampleos/systemd-units@sha256:1234...
  - quay.io/exampleos/custom-agent@sha256:ab89..

I would rather the chunks be predetermined as that is what is more optimized if zstd:chunked is not used. Even if it is, it is still great for registries. This accounts for most of the gain rechunk has over rpm-ostree chunking right now.

However, the chunks you mention are not meant for optimizing downloads. They are actually extensions.

I have been thinking a lot while building rechunk about how extensions would look and researching sysext as well.

I do not think that sysext is appropriate for the bootc model, but it makes a great point: the extensions are married to the image. They update as part of it.

My idea would be to have the distro build a huge master container image that bundles all OS extensions together. They have 100 layers to do it, and for most applications that is broadly enough. Then, they use a little yaml file to say which package is part of which extension. Rechunk handles the rest and places the extension files to separate layers, while having the correct directory structure due to using ostree.

Then, during deployment, the user has tied a feature mask to their update tag, which specifies which extensions should be enabled.

Depending on which extensions the user has selected, certain layers get shed, saving space and update bandwidth.

As the image is built at the same time with all extensions, there is 0 update drift and only one update channel. It also has the benefit that images with extensions are able to then be used as base images using from and they work as expected.

There's a strong relationship here to the (relatively) new COPY --link (that's not implemented in podman yet) for the latter two bits - it's basically what we're doing.

This copy --link is essential for rechunking AI images, as it is prohibitively expensive to copy the model files.

Currently, rechunk copies the files oh, only 3 times:

  • podman mount to ostree
  • ostree to oci dir
  • oci dir to podman (esp. for zstd:chunked conversion)

It is not that big of an issue even with an image that is 16GB (7GB compressed). But if an image is 50-100GB it is.

For applying something like rechunk to general OCI images, OSTree will have to be removed from the equation and buildah needs to be used instead. Also needs reflinks to avoid thrashing the disk with large container images.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have liked to merge this with tests but not going to block on it in the immediate term.

Maybe one thing we can try to directly do is "reverse dependency testing", run rechunk's tests at least optionally on PRs here?

@cgwalters cgwalters merged commit 60e2a40 into ostreedev:main Sep 3, 2024
10 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants