feat: Add external input support for container encapsulation #652

antheas · 2024-08-02T13:04:59Z

This PR adds a --contentmeta option to ostree-rs-ext container encapsulate and makes the required changes to support it.

The --contentmeta file receives the path of a JSON file as input which has the following format (shown in yaml):

# The version of the file protocol (ignored)
version: 1

# The created at field in the imgcfg file
# optional, if provided overrides automatic generation
created: 2024-08-01T22:16:56Z

# The annotation field of the OCI and Docker standards
# Placed in both to ensure support from both podman and Docker (e.g., ghcr) ecosystems
labels:
  tag1: value1
  tag2: value2

# The layers that should be created in the order they should be created,
# provided with a human parse-able name (currently placed in "ostree.components")
layers:
    dedi:meta:wine: "dedi:meta:wine"
    rechunk_layer004: "jupiter-hw-support-btrfs,flite",
  ...

# OSTree hash to layer mappings
# Currently not ordered, but perhaps should be
mapping:    
    000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba: rechunk_layer027
    000c9e21926935eaf55daf34dd2dcbbd50e0082429368eae3b7809315b27bf15: dedi:meta:qt6-base
    000d29f48958065a804cf92b36943d5f59108f901ae3c1b56e2f543e061b5507: dedi:meta:kde

Then, ostree-rs-ext uses this file as a base to export the provided commit as an OCI image.

The implementation here hijacks and turns off the rechunking component of ostree-rs-ext in an unclean way, so it needs to be refactored. For example, ostree-rs-ext adds 1 additional extra layer for "new-packages"

In addition, the following changes had to be made:

Use ordered unserialization of JSON
- Important for having a deterministic output and having the output layers look pretty
Switch from HashMap to BTreeMap
- Same as above
Removes the canonical json exporter from ocidir-rs
- Violates the JSON standard by encoding the characters \n etc literally
- If included in a label, the whole Red Hat family of tools refuses to read the image
- However, it seems that Github truncates \n in image descriptions anyway, so its unclear how important this is.
Places the image labels both at the outer and inner configs
- This makes them visible in GHCR while preserving them in rpm-ostree status (example)

Overall, this PR makes it possible to remove rpm-ostree as a backend in image creation, and use something like rechunk for the packaging aspect. This opens up the possibility of including support for more distros (e.g., arch) as their packaging format can now be scripted.

In addition, since creating the JSON file is decoupled from producing the OCI image, it becomes possible to simulate different packaging scenarios without having to wait and analyze the output OCI image.

Performance

As a sidenote, an attempt was made to speed up ostree-rs-ext writing the output image. Currently, it is single threaded and takes around 5 minutes to export an image, which rounds out to around 10 minutes doubling the time it takes to produce an image. It would be nice if it took e.g., 1 minute and had a progress bar.

We could not make multithreading work due to the multiple references of ark-string and the mutable reference to the OSTree object, which only holds a dir. It would take a bit of restructuring.

Removing rpm-ostree from the process saved around 2 minutes in calculating the layers. Most of that time is due to OSTree file lookups, which is relatively easy to fix.

For example, rechunk uses a hash memory map extracted from OSTree, which takes around 20 seconds to calculate and allows for instant lookups.

Note that the command used, ostree ls is not currently machine parseable (size and hard link numbers overflow), so it would be nice to fix with e.g., a --csv tag.

…ce/remove)

This reverts commit 8c94e05.

antheas · 2024-08-02T13:06:03Z

Here is an example contentmeta file:
contentmeta.json

antheas · 2024-08-02T13:07:35Z

Worthwhile to mention that the "ostree.components" layer annotation is thrown away by podman and friends, so dev.hhd.rechunk.info was implemented as a substitute for avoiding layer shifts (see containers/podman#23463).

cgwalters

Thanks for filing this! Just a quick initial look.

cgwalters · 2024-08-02T17:27:00Z

lib/Cargo.toml

@@ -36,7 +36,7 @@ once_cell = "1.9"
 libc = "0.2.92"
 libsystemd = "0.7.0"
 openssl = "0.10.33"
-ocidir = "0.1.0"
+ocidir = { version = "0.2.0", git = "https://github.com/hhd-dev/ocidir-rs" }


Ah...you have changes to that too!

The bump to 0.2 is in #653 at least.

Would you mind doing (again) at least a draft PR to the repo with hhd-dev/ocidir-rs@9f6095c and provide a bit of reproducer instructions around how skopeo is broken?

Ohh I see it's in the PR text:

Violates the JSON standard by encoding the characters \n etc literally

Hmmmm...ok. I will look.

https://docs.rs/olpc-cjson/latest/olpc_cjson/

(specifically, ASCII control characters 0x00–0x1f are printed literally, which is not valid JSON). Therefore, serde_json cannot necessarily deserialize JSON produced by this formatter.

error: Querying manifest after push: Fetching manifest: Failed to invoke skopeo proxy method GetManifest: remote error: invalid character '\n' in string literal

Hopefully this helps

Only occurs if \n is included in a label. It would have been nice to generate fancy descriptions. Although the only place that reads them (ghcr) omits \n so new lines are not shown.

OK, right thanks. This is probably then best tracked at containers/ocidir-rs#10 ?

Yes, since I do not think my fork is a proper fix it did not make sense to PR it.

cgwalters · 2024-08-02T18:58:18Z

The --contentmeta file receives the path of a JSON file as input which has the following format (shown in yaml):

The code was very much written with the idea to do something like this, awesome that you were to get it to work!

antheas · 2024-08-02T19:02:59Z

One of the problems of the spec I made is that serde does not preserve order in JSON dictionaries. Which is why indexmap is included.

Rust developers would suggest something like this to preserve ordering:

layers:
    - key: dedi:meta:wine
      friendly: "dedi:meta:wine"
    - key: dedi:meta:wine
      friendly: "dedi:meta:wine"

mapping:    
    - key: 000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba
      layer: rechunk_layer027
    - key: 000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba
      layer: dedi:meta:kde

Which I am not sure how I feed about. Other than that, I think it makes sense as an API.

lib/src/cli.rs

cgwalters · 2024-08-20T17:35:28Z

lib/src/cli.rs

+    /// When the image was created. Sync it with the io.container.image.created label.
+    pub created: Option<String>,
+    /// Top level labels, to be prefixed to the ones with --label
+    pub labels: Option<BTreeMap<String, String>>,


I'm OK having these here, but it seems like it'd make more sense for them to be separate CLI arguments?

I think its valuable to have 2 sources for this, as currently rechunk will also generate the labels and the created tag and harmonize them. Since it pulls the rpm database it can do fancy stuff such as variable substitution. So this acts as a way of passing them through the 2 commands cleanly.

Having to refeed them into arguments would be hell.

I expect if anyone else tried to extend this they would agree.

Here is how the action example for the layers looks right now:
https://github.com/hhd-dev/rechunk/blob/496f4b84aced656b9c2c0f176f24323fe13129ad/.github/workflows/online_test_deck.yml#L51-L93

CLI may override the file.

lib/src/cli.rs

cgwalters · 2024-08-20T17:46:43Z

lib/src/cli.rs

+                            change_frequency: if k == "unpackaged" { std::u32::MAX } else { 1 },
+                            change_time_offset: 1,
+                        },
+                        size: 1,


Size of 1 seems odd here?

This section is what was required to make it work without restructuring the code.

A proper implementation would remove this, incl. the dangling "Reserved for New Packages" layer at the end.

cgwalters · 2024-08-20T17:49:09Z

lib/src/cli.rs

+                sizes: raw
+                    .layers
+                    .into_iter()
+                    .map(|(k, v)| ObjectSourceMetaSized {


Couldn't we just directly parse this data from the input JSON?

The use of this fork only reuses ostree-rs-ext as an exporter. If we wanted to use it with its grouping algorithm, it would make sense to do it that.

Right now this section is fighting with the existing code to make it work as an exporter and by exposing a cleaner API for the json file.

antheas · 2024-08-22T10:11:38Z

"should be fixed" as in future me will have to fix it :)

I will have a look over this PR this weekend and try to fix the comments. Incl. those that I "justified".

As for something you did not comment on: IndexMap. Rust treats JSON maps as unordered, but the spec I suggested requires them being ordered. This necessitates the use of the IndexMap package.

Do we keep or change the spec? In this case, to a list of tuples.

In addition, the mapping field should be treated as both ordered and non-ordered depending on perhaps an input argument ordered being True.

This would allow re-ordering the tar stream so that something like zstd:chunked can perform less range requests.

Currently, it is non-ordered, with a BTree or something inside ostree-rs-ext defining the ordering. For backward compatibility reasons and having fewer ways partitioning algorithms that are not that deterministic can fail, this should remain as the default ordering. With partitioning algorithms that care about the order of the tar stream being able to change that.

antheas · 2024-08-22T10:16:22Z

For reordering the tar stream, it would be required to "stash" somewhere in the image the order of the current tar stream (essentially the mapping contents), so that the next image can reference it.

Otherwise, it would have to be done out-of-band, which is too messy imo. I'd rather all metadata be part of the image.

Compressed, this is around 10MB. Since ostree-rs-ext creates the image, it also has the opportunity to stash that into e.g., the commit layer, which is small enough and can be pulled before creating the next image.

The alternative would be the inner config, which skopeo reads. However, adding an 8MB blob might cause issues for certain registries.

antheas · 2024-08-22T10:31:23Z

If we extract the mapping var from the contentmeta.json above and convert it in a \n separated list string, it is 12.3MB.

After GZIP compression, it becomes 6.9MB.

If the compressed array is then converted using base85 to be inserted into e.g., a label, it becomes 8.7MB.

cgwalters · 2024-08-29T22:13:37Z

For reordering the tar stream, it would be required to "stash" somewhere in the image the order of the current tar stream (essentially the mapping contents), so that the next image can reference it.

Otherwise, it would have to be done out-of-band, which is too messy imo. I'd rather all metadata be part of the image.

That's forcing all consumers to download this though.

I think in the general case production build systems can maintain "out of band" caches as separate images in the registry. Look at how e.g. cosign/sigstore work - there's an OCI artifact uploaded to the registry that refers to the base manifest.

cgwalters · 2024-08-29T22:16:35Z

Currently, it is non-ordered, with a BTree or something inside ostree-rs-ext defining the ordering.

A BTree is ordered. The output ordering of each chunk is sorted by the checksum.

Which yes...if we wanted to optimize zstd:chunked range requests would need to be configurable. I can very much see the argument that actually by default we should order the output of chunks first by their associated contentID, and only then by checksum.

antheas · 2024-08-30T06:44:53Z

Which yes...if we wanted to optimize zstd:chunked range requests would need to be configurable. I can very much see the argument that actually by default we should order the output of chunks first by their associated contentID, and only then by checksum.

This takes a bit of thinking, but you can derive that the chunking of the image is irrelevant as far as the zstd:chunked ordering is concerned.

If the ordering of the mapping variable is respected and the new hashes are placed on its end then it does not matter how it is partitioned into layers. All new files will be on the end of the resulting layers, in respect to users of previous versions.

Regardless of how the previous versions or this version was structured into layers.

we should order the output of chunks first by their associated contentID

The above mean that this would only fragment the layers and be counterproductive.

The reason the layers variable is forced to be ordered in this PR is so that the resulting manifests have the same layer order, which is prettier and makes comparing manifests easier.

antheas · 2024-08-30T07:06:31Z

That's forcing all consumers to download this though.

I think in the general case production build systems can maintain "out of band" caches as separate images in the registry.

6.7Mb of a 6.5 Gb image is oh, 0.1%? For fedora core, it would be 2MB out of 2GB, so it scales pretty well with the image size to be 0.1% to 0.2%.

I do not think that warrants an out-of-band store. But it also needs not be part of the spec, so enterprise users that want to squeeze out that 0.1% can.

Look at how e.g. cosign/sigstore work - there's an OCI artifact uploaded to the registry that refers to the base manifest.

Referencing the bazzite image store, in my opinion forcing out-of-band storage for signatures is a wart of the OCI spec. It clouds both skopeo manifests and the image store.

For most images, signing happens during build time. So the happy path would be to bundle the initial sigs with the image itself.

antheas · 2024-08-30T07:15:45Z

A BTree is ordered. The output ordering of each chunk is sorted by the checksum.

I would refer to that as deterministic. At best, unless there are stable ABI guarantees in the Rust spec.

If the checksum algorithm changes for any reason, all existing images produced with ostree-rs-ext would be cache busted. This includes changing the string type ostree-rs-ext uses and Rust updates.

Ordered here would mean the spec would respect the ordering provided by mapping. False is provided as a sane default such that users of the API do not have to care about ordering if they do not have to, and for backwards compatibility.

cgwalters · 2024-08-30T12:49:29Z

If the checksum algorithm changes for any reason, all existing images produced with ostree-rs-ext would be cache busted.

This is really an unimportant side discussion but just for reference: If we changed how ostree computes checksums, it would also duplicate on-disk state, etc. It's not going to happen. What we should be doing is getting ostree out of the container images and orienting around composefs and fsverity, probably moving to xattrs in the tar stream, etc. which would be the more useful format break.

antheas · 2024-08-30T13:06:29Z

I meant the checksumming which causes ordering of the BTree within rust.

Looking at the Rust documentation, it is not clear how it is determined.

There are references to the Ord trait so perhaps in the current version it is alphabetical? I am unsure.

I am also not sure on how the final hash ordering is currently determined. But I would rather it not break unless it is done intentionally. I know the hashes themselves will not change.

antheas · 2024-08-30T13:38:59Z

What we should be doing is getting ostree out of the container images and orienting around composefs and fsverity, probably moving to xattrs in the tar stream, etc. which would be the more useful format break.

I agree on the other points though. I am not aware of Fedora's plans, but for us I'd prefer we maintain interop with rpm-ostree as it ships today for the next 1 year at least on images shipped. And in the much sorter term fixing the xattrs issue, as it appears we will need it soon. Any image derived with rechunk cannot use it again because it loses xattrs.

That means extending ostree-rs-ext and depending on ostree for the time being.

cgwalters · 2024-08-30T18:55:28Z

I've been thinking about this more. I think we can simplify things a lot if instead of going from container image -> container image, we go from:

base image
additional package set
zero or more merged artifact images

It's basically the same thing I'm proposing here https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2078106687

The big simplification here vs what you have is that each artifact image becomes its own chunk by default (unless specified to be merged). Again something like this:

from: quay.io/fedora/fedora-bootc:40
packages:
  - libvirt
  - cowsay
copy-images:
  - quay.io/exampleos/systemd-units@sha256:1234...
  - quay.io/exampleos/custom-agent@sha256:ab89..

There's a strong relationship here to the (relatively) new COPY --link (that's not implemented in podman yet) for the latter two bits - it's basically what we're doing.

cgwalters · 2024-08-30T19:03:36Z

Well except actually, we need in the general case to do selinux labeling on artifacts when enabled, and canonicalize to ostree format short term, so it won't actually be COPY --link, but anyways the larger point is preserving the "one thing in, one thing out" in a reproducible fashion.

lib/src/cli.rs

cgwalters · 2024-08-30T19:09:07Z

Removes the canonical json exporter from ocidir-rs
...
However, it seems that Github truncates \n in image descriptions anyway, so its unclear how important this is.

I agree this is a design flaw that needs fixing, but if it's not actually really useful right now can we then drop the newline requirement and the need to change ocidir-rs in the short term? I think that's the thing that's conflicting in the PR at the moment.

This reverts commit 1061e93.

antheas · 2024-08-30T19:51:08Z

I agree this is a design flaw that needs fixing, but if it's not actually really useful right now can we then drop the newline requirement and the need to change ocidir-rs in the short term? I think that's the thing that's conflicting in the PR at the moment.

Reverted and tested it does not conflict. I will make sure to filter \n before merging the changes to rechunk.

antheas · 2024-08-30T20:30:03Z

I do not know why my local version did not catch the fixture error. Perhaps there needs to be a bit more testing there.

There is also still the following issue. ostree ext adds a last layer dedicated for new packages that will not be used.

        {
            "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip",
            "Digest": "sha256:5f9d3dcf5281c5f6512471366be68bee46c2485eddf4fd1887da6b240712be5f",
            "Size": 2340,
            "Annotations": {
                "ostree.components": ""
            }
        }

Changing it would require making basic_packing aware of skipping it and perhaps it is for a future PR.

antheas · 2024-08-30T21:02:11Z

Well except actually, we need in the general case to do selinux labeling on artifacts when enabled, and canonicalize to ostree format short term, so it won't actually be COPY --link, but anyways the larger point is preserving the "one thing in, one thing out" in a reproducible fashion.

Yeah, for the next 1-2 years this will probably be the way to go. For the vast majority of images, the time downside is negligible (4-7m) and the end result is compatible with rpm-ostree.

There is no size or performance benefit to the alternative for OS images.

There is a huge benefit however for AI and OCI images for not using ostree, which brings us to below or perhaps the topic of next Tuesday.

I've been thinking about this more. I think we can simplify things a lot if instead of going from container image -> container image, we go from:

base image

additional package set

zero or more merged artifact images

It's basically the same thing I'm proposing here https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2078106687

People like using OCI images because the contents are inspectable, extendable, and iterative.

They would rather paste a bunch of dnf install ..., so I say let them do it.

I also like the idea of having an optional postprocessing step like rechunk that fixes the image only when it is meant for distribution (as in it does not have to be used during development). And to have that tool handle all the quirks that come with shifting packages around so that the image is optimized for downloading.

I received a lot of positive feedback about the fact that rechunk is a drop-in addition and after that you get 3x smaller downloads for free. Rechunk contains a new partitioning algorithm which has quite a bit better bandwidth savings compared to the one in ostree-rs-ext. And requires no changes to existing images or workflow.

The big simplification here vs what you have is that each artifact image becomes its own chunk by default (unless specified to be merged). Again something like this:
from: quay.io/fedora/fedora-bootc:40
packages:
  - libvirt
  - cowsay
copy-images:
  - quay.io/exampleos/systemd-units@sha256:1234...
  - quay.io/exampleos/custom-agent@sha256:ab89..

I would rather the chunks be predetermined as that is what is more optimized if zstd:chunked is not used. Even if it is, it is still great for registries. This accounts for most of the gain rechunk has over rpm-ostree chunking right now.

However, the chunks you mention are not meant for optimizing downloads. They are actually extensions.

I have been thinking a lot while building rechunk about how extensions would look and researching sysext as well.

I do not think that sysext is appropriate for the bootc model, but it makes a great point: the extensions are married to the image. They update as part of it.

My idea would be to have the distro build a huge master container image that bundles all OS extensions together. They have 100 layers to do it, and for most applications that is broadly enough. Then, they use a little yaml file to say which package is part of which extension. Rechunk handles the rest and places the extension files to separate layers, while having the correct directory structure due to using ostree.

Then, during deployment, the user has tied a feature mask to their update tag, which specifies which extensions should be enabled.

Depending on which extensions the user has selected, certain layers get shed, saving space and update bandwidth.

As the image is built at the same time with all extensions, there is 0 update drift and only one update channel. It also has the benefit that images with extensions are able to then be used as base images using from and they work as expected.

There's a strong relationship here to the (relatively) new COPY --link (that's not implemented in podman yet) for the latter two bits - it's basically what we're doing.

This copy --link is essential for rechunking AI images, as it is prohibitively expensive to copy the model files.

Currently, rechunk copies the files oh, only 3 times:

podman mount to ostree
ostree to oci dir
oci dir to podman (esp. for zstd:chunked conversion)

It is not that big of an issue even with an image that is 16GB (7GB compressed). But if an image is 50-100GB it is.

For applying something like rechunk to general OCI images, OSTree will have to be removed from the equation and buildah needs to be used instead. Also needs reflinks to avoid thrashing the disk with large container images.

cgwalters

I'd have liked to merge this with tests but not going to block on it in the immediate term.

Maybe one thing we can try to directly do is "reverse dependency testing", run rechunk's tests at least optionally on PRs here?

antheas added 9 commits June 28, 2024 17:12

add contentmeta option with determinism

f300504

add label support to top level

ad24d8c

force ordering in internal mappings (does not seem to make a differen…

050916e

…ce/remove)

fix layer concatenation bug

92b12be

add annotation setting support

8c94e05

add created timestamp support

ff6a6da

Revert "add annotation setting support"

33b2606

This reverts commit 8c94e05.

add annotation support

05f13f8

use ocidir fork to fix newlines

1061e93

antheas mentioned this pull request Aug 2, 2024

submit changes to ostree-rs-ext hhd-dev/rechunk#1

Closed

cgwalters reviewed Aug 2, 2024

View reviewed changes

cgwalters mentioned this pull request Aug 2, 2024

olpc-cjson != JSON containers/ocidir-rs#10

Open

cgwalters reviewed Aug 20, 2024

View reviewed changes

antheas added 2 commits August 27, 2024 23:28

remove option<> from rawmeta creation

e855c51

improve contentmeta docs and fix label nit

dbef2af

cgwalters reviewed Aug 30, 2024

View reviewed changes

lib/src/cli.rs Outdated Show resolved Hide resolved

Revert "use ocidir fork to fix newlines"

f3162f9

This reverts commit 1061e93.

antheas added 4 commits August 30, 2024 22:03

fix formatting

cb33dad

make version field mandatory

3ea0f76

fix fixture using the wrong entry type

862c1ec

lower indexmap version to fix c9s

8fda049

antheas marked this pull request as ready for review August 30, 2024 20:32

cgwalters approved these changes Sep 3, 2024

View reviewed changes

cgwalters merged commit 60e2a40 into ostreedev:main Sep 3, 2024
10 checks passed

feat: Add external input support for container encapsulation #652

feat: Add external input support for container encapsulation #652

Conversation

antheas commented Aug 2, 2024

Performance

antheas commented Aug 2, 2024

antheas commented Aug 2, 2024 • edited Loading

cgwalters left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antheas Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters commented Aug 2, 2024

antheas commented Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antheas Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

antheas commented Aug 22, 2024 • edited Loading

antheas commented Aug 22, 2024 • edited Loading

antheas commented Aug 22, 2024

cgwalters commented Aug 29, 2024

cgwalters commented Aug 29, 2024

antheas commented Aug 30, 2024

antheas commented Aug 30, 2024

antheas commented Aug 30, 2024 • edited Loading

cgwalters commented Aug 30, 2024

antheas commented Aug 30, 2024 • edited Loading

antheas commented Aug 30, 2024

cgwalters commented Aug 30, 2024

cgwalters commented Aug 30, 2024

cgwalters commented Aug 30, 2024

antheas commented Aug 30, 2024 • edited Loading

antheas commented Aug 30, 2024

antheas commented Aug 30, 2024 • edited Loading

cgwalters left a comment

Choose a reason for hiding this comment

antheas commented Aug 2, 2024 •

edited

Loading

antheas Aug 2, 2024 •

edited

Loading

antheas commented Aug 2, 2024 •

edited

Loading

antheas Aug 20, 2024 •

edited

Loading

antheas commented Aug 22, 2024 •

edited

Loading

antheas commented Aug 22, 2024 •

edited

Loading

antheas commented Aug 30, 2024 •

edited

Loading

antheas commented Aug 30, 2024 •

edited

Loading

antheas commented Aug 30, 2024 •

edited

Loading

antheas commented Aug 30, 2024 •

edited

Loading