-
Notifications
You must be signed in to change notification settings - Fork 26
feat: Add external input support for container encapsulation #652
Conversation
This reverts commit 8c94e05.
Here is an example contentmeta file: |
Worthwhile to mention that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for filing this! Just a quick initial look.
lib/Cargo.toml
Outdated
@@ -36,7 +36,7 @@ once_cell = "1.9" | |||
libc = "0.2.92" | |||
libsystemd = "0.7.0" | |||
openssl = "0.10.33" | |||
ocidir = "0.1.0" | |||
ocidir = { version = "0.2.0", git = "https://github.com/hhd-dev/ocidir-rs" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah...you have changes to that too!
The bump to 0.2 is in #653 at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind doing (again) at least a draft PR to the repo with hhd-dev/ocidir-rs@9f6095c and provide a bit of reproducer instructions around how skopeo is broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I see it's in the PR text:
Violates the JSON standard by encoding the characters \n etc literally
Hmmmm...ok. I will look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.rs/olpc-cjson/latest/olpc_cjson/
(specifically, ASCII control characters 0x00–0x1f are printed literally, which is not valid JSON). Therefore, serde_json cannot necessarily deserialize JSON produced by this formatter.
error: Querying manifest after push: Fetching manifest: Failed to invoke skopeo proxy method GetManifest: remote error: invalid character '\n' in string literal
Hopefully this helps
Only occurs if \n is included in a label. It would have been nice to generate fancy descriptions. Although the only place that reads them (ghcr) omits \n so new lines are not shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, right thanks. This is probably then best tracked at containers/ocidir-rs#10 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since I do not think my fork is a proper fix it did not make sense to PR it.
The code was very much written with the idea to do something like this, awesome that you were to get it to work! |
One of the problems of the spec I made is that serde does not preserve order in JSON dictionaries. Which is why indexmap is included. Rust developers would suggest something like this to preserve ordering: layers:
- key: dedi:meta:wine
friendly: "dedi:meta:wine"
- key: dedi:meta:wine
friendly: "dedi:meta:wine"
mapping:
- key: 000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba
layer: rechunk_layer027
- key: 000c6113a238d5792f25f097f37b9394bc2a735ed40542dd70826c4b9f7393ba
layer: dedi:meta:kde Which I am not sure how I feed about. Other than that, I think it makes sense as an API. |
lib/src/cli.rs
Outdated
/// When the image was created. Sync it with the io.container.image.created label. | ||
pub created: Option<String>, | ||
/// Top level labels, to be prefixed to the ones with --label | ||
pub labels: Option<BTreeMap<String, String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK having these here, but it seems like it'd make more sense for them to be separate CLI arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its valuable to have 2 sources for this, as currently rechunk will also generate the labels and the created tag and harmonize them. Since it pulls the rpm database it can do fancy stuff such as variable substitution. So this acts as a way of passing them through the 2 commands cleanly.
Having to refeed them into arguments would be hell.
I expect if anyone else tried to extend this they would agree.
Here is how the action example for the layers looks right now:
https://github.com/hhd-dev/rechunk/blob/496f4b84aced656b9c2c0f176f24323fe13129ad/.github/workflows/online_test_deck.yml#L51-L93
CLI may override the file.
lib/src/cli.rs
Outdated
change_frequency: if k == "unpackaged" { std::u32::MAX } else { 1 }, | ||
change_time_offset: 1, | ||
}, | ||
size: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size of 1
seems odd here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is what was required to make it work without restructuring the code.
A proper implementation would remove this, incl. the dangling "Reserved for New Packages" layer at the end.
lib/src/cli.rs
Outdated
sizes: raw | ||
.layers | ||
.into_iter() | ||
.map(|(k, v)| ObjectSourceMetaSized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just directly parse this data from the input JSON?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of this fork only reuses ostree-rs-ext as an exporter. If we wanted to use it with its grouping algorithm, it would make sense to do it that.
Right now this section is fighting with the existing code to make it work as an exporter and by exposing a cleaner API for the json file.
"should be fixed" as in future me will have to fix it :) I will have a look over this PR this weekend and try to fix the comments. Incl. those that I "justified". As for something you did not comment on: IndexMap. Rust treats JSON maps as unordered, but the spec I suggested requires them being ordered. This necessitates the use of the IndexMap package. Do we keep or change the spec? In this case, to a list of tuples. In addition, the This would allow re-ordering the tar stream so that something like zstd:chunked can perform less range requests. Currently, it is non-ordered, with a BTree or something inside ostree-rs-ext defining the ordering. For backward compatibility reasons and having fewer ways partitioning algorithms that are not that deterministic can fail, this should remain as the default ordering. With partitioning algorithms that care about the order of the tar stream being able to change that. |
For reordering the tar stream, it would be required to "stash" somewhere in the image the order of the current tar stream (essentially the Otherwise, it would have to be done out-of-band, which is too messy imo. I'd rather all metadata be part of the image. Compressed, this is around 10MB. Since The alternative would be the inner config, which skopeo reads. However, adding an 8MB blob might cause issues for certain registries. |
If we extract the After GZIP compression, it becomes 6.9MB. If the compressed array is then converted using base85 to be inserted into e.g., a label, it becomes 8.7MB. |
That's forcing all consumers to download this though. I think in the general case production build systems can maintain "out of band" caches as separate images in the registry. Look at how e.g. cosign/sigstore work - there's an OCI artifact uploaded to the registry that refers to the base manifest. |
A BTree is ordered. The output ordering of each chunk is sorted by the checksum. Which yes...if we wanted to optimize zstd:chunked range requests would need to be configurable. I can very much see the argument that actually by default we should order the output of chunks first by their associated contentID, and only then by checksum. |
This takes a bit of thinking, but you can derive that the chunking of the image is irrelevant as far as the zstd:chunked ordering is concerned. If the ordering of the mapping variable is respected and the new hashes are placed on its end then it does not matter how it is partitioned into layers. All new files will be on the end of the resulting layers, in respect to users of previous versions. Regardless of how the previous versions or this version was structured into layers.
The above mean that this would only fragment the layers and be counterproductive. The reason the layers variable is forced to be ordered in this PR is so that the resulting manifests have the same layer order, which is prettier and makes comparing manifests easier. |
6.7Mb of a 6.5 Gb image is oh, 0.1%? For fedora core, it would be 2MB out of 2GB, so it scales pretty well with the image size to be 0.1% to 0.2%. I do not think that warrants an out-of-band store. But it also needs not be part of the spec, so enterprise users that want to squeeze out that 0.1% can.
Referencing the bazzite image store, in my opinion forcing out-of-band storage for signatures is a wart of the OCI spec. It clouds both skopeo manifests and the image store. For most images, signing happens during build time. So the happy path would be to bundle the initial sigs with the image itself. |
I would refer to that as deterministic. At best, unless there are stable ABI guarantees in the Rust spec. If the checksum algorithm changes for any reason, all existing images produced with ostree-rs-ext would be cache busted. This includes changing the string type ostree-rs-ext uses and Rust updates. Ordered here would mean the spec would respect the ordering provided by |
This is really an unimportant side discussion but just for reference: If we changed how ostree computes checksums, it would also duplicate on-disk state, etc. It's not going to happen. What we should be doing is getting ostree out of the container images and orienting around composefs and fsverity, probably moving to xattrs in the tar stream, etc. which would be the more useful format break. |
I meant the checksumming which causes ordering of the BTree within rust. Looking at the Rust documentation, it is not clear how it is determined. There are references to the Ord trait so perhaps in the current version it is alphabetical? I am unsure. I am also not sure on how the final hash ordering is currently determined. But I would rather it not break unless it is done intentionally. I know the hashes themselves will not change. |
I agree on the other points though. I am not aware of Fedora's plans, but for us I'd prefer we maintain interop with rpm-ostree as it ships today for the next 1 year at least on images shipped. And in the much sorter term fixing the xattrs issue, as it appears we will need it soon. Any image derived with rechunk cannot use it again because it loses xattrs. That means extending ostree-rs-ext and depending on ostree for the time being. |
I've been thinking about this more. I think we can simplify things a lot if instead of going from container image -> container image, we go from:
It's basically the same thing I'm proposing here https://gitlab.com/fedora/bootc/tracker/-/issues/32#note_2078106687 The big simplification here vs what you have is that each artifact image becomes its own chunk by default (unless specified to be merged). Again something like this:
There's a strong relationship here to the (relatively) new COPY --link (that's not implemented in podman yet) for the latter two bits - it's basically what we're doing. |
Well except actually, we need in the general case to do selinux labeling on artifacts when enabled, and canonicalize to ostree format short term, so it won't actually be |
I agree this is a design flaw that needs fixing, but if it's not actually really useful right now can we then drop the newline requirement and the need to change ocidir-rs in the short term? I think that's the thing that's conflicting in the PR at the moment. |
This reverts commit 1061e93.
Reverted and tested it does not conflict. I will make sure to filter \n before merging the changes to rechunk. |
I do not know why my local version did not catch the fixture error. Perhaps there needs to be a bit more testing there. There is also still the following issue. ostree ext adds a last layer dedicated for new packages that will not be used.
Changing it would require making |
Yeah, for the next 1-2 years this will probably be the way to go. For the vast majority of images, the time downside is negligible (4-7m) and the end result is compatible with rpm-ostree. There is no size or performance benefit to the alternative for OS images. There is a huge benefit however for AI and OCI images for not using ostree, which brings us to below or perhaps the topic of next Tuesday.
People like using OCI images because the contents are inspectable, extendable, and iterative. They would rather paste a bunch of I also like the idea of having an optional postprocessing step like rechunk that fixes the image only when it is meant for distribution (as in it does not have to be used during development). And to have that tool handle all the quirks that come with shifting packages around so that the image is optimized for downloading. I received a lot of positive feedback about the fact that rechunk is a drop-in addition and after that you get 3x smaller downloads for free. Rechunk contains a new partitioning algorithm which has quite a bit better bandwidth savings compared to the one in ostree-rs-ext. And requires no changes to existing images or workflow.
I would rather the chunks be predetermined as that is what is more optimized if zstd:chunked is not used. Even if it is, it is still great for registries. This accounts for most of the gain rechunk has over rpm-ostree chunking right now. However, the chunks you mention are not meant for optimizing downloads. They are actually extensions. I have been thinking a lot while building rechunk about how extensions would look and researching sysext as well. I do not think that sysext is appropriate for the bootc model, but it makes a great point: the extensions are married to the image. They update as part of it. My idea would be to have the distro build a huge master container image that bundles all OS extensions together. They have 100 layers to do it, and for most applications that is broadly enough. Then, they use a little yaml file to say which package is part of which extension. Rechunk handles the rest and places the extension files to separate layers, while having the correct directory structure due to using ostree. Then, during deployment, the user has tied a feature mask to their update tag, which specifies which extensions should be enabled. Depending on which extensions the user has selected, certain layers get shed, saving space and update bandwidth. As the image is built at the same time with all extensions, there is 0 update drift and only one update channel. It also has the benefit that images with extensions are able to then be used as base images using
This copy --link is essential for rechunking AI images, as it is prohibitively expensive to copy the model files. Currently, rechunk copies the files oh, only 3 times:
It is not that big of an issue even with an image that is 16GB (7GB compressed). But if an image is 50-100GB it is. For applying something like rechunk to general OCI images, OSTree will have to be removed from the equation and buildah needs to be used instead. Also needs reflinks to avoid thrashing the disk with large container images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have liked to merge this with tests but not going to block on it in the immediate term.
Maybe one thing we can try to directly do is "reverse dependency testing", run rechunk's tests at least optionally on PRs here?
This PR adds a
--contentmeta
option toostree-rs-ext container encapsulate
and makes the required changes to support it.The
--contentmeta
file receives the path of a JSON file as input which has the following format (shown in yaml):Then,
ostree-rs-ext
uses this file as a base to export the provided commit as an OCI image.The implementation here hijacks and turns off the rechunking component of
ostree-rs-ext
in an unclean way, so it needs to be refactored. For example,ostree-rs-ext
adds 1 additional extra layer for "new-packages"In addition, the following changes had to be made:
ocidir-rs
\n
etc literally\n
in image descriptions anyway, so its unclear how important this is.rpm-ostree status
(example)Overall, this PR makes it possible to remove
rpm-ostree
as a backend in image creation, and use something like rechunk for the packaging aspect. This opens up the possibility of including support for more distros (e.g., arch) as their packaging format can now be scripted.In addition, since creating the JSON file is decoupled from producing the OCI image, it becomes possible to simulate different packaging scenarios without having to wait and analyze the output OCI image.
Performance
As a sidenote, an attempt was made to speed up
ostree-rs-ext
writing the output image. Currently, it is single threaded and takes around 5 minutes to export an image, which rounds out to around 10 minutes doubling the time it takes to produce an image. It would be nice if it took e.g., 1 minute and had a progress bar.We could not make multithreading work due to the multiple references of ark-string and the mutable reference to the OSTree object, which only holds a dir. It would take a bit of restructuring.
Removing
rpm-ostree
from the process saved around 2 minutes in calculating the layers. Most of that time is due to OSTree file lookups, which is relatively easy to fix.For example,
rechunk
uses a hash memory map extracted from OSTree, which takes around 20 seconds to calculate and allows for instant lookups.Note that the command used,
ostree ls
is not currently machine parseable (size and hard link numbers overflow), so it would be nice to fix with e.g., a--csv
tag.