Skip to content

Conversation

kit-ty-kate
Copy link
Member

@kit-ty-kate kit-ty-kate commented Jul 31, 2025

Fixes #5741
Fixes #5648
Fixes #5484
Fixes #5346
Fixes #5559
cc @hannesm to check if it works for conex (2.1 worked with tar.gz files already so i'm not too scared about breakages)

Reasoning

#5741 shows that assumptions that hold true "most of the time" on some unix platforms such as "it is ok to scan a large tree structure of files and directories", don't hold true on other platforms. Systems such as Windows, network filesystems, busy shared servers whose disk is being constantly used, harddrives, … suffer from this.

In opam we can have 3 types of repositories:

  1. HTTP (the default): where we download a tar.gz
  2. VCS (aka. mostly git these days): where we use the vcs command line to get the files
  3. local/ssh: where we either use our own copy primitives or use rsync

Out of the three, the most critical for first time users is the first one. It is also the one that suffer the most from these issues as currently we:

  1. untar it
  2. diff with the current repository (used for conex and for opam update: load only changed opam files  #6614)
  3. patch the changed files
  4. remove the directory
  5. rescan the whole repository

VCS do not have step 4. Steps 1, 2 and 3 are builtin and heavily optimized. Is left only step 5 which should be improved by #6614 and for which we can improve further later by using git cat-file or even parse PACK files using ocaml-git.

Local/ssh repositories are the ones left a bit with very few things we can do about them. #5966 should help, but beyond, maybe we might want to require that people use git even for local repositories.

For HTTP though, the untarring (which takes 1+ minute) is the main issue. Thus this here PR.

Design decisions

Instead of untarring we simply use the tar.gz as-is and use ocaml-tar to read it on the fly.
The new update steps are:

  1. diff the two tar.gz
  2. remove the old tar.gz
  3. move the new tar.gz in its place
  4. scan what has changed (required opam update: load only changed opam files  #6614)

Given the ubiquity of the use of OpamFilename.Dir.t to mean both any random directory and a repository directory, i chose to first abstract over it in a new OpamRepositoryRoot module, and work with the help of the type checker from there. Its interface help see what are the actions that opam does on repositories. While i'd rather keep them, the Tar and Dir submodules can be removed when everything is done.

The REPOSITORYTARRING environment variable is removed by this work, given the repositories are tarred already.

I had simplify opam var pkg:opamfile for this work. Previously it would point to the file in the repository. However this isn't what it's supposed to be doing. Instead it should point to the <switch>/.opam-switch/packages/ directory which actually reflects the opam file that was used to installed. Otherwise the opam file can change between before and after the user has called opam update etc.

TODO

There are a number of assert false (* TODO *) in this draft PR. Those are to be fixed before undrafting but i felt reasonably confident with the rest of them to open this draft PR in this state to put more eyes on this work and to increase my self-motivation.

  • The main thing to do is to do the diff function between two tar files and between a directory and a tar.
  • The other thing is to fill OpamRepositoryState.get_repo_files: a function which extracts a limited number of files from the tar.gz to a new cache directory.

Some of these changes should probably be extracted to separate PRs but let's do that at the end when we have something that actually works.

While early form of this work started a year and a half ago, i believe the crust left over from that time should be minimal, after 6 different branches. The final rebase and split into smaller PRs shouldn't be too painful.

Future work

In the future we can use ocaml-tar that we now depend on to replace some of the uses of the tar command. This should allow us to have better behaviours with things like symlinks on windows or even add new features such as excluding some directories (see ocaml/ocaml#14152).

As mentioned above we can also improve local and git/vcs repositories with or without ocaml-git.

@hannesm
Copy link
Member

hannesm commented Aug 1, 2025

Hey, thanks for your work on that. Since you asked me directly

cc @hannesm to check if it works for conex

Instead of untarring we simply use the tar.gz as-is and use ocaml-tar to read it on the fly.
The new update steps are:

diff the two tar.gz
remove the old tar.gz
move the new tar.gz in its place

This should be fine from the design point of view for conex.

Conex will need to interject between step 0 (you downloaded the tarball) and 2 (remove the old tar.gz). Currently, conex requires a diff file on disk, and the old repository as directory. But we can revise that interface, and conex could as well work on two tar.gz (and/or on two directories).

I guess you have a clear understanding of the update process currently, and since you mention the different kinds (http, git, local) -- maybe we should re-think how opam and conex should interact to avoid burden paid by people not using conex, and avoid the burden of duplicating computations in both opam and conex. The latter may need to include conex as a library into opam.

I'm away for the next 10 days (back on August 10th), but am happy to discuss this afterwards - esp.since I plan to revive my work on conex thereafter.

@hannesm
Copy link
Member

hannesm commented Aug 1, 2025

To be more precise, given that

  • (a) local repositories (rsync) aren't really worth to verify (under the assumption that whoever has access to the repository can as well install arbitrary packages) [which may be revisited if there's the NFS use case or a shared server]
  • (b) VCS (git): we could do the git fetch and provide conex with the local repository and the commit that the update should be to
  • (c) http: provide old and new tarball

Then conex could do what is needed (compute the set of changed opam files, verify signatures; exit 0 on success); and could even report back the set of changed files to opam (I suspect this is what #6614 depends on) - using a file, or a socket, or if integrated with opam, this will be much simpler (using shared memory).

For opam itself, I guess that #6349 and #5553 will improve a lot of updates already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment