Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: working Dockerfile \o/ #123

Merged
merged 12 commits into from
Feb 15, 2023
Merged

fix: working Dockerfile \o/ #123

merged 12 commits into from
Feb 15, 2023

Conversation

bloodearnest
Copy link
Member

@bloodearnest bloodearnest commented Jan 27, 2023

This change fixes the longstanding issue of not being able to rebuild
our R image, which was causing many issues.

It solves the problem of reproducing the R library versions as in the
current R docker image by using renv to manage a lockfile of versions.

This is bit nuanced (renv is not really designed for this use case)
but it does work. Check the Dockerfile comments for details. With
a cold cache, it is very slow to build all the initial packages (i.e.
1-2 hours). But the build process keeps a local cache of them on the
host, so rebuilds are fast.

Whilst there are no R library version changes, there are some system
library changes, as we moved from 18.04 to 20.04 as a base image. This
was necessary because a) 18.04 is EOL shortly and b) we don't maintain
a base 18.04 image. The file 20.04-library-changes.txt contains
a summary of differences, basically a bunch of uprades to system
libraries. Also, we upgraded to the latest point release of R 4.0
series, 4.0.2 -> 4.0.5.

Additionally, many unused components have been removed from the image:

  • all build toolchain/-dev packages
  • java/python integration
  • various x11/gtk dependencies

These made the original image heavier, and to the best of our knowledge,
have never been used by any projects. If this turns out to be
mistaken, we can add things back easily enough. The old image was 4.5G
from dockers

  • the new one is 1.9G.

Testing:

An R image build with this process was used to test ~30 OpenSAFELY
project.yamls with R code in, and it all worked, so we have some
confidence we've not broken anything.

We'll keep the old image unter r:legacy tag or something, so we can
easily access it if needed.

Note: this does not add a publishing workflow to the repo, and thus does
not change the current image by landing.

Whether we can build it in GH remains to be seen. With no cache, I'd
expect it to take multiple hours.

@bloodearnest bloodearnest mentioned this pull request Jan 27, 2023
@remlapmot
Copy link
Contributor

This is super impressive Simon.

The multi-stage build and the caching is superb.

My comments are:

  • At the beginning of a build there's a warning

    just build
    WARN[0000] The "PACKAGE" variable is not set. Defaulting to a blank string. 
    

    however, on Git Bash on Windows 11 and on bash and zsh my WSL v2 Ubuntu this seems to cause the build to fail and exit. On zsh on macOS the build succeeded. I can't explain why the difference occurs.

  • On line 40 of the Dockerfile I recommend installing renv at a specific version in a reproducible way (in case your cache gets deleted), and this is what the renv docs site recommends
    https://rstudio.github.io/renv/articles/docker.html#creating-docker-images-with-renv
    You could implement their approach of using the remotes package or install from a Posit Package Manager snapshot (or there are other ways to achieve this), e.g. in my legacy-03 branch I did it with

    install.packages('renv',  repos = 'https://packagemanager.posit.co/all/__linux__/focal/2023-01-12+BkwhZFgY')
  • I think the add-package section of the Dockerfile may need a little bit of work or at least better explanation. This is because as far as I understand it the Michael Rutter PPA does not contain all CRAN packages. I remember reading it had about 6,000 packages (not sure how to check this), whereas all of CRAN is approx 20,000 packages. So I think that I can make the just add-package error by passing it the name of a package that's on CRAN but not in the PPA (I think - or maybe the below error occurs because CRAN package names can contain capital letters), here's an example using one of my CRAN packages, I tried different syntaxes

    just add-package OneSampleMR
    just add-package onesamplemr
    just add-package r-cran-onesamplemr
    

    which all failed.

    The first syntax fails with

    ./add-package.sh OneSampleMR
    Attempting to build and install OneSampleMR
    [+] Building 0.0s (0/0)                                                                                                             
    invalid tag "r-OneSampleMR": repository name must be lowercase
    Builing OneSampleMR failed.
    You may need to add build dependencies (e.g. -dev packages) to build-dependencies.txt
    error: Recipe `add-package` failed on line 22 with exit code 1
    

    and the second and third syntaxes fail with the last three lines of the above message.

    Additionally adding another of my packages, mrbayes appeared to work but the tests errored with

    just add-package mrbayes
    ./test.sh "r"
    WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
    Error: unexpected symbol in "-e library"
    Execution halted
    error: Recipe `test` failed on line 27 with exit code 1
    error: Recipe `add-package` failed on line 22 with exit code 1
    

    I think this is simply fixed by the syntax for the " before library on line 4 of test.sh, i.e., maybe you need to escape the quotes or use a different quote syntax given there are nested quotes on that line.

@remlapmot
Copy link
Contributor

It's great to see the locale issue fixed.

It turns out that C.UTF-8 is what Posit Cloud set their machines to. This seems to return the same string sort ordering for the 26 letter alphabet as when the locale is set to en_GB.UTF-8 (as it likely is on all/most users machines).

sort(c(head(letters), head(LETTERS)))
## [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

So I'm pretty sure that most graphs with axes sorted by a string variable will now be the same whether produced in the container or on a local machine (that would be a graph made locally using summary data, obvs).

@bloodearnest
Copy link
Member Author

It's great to see the locale issue fixed.

This came for free by switching to use our standard base-action image, which sets up a bunch of sensible defaults for all our images 👍

It turns out that C.UTF-8 is what Posit Cloud set their machines to. This seems to return the same string sort ordering for the 26 letter alphabet as when the locale is set to en_GB.UTF-8 (as it likely is on all/most users machines).

sort(c(head(letters), head(LETTERS)))
## [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

So I'm pretty sure that most graphs with axes sorted by a string variable will now be the same whether produced in the container or on a local machine (that would be a graph made locally using summary data, obvs).

Right, that makes sense, thanks for digging into it.

I think we're ok with from a reproducability PoV if re-running old action with this image if the sort order ends up different.

@bloodearnest
Copy link
Member Author

This is super impressive Simon.

The multi-stage build and the caching is superb.

My comments are:

  • At the beginning of a build there's a warning

    just build
    WARN[0000] The "PACKAGE" variable is not set. Defaulting to a blank string. 
    

    however, on Git Bash on Windows 11 and on bash and zsh my WSL v2 Ubuntu this seems to cause the build to fail and exit. On zsh on macOS the build succeeded. I can't explain why the difference occurs.

Gah, seems docker issue the warning even if it's not going to build that stage in the Dockerfile. We can give it a default arg, that should stop that warning. I'll check it out.

Yes, that would probably be better. It's hard to pin the version of the package manager that understands how to install specific version on a platform that doesn't natively support that :)

You could implement their approach of using the remotes package or install from a Posit Package Manager snapshot

Yeah, I shied away from remotes, as it enables installing from random github repos, which may be ok, but feels like a policy change that needs further discussion before implementing

(or there are other ways to achieve this), e.g. in my legacy-03 branch I did it with

install.packages('renv',  repos = 'https://packagemanager.posit.co/all/__linux__/focal/2023-01-12+BkwhZFgY')

That might be worth doing.

However, given that it's just one package, and the version shouldn't matter to users of this image, I'm inclined to keep it unpinned atm. If a new version of renv comes out that breaks things, well, the Dockerfile build will break, and we'll need to fix it. We can always pin it then.

  • I think the add-package section of the Dockerfile may need a little bit of work or at least better explanation.

Yeah, it is not pretty. We could just do it via a docker run invocation. But that would mean building the package(s) twice, as the cache built in the image via docker run would not be preserved.

This is because as far as I understand it the Michael Rutter PPA does not contain all CRAN packages. I remember reading it had about 6,000 packages (not sure how to check this), whereas all of CRAN is approx 20,000 packages. So I think that I can make the just add-package error by passing it the name of a package that's on CRAN but not in the PPA

So, we're not installing any CRAN packages from the PPA, AFAIK. Those are packaged in apt with the names like r-cran-$NAME. My understanding was that renv::install("foo@version") would fetch the sources from upstream CRAN, and build locally, rather than using prebuilt binaries from the PPA.

(I think - or maybe the below error occurs because CRAN package names can contain capital letters), here's an example using one of my CRAN packages, I tried different syntaxes

just add-package OneSampleMR
just add-package onesamplemr
just add-package r-cran-onesamplemr

which all failed.
The first syntax fails with

./add-package.sh OneSampleMR
Attempting to build and install OneSampleMR
[+] Building 0.0s (0/0)                                                                                                             
invalid tag "r-OneSampleMR": repository name must be lowercase
Builing OneSampleMR failed.
You may need to add build dependencies (e.g. -dev packages) to build-dependencies.txt
error: Recipe `add-package` failed on line 22 with exit code 1

Bah. Stupid docker. If we lowercase the build image tag, it works, fix incoming.

and the second and third syntaxes fail with the last three lines of the above message.

I expect if failed because it couldn't find the 2nd two names on CRAN. But it would be nice to make that issue error early with a helpful error message.

FTR, just add-package doesn't support installing prebuilt r-cran apt pacakges at the moment. The thinking here was a) the original built from source, so we're doing the same in this rebuild and b) version pinning apt packages to specific version not fun, as you well know!

I'd expect in a future version r:4.2 image, we could try installing as much as possible from prebuilt r-cran packages in the ppa. There are various ways to pin/hold apt packages to specific versions. But they're not really designed for this use case (e.g. old versions of the deb will be removed, making rebuilding hard). We may end up always building from source anyway (which we'd have to support for those packages that are not in the ppa anyway). Rather than have 2 ways to install a package, it'd be simpler to have one. And the local build cache means it's at least tenable. If you had to rebuild everything every build, it clearly would not be!

Additionally adding another of my packages, mrbayes appeared to work but the tests errored with

just add-package mrbayes
./test.sh "r"
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Error: unexpected symbol in "-e library"
Execution halted
error: Recipe `test` failed on line 27 with exit code 1
error: Recipe `add-package` failed on line 22 with exit code 1

I think this is simply fixed by the syntax for the " before library on line 4 of test.sh, i.e., maybe you need to escape the quotes or use a different quote syntax given there are nested quotes on that line.

Yep, that sounds right, that script is bit hacky, probably worth doing it properly. Just building mrbayes now to test.

@bloodearnest
Copy link
Member Author

Yep, that sounds right, that script is bit hacky, probably worth doing it properly. Just building mrbayes now to test.

mrbayes worked first time for me.

I suspect this is an issue with not specifying bash properly, perhaps. I've added a commit to attempt to fix that.

however, on Git Bash on Windows 11 and on bash and zsh my WSL v2 Ubuntu this seems to cause > the build to fail and exit. On zsh on macOS the build succeeded. I can't explain why the
difference occurs.

Revisiting this, I'm not surprised it fails on git-bash using Docker for Windows. I'm not sure how well the build mount stuff works there. These images are designed to be built and published in a linux VM in Github Actions, and we've never tested building them with Docker for Windows.

I would expect it to work in WSL2, however. When I get chance, I'll reboot and see if there's any easy fix.

@remlapmot
Copy link
Contributor

thanks Simon

(sorry I can't read Docker error messages properly)

I agree renv is unlikely to break in a future version.

I reran from the new commits and build still fails on both Git Bash and WSL 2 with a similar message about IMAGE_TAG not being set (I guess Docker on Windows and from WSL 2 is using the same WSL 2 backend, whether there's a Docker setting I can change I'm not sure) - but anyway maybe I should actually use a proper Linux VM to test this in.

$ just build
WARN[0000] The "IMAGE_TAG" variable is not set. Defaulting to a blank string.
docker endpoint for "default" not found
error: Recipe `build` failed with exit code 1

And on my Mac the tests after adding a package still fail because of the quote.

just add-package OneSampleMR
bash ./test.sh "r"
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Error: unexpected symbol in "-e library"
Execution halted
error: Recipe `test` failed on line 28 with exit code 1
error: Recipe `add-package` failed on line 23 with exit code 1

Also you might want to commit into the repo the file renv/.gitignore with contents

library/
local/
cellar/
lock/
python/
sandbox/
staging/

because that seems to be created whilst adding a package.
Screenshot 2023-01-31 at 06 50 24

@remlapmot
Copy link
Contributor

I fixed test.sh on my Mac - I had the wrong reason for the failure - it wasn't the quotes - it was the -e and \n syntax that wasn't required. The -e and \n were being written into the .tests.R file which meant that contained invalid R code. Also putting quotes around the package name in a library() call is optional so I removed those, and semi-colon wasn't needed as the xargs seems to put each invocation of library() on a new line. So the version which runs for me is

#!/usr/bin/env bash
set -eu
IMAGE=${1:-ghcr.io/opensafely-core/r}
python3 -c 'import json; print("\n".join(json.load(open("renv.lock"))["Packages"]))' | xargs -I {} echo "library({}, warn.conflicts = FALSE)" > .tests.R
docker run --rm -v "$PWD:/workspace" "$IMAGE" .tests.R

@remlapmot
Copy link
Contributor

I had good progress on my Windows machine - the build succeeded under WSL 2.

It turned out that as per this Stack Overflow post, something was wrong in my C:\Users\%username%\.docker directory. Deleting that and restarting Docker was the fix.

And then just add-package ... ran successfully, both with test.sh unmodified and also modified as above (so the failure of the unmodified version on Mac is still a bit of a mystery to me).

@remlapmot
Copy link
Contributor

Two more small comments.

The only other output that was different on my machines was from line 34 of add-package.sh. I'd say it's safer to run sort with its -f flag. On both my WSL and Mac I obtained packages.csv sorted in a different order. Without the -f the packages beginning with uppercase letters were sorted first in the list followed by those starting with lowercase letters. With -f specified I obtained the same order as you. (I checked my WSL locale shows the same as the container locale so I can't explain this difference either).

Also I ran a few more add-package examples and find that

just add-package healthyR.ts

fails because it seems dependency R package rlang requires updating. The output was

... lots of output ...
#0 35.96 Installing clock [0.6.1] ...
#0 150.4        FAILED
#0 150.4 Error installing package 'clock':
#0 150.4 =================================
... lots of output ...
#0 150.4 installing to /renv/renv/staging/1/00LOCK-clock/00new/clock/libs
#0 150.4 ** R
#0 150.4 ** data
#0 150.4 *** moving datasets to lazyload DB
#0 150.4 ** inst
#0 150.4 ** byte-compile and prepare package for lazy loading
#0 150.4 Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
#0 150.4   namespace ‘rlang’ 0.4.10 is being loaded, but >= 1.0.4 is required
#0 150.4 Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
#0 150.4 Execution halted
#0 150.4 ERROR: lazy loading failed for package ‘clock’
#0 150.4 * removing ‘/renv/renv/staging/1/clock’
#0 150.4 Error: install of package 'clock' failed [error code 1]
#0 150.4 Traceback (most recent calls last):
#0 150.4 11: renv::install("healthyR.ts")
#0 150.4 10: renv_install_impl(records)
#0 150.4  9: renv_install_staged(records)
#0 150.4  8: renv_install_default(records)
#0 150.4  7: handler(package, renv_install_package(record))
#0 150.4  6: renv_install_package(record)
#0 150.4  5: withCallingHandlers(renv_install_package_impl(record), error = function(e) {
#0 150.4         vwritef("\tFAILED")
#0 150.4         writef(e$output)
#0 150.4     })
#0 150.4  4: renv_install_package_impl(record)
#0 150.4  3: r_cmd_install(package, path)
#0 150.4  2: r_exec_error(package, output, "install", status)
#0 150.4  1: stop(error)
#0 150.4 Execution halted
------
failed to solve: executor failed running [/bin/sh -c bash -c "R -e 'renv::activate(); renv::install(\"$PACKAGE\"); renv::snapshot(type=\"all\")'"]: exit code: 1
Builing healthyR.ts failed.
You may need to add build dependencies (e.g. -dev packages) to build-dependencies.txt

@bloodearnest
Copy link
Member Author

I fixed test.sh on my Mac - I had the wrong reason for the failure - it wasn't the quotes - it was the -e and \n syntax that wasn't required. The -e and \n were being written into the .tests.R file which meant that contained invalid R code. Also putting quotes around the package name in a library() call is optional so I removed those, and semi-colon wasn't needed as the xargs seems to put each invocation of library() on a new line. So the version which runs for me is

#!/usr/bin/env bash
set -eu
IMAGE=${1:-ghcr.io/opensafely-core/r}
python3 -c 'import json; print("\n".join(json.load(open("renv.lock"))["Packages"]))' | xargs -I {} echo "library({}, warn.conflicts = FALSE)" > .tests.R
docker run --rm -v "$PWD:/workspace" "$IMAGE" .tests.R

Nice, have pushed this change, and silenced the IMAGE_TAG warning.

If you have any ideas for some more tests above just importing libraries, I'm all ears :)

@bloodearnest
Copy link
Member Author

Two more small comments.

The only other output that was different on my machines was from line 34 of add-package.sh. I'd say it's safer to run sort with its -f flag. On both my WSL and Mac I obtained packages.csv sorted in a different order. Without the -f the packages beginning with uppercase letters were sorted first in the list followed by those starting with lowercase letters. With -f specified I obtained the same order as you. (I checked my WSL locale shows the same as the container locale so I can't explain this difference either).

Huh. That seems like a good idea. The ordering is just really to make the diffs more understandable when reviewing changes, e.g. we can spot an unintended version bump more easily. Will add.

Also I ran a few more add-package examples and find that

just add-package healthyR.ts

fails because it seems dependency R package rlang requires updating. The output was

Yeah, not much we can do about that, as our general policy is no version bumps, to avoid breaking backwards compat.

@remlapmot
Copy link
Contributor

Thanks for these Simon.

I just noticed that I think you need a little fix when using the PACKAGE@VERSION syntax, i.e., my guess from the error below is that Docker image tag names can't include an @ symbol. The error was from (I was starting to work backwards through previous versions of the healthyR.ts package)

$ just add-package [email protected]
bash ./add-package.sh [email protected]
Attempting to build and install [email protected]
[+] Building 0.0s (0/0)
invalid tag "[email protected]": invalid reference format
Builing [email protected] failed.
You may need to add build dependencies (e.g. -dev packages) to build-dependencies.txt
error: Recipe `add-package` failed on line 23 with exit code 1

You could just crop the image tag name from before the @ maybe.

(With that fixed) I would perhaps add a comment to the last section of the README that if a package fails to install at its current version on CRAN then it is recommended to look up the prior version numbers of the package from its CRAN old sources page, which has a URL of the form https://cran.r-project.org/src/contrib/Archive/{PACKAGE}/, e.g. https://cran.r-project.org/src/contrib/Archive/healthyR.ts/, and attempt to install each prior version in turn using the just add-package PACKAGE@VERSION syntax.

@remlapmot
Copy link
Contributor

I fixed test.sh on my Mac - I had the wrong reason for the failure - it wasn't the quotes - it was the -e and \n syntax that wasn't required. The -e and \n were being written into the .tests.R file which meant that contained invalid R code. Also putting quotes around the package name in a library() call is optional so I removed those, and semi-colon wasn't needed as the xargs seems to put each invocation of library() on a new line. So the version which runs for me is

#!/usr/bin/env bash
set -eu
IMAGE=${1:-ghcr.io/opensafely-core/r}
python3 -c 'import json; print("\n".join(json.load(open("renv.lock"))["Packages"]))' | xargs -I {} echo "library({}, warn.conflicts = FALSE)" > .tests.R
docker run --rm -v "$PWD:/workspace" "$IMAGE" .tests.R

Nice, have pushed this change, and silenced the IMAGE_TAG warning.

If you have any ideas for some more tests above just importing libraries, I'm all ears :)

Sorry I don't. This is a really neat test. Although you can make a case for saying it's not really necessary because during the process of installing a package, R checks whether that package can be successfully loaded. However, given this test is really quick it does no harm so I'd leave it in.

@remlapmot
Copy link
Contributor

Huh. That seems like a good idea. The ordering is just really to make the diffs more understandable when reviewing changes, e.g. we can spot an unintended version bump more easily. Will add.

thanks - yes I realised that. I did exactly the same at the end of my 3 legacy branches, I just did my sorting in R.

@bloodearnest
Copy link
Member Author

Thanks for these Simon.

I just noticed that I think you need a little fix when using the PACKAGE@VERSION syntax, i.e., my guess from the error below is that Docker image tag names can't include an @ symbol. The error was from (I was starting to work backwards through previous versions of the healthyR.ts package)

$ just add-package [email protected]
bash ./add-package.sh [email protected]
Attempting to build and install [email protected]
[+] Building 0.0s (0/0)
invalid tag "[email protected]": invalid reference format
Builing [email protected] failed.
You may need to add build dependencies (e.g. -dev packages) to build-dependencies.txt
error: Recipe `add-package` failed on line 23 with exit code 1

You could just crop the image tag name from before the @ maybe.

Done, thanks

(With that fixed) I would perhaps add a comment to the last section of the README that if a package fails to install at its current version on CRAN then it is recommended to look up the prior version numbers of the package from its CRAN old sources page, which has a URL of the form https://cran.r-project.org/src/contrib/Archive/{PACKAGE}/, e.g. https://cran.r-project.org/src/contrib/Archive/healthyR.ts/, and attempt to install each prior version in turn using the just add-package PACKAGE@VERSION syntax.

Good idea, added a section to the README on this.

@remlapmot
Copy link
Contributor

thanks for those commits Simon.

  • I think you probably want to commit the renv/.gitignore file I mentioned above into the repo, or at least you'll likely need to do so on the first real run of just add-package ....
  • I forgot to say, you could add a pointer to the new Trouble shooting section of the README to the error message when just add-package fails, i.e., to add-package.sh you could add a new line 15 such as
    echo "Alternatively, you may need to install an older version of $PACKAGE. Please see the Trouble shooting section of the README."
    
  • On line 48 of the README
    • suggest fixing typo "Find and previous ..." to "Find a previous ..."
    • perhaps render the URL as code as I did in my message above because it's not valid until {PACKAGE} has been substituted with a package name. As is, GitHub will render this as a hyperlink which may confuse people if they click it because that goes to a 404 without the substitution.

(After these I think that's probably me out of comments.)

Also, I had a look at adding butcher. I think you need to go back to version 0.1.5 of that such that its dependencies that are already in the container meet its requirements (this is going back quite a few versions but the version of rlang in this container is quite old).

@StevenMaude
Copy link
Contributor

It's great that this is getting improved! 👍

With
a cold cache, it is very slow to build all the initial packages (i.e.
1-2 hours). But the build process keeps a local cache of them on the
host, so rebuilds are fast.

These issues might be worth tracking for possible improvements to this in future:

rstudio/renv#459
rstudio/renv#907
r-lib/pak#343

add-package.sh Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@remlapmot remlapmot mentioned this pull request Feb 9, 2023
@remlapmot
Copy link
Contributor

I just noticed there is an unlucky accidental error in e4f1961 . The problem is that the row with the column names, "Package" and "Version", do not remain at the top of the file, but they are sorted with the rest of the values. Hence they were accidentally moved to line 237 of the file. And after yesterday's PR they are now on line 20. They're better as line 1 of the file or deleted.

@remlapmot
Copy link
Contributor

Also I checked yesterday's package, forestploter, and

just add-package forestploter

works beautifully in the current build of this image.

@remlapmot
Copy link
Contributor

It's great that this is getting improved! 👍

With
a cold cache, it is very slow to build all the initial packages (i.e.
1-2 hours). But the build process keeps a local cache of them on the
host, so rebuilds are fast.

These issues might be worth tracking for possible improvements to this in future:

rstudio/renv#459 rstudio/renv#907 r-lib/pak#343

You're absolutely right, once renv::restore() can use pak, using parallel downloads will bring an incredible speedup to downloading the package files (it seems renv::install() can already use pak but that's no help to us).

@bloodearnest
Copy link
Member Author

I just noticed there is an unlucky accidental error in e4f1961 . The problem is that the row with the column names, "Package" and "Version", do not remain at the top of the file, but they are sorted with the rest of the values. Hence they were accidentally moved to line 237 of the file. And after yesterday's PR they are now on line 20. They're better as line 1 of the file or deleted.

Bah, yes. Ok, so, is there a way for the R that generates the csv data to sort, rather than sorting in bash?

@remlapmot
Copy link
Contributor

You could either use stringr (stringr::sort()/stringr::order()/stringr::rank()) or data.table (data.table::setorder()).

In my branches I used stringr but I think you might want to go with data.table. This is because stringr sorts using an en locale (producing capital letters interspersed within the lowercase), whereas data.table sorts using a C locale which is more similar to what you did originally (capital letters sorted first then lowercase sorted).

Your alternative syntaxes are approximately as follows.

For stringr

library(dplyr)
readr::read_csv("packages.csv",
 col_types = "cc",
 col_names = c("Package", "Version")
) %>%
  filter(Package != "Package") %>%
  arrange(stringr::str_rank(Package)) %>%
  readr::write_csv("packages-alphabetical.csv", quote = "all")

For data.table

library(data.table)
dat <- read.csv("packages.csv", header = FALSE)
colnames(dat) <- c("Package", "Version")
# Remove accidental inclusion of c("Package", "Version") as a row
dat <- subset(dat, Package != "Package")
dat <- data.table(dat)
# data.table setorder() sorts in C-locale
setorder(dat, Package)
write.csv(dat, file = "packages-alphabetical.csv", row.names = FALSE)

@bloodearnest
Copy link
Member Author

bloodearnest commented Feb 13, 2023

just noticed there is an unlucky accidental error in e4f1961 . The problem is that the row with the column names, "Package" and "Version", do not remain at the top of the file, but they are sorted with the rest of the values. Hence they were accidentally moved to line 237 of the file. And after yesterday's PR they are now on line 20. They're better as line 1 of the file or deleted.

Fixed in #126, will update this PR to use the default sorting once that's landed.

Went with the simplest approach, of just using the same locale for generating the csv.

bloodearnest and others added 9 commits February 14, 2023 14:25
This change fixes the longstanding issue of not being able to rebuild
our R image, which was causing many issues.

It solves the problem of reproducing the R library versions as in the
current R docker image by using renv to manage a lockfile of versions.

This is bit nuanced - renv is not really designed for this use case
- but it does work. Check the Dockerfile comments for details. With
a cold cache, it is very slow to build all the initial packages (i.e.
1-2 hours). But the build process keeps a local cache of them on the
host, so rebuilds are fast.

Whilst there are no R library version changes, there are some system
library changes, as we moved from 18.04 to 20.04 as a base image. This
was necessary because a) 18.04 is EOL shortly and b) we don't maintain
a base 18.04 image.  The file `20.04-library-changes.txt` contains
a summary of differences, basically a bunch of uprades to system
libraries. Also, we upgraded to the latest point release of R 4.0
series, 4.0.2 -> 4.0.5.

Additionally, many unused components have been removed from the image:
 - all build toolchain/-dev packages
 - java/python integration
 - various x11/gtk dependencies

These made the original image heavier, and to the best of our knowledge,
have never been used by any projects. If this turns out to be
mistaken, we can add things back easily enough. The old image was 4.5G
from dockers
- the new one is 1.9G.

Testing:

An R image build with this process was used to test ~30 OpenSAFELY
project.yamls with R code in, and it all worked, so we have some
confidence we've not broken anything.

We'll keep the old image unter r:legacy tag or something, so we can
easily access it if needed.

Note: this does not add a publishing workflow to the repo, and thus does
not change the current image by landing.

Whether we can build it in GH remains to be seend. With no cache, I'd
expect it to take multiple hours.
 - make sure buildkit is used for everything

 - supply a nonsense default arg for PACKAGE to silenc docker warning
Also silence the IMAGE_TAG warning from docker-compose
Co-authored-by: Steven Maude <[email protected]>
Whenever we run R in the container, it activates renv, which AFAICT will
*always* create an `renv` directory in the current.

The test.sh script was mounting this directory in, thus creating an renv
dir in it, so we stop doing that.

Additionally, add renv/ dir to the gitignore.

In the processs, realised this dir creation created a sandbox dir, which
could be instead done once in /renv, so made that change too.

Ideally, we'd tell renv to create the dir somewhere else, but there
doesn't seem to be away to do that, it's hardcoded to be `./renv`.

Also, apply suggested docs fix from Tom.
@bloodearnest
Copy link
Member Author

Right, I think this is ready to go, all outstanding issue dealt with.

Will merge and publish later today if no one shouts!

@StevenMaude
Copy link
Contributor

@bloodearnest: is this something to explain about in Platform News, maybe? Either that or a blog post?

Also, should we be advising users how to clean up the old image, either there or elsewhere, once they have asserted that they definitely don't need the legacy image? If this new image doesn't share any layers, the old one will be occupying 4.5 GiB of their disk. Docker doesn't typically garbage collect images and containers.

@StevenMaude
Copy link
Contributor

(I don't think this ⬆️ blocks merging the PR.)

@bloodearnest
Copy link
Member Author

@bloodearnest: is this something to explain about in Platform News, maybe? Either that or a blog post?

Yep, there's a couple of pieces of news blocked on publishing this

Also, should we be advising users how to clean up the old image, either there or elsewhere, once they have asserted that they definitely don't need the legacy image? If this new image doesn't share any layers, the old one will be occupying 4.5 GiB of their disk. Docker doesn't typically garbage collect images and containers.

It doesn't share any layers, but fortunately, opensafely pull will do this for them, as it runs docker image prune after pulling.

And the plan is that we decide to switch r:latest to be this new build - the users should notice nothing except for a big download. This is not a semantic change - all the library versions should be identical, and if it goes well, invisible to users.

Of course, many things could go wrong. If we need to roll back, I've already published the current image as r:legacy, so we can do that easily enough. But you couldn't explicitly run r:legacy in production atm, as it's not tagged like that on the server, and not planning to do so.

@remlapmot
Copy link
Contributor

Right, I think this is ready to go, all outstanding issue dealt with.

Will merge and publish later today if no one shouts!

Great (I am travelling today, so I haven't tested the new commits) apart from putting backticks around the URL at the end of the README (to stop people accidentally clicking it and thinking it's invalid) this looks great to me.

It has been great to learn about caching - thanks Simon

@StevenMaude
Copy link
Contributor

it runs docker image prune after pulling.

TIL; great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants