Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale is set to nothing leading to different than expected sorting of vectors of character strings in R #99

Closed
remlapmot opened this issue Jun 14, 2022 · 5 comments

Comments

@remlapmot
Copy link
Contributor

(sorry for the long report below and that it took me so long to spot this)

I'm sure by accident there the locale is set to nothing (is this referred to as unset? sorry I am not really familiar with locale grammar) in the r-docker container.

Unfortunately the sorting of character strings in I think the base R
sort() and order() functions and hence dplyr::arrange() are all
affected by locale.

Examples

  • check locale is unset
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "Sys.getenv('LANG')"
## [1] ""
  • check with locale and which are generated/setup (did this
    interactively)
% docker run --platform linux/amd64 --entrypoint /bin/bash -it ghcr.io/opensafely-core/r:latest

root@d446d1ea5760:/workspace# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

root@d446d1ea5760:/workspace# locale -a
C
C.UTF-8
POSIX
  • Show different sorting of character string (compare with output from
    next bullet)
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "sort(c(head(letters), head(LETTERS)))"
##  [1] "A" "B" "C" "D" "E" "F" "a" "b" "c" "d" "e" "f"
  • Whereas on say a machine with locale en_US.UTF-8 we get
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
    Rscript -e "Sys.getenv('LANG'); \
      sort(c(head(letters), head(LETTERS)))"
## [1] "en_US.UTF-8"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
    Rscript -e "Sys.setenv(LANG='en_GB.UTF-8'); \
      Sys.getenv('LANG'); \
      sort(c(head(letters), head(LETTERS)))"
## [1] "en_GB.UTF-8"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

Fixes

  • A suboptimal quick fix for users is to use stringr::str_sort()
    instead of the other functions mentioned
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "stringr::str_sort(c(head(letters), head(LETTERS)))"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

Once the locales have been generated the other fixes are

  • Users can set a locale at the top of each R script/session (although
    this probably not best practise)
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "Sys.setenv(LANG='en_GB.UTF-8'); sort(c(head(letters), head(LETTERS)))"
  • Set LANG="en_GB.UTF-8" in a /workspace/.Renviron file (might
    also be worth setting LC_CTYPE to same value as well)

  • Set LANG="en_GB.UTF-8" in the global Renviron.site file in
    /usr/lib/R/etc (might also be worth setting LC_CTYPE to same value as well)

  • (And when the Dockerfile is running again of course could just set
    it in that with)

ENV LANG="en_GB.UTF-8"

In R see the locales and sort helpfiles for more info

?locales
?sort
@inglesp
Copy link
Contributor

inglesp commented Aug 3, 2022

Hi @remlapmot -- to help us prioritise fixing this, can you let me know what impact this has on users and their research?

@inglesp inglesp removed the status in Data Team Aug 3, 2022
@remlapmot
Copy link
Contributor Author

This doesn't have much impact @inglesp.

The only effect of this I have seen in OpenSAFELY repos is that graphs can be produced with categories in different orders compared to when the locale is set. So nothing incorrect is produced, but the plot has the x- or y-axis categories in a different order compared to what the user was expecting. This gives the user a few seconds of confusion, but they either ignore it or regenerate the plot locally to obtain the preferred/expected sort ordering.

It affects the sort ordering of vectors of character strings with certain functions. Unfortunately those are the functions users are most likely to use, i.e., sort() and order() from base R and dplyr::arrange() but it can be overcome, without setting the locale, by using stringr::str_sort()/stringr::str_order() instead - although this requires the user to remember this. You can see the description of the locale argument to those stringr functions here.

In the Rocker containers they set the locale to en_US.UTF-8, e.g. in r-ver here and in r-base e.g. here. You can see their original Rocker issue about this here.

I went into a few other OpenSAFELY and datalab containers interactively and I couldn't really find one where the locale has been set. The locales were either empty or C.UTF-8.

@inglesp
Copy link
Contributor

inglesp commented Aug 3, 2022

Thanks Tom. I expect we'll come to this when we revisit how the R docker image is built.

@remlapmot
Copy link
Contributor Author

That's great.

Since I made the reverse-engineer fork of r-docker I have been rebasing it everytime the container has been updated (only 3 times since I made it) - so it's still up-to-date.

https://github.com/opensafely/reverse-engineer-r-docker/branches

I have made 4 active branches with slightly different approaches. Two of these are more stable, these are the 2 with names ending -fixed-urls because they use the locked package data URLs from RSPM.

https://packagemanager.rstudio.com/client/#/repos/1/overview

rspm-2

I had also been thinking about making another branch just using MRAN dated snapshot repos - but then all the packages would have to be built from source and so the Dockerfile would probably take several hours to build. The existing 4 branches take approx 30 mins to build because they download prebuilt Linux binary packages from RSPM.

@remlapmot
Copy link
Contributor Author

Closing as fixed by #123 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants