Description
(sorry for the long report below and that it took me so long to spot this)
I'm sure by accident there the locale is set to nothing (is this referred to as unset? sorry I am not really familiar with locale grammar) in the r-docker container.
Unfortunately the sorting of character strings in I think the base R
sort()
and order()
functions and hence dplyr::arrange()
are all
affected by locale.
Examples
- check locale is unset
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
-e "Sys.getenv('LANG')"
## [1] ""
- check with
locale
and which are generated/setup (did this
interactively)
% docker run --platform linux/amd64 --entrypoint /bin/bash -it ghcr.io/opensafely-core/r:latest
root@d446d1ea5760:/workspace# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
root@d446d1ea5760:/workspace# locale -a
C
C.UTF-8
POSIX
- Show different sorting of character string (compare with output from
next bullet)
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
-e "sort(c(head(letters), head(LETTERS)))"
## [1] "A" "B" "C" "D" "E" "F" "a" "b" "c" "d" "e" "f"
- Whereas on say a machine with locale
en_US.UTF-8
we get
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
Rscript -e "Sys.getenv('LANG'); \
sort(c(head(letters), head(LETTERS)))"
## [1] "en_US.UTF-8"
## [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
Rscript -e "Sys.setenv(LANG='en_GB.UTF-8'); \
Sys.getenv('LANG'); \
sort(c(head(letters), head(LETTERS)))"
## [1] "en_GB.UTF-8"
## [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"
Fixes
- A suboptimal quick fix for users is to use
stringr::str_sort()
instead of the other functions mentioned
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
-e "stringr::str_sort(c(head(letters), head(LETTERS)))"
## [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"
Once the locales have been generated the other fixes are
- Users can set a locale at the top of each R script/session (although
this probably not best practise)
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
-e "Sys.setenv(LANG='en_GB.UTF-8'); sort(c(head(letters), head(LETTERS)))"
-
Set
LANG="en_GB.UTF-8"
in a/workspace/.Renviron
file (might
also be worth settingLC_CTYPE
to same value as well) -
Set
LANG="en_GB.UTF-8"
in the globalRenviron.site
file in
/usr/lib/R/etc
(might also be worth settingLC_CTYPE
to same value as well) -
(And when the Dockerfile is running again of course could just set
it in that with)
ENV LANG="en_GB.UTF-8"
In R see the locales and sort helpfiles for more info
?locales
?sort
Metadata
Metadata
Assignees
Labels
Type
Projects
Status