Skip to content

locale is set to nothing leading to different than expected sorting of vectors of character strings in R #99

Closed
@remlapmot

Description

@remlapmot

(sorry for the long report below and that it took me so long to spot this)

I'm sure by accident there the locale is set to nothing (is this referred to as unset? sorry I am not really familiar with locale grammar) in the r-docker container.

Unfortunately the sorting of character strings in I think the base R
sort() and order() functions and hence dplyr::arrange() are all
affected by locale.

Examples

  • check locale is unset
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "Sys.getenv('LANG')"
## [1] ""
  • check with locale and which are generated/setup (did this
    interactively)
% docker run --platform linux/amd64 --entrypoint /bin/bash -it ghcr.io/opensafely-core/r:latest

root@d446d1ea5760:/workspace# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

root@d446d1ea5760:/workspace# locale -a
C
C.UTF-8
POSIX
  • Show different sorting of character string (compare with output from
    next bullet)
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "sort(c(head(letters), head(LETTERS)))"
##  [1] "A" "B" "C" "D" "E" "F" "a" "b" "c" "d" "e" "f"
  • Whereas on say a machine with locale en_US.UTF-8 we get
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
    Rscript -e "Sys.getenv('LANG'); \
      sort(c(head(letters), head(LETTERS)))"
## [1] "en_US.UTF-8"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"
docker run --platform linux/amd64 rocker/r-ver:4.0.2 \
    Rscript -e "Sys.setenv(LANG='en_GB.UTF-8'); \
      Sys.getenv('LANG'); \
      sort(c(head(letters), head(LETTERS)))"
## [1] "en_GB.UTF-8"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

Fixes

  • A suboptimal quick fix for users is to use stringr::str_sort()
    instead of the other functions mentioned
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "stringr::str_sort(c(head(letters), head(LETTERS)))"
##  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F"

Once the locales have been generated the other fixes are

  • Users can set a locale at the top of each R script/session (although
    this probably not best practise)
docker run --platform linux/amd64 ghcr.io/opensafely-core/r:latest \
    -e "Sys.setenv(LANG='en_GB.UTF-8'); sort(c(head(letters), head(LETTERS)))"
  • Set LANG="en_GB.UTF-8" in a /workspace/.Renviron file (might
    also be worth setting LC_CTYPE to same value as well)

  • Set LANG="en_GB.UTF-8" in the global Renviron.site file in
    /usr/lib/R/etc (might also be worth setting LC_CTYPE to same value as well)

  • (And when the Dockerfile is running again of course could just set
    it in that with)

ENV LANG="en_GB.UTF-8"

In R see the locales and sort helpfiles for more info

?locales
?sort

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions