What is the current status of locale support (particularly en_US.UTF-8 and C)?
#23010
Replies: 4 comments 6 replies
-
|
To address your background question: Bionic libc is the C library (on device) that the packages in this repo link with. The runtime is whatever version is shipped on the user's device, but the system headers used for building the Termux environment (or building code within it) are a modified version of the Android NDK. Bionic libc does provide reasonably compliant While the default locale is the Termux somewhat papered over this by:
This issue was later somewhat fixed for Android 8.0 (API 26). Now, Anyway, I have no insight into why Termux took the path above for dealing with this. What I do currently for myself is:
/* Implement the algorithm from
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap08.html */
static inline const char* __locale_from_env(int cat) {
static const char * const cat_names[] = {
"LC_CTYPE", "LC_NUMERIC", "LC_TIME", "LC_COLLATE", "LC_MONETARY", "LC_MESSAGES",
"LC_ALL", "LC_PAPER", "LC_NAME", "LC_ADDRESS", "LC_TELEPHONE", "LC_MEASUREMENT",
"LC_IDENTIFICATION" };
const char *name;
if (cat < 0 || cat >= sizeof cat_names)
return "";
if ((name = getenv("LC_ALL")) && *name)
return name;
if (cat != LC_ALL && (name = getenv(cat_names[cat])) && *name)
return name;
if ((name = getenv("LANG")) && *name)
return name;
return ""; /* Or "C.UTF-8" for Android <= 7 support */
}
static char* _Nullable __termux_repl_setlocale(int __category, const char* _Nullable __locale_name) {
if (__locale_name && *__locale_name == '\0')
__locale_name = __locale_from_env(__category);
return setlocale(__category, __locale_name);
}
#define setlocale __termux_repl_setlocaleI'm glad you started this topic because I'd be very curious to know if there's any obvious issue with my approach, and maybe making a PR to integrate something like that and drop the various workarounds in the packages. |
Beta Was this translation helpful? Give feedback.
-
In addition to the above issue, the behavior of gawk is also broken in Termux with $ echo | LC_ALL=C gawk $'/(\xE3\x81\x82)/' # note: this is equivalent to '/(あ)/'
gawk: cmd. line:1: fatal: unbalanced (The gawk version in Termux is 5.3.1. This doesn't happen in gawk in other systems (I checked the behavior in 5.3.0 and 5.3.1 in Fedora 41, and 5.3.2 in Arch). |
Beta Was this translation helpful? Give feedback.
-
|
(Continuing discussion from #25149 (comment), cc: @TomJo2000) One thing I neglected to mention is that for some programs (including So there are three total header changes needed:
213c213,214
< #define MB_CUR_MAX 4
---
> size_t __ctype_get_mb_cur_max(void);
> #define MB_CUR_MAX __ctype_get_mb_cur_max()
146c146
< if (item == CODESET) return "UTF-8";
---
> if (item == CODESET) return (MB_CUR_MAX == 1) ? "ASCII" : "UTF-8";
After making these changes, a fresh build of I'd be happy to work on a PR for this in the near future. |
Beta Was this translation helpful? Give feedback.
-
|
I dont understand probably nothing, but this seems nice. Good job. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Although the default locale in Termux seems to be
en_US.UTF-8, the locale support of Termux appears to be incomplete. Here, I'd like to know the latest information about how it is incomplete (what's available and what's not) and how we could work around issues related to incomplete locale support better. There doesn't seem to be any official documentation about the locale support.Original issue
Suppose one wants to manipulate bytes in binary data (which does not contain NUL) stored in a shell variable within a Bash script. Usually, one can achieve this by setting
LC_CTYPE=Cand count the number of bytes by${#data}or access a byte with${data:index:1}. However, this doesn't seem to work in Termux. For example, you can see the issue with the following example:Although we expect
3as the result, Bash returns1in Termux. In all the other environments,3is obtained as expected.Past discussions
There is an old discussion from 2020:
The issue asked whether the locale
Cis available. The answer was that Termux doesn't support locales. However, it would be unclear what happens when no locales are supported. If it were not supported literally at all, many of the basic C APIs would be unavailable (e.g.,printf,isalpha,tolower,strftime, etc. all depend on the current locale). Thus, it would be reasonable to think something is assumed for the results of the actions that rely on locale. What is that?A StackOverflow question from 2021
states that
which implies that
en_US.UTF-8would have been introduced between 2020 and 2021.There is also a comment in a discussion from 2022:
The comment says
which contradicts the first information from 2020. Does this mean that Termux/Bionic introduced a certain support for the locale
en_US.UTF-8andCbetween 2020 and 2022?However, as of 2025, the locale
Cis incomplete as illustrated in the first example. Even foren_US.UTF-8, another issue from 2023 reports thaten_US.UTF-8is unsupported (or not complete enough to pass the tests):Those four statements in past discussions don't seem to be really consistent with each other, so I think some of them (or all) are untrustworthy. If all of them are somewhat correct, I guess it would mean Termux supports neither of
en_US.UTF-8norC, but an unspecified amalgam ofen_US.UTF-8andC. Or it might be switching back and forth betweenen_US.UTF-8andCevery single year.Bionic libc
The third mentioned Bionic libc, so can I assume that Termux packages adopt Bionic as the C standard library? I also tried to look up information in Bionic. However, Bionic doesn't seem to have a place to report an issue or ask questions. Instead, I find the following comment in
/libc/bionic/locale.cppof the Bionic codebase:This seems to imply that Bionic supports both
Canden_US.UTF-8(a synonym ofC.UTF-8) separately. This comment has existed at least since 2016, which is inconsistent with the observation above.I also found a mention on locale in the documentation (boldfaced by me):
This part of the documentation seems to have been introduced by commit aosp-mirror/platform_bionic@046fe15, whose commit message says
So it seems to imply that Bionic actually only supports
en_US.UTF-8(a synonym ofC.UTF-8). If this is true, it seems to me that the first information in the code comment Bionic'slocale.cppwould be wrong. Or the support for theClocale might have been dropped at some point between 2015 and 2022.I'm not sure which information I should believe. In either case, the behavior is not consistent with the past reports for Termux. Another possibility would be that the upstream Bionic and the Bionic used by Termux are actually different versions. Another possibility would be that Termux only uses Bionic partially, and the locale part might have extensions/modifications.
Timeline
To summarize the timeline, we could make the following table for the locale support:
Canden_US.UTF-8en_US.UTF-8Coren_US.UTF-8en_US.UTF-8(broken)en_US.UTF-8(broken)CEvery piece of the information is inconsistent, so I'm confused about which information would be really trustworthy, and what would be the relationship between the C library used in Termux packages and the upstream Bionic.
Questions
Clocale (which is separate fromC.UTF-8/en_US.UTF-8)? If not, would it be supported in the future?Beta Was this translation helpful? Give feedback.
All reactions