Skip to content

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

@sammo3182

Description

@sammo3182

stri_detect_regex looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:

Sys.setlocale(, "Chinese")
library(stringi)

stri_detect_fixed("昌平区", "") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "") # TRUE
#> [1] TRUE
grepl("", "昌平区") # FALSE
#> [1] FALSE

Another example:

library(dplyr)
library(rvest)
library(stringi)

link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
  html_nodes("p") %>%
    html_text

stri_detect_regex(tx_xi, "同志们")  #Note that these are the very first three characters of the speech

#> [1] FALSE
sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#>  [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> system code page: 65001
#>
#> attached base packages:
#>  [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#>
#> other attached packages:
#>   [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#>   [1] compiler_4.2.0 tools_4.2.0    parallel_4.2.0

The issue was submitted to stringr (tidyverse/stringr#386 (comment)), but it looks like a stringi problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions