Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding

 `stri_detect_regex` looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:

```r
Sys.setlocale(, "Chinese")
library(stringi)

stri_detect_fixed("昌平区", "县") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "县") # TRUE
#> [1] TRUE
grepl("县", "昌平区") # FALSE
#> [1] FALSE
```

Another example:

```r
library(dplyr)
library(rvest)
library(stringi)

link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
  html_nodes("p") %>%
    html_text

stri_detect_regex(tx_xi, "同志们")  #Note that these are the very first three characters of the speech

#> [1] FALSE
```

``` r
sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#>  [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> system code page: 65001
#>
#> attached base packages:
#>  [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#>
#> other attached packages:
#>   [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#>   [1] compiler_4.2.0 tools_4.2.0    parallel_4.2.0
```

The issue was submitted to `stringr` (https://github.com/tidyverse/stringr/issues/386#issue-894992244), but it looks like a `stringi` problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding #448

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions