-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
stri_detect_regex
looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:
Sys.setlocale(, "Chinese")
library(stringi)
stri_detect_fixed("昌平区", "县") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "县") # TRUE
#> [1] TRUE
grepl("县", "昌平区") # FALSE
#> [1] FALSE
Another example:
library(dplyr)
library(rvest)
library(stringi)
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"
tx_xi <- read_html(link_speech) %>%
html_nodes("p") %>%
html_text
stri_detect_regex(tx_xi, "同志们") #Note that these are the very first three characters of the speech
#> [1] FALSE
sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.936
#> [2] LC_CTYPE=Chinese (Simplified)_China.936
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.936
#> system code page: 65001
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods
#> [7] base
#>
#> other attached packages:
#> [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.2.0 tools_4.2.0 parallel_4.2.0
The issue was submitted to stringr
(tidyverse/stringr#386 (comment)), but it looks like a stringi
problem?
Metadata
Metadata
Assignees
Labels
No labels