-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Labels
Description
Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:
- Simpler data structures compared to R's heavy CHARSXP and R's global string cache
- Short string optimization
- The possibility of true multithreading (you can't multithread R internals)
If there's interest, I'd be happy to develop and work on it.
To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:
std::vector<std::string>
You don't need to keep track of encoding, if you can assume UTF-8.
You'd probably want some global configuration parameter:
stri_use_alt_rep(bool)
You'd have to replace every interaction with R string memory with a conditional.
CHAR
SET_STRING_ELT
STRING_ELT
mkChar*
Rf_allocVector(STRSXP,...)
And replace any comparison of address for testing string equality (not sure if stringi does so).
There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.