TextCleaner could sends its output in UTF-32 so the RoughTokenizer doesn't have to redecode it from UTF8. Since the TextCleaner must also be able to output UTF8 for the Classifier stage (reading annotated data and aligning), the TextCleaner class would have to be heavily templated. Performance gain would probably wouldn't be too high.