Let's consider emoji 4️⃣: it consists of 3 seperate codepoints. How should we treat it when building ngram index for example?