🎉First version of merge-tokenizers
This is the first version of merge-tokenizers
. It contains:
- Five aligners: DTW, Fast-DTW, greedy, word_ids, and Tamuhey, that can be used to align different tokenizations and merge token-level features.
- Examples and benchmark.
- Implementations of token distances.
- Heuristics to avoid useless computations.
- Base class to implement custom aligners.
- Brief documentation in the README.