Skip to content

🎉First version of merge-tokenizers

Compare
Choose a tag to compare
@jogonba2 jogonba2 released this 21 Mar 16:38
· 9 commits to main since this release

This is the first version of merge-tokenizers. It contains:

  • Five aligners: DTW, Fast-DTW, greedy, word_ids, and Tamuhey, that can be used to align different tokenizations and merge token-level features.
  • Examples and benchmark.
  • Implementations of token distances.
  • Heuristics to avoid useless computations.
  • Base class to implement custom aligners.
  • Brief documentation in the README.