Replies: 4 comments 5 replies
-
I posted this in TorchSharp to link up discussions dotnet/TorchSharp#248 @NiklasGustafsson has been doing work there on samples requiring tokenization |
Beta Was this translation helpful? Give feedback.
-
Any way to collaborate with the Curiosity AI folks on that one? They pride themselves on their tokenization as far as I can tell. Or would their implementation be to opinionated? https://github.com/curiosity-ai/catalyst |
Beta Was this translation helpful? Give feedback.
-
We are now sorting out here, please join the discussion The list of tokenizers considered in TorchText.Data Utils.cs are
|
Beta Was this translation helpful? Give feedback.
-
Not to hijack this discussion. Is there a interest of using BlingFire and making it more End-To-End available through TorchSharp? |
Beta Was this translation helpful? Give feedback.
-
Data preparation in NLP usually involves some sort of tokenization. State of the art model architectures like transformers have their own way of preprocessing text. Libraries like those from HuggingFace have not only simplified the modeling process for complex NLP models, but also the preprocessing pipeline.
The idea is to create a tokenization library for .NET that makes it simpler to use downstream for modeling with other libraries such as TensorFlow.NET, TensorFlow.Keras, TorchSharp, DiffSharp, ML .NET, etc.
Two potential approaches:
Beta Was this translation helpful? Give feedback.
All reactions