You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the common structure of output data from labeling tools like LabelStudio or LabelBox, because it's easy and human interpretable.
29
+
30
+
This is the common structure of output data from labeling tools like LabelStudio or LabelBox, because it's easy and human interpretable.
31
31
32
32
Huggingface format refers to the BIO/BILOU/BIOES tagging format commonly used for fine-tuning transformers. The input text is tokenized, and each token
33
33
is given a tag to denote whether or not it's a label (and it's location, Beginning, Inside etc). Here's an example: https://huggingface.co/datasets/wikiann
@@ -37,7 +37,7 @@ For more information about this tagging system, see [wikipedia](https://en.wikip
37
37
38
38
39
39
This format is tricky, though, because it is entirely dependant on the tokenizer used. Tokens are not simply space separated words. Each tokenizer has a specific vocabulary of tokens that break down works into unique sub-words. So moving from character level spans to token level tags is a very
40
-
manual process. That's a core reason I built this tool.
40
+
manual process. That's a core reason I built this tool.
This will return your data as a HuggingFace `Dataset` and will automatically
74
+
string-index your `ner_tags` into a `ClassLabel` object
69
75
70
76
## Project Setup
71
77
Project setup is credited to [@anthonycorletti](https://github.com/anthonycorletti) and his awesome [project template repo](https://github.com/anthonycorletti/python-project-template)
0 commit comments