Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This add
DocumentStatistics
, an abstract class derived fromDocumentMetrics
to easily collect statistics over datasets. It requires to implement a method_collect(doc: Document) -> Any
. It should calculate any values over the document, can return naive types (such as int, float, str), dicts, or lists.The
_compute()
will aggregate the results of_collect()
in the following way : naive types are arranged in a list, lists are concatenated, and dictionaries are flattened and their values are handled such as naive types or lists, respectively, and finally unflattened again (this is because calling theDocumentMetric
on aDatasetDict
will produce a dictionary with the split names as keys and the metric results as values, so this would result in a semi-flattened dict, if we would not unflatten again).This also implements the following document statistics:
TokenCountCollector
(formerlyDocumentTokenCounter
): Collects the token count of a field when tokenizing its content with a Huggingface tokenizer.FieldLengthCollector
(formerlyDocumentFieldLengthCounter
): Collects the length of a field, e.g. to collect the number the characters in the input text.SubFieldLengthCollector
(formerlyDocumentSubFieldLengthCounter
): Collects the length of a subfield in a field, e.g. to collect the number of arguments of N-ary relations.LabeledSpanLengthCollector
(formerlyDocumentSpanLengthCounter
): Counts the length of spans in a field per label, e.g. to collect the length of entities per type.DummyCollector
(formerlyDummyCounter
): A dummy collector that always returns 1, e.g. to count the number of documents.LabelCountCollector
(formerlyLabelCounter
): Collects the number of field entries per label, e.g. to collect the number of entities per type.