Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement DocumentStatistics #312

Merged
merged 10 commits into from
Aug 6, 2023
Merged

implement DocumentStatistics #312

merged 10 commits into from
Aug 6, 2023

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Aug 6, 2023

This add DocumentStatistics, an abstract class derived from DocumentMetrics to easily collect statistics over datasets. It requires to implement a method _collect(doc: Document) -> Any. It should calculate any values over the document, can return naive types (such as int, float, str), dicts, or lists.

The _compute() will aggregate the results of _collect() in the following way : naive types are arranged in a list, lists are concatenated, and dictionaries are flattened and their values are handled such as naive types or lists, respectively, and finally unflattened again (this is because calling the DocumentMetric on a DatasetDict will produce a dictionary with the split names as keys and the metric results as values, so this would result in a semi-flattened dict, if we would not unflatten again).

This also implements the following document statistics:

  • TokenCountCollector (formerly DocumentTokenCounter): Collects the token count of a field when tokenizing its content with a Huggingface tokenizer.
  • FieldLengthCollector (formerly DocumentFieldLengthCounter): Collects the length of a field, e.g. to collect the number the characters in the input text.
  • SubFieldLengthCollector (formerly DocumentSubFieldLengthCounter): Collects the length of a subfield in a field, e.g. to collect the number of arguments of N-ary relations.
  • LabeledSpanLengthCollector (formerly DocumentSpanLengthCounter): Counts the length of spans in a field per label, e.g. to collect the length of entities per type.
  • DummyCollector (formerly DummyCounter): A dummy collector that always returns 1, e.g. to count the number of documents.
  • LabelCountCollector (formerly LabelCounter): Collects the number of field entries per label, e.g. to collect the number of entities per type.

@ArneBinder ArneBinder linked an issue Aug 6, 2023 that may be closed by this pull request
@ArneBinder ArneBinder merged commit 102051d into main Aug 6, 2023
6 checks passed
@ArneBinder ArneBinder deleted the document_statistics branch August 6, 2023 17:08
@ArneBinder ArneBinder added the enhancement New feature or request label Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add DocumentStatistic
1 participant