implement DocumentStatistics #312

ArneBinder · 2023-08-06T16:11:10Z

This add DocumentStatistics, an abstract class derived from DocumentMetrics to easily collect statistics over datasets. It requires to implement a method _collect(doc: Document) -> Any. It should calculate any values over the document, can return naive types (such as int, float, str), dicts, or lists.

The _compute() will aggregate the results of _collect() in the following way : naive types are arranged in a list, lists are concatenated, and dictionaries are flattened and their values are handled such as naive types or lists, respectively, and finally unflattened again (this is because calling the DocumentMetric on a DatasetDict will produce a dictionary with the split names as keys and the metric results as values, so this would result in a semi-flattened dict, if we would not unflatten again).

This also implements the following document statistics:

TokenCountCollector (formerly DocumentTokenCounter): Collects the token count of a field when tokenizing its content with a Huggingface tokenizer.
FieldLengthCollector (formerly DocumentFieldLengthCounter): Collects the length of a field, e.g. to collect the number the characters in the input text.
SubFieldLengthCollector (formerly DocumentSubFieldLengthCounter): Collects the length of a subfield in a field, e.g. to collect the number of arguments of N-ary relations.
LabeledSpanLengthCollector (formerly DocumentSpanLengthCounter): Counts the length of spans in a field per label, e.g. to collect the length of entities per type.
DummyCollector (formerly DummyCounter): A dummy collector that always returns 1, e.g. to count the number of documents.
LabelCountCollector (formerly LabelCounter): Collects the number of field entries per label, e.g. to collect the number of entities per type.

ArneBinder added 10 commits August 6, 2023 17:43

implement DocumentStatistic

428d8f8

implement specific document statistics and tests

edc934e

fix: re-add unflatten

77c8937

improve documentation for statistics

699d884

add more tests cases

7d43fd3

fix docstring

fb7416a

rename statistics

bd0ca85

fix docstring

9747531

improve docstring

cf68960

rename

e8147d0

ArneBinder linked an issue Aug 6, 2023 that may be closed by this pull request

add DocumentStatistic #303

Closed

ArneBinder merged commit 102051d into main Aug 6, 2023

ArneBinder deleted the document_statistics branch August 6, 2023 17:08

This was referenced Aug 6, 2023

use document statistics from pytorch-ie ArneBinder/pytorch-ie-hydra-template-1#111

Merged

aggregate the result of DocumentStatistic #313

Merged

ArneBinder added the enhancement New feature or request label Aug 14, 2023

ArneBinder mentioned this pull request Aug 23, 2023

integrate statistics from pie template repo ArneBinder/pie-utils#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement DocumentStatistics #312

implement DocumentStatistics #312

Uh oh!

ArneBinder commented Aug 6, 2023 •

edited

Loading

Uh oh!

Uh oh!

implement DocumentStatistics #312

implement DocumentStatistics #312

Uh oh!

Conversation

ArneBinder commented Aug 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArneBinder commented Aug 6, 2023 •

edited

Loading