implement DatasetDict #296

ArneBinder · 2023-07-27T23:23:40Z

This PR implements a PIE DatasetDict similar to the HF version. However, it works as a container for pie Datasets and IterableDatasets and some methods are slightly different. List of implemented methods:

__getitem__ *: returns Datasets or IterableDatasets
from_hf_dataset: convert a HF DatasetDict or HF IterableDatasetDict to a PIE a DatasetDict (requires a document_type)
to_json *: We just convert all documents with .asdict() and dump them with json.dump() to json line files
from_json *: Loads the content dumped with to_json
document_type (property): return the document type. Raises an error if no splits are available or the splits have different document types.
dataset_type (property): return the dataset type, i.e. either Dataset or IterableDataset. Raises an error if no splits are available or the splits have different dataset types.
map *: similar to the HF version. But the mapping callable (function) gets and needs to return a document. Also, checks function is an object that is derived from one of the following mixins and then calls the respective logic:
- EnterDatasetMixin: calls function.enter_dataset(dataset_split, split_name) before processing the split
- ExitDatasetMixin: calls function.exit_dataset(processed_dataset_split, split_name) after processing the split
- EnterDatasetDictMixin: calls function.enter_dataset_dict(dataset_dict) before processing any split (end before handling any EnterDatasetMixin)
- ExitDatasetDictMixin: calls function.exit_dataset_dict(processed_dataset_dict) after processing all splits (end after handling any ExitDatasetMixin)
select: similar to HF version, but adds parameters start, stop, step that will be used to create indices, if available, and split to indicate which split should be modified
rename_splits: rename the splits with a mapping dict
add_test_split: cut a target_split out of a source_split by using train_test_split() from HF
drop_splits: Drops splits from the dataset
concat_splits: concatenate selected splits into a new split
filter: filter a split via a function by using filter from HF. IMPORTANT: In contrast to map, the filter function gets the dict instead of a document as input because the PIE variant of Dataset.filter() is not yet implemented!
move_to_new_split: similar to add_test_split, but the moved documents can be selected by a list of ids or via a filter function (uses PIE DatasetDict.select internally).
cast_document_type: casts all dataset splits to a new_document_type

IMPORTANT: Methods marked with (*) differ from the HF syntax and semantic!

This PR also adds utils.hydra.resolve_target().

…Dataset as input

ArneBinder added 2 commits July 28, 2023 01:22

add utils.hydra.resolve_target()

398b484

implement DatasetDict

b64ea18

ArneBinder added the enhancement New feature or request label Jul 27, 2023

ArneBinder added 4 commits July 28, 2023 02:56

add documentation to methods

2d899ac

rename from_hf_dataset to from_hf and allow HF Dataset or HF Iterable…

0645999

…Dataset as input

improve documentation

c327675

fix tests

167382c

ArneBinder merged commit ca80fe8 into main Jul 28, 2023

ArneBinder deleted the dataset_dict branch July 28, 2023 13:50

This was referenced Jul 28, 2023

Update pytorch ie 0.17.0 ArneBinder/pie-utils#48

Merged

fix src.dataset.select for non-existing split ArneBinder/pytorch-ie-hydra-template-1#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement DatasetDict #296

implement DatasetDict #296

Uh oh!

ArneBinder commented Jul 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

implement DatasetDict #296

implement DatasetDict #296

Uh oh!

Conversation

ArneBinder commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArneBinder commented Jul 27, 2023 •

edited

Loading