Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement DatasetDict #296

Merged
merged 6 commits into from
Jul 28, 2023
Merged

implement DatasetDict #296

merged 6 commits into from
Jul 28, 2023

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Jul 27, 2023

This PR implements a PIE DatasetDict similar to the HF version. However, it works as a container for pie Datasets and IterableDatasets and some methods are slightly different. List of implemented methods:

  • __getitem__ *: returns Datasets or IterableDatasets
  • from_hf_dataset: convert a HF DatasetDict or HF IterableDatasetDict to a PIE a DatasetDict (requires a document_type)
  • to_json *: We just convert all documents with .asdict() and dump them with json.dump() to json line files
  • from_json *: Loads the content dumped with to_json
  • document_type (property): return the document type. Raises an error if no splits are available or the splits have different document types.
  • dataset_type (property): return the dataset type, i.e. either Dataset or IterableDataset. Raises an error if no splits are available or the splits have different dataset types.
  • map *: similar to the HF version. But the mapping callable (function) gets and needs to return a document. Also, checks function is an object that is derived from one of the following mixins and then calls the respective logic:
    • EnterDatasetMixin: calls function.enter_dataset(dataset_split, split_name) before processing the split
    • ExitDatasetMixin: calls function.exit_dataset(processed_dataset_split, split_name) after processing the split
    • EnterDatasetDictMixin: calls function.enter_dataset_dict(dataset_dict) before processing any split (end before handling any EnterDatasetMixin)
    • ExitDatasetDictMixin: calls function.exit_dataset_dict(processed_dataset_dict) after processing all splits (end after handling any ExitDatasetMixin)
  • select: similar to HF version, but adds parameters start, stop, step that will be used to create indices, if available, and split to indicate which split should be modified
  • rename_splits: rename the splits with a mapping dict
  • add_test_split: cut a target_split out of a source_split by using train_test_split() from HF
  • drop_splits: Drops splits from the dataset
  • concat_splits: concatenate selected splits into a new split
  • filter: filter a split via a function by using filter from HF. IMPORTANT: In contrast to map, the filter function gets the dict instead of a document as input because the PIE variant of Dataset.filter() is not yet implemented!
  • move_to_new_split: similar to add_test_split, but the moved documents can be selected by a list of ids or via a filter function (uses PIE DatasetDict.select internally).
  • cast_document_type: casts all dataset splits to a new_document_type

IMPORTANT: Methods marked with (*) differ from the HF syntax and semantic!

This PR also adds utils.hydra.resolve_target().

@ArneBinder ArneBinder added the enhancement New feature or request label Jul 27, 2023
@ArneBinder ArneBinder merged commit ca80fe8 into main Jul 28, 2023
6 checks passed
@ArneBinder ArneBinder deleted the dataset_dict branch July 28, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant