implement DatasetDict #296
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a PIE
DatasetDictsimilar to the HF version. However, it works as a container for pieDatasetsandIterableDatasetsand some methods are slightly different. List of implemented methods:__getitem__*: returnsDatasets orIterableDatasetsfrom_hf_dataset: convert a HFDatasetDictor HFIterableDatasetDictto a PIE aDatasetDict(requires adocument_type)to_json*: We just convert all documents with.asdict()and dump them withjson.dump()to json line filesfrom_json*: Loads the content dumped withto_jsondocument_type(property): return the document type. Raises an error if no splits are available or the splits have different document types.dataset_type(property): return the dataset type, i.e. eitherDatasetorIterableDataset. Raises an error if no splits are available or the splits have different dataset types.map*: similar to the HF version. But the mapping callable (function) gets and needs to return a document. Also, checksfunctionis an object that is derived from one of the following mixins and then calls the respective logic:EnterDatasetMixin: callsfunction.enter_dataset(dataset_split, split_name)before processing the splitExitDatasetMixin: callsfunction.exit_dataset(processed_dataset_split, split_name)after processing the splitEnterDatasetDictMixin: callsfunction.enter_dataset_dict(dataset_dict)before processing any split (end before handling anyEnterDatasetMixin)ExitDatasetDictMixin: callsfunction.exit_dataset_dict(processed_dataset_dict)after processing all splits (end after handling anyExitDatasetMixin)select: similar to HF version, but adds parametersstart,stop,stepthat will be used to create indices, if available, andsplitto indicate which split should be modifiedrename_splits: rename the splits with a mapping dictadd_test_split: cut atarget_splitout of asource_splitby usingtrain_test_split()from HFdrop_splits: Drops splits from the datasetconcat_splits: concatenate selected splits into a new splitfilter: filter asplitvia a function by usingfilterfrom HF. IMPORTANT: In contrast tomap, the filter function gets the dict instead of a document as input because the PIE variant ofDataset.filter()is not yet implemented!move_to_new_split: similar toadd_test_split, but the moved documents can be selected by a list ofidsor via a filter function (uses PIEDatasetDict.selectinternally).cast_document_type: casts all dataset splits to anew_document_typeIMPORTANT: Methods marked with (*) differ from the HF syntax and semantic!
This PR also adds
utils.hydra.resolve_target().