Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements a PIE
DatasetDict
similar to the HF version. However, it works as a container for pieDatasets
andIterableDatasets
and some methods are slightly different. List of implemented methods:__getitem__
*: returnsDataset
s orIterableDataset
sfrom_hf_dataset
: convert a HFDatasetDict
or HFIterableDatasetDict
to a PIE aDatasetDict
(requires adocument_type
)to_json
*: We just convert all documents with.asdict()
and dump them withjson.dump()
to json line filesfrom_json
*: Loads the content dumped withto_json
document_type
(property): return the document type. Raises an error if no splits are available or the splits have different document types.dataset_type
(property): return the dataset type, i.e. eitherDataset
orIterableDataset
. Raises an error if no splits are available or the splits have different dataset types.map
*: similar to the HF version. But the mapping callable (function
) gets and needs to return a document. Also, checksfunction
is an object that is derived from one of the following mixins and then calls the respective logic:EnterDatasetMixin
: callsfunction.enter_dataset(dataset_split, split_name)
before processing the splitExitDatasetMixin
: callsfunction.exit_dataset(processed_dataset_split, split_name)
after processing the splitEnterDatasetDictMixin
: callsfunction.enter_dataset_dict(dataset_dict)
before processing any split (end before handling anyEnterDatasetMixin
)ExitDatasetDictMixin
: callsfunction.exit_dataset_dict(processed_dataset_dict)
after processing all splits (end after handling anyExitDatasetMixin
)select
: similar to HF version, but adds parametersstart
,stop
,step
that will be used to create indices, if available, andsplit
to indicate which split should be modifiedrename_splits
: rename the splits with a mapping dictadd_test_split
: cut atarget_split
out of asource_split
by usingtrain_test_split()
from HFdrop_splits
: Drops splits from the datasetconcat_splits
: concatenate selected splits into a new splitfilter
: filter asplit
via a function by usingfilter
from HF. IMPORTANT: In contrast tomap
, the filter function gets the dict instead of a document as input because the PIE variant ofDataset.filter()
is not yet implemented!move_to_new_split
: similar toadd_test_split
, but the moved documents can be selected by a list ofids
or via a filter function (uses PIEDatasetDict.select
internally).cast_document_type
: casts all dataset splits to anew_document_type
IMPORTANT: Methods marked with (*) differ from the HF syntax and semantic!
This PR also adds
utils.hydra.resolve_target()
.