Open
Description
Is your feature request related to a problem? Please describe.
Currently masking of data happens in tokenizer_masking in batchify_source and batchify_target, and there is not a class currently for masking, which is needed to implement different masking strategies, vary them through training and so on.
Describe the solution you'd like
The first solution is a bare bones implementation of this. Create a masker class, which implements a couple of simple masking strategies, and is instantiated in multi_stream_data_sampler before the tokenizer is instantiated. This class should take input data, and return the masked data. This needs to occur for both source and target data.
tokenizer_masking should then be adjusted to do only tokenisation, and no longer do masking at the same time.
Describe alternatives you've considered
No response
Additional context
No response
Organisation
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status