Skip to content

Masking class and separation of tokenizer_masking #380

Open
@shmh40

Description

@shmh40

Is your feature request related to a problem? Please describe.

Currently masking of data happens in tokenizer_masking in batchify_source and batchify_target, and there is not a class currently for masking, which is needed to implement different masking strategies, vary them through training and so on.

Describe the solution you'd like

The first solution is a bare bones implementation of this. Create a masker class, which implements a couple of simple masking strategies, and is instantiated in multi_stream_data_sampler before the tokenizer is instantiated. This class should take input data, and return the masked data. This needs to occur for both source and target data.

tokenizer_masking should then be adjusted to do only tokenisation, and no longer do masking at the same time.

Describe alternatives you've considered

No response

Additional context

No response

Organisation

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions