components translation_datapreprocess

Translation DataPreProcess

translation_datapreprocess

Overview

Component to preprocess data for translation task. See docs to learn more.

Version: 0.0.32

View in Studio: https://ml.azure.com/registries/azureml/components/translation_datapreprocess/version/0.0.32

Inputs

task arguments

sample input

{en:"Others have dismissed him as a joke.",ro:"Al\u021bii l-au numit o glum\u0103."}

If the dataset follows above pattern, source_lang is en and target_lang is ro

source language codes

t5 - English (en)

mbart - Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese, Sim (zh_CN)

target language codes

t5 - French (fr), German (de), Romanian (ro)

mbart - Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese, Sim (zh_CN)

Name	Description	Type	Default	Optional
source_lang	key for source language text in an example. This key should be an abbreviated/coded form of the language as understood by tokenizer. Please check the respective model's language codes while updating this information	string		False
target_lang	key for target language text in an example. This key should be an abbreviated/coded form of the language as understood by tokenizer. Please check the respective model's language codes while updating this information	string		False
batch_size	Number of examples to batch before calling the tokenization function	integer	1000	True

Tokenization params

Name	Description	Type	Default	Optional	Enum
pad_to_max_length	If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their `max_seq_length`. If no `max_seq_length` is specified, the padding is done up to the model's max length.	string	false	True	['true', 'false']
max_seq_length	Controls the maximum length to use when pad_to_max_length parameter is set to `true`. Default is -1 which means the padding is done up to the model's max length. Else will be padded to `max_seq_length`.	integer	-1	True

Data inputs Please note that either train_file_path or train_mltable_path needs to be passed. In case both are passed, mltable path will take precedence. The validation and test paths are optional and an automatic split from train data happens if they are not passed. If both validation and test files are missing, 10% of train data will be assigned to each of them and the remaining 80% will be used for training If anyone of the file is missing, 20% of the train data will be assigned to it and the remaining 80% will be used for training

Name	Description	Type	Optional
train_file_path	Path to the registered training data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file	True
validation_file_path	Path to the registered validation data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file	True
test_file_path	Path to the registered test data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file	True
train_mltable_path	Path to the registered training data asset in `mltable` format.	mltable	True
validation_mltable_path	Path to the registered validation data asset in `mltable` format.	mltable	True
test_mltable_path	Path to the registered test data asset in `mltable` format.	mltable	True

Dataset parameters

Name	Description	Type	Default	Optional	Enum
model_selector_output	output folder of model selector containing model metadata like config, checkpoints, tokenizer config	uri_folder		False

Outputs

Name	Description	Type
output_dir	The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data	uri_folder

Environment

azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/41

Wiki menu

Home
Reference Documentation
- Components
- Data
- Environments
- Models
Contributing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly