components text_generation_datapreprocess

Text Generation DataPreProcess

text_generation_datapreprocess

Overview

Component to preprocess data for text generation task

Version: 0.0.79

View in Studio: https://ml.azure.com/registries/azureml/components/text_generation_datapreprocess/version/0.0.79

Inputs

Text Generation task arguments

Name	Description	Type	Default	Optional
text_key	key for text in an example. format your data keeping in mind that text is concatenated with ground_truth while finetuning in the form - text + groundtruth. for eg. "text"="knock knock\n", "ground_truth"="who's there"; will be treated as "knock knock\nwho's there"	string		False
ground_truth_key	key for ground_truth in an example. we take separate column for ground_truth to enable use cases like summarization, translation, question_answering, etc. which can be repurposed in form of text-generation where both text and ground_truth are needed. This separation is useful for calculating metrics. for eg. "text"="Summarize this dialog:\n{input_dialogue}\nSummary:\n", "ground_truth"="{summary of the dialogue}"	string		True
batch_size	Number of examples to batch before calling the tokenization function	integer	1000	True

Tokenization params

Name	Description	Type	Default	Optional	Enum
pad_to_max_length	If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their `max_seq_length`. If no `max_seq_length` is specified, the padding is done up to the model's max length.	string	false	True	['true', 'false']
max_seq_length	Default is -1 which means the padding is done up to the model's max length. Else will be padded to `max_seq_length`.	integer	-1	True

Inputs

Name	Description	Type	Optional
train_file_path	Path to the registered training data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file	True
validation_file_path	Path to the registered validation data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file	True
test_file_path	Path to the registered test data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file	True
train_mltable_path	Path to the registered training data asset in `mltable` format.	mltable	True
validation_mltable_path	Path to the registered validation data asset in `mltable` format.	mltable	True
test_mltable_path	Path to the registered test data asset in `mltable` format.	mltable	True

Dataset parameters

Name	Description	Type	Default	Optional	Enum
model_selector_output	output folder of model selector containing model metadata like config, checkpoints, tokenizer config	uri_folder		False

Validation parameters

Name	Description	Type	Default	Optional	Enum
system_properties	Validation parameters propagated from pipeline.	string		True

Outputs

Name	Description	Type
output_dir	The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data	uri_folder

Environment

azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/105

Wiki menu

Home
Reference Documentation
- Components
- Data
- Environments
- Models
Contributing

components text_generation_datapreprocess

Text Generation DataPreProcess

text_generation_datapreprocess

Overview

Inputs

Outputs

Environment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!