Fine-tune whisper for translation task "speech in custom dialect to arabic " using a cutom dataset #2311

mrlabied · 2024-08-24T14:28:52Z

mrlabied
Aug 24, 2024

Hello,

I have successfully fine-tuned Whisper-Tiny for transcription tasks on my custom speech dataset. However, when I attempted to fine-tune the model for a translation task using the same dataset, I encountered an issue. I selected only a few items from my dataset for the initial training to ensure the pipeline was functioning correctly before adjusting the training arguments.

The model training completed successfully, but when I use the fine-tuned model for inference, the output is always in English (which is incorrect). I need the output to be in Arabic script.

I would greatly appreciate any help in figuring out what I might have missed during the fine-tuning process that could be causing this issue.

Thank you!

fine-tuning code:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
from datasets import load_dataset, load_metric, Audio, DatasetDict, Dataset


my_voice_dataset_train = DatasetDict()
my_voice_dataset_test = DatasetDict()

fileList = ["wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav",]
sentenceList = ["واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه"]
sentenceTranslationList = ["هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس"]
import torchaudio

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch)
    return {"array":speech_array[0].numpy(),"sampling_rate":sampling_rate}

common_voice["train"] = Dataset.from_dict({"path": fileList, "sentence": sentenceList, "translation":sentenceTranslationList, 'audio':[ speech_file_to_array_fn(f)  for f in fileList]})
common_voice["test"]= Dataset.from_dict({"path": fileList, "sentence": sentenceList, "translation":sentenceTranslationList, 'audio':[ speech_file_to_array_fn(f)  for f in fileList]})
 

"""## Prepare Feature Extractor, Tokenizer and Data

We'll go through details for setting-up the feature extractor and tokenizer one-by-one!

### Load WhisperFeatureExtractor
 

We'll load the feature extractor from the pre-trained checkpoint with the default values:
"""

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")

"""### Load WhisperTokenizer 
"""

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")

"""### Combine To Create A WhisperProcessor

To simplify using the feature extractor and tokenizer, we can _wrap_
both into a single `WhisperProcessor` class. This processor object
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`,
and can be used on the audio inputs and model predictions as required.
In doing so, we only need to keep track of two objects during training:
the `processor` and the `model`:
"""

from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")

"""### Prepare Data

Let's print the first example of the Common Voice dataset to see
what form the data is in:
"""


from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

"""Re-loading the first audio sample in the Common Voice dataset will resample
it to the desired sampling rate:
"""

print(common_voice["train"][0])

"""Now we can write a function to prepare our data ready for the model:
1. We load and resample the audio data by calling `batch["audio"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.
2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.
3. We encode the transcriptions to label ids through the use of the tokenizer.
"""

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["translation"]).input_ids
    return batch



"""We can apply the data preparation function to all of our training examples using dataset's `.map` method. The argument `num_proc` specifies how many CPU cores to use. Setting `num_proc` > 1 will enable multiprocessing. If the `.map` method hangs with multiprocessing, set `num_proc=1` and process the dataset sequentially."""

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])

"""## Training and Evaluation
 
Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it
to translate speech in Darija.

### Load a Pre-Trained Checkpoint

We'll start our fine-tuning run from the pre-trained Whisper `small` checkpoint,
the weights for which we need to load from the Hugging Face Hub. Again, this
is trivial through use of 🤗 Transformers!
"""

from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.generation_config.language = "Arabic"
model.generation_config.task = "translate"
model.generation_config.forced_decoder_ids =None

"""### Define a Data Collator """



import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

"""Let's initialise the data collator we've just defined:"""

data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

"""### Evaluation Metrics

We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing
ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:
"""

import evaluate

metric = evaluate.load("bleu")

"""We then simply have to define a function that takes our model
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:
"""

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    # Replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    # Decode predictions and labels
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Tokenize the outputs for BLEU metric
    pred_str = [p.split() for p in pred_str]
    label_str = [[l.split()] for l in label_str]

    # Calculate BLEU score
    bleu = metric.compute(predictions=pred_str, references=label_str)
    return {"bleu": bleu["bleu"]}

"""### Define the Training Configuration

In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).
"""

from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
    output_dir="whisper-tiny-darija-translate",  # change to a repo name of your choice
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=25,
    max_steps=25,
    gradient_checkpointing=False,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=5,
    eval_steps=5,
    logging_steps=5,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="bleu",
    greater_is_better=False,
    push_to_hub=False,
)


"""**Note**: if one does not want to upload the model checkpoints to the Hub,
set `push_to_hub=False`.

We can forward the training arguments to the 🤗 Trainer along with our model,
dataset, data collator and `compute_metrics` function:
"""

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


"""We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:"""

processor.save_pretrained(training_args.output_dir)

"""### Training
To launch training, simply execute:
"""

trainer.train()


kwargs = {
    "dataset_tags": "darija-c",
    "dataset": "Darija-C",  # a 'pretty' name for the training dataset
    "dataset_args": "config: ar, split: test",
    "language": "ar",
    "model_name": "Whisper tiny darija translate",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}



trainer.save_model("./whisper-tiny-darija-translate")

"""## Building a Demo

Now that we've fine-tuned our model we can build a demo to show
off its Translate capabilities! We'll make use of 🤗 Transformers
`pipeline`, which will take care of the entire ASR pipeline,
right from pre-processing the audio inputs to decoding the
model predictions.

take the first audio of our dataset then input it to
our fine-tuned Whisper model to translate the corresponding text:
"""

from transformers import pipeline
import gradio as gr
from transformers import WhisperTokenizer
from transformers import pipeline, WhisperForConditionalGeneration, WhisperProcessor

def test_model():
    model = WhisperForConditionalGeneration.from_pretrained("whisper-tiny-darija-translate")
    processor = WhisperProcessor.from_pretrained("whisper-tiny-darija-translate")
    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")
    pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer,
                    feature_extractor=processor.feature_extractor)
    audio=fileList[0]
    text = pipe(audio)["text"]
    print("Reference input : ",sentenceTranslationList[0])
    print("translation output : ",text)

test_model()

console output

Answered by ryanheise

Aug 28, 2024

That would indicate training was not successful. But the issue here is likely that the translation task was intended for translation into English, and trained for that, so this is not a minor finetuning, you would need a lot of training to get it to forget what it's already trained to do. (How much data do you have?)

Others have observed that Whisper can sometimes translate into other languages when using the transcribe task instead of the translate task and setting the language to be the target translation language. While that behaviour probably wasn't intended, you might have better success finetuning under those parameters, again with sufficient data.

View full answer

glangford · 2024-08-24T15:24:15Z

glangford
Aug 24, 2024

The model training completed successfully, but when I use the fine-tuned model for inference, the output is always in English (which is incorrect). I need the output to be in Arabic script.

Not sure about the fine tuning process and the resulting model, but in test_model() should this have task="transcribe" rather than translate?

    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="transcribe")

1 reply

mrlabied Aug 25, 2024
Author

No, because I fine-tuned Whisper for translation, not transcription. However, I gave it a try, and I still get results in English.

ryanheise · 2024-08-26T03:59:27Z

ryanheise
Aug 26, 2024

The model training completed successfully

Do you mean that your evaluation metrics achieved a desirable level during training? If so, what were they?

3 replies

mrlabied Aug 27, 2024
Author

Unfortunately, the evaluation metrics did not achieve a desirable level during the training process. Specifically, the BLEU score remained at 0.0, indicating that the model's translations did not match the reference translations.

ryanheise Aug 28, 2024

That would indicate training was not successful. But the issue here is likely that the translation task was intended for translation into English, and trained for that, so this is not a minor finetuning, you would need a lot of training to get it to forget what it's already trained to do. (How much data do you have?)

Others have observed that Whisper can sometimes translate into other languages when using the transcribe task instead of the translate task and setting the language to be the target translation language. While that behaviour probably wasn't intended, you might have better success finetuning under those parameters, again with sufficient data.

Answer selected by mrlabied

mrlabied Aug 31, 2024
Author

Yes for your answer, After changing the training parameters the generated model can successfully translate speech into text: from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="whisper-tiny-darija-translate", # change to a repo name of your choice
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=40,
max_steps=400,
gradient_checkpointing=False,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
eval_steps=100,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="bleu",
greater_is_better=False,
push_to_hub=False,
)

Ayman-AITACHOUR · 2024-10-17T07:54:08Z

Ayman-AITACHOUR
Oct 17, 2024

Please, what is the size of your dataset?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune whisper for translation task "speech in custom dialect to arabic " using a cutom dataset #2311

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fine-tune whisper for translation task "speech in custom dialect to arabic " using a cutom dataset #2311

mrlabied Aug 24, 2024

fine-tuning code:

console output

Replies: 3 comments · 4 replies

glangford Aug 24, 2024

mrlabied Aug 25, 2024 Author

ryanheise Aug 26, 2024

mrlabied Aug 27, 2024 Author

ryanheise Aug 28, 2024

mrlabied Aug 31, 2024 Author

Ayman-AITACHOUR Oct 17, 2024

mrlabied
Aug 24, 2024

Replies: 3 comments 4 replies

glangford
Aug 24, 2024

mrlabied Aug 25, 2024
Author

ryanheise
Aug 26, 2024

mrlabied Aug 27, 2024
Author

mrlabied Aug 31, 2024
Author

Ayman-AITACHOUR
Oct 17, 2024