Description
Hello, i have a 20 minute audio and i'm trying to align my text at a specific time. It seems like the original pytorch implementation is more accurate for my trimmed audio, although your implementation is way faster.
I trimmed my audio with this:
audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)
x0 = int((start[1]-1) * 16000)
x1 = int((end[1]+1) * 16000)
trim_waveform = audio_waveform[x0:x1]
start[1] are the starting timestamp and end[1] is the ending timestamp in seconds.
This repo implementation:
text = 'halo halo assalamualaikum warahmatullahi wabarakatuh nah di video ini saya akan sharing'
emissions, stride = generate_emissions(
alignment_model, trim_waveform, batch_size=batch_size
)
tokens_starred, text_starred = preprocess_text(
text,
romanize=True,
language=language,
)
segments, scores, blank_token = get_alignments(
emissions,
tokens_starred,
alignment_tokenizer,
)
spans = get_spans(tokens_starred, segments, blank_token)
word_timestamps = postprocess_results(text_starred, spans, stride, scores)
Pytorch implementation:
waveform, sr = torchaudio.load(audio_path)
transcript = text_join.split()
tokens = tokenizer(transcript)
waveform = F.resample(waveform, sr, 16000)
x0 = int((start[1]-1) * 16000)
x1 = int((end[1]+1) * 16000)
trim_waveform = waveform[:, x0:x1]
transcript = 'halo halo assalamualaikum warahmatullahi wabarakatuh nah di video ini saya akan sharing'.split()
tokens = tokenizer(transcript)
emission1, token_spans1 = compute_alignments1(trim_waveform, transcript)
num_frames = emission1.size(1)
This repo scores:
(0) 00:01 - 00:01: halo (-9.54248046875)
(1) 00:01 - 00:01: halo (-9.9814453125)
(2) 00:01 - 00:02: assalamualaikum (-30.47718048095703)
(3) 00:02 - 00:02: warahmatullahi (-33.127105712890625)
(4) 00:02 - 00:03: wabarakatuh (-24.040283203125)
(5) 00:04 - 00:04: nah (-6.6703033447265625)
(6) 00:04 - 00:04: di (-0.25970458984375)
(7) 00:04 - 00:04: video (-2.6971397399902344)
(8) 00:04 - 00:05: ini (-5.02679443359375)
(9) 00:05 - 00:05: saya (-11.3028564453125)
(10) 00:05 - 00:05: akan (-1.8131332397460938)
(11) 00:06 - 00:06: sharing (-23.826171875)
Pytorch scores:
(0) 00:01 - 00:01: halo (28.99)
(1) 00:01 - 00:01: halo (1.32)
(2) 00:01 - 00:02: assalamualaikum (53.66)
(3) 00:02 - 00:02: warahmatullahi (30.17)
(4) 00:02 - 00:04: wabarakatuh (34.84)
(5) 00:04 - 00:04: nah (29.71)
(6) 00:04 - 00:04: di (71.25)
(7) 00:04 - 00:04: video (72.95)
(8) 00:04 - 00:04: ini (65.45)
(9) 00:04 - 00:05: saya (56.08)
(10) 00:05 - 00:05: akan (70.26)
(11) 00:05 - 00:05: sharing (49.96)
These 4 words:
(2) 00:01 - 00:02: assalamualaikum
(3) 00:02 - 00:02: warahmatullahi
(4) 00:02 - 00:03: wabarakatuh
(11) 00:06 - 00:06: sharing (-23.826171875)
should have high confidence scores. Correct me if im wrong, but in your implementation, very negative score = low confidence, right? Why is this happening? I also trimmed my audio differently. For your implementation i use trim_waveform = audio_waveform[x0:x1]
, but for pytorch implementation i use trim_waveform = waveform[:, x0:x1]
. Is that related to this problem? Thank you.