Task: Language Identification #85

ibevers · 2024-07-04T14:07:01Z

fabiocat93 · 2024-07-04T20:44:03Z

correct. we may want to have 2 modules, one in senselab.text and one in senselab.audio

fabiocat93 · 2024-11-15T22:36:09Z

For now, I have implemented audio-based language identification using Speechbrain's models. These models work assuming that only one language is included in a clip. In the future, we may want to integrate the Whisper model. This should be easy to implement since we already use the same model for speech-to-text and allow the identification of multiple languages in the same clip. Here is a first draft of how Whisper works:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-tiny"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    return_language=True
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
result['chunks'][0]['language']

fabiocat93 · 2024-11-20T15:49:15Z

@ibevers I have implemented the speech-based version of this. Can you do the same with text-based language identification? Simply integrating huggingface models for this should be more than fine for now (https://huggingface.co/models?search=language%20detection)

ibevers · 2024-11-22T14:25:30Z

@fabiocat93 Thank you for your patience. I can take a look at this after next week

fabiocat93 · 2024-12-23T17:12:14Z

@fabiocat93 Thank you for your patience. I can take a look at this after next week

I will extend your deadline one more time (from Dec 16 to Jan 14). Please, let me know if you face any blockers

fabiocat93 added enhancement New feature or request help wanted Extra attention is needed labels Aug 9, 2024

fabiocat93 added this to senselab Nov 14, 2024

fabiocat93 moved this to Todo in senselab Nov 15, 2024

fabiocat93 changed the title ~~Task: Language Detection~~ Task: Language Identification Nov 20, 2024

fabiocat93 self-assigned this Nov 20, 2024

fabiocat93 moved this from Todo to In Progress in senselab Nov 20, 2024

fabiocat93 assigned ibevers Nov 20, 2024

fabiocat93 linked a pull request Nov 20, 2024 that will close this issue

Adding language identification from text and speech #207

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task: Language Identification #85

Task: Language Identification #85

ibevers commented Jul 4, 2024 •

edited by fabiocat93

Loading

fabiocat93 commented Jul 4, 2024

fabiocat93 commented Nov 15, 2024 •

edited

Loading

fabiocat93 commented Nov 20, 2024

ibevers commented Nov 22, 2024

fabiocat93 commented Dec 23, 2024

Task: Language Identification #85

Task: Language Identification #85

Comments

ibevers commented Jul 4, 2024 • edited by fabiocat93 Loading

Description

Tasks

Audio

Text

Freeform Notes

fabiocat93 commented Jul 4, 2024

fabiocat93 commented Nov 15, 2024 • edited Loading

fabiocat93 commented Nov 20, 2024

ibevers commented Nov 22, 2024

fabiocat93 commented Dec 23, 2024

ibevers commented Jul 4, 2024 •

edited by fabiocat93

Loading

fabiocat93 commented Nov 15, 2024 •

edited

Loading