Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: Language Identification #85

Open
5 of 10 tasks
ibevers opened this issue Jul 4, 2024 · 5 comments · May be fixed by #207
Open
5 of 10 tasks

Task: Language Identification #85

ibevers opened this issue Jul 4, 2024 · 5 comments · May be fixed by #207
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ibevers
Copy link
Collaborator

ibevers commented Jul 4, 2024

Description

As far as I understand, this should take in an Audio or a ScriptLine and output a Language object

Tasks

Audio

  • Create a general API
  • Select and implement a default model
  • Get working with the default model
  • Tutorial
  • Documentation

Text

  • Create a general API
  • Select and implement a default model
  • Get working with the default model
  • Tutorial
  • Documentation

Freeform Notes

Might want to have examples that cover a wide range of languages, or we could just trust the model developer. Ideally, we should have multi-class output, so if a given input includes more than one language, the output will reflect that.

@fabiocat93
Copy link
Collaborator

correct. we may want to have 2 modules, one in senselab.text and one in senselab.audio

@fabiocat93 fabiocat93 added enhancement New feature or request help wanted Extra attention is needed labels Aug 9, 2024
@fabiocat93 fabiocat93 moved this to Todo in senselab Nov 15, 2024
@fabiocat93
Copy link
Collaborator

fabiocat93 commented Nov 15, 2024

For now, I have implemented audio-based language identification using Speechbrain's models. These models work assuming that only one language is included in a clip. In the future, we may want to integrate the Whisper model. This should be easy to implement since we already use the same model for speech-to-text and allow the identification of multiple languages in the same clip. Here is a first draft of how Whisper works:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-tiny"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    return_language=True
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
result['chunks'][0]['language']

@fabiocat93 fabiocat93 changed the title Task: Language Detection Task: Language Identification Nov 20, 2024
@fabiocat93 fabiocat93 self-assigned this Nov 20, 2024
@fabiocat93 fabiocat93 moved this from Todo to In Progress in senselab Nov 20, 2024
@fabiocat93
Copy link
Collaborator

@ibevers I have implemented the speech-based version of this. Can you do the same with text-based language identification? Simply integrating huggingface models for this should be more than fine for now (https://huggingface.co/models?search=language%20detection)

@fabiocat93 fabiocat93 linked a pull request Nov 20, 2024 that will close this issue
1 task
@fabiocat93 fabiocat93 linked a pull request Nov 20, 2024 that will close this issue
1 task
@ibevers
Copy link
Collaborator Author

ibevers commented Nov 22, 2024

@fabiocat93 Thank you for your patience. I can take a look at this after next week

@fabiocat93
Copy link
Collaborator

@fabiocat93 Thank you for your patience. I can take a look at this after next week

I will extend your deadline one more time (from Dec 16 to Jan 14). Please, let me know if you face any blockers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants