Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata.json missing for DALI train/dev split #4

Open
scaperothian opened this issue Mar 11, 2024 · 7 comments
Open

metadata.json missing for DALI train/dev split #4

scaperothian opened this issue Mar 11, 2024 · 7 comments

Comments

@scaperothian
Copy link

Hello, thank you the publication of your work on this research topic. I am interested in using your repo to fine tune WAV2VEC with DALI and other data. when i run the dali_prepare.py scripts in DALI/LM/dali_prepare.py:

python dali_prepary.py --data_folder=/path/to/DALI_v2.0/

it returns the following:

Traceback (most recent call last):
  File "dali_prepare.py", line 84, in <module>
    prepare_text_dali(root=args.data_folder, save_folder=args.save_folder)
  File "dali_prepare.py", line 38, in prepare_text_dali
    with open(anno_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../../../DALI_v2.0/metadata.json' 

The metadata.json is not found in DALI dataset from Zenodo nor in DALI github page.

I can just recreate based on your paper's relative hours of data, but would perfer to just use your exact json and modify as needed (i.e. based on connectivity, etc.).

Thanks again to your contributions to this field.

@brenzjam
Copy link

brenzjam commented May 1, 2024

Hi guxm2021,

This is very interesting work. I was delighted to read your paper and eager to experiment with this repository. I have a similar question to scaperothian. Could you tell us where to find the metadata.json, or post an example json so we can recreate the format?

Many thanks,
Brens

@Sonata165
Copy link
Collaborator

Thank you so much for your interest to this project!

The meta_data.json is a new file we generated during data processing procedure, containing the text annotation and path to audio, for each utterance-level sample in the dataset. We have processed all dataset to a similar format (metadata + a folder with utterance-level samples). I’m sorry for the delay of uploading this part of code and the corresponding procedures in readme. I’ll try to clean up the code of this part and post it to github before next week.

@brenzjam
Copy link

brenzjam commented May 3, 2024

Lovely to hear from you Longshen,

No problem at all. In fact I found your response rather fast! So just to make sure I've interpreted correctly: you segment the audio tracks into individual tracks for each utterance before making the metadata.json? And by utterance, do you mean phoneme, word or phrase/line?

Thanks again! I don't know why your repo hasn't gotten more attention. It looks pretty cool.
Brendan

@Sonata165
Copy link
Collaborator

We've updated the data processing code here. Please follow the Readme.md inside that dir to prepare data.

Hi Brendan, yes, for each of dataset, the audio were separated into utterances and metadata.json was created for the utterance-level version of dataset. By utterance I mean one line of lyric in the song.

@Sonata165
Copy link
Collaborator

Btw, if you need access to full audio of DALI v2 instead of downloading them from youtube (actually a proportion of their urls has become invalid after years), please send me an email to [email protected], from any of your outlook email address, and then I can share the audio (currently saved in my OneDrive) to you. Thanks for your patience.

@brenzjam
Copy link

Hi @Sonata165 , thanks for the response. Unfortuantely I am unable to successfully send emails to you, and have been trying to do so from my University's outlook email account. Is this definitely accurate, or is there another email I can contact you through?

@guxm2021
Copy link
Owner

guxm2021 commented Jun 3, 2024

I think his email address is "[email protected]".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants