-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some mp3 files in cv corpus 4 are empty #31
Comments
Could you please share from which locale these are? |
Thanks for your comments. @HarikalarKutusu Those empty files are located in v4.0/en/clips. |
I'll check these to confirm, also check v5.1 to see if it persistent. As English is the largest dataset, it might take some time. I'm not saying that this is not a "bug", it sure is - this is the reason I also post them as issues. But one thing: It is not unusual to have some bad data in datasets. It could be caused during recording, it might be a technical issue during saving and/or packaging the dataset. As the consumers of the dataset, we usually check the validity of the data, and leave out invalid ones out of the training, such as:
These are a few among tens of thousands, but without these checks the whole program will error out, or the training would get worse. The problem is: Whenever a dataset is out and people download them, if there is no systematical error, a new version will not come out. In the past Common Voice had such re-versioning like: v2 -> v3, v5.0 -> v5.1, v6.0 -> v6.1, v16.0 -> v16.1... So, as there is no v4.1, I think one should take care of these in pre-processing... |
OK, here is what I found:
I check some of the zero sized ones, and I found that these are in the As Common Voice releases everything recorded, I don't think that it is a bug either. Those files are actually zero length, invalidated by volunteers. If there are some instances that passed the validation (although they are invalid), that would need the fourth option in my previous post - I think... |
Thank you so much for your comment!@ @HarikalarKutusu I realized some of these invalid mp3 files can be fixed by corpus v5.0...(There are valid versions of these files in corpus v5.0). |
Here is a complete list of 504 zero-length (corrupt) clips for future reference (sorted in lc, ver, clip):
|
I notice some files in corpus 4(e.g., common_voice_en_146651.mp3, common_voice_en_130054.mp3) have zero byte. Has anyone else had this problem?
The text was updated successfully, but these errors were encountered: