FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately #26

KathyReid · 2023-10-22T02:43:52Z

User story

As a researcher, I frequently create data visualisations based on the validated.tsv file of a language / release. Currently the only way to obtain this file is to download the whole dataset or delta.

I want to be able to get just the .tsv files related to a release, without downloading the clips, so that I can do faster data visualisations.

Acceptance criteria

The files
- clip_durations.tsv
- invalidated.tsv
- other.tsv
- reported.tsv
- validated.tsv

are available

for each language in the CV corpus (about 103 at time of writing)
for each version
including delta releases

from the CV datasets download page, in the same way as we currently download the .tar.gz formatted datasets.

The text was updated successfully, but these errors were encountered:

HarikalarKutusu · 2023-10-22T21:15:52Z

Thank you for posting this @KathyReid...
I raised this request a lot of times, in Discourse, in meetings, and in one-to-one talks, whenever I decided to create the CV Metadata Viewer and CV Dataset Analyzer webapps. To be able to update these apps, I download every dataset, now 615 GB on disk, every 3 months, takes 2-3 days - a waste of bandwidth and hits the ecology with unnecessary carbon footprint. I only work on Turkic languages for training, so 114-11 > 100 language downloads are wasted.

Two notes:

Default splits should also be included (train.tsv, dev.tsv, test.tsv).
Maybe the correct repo for this is the common-voice-bundler

soliviantar mentioned this issue Oct 22, 2023

Feature request: Datasets with only validated recordings #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately #26

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately #26

KathyReid commented Oct 22, 2023

HarikalarKutusu commented Oct 22, 2023

FEATURE REQUEST: Make the .tsv files that are part of a downloaded dataset available separately #26

FEATURE REQUEST: Make the .tsv files that are part of a downloaded dataset available separately #26

Comments

KathyReid commented Oct 22, 2023

User story

Acceptance criteria

HarikalarKutusu commented Oct 22, 2023

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately #26

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately #26