You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a researcher, I frequently create data visualisations based on the validated.tsv file of a language / release. Currently the only way to obtain this file is to download the whole dataset or delta.
I want to be able to get just the .tsv files related to a release, without downloading the clips, so that I can do faster data visualisations.
Acceptance criteria
The files
clip_durations.tsv
invalidated.tsv
other.tsv
reported.tsv
validated.tsv
are available
for each language in the CV corpus (about 103 at time of writing)
for each version
including delta releases
from the CV datasets download page, in the same way as we currently download the .tar.gz formatted datasets.
The text was updated successfully, but these errors were encountered:
Thank you for posting this @KathyReid...
I raised this request a lot of times, in Discourse, in meetings, and in one-to-one talks, whenever I decided to create the CV Metadata Viewer and CV Dataset Analyzer webapps. To be able to update these apps, I download every dataset, now 615 GB on disk, every 3 months, takes 2-3 days - a waste of bandwidth and hits the ecology with unnecessary carbon footprint. I only work on Turkic languages for training, so 114-11 > 100 language downloads are wasted.
Two notes:
Default splits should also be included (train.tsv, dev.tsv, test.tsv).
User story
validated.tsv
file of a language / release. Currently the only way to obtain this file is to download the whole dataset or delta.I want to be able to get just the
.tsv
files related to a release, without downloading the clips, so that I can do faster data visualisations.Acceptance criteria
The files
clip_durations.tsv
invalidated.tsv
other.tsv
reported.tsv
validated.tsv
are available
from the CV datasets download page, in the same way as we currently download the
.tar.gz
formatted datasets.The text was updated successfully, but these errors were encountered: