Description
Hi,
Thanks for your efforts in creating/curating these datasets! These are priceless and greatly advance NLP for Indian languages.
I tried adding them into mtdata
thammegowda/mtdata#81
Since the README says your datasets are still growing, I am wondering whats the best long-term strategy is for keeping in sync.
For now, I can grep -i -o 'http[^ ]*zip' README.md
, but the immediate concern is about consistency in determining name, version, and languages of datasets from URL.
The way current files are named (which act as ID for corpus) is a bit inconsistent. For example, consider these:
1) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/oneindia_20210320_en_ml.zip
2) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/pibarchives_2014_2016_en_ml.zip
3) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/wikipedia-en-ml-20210201.zip
- item (1), we can easily split by
_
and get(name, version, lang1, lang2)
, so this is great. we can seeoneindia
is the name,20210320
is the version, anden_ml
are langs. - item (2), seems okay we can call
2014_2016
as version, though it would have been nice to have2014to2016v1
as version. so splitting by_
would give exactly(name, version, lang1, lang2)
as in item 1. - item (3) seems abnormal as it doesn't fit
(name, version, lang1, lang2)
. There are more datasets matching item (1) than item (3) pattern, so I am inclined to call this abnormal.
Could you please consider having a consistent format in dataset IDs? It'd greatly help the automated downloaders such as mtdata
.
Otherwise, do you really want your users to manually download 196 zip files via browser, and extract and merge them? :)
Thanks.