Inconsistent IDs

Hi,

Thanks for your efforts in creating/curating these datasets! These are priceless and greatly advance NLP for Indian languages.

I tried adding them into `mtdata` https://github.com/thammegowda/mtdata/issues/81
Since the README says your datasets are still growing, I am wondering whats the best long-term strategy is for keeping in sync.

For now, I can `grep -i -o 'http[^ ]*zip' README.md`, but the immediate concern is about consistency in determining name, version, and languages of datasets from URL. 

The way current files are named  (which act as  ID for corpus) is a bit inconsistent.  For example, consider these:
```
1) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/oneindia_20210320_en_ml.zip
2) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/pibarchives_2014_2016_en_ml.zip
3) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/wikipedia-en-ml-20210201.zip
```
* item (1), we can easily split by `_` and get  `(name, version, lang1, lang2)`, so this is great. we can see `oneindia` is the name, `20210320` is the version, and `en_ml` are langs. 
* item (2), seems okay we can call `2014_2016` as version, though it would have been nice to have `2014to2016v1` as version.  so splitting by `_` would give exactly  `(name, version, lang1, lang2)` as in item 1.
* item (3) seems abnormal as it doesn't fit `(name, version, lang1, lang2)`.  There are more datasets matching item (1) than item (3) pattern, so I am inclined to call this abnormal.

Could you please consider having a consistent format in dataset IDs?  It'd greatly help the automated downloaders such as `mtdata`.  
 <sup>Otherwise, do you really want your users to manually download 196 zip files via browser, and extract and merge them? :)</sup>
 
 Thanks.

P.S https://github.com/thammegowda/mtdata#dataset-id 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent IDs #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent IDs #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions