Skip to content

Inconsistent IDs #1

Open
Open
@thammegowda

Description

@thammegowda

Hi,

Thanks for your efforts in creating/curating these datasets! These are priceless and greatly advance NLP for Indian languages.

I tried adding them into mtdata thammegowda/mtdata#81
Since the README says your datasets are still growing, I am wondering whats the best long-term strategy is for keeping in sync.

For now, I can grep -i -o 'http[^ ]*zip' README.md, but the immediate concern is about consistency in determining name, version, and languages of datasets from URL.

The way current files are named (which act as ID for corpus) is a bit inconsistent. For example, consider these:

1) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/oneindia_20210320_en_ml.zip
2) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/pibarchives_2014_2016_en_ml.zip
3) https://anuvaad-parallel-corpus.s3-us-west-2.amazonaws.com/wikipedia-en-ml-20210201.zip
  • item (1), we can easily split by _ and get (name, version, lang1, lang2), so this is great. we can see oneindia is the name, 20210320 is the version, and en_ml are langs.
  • item (2), seems okay we can call 2014_2016 as version, though it would have been nice to have 2014to2016v1 as version. so splitting by _ would give exactly (name, version, lang1, lang2) as in item 1.
  • item (3) seems abnormal as it doesn't fit (name, version, lang1, lang2). There are more datasets matching item (1) than item (3) pattern, so I am inclined to call this abnormal.

Could you please consider having a consistent format in dataset IDs? It'd greatly help the automated downloaders such as mtdata.
Otherwise, do you really want your users to manually download 196 zip files via browser, and extract and merge them? :)

Thanks.

P.S https://github.com/thammegowda/mtdata#dataset-id

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions