Skip to content

Malformed Chinese mMARCO dataset #161

@sissilab

Description

@sissilab

Thanks for this excellent tool firstly!

I'm goind to calculate metrics based on some Chinese datasets, like mMARCO. I downloaded Chinese mMARCO via the link (https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/mmarco.zip). When I opened the queries.jsonl, the content of text is malformed as follows:

// {\\fn华文楷体\\fs16\\1cHE0E0E0} and {\\fn华文楷体\\fs16\\1cHE0E0E0}"}
{"_id": "224811", "text": "{\\fn华文楷体\\fs16\\1cHE0E0E0}萤火虫怎么点亮的 {\\fn华文楷体\\fs16\\1cHE0E0E0}"}

// many repeated 每桶
{"_id": "473204", "text": "房客建房每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶"}

// meaningless and ungrammatical
{"_id": "880877", "text": "何国籍何为姓甘?"}

//  meaningless and repeated $
{"_id": "319885", "text": "$$$$$$$$$$$$$$$ $$$ $$$ $$$ $$$ $$ $$ $ $$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $"}

// ????
{"_id": "1035441", "text": "???????"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions