-
Notifications
You must be signed in to change notification settings - Fork 225
Open
Description
Thanks for this excellent tool firstly!
I'm goind to calculate metrics based on some Chinese datasets, like mMARCO. I downloaded Chinese mMARCO via the link (https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/mmarco.zip). When I opened the queries.jsonl, the content of text is malformed as follows:
// {\\fn华文楷体\\fs16\\1cHE0E0E0} and {\\fn华文楷体\\fs16\\1cHE0E0E0}"}
{"_id": "224811", "text": "{\\fn华文楷体\\fs16\\1cHE0E0E0}萤火虫怎么点亮的 {\\fn华文楷体\\fs16\\1cHE0E0E0}"}
// many repeated 每桶
{"_id": "473204", "text": "房客建房每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶每桶"}
// meaningless and ungrammatical
{"_id": "880877", "text": "何国籍何为姓甘?"}
// meaningless and repeated $
{"_id": "319885", "text": "$$$$$$$$$$$$$$$ $$$ $$$ $$$ $$$ $$ $$ $ $$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $"}
// ????
{"_id": "1035441", "text": "???????"}
Metadata
Metadata
Assignees
Labels
No labels