-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Just found a tiny issue for all sub-dagws:
Here Peter specified source as dagw-xxx but the validator would check if the source equals to the name of the dataset folder: dataset_name = document_file.parent.parent.name which is dagw:
Therefore we will have:
Checking dataset: dagw: 86%|████████████████████████████████████████████▉ | 19/22 [00:09<00:02, 1.46it/s]ERROR:__main__:--- Dataset dagw failed validation ------------
ERROR:__main__:Datasheet dagw does not exist.
Error reading datasheet dagw: [Errno 2] No such file or directory: '/work/github/danish-foundation-models/docs/datasheets/dagw'
Error in document file dagw-retsinformationdk.jsonl.gz: Source should be dagw, but is dagw-retsinformationdk
Error in document file dagw-ep.jsonl.gz: Source should be dagw, but is dagw-ep
Error in document file dagw-hest.jsonl.gz: Source should be dagw, but is dagw-hest
This also is the case for checking if dataset sheets exist, it would only check if dagw.md exists.
So I guess we have to seperate each of the sub-dagw into individual folders like:
dataset_folder
│
└── dataset_name
│
├── documents
│ └── dataset_name.jsonl.gz
│
└── attributes # OPTIONAL: folder containing annotations from dataset cleaning
Originally posted by @TTTTao725 in #266 (comment)
Metadata
Metadata
Assignees
Labels
No labels