Skip to content

Issues in sub-dagws #273

@KennethEnevoldsen

Description

@KennethEnevoldsen

Just found a tiny issue for all sub-dagws:

Here Peter specified source as dagw-xxx but the validator would check if the source equals to the name of the dataset folder: dataset_name = document_file.parent.parent.name which is dagw:

image

Therefore we will have:

Checking dataset: dagw:  86%|████████████████████████████████████████████▉       | 19/22 [00:09<00:02,  1.46it/s]ERROR:__main__:--- Dataset dagw failed validation ------------
ERROR:__main__:Datasheet dagw does not exist.
Error reading datasheet dagw: [Errno 2] No such file or directory: '/work/github/danish-foundation-models/docs/datasheets/dagw'
Error in document file dagw-retsinformationdk.jsonl.gz: Source should be dagw, but is dagw-retsinformationdk
Error in document file dagw-ep.jsonl.gz: Source should be dagw, but is dagw-ep
Error in document file dagw-hest.jsonl.gz: Source should be dagw, but is dagw-hest

This also is the case for checking if dataset sheets exist, it would only check if dagw.md exists.

So I guess we have to seperate each of the sub-dagw into individual folders like:

dataset_folder
│
└── dataset_name
    │
    ├── documents
    │   └── dataset_name.jsonl.gz  
    │
    └── attributes   # OPTIONAL: folder containing annotations from dataset cleaning

Originally posted by @TTTTao725 in #266 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions