Skip to content

Metadata for subcorpora #1135

@nschneid

Description

@nschneid

Currently the format docs indicate how to specify metadata for document and paragraph units, but not super-document units like source texts or genres within a larger corpus.

Is there a recommendation or emerging convention for this?

English-EWT is composed of 5 genres, and sentences are grouped by genre within each of the 3 conllu files. The genre can be recovered from the sentence ID but is not formally established in a separate field. E.g. the beginning of the dev file is

# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# newpar id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-p0001
# text = From the AP comes this story :

I could imagine adding lines before newdoc:

# newsubcorpus id = weblog
# genre = web

This genre declaration is meant to scope over all documents and sentences within the subcorpus. In general, the genre field would have values from the official UD genre list for READMEs). See also https://universaldependencies.org/contributing/genres.html, which proposes a way to map genres to sentence ranges.

There may be other subcorpus-level properties worth declaring. E.g. if certain annotations were performed differently, something like

# feats = converted

for one subcorpus and

# feats = manual

for another.

When loading sentences in a .conllu reader, it would be helpful to be able access all pieces of metadata defined at the sentence, paragraph, or subcorpus level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions