-
Notifications
You must be signed in to change notification settings - Fork 261
Description
Currently the format docs indicate how to specify metadata for document and paragraph units, but not super-document units like source texts or genres within a larger corpus.
Is there a recommendation or emerging convention for this?
English-EWT is composed of 5 genres, and sentences are grouped by genre within each of the 3 conllu files. The genre can be recovered from the sentence ID but is not formally established in a separate field. E.g. the beginning of the dev file is
# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# newpar id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-p0001
# text = From the AP comes this story :
I could imagine adding lines before newdoc
:
# newsubcorpus id = weblog
# genre = web
This genre
declaration is meant to scope over all documents and sentences within the subcorpus. In general, the genre
field would have values from the official UD genre list for READMEs). See also https://universaldependencies.org/contributing/genres.html, which proposes a way to map genres to sentence ranges.
There may be other subcorpus-level properties worth declaring. E.g. if certain annotations were performed differently, something like
# feats = converted
for one subcorpus and
# feats = manual
for another.
When loading sentences in a .conllu reader, it would be helpful to be able access all pieces of metadata defined at the sentence, paragraph, or subcorpus level.