Metadata for subcorpora

Currently the [format docs](https://universaldependencies.org/format.html#paragraph-and-document-boundaries) indicate how to specify metadata for document and paragraph units, but not super-document units like source texts or genres within a larger corpus.

Is there a recommendation or emerging convention for this?

English-EWT is composed of 5 genres, and sentences are grouped by genre within each of the 3 conllu files. The genre can be recovered from the sentence ID but is not formally established in a separate field. E.g. the beginning of the [dev file](https://github.com/UniversalDependencies/UD_English-EWT/blob/master/en_ewt-ud-dev.conllu) is

```
# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# newpar id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-p0001
# text = From the AP comes this story :
```

I could imagine adding lines before `newdoc`:

```
# newsubcorpus id = weblog
# genre = web
```

This `genre` declaration is meant to scope over all documents and sentences within the subcorpus. In general, the `genre` field would have values from the official [UD genre list](https://github.com/UniversalDependencies/docs-automation/blob/master/genre_symbols.json) for READMEs). See also https://universaldependencies.org/contributing/genres.html, which proposes a way to map genres to sentence ranges.

There may be other subcorpus-level properties worth declaring. E.g. if certain annotations were performed differently, something like

```
# feats = converted
```

for one subcorpus and

```
# feats = manual
```

for another.

When loading sentences in a .conllu reader, it would be helpful to be able access all pieces of metadata defined at the sentence, paragraph, or subcorpus level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata for subcorpora #1135

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metadata for subcorpora #1135

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions