Combining `--multilang` and paragraph-level annotations

I've been trying to support `--multilang` (which is CLD2 splitting the document into up to three documents with different language labels) while adding classifiers and JSON support. But should we?

Does anyone use `--multilang`? I think in HPLT we want to avoid breaking up documents, so it won't be used by us.

How is `--multilang` supposed to work with the `--identify-paragraphs` option? The current implementation treats each broken up document as its own, so you can only replicate these stand-off annotations if you use the exact same langid so the split happens exactly the same. This sounds like a bug to me.

Do we want to keep multilang support when adding other paragraph level annotations, such as the block element name (or tag) that delineated that paragraph? It's a bit more cumbersome to implement since the break-up boundaries of the langid chunks are whatever CLD2 makes them, not the paragraph boundaries that warc2text introduces when parsing HTML.

Related to #35 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining `--multilang` and paragraph-level annotations #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Combining --multilang and paragraph-level annotations #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Combining `--multilang` and paragraph-level annotations #45