-
Notifications
You must be signed in to change notification settings - Fork 6
Description
I've been trying to support --multilang (which is CLD2 splitting the document into up to three documents with different language labels) while adding classifiers and JSON support. But should we?
Does anyone use --multilang? I think in HPLT we want to avoid breaking up documents, so it won't be used by us.
How is --multilang supposed to work with the --identify-paragraphs option? The current implementation treats each broken up document as its own, so you can only replicate these stand-off annotations if you use the exact same langid so the split happens exactly the same. This sounds like a bug to me.
Do we want to keep multilang support when adding other paragraph level annotations, such as the block element name (or tag) that delineated that paragraph? It's a bit more cumbersome to implement since the break-up boundaries of the langid chunks are whatever CLD2 makes them, not the paragraph boundaries that warc2text introduces when parsing HTML.
Related to #35 .