Plain text extraction and newlines #799
Unanswered
billziss-gh
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Question
In what situations does trafilatura output a newline when doing plain text extraction?
Explanation
I am using trafilatura to extract plain text from HTML, which I process further with spacy. Spacy segments text into sentences using punctuation, which can be a problem because headlines often do not end with a period/fullstop.
So the following problem often happens. Trafilatura extracts:
Spacy then sees the following sentences:
An obvious fix would be to enforce a sentence break wherever there is a newline. However this assumes that newlines are only used to separate headings, paragraphs, etc. and cannot appear otherwise.
Beta Was this translation helpful? Give feedback.
All reactions