Open
Description
Summary
When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.
This is not a hypothetical, I was scraping a docket at the Ohio Northern Bankruptcy Court (ohnb), and the
docketreport.parse()
failed because of some invalid XML characters coming back from the request.
Tasks
- update the code in the
juriscraper/lib/html_utils.py
to escape these characters, probably using some regex so we don't lose too much speed. - capture the raw response of the parsed docket, and include it in the test suite.
Questions
- has anyone seen this type of error coming from a pacer scrape ? You would've seen a
All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
traceback bubble up the stack. - any opposition to having someone (possibly me) work on a patch for the html_utils to ensure that this type of data is protected against ?
Metadata
Metadata
Assignees
Labels
No labels