Skip to content

Invalid XML character break docket parsers #348

Open
@cgdeboer-toptal

Description

@cgdeboer-toptal

Summary

When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.

This is not a hypothetical, I was scraping a docket at the Ohio Northern Bankruptcy Court (ohnb), and the docketreport.parse() failed because of some invalid XML characters coming back from the request.

Tasks

  • update the code in the juriscraper/lib/html_utils.py to escape these characters, probably using some regex so we don't lose too much speed.
  • capture the raw response of the parsed docket, and include it in the test suite.

Questions

  • has anyone seen this type of error coming from a pacer scrape ? You would've seen a All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters traceback bubble up the stack.
  • any opposition to having someone (possibly me) work on a patch for the html_utils to ensure that this type of data is protected against ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions