Skip to content

xh_scanner loses data when tag name or attribute is too long #32

@jelmervdl

Description

@jelmervdl

I was debugging browsermt/bergamot-translator#273 when I noticed that xh_scanner does test for MAX_TOKEN_SIZE everywhere it adds characters to buffer, but does not call push_back(c) if the limit is hit. As a result, if any of the for-loops that add characters to its internal buffers do hit that limit, a character may be lost.

I think this only affects CDATA sections, comments, attribute values and tag names. So for the main use case of warc2text there is little impact for this bug.

Edit: Thinking about it, it would only affect the tag filters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions