Skip to content

Inefficient parsing of some invalid files #2345

@thomas-boucher

Description

@thomas-boucher

Describe the bug
When parsing a file with invalid fields, the ProcessRFC4180BadField() method is preparing arguments for the optional user-defined callback badDataFound and one of the arguments is RawRecord which is actually allocating a new string containing the entire current line. In a degenerate case of a file being seen as a single line with lots of columns (due to bad quotes), let's say 10 columns and 100k lines, this leads to allocating multiple MB 1 million times creating huge pressure on the GC / OOM.

To Reproduce

"A,"B","C""
""a1","b1","c1""
""a2","b2","c2""
...

Expected behavior
The functional behavior is fine but the expectation would be that at least to not pay this price if the callback for bad data is not defined.

Screenshots

Additional context
I will open a suggested PR to avoid preparing the arguments if the callback is not defined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions