Skip to content

[BUG] Malformed fixed length byte array Parquet file loads corrupted data instead of error #14104

Open
@jlowe

Description

@jlowe

Describe the bug
Using libcudf to load a Parquet file that is malformed "succeeds" by producing a table with some corrupted rows rather than returning an error as expected. Spark 3.5, parquet-mr 1.13.1, and pyarrow 13 all produce unexpected EOF errors when trying to load the same file.

Steps/Code to reproduce bug
Load https://github.com/apache/parquet-testing/blob/master/data/fixed_length_byte_array.parquet using libcudf. Note that it will produce a table with 1000 rows with no nulls, and some of the rows have a list of bytes longer than 4 entries. According to the docs for the file, the data is supposed to be a single column with a fixed-length byte array of size 4, yet some rows load with more than four bytes, some with no bytes.

Expected behavior
libcudf should return an error when trying to load the file rather than producing corrupted rows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    1 - On DeckTo be worked on nextSparkFunctionality that helps Spark RAPIDSbugSomething isn't workingcuIOcuIO issuelibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions