Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Chunked parquet reader incorrect results for large string columns #17158

Open
brandon-b-miller opened this issue Oct 23, 2024 · 0 comments · May be fixed by #17207
Open

[BUG] Chunked parquet reader incorrect results for large string columns #17158

brandon-b-miller opened this issue Oct 23, 2024 · 0 comments · May be fixed by #17207
Assignees
Labels
bug Something isn't working cudf.polars Issues specific to cudf.polars cuIO cuIO issue

Comments

@brandon-b-miller
Copy link
Contributor

Describe the bug
When reading parquet with the chunked parquet reader, I get a different result than with the non-chunked reader for certain datasets with large string columns.

Steps/Code to reproduce bug

import pylibcudf as plc
import pyarrow as pa
import cudf

PATH = xyz # snip
columns = abc # snip

reader = plc.io.parquet.ChunkedParquetReader(
        plc.io.SourceInfo([PATH]),
        columns=columns,
        nrows=-1,
        skip_rows=0,
        chunk_read_limit=0,
        pass_read_limit=17179869184 # 16 GiB
)
chk = reader.read_chunk()
tbl = chk.tbl
names = chk.column_names()
concatenated_columns = tbl.columns()

while reader.has_next():
    tbl = reader.read_chunk().tbl
    for i in range(tbl.num_columns()):
        concatenated_columns[i] = plc.concatenate.concatenate(
                [concatenated_columns[i], tbl._columns[i]]
        )
        tbl._columns[i] = None

chunked_col = plc.interop.to_arrow(concatenated_columns[2])

non_chunked = plc.io.parquet.read_parquet(
        plc.io.SourceInfo([PATH]),
        columns=columns,
        nrows=-1,
        skip_rows=0
        )

non_chunked_col = plc.interop.to_arrow(non_chunked.columns[2])

assert chunked_col.equals(non_chunked_col) # False

Expected behavior
I expect the same data.

Environment overview (please complete the following information)

  • Environment location: Bare metal
  • Method of cuDF install: Source

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
24.12 head. PATH and columns are internal values I can provide, contact me for details.
cc @galipremsagar @GregoryKimball

@brandon-b-miller brandon-b-miller added bug Something isn't working cuIO cuIO issue labels Oct 23, 2024
@vyasr vyasr added the cudf.polars Issues specific to cudf.polars label Oct 28, 2024
@mhaseeb123 mhaseeb123 self-assigned this Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.polars Issues specific to cudf.polars cuIO cuIO issue
Projects
Status: Todo
Status: In Progress
3 participants