Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add multi-threaded GZIP decompression for multi-source JSON reading #17638

Open
GregoryKimball opened this issue Dec 19, 2024 · 0 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The JSON reader in libcudf supports multi-source reading of GZIP-compressed JSONL files, using host-side decompression algorithms.

However, the performance is limited to about 100 MB/s due to a single-host thread completing the decompression in sequence (see discussion in 17219).

Describe the solution you'd like
We should add a multi-threaded implementation to process GZIP decompression, with one host thread per source. Each source is a single compression block.

@GregoryKimball GregoryKimball added the feature request New feature or request label Dec 19, 2024
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

2 participants