Skip to content

daft.write_parquet cannot partition column #4077

Answered by colin-ho
GZ82 asked this question in Q&A
Discussion options

You must be logged in to vote

Yes there's a couple of data skipping optimizations that Daft can do with parquet.

Firstly, parquet itself contains metadata such as min/max stats per column. When you do a read -> filter operation, such as daft.read_parquet(file).where('id' > 100) , daft will analyze the parquet metadata to determine which row groups / files match or don't match the filter, and only read the ones that do. By default, daft will write statistics to parquet files.

Secondly, daft can infer a hive style partitioning scheme for reads, e.g. daft.read_parquet(file, hive_partitioning =True), and skip partitions entirely. Hive partitioning is essentially where the directories are named by the partition value. This…

Replies: 6 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by colin-ho
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #4043 on March 26, 2025 17:36.