daft.write_parquet cannot partition column #4077

GZ82 · 2025-03-23T01:38:35Z

GZ82
Mar 23, 2025

Describe the bug

cannot apply partition using partition_cols in DataFrame.write_parquet
without setting partition_cols, the writing works well
setting e.g., partition_cols=['id'] leads to
buffer overflow with increasing memory usage up to 30 G given the input csv just 60 MB.
and it is very slow.

column to partition: id, datatype: Utf8
number of columns: 12
number of rows: 200_000
size of the df in .csv: ~60 MB

python version: 3.11.9
daft version: 0.4.2

To Reproduce

import daft
import random
n_rows = 200_000
random_numbers = [random.randint(1, n_rows) for _ in range(n_rows)]

df = {
      'id': [str(i) for i in random_numbers)
}
for i in range(11):
     df[f'col{i}'] = [random.randint(1, n_rows) for _ in range(n_rows)]
df = daft. from_pydict(df)

# df as described as above
df.write_parquet(
      root_dir='path/to/test', # please check upload to AWS s3 also if possible, e.g., 's3://bucket/keys'
      write_mode="append", 
      partition_cols=["id", ]
  )

Expected behavior

I would expect the partition can be done in a second for 200_000 rows.
and can be used just like indexing in SQL at query stage.

Component(s)

Python Runner

Additional context

No response

Answered by colin-ho

Mar 25, 2025

Yes there's a couple of data skipping optimizations that Daft can do with parquet.

Firstly, parquet itself contains metadata such as min/max stats per column. When you do a read -> filter operation, such as daft.read_parquet(file).where('id' > 100) , daft will analyze the parquet metadata to determine which row groups / files match or don't match the filter, and only read the ones that do. By default, daft will write statistics to parquet files.

Secondly, daft can infer a hive style partitioning scheme for reads, e.g. daft.read_parquet(file, hive_partitioning =True), and skip partitions entirely. Hive partitioning is essentially where the directories are named by the partition value. This…

View full answer

colin-ho · 2025-03-23T03:03:55Z

colin-ho
Mar 23, 2025
Maintainer

Hi @GZ82 , thanks for opening this issue!

Looks like you are partitioning on the id column. In the example you provided, 200_000 ids are randomly generated. Suppose there are roughly 200000 unique ids, this means daft will partition the data into 200_000 partitions, and will write parquet files for each of these partitions into directories corresponding to the partition_id value, i.e. partition_0/file_name.parquet, partition_1/file_name.parquet, ... partition_200000/file_name.parquet.

Writing 200_000 files, even if the files are small, to local disk is probably going to be quite slow as we will be bottlenecked by file system operations, disk i/o, etc. What we can do on our end is implementing a mechanism to cap the number of open files at a time, which could help.

Writing to s3 could be faster, and if that is what you need, then you could benefit from running this workload distributed, via the ray_runner.

Just to make sure, is your intent to partition this data into a separate directory for each unique id?

0 replies

GZ82 · 2025-03-24T06:11:34Z

GZ82
Mar 24, 2025
Author

Hi @GZ82 , thanks very much for opening this issue!

Looks like you are partitioning on the id column. In the example you provided, 200_000 ids are randomly generated. Suppose there are roughly 200000 unique ids, this means daft will partition the data into 200_000 partitions, and will write parquet files for each of these partitions into directories corresponding to the partition_id value, i.e. partition_0/file_name.parquet, partition_1/file_name.parquet, ... partition_200000/file_name.parquet.

Writing 200_000 files, even if the files are small, to local disk is probably going to be quite slow as we will be bottlenecked by file system operations, disk i/o, etc. What we can do on our end is implementing a mechanism to cap the number of open files at a time, which could help.

Writing to s3 could be faster, and if that is what you need, then you could benefit from running this workload distributed, via the ray_runner.

Just to make sure, is your intent to partition this data into a separate directory for each unique id?

Hi @colin-ho Thanks very much for your quick reply!
the answer is: No.
Sorry that I find I miss understand partition. now I understand the meaning of partition.

what I want to know is if daft has any features/function can
indexing columns of a batch of parquet files saved under a directory locally and on s3.
Something similar to indexing in RDS e.g., in postgresql, we can do
CREATE INDEX index_name ON table_name (column_name);
so that fetching rows by matching the indexed column will be much faster afterwards, given a list of values.

0 replies

colin-ho · 2025-03-25T01:02:58Z

colin-ho
Mar 25, 2025
Maintainer

Yes there's a couple of data skipping optimizations that Daft can do with parquet.

Firstly, parquet itself contains metadata such as min/max stats per column. When you do a read -> filter operation, such as daft.read_parquet(file).where('id' > 100) , daft will analyze the parquet metadata to determine which row groups / files match or don't match the filter, and only read the ones that do. By default, daft will write statistics to parquet files.

Secondly, daft can infer a hive style partitioning scheme for reads, e.g. daft.read_parquet(file, hive_partitioning =True), and skip partitions entirely. Hive partitioning is essentially where the directories are named by the partition value. This is what I mentioned previously, where df.write_parquet(path, partition_col='id') will make a new directory for each partition value.

Lets say we wrote parquet with partition_col='month', and have directories that look like month=january/data.parquet, month=february/data.parquet. When we do a read with a filter on month, e.g. read_parquet(path).where('month' == 'february'), daft will only read the files in month=february, and skip all other directories.

The hive partitioning approach works well when there's a lot of data per partition. If you are filtering on something like a unique key, then it won't work so well because there will be many directories and files, and parquet statistics will likely do a better job.

0 replies

GZ82 · 2025-03-26T11:18:17Z

GZ82
Mar 26, 2025
Author

Yes there's a couple of data skipping optimizations that Daft can do with parquet.

Firstly, parquet itself contains metadata such as min/max stats per column. When you do a read -> filter operation, such as daft.read_parquet(file).where('id' > 100) , daft will analyze the parquet metadata to determine which row groups / files match or don't match the filter, and only read the ones that do. By default, daft will write statistics to parquet files.

Secondly, daft can infer a hive style partitioning scheme for reads, e.g. daft.read_parquet(file, hive_partitioning =True), and skip partitions entirely. Hive partitioning is essentially where the directories are named by the partition value. This is what I mentioned previously, where df.write_parquet(path, partition_col='id') will make a new directory for each partition value.

Lets say we wrote parquet with partition_col='month', and have directories that look like month=january/data.parquet, month=february/data.parquet. When we do a read with a filter on month, e.g. read_parquet(path).where('month' == 'february'), daft will only read the files in month=february, and skip all other directories.

The hive partitioning approach works well when there's a lot of data per partition. If you are filtering on something like a unique key, then it won't work so well because there will be many directories and files, and parquet statistics will likely do a better job.

Hi @colin-ho thanks again for your reply in details! I will think about how to partition my data. and glad to know daft will write statistics to parquet files. my last question before closing the issue (which is not a real one): any other metadata will be wrote to parquet file in addition to statistics.

0 replies

colin-ho · 2025-03-26T17:34:41Z

colin-ho
Mar 26, 2025
Maintainer

Yes! Parquet metadata also stores information such as the encodings, compression scheme, bytes offsets of the columns, etc. You can find a general list of what metadata can be stored in a parquet file here: https://parquet.apache.org/docs/file-format/metadata/

The bytes offsets of the columns for example is useful when you read only a few columns, e.g. daft.read_parquet(path).select('col_a', 'col_b'), and will enable daft to only read the relevant byte ranges for those columns

0 replies

colin-ho · 2025-03-26T17:36:14Z

colin-ho
Mar 26, 2025
Maintainer

Also, I'm going to convert this issue to a discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

daft.write_parquet cannot partition column #4077

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

daft.write_parquet cannot partition column #4077

Uh oh!

Uh oh!

GZ82 Mar 23, 2025

Describe the bug

To Reproduce

Expected behavior

Component(s)

Additional context

Replies: 6 comments

Uh oh!

colin-ho Mar 23, 2025 Maintainer

Uh oh!

GZ82 Mar 24, 2025 Author

Uh oh!

colin-ho Mar 25, 2025 Maintainer

Uh oh!

GZ82 Mar 26, 2025 Author

Uh oh!

colin-ho Mar 26, 2025 Maintainer

Uh oh!

colin-ho Mar 26, 2025 Maintainer

GZ82
Mar 23, 2025

colin-ho
Mar 23, 2025
Maintainer

GZ82
Mar 24, 2025
Author

colin-ho
Mar 25, 2025
Maintainer

GZ82
Mar 26, 2025
Author

colin-ho
Mar 26, 2025
Maintainer

colin-ho
Mar 26, 2025
Maintainer