Daft write_parquet() takes forever, and then the file cannot be read #4372

mikeprince4 · 2025-05-17T14:22:32Z

mikeprince4
May 17, 2025

Hello,

We are using daft==0.4.12 and are migrating some old code which uses pandas. The code involves constructing a dataframe, writing it to disk, and then uploading it to s3. The code is running on a Windows EC2 instance.

The code looks as follows

async def _upload_df_to_s3(
    df: daft.DataFrame,
    output_s3_bucket: S3Bucket,
    local_parquet_path: Path,
    destination_s3_parquet_path: PurePosixPath,
) -> None:
    df.write_parquet(local_parquet_path)

    await output_s3_bucket.upload_from_path(
        local_parquet_path,
        str(destination_s3_parquet_path),
    )

With Pandas, this code takes only a few seconds. However with daft, the write_parquet() step takes over an hour. Then the next step fails due a permission error

PermissionError: [Errno 13] Permission denied: 'C:\\Windows\\TEMP\\tmp_b24b0dp\\ExampleFilename.parquet'

Any ideas what the issue might be?

Thank you!

desmondcheongzx · 2025-05-17T15:26:00Z

desmondcheongzx
May 17, 2025
Maintainer

Just curious, why write to disk then upload to s3? I think you'll get better performance writing directly to s3

3 replies

hongbo-miao May 17, 2025

Hi @desmondcheongzx We use Rclone, which supports multi threads, ETA, progress bar, different destinations, etc. It auto handles the destination rate limit issue for example S3 API limit as we often read and write near TiB size file in very short time. It can easily reach 500 MB/s and within rate limit.
However, I am curious how much Daft it can save by bypassing writing to local parquet.

desmondcheongzx May 17, 2025
Maintainer

As of today bypassing writing to a local file probably won't save as much as you want because of your multi part upload optimizations.

But in a few weeks (see my reply further down in the discussion) you can probably get the best experience by using pure Daft writing to s3 as we will handle the multi threading and multi part upload automatically and optimally

hongbo-miao May 17, 2025

Thanks, I will keep an eye on it and give it a try when it releases!

mikeprince4 · 2025-05-17T15:59:47Z

mikeprince4
May 17, 2025
Author

Good question! That's merely the pattern that existed before, but I will give the direct write a shot and let you know.

0 replies

desmondcheongzx · 2025-05-17T16:34:17Z

desmondcheongzx
May 17, 2025
Maintainer

Fwiw, this doesn't address your permissions issue when writing to a local windows file, we'll need to look into that more closely.

I think it is also fair to say that our writes could be much faster than they are today. We currently use the pyarrow writers which are completely single threaded. The next release of Daft will include a native writer that encodes columns in parallel 1, which we plan to extend to write row groups in parallel 2.

@rohitkulshreshtha is also working on our remote puts 3 so that our S3 writes also become blazing fast :) Keep your eyes peeled in the coming weeks and do keep us accountable if your workload doesn't speed up as expected.

1 reply

hongbo-miao May 18, 2025

Cannot wait for 1 and 2 to merge, excited to give it another try soon! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daft write_parquet() takes forever, and then the file cannot be read #4372

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Daft write_parquet() takes forever, and then the file cannot be read #4372

Uh oh!

Uh oh!

mikeprince4 May 17, 2025

Replies: 3 comments · 4 replies

Uh oh!

desmondcheongzx May 17, 2025 Maintainer

Uh oh!

Uh oh!

hongbo-miao May 17, 2025

Uh oh!

desmondcheongzx May 17, 2025 Maintainer

Uh oh!

hongbo-miao May 17, 2025

Uh oh!

mikeprince4 May 17, 2025 Author

Uh oh!

desmondcheongzx May 17, 2025 Maintainer

Uh oh!

hongbo-miao May 18, 2025

mikeprince4
May 17, 2025

Replies: 3 comments 4 replies

desmondcheongzx
May 17, 2025
Maintainer

desmondcheongzx May 17, 2025
Maintainer

mikeprince4
May 17, 2025
Author

desmondcheongzx
May 17, 2025
Maintainer