Daft write_parquet() takes forever, and then the file cannot be read #4372
Replies: 3 comments 4 replies
-
|
Just curious, why write to disk then upload to s3? I think you'll get better performance writing directly to s3 |
Beta Was this translation helpful? Give feedback.
-
|
Good question! That's merely the pattern that existed before, but I will give the direct write a shot and let you know. |
Beta Was this translation helpful? Give feedback.
-
|
Fwiw, this doesn't address your permissions issue when writing to a local windows file, we'll need to look into that more closely. I think it is also fair to say that our writes could be much faster than they are today. We currently use the pyarrow writers which are completely single threaded. The next release of Daft will include a native writer that encodes columns in parallel 1, which we plan to extend to write row groups in parallel 2. @rohitkulshreshtha is also working on our remote puts 3 so that our S3 writes also become blazing fast :) Keep your eyes peeled in the coming weeks and do keep us accountable if your workload doesn't speed up as expected. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
We are using
daft==0.4.12and are migrating some old code which uses pandas. The code involves constructing a dataframe, writing it to disk, and then uploading it to s3. The code is running on a Windows EC2 instance.The code looks as follows
With Pandas, this code takes only a few seconds. However with daft, the
write_parquet()step takes over an hour. Then the next step fails due a permission errorAny ideas what the issue might be?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions