Ray runner significantly slower than native runner #5812

yuchaoran2011 · 2025-12-13T18:03:17Z

yuchaoran2011
Dec 13, 2025

Hi all, I have tried to run a file deduplication pipeline using Daft by borrowing most of the code from this tutorial: https://docs.daft.ai/en/stable/examples/minhash-dedupe/. The code works well using the native Swordfish runner, processing a 100MB dataset takes a couple of seconds. Now I'm ready to run the same code in cluster mode on Ray. I kept all the code intact except for adding daft.set_runner_ray() to the beginning of the script. Then for the same input dataset, it became extremely slow. Now it's been 30 minutes and the job still hasn't finished.
I've tested that on a much smaller dataset (10 MB), the Ray runner was able to run the script successfully in about a minute. But why does the performance degrade so much on a larger dataset? Initially I thought that maybe the Ray worker communication overhead was dominating, so I tried ray.init(num_cpus=2) to limit the number of CPUs. But it didn't help.
From the console output, I can see that (InMemoryScan, InMemoryScan)->HashJoin->Project->UnGroupedAggregate is the stage that's most time consuming, which is understandable, but still doesn't explain the huge discrepancy with native runner. I should also add that there's no errors reported. Any ideas?

Jay-ju · 2025-12-13T22:46:43Z

Jay-ju
Dec 13, 2025

@yuchaoran2011 Can you add to the test cases here? cc @colin-ho

0 replies

yuchaoran2011 · 2025-12-14T02:59:48Z

yuchaoran2011
Dec 14, 2025
Author

Some additional information. I printed out the logical plans under both Ray runner and native runner. It turns out that they are identical, except that in the Ray case, the number of partitions are much larger. An example:
Ray runner plan for E, where E is defined in the Daft tutorial reference above:

== Optimized Logical Plan ==

* Concat
|   Stats = { Approx num rows = 5,596, Approx size bytes = 87.44 KiB, Accumulated selectivity = 1.00 }
|\
| * Project: col(v) as u, col(u) as v
| |   Stats = { Approx num rows = 2,798, Approx size bytes = 43.72 KiB, Accumulated selectivity = 1.00 }
| |
| * Source:
| |   Number of partitions = 200
| |   Output schema = u#UInt64, v#UInt64
| |   Stats = { Approx num rows = 2,798, Approx size bytes = 43.72 KiB, Accumulated selectivity = 1.00 }
|
* Source:
|   Number of partitions = 200
|   Output schema = u#UInt64, v#UInt64
|   Stats = { Approx num rows = 2,798, Approx size bytes = 43.72 KiB, Accumulated selectivity = 1.00 }

Native runner plan for E:

== Optimized Logical Plan ==

* Concat
|   Stats = { Approx num rows = 5,596, Approx size bytes = 87.44 KiB, Accumulated selectivity = 1.00 }
|\
| * Project: col(v) as u, col(u) as v
| |   Stats = { Approx num rows = 2,798, Approx size bytes = 43.72 KiB, Accumulated selectivity = 1.00 }
| |
| * Source:
| |   Number of partitions = 1
| |   Output schema = u#UInt64, v#UInt64
| |   Stats = { Approx num rows = 2,798, Approx size bytes = 43.72 KiB, Accumulated selectivity = 1.00 }
|
* Source:
|   Number of partitions = 1
|   Output schema = u#UInt64, v#UInt64
|   Stats = { Approx num rows = 2,798, Approx size bytes = 43.72 KiB, Accumulated selectivity = 1.00 }

The same partition mismatch also happens for other dataframes defined in the tutorial (e.g. nbr_min, assignments). I've set Ray's num_cpus to 4 and my input only has one single file

2 replies

everettVT Dec 14, 2025
Maintainer

There is always going to be some overhead for moving objects in and out of the immutable object store, but concerning the hash join I'll defer to @srilman.

I wrote the tutorial for the minhash dedupe workload so let me know if you are looking for a script version.

yuchaoran2011 Dec 14, 2025
Author

Thanks for the reply and for writing that excellent tutorial. I was able to put it in a script format already and it runs nicely for datasets in the hundreds of MB range using the native runner. But anything bigger than that made OS kill the script due to memory pressure. My eventual goal is to run the algorithm for a 100 TB dataset. That's why I'm switching to the Ray runner on a small dataset first to try it out. I understand that there added overhead, but still compared to a couple of seconds that native runner is able to achieve, an hour of run time observed in the Ray runner case seems to suggest something is fundamentally not working.

yuchaoran2011 · 2025-12-14T04:09:49Z

yuchaoran2011
Dec 14, 2025
Author

Figured it out, thanks to a kind member in the community whom I talked to off Github! The trick was to manually repartition the dataframes involved in a join before the join. Now the script can complete successfully in 3 minutes, as opposed to an hour+. I was hoping that Daft would be able to intelligently decide on the number of partitions to use, but looks like manual tuning is still required

5 replies

everettVT Dec 15, 2025
Maintainer

Glad you found a straightforward solution! Our partitions can be quite large by default, but we are releasing a simple dynamic batching feature in 0.7.0 that should alleviate some of these initial manual tuning headaches.

Connected components can be a beast of a job!

yuchaoran2011 Dec 15, 2025
Author

Indeed. Any ETA on when to expect 0.7.0? I'd love to give it a spin once available

everettVT Dec 15, 2025
Maintainer

We'll be releasing this next week as early as tomorrow.

yuchaoran2011 Dec 15, 2025
Author

Nice, look forward to it!

everettVT Dec 15, 2025
Maintainer

Just chatted with @universalmind303 who reminded me that we aren't supporting joins quite yet for dynamic batching. Sorry!

Ray runner significantly slower than native runner #5812

Uh oh!

yuchaoran2011 Dec 13, 2025

Replies: 3 comments · 7 replies

Uh oh!

Jay-ju Dec 13, 2025

Uh oh!

Uh oh!

yuchaoran2011 Dec 14, 2025 Author

Uh oh!

everettVT Dec 14, 2025 Maintainer

Uh oh!

yuchaoran2011 Dec 14, 2025 Author

Uh oh!

yuchaoran2011 Dec 14, 2025 Author

Uh oh!

everettVT Dec 15, 2025 Maintainer

Uh oh!

yuchaoran2011 Dec 15, 2025 Author

Uh oh!

everettVT Dec 15, 2025 Maintainer

Uh oh!

yuchaoran2011 Dec 15, 2025 Author

Uh oh!

everettVT Dec 15, 2025 Maintainer

yuchaoran2011
Dec 13, 2025

Replies: 3 comments 7 replies

Jay-ju
Dec 13, 2025

yuchaoran2011
Dec 14, 2025
Author

everettVT Dec 14, 2025
Maintainer

yuchaoran2011 Dec 14, 2025
Author

yuchaoran2011
Dec 14, 2025
Author

everettVT Dec 15, 2025
Maintainer

yuchaoran2011 Dec 15, 2025
Author

everettVT Dec 15, 2025
Maintainer

yuchaoran2011 Dec 15, 2025
Author

everettVT Dec 15, 2025
Maintainer