Is there a tuning guide or document on the behavior of memory consumption? #1883

min-mwei · 2024-02-14T08:04:26Z

min-mwei
Feb 14, 2024

I just played with some real data. My test is a simple counting over 1000+ of parquet files over 250+GB total running on a 370GB memory VM with 48 cores.

The line could finish the counting 100GB DRAM consumption, like ~20mins.
daft.read_parquet("az://...", io_config=io_config) .select('some_id', 'some_field').groupby('some_id').agg([('some_field', 'count')]).collect())

I thought having two partitions like below would speed up.
daft.read_parquet("az://...", io_config=io_config).into_partitions(2)..select('some_id', 'some_field').groupby('some_id').agg([('some_field', 'count')]).collect())

Yet it runs out of memory under 1 min, and there is no obvious config setting to control memory.
RuntimeError: Requested 4056071093404 bytes of memory but found only 405476225024 available
My package version
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
daft.__version__ '0.2.14'

Thanks for any suggestion.

jaychia · 2024-02-14T18:31:09Z

jaychia
Feb 14, 2024
Maintainer

Hi @min-mwei thanks for raising this!

Could you run: df.explain(show_all=True) for these dataframes and post the output? Here are some of my initial thoughts:

Daft should already intelligently process your 1000 Parquet files as some N number of partitions (not just 1!). We have some heuristics around how Parquet files are split/combined to form well-sized partitions. The df.explain(show_all=True) output should help us clarify this!
It's likely that when you do .into_partitions(2), you are coalescing the data from N partitions into just 2 partitions, which might actually be slower/more memory intensive for your workload because it involves a copy of your data!

We do have documentation around memory usage, but it's geared towards running on the Ray runner. Daft does not perform out-of-core processing when running on the PyRunner (which is the default single-node backend when you run Daft without explicit calls to switch the runner)
https://www.getdaft.io/projects/docs/en/latest/user_guide/poweruser/memory.html

cc @samster25 as well

3 replies

min-mwei Feb 15, 2024
Author

Thanks @jaychia. You are right, using into_partitions is counter productive in this case. Note that the plan is from a "full real test dataset", due to the sensitivity of the data, I can only share some munged execution plans below.

I am looking for ways to push the VM more by committing more resources to speed up both the scan (taking about 15mins) and aggregation (taking about ~30 seconds).

== "default" Physical Plan ==

* Project: col(some_id), col(some_time.local_count(Valid).local_sum()) AS some_time
|   Partition spec = { Scheme = Hash, Num partitions = 200, By = col(some_id) }
|
* Aggregation: sum(col(some_time.local_count(Valid)) AS some_time.local_count(Valid).local_sum())
|   Group by = col(some_id)
|
* ReduceMerge
|
* FanoutByHash: 200
|   Partition by = col(some_id)
|
* Aggregation: count(col(some_time) AS some_time.local_count(Valid), Valid)
|   Group by = col(some_id)
|
* Project: col(some_id), col(some_time)
|   Partition spec = { Scheme = Unknown, Num partitions = 27792 }
|
* TabularScan:
|   Num Scan Tasks = 27792
|   Estimated Scan Bytes = 3566011065418
|   Partition spec = { Scheme = Unknown, Num partitions = 27792 }



== ".into_partitions(2),"  Physical Plan ==

* Project: col(some_id), col(some_time.local_count(Valid).local_sum()) AS some_time
|   Partition spec = { Scheme = Hash, Num partitions = 2, By = col(some_id) }
|
* Aggregation: sum(col(some_time.local_count(Valid)) AS some_time.local_count(Valid).local_sum())
|   Group by = col(some_id)
|
* ReduceMerge
|
* FanoutByHash: 2
|   Partition by = col(some_id)
|
* Aggregation: count(col(some_time) AS some_time.local_count(Valid), Valid)
|   Group by = col(some_id)
|
* Coalesce: Num from = 27653
|   Num to = 2
|
* Project: col(some_id), col(some_time)
|   Partition spec = { Scheme = Unknown, Num partitions = 27653 }
|
* TabularScan:
|   Num Scan Tasks = 27653
|   Estimated Scan Bytes = 3566011065418
|   Partition spec = { Scheme = Unknown, Num partitions = 27653 }

samster25 Feb 22, 2024
Maintainer

Hi @min-mwei,

We're currently working on an improvement on our in memory estimator for parquet that will now take in account projections and the expansion that happens with parquet. In the mean time, we would love to help you speed up your workload on azure! Let us know if you'd be open to hop on a call and debug this together!

PR

min-mwei Feb 27, 2024
Author

Hi, @samster25. I have been offline, just sent you an email gathered via your github handle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a tuning guide or document on the behavior of memory consumption? #1883

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a tuning guide or document on the behavior of memory consumption? #1883

Uh oh!

Uh oh!

min-mwei Feb 14, 2024

Replies: 1 comment · 3 replies

Uh oh!

jaychia Feb 14, 2024 Maintainer

Uh oh!

min-mwei Feb 15, 2024 Author

Uh oh!

samster25 Feb 22, 2024 Maintainer

Uh oh!

min-mwei Feb 27, 2024 Author

min-mwei
Feb 14, 2024

Replies: 1 comment 3 replies

jaychia
Feb 14, 2024
Maintainer

min-mwei Feb 15, 2024
Author

samster25 Feb 22, 2024
Maintainer

min-mwei Feb 27, 2024
Author