Replies: 1 comment 3 replies
-
|
Hi @min-mwei thanks for raising this! Could you run:
We do have documentation around memory usage, but it's geared towards running on the Ray runner. Daft does not perform out-of-core processing when running on the PyRunner (which is the default single-node backend when you run Daft without explicit calls to switch the runner) cc @samster25 as well |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I just played with some real data. My test is a simple counting over 1000+ of parquet files over 250+GB total running on a 370GB memory VM with 48 cores.
The line could finish the counting 100GB DRAM consumption, like ~20mins.
daft.read_parquet("az://...", io_config=io_config) .select('some_id', 'some_field').groupby('some_id').agg([('some_field', 'count')]).collect())I thought having two partitions like below would speed up.
daft.read_parquet("az://...", io_config=io_config).into_partitions(2)..select('some_id', 'some_field').groupby('some_id').agg([('some_field', 'count')]).collect())Yet it runs out of memory under 1 min, and there is no obvious config setting to control memory.
RuntimeError: Requested 4056071093404 bytes of memory but found only 405476225024 availableMy package version
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linuxdaft.__version__ '0.2.14'Thanks for any suggestion.
Beta Was this translation helpful? Give feedback.
All reactions