Heavy & increasing memory usage when looping write_parquet #5546
-
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
|
Hey @SanilK2108 , thanks for the detailed issue and repro instructions. I'm running it now with latest daft on my macbook and unfortunately can't repro it, memory hovers consistently around 160mb. I also removed all the manual cleanup attempts and execution config and memory hovers around 190mb. I'm gonna try running on linux next. In the meantime, one thing that might be useful is using https://github.com/bloomberg/memray, which is what i use to profile memory. It can create flamegraphs of which functions allocate the most memory. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @colin-ho , Thanks for replying. I reran the script I sent you with and without memray live tracking and I am seeing the same graph on grafana of pod memory usage -
This is the exact metric I am looking at -
However, on memray live tracking, I don't see any increase in memory usage, and memory consistently stays under ~100MB I had also tried using I tried looking at Is there any reason we would not see this memory growth in RSS, but in container_memory_working_set_bytes? |
Beta Was this translation helpful? Give feedback.
-
|
Are there any other metrics you are able to see? e.g. A possible theory is that during the writes, the kernel has not started flushing the dirty pages used for the writes, and so those count towards |
Beta Was this translation helpful? Give feedback.
-
|
Hi @colin-ho, adding more details on
I had a few questions based on these - Do you think this is a Daft issue or a Kernel / OS level issue? I had some doubts on Daft configuration we could experiment with
Thanks for taking the time and helping us out here. Really appreciate it! |
Beta Was this translation helpful? Give feedback.
-
What i think is happening is that the page cache is being populated by the writes, which is why
Only in UDFs. Your query doesn't have any UDFs so there should be no other processes.
|
Beta Was this translation helpful? Give feedback.





What i think is happening is that the page cache is being populated by the writes, which is why
memory_cacheandworking_set_bytesare increasing. Lowering parameters likeparquet_target_file_sizeorparquet_target_row_group_sizewill make daft write more frequently, but when those pages get flushed to disk is up to the kernel, not Daft.