-
Notifications
You must be signed in to change notification settings - Fork 665
Open
Labels
P1Important tasks that we should complete soonImportant tasks that we should complete soonPerformance 🚀Performance related issues and pull requests.Performance related issues and pull requests.
Description
No specific reproducer attached, but this is observable on a variety of workloads using large DFs.
Modin on Ray uses excessive amounts of memory for a wide range of operations, and we should investigate how this can be reduced. Consider this diagram from the authors of Dias [paper + GH] and PandasBench, where Modin uses obscene amounts of memory for some notebooks:

While testing an implementation for parallel writes from Ray datasets to Snowflake, we also observed the upload of a ~3GB in-memory dataframe to be using well over 100GB RAM.
Further areas of investigation:
- Is this entirely a Ray issue, or are there things Modin can do to address it?
- Is excessive memory usage also an issue for Modin on Dask? The PandasBench paper indicates that dask DFs' peak memory consumption is relatively low.
See also: #5524
scottedwards2000
Metadata
Metadata
Assignees
Labels
P1Important tasks that we should complete soonImportant tasks that we should complete soonPerformance 🚀Performance related issues and pull requests.Performance related issues and pull requests.