Skip to content

PERF: Reduce excessive memory usage in Ray #7655

@sfc-gh-joshi

Description

@sfc-gh-joshi

No specific reproducer attached, but this is observable on a variety of workloads using large DFs.

Modin on Ray uses excessive amounts of memory for a wide range of operations, and we should investigate how this can be reduced. Consider this diagram from the authors of Dias [paper + GH] and PandasBench, where Modin uses obscene amounts of memory for some notebooks:

Image

While testing an implementation for parallel writes from Ray datasets to Snowflake, we also observed the upload of a ~3GB in-memory dataframe to be using well over 100GB RAM.

Further areas of investigation:

  • Is this entirely a Ray issue, or are there things Modin can do to address it?
  • Is excessive memory usage also an issue for Modin on Dask? The PandasBench paper indicates that dask DFs' peak memory consumption is relatively low.

See also: #5524

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Important tasks that we should complete soonPerformance 🚀Performance related issues and pull requests.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions