Skip to content

Explore using ray to run DASH with fewer tasks #511

Open
@smcguire-cmu

Description

@smcguire-cmu

Instead of individually splitting and writing the output shards then reducing them in a separate step, we could try using ray and taking advantage of their shared memory object store to create tasks with all input shards needed for each output hats partition and writing the output directly.

Since the parquet files are spatially partitioned by tract and patch, this should limit the amount of input files each output task needs.

Pros:

  • Skips one iteration of read and write from the pipeline

Cons:

  • Workers need to have enough memory to load multiple tract parquet files at the same time. (I think up to 4 for healpix tiles at the corner of a patch)
  • Ray's scheduling needs to be smart at assigning tasks to run and when to spill objects from the store to disk and reload them. This is especially difficult on a cluster with multiple machines. If it is not, this could result in a lot of unnecessary I/o which could make the entire thing slower

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetWorking with a particular datasetperformanceFor slow queries or compute bottlenecks

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions