Explore using ray to run DASH with fewer tasks

Instead of individually splitting and writing the output shards then reducing them in a separate step, we could try using ray and taking advantage of their shared memory object store to create tasks with all input shards needed for each output hats partition and writing the output directly.

Since the parquet files are spatially partitioned by tract and patch, this should limit the amount of input files each output task needs.

Pros:
- Skips one iteration of read and write from the pipeline

Cons:
- Workers need to have enough memory to load multiple tract parquet files at the same time. (I think up to 4 for healpix tiles at the corner of a patch)
- Ray's scheduling needs to be smart at assigning tasks to run and when to spill objects from the store to disk and reload them. This is especially difficult on a cluster with multiple machines. If it is not, this could result in a lot of unnecessary I/o which could make the entire thing slower

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore using ray to run DASH with fewer tasks #511

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore using ray to run DASH with fewer tasks #511

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions