Open
Description
Instead of individually splitting and writing the output shards then reducing them in a separate step, we could try using ray and taking advantage of their shared memory object store to create tasks with all input shards needed for each output hats partition and writing the output directly.
Since the parquet files are spatially partitioned by tract and patch, this should limit the amount of input files each output task needs.
Pros:
- Skips one iteration of read and write from the pipeline
Cons:
- Workers need to have enough memory to load multiple tract parquet files at the same time. (I think up to 4 for healpix tiles at the corner of a patch)
- Ray's scheduling needs to be smart at assigning tasks to run and when to spill objects from the store to disk and reload them. This is especially difficult on a cluster with multiple machines. If it is not, this could result in a lot of unnecessary I/o which could make the entire thing slower
Metadata
Metadata
Assignees
Type
Projects
Status
No status