-
Notifications
You must be signed in to change notification settings - Fork 696
Description
Hi — I’m developing a PJRT plugin for a custom accelerator and want to enable availability-aware, cost-driven partitioning of an XLA/HLO module across GPU, CPU, and the accelerator:
If only CPU + accelerator are available, run using those.
If GPU is present and used, automatically identify HLO subgraphs that are better offloaded to the accelerator and compile/run them there.
Questions:
-
Does XLA currently support multi-backend HLO partitioning/placement (i.e., splitting one HLO module across different backend types)?
-
Can a PJRT plugin expose device cost/constraints or otherwise influence partitioning during HLO-level compilation?
-
If not, what’s the recommended approach: implement an XLA pass to consume cost info, or build an orchestration layer that partitions the model and invokes multiple PJRT clients/executables? Which option is more realistic today?
I can prototype either an XLA/HLO pass or an external orchestrator — looking for pointers, existing examples, or caveats.
Thanks.