Skip to content

Feature request: "collective_broadcast" for CPU PJRT #33502

@janpfeifer

Description

@janpfeifer

Just to help make a case for it:

The collective_broadcast is important for distributed training (SPMD) to sync up the variables and prevent potential numeric drifting across replicas. Accelerators can do this much faster (presumably?) than moving data through the host.

The fake "multi-device" CPU PJRT is very useful to develop and test distributed models without using expensive credits (and the extra hurdle) to get a multi-device hardware during development (which can take more time than the training itself in some cases).

Without collective_broadcast implemented in CPU though, one needs to keep different versions for testing/development, and the version that will actually be used in training. This adds in maintenance, etc.

many thanks!!!

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions