Support torch.distributed.scatter collective #9365

bfolie · 2025-06-16T17:01:53Z

XLA doesn't have a distributed Scatter op but we can put dummy tensor lists on the non-source rank and use reduce_scatter

…314-scatter

bfolie · 2025-06-17T05:27:22Z

test/test_torch_distributed_xla_backend.py

@@ -360,7 +360,6 @@ def test_barrier(self):
      'allreduce_coalesced',
      'alltoall',
      'gather',
-      'scatter',


I'm not sure if there's a reason to add a scatter test to this file. The test in test/pjrt/test_collective_ops_tpu.py is more robust in that it tests the actual result. The tests in this file just check that the IR looks correct, which can be misleading (as was the case for send/recv).

I see other operation calls to things like 'gather' and 'alltoall'. What is the reasoning to keep them and remove scatter?

Would it perhaps to improve test documentation on what it does rather than remove 'scatter'?

This test checks that the methods are unimplemented for XLAProcessGroup. scatter is now implemented so it is being removed.

What I'm wondering if if there's a reason to add a new test to this file that checks if group.scatter outputs the expected HLO. That's what other tests in this file do, but it's not clear to me what value they add beyond the existing tests in test_collective_ops_tpu.

Which tests are calling this function in test/pjrt/test_collective_ops_tpu.py? The tests I see there are for reduce_scatter and ReduceScatter which are seem to me to be higher level abstractions with other things happening. Perhaps I am missing something.

The test added in this PR, test_collective_ops_tpu:test_scatter calls torch.dist.scatter which calls ProcessGroupXLA.scatter.

Abstracting things a bit, we have a function A (torch.dist.scatter) that wraps function B (ProcessGroupXLA.scatter). We have a test for A. Should we have a test for B as well? In many cases the answer is yes, especially if B is used in multiple places, if the test for B is logically self-contained and informative, if A adds significant additional logic that we want to test without worrying having to think about B, etc. In this case I'm advocating for not testing B, because

The test for B is unreliable (as we saw for send/recv, the IR might look reasonable but not work)

A is a fairly thin wrapper around B

The contents of A are in upstream PT, so we're not testing them independently of B

B isn't used anywhere else, nor would it be directly called by other code

What do you think?

Ah, I see. I misunderstood a couple things. The clarification here helps a lot. It is preferable to test the calling API to a method rather than the method itself. Given that torch.dist.scatter is serving as the external API to ProcessGroupXLA.scatter, it is reasonable to only test it.

I can see this popping up as an issue on some coverage tests, so I would add an explicit comment to the tests in addition to your comment on torch.distributed.scatter.

pgmoka

Mostly questions, and a requests for extra documentation

torch_xla/distributed/xla_backend.py

pgmoka · 2025-06-17T21:34:42Z

test/test_torch_distributed_xla_backend.py

@@ -360,7 +360,6 @@ def test_barrier(self):
      'allreduce_coalesced',
      'alltoall',
      'gather',
-      'scatter',


I see other operation calls to things like 'gather' and 'alltoall'. What is the reasoning to keep them and remove scatter?

Would it perhaps to improve test documentation on what it does rather than remove 'scatter'?

test/pjrt/test_collective_ops_tpu.py

ghpvnist · 2025-06-20T19:49:42Z

test/pjrt/test_collective_ops_tpu.py

+    dist.init_process_group("xla", init_method='xla://')
+    device = torch_xla.device()
+    world_size = xr.world_size()
+    if xr.global_ordinal() == 0:


Minor readability improvement:

tensors = None if xr.global_ordinal() == 0: tensors = [ torch.tensor([i], device=device, dtype=torch.float) for i in range(world_size) ]

pgmoka

Follow-up seems good. Let me know if you have any questions on https://github.com/pytorch/xla/pull/9365/files#r2151351304.

Otherwise, LGTM

pgmoka

Follow-up seems good. Let me know if you have any questions on https://github.com/pytorch/xla/pull/9365/files#r2151351304.

Otherwise, LGTM.

One minor thing: I believe the tests failing are due to flakyness. Can you confirm?

bfolie added 2 commits June 16, 2025 16:04

write first draft of scatter implementation and test

af92456

format, fix small issues

4940e19

bfolie requested review from bhavya01 and pgmoka June 16, 2025 17:02

bfolie added 3 commits June 16, 2025 17:10

generalize implementation to work on longer lists

fcf2803

remove scatter from unimplemented list

04032e8

Merge branch 'master' of https://github.com/pytorch/xla into bfolie/9…

2154841

…314-scatter

bfolie commented Jun 17, 2025

View reviewed changes

remove extra blank line

25ffd2f

bfolie mentioned this pull request Jun 17, 2025

[RFC] Improved coverage for native distributed collective operations #9315

Open

pgmoka reviewed Jun 17, 2025

View reviewed changes

test/pjrt/test_collective_ops_tpu.py Show resolved Hide resolved

improved documentation

448bd65

bfolie requested review from pgmoka and ghpvnist June 18, 2025 17:32

ghpvnist approved these changes Jun 20, 2025

View reviewed changes

pgmoka approved these changes Jun 20, 2025

View reviewed changes

pgmoka reviewed Jun 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support torch.distributed.scatter collective #9365

Support torch.distributed.scatter collective #9365

Uh oh!

bfolie commented Jun 16, 2025 •

edited

Loading

Uh oh!

bfolie Jun 17, 2025

Uh oh!

pgmoka Jun 17, 2025

Uh oh!

bfolie Jun 17, 2025

Uh oh!

pgmoka Jun 18, 2025

Uh oh!

bfolie Jun 18, 2025

Uh oh!

pgmoka Jun 20, 2025

Uh oh!

pgmoka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pgmoka Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghpvnist Jun 20, 2025

Uh oh!

pgmoka left a comment

Uh oh!

pgmoka left a comment

Uh oh!

Uh oh!

Support torch.distributed.scatter collective #9365

Are you sure you want to change the base?

Support torch.distributed.scatter collective #9365

Uh oh!

Conversation

bfolie commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bfolie commented Jun 16, 2025 •

edited

Loading