-
Notifications
You must be signed in to change notification settings - Fork 3
the all_reduce thing in parallel.py was having errors with different buffer lengths; this should fix this #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
buffer lengths; this should fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue in parallel.py where all_reduce operations failed due to inconsistent tensor buffer sizes across different ranks. The fix ensures that all ranks operate on buffers with the same shape during all_reduce operations.
- Modified the sync_buffer function to handle all_reduce operations differently from broadcast operations
- Added logic to use temporary buffers when operating on indexed subsets during all_reduce
- Preserved existing broadcast behavior while fixing the tensor shape consistency issue
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| # Create a temporary buffer for the indexed elements | ||
| temp_buffer = torch.zeros_like(buffer) | ||
| temp_buffer[indices] = buffer[indices] | ||
| dist.all_reduce(temp_buffer, op=op) | ||
| buffer[indices] = temp_buffer[indices] |
Copilot
AI
Oct 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a temporary buffer the size of the full buffer for partial updates is inefficient for memory usage. Consider implementing a gather-reduce-scatter pattern or using all_gather followed by local reduction to minimize memory overhead, especially for large buffers with small index subsets.
| # Create a temporary buffer for the indexed elements | |
| temp_buffer = torch.zeros_like(buffer) | |
| temp_buffer[indices] = buffer[indices] | |
| dist.all_reduce(temp_buffer, op=op) | |
| buffer[indices] = temp_buffer[indices] | |
| # Efficient gather-reduce-scatter for partial index updates | |
| local_part = buffer[indices].clone() | |
| gathered_parts = [torch.zeros_like(local_part) for _ in range(dist.get_world_size())] | |
| dist.all_gather(gathered_parts, local_part) | |
| # Perform reduction locally | |
| stacked = torch.stack(gathered_parts, dim=0) | |
| if op == dist.ReduceOp.SUM: | |
| reduced = stacked.sum(dim=0) | |
| elif op == dist.ReduceOp.PRODUCT: | |
| reduced = stacked.prod(dim=0) | |
| elif op == dist.ReduceOp.MIN: | |
| reduced, _ = stacked.min(dim=0) | |
| elif op == dist.ReduceOp.MAX: | |
| reduced, _ = stacked.max(dim=0) | |
| else: | |
| raise NotImplementedError(f"Unsupported reduction op: {op}") | |
| buffer[indices] = reduced |
|
Thanks for the patch! Do you have the full traceback where this error occurred? The use |
|
See attached for full traceback, I'm using options.reconstructor_options.random_seed = round( int(datetime.datetime.now().strftime('%H%M%S'))) for the code in the new branch you modified the random seed stuff in: |
In parallel.py, I was always getting errors of the sort:
[rank2]:[I1017 19:34:58.923985894 ProcessGroupWrapper.cpp:587] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=38, OpType=ALLREDUCE, TensorShape=[77, 1], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I1017 19:34:58.999468354 ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=38, OpType=ALLREDUCE, TensorShape=[100, 1], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I1017 19:34:58.003353312 ProcessGroupWrapper.cpp:587] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=38, OpType=ALLREDUCE, TensorShape=[100, 1], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1017 19:34:58.003669532 ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=38, OpType=ALLREDUCE, TensorShape=[100, 1], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
Notice that the TensorShape is different for rank = 2.
I've tested the fix in this pull request on Califone and other GPU machines for Sector 9 and it seems to have solved this issue.