Skip to content

Conversation

rockerBOO
Copy link
Contributor

Can improve cases where we are moving multiple tensors to the GPU before we do processing on it.

We need to synchronize them before we do processing on them to be sure they are all there.
torch.cuda.synchronize

This code is mostly a prototype converting things to use non_blocking but needs testing and validation to be sure it's working as expected as it will "work" but not be synchronized.

With this I am getting 8-10% faster training through.

https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html

@kohya-ss
Copy link
Owner

Thank you, this has the potential to improve overall performance.

Regarding stream synchronization, this project also supports mps and xpu, so I would appreciate it if you could use device_utils.synchnorize_device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants