Skip to content

Conversation

bghira
Copy link
Owner

@bghira bghira commented Aug 6, 2025

when some cache is preserved across restarts, work units are distributed unevenly across GPUs, as, one or more GPUs may contain solely corrupted samples.

the reduction in bucket contents went from 79k to 4 samples, but when shuffling before distribution, we'd get better results and each worker receives 19k remaining samples to process.

removed do_shuffle parameter for reduce_buckets because it's been split by this point and shuffling does nothing.

bghira added 2 commits August 6, 2025 09:47
@bghira bghira merged commit cbad4fb into main Aug 6, 2025
1 check passed
@bghira bghira deleted the bugfix/multigpu-vae-shuffle branch August 6, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant