Skip to content

Fix synchronization between SwapDeviceBuffers and Transport #401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 25, 2025

Conversation

SeverinDiederichs
Copy link
Collaborator

@SeverinDiederichs SeverinDiederichs commented Jun 25, 2025

This PR fixes a subtle but important synchronization that was missing between the swap of the device buffers and the transport.

In rare cases, this could lead to the following scenario:

The hit slot counter statistics were copied from the GPU to the CPU, and reset to 0 on the device. Then the swap was executed.

However, after resetting the slot counter statistics and before the swap was executed, the transport already ran again and wrote some steps (then back to the initial position, as the counter was already reset), so some later steps overwrote some initial steps in the buffer.

This race condition broke reproducibility and could result in wrong results.

So far, no more issues with reproducibility are observed, so the CI test is put back in place.

@SeverinDiederichs SeverinDiederichs added the bug Type: Something isn't working label Jun 25, 2025
@phsft-bot
Copy link

Can one of the admins verify this patch?

@agheata agheata merged commit f751a2b into apt-sim:master Jun 25, 2025
3 checks passed
@SeverinDiederichs SeverinDiederichs deleted the fix_deviceswap_sync branch June 25, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Type: Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants