Skip to content

Rework get_task and steal_task to better interact with out_of_work checks #1779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

akukanov
Copy link
Contributor

@akukanov akukanov commented Jul 11, 2025

This PR "revives" the improvement originated in #417, but takes a different approach to the problem.

To remind, when a thread looks for a task to take from a task pool, it might skip some tasks due to affinity or isolation restrictions, The task pool still contains pointers to the tasks, but the observable limits of the pool (the head, modified by thieves, and the tail, modified by the owning thread) might temporary exclude the skipped tasks. Due to that, another thread that inspects the arena for work availability might find the task pool "empty" and potentially mark the whole arena empty, causing premature leaving of worker threads. The current implementation mitigates that by issuing a "work advertisement" signal when the skipped tasks are "returned" to the observable pool.

The PR #417 tried to improve the implementation by adding "shadow" head and tail indexes for the slot inspection, which are not changed until an operation on the task pool is complete, and so they should never exclude "skipped" tasks. In my opinion, however, it puts the burden on the wrong side and complicates the arbitration protocol between the pool owner and thieves. As implemented, it also does not achieve the goal, as in the case of pool "exhaustion" the shadow limits would be temporarily reset, similar to the real limits.

This PR takes a different approach and puts more burden on the inspecting thread, which anyway has no tasks to execute. If that thread suspects the task pool to be empty after comparing its head and tail, it locks the pool and re-reads its state. By locking, any temporary modifications by stealing threads are prevented. To coordinate the inspection with changes made by the owning thread, a new flag is added into arena_slot. The flag is set by the owning thread in get_task if it skips one or more tasks, and is reset once the pool limits are restored. The flag is read and tested by the inspecting thread, and the slot is only considered empty when both the pool limits show no tasks and the skipping flag is not set.

Tests

  • [ X ] not needed, existing tests should be sufficient

Documentation

  • [ X ] not needed

Breaks backward compatibility?

  • [ X ] No - the changes are not exposed in API or ABI

@akukanov akukanov changed the title Change get_task to not reset the task pool if some tasks were skipped Rework get_task and steal_task to better interact with out_of_work checks Jul 11, 2025
@akukanov akukanov force-pushed the dev/improve-task-omission-akukanov branch from 0ec492d to 0e277fe Compare July 11, 2025 23:02
@akukanov akukanov force-pushed the dev/improve-task-omission-akukanov branch from 0e277fe to 13d8ce9 Compare July 11, 2025 23:08
@akukanov akukanov force-pushed the dev/improve-task-omission-akukanov branch from abde43c to deda35d Compare July 14, 2025 17:42
@akukanov akukanov marked this pull request as ready for review July 14, 2025 19:22
@akukanov
Copy link
Contributor Author

@kboyarinov @isaevil @dnmokhov Please take a look.

Comment on lines -122 to -129
if ( H0 < T0 ) {
// Restore the task pool if there are some tasks.
head.store(H0, std::memory_order_relaxed);
tail.store(T0, std::memory_order_relaxed);
// The release fence is used in publish_task_pool.
publish_task_pool();
// Synchronize with snapshot as we published some tasks.
ed.task_disp->m_thread_data->my_arena->advertise_new_work<arena::wakeup>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding of the problem statement and the proposed fix (may duplicate the description in a bit:) ).

Let's assume that the current slot contain a single task, isolation of which prevents the owning thread to execute it.
The owning thread skips this task while looking for a work to do and temporarily decrements the tail (head equal to tail in our case).
Between the decrement and the restore of the tail, the stealing thread can inspect the current slot in arena::out_of_work and see head equal to tail meaning the current slot is out of tasks.

If all other slots in arena are out of tasks, the stealing thread signals to workers to leave the arena.

When the restoration of head and tail is done by the owning thread, the advertise_new_work will request the workers to re-join the arena.

Am I correct that the idea of the patch is to prevent workers from leaving the arena in this case and double-check by reading the flag has_skipped_tasks that should be set by the owning thread while the tasks are skipped?

Copy link
Contributor Author

@akukanov akukanov Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct overall, with a couple of comments to be made.

Between the decrement and the restore of the tail, the stealing thread can inspect the current slot in arena::out_of_work...

Technically, it's not a stealing thread, as it does not at that point attempt to steal anything. I would call it an inspecting thread, to disambiguate with other threads that can really try stealing at the same time.

The second comment is that stealing threads might also skip tasks and temporarily make the slot appear empty. In the current implementation, they also call advertise_new_work afterwards. In the patch, the inspecting thread has to lock the pool for final inspection, therefore no stealing can happen at the same time.


if ( tasks_skipped ) {
__TBB_ASSERT( is_task_pool_published(), nullptr ); // the pool was not reset
tail.store(T0, std::memory_order_release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that we do not need to restore head here since it is a stealing thread responsibility and is done in the steal_task?

Copy link
Contributor Author

@akukanov akukanov Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct.

Generally, H0 represents the state of the pool head as it was seen by the owner; it might get outdated at any time. The core principle therefore is that the owner only works with the tail and does not change the head.

Indeed, if there was no conflict for the last task, the owner has no idea what the proper value for the head should be. And in case of a conflict the pool lock is taken and the head is re-read, and we can be sure that there is no skipped tasks beyond the head, so there is no need to change anything.

Prior to the patch, there is a store of H0 to the head - but it is done at the point where the pool is temporarily quiescent, and therefore it is safe. It "optimizes" the case when the task at the head is taken while others were skipped. In the patch, the pool is not reset if tasks were skipped, as that would also mislead observers. So this optimization cannot be safely performed anymore.

akukanov and others added 2 commits July 23, 2025 20:18
@akukanov akukanov force-pushed the dev/improve-task-omission-akukanov branch from 1da1b25 to 796db43 Compare July 25, 2025 09:08
@akukanov
Copy link
Contributor Author

akukanov commented Jul 25, 2025

The commit 796db43 is for code refactoring and is not strictly necessary. It significantly changes get_task, and though the core logic there remains the same, some code blocks and checks had to be reordered. It's likely better to review it separately, on top of the previous commits. I can as well revert it in case you prefer to keep refactoring separate from the substantial changes.

@akukanov akukanov force-pushed the dev/improve-task-omission-akukanov branch from 5469345 to 6f0d728 Compare July 28, 2025 21:50
@akukanov akukanov requested review from isaevil and kboyarinov July 30, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants