[LibOS] execve: eliminate race on clear_child_tid before VMAs deallocation#2149
[LibOS] execve: eliminate race on clear_child_tid before VMAs deallocation#2149forkthus wants to merge 2 commits intogramineproject:masterfrom
execve: eliminate race on clear_child_tid before VMAs deallocation#2149Conversation
… threads’ clear_child_tid before unmapping VMAs Signed-off-by: Wenhan Ji <whji@cs.unc.edu>
|
Jenkins, test this please |
|
Jenkins, retest this please (All failures seem to be connectivity issues.) |
Could someone kindly help me trigger a Jenkins retest? My retest command doesn’t seem to work—possibly due to a permission issue. I also don’t have access to the build logs. |
|
Jenkins, retest this please |
|
Add to whitelist |
donporter
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: )
libos/src/bookkeep/libos_thread.c line 567 at r1 (raw file):
free_threads_array(threads, count); count = needed_count;
Minor: assert needed_count > count?
This is a clever way to do things, but a comment or two would help the reader understand why this loop terminates (i.e., that count monotonically increases to match the actual thread count, and new threads should not be created at this point).
Or, consider restructuring as a do loop, with needed_count <= count being the loop condition.
|
The deb job failed with: The repository 'http://deb.debian.org/debian bullseye-backports Release' no longer has a Release file. This looks legit, and unrelated to this PR. |
…r other threads’ clear_child_tid before unmapping VMAs Signed-off-by: Wenhan Ji <whji@cs.unc.edu>
forkthus
left a comment
There was a problem hiding this comment.
Reviewable status: 3 of 4 files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @donporter)
libos/src/bookkeep/libos_thread.c line 567 at r1 (raw file):
Previously, donporter (Don Porter) wrote…
Minor: assert needed_count > count?
This is a clever way to do things, but a comment or two would help the reader understand why this loop terminates (i.e., that count monotonically increases to match the actual thread count, and new threads should not be created at this point).
Or, consider restructuring as a do loop, with needed_count <= count being the loop condition.
Done.
I added comments explaining why the loop terminates (count increases monotonically to the needed size) and an assert on the retry path.
I kept the structure to match the existing style—this pattern mirrors dump_vmas() in libos_vma.c.
|
Jenkins, retest this please. Just seeing if the webhook gets reactivated. |
kailun-qin
left a comment
There was a problem hiding this comment.
@kailun-qin reviewed 4 files and all commit messages, and made 4 comments.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on forkthus).
-- commits line 2 at r2:
Pls refine the commit subject to make it more concise. Also, pls include a detailed description in the commit msg.
libos/src/bookkeep/libos_thread.c line 551 at r2 (raw file):
int dump_all_threads(struct libos_thread*** out_threads, size_t* out_cnt) { size_t count = 1;
Is it better that we start w/ a more reasonable initial guess and/or a better growth strategy to make this more efficient?
Code quote:
size_t count = 1;libos/src/sys/libos_exec.c line 191 at r2 (raw file):
} (void)kill_other_threads();
Is it possible that a thread is spawned after we take the snapshot but before kill_other_threads()? What would happen in such case?
libos/src/sys/libos_futex.c line 989 at r2 (raw file):
} void wait_clear_child_tid(int* clear_child_tid, int child_tid) {
Do we need check clear_child_tid, similar to release_clear_child_tid() above?
Description of the changes
Fixes #2148
This PR makes
execve()wait until every sibling thread’s*clear_child_tidis zeroed before deallocating their VMAs.Implementation details:
g_thread_listas soon as the calling thread acquiresfirst.If
*clear_child_tid != 0, invokefutex_wait(); it will be awakened byrelease_clear_child_tid()viafutex_wake().*clear_child_tidare cleared, the calling thread then starts to deallocate VMAs.How to test this PR?
Repeating
gramine-sgx exec_same [args_#1...args_#49]Without this PR – the test usually fails within a few minutes, especially on the branch of PR #1795 because of the issue mentioned above. The main branch takes longer to fail.
With this PR applied – the same loop runs for hours without any failures on the branch of PR #1795 .
This change is