-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oshmem/shmem: Allocate and exchange base segment address beforehand #12889
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Thomas Vegas <[email protected]>
071169b
to
019badb
Compare
#endif | ||
} | ||
|
||
if (mca_sshmem_base_start_address != memheap_mmap_get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is based on mmap() behavior where it always creates vma at the hint position if possible. If this not always true (kernel vesions..), this could regress existing behavior and even fail to honor command line parameter.
Shall we remove that confirmation check and proceed regardless? Or maybe only ignore that check when address was passed from command line?
@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf, | |||
/* init the contents of map_segment_t */ | |||
shmem_ds_reset(ds_buf); | |||
|
|||
(void)munmap(mca_sshmem_base_start_address, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why added then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now "reserve" that area by holding an mmap() on it as it seems there is no randomization between mmap/munmap + mmap sequence and area could be consumed by unrelated mmap() in between.
Then on the modules we "overwrite" it with (ucp_mem_map() / mmap() / shmat()). It's a try to make it explicit, although it opens for race and mmap() anyways replaces it with MAP_FIXED
.
Will remove, need to check with shmat() that it overwrites existing area too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed for mmap module, kept for sysv module as it is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like negotiation is not done by default, as default value of sshmem_base_start_address
remains the same
@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf, | |||
/* init the contents of map_segment_t */ | |||
shmem_ds_reset(ds_buf); | |||
|
|||
(void)munmap(mca_sshmem_base_start_address, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why added then?
@@ -170,6 +170,7 @@ segment_create(map_segment_t *ds_buf, | |||
} | |||
|
|||
/* Attach to the segment */ | |||
(void)munmap(mca_sshmem_base_start_address, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check shmat() overwrites existing mmap() areas, if yes I will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not remove for sysv case
static int _check_non_static_segment(const map_segment_t *mem_segs, | ||
int n_segment, | ||
const void *start, const void *end) | ||
{ | ||
int i; | ||
|
||
for (i = 0; i < n_segment; i++) { | ||
if ((start <= mem_segs[i].super.va_base) && | ||
(mem_segs[i].super.va_base < end)) { | ||
MEMHEAP_VERBOSE(100, | ||
"non static segment: %p-%p already exists as %p-%p", | ||
start, end, mem_segs[i].super.va_base, | ||
mem_segs[i].super.va_end); | ||
return OSHMEM_ERROR; | ||
} | ||
} | ||
|
||
return OSHMEM_SUCCESS; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just a safety check or you observed some real issue? Do we need to also check when the new segment starts in the middle of the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking for existing static segments, we need to skip the non static which has previously been created, else it is added twice to the list which is an issue, although I don't have a reproducer now.
In previous patch we would try to make sure to always be above _end
, and there is code to skip when > _end
.
As we look for known existing segments, it can only be contained inside the one we iterate for, I don't see a case where there can be partial overlap.
rc = oshmem_shmem_allgather(&ptr, bases, sizeof(ptr)); | ||
if (OSHMEM_SUCCESS != rc) { | ||
MEMHEAP_ERROR("Failed to exchange selected vma for base segment " | ||
"(error %d)", rc); | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can also introduce an option without fallback to the original behavior? Then allgatherv will not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, in that case we could depend on mca_sshmem_base_start_address
value:
1- if 0: bcast the pointer value, and any rank unable to create fails on its side, global failure
2- if UINTPTR_MAX: bcast the pointer value, allgather so that they all fallback on default value
default could be point 2-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
base = ptr; | ||
} | ||
|
||
rc = oshmem_shmem_bcast(&base, sizeof(base), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich, tried the patch below where they all do the mmap(). the mmap() returned address is randomized like below, so we need some form of synchronization of the base adddress.
memheap_exchange_base_address() #1: exchange base address: base 0x7fa7d9dff000: ok
memheap_exchange_base_address() #3: exchange base address: base 0x7fdc5a15b000: ok
memheap_exchange_base_address() #2: exchange base address: base 0x7fe8aa56a000: ok
memheap_exchange_base_address() #0: exchange base address: base 0x7f3d1736b000: ok
diff --git a/oshmem/mca/memheap/base/memheap_base_select.c b/oshmem/mca/memheap/base/memheap_base_select.c
index 0ec74de6aa..0b0cfe4bee 100644
--- a/oshmem/mca/memheap/base/memheap_base_select.c
+++ b/oshmem/mca/memheap/base/memheap_base_select.c
@@ -134,21 +134,8 @@ static int memheap_exchange_base_address(size_t size, void **address)
return OSHMEM_ERROR;
}
- if (oshmem_my_proc_id() == 0) {
- ptr = memheap_mmap_get(NULL, size);
- base = ptr;
- }
-
- rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
- if (OSHMEM_SUCCESS != rc) {
- MEMHEAP_ERROR("Failed to exchange allocated vma for base segment "
- "(error %d)", rc);
- goto out;
- }
-
- if (oshmem_my_proc_id() != 0) {
- ptr = memheap_mmap_get(base, size);
- }
+ ptr = memheap_mmap_get(NULL, size);
+ base = ptr;
MEMHEAP_VERBOSE(100, "#%d: exchange base address: base %p: %s",
oshmem_my_proc_id(), base,
i do not understand that comment since new default address is ~0 and rank 0 allocates and bcast's the pointer value, but ack it is not a full negotiation. |
Signed-off-by: Thomas Vegas <[email protected]>
Signed-off-by: Thomas Vegas <[email protected]>
What
Processes have their
_end
that depends on the program built. Try negotiation first assuming symmetric layout will lead to same available memory areas. If not all ranks can create at the same position, fallback on the current hardcoded method.We need to keep the mmap() as a reservation in all cases, so that intermediate library calls do not consume it in between. If that happens, UCX module overrides it, causing some later corruption.
Tested
-mca sshmem_base_start_address 0xffffffffffffffff
or no option: negotiation takes place, mmap reservation-mca sshmem_base_start_address 0x7f.....
: no negotiation, mmap reservation, detection if failure to allocate.Static segment creation always skips module-created segment. Segments found in
/proc/self/maps
are always bigger or equal than module-allocated one.Misc
Configure:
./configure --prefix=rfs --enable-debug --with-ucx=rfs
Options:
-mca memheap_base_verbose 100
,-mca sshmem sysv/mmap/ucx