Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oshmem/shmem: Allocate and exchange base segment address beforehand #12889

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Oct 28, 2024

What

Processes have their _end that depends on the program built. Try negotiation first assuming symmetric layout will lead to same available memory areas. If not all ranks can create at the same position, fallback on the current hardcoded method.

We need to keep the mmap() as a reservation in all cases, so that intermediate library calls do not consume it in between. If that happens, UCX module overrides it, causing some later corruption.

Tested

  1. -mca sshmem_base_start_address 0xffffffffffffffff or no option: negotiation takes place, mmap reservation
  2. -mca sshmem_base_start_address 0x7f.....: no negotiation, mmap reservation, detection if failure to allocate.
  3. when one or more ranks fail to negotiate, all of them fallback on hardcoded method with mmap reservation

Static segment creation always skips module-created segment. Segments found in /proc/self/maps are always bigger or equal than module-allocated one.

Misc

Configure: ./configure --prefix=rfs --enable-debug --with-ucx=rfs
Options: -mca memheap_base_verbose 100, -mca sshmem sysv/mmap/ucx

#endif
}

if (mca_sshmem_base_start_address != memheap_mmap_get(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on mmap() behavior where it always creates vma at the hint position if possible. If this not always true (kernel vesions..), this could regress existing behavior and even fail to honor command line parameter.

Shall we remove that confirmation check and proceed regardless? Or maybe only ignore that check when address was passed from command line?

@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf,
/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);

(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why added then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now "reserve" that area by holding an mmap() on it as it seems there is no randomization between mmap/munmap + mmap sequence and area could be consumed by unrelated mmap() in between.

Then on the modules we "overwrite" it with (ucp_mem_map() / mmap() / shmat()). It's a try to make it explicit, although it opens for race and mmap() anyways replaces it with MAP_FIXED.

Will remove, need to check with shmat() that it overwrites existing area too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed for mmap module, kept for sysv module as it is needed

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 28, 2024

@brminich

Copy link
Member

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like negotiation is not done by default, as default value of sshmem_base_start_address remains the same

@@ -126,6 +126,7 @@ segment_create(map_segment_t *ds_buf,
/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);

(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why added then?

@@ -170,6 +170,7 @@ segment_create(map_segment_t *ds_buf,
}

/* Attach to the segment */
(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check shmat() overwrites existing mmap() areas, if yes I will remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not remove for sysv case

Comment on lines +143 to +162
static int _check_non_static_segment(const map_segment_t *mem_segs,
int n_segment,
const void *start, const void *end)
{
int i;

for (i = 0; i < n_segment; i++) {
if ((start <= mem_segs[i].super.va_base) &&
(mem_segs[i].super.va_base < end)) {
MEMHEAP_VERBOSE(100,
"non static segment: %p-%p already exists as %p-%p",
start, end, mem_segs[i].super.va_base,
mem_segs[i].super.va_end);
return OSHMEM_ERROR;
}
}

return OSHMEM_SUCCESS;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just a safety check or you observed some real issue? Do we need to also check when the new segment starts in the middle of the existing one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking for existing static segments, we need to skip the non static which has previously been created, else it is added twice to the list which is an issue, although I don't have a reproducer now.

In previous patch we would try to make sure to always be above _end, and there is code to skip when > _end.

As we look for known existing segments, it can only be contained inside the one we iterate for, I don't see a case where there can be partial overlap.

Comment on lines 157 to 162
rc = oshmem_shmem_allgather(&ptr, bases, sizeof(ptr));
if (OSHMEM_SUCCESS != rc) {
MEMHEAP_ERROR("Failed to exchange selected vma for base segment "
"(error %d)", rc);
goto out;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also introduce an option without fallback to the original behavior? Then allgatherv will not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, in that case we could depend on mca_sshmem_base_start_address value:
1- if 0: bcast the pointer value, and any rank unable to create fails on its side, global failure
2- if UINTPTR_MAX: bcast the pointer value, allgather so that they all fallback on default value

default could be point 2-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

base = ptr;
}

rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich, tried the patch below where they all do the mmap(). the mmap() returned address is randomized like below, so we need some form of synchronization of the base adddress.

memheap_exchange_base_address() #1: exchange base address: base 0x7fa7d9dff000: ok
memheap_exchange_base_address() #3: exchange base address: base 0x7fdc5a15b000: ok
memheap_exchange_base_address() #2: exchange base address: base 0x7fe8aa56a000: ok
memheap_exchange_base_address() #0: exchange base address: base 0x7f3d1736b000: ok
diff --git a/oshmem/mca/memheap/base/memheap_base_select.c b/oshmem/mca/memheap/base/memheap_base_select.c
index 0ec74de6aa..0b0cfe4bee 100644
--- a/oshmem/mca/memheap/base/memheap_base_select.c
+++ b/oshmem/mca/memheap/base/memheap_base_select.c
@@ -134,21 +134,8 @@ static int memheap_exchange_base_address(size_t size, void **address)
         return OSHMEM_ERROR;
     }

-    if (oshmem_my_proc_id() == 0) {
-        ptr = memheap_mmap_get(NULL, size);
-        base = ptr;
-    }
-
-    rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
-    if (OSHMEM_SUCCESS != rc) {
-        MEMHEAP_ERROR("Failed to exchange allocated vma for base segment "
-                      "(error %d)", rc);
-        goto out;
-    }
-
-    if (oshmem_my_proc_id() != 0) {
-        ptr = memheap_mmap_get(base, size);
-    }
+    ptr = memheap_mmap_get(NULL, size);
+    base = ptr;

     MEMHEAP_VERBOSE(100, "#%d: exchange base address: base %p: %s",
                     oshmem_my_proc_id(), base,

@tvegas1
Copy link
Contributor Author

tvegas1 commented Nov 5, 2024

seems like negotiation is not done by default, as default value of sshmem_base_start_address remains the same

i do not understand that comment since new default address is ~0 and rank 0 allocates and bcast's the pointer value, but ack it is not a full negotiation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants