-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core][rdt] GC RDT object metadata on owner #57671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a memory leak where RDT object metadata was not being garbage collected on the owner process. The change introduces logic to free this metadata when the object's reference count drops to zero by popping it from managed_gpu_object_metadata
in free_object_primary_copy
. The PR includes comprehensive tests for both gloo
and nixl
backends to verify the new garbage collection behavior and adds a skipped test for future work on lineage reconstruction. The changes are logical and well-tested. I have one suggestion regarding the implementation in free_object_primary_copy
to make it fully idempotent as per its docstring, which would make it more robust.
# NOTE: This may have to change if we support lineage reconstruction for RDT | ||
self.managed_gpu_object_metadata.pop(object_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for free_object_primary_copy
states that it is "Expected to be idempotent". However, the current implementation is not fully idempotent. If this function is called more than once for the same object_id
, the first call will pop the metadata, but subsequent calls will raise a KeyError
on line 543 when trying to access self.managed_gpu_object_metadata[object_id]
, resulting in an error being logged.
To make this function idempotent and avoid the error log on subsequent calls, you could fetch and remove the metadata atomically at the beginning of the function. Here's a suggested refactoring:
def free_object_primary_copy(self, object_id: str):
from ray.experimental.gpu_object_manager.gpu_object_store import (
__ray_free__,
)
try:
gpu_object_meta = self.managed_gpu_object_metadata.pop(object_id, None)
if gpu_object_meta is None:
# Already freed, so this is a no-op.
return
src_actor = gpu_object_meta.src_actor
tensor_transport_backend = gpu_object_meta.tensor_transport_backend
tensor_transport_meta = gpu_object_meta.tensor_transport_meta
src_actor.__ray_call__.options(concurrency_group="_ray_system").remote(
__ray_free__, object_id, tensor_transport_backend, tensor_transport_meta
)
except Exception as e:
logger.error(
"Something went wrong while freeing the RDT object!", exc_info=e
)
This change would make the function truly idempotent as per its docstring's expectation.
__ray_free__, object_id, tensor_transport_backend, tensor_transport_meta | ||
) | ||
# NOTE: This may have to change if we support lineage reconstruction for RDT | ||
self.managed_gpu_object_metadata.pop(object_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Non-Idempotent Metadata Removal Causes Errors
The free_object_primary_copy
method isn't idempotent as documented. It removes an object's metadata on the first call, causing subsequent calls for the same object ID to raise a KeyError
when accessing the missing metadata. This results in spurious error logs.
Why are these changes needed?
Currently RDT object metadata will stick around on the driver. This can lead to some weird behavior like the src actor living past expectations because the actor handle sticks around in the metadata. I'm now freeing the metadata on the owner at whenever we decide to tell the primary copy owner to free (when the ref counter decides it's "OutOfScope").
ray/python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Lines 29 to 38 in d8b7a54
A couple TODO's for the future: