Skip to content

Handling of streams for new API layer #176

@seberg

Description

@seberg

Moving over from gh-175, I think stream handling needs to be discussed separately. I had thought quite a bit about it a while ago. My take-aways (very summarized) was that the CAI approach is actually the sane one.
The caveat of it is the stream lifetime guarantee question: You really need to keep the stream alive, which is attach it to the lifetime of the exported object in some cases.

A very rough sketch of reasoning. There are two main cases for exchange:

  1. The kernel use case (e.g. numba, FFI). In this case lifetime of everything is effectively "borrowed". By the time the kernel returns, all tensors (and the stream) are released.
  2. Long term export by the user. In this case, stream order guarantees should be up to the user IMO (besides at the start and end of lifetime likely).
    • End of lifetime is the interesting point here, because if we need the stream, then the stream lifetime has to be tied to the lifetime of the export. (e.g. due to asynchronous deallocation?)
    • There are probably some mixed cases, but I think it is helpful to focus on these maybe.

There are three solutions right now that we see:

  • CAI (cuda array interface): producer provides the stream.
  • DLPack: consumer provides the stream.
  • Arrow: Use of an event.

Of these, CAI is the only one that cleanly supports the first kernel use-case. Both others seem designed mainly for the second use-case and have a severe limitation here IMO. Now for arrow this may actually not be a problem, because arrow does not support mutability. As long there is no async deallocation the kernel cannot have the data passed in mutated before it finishes (immutable data cannot be mutated except by deallocation which effectively allows mutating it)!
(For DLPack, mutation is a thing and I think that rules out a single event design. There may be ways to use multiple events, but it is tricky. The single event is maybe cleaner (but possibly slower) than the DLPack design (because the DLPack design does not provide additional functionality).)

To explain why CAI is the only one, consider the following:

def library.kernel(arr):  # via protocol
    return arr + 1

arr = producer.create_array()
res1 = library.kernel(arr)
arr += 2  # mutates arr while kernel is still running

CAI allows the kernel to sync back the stream on which arr += 2 should be happening, a single event or the consumer stream does not. (Leo correctly stated that if the kernel uses xp.from_dlpack(arr + 1) to return the result or even just converts arr via dlpack a second time, a kernel could hack to enforce an order.)

If async deallocation is something to expect then things might get even more complicated (it's possible often, but uncommon in practice maybe).
Memory pools are also asynchronous and memory pools and maybe even worse so?! It is honestly a problem I had missed here!

Another approach to the stream lifetime issue, could be to provide a get_stream() which gets the current active one used. I am not quite sure how that works and if lifetime management isn't a problem it seemed easier to me.

(I had written down some more problematic cases, but I think the mutating one is the interesting one. The escalation of the mutating one might be mainly the case where the "mutation" is really just the deallocation itself -- I would have to read my own stuff again to be sure I am not missing things.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions