Handling of streams for new API layer

Moving over from gh-175, I think stream handling needs to be discussed separately.  I had thought quite a bit about it a while ago.  My take-aways (very summarized) was that the CAI approach is actually the sane one.
The caveat of it is the stream lifetime guarantee question: You really need to keep the stream alive, which is attach it to the lifetime of the exported object in some cases.

A very rough sketch of reasoning.  There are two main cases for exchange:
1. The `kernel` use case (e.g. numba, FFI).  In this case lifetime of everything is effectively "borrowed".  By the time the `kernel` returns, all tensors (and the stream) are released.
2. Long term export by the user.  In this case, stream order guarantees should be up to the user IMO (besides at the start and end of lifetime likely).
   * End of lifetime is the interesting point here, because if we need the stream, then the stream lifetime has to be tied to the lifetime of the export.  (e.g. due to asynchronous deallocation?) 
   * There are probably some mixed cases, but I think it is helpful to focus on these maybe.

There are three solutions right now that we see:
* CAI (cuda array interface): producer provides the stream.
* DLPack: consumer provides the stream.
* Arrow: Use of an event.

Of these, CAI is the *only* one that cleanly supports the first kernel use-case.  Both others seem designed mainly for the second use-case and have a severe limitation here IMO. Now for arrow this may actually **not** be a problem, because arrow does not support mutability.  As long there is no async deallocation the `kernel` cannot have the data passed in mutated before it finishes (immutable data cannot be mutated except by deallocation which effectively allows mutating it)!
(For DLPack, mutation *is* a thing and I think that rules out a single event design.  There _may_ be ways to use multiple events, but it is tricky.  The single event is maybe cleaner (but possibly slower) than the DLPack design (because the DLPack design does not provide additional functionality).)

To explain why CAI is the only one, consider the following:
```
def library.kernel(arr):  # via protocol
    return arr + 1

arr = producer.create_array()
res1 = library.kernel(arr)
arr += 2  # mutates arr while kernel is still running
```
CAI allows the kernel to sync back the stream on which `arr += 2` should be happening, a single event or the consumer stream does not.  (Leo correctly stated that if the kernel uses `xp.from_dlpack(arr + 1)` to return the result or even just converts `arr` via dlpack a second time, a kernel could hack to enforce an order.)

If async deallocation is something to expect then things might get even more complicated (it's possible often, but uncommon in practice maybe).
Memory pools are also asynchronous and memory pools and maybe even worse so?! It is honestly a problem I had missed here!

Another approach to the stream lifetime issue, could be to provide a `get_stream()` which gets the _current_ active one used.  I am not quite sure how that works and if lifetime management isn't a problem it seemed easier to me.

(I had written down some more problematic cases, but I think the mutating one is the interesting one.  The escalation of the mutating one might be mainly the case where the "mutation" is really just the deallocation itself -- I would have to read my own stuff again to be sure I am not missing things.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling of streams for new API layer #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling of streams for new API layer #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions