Batched Decoding #230

martindevans · 2023-11-02T13:41:38Z

martindevans
Nov 2, 2023
Maintainer

llama.cpp recently added an entirely new way to manage the KV cache. LLamaSharp has some bindings to this API (#185), but they're barely used - all of the executors are still using the old llama_eval method which is now obsolete.

In #223 I added a new example of basic batched decoding which is a direct port of one of the llama.cpp examples. This uses the low level APIs directly, I'll be working to try and provide safe wrappers around everything it does.

In the future this may become the basis of an entirely new executor in LLamaSharp. For example that batched decoding example could become a new type of executor which provides multiple output streams.

eublefar · 2023-11-02T14:39:34Z

eublefar
Nov 2, 2023

If I understand this feature correctly, it's also possible to provide only one sequence per NativeHandle.Decode call.
Maybe it's worth to refactor exitsting executors to have their own sequence ID (from some sequence ID pool that fit into n_ctx). They would only call to their sequence ID and function like before, but will not share KV cache between eachother like they do now.

Then it'd be possible to create some batch executor from composition of existing executors that would form batches from all their sequence IDs instead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched Decoding #230

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Batched Decoding #230

martindevans Nov 2, 2023 Maintainer

Replies: 1 comment

eublefar Nov 2, 2023

martindevans
Nov 2, 2023
Maintainer

eublefar
Nov 2, 2023