Replies: 2 comments
-
|
I am also experiencing this with my M2 Air 16gb ram. It uses double the amount of ram vs when I use |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Did any of you fix it? I'm having the same problem where, with ollama run, I can get around 20-30 tokens per second with high gpu usage but avante I'd estimate is doing maybe 2 tokens per second with low gpu usage |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm testing using Avante locally with Ollama, similar to how I have previously with gen.nvim. Except that with Avante the model's memory balloons out of VRAM as soon as I work with it, making it VERY slow and killing the "vibe" as they say.
Debugging this, I see a gnarly context prompt (not sure if that is from Avante or CodeLlama, but the content, just wow), and I figure it's shipping a snippet or whole of the file too. If there a way I can scope it to limit the context / amount of the file that is sent and reduce the context size? I have a suspicion this is causing a excess of tokens to be loaded in to the model, causing it to expand out of VRAM.
Here's an example: Before 5.7GB 100% GPU, after:
Ref System:
Beta Was this translation helpful? Give feedback.
All reactions