Fun Experiment: Falcon 180B Q4_0 locally on a phone! #3113

BarfingLemurs · 2023-09-10T23:13:13Z

BarfingLemurs
Sep 10, 2023

Pixel 6 Pro (12GB)
(wanted to see t/s for streaming #3047 (comment))

I download preconverted gguf model parts, merge them to one file on a PC
I delete many nice things on android
I push the whole model into downloads with adb
I build llama.cpp with cmake
I allow access to downloads with termux-setup-storage, and run..

Resulting in generation at ~93s per token.

My guess is this will produce ~3,600 tokens after 10 hours.

I would very much like to know more about mmap inference, This ability to run big models is incredible, I overlooked learning about it. And I'm sure its not the main goals of the project, but a wonderful byproduct. Does having 8gb available suffer in speed more than 12gb?
What are the bottlenecks here?

I have little space for a prompt cache ,only 172MB how much at most do these take?

Fun: I'm going to bed, any ideas for a story?

KerfuffleV2 · 2023-09-11T03:13:51Z

KerfuffleV2
Sep 11, 2023
Collaborator

I would very much like to know more about mmap inference

It's not really doing anything too magical. Basically, you're going to be reading the whole model from the SSD once per token. So the bottleneck is mainly going to be the speed of the storage.

The good news is just reading shouldn't be excess wear on the flash, but it's not going to be fast. Don't try --mlock. :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fun Experiment: Falcon 180B Q4_0 locally on a phone! #3113

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Fun Experiment: Falcon 180B Q4_0 locally on a phone! #3113

BarfingLemurs Sep 10, 2023

Replies: 1 comment

KerfuffleV2 Sep 11, 2023 Collaborator

BarfingLemurs
Sep 10, 2023

KerfuffleV2
Sep 11, 2023
Collaborator