Fun Experiment: Falcon 180B Q4_0 locally on a phone! #3113
BarfingLemurs
started this conversation in
Show and tell
Replies: 1 comment
-
It's not really doing anything too magical. Basically, you're going to be reading the whole model from the SSD once per token. So the bottleneck is mainly going to be the speed of the storage. The good news is just reading shouldn't be excess wear on the flash, but it's not going to be fast. Don't try |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Pixel 6 Pro (12GB)
(wanted to see t/s for streaming #3047 (comment))
termux-setup-storage
, and run..Resulting in generation at ~93s per token.
My guess is this will produce ~3,600 tokens after 10 hours.
I would very much like to know more about mmap inference, This ability to run big models is incredible, I overlooked learning about it. And I'm sure its not the main goals of the project, but a wonderful byproduct. Does having 8gb available suffer in speed more than 12gb?
What are the bottlenecks here?
I have little space for a prompt cache ,only 172MB how much at most do these take?
Fun: I'm going to bed, any ideas for a story?
Beta Was this translation helpful? Give feedback.
All reactions