Performance of llama.cpp on Apple Silicon M-series #4167
Replies: 84 comments 165 replies
-
M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Max Studio, 8+4 CPU, 38 GPU ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅
build: 55978ce (1555) Short Note: mostly similar to the one reported by @slaren . But for Q4_0 |
Beta Was this translation helpful? Give feedback.
-
|
In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s. |
Beta Was this translation helpful? Give feedback.
-
|
How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later. |
Beta Was this translation helpful? Give feedback.
-
M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
|
Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth. |
Beta Was this translation helpful? Give feedback.
-
M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅
build: e9c13ff (1560) Note: M1 Max RAM Bandwidth is 400GB/s |
Beta Was this translation helpful? Give feedback.
-
|
Look at what I started |
Beta Was this translation helpful? Give feedback.
-
M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅
build: e9c13ff (1560) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
### M2 MAX (MBP 16) 38 Core 32GB ✅
build: 795cd5a (1493) |
Beta Was this translation helpful? Give feedback.
-
|
I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models. |
Beta Was this translation helpful? Give feedback.
-
|
... cross-posted to the Vulkan thread: Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS MontereyNote: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board. Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md ./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null
build: d3bd719 (5092) The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag): ./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null
build: d3bd719 (5092) |
Beta Was this translation helpful? Give feedback.
-
|
Just saying.. Shouldn't the OP be edited with the actual used bandwidth numbers, rather than the BS figures apple gave to the press? |
Beta Was this translation helpful? Give feedback.
-
M3 Ultra (Mac Studio 2025) 24+8 CPU, 80 GPU, 512GB RAM
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
M1 (MacBook Air 2020) 8 CPU, 8GPU, 16GB RAM
build: 8e672ef (1550)
build: 3e0be1c (5410) |
Beta Was this translation helpful? Give feedback.
-
|
Finally got the results I was asking about here recently 😊 Though I had to purchase a Mac Studio with the M4 Max chip myself to achieve this. M4 MAX (Mac Studio 2024), 14CPU, 32 GPU, 36GB RAMllama.cpp % ./llama-bench
build: 8e672ef (1550) On new build:
build: b44890d (5440) |
Beta Was this translation helpful? Give feedback.
-
Old Mac Intel with AMD GPU is not dead, yet !That old Mac with Intel and an AMD GPU might be showing its age, but it’s far from useless and can still pack a punch today with the right tweaks. export GGML_METAL_DEVICE_INDEX=1 ./build/bin/llama-bench -ngl 99 -m ~/Models/llama-2-7b-q4_0.gguf
build: 79c1160 (6123 with metal3 patch) export GGML_METAL_DEVICE_INDEX=2
build: 79c1160 (6123 with metal3 patch) export GGML_METAL_DEVICE_INDEX=3
build: 79c1160 (6123 with metal3 patch) The tweaks : https://gist.github.com/Basten7/f316fef96aac9a6614032a65c9825eaf |
Beta Was this translation helpful? Give feedback.
-
|
There's some speculation online about how the new A19 chip has the NPU cores inside the GPU instead of being a seperate module, which might indicate that it can be used like tensor cores to help with prompt processing.
|
Beta Was this translation helpful? Give feedback.
-
|
So, let's reignite the discussion: It looks like we have a good chance of ~30% LLM performance increase: |
Beta Was this translation helpful? Give feedback.
-
|
M5 (base) benchmark , 2.6 speedup, FA is slower for TG, but faster for prefill? lama-bench M5 Without Neural Accelerator
M5 with Neural Accelerator
Also Metal4 plus -fa 1
|
Beta Was this translation helpful? Give feedback.
-
build: 8e672ef (1550) |
Beta Was this translation helpful? Give feedback.
-
|
We need a comparison between llama.cpp inference speed and PyTorch inference speed. A fair comparison. All of the quantization algorithms can be written in PyTorch too. it's just engineering work. But if someone shows that having the inference in pure C++ has clear benefits and we can actually run the inference faster (on CPU or GPU or both) that is something people will invest on. We do not have any clear proof showing llama.cpp is actually faster than the PyTorch implementation on either CPU or GPU. if we have that please educate me. |
Beta Was this translation helpful? Give feedback.
-
|
and you are PhD too .. |
Beta Was this translation helpful? Give feedback.
-
|
I came here trying to get a sense of the M5 speedup given the new Metal API in M5 chips. ggerganov hasn't updated the post yet, but I found useful info in #16634 |
Beta Was this translation helpful? Give feedback.
-
|
I'm holding out for a benchmark of M5 Max and assuming there is a Mac Studio launching soon. I'd like to see some numbers comparing it to a Spark and Strix Halo respectively. Anyone have $5K laying around for a new Macbook Pro ... ;) |
Beta Was this translation helpful? Give feedback.
-
|
Hey all - I just stumbled across this excellent project - I created an open source Ollama/mlx/any openai endpoint benchmarking app with public leaderboards, and open sourced the dataset as well. If I can help contribute in some way please let me know. The binary is certificate signed and the full source is available. The dataset is small right now, but the datapoints are very robust; contributions are very welcome, as are stars to get the app on Homebrew as a cask. https://github.com/uncSoft/anubis-oss https://devpadapp.com/anubis-oss.html https://devpadapp.com/leaderboard.html https://devpadapp.com/explorer.html
|
Beta Was this translation helpful? Give feedback.








Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
LLaMA 7B
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
plot.py
Description
This is a collection of short
llama.cppbenchmarks on various Apple Silicon hardware. It can be useful to compare the performance thatllama.cppachieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):
PPmeans "prompt processing" (bs = 512),TGmeans "text-generation" (bs = 1),t/smeans "tokens per second"Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 21) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares over time on M2 Ultra:
[GB/s]
Cores
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
[t/s]
M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅
build: 8e672ef (1550)
M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅
build: d103d93 (1553)
Footnotes
https://en.wikipedia.org/wiki/Apple_M1#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M2#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M3#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
https://en.wikipedia.org/wiki/Apple_M4#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
https://en.wikipedia.org/wiki/Apple_M5#Variants ↩ ↩2 ↩3 ↩4 ↩5 ↩6
Beta Was this translation helpful? Give feedback.
All reactions