Performance of llama.cpp on Apple Silicon M-series #4167

ggerganov · 2023-11-22T09:46:54Z

ggerganov
Nov 22, 2023
Maintainer

Summary

LLaMA 7B

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ M1 ¹	68	7			108.21	7.92	107.81	14.19
✅ M1 ¹	68	8			117.25	7.91	117.96	14.15
✅ M1 Pro ¹	200	14	262.65	12.75	235.16	21.95	232.55	35.52
✅ M1 Pro ¹	200	16	302.14	12.75	270.37	22.34	266.25	36.41
✅ M1 Max ¹	400	24	453.03	22.55	405.87	37.81	400.26	54.61
✅ M1 Max ¹	400	32	599.53	23.03	537.37	40.2	530.06	61.19
✅ M1 Ultra ¹	800	48	875.81	33.92	783.45	55.69	772.24	74.93
✅ M1 Ultra ¹	800	64	1168.89	37.01	1042.95	59.87	1030.04	83.73

✅ M2 ²	100	8			147.27	12.18	145.91	21.7
✅ M2 ²	100	10	201.34	6.72	181.4	12.21	179.57	21.91
✅ M2 Pro ²	200	16	312.65	12.47	288.46	22.7	294.24	37.87
✅ M2 Pro ²	200	19	384.38	13.06	344.5	23.01	341.19	38.86
✅ M2 Max ²	400	30	600.46	24.16	540.15	39.97	537.6	60.99
✅ M2 Max ²	400	38	755.67	24.65	677.91	41.83	671.31	65.95
✅ M2 Ultra ²	800	60	1128.59	39.86	1003.16	62.14	1013.81	88.64
✅ M2 Ultra ²	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27

🟥 M3 ³	100	8
🟨 M3 ³	100	10			187.52	12.27	186.75	21.34
🟨 M3 Pro ³	150	14			272.11	17.44	269.49	30.65
✅ M3 Pro ³	150	18	357.45	9.89	344.66	17.53	341.67	30.74
✅ M3 Max ³	300	30	589.41	19.54	566.4	34.3	567.59	56.58
✅ M3 Max ³	400	40	779.17	25.09	757.64	42.75	759.7	66.31
✅ M3 Ultra ³	800	60	1121.80	42.24	1085.76	63.55	1073.09	88.40
✅ M3 Ultra ³	800	80	1538.34	39.78	1487.51	63.93	1471.24	92.14

🟥 M4 ⁴	120	8
✅ M4 ⁴	120	10	230.18	7.43	223.64	13.54	221.29	24.11
✅ M4 Pro ⁴	273	16	381.14	17.19	367.13	30.54	364.06	49.64
✅ M4 Pro ⁴	273	20	464.48	17.18	449.62	30.69	439.78	50.74
✅ M4 Max ⁴	410	32	736.25	24.29	718.56	43.87	713.93	69.95
✅ M4 Max ⁴	546	40	922.83	31.64	891.94	54.05	885.68	83.06
🟥 M4 Ultra	820	64
🟥 M4 Ultra	1092	80

🟥 M5 ⁵	154	8
🟥 M5 ⁵	154	10
🟥 M5 Pro ⁵	307	16
🟥 M5 Pro ⁵	307	20
🟥 M5 Max ⁵	460	32
🟥 M5 Max ⁵	614	40
🟥 M5 Ultra	?	?
🟥 M5 Ultra	?	?

plot.py

# GPT-4 Generated Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating DataFrame from the provided data
data = {
    "Chip": ["M1", "M1", "M1 Pro", "M1 Pro", "M1 Max", "M1 Max", "M1 Ultra", "M2", "M2 Pro", "M2 Pro", "M2 Max", "M2 Max", "M2 Ultra", "M2 Ultra", "M3", "M3 Pro", "M3 Pro", "M3 Max"],
    "BW (GB/s)":     [68, 68, 200, 200, 400, 400, 800, 100, 200, 200, 400, 400, 800, 800, 100, 150, 150, 400],
    "GPU Cores":     [7, 8, 14, 16, 24, 32, 48, 10, 16, 19, 30, 38, 60, 76, 10, 14, 18, 40],
    "F16 PP (t/s)":  [None, None, None, 302.14, 453.03, 599.53, 875.81, 201.34, 312.65, 384.38, 600.46, 755.67, 1128.59, 1401.85, None, None, 357.45, 779.17],
    "F16 TG (t/s)":  [None, None, None, 12.75, 22.55, 23.03, 33.92, 6.72, 12.47, 13.06, 24.16, 24.65, 39.86, 41.02, None, None, 9.89, 25.09],
    "Q8_0 PP (t/s)": [108.21, 117.25, 235.16, 270.37, 405.87, 537.37, 783.45, 181.4, 288.46, 344.5, 540.15, 677.91, 1003.16, 1248.59, 187.52, 272.11, 344.66, 757.64],
    "Q8_0 TG (t/s)": [7.92, 7.91, 21.95, 22.34, 37.81, 40.2, 55.69, 12.21, 22.7, 23.01, 39.97, 41.83, 62.14, 66.64, 12.27, 17.44, 17.53, 42.75],
    "Q4_0 PP (t/s)": [107.81, 117.96, 232.55, 266.25, 400.26, 530.06, 772.24, 179.57, 294.24, 341.19, 537.6, 671.31, 1013.81, 1238.48, 186.75, 269.49, 341.67, 759.7],
    "Q4_0 TG (t/s)": [14.19, 14.15, 35.52, 36.41, 54.61, 61.19, 74.93, 21.91, 37.87, 38.86, 60.99, 65.95, 88.64, 94.27, 21.34, 30.65, 30.74, 66.31]
}
df = pd.DataFrame(data)

# Helper function to plot and annotate multiple data series in the same plot
def plot_multi_series(ax, x, y_series, labels, xlabel, ylabel, title, poly_power=1):
    colors = ['r', 'g', 'b']  # Colors for different series
    for i, y in enumerate(y_series):
        # Sorting data for regression
        sorted_indices = np.argsort(x)
        x_sorted = x[sorted_indices]
        y_sorted = y[sorted_indices]

        # Masking NaN values
        mask = ~np.isnan(y_sorted)
        x_sorted = x_sorted[mask]
        y_sorted = y_sorted[mask]

        # Fitting a polynomial regression model
        coefficients = np.polyfit(x_sorted, y_sorted, poly_power)
        polynomial = np.poly1d(coefficients)

        # Creating a range of x-values for a smoother trendline
        x_range = np.linspace(x_sorted.min(), x_sorted.max(), 500)
        trendline = polynomial(x_range)

        # Plotting
        ax.scatter(x, y, color=colors[i], label=labels[i], s=20)
        ax.plot(x_range, trendline, f"{colors[i]}-", linewidth=1)  # Trendline in the same color

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.legend()

    # Annotating points with the number of GPU cores and Bandwidth
    for i, txt in enumerate(df["Chip"]):
        ax.annotate(f"{df['GPU Cores'][i]} Cores, {df['BW (GB/s)'][i]} GB/s", (x[i], y_series[0][i]))


# Creating plots for PP vs Cores and TG vs Bandwidth
fig, axs = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('PP vs GPU Cores and TG vs Bandwidth for F16, Q8_0, and Q4_0')

# PP vs GPU Cores
y_series_cores_pp = [df["F16 PP (t/s)"], df["Q8_0 PP (t/s)"], df["Q4_0 PP (t/s)"]]
plot_multi_series(axs[0], df["GPU Cores"], y_series_cores_pp,
                  ['F16 PP', 'Q8_0 PP', 'Q4_0 PP'], 'GPU Cores', 'Performance (t/s)',
                  'PP Performance vs GPU Cores', 1)

# TG vs Bandwidth
y_series_bw_tg = [df["F16 TG (t/s)"], df["Q8_0 TG (t/s)"], df["Q4_0 TG (t/s)"]]
plot_multi_series(axs[1], df["BW (GB/s)"], y_series_bw_tg,
                  ['F16 TG', 'Q8_0 TG', 'Q4_0 TG'], 'Bandwidth (GB/s)', 'Performance (t/s)',
                  'TG Performance vs Bandwidth', 2)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508

If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef
Please also include the F16 model as shown, not just the quantum models
Contributors can post the same results in the comments below
If a device is already benchmarked and your results are comparable, there is no need to add it again
PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"
✅ means the data has been added to the summary

Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 21) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares over time on M2 Ultra:

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
2023 Nov 21
M2 Ultra `8e672ef`	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
2024 Nov 12
M2 Ultra `86ed72d` + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80
2025 Aug 02
M2 Ultra `5c0eb5e` + FA	800	76	1561.35	43.24	1386.97	73.35	1412.42	109.41

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	302.14 ± 0.07
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.75 ± 0.00
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	270.37 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.34 ± 0.00
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	266.25 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	36.41 ± 0.01

build: 8e672ef (1550)

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1401.85 ± 1.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	41.02 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1248.59 ± 0.73
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	66.64 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1238.48 ± 0.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	94.27 ± 0.05

build: 8e672ef (1550)

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	794.26 ± 3.16
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.27 ± 0.07
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	749.37 ± 8.35
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	43.00 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	690.99 ± 33.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.85 ± 0.22

build: d103d93 (1553)

QueryType · 2023-11-22T17:17:06Z

QueryType
Nov 22, 2023

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	201.34 ± 0.21
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	6.72 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	181.40 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	12.21 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	179.57 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	21.91 ± 0.02

build: 8e672ef (1550)

0 replies

brozkrut · 2023-11-23T15:50:17Z

brozkrut
Nov 23, 2023

M2 Max Studio, 8+4 CPU, 38 GPU ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	755.67 ± 0.11
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.65 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	677.91 ± 0.26
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.83 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	671.31 ± 0.20
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.95 ± 0.08

build: 8e672ef (1550)

8 replies

maver1ck Dec 16, 2023

Wow. I wasn't aware that 4090 is so fast.

vitali-fridman Dec 26, 2023

This is from one/two generations old hardware but it's for 70B model which might be of interest.

CPU: AMD 3995WX, GPU: 2x Nvidia 3090, Ubuntu 23.10, Kernel 6.5.0-14, NV Driver: 545.23.08, CUDA: 12.3.1

model	size	params	backend	ngl	test	t/s
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	pp 512	179.29 ± 2.83
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	tg 128	21.17 ± 0.04

For comparison, 7B model on the same hardware

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	1178.60 ± 88.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	87.34 ± 0.89

zotona Dec 27, 2023

could you try at 7b model for correct comparation? Thanks!

pukhrajvansh Feb 16, 2025

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

atlas5301 Feb 17, 2025

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

Probably because llama.cpp is not well optimized on gpus. You can expect significantly better throughput with sglang and vllm.

crasm · 2023-11-23T19:02:44Z

crasm
Nov 23, 2023

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1128.59 ± 0.82
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	39.86 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1003.16 ± 0.39
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	62.14 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1013.81 ± 0.92
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	88.64 ± 0.06

build: 8e672ef (1550)

0 replies

ymcui · 2023-11-24T03:17:39Z

ymcui
Nov 24, 2023

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	779.17 ± 0.49
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.09 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	757.64 ± 1.03
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.75 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	759.70 ± 2.26
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.31 ± 0.12

build: 55978ce (1555)

Short Note: mostly similar to the one reported by @slaren . But for Q4_0 pp 512, my result is 759.70 ± 2.26, while the one in the main post is 690.99 ± 33.76. Not sure about the source of the difference.

1 reply

slaren Nov 24, 2023
Maintainer

I am not sure why, but the results that I get are not very consistent. I suspect that it may due to the cooling limitations of the smaller laptop. I repeated the test now and the results are very similar to yours.

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	787.24 ± 0.84
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.15 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	755.88 ± 1.56
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.64 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	760.65 ± 0.77
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.35 ± 0.24

Azirine · 2023-11-24T09:08:33Z

Azirine
Nov 24, 2023

In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s.

0 replies

Azirine · 2023-11-24T15:08:41Z

Azirine
Nov 24, 2023

How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later.

1 reply

ggerganov Nov 24, 2023
Maintainer Author

You can compute these. By default, you can use ~75% of the total RAM with the GPU. You can use more if you do some tricks

minosvasilias · 2023-11-24T20:36:12Z

minosvasilias
Nov 24, 2023

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	312.65 ± 15.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.47 ± 0.71
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	288.46 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.70 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	294.24 ± 0.10
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	37.87 ± 0.10

build: e9c13ff (1560)

0 replies

to3d · 2023-11-24T22:06:23Z

to3d
Nov 24, 2023

Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth.

0 replies

MrSparc · 2023-11-25T00:11:27Z

MrSparc
Nov 25, 2023

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.50 ± 0.58
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.79 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	669.51 ± 1.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	64.55 ± 1.36

build: e9c13ff (1560)

2 replies

rlippmann Nov 26, 2023

I'm also using a MBP16 M2Max with the same CPU/GPU specs, but only 32 gb ram and my results are roughly the same:

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	747.99 ± 0.28
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.54 ± 0.22
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.37 ± 0.63
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.67 ± 0.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	668.28 ± 0.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	62.98 ± 0.06

build: 22da055 (1566)

MrSparc Nov 26, 2023

Yes, it is expected that the same cpu/gpu spec will have similar performance values for same models to be compared regardless of RAM, as long as the size of the model to be used can be loaded into memory.
The amount of RAM is a limiting factor in the size of the model that can be loaded, as only 75% (by default) of the unified memory can be used as VRAM on the GPU
https://github.com/ggerganov/llama.cpp#memorydisk-requirements

CedricYauLBD · 2023-11-25T00:16:50Z

CedricYauLBD
Nov 25, 2023

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	599.53 ± 0.86
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	23.03 ± 0.09
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	537.37 ± 0.19
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.20 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	530.06 ± 0.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	61.19 ± 0.15

build: e9c13ff (1560)

Note: M1 Max RAM Bandwidth is 400GB/s

0 replies

philipturner · 2023-11-25T03:32:09Z

philipturner
Nov 25, 2023

Look at what I started

1 reply

yxzwayne Nov 25, 2023

off topic, but your benchmark output is my desktop rn :D

paramaggarwal · 2023-11-25T03:47:44Z

paramaggarwal
Nov 25, 2023

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	272.11 ± 1.40
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	17.44 ± 0.42
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	269.49 ± 1.14
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	30.65 ± 0.20

build: e9c13ff (1560)

5 replies

ggerganov Nov 25, 2023
Maintainer Author

This one has 150 GB/s memory bandwidth, correct?

paramaggarwal Nov 25, 2023

Yes, that's correct. (source)

Kaszebe May 30, 2024

Could it run a Q5 quant of llama3 70b Instruct at ~2 tokens per second?

mladencucakSYN Mar 22, 2025

I'm also interested to see if it can run a bit bigger model with some kind of reasonable outcome. Just don't want to spend MCB Max money

bagobones Mar 22, 2025

The old models give excellent comparative numbers but I wonder if the benchmark needs to be re-based around the current most popular models at some point.

Not just bigger ones for finding the biggest but popular sets / distillations that go from small to very large.

It looks like 96-128 ish gigs of shared memory will be practical on Apple / AMD / nvidia digits going forward.

brozkrut · 2023-11-25T14:50:23Z

brozkrut
Nov 25, 2023

Chip (vs. Predecessor)	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
M2 Pro (16) vs. M1 Pro (16)	312.65 302.14	12.47 12.75	288.46 270.37	22.7 22.34	294.24 266.25	37.87 36.41
	+3.48%	-2.20%	+6.69%	+1.61%	+10.51%	+4.01%
M2 Max (38) vs. M1 Max (32)	755.67 599.53	24.65 23.03	677.91 537.37	41.83 40.2	671.31 530.06	65.95 61.19
	+26.04%	+7.03%	+26.15%	+4.05%	+26.65%	+7.78%
M2 Ultra (60) vs. M2 Max (38)	1128.59 755.67	39.86 24.65	1003.16 677.91	62.14 41.83	1013.81 671.31	88.64 65.95
	+49.34%	+61.90%	+48.04%	+48.48%	+51.03%	+34.41%
M2 Ultra (76) vs. M2 Max (38)	1401.85 755.67	41.02 24.65	1248.59 677.91	66.64 41.83	1238.48 671.31	94.27 65.95
	+85.67%	+66.45%	+84.24%	+59.47%	+84.53%	+43.06%
M2 Ultra (76) vs. M2 Ultra (60)	1401.85 1128.59	41.02 39.86	1248.59 1003.16	66.64 62.14	1238.48 1013.81	94.27 88.64
	+24.25%	+2.91%	+24.43%	+7.23%	+22.19%	+6.33%
M3 Pro (14) vs. M2 Pro (16)			272.11 288.46	17.44 22.7	269.49 294.24	30.65 37.87
			-5.67%	-23.17%	-8.41%	-19.07%
M3 Max (40) vs. M2 Max (38)	779.17 755.67	25.09 24.65	757.64 677.91	42.75 41.83	759.7 671.31	66.31 65.95
	+3.11%	+1.78%	+11.76%	+2.20%	+13.17%	+0.55%

0 replies

pudepiedj · 2023-11-25T17:33:00Z

pudepiedj
Nov 25, 2023

### M2 MAX (MBP 16) 38 Core 32GB ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	754.39 ± 0.36
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.31 ± 0.38
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	671.33 ± 2.65
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.85 ± 0.32
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	664.07 ± 9.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	63.29 ± 0.15

build: 795cd5a (1493)

0 replies

MrSparc · 2023-11-25T21:49:00Z

MrSparc
Nov 25, 2023

I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models.
Sorry if my question is silly, I'm new to this area, but can someone explain to me why original model delivers more performance than quantized models? Thanks

1 reply

ggerganov Nov 26, 2023
Maintainer Author

The question is not silly - the observation is expected. At large batch size (PP means batch size of 512) the computation is compute bound. I.e. the speed depends on how many FLOPS you can utilize. For quantum models, the existing kernels require extra compute to dequantize the data compared to F16 models where the data is already in F16 format.

lukewp · 2025-04-03T15:11:13Z

lukewp
Apr 3, 2025

... cross-posted to the Vulkan thread:

Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS Monterey

Note: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board.

Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.55 ± 0.25
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	11.05 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.86 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	16.73 ± 0.05

build: d3bd719 (5092)

The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag):

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	25.87 ± 0.56
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	6.85 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	26.17 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	10.85 ± 0.01

build: d3bd719 (5092)

(proof-of-life images below):
GPU test:

CPU test:

0 replies

mirh · 2025-04-12T02:52:32Z

mirh
Apr 12, 2025

Just saying.. Shouldn't the OP be edited with the actual used bandwidth numbers, rather than the BS figures apple gave to the press?

4 replies

AndreasKunar May 12, 2025

I don't understand your post - llama.cpp token-generation is able to pretty much saturate the RAM bus-bandwidth on my Macs, on my Snapdragon X and on my NVIDIA Jetson. All with very comparable results based on their RAM's transactions/s-limit*bus-width calculated "marketing" GB/s. If someone is lying, they all seem to do it consistently.

Normal CPU-operations might not fully max out the RAM bus bandwidth. But NEON,... instructions and the GPUs apparently can.

mirh May 13, 2025

The bandwidth in the first post is clearly not measured, and it seems obvious that's just a copy-paste of the official PR numbers.
And idk really what the actual gpu limit is since anandtech (rip) didn't measure that specifically, but to be sure at least the ultra numbers are patent horseshit.
It's literally two maxs side-by-side. Even if a single one was 400GB/s, adding another 400GB/s unit physically cannot make 800GB/s.

AndreasKunar May 13, 2025

OK, if you want to be negative and say that "it cannot be" without any evidence applying to the measurements here, I don't care and don't want to waste my time feeding trolls.

I think it's perfect to compare llama.cpp's performance with the theoretical maximum memory-bandwidth the system is designed for (max. transactions/s of the RAMs x data-bus size bytes). It is a measurement, how good llama.cpp's code for TG can leverage the theoretical limit imposed by memory-bandwidth on that hardware. And - as evidence - the TG-data for the same model/build quite matches this, as I mentioned above. E.g. M2-series TG fp16 (quantization has an impact) - M2 100GB/s ~6.5tok/s, M2 Pro 200GB/s ~ 13tok/s, M2 Max ~400GB/s ~25tok/s, M2 Ultra 800GB/s ~41tok/s (some impact of split-chip design, it got better from M1 to M2 with Apple learning). My Snapdragon X / Jetson Orin NX only have 16GB unified RAM, cannot really run fp16, but Q4_0/8 matches (with some differences based on their quantization-algorithm HW support). This discussion currently enables the prediction of a probability-range for TG token/s based on a designed RAM-bandwidth - e.g. let's see if my NVIDIA DGX Spark prediction based on its 256GB/s holds up.

P.S. the M-series Ultra Fusion interconnect is many times faster than its RAM-bandwidth. I found no hard evidence, that the dies do RAM-address interleaving, but they probably do according to some internet-gossip - so there is no reason, that the combined chip cannot do double the individual chip's RAM transfer-rate to the caches. It definitely has some interconnect overhead, but It's not a typical NUMA multi-processor architecture, which would require llama.cpp to do a special tensor memory+operations-layout.

mirh May 13, 2025

I think it's perfect to compare llama.cpp's performance with the theoretical maximum memory-bandwidth the system is designed for

Yes, and what I'm telling you is that those theoretical numbers are unproven anywhere (and regardless it seems very odd for every single figure of those to be "empirical", except bandwidth that is taken for granted with blatantly unknown rounding).

I found no hard evidence, that the dies do RAM-address interleaving, but they probably do according to some internet-gossip

According to some other internet gossip, it may actually just have been M1 ultra to be a disaster.
And while after much much scavenging of the net I found some benchmarks that kinda resized my contempt (truthfully the gpu is really privileged), for the biggest most ambitious chips there's still a 20-25% difference from the datasheet.

beebopkim · 2025-04-22T17:53:44Z

beebopkim
Apr 22, 2025

M3 Ultra (Mac Studio 2025) 24+8 CPU, 80 GPU, 512GB RAM

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1527.74 ± 2.02
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	40.10 ± 0.10
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1488.84 ± 2.52
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	64.16 ± 0.38
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1473.76 ± 1.09
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	91.93 ± 0.48

build: 8e672ef (1550)

0 replies

olegshulyakov · 2025-05-20T14:24:27Z

olegshulyakov
May 20, 2025

M1 (MacBook Air 2020) 8 CPU, 8GPU, 16GB RAM

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	115.67 ± 0.88
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	14.13 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	121.73 ± 1.43
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	7.69 ± 0.12

build: 8e672ef (1550)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	131.46 ± 6.71
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	13.99 ± 0.14
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	pp512	133.34 ± 1.17
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	tg128	7.67 ± 0.02

build: 3e0be1c (5410)

0 replies

ilcommm · 2025-05-21T10:38:46Z

ilcommm
May 21, 2025

Finally got the results I was asking about here recently 😊

Though I had to purchase a Mac Studio with the M4 Max chip myself to achieve this.

M4 MAX (Mac Studio 2024), 14CPU, 32 GPU, 36GB RAM

llama.cpp % ./llama-bench
-m ./models/llama-7b-v2/ggml-model-f16.gguf
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	747.59 ± 0.92
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.58 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	720.38 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	43.80 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	715.74 ± 0.52
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	69.24 ± 0.09

build: 8e672ef (1550)

On new build:
./build/bin/llama-bench
-m ./models/llama-7b-v2/ggml-model-f16.gguf
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	10	pp512	790.33 ± 0.49
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	10	tg128	26.05 ± 0.01
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	10	pp512	702.39 ± 11.36
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	10	tg128	44.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	10	pp512	762.81 ± 1.03
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	10	tg128	72.26 ± 0.07

build: b44890d (5440)

7 replies

arty-hlr May 21, 2025

Keep in mind performance is increased using mlx instead of llama.cpp.

olegshulyakov May 21, 2025

@arty-hlr I was interested in it since price is not far away of GPU, but power consumption is much less.

arty-hlr May 21, 2025

@olegshulyakov I got a refurbished m2 ultra 76 cores 64 GB a few weeks ago, not regretting it at all. Silent, very power efficient, good speeds, just prompt processing is slower than NVIDIA GPUs, but that's a small compromise to make imo. Atm I'm using lmstudio for inference as they have mlx as backend, but I'll switch back to ollama when they add it, lmstudio doesn't handle loading models as well as ollama.

ilcommm May 21, 2025

Keep in mind performance is increased using mlx instead of llama.cpp.

how to achieve it in inference? only via lmstudio currently?

olegshulyakov May 21, 2025

@ilcommm There is mlx-ml server like llama.cpp does

Basten7 · 2025-08-11T12:57:40Z

Basten7
Aug 11, 2025

Old Mac Intel with AMD GPU is not dead, yet !

That old Mac with Intel and an AMD GPU might be showing its age, but it’s far from useless and can still pack a punch today with the right tweaks.

export GGML_METAL_DEVICE_INDEX=1
./build/bin/llama-bench -ngl 99 -m ~/Models/llama-2-7b-q4_0.gguf
ggml_metal_init: found device: AMD Radeon RX 6900 XT

./build/bin/llama-bench -ngl 99 -m ~/Models/llama-2-7b-q4_0.gguf

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	pp512	331.84 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	tg128	98.54 ± 0.83

build: 79c1160 (6123 with metal3 patch)

export GGML_METAL_DEVICE_INDEX=2
./build/bin/llama-bench -ngl 99 -m ~/Models/llama-2-7b-q4_0.gguf
ggml_metal_init: found device: AMD Radeon PRO W6800

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	pp512	246.19 ± 0.79
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	tg128	85.82 ± 0.23

build: 79c1160 (6123 with metal3 patch)

export GGML_METAL_DEVICE_INDEX=3
./build/bin/llama-bench -ngl 99 -m ~/Models/llama-2-7b-q4_0.gguf
ggml_metal_init: found device: AMD Radeon PRO W6800X Duo

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	pp512	251.48 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	12	tg128	96.24 ± 0.19

build: 79c1160 (6123 with metal3 patch)

The tweaks : https://gist.github.com/Basten7/f316fef96aac9a6614032a65c9825eaf

0 replies

netrunnereve · 2025-09-09T21:10:02Z

netrunnereve
Sep 9, 2025
Collaborator

There's some speculation online about how the new A19 chip has the NPU cores inside the GPU instead of being a seperate module, which might indicate that it can be used like tensor cores to help with prompt processing.

0 replies

bluemoehre · 2025-10-01T13:22:15Z

bluemoehre
Oct 1, 2025

So, let's reignite the discussion:
M5 is already there and M5 Pro / M5 Ultra is just around the corner.
Personally I don't believe rumors about M5 Extreme, but lets see what Apple puts into the next gen Mac Pro having space for like 4x NVIDIA GPUs? 😁

It looks like we have a good chance of ~30% LLM performance increase:
https://browser.geekbench.com/v6/compute/4916492

4 replies

ggerganov Oct 1, 2025
Maintainer Author

Definitely interesting. I suppose the newly introduced tensor types in MSL4 will be the way to utilize the new NPU cores (https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf). Hopefully we can easily adapt the implementation to support this.

bluemoehre Oct 2, 2025

Would be awesome if both xPUs could work together. Maybe things would already speed up if one handles PP and the other TG? I mean, we’ve already got Performance and Efficiency Cores in CPUs, so maybe in the future you could even pick which processing / layers / experts run where.

I hope the M5 Pro supports 128GB+ at about 325GB/s (and that Apple doesn't get too greedy with $). That way you could run multiple models in the background and still use the machine normally. That said, I've still got pre-orders for the AMD Strix Halo 395+ 🥲

bluemoehre Oct 15, 2025

@ggerganov time to update the table 😉 - https://www.apple.com/macbook-pro/

M5: 153 GB/s

bluemoehre Mar 9, 2026

M5 Max has 1,5x PP performance vs M3 Ultra / 2,3x performance vs M4 Max ... now I need to order one.

Anemll · 2025-10-30T02:03:06Z

Anemll
Oct 30, 2025

M5 (base) benchmark , 2.6 speedup, FA is slower for TG, but faster for prefill?

lama-bench
-m llama-7b-v2/ggml-model-f16.gguf
-m llama-7b-v2/ggml-model-Q8_0.gguf
-m llama-7b-v2/ggml-model-Q4_0.gguf
-p 512 -n 128 -ngl 99 -o md 2>/dev/null

M5 Without Neural Accelerator

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	257.79 ± 0.94
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	9.43 ± 0.12
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	264.15 ± 2.26
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	16.62 ± 0.15
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	247.68 ± 2.63
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	29.62 ± 0.33
build: `8e672ef` (1550)

M5 with Neural Accelerator

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	4	pp512	374.35 ± 1.95
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	4	tg128	9.67 ± 0.06
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	pp512	489.78 ± 3.90
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	tg128	17.50 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	636.36 ± 1.92
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	31.02 ± 0.37
build: `9fce244` (6817)

Also Metal4 plus -fa 1

model	size	params	backend	threads	fa	test	t/s
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	4	1	pp512	378.36 ± 4.69
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	4	1	tg128	9.56 ± 0.11
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	1	pp512	512.15 ± 2.73
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	1	tg128	17.55 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	1	pp512	650.65 ± 5.06
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	1	tg128	30.67 ± 0.25
build: `9fce244` (6817))

1 reply

ggerganov Oct 30, 2025
Maintainer Author

FA is slower for TG, but faster for prefill?

Looks like noise to me.

Hassan-A-K · 2026-01-16T06:18:36Z

Hassan-A-K
Jan 16, 2026

1 | model                          |       size |     params | backend    | ngl | test       |              t/s |
2 | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
3 | llama 7B mostly F16            |  12.55 GiB |     6.74 B | Metal      |  99 | pp 512     |    736.25 ± 1.88 |
4 | llama 7B mostly F16            |  12.55 GiB |     6.74 B | Metal      |  99 | tg 128     |     24.29 ± 0.18 |
5 | llama 7B mostly Q8_0           |   6.67 GiB |     6.74 B | Metal      |  99 | pp 512     |    718.56 ± 0.78 |
6 | llama 7B mostly Q8_0           |   6.67 GiB |     6.74 B | Metal      |  99 | tg 128     |     43.87 ± 0.02 |
7 | llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | Metal      |  99 | pp 512     |    713.93 ± 0.25 |
8 | llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | Metal      |  99 | tg 128     |     69.95 ± 0.15 |

build: 8e672ef (1550)
M4 Max 14C 32C 36GB 1TB ✅

0 replies

arashakb · 2026-03-02T20:29:16Z

arashakb
Mar 2, 2026

We need a comparison between llama.cpp inference speed and PyTorch inference speed. A fair comparison.
No one cares about the llama.cpp speed alone. if Pytroch inference speed is faster or on par, honestly there is no clear motivation on why people should use llama.cpp

All of the quantization algorithms can be written in PyTorch too. it's just engineering work. But if someone shows that having the inference in pure C++ has clear benefits and we can actually run the inference faster (on CPU or GPU or both) that is something people will invest on.

We do not have any clear proof showing llama.cpp is actually faster than the PyTorch implementation on either CPU or GPU. if we have that please educate me.

7 replies

arashakb Mar 2, 2026

@j05hau
Jury's out on "convenience" part lol
I do agree with portability but still I don't think people would sacrifice inference speed over portability. or at least people want to know if their model will become slower or not.

I actually like llama.cpp but the fact that there is no benchmark for speed compared to PyTorch is just so weird and I think many people agree with me on this.

arashakb Mar 2, 2026

here is no clear motivation on why people should use llama.cpp

How about :

Not having to deal with python dependencies

builtin-web UI. Nothing else needs to be installed to start using LLMs interactively

Multiple inference backends, mix CPU/GPU inference.

Easy distributed inference with RPC. You can even mix different platforms.

these are valid. Thank you for mentioning them.
still I think since llama.cpp is all about running inference, the most important thing is the inference speed. that't the main reason why different inference engines got created (vLLM, TensorRT-LLM, etc) and yet there is no clear answer or comparison on this for llama.cpp compared to even a basic transformers forward pass.
I hope this will become more clear in future.

j05hau Mar 2, 2026

Convenience is part and parcel alongside portability. If I can throw a local server into an app, download a model, and simply prompt via an API, this is convenient. It saves a lot of time and you can take it anywhere. I am fine with sacrificing negligible performance loss for this convenience.

It comes down to your implementation or use case as to how you may value a package like this; shipping isn't always about performance alone, it's also about cost analysis, time investment, complexity, sacrifice, among other considerations.

Including all PyTorch dependencies in an app simply isn't ideal, and llama.cpp fills that void with style.

tarruda Mar 2, 2026

Most of the times the inference speed of llama.cpp is enough for me.

The best thing about llama.cpp are the amazing dynamic quants. A recent example was Qwen 3.5 397B which maintains decent amount of intelligence even with a 2-bit quant as I've shown here: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/8. I can run a 400B parameter model on my 128G mac because of those quants. There's simply no alternative.

arthurhjorth Mar 3, 2026

What a weird comment to leave here.

First, this is not the right place for it. This thread is for performance measures across M-series chipsets, not for meta discussions.

Second, people do care. This helps people shop for M-chipsets that best suit their purpose/budget.

Third, and most important, if this is so important to you, why don't you write up the code to run comparable benchmarks? This is an open source project. Showing up and demanding that other people do the work to show you data just because you want them is.. lol.

If this is important to you, create an issue. The heading could be something like "Comparable benchmarks across inference frameworks?". Content: Hey guys, I think it would be really useful to have comparable inference data across llama.cpp, mlx, pytorch, etc. Here's a repo where I run these comparisons. Comparing across frameworks is tricky because they don't all support the same dtypes and quants, so here are my thoughts on fair comparisons. ... If I maintain this repo and keep it up to date for new models, quants etc., would there be interest in somehow making my data easily accessible here, so people can get the most accurate information possible?"

You're welcome.

elfarolab · 2026-03-02T20:36:24Z

elfarolab
Mar 2, 2026

and you are PhD too ..

0 replies

hnfong · 2026-03-03T16:13:53Z

hnfong
Mar 3, 2026

I came here trying to get a sense of the M5 speedup given the new Metal API in M5 chips.

ggerganov hasn't updated the post yet, but I found useful info in #16634

0 replies

d-shehu · 2026-03-03T21:35:47Z

d-shehu
Mar 3, 2026

I'm holding out for a benchmark of M5 Max and assuming there is a Mac Studio launching soon. I'd like to see some numbers comparing it to a Spark and Strix Halo respectively.

Anyone have $5K laying around for a new Macbook Pro ... ;)

https://www.apple.com/shop/buy-mac/macbook-pro/14-inch

1 reply

olegshulyakov Mar 9, 2026

There is Strix Halo result.

uncSoft · 2026-03-05T14:36:11Z

uncSoft
Mar 5, 2026

Hey all - I just stumbled across this excellent project - I created an open source Ollama/mlx/any openai endpoint benchmarking app with public leaderboards, and open sourced the dataset as well. If I can help contribute in some way please let me know. The binary is certificate signed and the full source is available. The dataset is small right now, but the datapoints are very robust; contributions are very welcome, as are stars to get the app on Homebrew as a cask.

https://github.com/uncSoft/anubis-oss

https://devpadapp.com/anubis-oss.html

https://devpadapp.com/leaderboard.html

https://devpadapp.com/explorer.html

https://imgur.com/a/3DA0OSj

2 replies

JJJ-M Mar 8, 2026

This is a great project you created, I am contributing.
It would be great to add the quantization and MLX or GGUF, so we can make exact comparisons.

uncSoft Mar 9, 2026

Added! there was a bug in sparkle, please manually update to the latest release: https://github.com/uncSoft/anubis-oss/releases/latest

Performance of llama.cpp on Apple Silicon M-series #4167

Uh oh!

Uh oh!

ggerganov Nov 22, 2023 Maintainer

Summary

Description

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

Footnotes

Replies: 84 comments · 165 replies

Uh oh!

Uh oh!

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

Uh oh!

Uh oh!

M2 Max Studio, 8+4 CPU, 38 GPU ✅

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

Uh oh!

Uh oh!

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

Uh oh!

slaren Nov 24, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

ggerganov Nov 24, 2023 Maintainer Author

Uh oh!

Uh oh!

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

Uh oh!

Uh oh!

Uh oh!

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

Uh oh!

Uh oh!

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

Uh oh!

Uh oh!

Uh oh!

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

Uh oh!

Uh oh!

Uh oh!

ggerganov
Nov 22, 2023
Maintainer

Replies: 84 comments 165 replies

slaren Nov 24, 2023
Maintainer

ggerganov Nov 24, 2023
Maintainer Author