-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more perfo with llamafile tinyblas on x86_64. #10714
Conversation
f7c5a68
to
b1c72b9
Compare
Some perplexity with new code. (vs master BF16/zen3)
|
For me look good. |
Not sure, try merging the current |
b1c72b9
to
d4a2a20
Compare
look like a small diff in result.
@pytest.mark.parametrize("n_slots", [1, 2])
def test_consistent_result_same_seed(n_slots: int):
global server
server.n_slots = n_slots what is n_slots? I have to check some elements in my code tomorrow... |
I am not sure what's the effect of increasing the number of slots for this test. I suspect that this error might indicate there is a buffer overflow somewhere, and random data beyond the tensor buffer may be causing it to generate different sequences despite using the same seed. |
That's what I was thinking last night, but it was too late. I have a little idea, but I was too tired to check/correct it. |
The failing test seems to be using 2 slots. With 2 slots, the KV cache buffer is shared among the two generations. Initially, the buffer is empty:
Then the first request is processed by slot 0 and thus the beginning of the buffer is occupied:
The second request is processed on slot 1, so the old data remains in the buffer:
Because we compute the attention on the entire buffer by masking out the cross-sequence values, it is actually possible to get different results between the 2 generations. This happens due to summing floating-point across the length of the KV buffer. In the next example, even-though the data in the buffer is the same, it can lead to different numerical results during the
I'm thinking that maybe there isn't a bug in the implementation in this PR, and it's a side-effect of the unified KV cache. Probably this test for |
d4a2a20
to
6c398db
Compare
@ggerganov On the other hand, by going over my code step by step, there are a small number of cases (2 to ~5?) where I do too much calculation and write outside the wrong value (possibly by overwriting correct data that I just calculated...) So I corrected that. It remains to be seen if the test passes. |
Well that wasn't enough, I'm doing another pass on the perlexity to be sure with my last correction. |
6c398db
to
01ba9f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been running random tests with test-backend-ops
and I haven't seen any failure, so I am fairly confident that this is correct. Let's just disable the server test for 2 slots.
not sur how to do it: # replace
# @pytest.mark.parametrize("n_slots", [1, 2])
# with that?
@pytest.mark.parametrize("n_slots", [1])
def test_consistent_result_same_seed(n_slots: int):
global server
server.n_slots = n_slots
server.start()
last_res = None
for _ in range(4):
res = server.make_request("POST", "/completion", data={
"prompt": "I believe the meaning of life is",
"seed": 42,
"temperature": 1.0,
"cache_prompt": False, # TODO: remove this once test_cache_vs_nocache_prompt is fixed
})
if last_res is not None:
assert res.body["content"] == last_res.body["content"]
last_res = res |
A different test is failing now. Add: --- a/examples/server/tests/unit/test_completion.py
+++ b/examples/server/tests/unit/test_completion.py
@@ -116,6 +116,7 @@ def test_different_result_different_seed(n_slots: int):
def test_consistent_result_different_batch_size(n_batch: int, temperature: float):
global server
server.n_batch = n_batch
+ server.n_slots = 1
server.start()
last_res = None
for _ in range(4): |
On my system (intel 13900k) I see better performance with BF16, but worse with F16 in some cases:
With different numbers of threads:
|
It is a AVX512 or a AVX2? |
AVX2 |
This is the "bad" intel CPU: that disable avx512 on effichent core
look faster on perf core slower on other... can you bench with that: #elif (defined(__AVX__) || defined(__AVX2__)) && defined(__F16C__)
// do not convert B to FP16
if (Btype == GGML_TYPE_F32) {
tinyBLAS<8, __m256, __m256, ggml_fp16_t, float, float> tb{ k,
(const ggml_fp16_t *)A, lda,
(const float *)B, ldb,
(float *)C, ldc,
ith, nth};
return tb.matmul(m, n);
} and may be with the BF16 too... |
Not sure where to change that, can you show me the diff of this change (change it locally and run |
tb.matmul(m, n); | ||
return true; | ||
if (Btype == GGML_TYPE_F16) { | ||
tinyBLAS<8, __m256, __m256, ggml_fp16_t, ggml_fp16_t, float> tb{ k, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#elif (defined(__AVX__) || defined(__AVX2__)) && defined(__F16C__)
// do not convert B to FP16 what was did before...
if (Btype == GGML_TYPE_F32) {
tinyBLAS<8, __m256, __m256, ggml_fp16_t, float, float> tb{ k,
(const ggml_fp16_t *)A, lda,
(const float *)B, ldb,
(float *)C, ldc,
ith, nth};
return tb.matmul(m, n);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I' will try the same on my zen3 ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely faster, but still slower than master in some cases.
Model | Threads | Test | t/s master | t/s perfo/tinyblas | Speedup |
---|---|---|---|---|---|
llama 7B F16 | 8 | pp64 | 52.68 | 61.85 | 1.17 |
llama 7B F16 | 16 | pp64 | 42.89 | 37.64 | 0.88 |
llama 7B F16 | 24 | pp64 | 63.02 | 56.05 | 0.89 |
llama 7B F16 | 32 | pp64 | 77.93 | 68.80 | 0.88 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these heterogeneous CPUs, the dispatch would need to be refined. For the moment, the calculations are distributed evenly between the Cores, so there is a "chance" that we will wait for the eco cores....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that the tinyblas implementation doesn't play very well with the e-cores or multi-threading. The ggml implementation has dynamic chunking so that the faster threads will get more work, but I don't think that this implemented in tinyblas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My RAM is 3600 vs 2133 it may explain the diff in TG...
CPU is the same.
I have 4x32Go config as interleaved
did you have 4x16Go ? with what config?
g++ --version
g++ (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3)
OS: fedora 41 vs Ubuntu 11.4
- Mistral-7b : look we have the same behavior with it so it's not due to the model.
Model | Test | t/s master | t/s PR | Speedup | PC |
---|---|---|---|---|---|
llama 7B F16 | pp120 | 55.95 | 84.83 | 1.52 | djip007 |
llama 7B F16 | pp120 | 70.64 | 44.11 | 0.62 | ggerganov |
For now I don't understand what's happening...
something with Memory config ?
Memory Device
Array Handle: 0x0032
Error Information Handle: 0x003F
Total Width: 64 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM_B1
Bank Locator: BANK 2
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3600 MT/s
Manufacturer: CRUCIAL
Serial Number: -----------
Asset Tag: Not Specified
Part Number: BL32G36C16U4B.M16FB1
Rank: 2
Configured Memory Speed: 3600 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 6, Hex 0x9B
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5950X 16-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 65%
CPU max MHz: 5084,0000
CPU min MHz: 550,0000
BogoMIPS: 6800,74
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdp
e1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma c
x16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowpr
efetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs i
bpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv
1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_loc
k nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku o
spke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 8 MiB (16 instances)
L3: 64 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
memory addressing:
Handle 0x003B, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x003A
Memory Array Mapped Address Handle: 0x0034
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
Handle 0x003E, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x003D
Memory Array Mapped Address Handle: 0x0034
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
Handle 0x0041, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0040
Memory Array Mapped Address Handle: 0x0034
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
Handle 0x0044, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 128 GB
Physical Device Handle: 0x0043
Memory Array Mapped Address Handle: 0x0034
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
You have Corsair Vengeance RGB RS 16 Go DDR4 3200 MHz CL16 RAM and clock it a 2133 MHz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My RAM is 3600 vs 2133 it may explain the diff in TG...
Yes, probably that's it.
I have 4x32Go config as interleaved
did you have 4x16Go ? with what config?
Yes, these are 4x16GB. I don't know if they are interleaved and how to check. Here is the full dmidecode
if it helps:
$ sudo dmidecode -t memory
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
Handle 0x000A, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 128 GB
Error Information Handle: 0x0009
Number Of Devices: 4
Handle 0x0012, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000A
Error Information Handle: 0x0011
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMG16GX4M1E3200C16
Rank: 1
Configured Memory Speed: 2133 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 3, Hex 0x9E
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 16 GB
Cache Size: None
Logical Size: None
Handle 0x0015, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000A
Error Information Handle: 0x0014
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMG16GX4M1E3200C16
Rank: 1
Configured Memory Speed: 2133 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 3, Hex 0x9E
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 16 GB
Cache Size: None
Logical Size: None
Handle 0x0018, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000A
Error Information Handle: 0x0017
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMG16GX4M1E3200C16
Rank: 1
Configured Memory Speed: 2133 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 3, Hex 0x9E
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 16 GB
Cache Size: None
Logical Size: None
Handle 0x001B, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000A
Error Information Handle: 0x001A
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2133 MT/s
Manufacturer: Unknown
Serial Number: 00000000
Asset Tag: Not Specified
Part Number: CMG16GX4M1E3200C16
Rank: 1
Configured Memory Speed: 2133 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 3, Hex 0x9E
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 16 GB
Cache Size: None
Logical Size: None
$ sudo dmidecode -t 20
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
Handle 0x0013, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Device Handle: 0x0012
Memory Array Mapped Address Handle: 0x000C
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
Handle 0x0016, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Device Handle: 0x0015
Memory Array Mapped Address Handle: 0x000C
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
Handle 0x0019, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Device Handle: 0x0018
Memory Array Mapped Address Handle: 0x000C
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
Handle 0x001C, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Device Handle: 0x001B
Memory Array Mapped Address Handle: 0x000C
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5950X 16-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU max MHz: 5083,3979
CPU min MHz: 2200,0000
BogoMIPS: 6787.87
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr
8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm
_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 8 MiB (16 instances)
L3: 64 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
You have Corsair Vengeance RGB RS 16 Go DDR4 3200 MHz CL16 RAM and clock it a 2133 MHz ?
Hm, it's possible. It's an old machine that I currently keep remotely and I remember that at some point in the past I was adjusting some settings in the BIOS with the main goal to reduce the CPU fan noises. Though it's completely possible that I have underclocked the memory by accident.
Anyway, I am OK to ignore this datapoint for now since it is very likely a misconfiguration on my side. Next time I have access to the machine, will check what the BIOS settings are. But no need to worry about this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the tests you asked for.
Model | Threads | Test | t/s master | t/s perfo/tinyblas | Speedup |
---|---|---|---|---|---|
llama 7B BF16 | 8 | pp1 | 6.54 | 6.85 | 1.05 |
llama 7B BF16 | 8 | pp2 | 11.15 | 14.50 | 1.30 |
llama 7B BF16 | 8 | pp3 | 14.71 | 21.46 | 1.46 |
llama 7B BF16 | 8 | pp4 | 17.29 | 25.63 | 1.48 |
llama 7B BF16 | 8 | pp5 | 19.17 | 32.46 | 1.69 |
llama 7B BF16 | 8 | pp6 | 20.78 | 37.58 | 1.81 |
llama 7B BF16 | 8 | pp7 | 22.28 | 40.30 | 1.81 |
llama 7B BF16 | 8 | pp8 | 23.07 | 44.56 | 1.93 |
llama 7B BF16 | 8 | pp9 | 22.69 | 47.97 | 2.11 |
llama 7B BF16 | 8 | pp10 | 22.85 | 50.19 | 2.20 |
llama 7B BF16 | 8 | pp11 | 23.28 | 52.97 | 2.28 |
llama 7B BF16 | 8 | pp12 | 23.64 | 55.29 | 2.34 |
llama 7B BF16 | 8 | pp13 | 23.93 | 55.26 | 2.31 |
llama 7B BF16 | 8 | pp14 | 17.59 | 55.09 | 3.13 |
llama 7B BF16 | 8 | pp15 | 23.55 | 57.91 | 2.46 |
llama 7B BF16 | 8 | pp16 | 24.62 | 57.64 | 2.34 |
llama 7B BF16 | 8 | pp30 | 26.04 | 63.60 | 2.44 |
llama 7B BF16 | 8 | pp31 | 25.55 | 63.00 | 2.47 |
llama 7B BF16 | 8 | pp32 | 25.07 | 60.48 | 2.41 |
llama 7B BF16 | 8 | pp64 | 25.38 | 63.79 | 2.51 |
llama 7B BF16 | 8 | pp65 | 24.97 | 66.64 | 2.67 |
llama 7B BF16 | 8 | pp66 | 25.25 | 68.75 | 2.72 |
llama 7B BF16 | 8 | pp120 | 23.24 | 67.85 | 2.92 |
llama 7B BF16 | 8 | pp128 | 25.64 | 66.69 | 2.60 |
llama 7B BF16 | 8 | pp130 | 23.48 | 54.18 | 2.31 |
llama 7B BF16 | 8 | pp240 | 24.38 | 68.85 | 2.82 |
llama 7B BF16 | 8 | pp255 | 24.51 | 66.71 | 2.72 |
llama 7B BF16 | 8 | pp256 | 23.19 | 66.83 | 2.88 |
llama 7B BF16 | 8 | pp510 | 22.54 | 60.81 | 2.70 |
llama 7B BF16 | 8 | pp512 | 23.94 | 60.61 | 2.53 |
llama 7B BF16 | 8 | pp1023 | 23.68 | 59.02 | 2.49 |
llama 7B BF16 | 8 | pp1024 | 23.78 | 59.15 | 2.49 |
llama 7B BF16 | 8 | pp1025 | 23.73 | 58.89 | 2.48 |
llama 7B BF16 | 8 | pp2048 | 23.29 | 58.20 | 2.50 |
llama 7B BF16 | 8 | tg128 | 6.69 | 6.42 | 0.96 |
llama 7B BF16 | 16 | pp1 | 6.89 | 6.90 | 1.00 |
llama 7B BF16 | 16 | pp2 | 13.49 | 14.19 | 1.05 |
llama 7B BF16 | 16 | pp3 | 18.20 | 20.99 | 1.15 |
llama 7B BF16 | 16 | pp4 | 21.92 | 27.21 | 1.24 |
llama 7B BF16 | 16 | pp5 | 24.63 | 34.20 | 1.39 |
llama 7B BF16 | 16 | pp6 | 26.45 | 39.95 | 1.51 |
llama 7B BF16 | 16 | pp7 | 26.87 | 45.12 | 1.68 |
llama 7B BF16 | 16 | pp8 | 20.01 | 49.56 | 2.48 |
llama 7B BF16 | 16 | pp9 | 28.62 | 54.41 | 1.90 |
llama 7B BF16 | 16 | pp10 | 30.72 | 57.11 | 1.86 |
llama 7B BF16 | 16 | pp11 | 31.19 | 62.05 | 1.99 |
llama 7B BF16 | 16 | pp12 | 31.64 | 65.01 | 2.05 |
llama 7B BF16 | 16 | pp13 | 33.39 | 66.48 | 1.99 |
llama 7B BF16 | 16 | pp14 | 33.71 | 69.13 | 2.05 |
llama 7B BF16 | 16 | pp15 | 34.12 | 71.64 | 2.10 |
llama 7B BF16 | 16 | pp16 | 33.81 | 71.72 | 2.12 |
llama 7B BF16 | 16 | pp30 | 34.43 | 81.93 | 2.38 |
llama 7B BF16 | 16 | pp31 | 34.60 | 80.99 | 2.34 |
llama 7B BF16 | 16 | pp32 | 34.67 | 81.60 | 2.35 |
llama 7B BF16 | 16 | pp64 | 35.44 | 85.54 | 2.41 |
llama 7B BF16 | 16 | pp65 | 34.26 | 85.67 | 2.50 |
llama 7B BF16 | 16 | pp66 | 36.01 | 86.00 | 2.39 |
llama 7B BF16 | 16 | pp120 | 36.28 | 86.63 | 2.39 |
llama 7B BF16 | 16 | pp128 | 32.27 | 87.02 | 2.70 |
llama 7B BF16 | 16 | pp130 | 35.74 | 86.62 | 2.42 |
llama 7B BF16 | 16 | pp240 | 33.41 | 75.09 | 2.25 |
llama 7B BF16 | 16 | pp255 | 33.79 | 86.10 | 2.55 |
llama 7B BF16 | 16 | pp256 | 33.65 | 86.45 | 2.57 |
llama 7B BF16 | 16 | pp510 | 32.94 | 82.24 | 2.50 |
llama 7B BF16 | 16 | pp512 | 34.21 | 75.92 | 2.22 |
llama 7B BF16 | 16 | pp1023 | 33.01 | 76.81 | 2.33 |
llama 7B BF16 | 16 | pp1024 | 32.96 | 77.04 | 2.34 |
llama 7B BF16 | 16 | pp1025 | 33.54 | 72.84 | 2.17 |
llama 7B BF16 | 16 | pp2048 | 32.17 | 72.21 | 2.24 |
llama 7B BF16 | 16 | tg128 | 6.64 | 6.49 | 0.98 |
llama 7B BF16 | 24 | pp1 | 6.45 | 6.46 | 1.00 |
llama 7B BF16 | 24 | pp2 | 13.87 | 14.12 | 1.02 |
llama 7B BF16 | 24 | pp3 | 19.54 | 21.26 | 1.09 |
llama 7B BF16 | 24 | pp4 | 23.84 | 27.43 | 1.15 |
llama 7B BF16 | 24 | pp5 | 27.16 | 34.66 | 1.28 |
llama 7B BF16 | 24 | pp6 | 29.39 | 40.81 | 1.39 |
llama 7B BF16 | 24 | pp7 | 31.82 | 46.53 | 1.46 |
llama 7B BF16 | 24 | pp8 | 32.90 | 52.14 | 1.58 |
llama 7B BF16 | 24 | pp9 | 34.22 | 57.32 | 1.67 |
llama 7B BF16 | 24 | pp10 | 34.64 | 61.83 | 1.78 |
llama 7B BF16 | 24 | pp11 | 35.80 | 67.03 | 1.87 |
llama 7B BF16 | 24 | pp12 | 35.61 | 70.34 | 1.98 |
llama 7B BF16 | 24 | pp13 | 26.88 | 72.89 | 2.71 |
llama 7B BF16 | 24 | pp14 | 36.83 | 76.42 | 2.08 |
llama 7B BF16 | 24 | pp15 | 36.29 | 83.74 | 2.31 |
llama 7B BF16 | 24 | pp16 | 37.80 | 84.55 | 2.24 |
llama 7B BF16 | 24 | pp30 | 39.96 | 96.48 | 2.41 |
llama 7B BF16 | 24 | pp31 | 40.38 | 97.68 | 2.42 |
llama 7B BF16 | 24 | pp32 | 40.44 | 98.66 | 2.44 |
llama 7B BF16 | 24 | pp64 | 42.22 | 98.08 | 2.32 |
llama 7B BF16 | 24 | pp65 | 41.12 | 95.11 | 2.31 |
llama 7B BF16 | 24 | pp66 | 33.80 | 95.15 | 2.82 |
llama 7B BF16 | 24 | pp120 | 41.40 | 74.86 | 1.81 |
llama 7B BF16 | 24 | pp128 | 41.64 | 97.67 | 2.35 |
llama 7B BF16 | 24 | pp130 | 35.86 | 99.80 | 2.78 |
llama 7B BF16 | 24 | pp240 | 41.78 | 96.56 | 2.31 |
llama 7B BF16 | 24 | pp255 | 41.48 | 81.96 | 1.98 |
llama 7B BF16 | 24 | pp256 | 38.66 | 98.73 | 2.55 |
llama 7B BF16 | 24 | pp510 | 39.75 | 85.03 | 2.14 |
llama 7B BF16 | 24 | pp512 | 39.44 | 94.90 | 2.41 |
llama 7B BF16 | 24 | pp1023 | 39.18 | 88.19 | 2.25 |
llama 7B BF16 | 24 | pp1024 | 38.70 | 87.89 | 2.27 |
llama 7B BF16 | 24 | pp1025 | 38.99 | 85.98 | 2.20 |
llama 7B BF16 | 24 | pp2048 | 37.26 | 83.63 | 2.24 |
llama 7B BF16 | 24 | tg128 | 6.12 | 6.14 | 1.00 |
llama 7B BF16 | 32 | pp1 | 6.61 | 6.49 | 0.98 |
llama 7B BF16 | 32 | pp2 | 13.41 | 12.63 | 0.94 |
llama 7B BF16 | 32 | pp3 | 16.67 | 19.67 | 1.18 |
llama 7B BF16 | 32 | pp4 | 24.53 | 26.30 | 1.07 |
llama 7B BF16 | 32 | pp5 | 25.11 | 30.75 | 1.22 |
llama 7B BF16 | 32 | pp6 | 30.08 | 35.58 | 1.18 |
llama 7B BF16 | 32 | pp7 | 32.79 | 44.60 | 1.36 |
llama 7B BF16 | 32 | pp8 | 33.60 | 47.41 | 1.41 |
llama 7B BF16 | 32 | pp9 | 35.68 | 49.91 | 1.40 |
llama 7B BF16 | 32 | pp10 | 36.61 | 61.07 | 1.67 |
llama 7B BF16 | 32 | pp11 | 34.87 | 60.74 | 1.74 |
llama 7B BF16 | 32 | pp12 | 38.80 | 58.61 | 1.51 |
llama 7B BF16 | 32 | pp13 | 38.92 | 67.56 | 1.74 |
llama 7B BF16 | 32 | pp14 | 37.30 | 71.50 | 1.92 |
llama 7B BF16 | 32 | pp15 | 38.23 | 52.93 | 1.38 |
llama 7B BF16 | 32 | pp16 | 39.77 | 77.02 | 1.94 |
llama 7B BF16 | 32 | pp30 | 42.88 | 93.16 | 2.17 |
llama 7B BF16 | 32 | pp31 | 41.70 | 92.21 | 2.21 |
llama 7B BF16 | 32 | pp32 | 42.97 | 96.90 | 2.26 |
llama 7B BF16 | 32 | pp64 | 45.44 | 106.63 | 2.35 |
llama 7B BF16 | 32 | pp65 | 44.98 | 105.98 | 2.36 |
llama 7B BF16 | 32 | pp66 | 45.10 | 105.50 | 2.34 |
llama 7B BF16 | 32 | pp120 | 46.50 | 104.57 | 2.25 |
llama 7B BF16 | 32 | pp128 | 45.95 | 104.48 | 2.27 |
llama 7B BF16 | 32 | pp130 | 46.34 | 101.40 | 2.19 |
llama 7B BF16 | 32 | pp240 | 43.26 | 87.72 | 2.03 |
llama 7B BF16 | 32 | pp255 | 46.97 | 109.11 | 2.32 |
llama 7B BF16 | 32 | pp256 | 46.94 | 104.42 | 2.22 |
llama 7B BF16 | 32 | pp510 | 44.52 | 92.37 | 2.07 |
llama 7B BF16 | 32 | pp512 | 44.46 | 92.70 | 2.08 |
llama 7B BF16 | 32 | pp1023 | 43.70 | 93.70 | 2.14 |
llama 7B BF16 | 32 | pp1024 | 43.72 | 94.72 | 2.17 |
llama 7B BF16 | 32 | pp1025 | 43.49 | 93.42 | 2.15 |
llama 7B BF16 | 32 | pp2048 | 42.64 | 90.61 | 2.12 |
llama 7B BF16 | 32 | tg128 | 5.84 | 5.89 | 1.01 |
Model | Threads | Test | t/s master | t/s perfo/tinyblas | Speedup |
---|---|---|---|---|---|
llama 7B F16 | 8 | pp1 | 6.38 | 7.03 | 1.10 |
llama 7B F16 | 8 | pp2 | 12.49 | 14.03 | 1.12 |
llama 7B F16 | 8 | pp3 | 19.06 | 20.65 | 1.08 |
llama 7B F16 | 8 | pp4 | 12.97 | 25.21 | 1.94 |
llama 7B F16 | 8 | pp5 | 18.14 | 31.46 | 1.73 |
llama 7B F16 | 8 | pp6 | 28.90 | 36.29 | 1.26 |
llama 7B F16 | 8 | pp7 | 20.51 | 39.36 | 1.92 |
llama 7B F16 | 8 | pp8 | 27.07 | 43.70 | 1.61 |
llama 7B F16 | 8 | pp9 | 39.64 | 46.82 | 1.18 |
llama 7B F16 | 8 | pp10 | 30.42 | 49.03 | 1.61 |
llama 7B F16 | 8 | pp11 | 23.51 | 51.16 | 2.18 |
llama 7B F16 | 8 | pp12 | 44.81 | 53.55 | 1.20 |
llama 7B F16 | 8 | pp13 | 35.45 | 53.97 | 1.52 |
llama 7B F16 | 8 | pp14 | 37.40 | 55.91 | 1.50 |
llama 7B F16 | 8 | pp15 | 47.63 | 57.52 | 1.21 |
llama 7B F16 | 8 | pp16 | 38.27 | 56.69 | 1.48 |
llama 7B F16 | 8 | pp30 | 55.66 | 62.84 | 1.13 |
llama 7B F16 | 8 | pp31 | 46.17 | 61.79 | 1.34 |
llama 7B F16 | 8 | pp32 | 46.22 | 62.78 | 1.36 |
llama 7B F16 | 8 | pp64 | 51.43 | 65.10 | 1.27 |
llama 7B F16 | 8 | pp65 | 51.57 | 65.04 | 1.26 |
llama 7B F16 | 8 | pp66 | 55.29 | 50.84 | 0.92 |
llama 7B F16 | 8 | pp120 | 56.77 | 65.83 | 1.16 |
llama 7B F16 | 8 | pp128 | 55.29 | 65.64 | 1.19 |
llama 7B F16 | 8 | pp130 | 54.22 | 66.17 | 1.22 |
llama 7B F16 | 8 | pp240 | 51.06 | 58.46 | 1.14 |
llama 7B F16 | 8 | pp255 | 56.89 | 66.11 | 1.16 |
llama 7B F16 | 8 | pp256 | 56.60 | 58.38 | 1.03 |
llama 7B F16 | 8 | pp510 | 49.56 | 59.74 | 1.21 |
llama 7B F16 | 8 | pp512 | 48.86 | 59.75 | 1.22 |
llama 7B F16 | 8 | pp1023 | 50.73 | 58.39 | 1.15 |
llama 7B F16 | 8 | pp1024 | 46.55 | 58.51 | 1.26 |
llama 7B F16 | 8 | pp1025 | 50.70 | 58.01 | 1.14 |
llama 7B F16 | 8 | pp2048 | 48.65 | 56.76 | 1.17 |
llama 7B F16 | 8 | tg128 | 6.57 | 6.84 | 1.04 |
llama 7B F16 | 16 | pp1 | 7.36 | 7.04 | 0.96 |
llama 7B F16 | 16 | pp2 | 11.80 | 13.98 | 1.18 |
llama 7B F16 | 16 | pp3 | 16.18 | 20.90 | 1.29 |
llama 7B F16 | 16 | pp4 | 14.60 | 26.77 | 1.83 |
llama 7B F16 | 16 | pp5 | 17.95 | 32.82 | 1.83 |
llama 7B F16 | 16 | pp6 | 29.38 | 38.84 | 1.32 |
llama 7B F16 | 16 | pp7 | 24.25 | 42.92 | 1.77 |
llama 7B F16 | 16 | pp8 | 26.31 | 47.11 | 1.79 |
llama 7B F16 | 16 | pp9 | 36.26 | 51.87 | 1.43 |
llama 7B F16 | 16 | pp10 | 30.21 | 53.56 | 1.77 |
llama 7B F16 | 16 | pp11 | 31.95 | 57.11 | 1.79 |
llama 7B F16 | 16 | pp12 | 26.38 | 60.18 | 2.28 |
llama 7B F16 | 16 | pp13 | 31.82 | 60.02 | 1.89 |
llama 7B F16 | 16 | pp14 | 35.95 | 63.33 | 1.76 |
llama 7B F16 | 16 | pp15 | 41.48 | 65.03 | 1.57 |
llama 7B F16 | 16 | pp16 | 36.87 | 65.43 | 1.77 |
llama 7B F16 | 16 | pp30 | 46.23 | 74.36 | 1.61 |
llama 7B F16 | 16 | pp31 | 43.34 | 73.61 | 1.70 |
llama 7B F16 | 16 | pp32 | 44.02 | 74.51 | 1.69 |
llama 7B F16 | 16 | pp64 | 44.73 | 77.52 | 1.73 |
llama 7B F16 | 16 | pp65 | 43.67 | 77.59 | 1.78 |
llama 7B F16 | 16 | pp66 | 36.00 | 77.94 | 2.17 |
llama 7B F16 | 16 | pp120 | 46.27 | 78.65 | 1.70 |
llama 7B F16 | 16 | pp128 | 44.67 | 64.10 | 1.43 |
llama 7B F16 | 16 | pp130 | 38.55 | 75.47 | 1.96 |
llama 7B F16 | 16 | pp240 | 47.56 | 79.61 | 1.67 |
llama 7B F16 | 16 | pp255 | 43.95 | 67.90 | 1.54 |
llama 7B F16 | 16 | pp256 | 43.84 | 78.00 | 1.78 |
llama 7B F16 | 16 | pp510 | 47.37 | 67.67 | 1.43 |
llama 7B F16 | 16 | pp512 | 47.73 | 69.11 | 1.45 |
llama 7B F16 | 16 | pp1023 | 46.70 | 69.32 | 1.48 |
llama 7B F16 | 16 | pp1024 | 46.85 | 69.70 | 1.49 |
llama 7B F16 | 16 | pp1025 | 46.43 | 69.02 | 1.49 |
llama 7B F16 | 16 | pp2048 | 41.15 | 64.47 | 1.57 |
llama 7B F16 | 16 | tg128 | 6.77 | 6.79 | 1.00 |
llama 7B F16 | 24 | pp1 | 6.72 | 6.70 | 1.00 |
llama 7B F16 | 24 | pp2 | 13.08 | 13.80 | 1.06 |
llama 7B F16 | 24 | pp3 | 19.12 | 19.80 | 1.04 |
llama 7B F16 | 24 | pp4 | 17.18 | 25.52 | 1.49 |
llama 7B F16 | 24 | pp5 | 20.68 | 31.70 | 1.53 |
llama 7B F16 | 24 | pp6 | 33.86 | 25.86 | 0.76 |
llama 7B F16 | 24 | pp7 | 20.49 | 42.41 | 2.07 |
llama 7B F16 | 24 | pp8 | 32.02 | 49.30 | 1.54 |
llama 7B F16 | 24 | pp9 | 45.35 | 54.32 | 1.20 |
llama 7B F16 | 24 | pp10 | 38.55 | 58.23 | 1.51 |
llama 7B F16 | 24 | pp11 | 41.01 | 62.25 | 1.52 |
llama 7B F16 | 24 | pp12 | 54.11 | 65.30 | 1.21 |
llama 7B F16 | 24 | pp13 | 45.66 | 66.65 | 1.46 |
llama 7B F16 | 24 | pp14 | 48.00 | 68.97 | 1.44 |
llama 7B F16 | 24 | pp15 | 58.72 | 74.53 | 1.27 |
llama 7B F16 | 24 | pp16 | 50.72 | 74.72 | 1.47 |
llama 7B F16 | 24 | pp30 | 65.48 | 84.78 | 1.29 |
llama 7B F16 | 24 | pp31 | 60.81 | 84.96 | 1.40 |
llama 7B F16 | 24 | pp32 | 61.80 | 85.63 | 1.39 |
llama 7B F16 | 24 | pp64 | 65.16 | 85.38 | 1.31 |
llama 7B F16 | 24 | pp65 | 64.48 | 84.86 | 1.32 |
llama 7B F16 | 24 | pp66 | 66.96 | 84.38 | 1.26 |
llama 7B F16 | 24 | pp120 | 55.89 | 85.90 | 1.54 |
llama 7B F16 | 24 | pp128 | 65.76 | 68.28 | 1.04 |
llama 7B F16 | 24 | pp130 | 65.48 | 85.78 | 1.31 |
llama 7B F16 | 24 | pp240 | 59.03 | 87.17 | 1.48 |
llama 7B F16 | 24 | pp255 | 66.37 | 84.71 | 1.28 |
llama 7B F16 | 24 | pp256 | 65.55 | 86.56 | 1.32 |
llama 7B F16 | 24 | pp510 | 64.20 | 74.53 | 1.16 |
llama 7B F16 | 24 | pp512 | 63.80 | 82.39 | 1.29 |
llama 7B F16 | 24 | pp1023 | 58.59 | 76.50 | 1.31 |
llama 7B F16 | 24 | pp1024 | 58.57 | 73.12 | 1.25 |
llama 7B F16 | 24 | pp1025 | 58.36 | 75.34 | 1.29 |
llama 7B F16 | 24 | pp2048 | 53.68 | 71.46 | 1.33 |
llama 7B F16 | 24 | tg128 | 6.31 | 6.50 | 1.03 |
llama 7B F16 | 32 | pp1 | 6.95 | 6.52 | 0.94 |
llama 7B F16 | 32 | pp2 | 10.93 | 13.09 | 1.20 |
llama 7B F16 | 32 | pp3 | 20.45 | 16.78 | 0.82 |
llama 7B F16 | 32 | pp4 | 18.21 | 25.07 | 1.38 |
llama 7B F16 | 32 | pp5 | 21.85 | 31.60 | 1.45 |
llama 7B F16 | 32 | pp6 | 37.12 | 33.09 | 0.89 |
llama 7B F16 | 32 | pp7 | 29.10 | 43.65 | 1.50 |
llama 7B F16 | 32 | pp8 | 35.45 | 49.34 | 1.39 |
llama 7B F16 | 32 | pp9 | 45.19 | 46.98 | 1.04 |
llama 7B F16 | 32 | pp10 | 43.03 | 58.52 | 1.36 |
llama 7B F16 | 32 | pp11 | 42.45 | 55.39 | 1.30 |
llama 7B F16 | 32 | pp12 | 61.48 | 64.59 | 1.05 |
llama 7B F16 | 32 | pp13 | 47.92 | 67.93 | 1.42 |
llama 7B F16 | 32 | pp14 | 53.72 | 63.25 | 1.18 |
llama 7B F16 | 32 | pp15 | 62.70 | 74.37 | 1.19 |
llama 7B F16 | 32 | pp16 | 58.18 | 65.05 | 1.12 |
llama 7B F16 | 32 | pp30 | 81.05 | 86.72 | 1.07 |
llama 7B F16 | 32 | pp31 | 70.93 | 85.87 | 1.21 |
llama 7B F16 | 32 | pp32 | 69.70 | 86.73 | 1.24 |
llama 7B F16 | 32 | pp64 | 75.62 | 85.96 | 1.14 |
llama 7B F16 | 32 | pp65 | 80.69 | 88.52 | 1.10 |
llama 7B F16 | 32 | pp66 | 86.93 | 88.50 | 1.02 |
llama 7B F16 | 32 | pp120 | 88.89 | 90.04 | 1.01 |
llama 7B F16 | 32 | pp128 | 85.38 | 72.80 | 0.85 |
llama 7B F16 | 32 | pp130 | 82.46 | 88.17 | 1.07 |
llama 7B F16 | 32 | pp240 | 73.02 | 90.63 | 1.24 |
llama 7B F16 | 32 | pp255 | 88.03 | 90.93 | 1.03 |
llama 7B F16 | 32 | pp256 | 84.65 | 78.67 | 0.93 |
llama 7B F16 | 32 | pp510 | 78.80 | 87.88 | 1.12 |
llama 7B F16 | 32 | pp512 | 73.98 | 88.27 | 1.19 |
llama 7B F16 | 32 | pp1023 | 71.23 | 82.14 | 1.15 |
llama 7B F16 | 32 | pp1024 | 74.38 | 82.08 | 1.10 |
llama 7B F16 | 32 | pp1025 | 73.94 | 80.76 | 1.09 |
llama 7B F16 | 32 | pp2048 | 65.88 | 75.73 | 1.15 |
llama 7B F16 | 32 | tg128 | 5.92 | 5.98 | 1.01 |
Model | Threads | Test | t/s master | t/s perfo/tinyblas | Speedup |
---|---|---|---|---|---|
llama 7B all F32 | 8 | pp1 | 3.29 | 3.60 | 1.10 |
llama 7B all F32 | 8 | pp2 | 6.29 | 7.10 | 1.13 |
llama 7B all F32 | 8 | pp3 | 9.88 | 10.64 | 1.08 |
llama 7B all F32 | 8 | pp4 | 7.28 | 13.62 | 1.87 |
llama 7B all F32 | 8 | pp5 | 9.33 | 16.96 | 1.82 |
llama 7B all F32 | 8 | pp6 | 18.77 | 20.40 | 1.09 |
llama 7B all F32 | 8 | pp7 | 12.29 | 22.44 | 1.83 |
llama 7B all F32 | 8 | pp8 | 14.00 | 17.29 | 1.24 |
llama 7B all F32 | 8 | pp9 | 27.70 | 27.44 | 0.99 |
llama 7B all F32 | 8 | pp10 | 12.74 | 28.43 | 2.23 |
llama 7B all F32 | 8 | pp11 | 18.08 | 31.63 | 1.75 |
llama 7B all F32 | 8 | pp12 | 34.07 | 36.68 | 1.08 |
llama 7B all F32 | 8 | pp13 | 21.59 | 37.41 | 1.73 |
llama 7B all F32 | 8 | pp14 | 23.54 | 41.72 | 1.77 |
llama 7B all F32 | 8 | pp15 | 42.10 | 45.17 | 1.07 |
llama 7B all F32 | 8 | pp16 | 26.58 | 45.03 | 1.69 |
llama 7B all F32 | 8 | pp30 | 60.69 | 68.16 | 1.12 |
llama 7B all F32 | 8 | pp31 | 42.30 | 66.59 | 1.57 |
llama 7B all F32 | 8 | pp32 | 43.45 | 70.13 | 1.61 |
llama 7B all F32 | 8 | pp64 | 60.06 | 81.96 | 1.36 |
llama 7B all F32 | 8 | pp65 | 58.62 | 83.51 | 1.42 |
llama 7B all F32 | 8 | pp66 | 52.86 | 80.40 | 1.52 |
llama 7B all F32 | 8 | pp120 | 75.73 | 66.10 | 0.87 |
llama 7B all F32 | 8 | pp128 | 71.88 | 79.70 | 1.11 |
llama 7B all F32 | 8 | pp130 | 69.67 | 84.03 | 1.21 |
llama 7B all F32 | 8 | pp240 | 57.52 | 76.86 | 1.34 |
llama 7B all F32 | 8 | pp255 | 68.61 | 63.99 | 0.93 |
llama 7B all F32 | 8 | pp256 | 57.10 | 76.73 | 1.34 |
llama 7B all F32 | 8 | pp510 | 58.15 | 63.27 | 1.09 |
llama 7B all F32 | 8 | pp512 | 57.39 | 65.34 | 1.14 |
llama 7B all F32 | 8 | pp1023 | 56.31 | 61.57 | 1.09 |
llama 7B all F32 | 8 | pp1024 | 56.42 | 63.13 | 1.12 |
llama 7B all F32 | 8 | pp1025 | 55.91 | 64.92 | 1.16 |
llama 7B all F32 | 8 | pp2048 | 54.29 | 64.00 | 1.18 |
llama 7B all F32 | 8 | tg128 | 3.30 | 3.30 | 1.00 |
llama 7B all F32 | 16 | pp1 | 3.55 | 3.57 | 1.01 |
llama 7B all F32 | 16 | pp2 | 5.44 | 7.26 | 1.33 |
llama 7B all F32 | 16 | pp3 | 8.37 | 10.87 | 1.30 |
llama 7B all F32 | 16 | pp4 | 6.61 | 14.35 | 2.17 |
llama 7B all F32 | 16 | pp5 | 7.75 | 17.88 | 2.31 |
llama 7B all F32 | 16 | pp6 | 15.87 | 21.44 | 1.35 |
llama 7B all F32 | 16 | pp7 | 11.32 | 24.51 | 2.17 |
llama 7B all F32 | 16 | pp8 | 12.19 | 28.14 | 2.31 |
llama 7B all F32 | 16 | pp9 | 22.32 | 31.75 | 1.42 |
llama 7B all F32 | 16 | pp10 | 15.65 | 34.38 | 2.20 |
llama 7B all F32 | 16 | pp11 | 12.30 | 37.65 | 3.06 |
llama 7B all F32 | 16 | pp12 | 28.21 | 41.11 | 1.46 |
llama 7B all F32 | 16 | pp13 | 19.83 | 43.12 | 2.17 |
llama 7B all F32 | 16 | pp14 | 20.17 | 46.27 | 2.29 |
llama 7B all F32 | 16 | pp15 | 33.49 | 49.29 | 1.47 |
llama 7B all F32 | 16 | pp16 | 23.59 | 50.51 | 2.14 |
llama 7B all F32 | 16 | pp30 | 48.91 | 80.46 | 1.65 |
llama 7B all F32 | 16 | pp31 | 37.14 | 79.76 | 2.15 |
llama 7B all F32 | 16 | pp32 | 36.78 | 82.64 | 2.25 |
llama 7B all F32 | 16 | pp64 | 52.07 | 103.75 | 1.99 |
llama 7B all F32 | 16 | pp65 | 39.03 | 76.48 | 1.96 |
llama 7B all F32 | 16 | pp66 | 57.92 | 104.47 | 1.80 |
llama 7B all F32 | 16 | pp120 | 63.20 | 112.86 | 1.79 |
llama 7B all F32 | 16 | pp128 | 58.30 | 111.95 | 1.92 |
llama 7B all F32 | 16 | pp130 | 57.93 | 112.51 | 1.94 |
llama 7B all F32 | 16 | pp240 | 58.98 | 103.05 | 1.75 |
llama 7B all F32 | 16 | pp255 | 62.83 | 84.99 | 1.35 |
llama 7B all F32 | 16 | pp256 | 60.20 | 102.69 | 1.71 |
llama 7B all F32 | 16 | pp510 | 58.07 | 82.11 | 1.41 |
llama 7B all F32 | 16 | pp512 | 56.77 | 88.96 | 1.57 |
llama 7B all F32 | 16 | pp1023 | 57.33 | 84.32 | 1.47 |
llama 7B all F32 | 16 | pp1024 | 57.65 | 83.80 | 1.45 |
llama 7B all F32 | 16 | pp1025 | 54.17 | 85.64 | 1.58 |
llama 7B all F32 | 16 | pp2048 | 53.54 | 77.67 | 1.45 |
llama 7B all F32 | 16 | tg128 | 3.38 | 3.32 | 0.98 |
llama 7B all F32 | 24 | pp1 | 3.57 | 3.58 | 1.00 |
llama 7B all F32 | 24 | pp2 | 4.44 | 7.18 | 1.62 |
llama 7B all F32 | 24 | pp3 | 9.34 | 10.78 | 1.15 |
llama 7B all F32 | 24 | pp4 | 7.36 | 14.26 | 1.94 |
llama 7B all F32 | 24 | pp5 | 8.86 | 17.80 | 2.01 |
llama 7B all F32 | 24 | pp6 | 18.73 | 21.35 | 1.14 |
llama 7B all F32 | 24 | pp7 | 13.51 | 24.65 | 1.82 |
llama 7B all F32 | 24 | pp8 | 14.72 | 28.18 | 1.91 |
llama 7B all F32 | 24 | pp9 | 27.34 | 31.47 | 1.15 |
llama 7B all F32 | 24 | pp10 | 19.02 | 34.47 | 1.81 |
llama 7B all F32 | 24 | pp11 | 20.65 | 37.93 | 1.84 |
llama 7B all F32 | 24 | pp12 | 36.08 | 41.20 | 1.14 |
llama 7B all F32 | 24 | pp13 | 25.13 | 43.69 | 1.74 |
llama 7B all F32 | 24 | pp14 | 25.49 | 47.22 | 1.85 |
llama 7B all F32 | 24 | pp15 | 40.66 | 50.46 | 1.24 |
llama 7B all F32 | 24 | pp16 | 28.48 | 52.54 | 1.84 |
llama 7B all F32 | 24 | pp30 | 60.87 | 85.70 | 1.41 |
llama 7B all F32 | 24 | pp31 | 46.29 | 85.91 | 1.86 |
llama 7B all F32 | 24 | pp32 | 33.68 | 88.61 | 2.63 |
llama 7B all F32 | 24 | pp64 | 63.22 | 115.66 | 1.83 |
llama 7B all F32 | 24 | pp65 | 62.47 | 115.16 | 1.84 |
llama 7B all F32 | 24 | pp66 | 72.74 | 116.25 | 1.60 |
llama 7B all F32 | 24 | pp120 | 82.94 | 125.57 | 1.51 |
llama 7B all F32 | 24 | pp128 | 71.71 | 125.38 | 1.75 |
llama 7B all F32 | 24 | pp130 | 58.06 | 123.70 | 2.13 |
llama 7B all F32 | 24 | pp240 | 71.65 | 115.68 | 1.61 |
llama 7B all F32 | 24 | pp255 | 68.59 | 97.20 | 1.42 |
llama 7B all F32 | 24 | pp256 | 67.18 | 114.58 | 1.71 |
llama 7B all F32 | 24 | pp510 | 62.15 | 106.84 | 1.72 |
llama 7B all F32 | 24 | pp512 | 60.77 | 107.10 | 1.76 |
llama 7B all F32 | 24 | pp1023 | 59.90 | 94.97 | 1.59 |
llama 7B all F32 | 24 | pp1024 | 59.98 | 96.93 | 1.62 |
llama 7B all F32 | 24 | pp1025 | 59.06 | 94.01 | 1.59 |
llama 7B all F32 | 24 | pp2048 | 59.30 | 91.46 | 1.54 |
llama 7B all F32 | 24 | tg128 | 3.40 | 3.40 | 1.00 |
llama 7B all F32 | 32 | pp1 | 3.44 | 3.03 | 0.88 |
llama 7B all F32 | 32 | pp2 | 6.29 | 5.22 | 0.83 |
llama 7B all F32 | 32 | pp3 | 9.26 | 10.28 | 1.11 |
llama 7B all F32 | 32 | pp4 | 7.41 | 14.41 | 1.94 |
llama 7B all F32 | 32 | pp5 | 8.90 | 16.51 | 1.85 |
llama 7B all F32 | 32 | pp6 | 13.72 | 20.73 | 1.51 |
llama 7B all F32 | 32 | pp7 | 12.95 | 25.04 | 1.93 |
llama 7B all F32 | 32 | pp8 | 14.24 | 26.06 | 1.83 |
llama 7B all F32 | 32 | pp9 | 26.56 | 29.83 | 1.12 |
llama 7B all F32 | 32 | pp10 | 18.28 | 33.25 | 1.82 |
llama 7B all F32 | 32 | pp11 | 19.46 | 33.97 | 1.75 |
llama 7B all F32 | 32 | pp12 | 35.39 | 38.65 | 1.09 |
llama 7B all F32 | 32 | pp13 | 23.80 | 42.63 | 1.79 |
llama 7B all F32 | 32 | pp14 | 25.25 | 42.34 | 1.68 |
llama 7B all F32 | 32 | pp15 | 44.43 | 43.85 | 0.99 |
llama 7B all F32 | 32 | pp16 | 27.65 | 33.80 | 1.22 |
llama 7B all F32 | 32 | pp30 | 62.07 | 76.51 | 1.23 |
llama 7B all F32 | 32 | pp31 | 48.28 | 79.72 | 1.65 |
llama 7B all F32 | 32 | pp32 | 50.50 | 83.07 | 1.64 |
llama 7B all F32 | 32 | pp64 | 62.50 | 114.16 | 1.83 |
llama 7B all F32 | 32 | pp65 | 72.63 | 119.77 | 1.65 |
llama 7B all F32 | 32 | pp66 | 89.07 | 121.06 | 1.36 |
llama 7B all F32 | 32 | pp120 | 91.80 | 131.97 | 1.44 |
llama 7B all F32 | 32 | pp128 | 84.08 | 130.04 | 1.55 |
llama 7B all F32 | 32 | pp130 | 84.99 | 128.60 | 1.51 |
llama 7B all F32 | 32 | pp240 | 75.48 | 123.00 | 1.63 |
llama 7B all F32 | 32 | pp255 | 82.80 | 125.00 | 1.51 |
llama 7B all F32 | 32 | pp256 | 80.86 | 125.54 | 1.55 |
llama 7B all F32 | 32 | pp510 | 79.07 | 103.85 | 1.31 |
llama 7B all F32 | 32 | pp512 | 72.10 | 118.05 | 1.64 |
llama 7B all F32 | 32 | pp1023 | 69.98 | 102.26 | 1.46 |
llama 7B all F32 | 32 | pp1024 | 73.17 | 112.25 | 1.53 |
llama 7B all F32 | 32 | pp1025 | 71.53 | 102.11 | 1.43 |
llama 7B all F32 | 32 | pp2048 | 68.90 | 98.77 | 1.43 |
llama 7B all F32 | 32 | tg128 | 3.13 | 3.16 | 1.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Thanks for the benchmark !
Look good with the intel CPU. the FP32 is really good for PP...
with 24 thread it look well balanced now.
@ggerganov : it's really strange these differences. for now I don't see why. I'll try to do other tests later. it would be good to find out why, if it's a question of config it would be good to be able to indicate it.
I tried by disabling interlacing, the tg is half but the pp has almost the same peak.
Will try to make some change so the pp don't slow down. 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you don't find a solution to this, I think it is fine to ignore the results from my Ryzen, because it is very likely that I have misconfigured something in the BIOS.
b2dab60
to
30ae0d2
Compare
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
30ae0d2
to
94cd488
Compare
- show-progress is not part of GNU Wget2
94cd488
to
4bf8cd9
Compare
OK code look good and I get good perf with Ryzen9 5950X and 7945HS. Need to "remove" not working test in "Server" check |
4bf8cd9
to
7b9119b
Compare
Some last bench with ("without" u-batch):
Do not direct compare with preview result, I have some change on Bios config (PBO / max TDP...)
|
Perplexity look good. ./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p
1 3.9443 ± 0.5267 0.00048 ± 0.00053 0.00004 ± 0.00001 0.169 ± 0.019 % 99.608 ± 0.392 %
2 5.4419 ± 0.6039 0.00133 ± 0.00153 0.00005 ± 0.00001 0.167 ± 0.012 % 99.412 ± 0.339 %
3 4.6835 ± 0.4027 0.00066 ± 0.00105 0.00006 ± 0.00001 0.236 ± 0.021 % 99.608 ± 0.226 %
4 5.0057 ± 0.3672 0.00051 ± 0.00080 0.00005 ± 0.00000 0.231 ± 0.017 % 99.608 ± 0.196 %
5 5.2931 ± 0.3434 0.00030 ± 0.00065 0.00005 ± 0.00000 0.220 ± 0.014 % 99.686 ± 0.157 %
6 5.8307 ± 0.3543 0.00030 ± 0.00055 0.00005 ± 0.00000 0.216 ± 0.012 % 99.739 ± 0.131 %
7 6.2255 ± 0.3544 0.00047 ± 0.00052 0.00005 ± 0.00000 0.210 ± 0.011 % 99.664 ± 0.137 %
8 6.4316 ± 0.3454 0.00047 ± 0.00046 0.00005 ± 0.00000 0.218 ± 0.010 % 99.657 ± 0.130 %
9 6.8874 ± 0.3580 0.00050 ± 0.00041 0.00005 ± 0.00000 0.213 ± 0.010 % 99.608 ± 0.130 %
10 7.2365 ± 0.3589 0.00030 ± 0.00038 0.00005 ± 0.00000 0.209 ± 0.009 % 99.569 ± 0.130 % ./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p
1 3.9432 ± 0.5262 0.00021 ± 0.00007 0.00000 ± 0.00000 0.023 ± 0.003 % 100.000 ± 0.000 %
2 5.4435 ± 0.6041 0.00163 ± 0.00150 0.00000 ± 0.00000 0.025 ± 0.002 % 100.000 ± 0.000 %
3 4.6856 ± 0.4029 0.00111 ± 0.00100 0.00000 ± 0.00000 0.030 ± 0.002 % 100.000 ± 0.000 %
4 5.0072 ± 0.3674 0.00081 ± 0.00075 0.00000 ± 0.00000 0.029 ± 0.002 % 100.000 ± 0.000 %
5 5.2951 ± 0.3437 0.00067 ± 0.00060 0.00000 ± 0.00000 0.030 ± 0.002 % 100.000 ± 0.000 %
6 5.8323 ± 0.3545 0.00057 ± 0.00050 0.00000 ± 0.00000 0.029 ± 0.002 % 100.000 ± 0.000 %
7 6.2269 ± 0.3546 0.00069 ± 0.00047 0.00000 ± 0.00000 0.028 ± 0.001 % 100.000 ± 0.000 %
8 6.4324 ± 0.3455 0.00059 ± 0.00041 0.00000 ± 0.00000 0.028 ± 0.001 % 100.000 ± 0.000 %
9 6.8876 ± 0.3581 0.00053 ± 0.00036 0.00000 ± 0.00000 0.027 ± 0.001 % 100.000 ± 0.000 %
10 7.2379 ± 0.3591 0.00049 ± 0.00033 0.00000 ± 0.00000 0.027 ± 0.001 % 99.961 ± 0.039 % |
7b9119b
to
2c6864a
Compare
Thanks ! |
ikawrakow/ik_llama.cpp#71 have a good idea.
I'll figure to add it in llamafile/tinyblas sgemm (and a litle more) and id work great: