Open
Description
% export OMP_NUM_THREADS=1
% python -m blis.benchmark
Setting up data for gemm. 1000 iters, nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.54 seconds
Numpy (openblas) gemm...
Total: 11032015.625
9.50 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.78 seconds
Numpy (openblas) einsum ab,cb->ca
unset OMP_NUM_THREADS
Total: 5510596.19140625
90.67 seconds
numpy with OpenBLAS and blis are on-par for gemm. However, this does not use intermediate optimization on numpy's einsum. Enabling this by passing optimize=True
:
% python -m blis.benchmark
Setting up data for gemm. 1000 iters, nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.62 seconds
Numpy (openblas) gemm...
Total: 11032015.625
9.51 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.70 seconds
Numpy (openblas) einsum ab,cb->ca
Total: 5510592.28515625
11.43 seconds
Only slightly slower than blis now. However, I am skeptical of the claim that parallelization does not help in inference. The matrix sizes used in the benchmark are fairly typical in inference (e.g. the standard transformer attention matrices are 768x768). Testing with 4 threads (fairly modest on current multi-core SMT CPUs):
% export OMP_NUM_THREADS=4
% python -m blis.benchmark
Setting up data for gemm. 1000 iters, nO=384 nI=384 batch_size=2000
Blis gemm...
Total: 11032014.6484375
9.77 seconds
Numpy (openblas) gemm...
Total: 11032015.625
3.40 seconds
Blis einsum ab,cb->ca
Total: 5510590.8203125
9.83 seconds
Numpy (openblas) einsum ab,cb->ca
Total: 5510592.28515625
4.53 seconds
Maybe it's worthwhile compiling blis with multi-threading support?
For reference:
% lscpu | grep name:
Model name: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Metadata
Metadata
Assignees
Labels
No labels