Skip to content

Commit db1e16a

Browse files
committed
Update README, turn off precompilation as this caused allocations and worse benchmarks
1 parent 7ccf760 commit db1e16a

File tree

2 files changed

+198
-33
lines changed

2 files changed

+198
-33
lines changed

README.md

+158
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,161 @@
44
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://JuliaSIMD.github.io/TriangularSolve.jl/dev)
55
[![Build Status](https://github.com/JuliaSIMD/TriangularSolve.jl/workflows/CI/badge.svg)](https://github.com/JuliaSIMD/TriangularSolve.jl/actions)
66
[![Coverage](https://codecov.io/gh/JuliaSIMD/TriangularSolve.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaSIMD/TriangularSolve.jl)
7+
8+
9+
Performs some triangular solves. For example:
10+
```julia
11+
julia> using TriangularSolve, LinearAlgebra, MKL;
12+
13+
julia> BLAS.set_num_threads(1)
14+
15+
julia> BLAS.get_config().loaded_libs
16+
1-element Vector{LinearAlgebra.BLAS.LBTLibraryInfo}:
17+
LBTLibraryInfo(libmkl_rt.so, ilp64)
18+
19+
julia> N = 100;
20+
21+
julia> A = rand(N,N); B = rand(N,N); C = similar(A);
22+
23+
julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false)) # false means single threaded
24+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
25+
Range (min max): 15.909 μs 41.524 μs ┊ GC (min max): 0.00% 0.00%
26+
Time (median): 17.916 μs ┊ GC (median): 0.00%
27+
Time (mean ± σ): 17.751 μs ± 697.786 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
28+
29+
▃▁ ▁ ▁ ▄▁ ▇▆ ▆█▃ ▂
30+
██▃▁▁██▁▁▁▁█▆▁▁▃▇██▄▃▁███▆▁▄▄███▄▄▅▅▆▇█▇▄▅▆▇██▇█▇▇▆▄▅▄▁▄▁▄▄▇ █
31+
15.9 μs Histogram: log(frequency) by time 19.9 μs <
32+
33+
Memory estimate: 0 bytes, allocs estimate: 0.
34+
35+
julia> @benchmark rdiv!(copyto!($C, $A), UpperTriangular($B))
36+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
37+
Range (min max): 17.578 μs 75.835 μs ┊ GC (min max): 0.00% 0.00%
38+
Time (median): 19.852 μs ┊ GC (median): 0.00%
39+
Time (mean ± σ): 19.827 μs ± 1.342 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
40+
41+
▄▂ ▂ ▆▅ ▁█▇▂ ▅▃ ▂ ▂
42+
██▁▁▃█▇▁▁▁█▇▄▄▁██▇▄▄▄██▆▅▄████▅▄▆██▆▆▆▆▇██▇▇▆▆▇▆▅▆▄▅▅▆▄▅▄▅▅ █
43+
17.6 μs Histogram: log(frequency) by time 22.4 μs <
44+
45+
Memory estimate: 0 bytes, allocs estimate: 0.
46+
47+
julia> @benchmark ldiv!($C, LowerTriangular($B), $A)
48+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
49+
Range (min max): 19.102 μs 69.966 μs ┊ GC (min max): 0.00% 0.00%
50+
Time (median): 21.561 μs ┊ GC (median): 0.00%
51+
Time (mean ± σ): 21.565 μs ± 890.952 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
52+
53+
▂▂ ▂▃ ▄▄ ▆█▄ ▅▅ ▂
54+
██▃▁▁▁▇█▁▁▁▁▅█▁▁▁▁▁██▅▁▁▁▅██▆▁▁▁▆███▆▅▃▅████▃▄▅██▇▇▅▆▆▇▇█▇▆▆ █
55+
19.1 μs Histogram: log(frequency) by time 23.4 μs <
56+
57+
Memory estimate: 0 bytes, allocs estimate: 0.
58+
59+
julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false)) # false means single threaded
60+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
61+
Range (min max): 19.082 μs 39.078 μs ┊ GC (min max): 0.00% 0.00%
62+
Time (median): 19.694 μs ┊ GC (median): 0.00%
63+
Time (mean ± σ): 19.765 μs ± 774.848 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
64+
65+
▃ ▄█ ▁
66+
▂▇██▄▂▁▁▂▂▃███▃▂▁▂▁▂▂▅█▇▃▂▂▂▁▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂ ▃
67+
19.1 μs Histogram: frequency by time 22.1 μs <
68+
69+
Memory estimate: 0 bytes, allocs estimate: 0.
70+
```
71+
Multithreaded benchmarks:
72+
```julia
73+
julia> BLAS.set_num_threads(TriangularSolve.VectorizationBase.num_cores())
74+
75+
julia> @benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B))
76+
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
77+
Range (min max): 8.309 μs 24.357 μs ┊ GC (min max): 0.00% 0.00%
78+
Time (median): 8.769 μs ┊ GC (median): 0.00%
79+
Time (mean ± σ): 8.812 μs ± 382.702 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
80+
81+
▁▃▄▆▆██▇▆▅▃▁
82+
▂▁▂▂▂▂▃▃▃▄▅▇██████████████▇▆▅▄▃▃▃▃▃▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
83+
8.31 μs Histogram: frequency by time 9.7 μs <
84+
85+
Memory estimate: 0 bytes, allocs estimate: 0.
86+
87+
julia> @benchmark rdiv!(copyto!($C, $A), UpperTriangular($B))
88+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
89+
Range (min max): 11.996 μs 151.147 μs ┊ GC (min max): 0.00% 0.00%
90+
Time (median): 14.163 μs ┊ GC (median): 0.00%
91+
Time (mean ± σ): 14.281 μs ± 2.372 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
92+
93+
▂▄▇███▇▆▅▃▂ ▁ ▂▄▄▅▅▅▆▃▃ ▁
94+
▁▁▁▂▂▃▄▇██████████████████████████▇▆▅▄▅▆▇███▆▅▅▃▄▂▂▂▁▁▁▁▁▁▁▁ ▅
95+
12 μs Histogram: frequency by time 17.3 μs <
96+
97+
Memory estimate: 0 bytes, allocs estimate: 0.
98+
99+
julia> @benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A)
100+
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
101+
Range (min max): 7.903 μs 22.442 μs ┊ GC (min max): 0.00% 0.00%
102+
Time (median): 9.871 μs ┊ GC (median): 0.00%
103+
Time (mean ± σ): 9.789 μs ± 864.957 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
104+
105+
▂▃ ▄▃ ▃▅ ▅▃ ▆▂ ▆▄ ▂▇▄ ▃█▅▂▂▁▁▄▆▃▁ ▁ ▂
106+
██▅▂██▆▅██▆▆▆██▇▇███▇▇▇████▇█████▆██████████████▇███▇▇▆▇▆▅▆ █
107+
7.9 μs Histogram: log(frequency) by time 11.8 μs <
108+
109+
Memory estimate: 0 bytes, allocs estimate: 0.
110+
111+
julia> @benchmark ldiv!($C, LowerTriangular($B), $A)
112+
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
113+
Range (min max): 13.507 μs 142.574 μs ┊ GC (min max): 0.00% 0.00%
114+
Time (median): 15.258 μs ┊ GC (median): 0.00%
115+
Time (mean ± σ): 15.319 μs ± 2.045 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
116+
117+
▁▃ ▁▂ ▁▃▅▁ ▁▄▄▁ ▂▆█▆▃
118+
▁▂▅███▆▇███▆▅████▆▅████▆▆█████▆▄▄▆▆▅▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▄
119+
13.5 μs Histogram: frequency by time 18.5 μs <
120+
121+
Memory estimate: 0 bytes, allocs estimate: 0.
122+
123+
julia> versioninfo()
124+
Julia Version 1.8.0-DEV.438
125+
Commit 88a6376e99* (2021-08-28 11:03 UTC)
126+
Platform Info:
127+
OS: Linux (x86_64-redhat-linux)
128+
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
129+
WORD_SIZE: 64
130+
LIBM: libopenlibm
131+
LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
132+
Environment:
133+
JULIA_NUM_THREADS = 8
134+
```
135+
136+
137+
For editing convenience (you can copy/paste the above into a REPL and it should automatically strip `julia> `s and outputs, but the above is less convenient to edit if you want to try changing the benchmarks):
138+
```julia
139+
using TriangularSolve, LinearAlgebra, MKL;
140+
BLAS.set_num_threads(Threads.nthreads())
141+
BLAS.get_config().loaded_libs
142+
N = 100;
143+
144+
A = rand(N,N); B = rand(N,N); C = similar(A);
145+
146+
@benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B), Val(false))
147+
@benchmark rdiv!(copyto!($C, $A), UpperTriangular($B))
148+
149+
@benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A, Val(false))
150+
@benchmark ldiv!($C, LowerTriangular($B), $A)
151+
152+
BLAS.set_num_threads(TriangularSolve.VectorizationBase.num_cores())
153+
@benchmark TriangularSolve.rdiv!($C, $A, UpperTriangular($B))
154+
@benchmark rdiv!(copyto!($C, $A), UpperTriangular($B))
155+
156+
@benchmark TriangularSolve.ldiv!($C, LowerTriangular($B), $A)
157+
@benchmark ldiv!($C, LowerTriangular($B), $A)
158+
159+
versioninfo()
160+
```
161+
162+
Currently, `rdiv!` with `UpperTriangular` and `ldiv!` with `LowerTriangulra` matrices are the only supported configurations.
163+
164+

src/TriangularSolve.jl

+40-33
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,26 @@ const LDIVBUFFERS = Vector{UInt8}[]
291291
si = StrideIndex{2,(1,2),1}((VectorizationBase.static_sizeof(T), RSUF), (StaticInt(0),StaticInt(0)))
292292
stridedpointer(ptr, si, StaticInt{0}())
293293
end
294+
function div_dispatch!(C::AbstractMatrix{T}, A, U, ::Val{UNIT}, ::Val{THREAD}) where {UNIT,T,THREAD}
295+
M, N = size(A)
296+
((N == 0) | (M == 0)) && return nothing
297+
_spa, spap = stridedpointer_preserve(A)
298+
_spc, spcp = stridedpointer_preserve(C)
299+
_spu, spup = stridedpointer_preserve(U)
300+
spa = zero_offsets(_spa)
301+
spc = zero_offsets(_spc)
302+
spu = zero_offsets(_spu)
303+
GC.@preserve spap spcp spup begin
304+
mtb = m_thread_block_size(M, N, Val(T))
305+
if THREAD && (VectorizationBase.num_threads() > 1)
306+
(M > mtb) && return multithread_rdiv!(spc, spa, spu, M, N, mtb, Val(UNIT), VectorizationBase.contiguous_axis(A))
307+
elseif N > block_size(Val(T))
308+
return rdiv_block_MandN!(spc, spa, spu, M, N, Val(UNIT), VectorizationBase.contiguous_axis(A))
309+
end
310+
return rdiv_U!(spc, spa, spu, M, N, VectorizationBase.contiguous_axis(A), Val(UNIT))
311+
end
312+
end
313+
294314
function rdiv!(A::AbstractMatrix{T}, U::UpperTriangular{T}, ::Val{THREAD} = Val(true)) where {T<:Union{Float32,Float64},THREAD}
295315
div_dispatch!(A, A, parent(U), Val(false), Val(THREAD))
296316
return A
@@ -398,6 +418,7 @@ function rdiv_block_MandN!(
398418
spc = gesp(spc, (B_m, StaticInt{0}()))
399419
m = mu
400420
end
421+
nothing
401422
end
402423
function _nthreads()
403424
nc = VectorizationBase.num_cores()
@@ -432,6 +453,7 @@ function multithread_rdiv!(
432453
)
433454
end
434455
end
456+
nothing
435457
# nlaunch = Md - (Mr == 0)
436458
# threads, torelease = Polyester.request_threads(Base.Threads.threadid(), nlaunch)
437459
# nthread = length(threads)
@@ -442,25 +464,6 @@ function multithread_rdiv!(
442464

443465
end
444466

445-
function div_dispatch!(C::AbstractMatrix{T}, A, U, ::Val{UNIT}, ::Val{THREAD}) where {UNIT,T,THREAD}
446-
M, N = size(A)
447-
((N == 0) | (M == 0)) && return C
448-
_spa, spap = stridedpointer_preserve(A)
449-
_spc, spcp = stridedpointer_preserve(C)
450-
_spu, spup = stridedpointer_preserve(U)
451-
spa = zero_offsets(_spa)
452-
spc = zero_offsets(_spc)
453-
spu = zero_offsets(_spu)
454-
GC.@preserve spap spcp spup begin
455-
mtb = m_thread_block_size(M, N, Val(T))
456-
(THREAD & (M > mtb)) && return multithread_rdiv!(spc, spa, spu, M, N, mtb, Val(UNIT), VectorizationBase.contiguous_axis(A))
457-
if VectorizationBase.num_threads() == 1
458-
N > block_size(Val(T)) && return rdiv_block_MandN!(spc, spa, spu, M, N, Val(UNIT), VectorizationBase.contiguous_axis(A))
459-
end
460-
rdiv_U!(spc, spa, spu, M, N, VectorizationBase.contiguous_axis(A), Val(UNIT))
461-
end
462-
end
463-
464467
# We're using `W x W` blocks, consuming `W` registers
465468
# For each block we need to load 1 more value, plus another register is used for `B`. So:
466469
# remaining_registers == register_count() - num_blocks * (W + 1) - 1
@@ -524,18 +527,22 @@ function __init__()
524527
end
525528
end
526529

527-
let
528-
while true
529-
A = rand(1, 1)
530-
B = rand(1, 1)
531-
res = similar(A)
532-
rdiv!(res, A, UpperTriangular(B))
533-
rdiv!(res, A, UnitUpperTriangular(B))
534-
535-
__init__()
536-
ldiv!(res, LowerTriangular(B), A)
537-
ldiv!(res, UnitLowerTriangular(B), A)
538-
break
539-
end
540-
end
530+
# let
531+
# while true
532+
# A = rand(1, 1)
533+
# B = rand(1, 1)
534+
# res = similar(A)
535+
# rdiv!(res, A, UpperTriangular(B))
536+
# rdiv!(res, A, UnitUpperTriangular(B))
537+
# rdiv!(res, A, UpperTriangular(B), Val(false))
538+
# rdiv!(res, A, UnitUpperTriangular(B), Val(false))
539+
540+
# __init__()
541+
# ldiv!(res, LowerTriangular(B), A)
542+
# ldiv!(res, UnitLowerTriangular(B), A)
543+
# ldiv!(res, LowerTriangular(B), A, Val(false))
544+
# ldiv!(res, UnitLowerTriangular(B), A, Val(false))
545+
# break
546+
# end
547+
# end
541548
end

0 commit comments

Comments
 (0)