Fortran Fast math

A collection of functions for fast number crunching using Fortran.

In order to get the maximum performance of this library, compile with "-O3 -march=native" (or equivalent).

Available functions

function	name(s)	shapes	types
sum	`fsum` `fsum_kahan`(1)	`1d`	`real32` `real64`
dot	`fprod` `fprod_kahan`(2)	`1d`	`real32` `real64`
cos	`fcos`	`elemental`	`real32` `real64`
sin	`fsin`	`elemental`	`real32` `real64`
tan	`ftan`	`elemental`	`real32` `real64`
tanh	`ftanh`	`elemental`	`real32` `real64`
acos	`facos`	`elemental`	`real32` `real64`
atan	`fatan`	`elemental`	`real32` `real64`
erf	`ferf`	`elemental`	`real32` `real64`
log	`flog_p3` `flog_p5`	`elemental`	`real64`
rsqrt(3)	`frsqrt`	`elemental`	`real32` `real64`

(1) fast (and precise) sum for 1D arrays - possibility of including a mask. fsum: fastest method and at worst, same or 1 order of magnitud more precise than the intrinsic sum. It groups chunks of values in a temporal working batch which is summed up once at the end. fsum_kahan: Highest precision. It has a precission close to a quadratic sum (for real32 summing with real64, and fo real64 summing with real128). It also uses the chunks principle with an elemental kahan operator applied on top.
(2) fast (and precise) dot product for 1D arrays - possibility of including a 3rd weighting array. fprod: fastest method and at worst, 1 order of magnitud more precise than the intrinsic dot_product. runtime can vary between 3X and 8X the intrinsic. It groups chunks of products in a temporal working batch which is summed up once at the end (based on fsum). fprod_kahan: Same idea as fsum_kahan but on top of chunked products.
(3) rsqrt: reciprocal square root $f(x)=1/sqrt(x)$

API documentation

To generate the API documentation for fast_math using ford run the following command:

ford ford.yml

TODO

Contribution guidelines
Polish autodoc

Elapsed time examples and precision

Warning: The following values are just references as to see how different can they be between different compilers. Actual speed-ups(downs) should be measured under the true use conditions to account for (lack-off) inlinement, etc etc. Results obtained using a Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz 2.89 GHz.

(Click to unfold) WSL2 gfortran 13.2 > fpm test --flag "-cpp -O3 -march=native -flto"

sum r32	[ns/eval]	Speed-Up	relative error
intrinsic	1.0300	1.00	3.1511E-06
kahan	0.1200	8.58	9.5367E-08
chunk	0.0900	11.44	1.0824E-07

sum r64	[ns/eval]	Speed-Up	relative error
intrinsic	1.2100	1.00	5.6974E-15
kahan	0.4300	2.81	1.3278E-16
chunk	0.1100	11.00	2.3359E-16

sum r32 mask	[ns/eval]	Speed-Up	relative error
intrinsic	1.2300	1.00	1.6180E-06
kahan	4.3400	0.28	8.3327E-08
chunk	0.3800	3.24	8.8394E-08

sum r64 mask	[ns/eval]	Speed-Up	relative error
intrinsic	3.8600	1.00	2.9463E-15
kahan	4.1950	0.92	6.8723E-17
chunk	0.4200	9.19	1.1879E-16

dot r32	[ns/eval]	Speed-Up	relative error
intrinsic	1.2100	1.00	3.2994E-06
kahan	0.2300	5.26	9.8348E-08
chunk	0.1200	10.08	1.1307E-07

dot r64	[ns/eval]	Speed-Up	relative error
intrinsic	1.1900	1.00	5.9648E-15
kahan	0.4400	2.70	1.2812E-16
chunk	0.0900	13.22	2.2760E-16

trigo	[ns/eval]	Speed-Up	relative error
fsin r32	0.5280	7.54	3.0972E-07
fsin r64	0.9320	8.76	3.9779E-16
facos r32	0.3080	20.87	2.9135E-05
facos r64	0.5960	15.90	2.1557E-14

hyperb	[ns/eval]	Speed-Up	relative error
ftanh r32	1.7280	10.07	7.4200E-08
ftanh r64	1.9360	9.32	1.3282E-09
ferf r32	0.4760	31.71	9.6432E-08
ferf r64	0.7640	18.42	9.6298E-08

rsqrt	[ns/eval]	Speed-Up	relative error
frsqrt r32	0.3480	1.06	9.4399E-04
frsqrt r64	0.6320	2.23	8.6268E-04

(Click to unfold) WSL2 nvfortran 23.9 > fpm test --flag "-Mpreprocess -fast -Minline"

sum r32	[ns/eval]	Speed-Up	relative error
intrinsic	0.1000	1.00	1.1295E-07
kahan	1.2500	0.08	9.8169E-08
chunk	0.0700	1.43	7.0930E-08

sum r64	[ns/eval]	Speed-Up	relative error
intrinsic	0.1400	1.00	3.8969E-16
kahan	1.6300	0.09	1.2623E-16
chunk	0.2500	0.56	1.8996E-16

sum r32 mask	[ns/eval]	Speed-Up	relative error
intrinsic	0.1700	1.00	2.0742E-07
kahan	5.5650	0.03	8.1956E-08
chunk	0.2550	0.67	5.8889E-08

sum r64 mask	[ns/eval]	Speed-Up	relative error
intrinsic	0.3600	1.00	3.8136E-16
kahan	5.7750	0.06	6.2839E-17
chunk	0.4400	0.82	8.5598E-17

dot r32	[ns/eval]	Speed-Up	relative error
intrinsic	0.1400	1.00	1.1426E-07
kahan	1.9700	0.07	9.7811E-08
chunk	0.1700	0.82	7.1764E-08

dot r64	[ns/eval]	Speed-Up	relative error
intrinsic	0.2400	1.00	3.9246E-16
kahan	1.8700	0.13	1.3178E-16
chunk	0.4100	0.59	1.9129E-16

trigo	[ns/eval]	Speed-Up	relative error
fsin r32	0.0160	726.00	1.0325E-07
fsin r64	0.0280	388.86	5.0118E-17
facos r32	0.0120	466.67	1.0563E-06
facos r64	0.0200	390.60	3.7996E-15

hyperb	[ns/eval]	Speed-Up	relative error
ftanh r32	0.0240	676.67	5.3264E-08
ftanh r64	0.0080	1595.00	1.3282E-09
ferf r32	0.0040	4851.00	9.1205E-08
ferf r64	0.0320	549.62	9.6298E-08

rsqrt	[ns/eval]	Speed-Up	relative error
frsqrt r32	16.5480	0.02	9.4387E-04
frsqrt r64	15.8280	0.09	8.6745E-04

(Click to unfold) WSL2 ifort 2021.10.0 > fpm test --flag "-fpp -O3 -xHost -ipo"

sum r32	[ns/eval]	Speed-Up	relative error
intrinsic	0.0700	1.00	6.2262E-08
kahan	0.2400	0.29	9.4564E-08
chunk	0.1000	0.70	7.0930E-08

sum r64	[ns/eval]	Speed-Up	relative error
intrinsic	0.0800	1.00	1.9862E-16
kahan	0.5200	0.15	1.2867E-16
chunk	0.1400	0.57	2.0384E-16

sum r32 mask	[ns/eval]	Speed-Up	relative error
intrinsic	0.2000	1.00	2.0568E-07
kahan	0.2150	0.93	7.7122E-08
chunk	0.1450	1.38	6.7770E-08

sum r64 mask	[ns/eval]	Speed-Up	relative error
intrinsic	0.2150	1.00	1.9040E-16
kahan	0.4400	0.49	7.0610E-17
chunk	0.3700	0.58	8.5154E-17

dot r32	[ns/eval]	Speed-Up	relative error
intrinsic	0.0700	1.00	6.2031E-08
kahan	0.2100	0.33	1.0544E-07
chunk	0.0500	1.40	7.1526E-08

dot r64	[ns/eval]	Speed-Up	relative error
intrinsic	0.2200	1.00	6.3782E-16
kahan	0.4600	0.48	2.4047E-16
chunk	0.1200	1.83	1.8829E-16

trigo	[ns/eval]	Speed-Up	relative error
fsin r32	0.3560	1.26	1.9746E-07
fsin r64	0.9280	1.38	7.5661E-17
facos r32	0.3200	2.01	3.0743E-06
facos r64	0.6520	3.36	6.3642E-15

hyperb	[ns/eval]	Speed-Up	relative error
ftanh r32	0.3960	3.70	1.1537E-08
ftanh r64	0.6760	5.17	1.3282E-09
ferf r32	0.3360	2.50	1.0924E-07
ferf r64	0.8440	2.18	9.6298E-08

rsqrt	[ns/eval]	Speed-Up	relative error
frsqrt r32	0.2600	1.31	9.4032E-04
frsqrt r64	0.6360	2.27	8.7360E-04

(Click to unfold) Windows ifx 2023.2.0 > fpm test --flag "-fpp -O3 -xHost -ipo"

sum r32	[ns/eval]	Speed-Up	relative error
intrinsic	0.3200	1.00	8.4376E-07
kahan	1.0300	0.31	8.7321E-08
chunk	0.4800	0.67	8.7082E-08

sum r64	[ns/eval]	Speed-Up	relative error
intrinsic	1.1200	1.00	5.7371E-15
kahan	0.9400	1.19	1.9507E-16
chunk	0.5600	2.00	1.9418E-16

sum r32 mask	[ns/eval]	Speed-Up	relative error
intrinsic	2.2700	1.00	1.5584E-06
kahan	4.4750	0.51	9.1434E-08
chunk	4.7200	0.48	8.7559E-08

sum r64 mask	[ns/eval]	Speed-Up	relative error
intrinsic	2.1750	1.00	2.9075E-15
kahan	4.7550	0.46	1.0636E-16
chunk	4.0250	0.54	1.0525E-16

dot r32	[ns/eval]	Speed-Up	relative error
intrinsic	0.2600	1.00	7.9530E-07
kahan	1.3800	0.19	6.8307E-08
chunk	0.4900	0.53	6.9737E-08

dot r64	[ns/eval]	Speed-Up	relative error
intrinsic	0.6200	1.00	2.9848E-15
kahan	1.4200	0.44	1.8197E-16
chunk	0.5800	1.07	1.8330E-16

trigo	[ns/eval]	Speed-Up	relative error
fsin r32	3.4640	0.47	1.3924E-07
fsin r64	3.2320	1.31	1.0296E-15
facos r32	1.3960	5.22	3.1710E-05
facos r64	1.4080	6.28	5.2928E-13

hyperb	[ns/eval]	Speed-Up	relative error
ftanh r32	2.8280	1.22	2.3012E-08
ftanh r64	2.6280	2.97	1.3282E-09
ferf r32	3.8600	1.57	3.0995E-07
ferf r64	3.9600	5.67	9.6298E-08

rsqrt	[ns/eval]	Speed-Up	relative error
frsqrt r32	1.6640	0.19	9.4038E-04
frsqrt r64	1.4320	0.96	8.7360E-04

Acknowledgement

Compilation of this library was possible thanks to Transvalor S.A. research activities.
Part of this library is based on the work of Perini and Reitz, that was funded through the Sandia National Laboratories by the U.S. Department of Energy, Office of Vehicle Technologies, program managers Leo Breton, Gupreet Singh.
The fortran lang community discussions such as Some Intrinsic SUMS and fastGPT

Contribution of open-source developers:

jalvesz

perazz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fortran Fast math

Available functions

API documentation

TODO

Elapsed time examples and precision

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fortran Fast math

Available functions

API documentation

TODO

Elapsed time examples and precision

Acknowledgement