@@ -28,7 +28,8 @@ Requirements:
2828- ``pycuda `` or ``cupy `` and CUDA developments tools (`nvcc `) for the cuda backend
2929- ``numpy ``
3030- on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation,
31- with either CUDA_PATH or CUDA_HOME environment variable.
31+ with either CUDA_PATH or CUDA_HOME environment variable. However it should be
32+ simpler to install using ``conda ``, as detailed below
3233- *Only when installing from source *: ``vkfft.h `` installed in the usual include
3334 directories, or in the 'src' directory
3435
@@ -105,8 +106,8 @@ Features
105106- unit tests for all transforms: see test sub-directory. Note that these take a **long **
106107 time to finish due to the exhaustive number of sub-tests.
107108- Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2
108- - tested on macOS (10.13.6), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10
109- (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
109+ - tested on macOS (10.13.6/x86, 12.6/M1 ), Linux (Debian/Ubuntu, x86-64 and power9),
110+ and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
110111- GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs.
111112- inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x
112113 size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case
@@ -131,9 +132,9 @@ Performance
131132See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare
132133with cuFFT (using scikit-cuda) and clFFT (using gpyfft).
133134
134- Example result for batched 2D FFT with array dimensions of batch x N x N using a Titan V :
135+ Example result for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100 :
135136
136- .. image :: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-TITAN_V -Linux.png
137+ .. image :: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_V100 -Linux.png
137138
138139Notes regarding this plot:
139140
@@ -143,23 +144,29 @@ Notes regarding this plot:
143144* the batch size is adapted for each N so the transform takes long enough, in practice the
144145 transformed array is at around 600MB. Transforms on small arrays with small batch sizes
145146 could produce smaller performances, or better ones when fully cached.
146- * a number of blue + (CuFFT) are actually performed as radix-N transforms with 7<N<127 (e.g. 11)
147- -hence the performance similar to the blue dots- but the list of supported radix transforms
148- is undocumented (?) so they are not correctly labeled.
147+ * The dots which are labelled as using a Blustein algorithm can also be using a Rader one,
148+ hence the better performance of many sizes, both for vkFFT and cuFFT
149149
150150The general results are:
151151
152152* vkFFT throughput is similar to cuFFT up to N=1024. For N>1024 vkFFT is much more
153153 efficient than cuFFT due to the smaller number of read and write per FFT axis
154154 (apart from isolated radix-2 3 sizes)
155155* the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges
156- where CUDA performs better, due to different cache . [Note that if the card is also used for display,
156+ where CUDA performs better, due to different cache. [Note that if the card is also used for display,
157157 then difference can increase, e.g. for nvidia cards opencl performance is more affected
158158 when being used for display than the cuda backend]
159159* clFFT (via gpyfft) generally performs much worse than the other transforms, though this was
160160 tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations
161161 to find the fastest combination)
162162
163+ Another example on an A40 card (only with radix<=13 transforms):
164+
165+ .. image :: https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_A40-Linux-radix13.png
166+
167+ On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms
168+ supported by vkFFT give globally better results.
169+
163170Accuracy
164171--------
165172See the accuracy notebook, which allows to compare the accuracy for different
0 commit comments