mathemagix-devel - New FFT implementation

Subject: Mathemagix

List archive

New FFT implementation

From: Joris van der Hoeven <address@concealed>
To: address@concealed
Subject: New FFT implementation
Date: Thu, 20 Dec 2007 16:31:33 +0100

Hi all,

I just finished a new (raw) FFT multiplication for lengths of the form n=2^p.
The implementation uses SIMD instructions and multi-threading.
Here are benchmarks on my bipro :

---> Complexify Double

[vdhoeven@localhost algebramix]$ fft_bench
4 : 7.67037 ns 6.1363e-05 ms
8 : 6.13164 ns 0.000147159 ms
16 : 4.65555 ns 0.000297955 ms
32 : 4.11272 ns 0.000658035 ms
64 : 4.13011 ns 0.00158596 ms
128 : 3.69123 ns 0.00330735 ms
256 : 3.43379 ns 0.00703241 ms
512 : 3.44551 ns 0.0158769 ms
1024 : 3.51524 ns 0.035996 ms
2048 : 3.85681 ns 0.0868862 ms
4096 : 3.83977 ns 0.188732 ms
8192 : 3.46781 ns 0.369308 ms
16384 : 3.44088 ns 0.789255 ms
32768 : 3.44783 ns 1.69468 ms
65536 : 4.36525 ns 4.5773 ms
131072 : 5.9592 ns 13.2784 ms
262144 : 5.37996 ns 25.3858 ms
524288 : 5.52765 ns 55.0635 ms
1048576 : 5.18983 ns 108.839 ms
2097152 : 5.16801 ns 227.6 ms
4194304 : 4.98821 ns 460.286 ms
8388608 : 6.22998 ns 1202 ms
16777216 : 9.44982 ns 3805 ms

---> Modular P, P= 3*2^30+1

[vdhoeven@localhost mmx]$ fft_bench
4 : 13.3117 ns 0.000106494 ms
8 : 14.8151 ns 0.000355562 ms
16 : 16.2274 ns 0.00103855 ms
32 : 17.2079 ns 0.00275326 ms
64 : 17.8119 ns 0.00683977 ms
128 : 18.418 ns 0.0165025 ms
256 : 18.6565 ns 0.0382086 ms
512 : 18.4581 ns 0.0850551 ms
1024 : 18.5203 ns 0.189648 ms
2048 : 18.6802 ns 0.420828 ms
4096 : 18.9043 ns 0.929182 ms
8192 : 19.0554 ns 2.02933 ms
16384 : 19.1279 ns 4.38748 ms
32768 : 8.85069 ns 4.35029 ms
65536 : 8.95706 ns 9.39216 ms
131072 : 8.61178 ns 19.189 ms
262144 : 8.73949 ns 41.2381 ms
524288 : 8.82756 ns 87.9355 ms
1048576 : 8.97725 ns 188.267 ms
2097152 : 8.81662 ns 388.286 ms
4194304 : 8.85039 ns 816.667 ms
8388608 : 8.76102 ns 1690.33 ms
16777216 : 8.84881 ns 3563 ms

A few notes:

- We are about twice as slow as FFTW3 for complex doubles,
due to the fact that we don't use codelets for the moment.
- The jump in efficiency for Modular P in at n=32768 is due
to the fact that we start multi-threading at this point.
- Multi-threading gives a clean gain by a factor 2 in the case
of Modular P and only a gain of about 1.5 in the case of
Complexify Double.
- Modular P does not make use of SIMD instructions.
It should be possible to compute w.r.t. two moduli with
about the same speed.
- For large n, integer multiplication based on four Ps
might be competitive to integer multiplication based on FFTW3.
- The TFT transform still has to be re-implemented.

Best wishes, Joris

New FFT implementation, Joris van der Hoeven, 12/20/2007
- Re: [Mathemagix] New FFT implementation, Bernard Mourrain, 12/21/2007

List archive

New FFT implementation