Name: FFTR2E.ASM Type: Assembler Macro Version: 1.0 Last Change: 4-Feb-87 Description: 1024 Point, Non-In-place, Complex FFT Macro This FFT macro performs a 1024 point complex FFT on external data using the Radix 2, Decimation in Time, Cooley-Tukey FFT algorithm [1][2]. The Cooley-Tukey algorithm is perhaps the simplest and most widely used form of FFT. Here we demonstrate several programming techniques and memory mapping strategies which take advantage of the DSP56000/1 architecture to perform very fast FFT's. It does not use straight line code to speed up FFT's but rather exploits the use of the dual internal data buses and memories and the ability of the external bus controller to perform one external (off-chip) access without adding any wait states. FFTR2E starts by performing two Radix 2 passes on external data to partition the 1024 complex points into four sets of 256 complex points which can be processed independently. We take advantage of the fact that half of the FFT coefficients (twiddle factors) are zero for the first two passes to reduce the number of operations by one half. The remaining coefficients are +1 and -1, allowing add and subtract operations to replace coefficient lookup table accesses. Because of these facts, the output of the first pass can be kept internal to the Data ALU and only the output of the second pass is written back to memory. At this point all data is still external. Note that the coding of the first two passes is written with only one external access per instruction. Assuming the FFTR2E macro is in internal program memory, this guarantees that external data can be accessed at the same speed as internal data. Each set of 256 complex points is then moved into the DSP56000/1 internal data memories and a Radix 2 FFT is performed on-chip. Thus the calculation is non-in-place since the internal data memories are used as a temporary workspace. The first pass of each 256 point FFT reads the external data input and writes the intermediate data into the internal memories. All intermediate passes work on internal data. However, the output data from the last pass of each 256 point FFT is written back to the external data memories. Thus the output of the 1024 point complex FFT overwrites the input data in external data memory. The implementation uses 24 bit fixed point data storage and 24 bit fixed point FFT coefficients (sine and cosine lookup tables). The Data ALU maintains 56 bit accumulator precision whenever possible. All data is complex, with the real part in X Data memory and the imaginary part in Y Data memory. The external data buffer requires 1024 X Data and 1024 Y Data memory locations. The algorithm is performed "non-in-place", since the internal DSP56000/1 Data RAM's are used as additional workspace storage. The input data is assumed to be in normal (time-sequential) order and the output is in bit-reversed order. By using the reverse-carry address modifier and a separate output data buffer, the output data may be easily unscrambled. Other methods also exist to unscramble the output data without a separate output data buffer. For maximum speed, the FFT macro performs a lookup table operation to get new FFT coefficients (called twiddle factors) for each group of butterflies. The FFTR2E macro uses "twiddle factor" lookup tables (-cosine in X memory and -sine in Y memory) stored in external data memory. A SINCOS macro is available to generate these tables. The -sine table requires 512 X Data locations and the -cosine table requires 512 Y Data locations. All register initialization is performed by this macro. However, the macro assumes that registers which should not be altered by the FFT have already been saved by the main program. This allows the user to fit the FFT macro into his application and thus control the context switching overhead. No data scaling is performed and no overflow detection is done. Modifications to this routine could allow it to be used with the scaling modes and thus allow dynamic scaling for each FFT pass. The FFTR2E macro requires 104 words of program memory. Using a 20.5 MHz DSP56000/1, the FFTR2E macro can perform a 1024 point complex FFT in 3.39 milliseconds. Additional algorithm details are included in the source file; however, more algorithm description would be required for complete understanding by typical users. The test program FFTR2ET demonstrates the calling procedure for this macro. References ---------- [1] Cooley, J. W., and Tukey, J. W., "An algorithm for the machine calculation of complex Fourier series," Math. Comput., vol. 19, pp. 297-301, April 1965. [2] Bergland, G. D., "A guided tour of the fast Fourier transform," IEEE Spectrum, vol. 6, pp. 41-52, July 1969, also in Digital Signal Processing, edited by L. R. Rabiner and C. M. Rader, IEEE Press, pp. 228-239, 1972.