Name: FFTR2E.ASM
         Type: Assembler Macro
      Version: 1.0
  Last Change:  4-Feb-87

  Description: 1024 Point, Non-In-place, Complex FFT Macro

 This FFT macro performs a 1024 point complex FFT on external data
 using the Radix 2, Decimation in Time, Cooley-Tukey FFT algorithm [1][2].
 The Cooley-Tukey algorithm is perhaps the simplest and most widely
 used form of FFT.  Here we demonstrate several programming techniques
 and memory mapping strategies which take advantage of the DSP56000/1
 architecture to perform very fast FFT's.  It does not use straight
 line code to speed up FFT's but rather exploits the use of the
 dual internal data buses and memories and the ability of the external
 bus controller to perform one external (off-chip) access without
 adding any wait states.

 FFTR2E starts by performing two Radix 2 passes on external data to
 partition the 1024 complex points into four sets of 256 complex points
 which can be processed independently.  We take advantage of the fact
 that half of the FFT coefficients (twiddle factors) are zero for the
 first two passes to reduce the number of operations by one half.  The
 remaining coefficients are +1 and -1, allowing add and subtract
 operations to replace coefficient lookup table accesses.  Because of
 these facts, the output of the first pass can be kept internal to the
 Data ALU and only the output of the second pass is written back to
 memory.  At this point all data is still external.  Note that the coding
 of the first two passes is written with only one external access per
 instruction.  Assuming the FFTR2E macro is in internal program memory,
 this guarantees that external data can be accessed at the same speed as
 internal data.

 Each set of 256 complex points is then moved into the DSP56000/1 internal
 data memories and a Radix 2 FFT is performed on-chip.  Thus the
 calculation is non-in-place since the internal data memories are used as
 a temporary workspace.  The first pass of each 256 point FFT reads the
 external data input and writes the intermediate data into the internal
 memories.  All intermediate passes work on internal data.  However, the
 output data from the last pass of each 256 point FFT is written back to
 the external data memories.  Thus the output of the 1024 point complex
 FFT overwrites the input data in external data memory.

 The implementation uses 24 bit fixed point data storage and 24 bit
 fixed point FFT coefficients (sine and cosine lookup tables).  The
 Data ALU maintains 56 bit accumulator precision whenever possible.
 All data is complex, with the real part in X Data memory and the
 imaginary part in Y Data memory.  The external data buffer requires
 1024 X Data and 1024 Y Data memory locations.  The algorithm is
 performed "non-in-place", since the internal DSP56000/1 Data RAM's
 are used as additional workspace storage. The input data is assumed
 to be in normal (time-sequential) order and the output is in bit-reversed
 order.  By using the reverse-carry address modifier and a separate
 output data buffer, the output data may be easily unscrambled.  Other
 methods also exist to unscramble the output data without a separate
 output data buffer.  For maximum speed, the FFT macro performs a lookup
 table operation to get new FFT coefficients (called twiddle factors) for
 each group of butterflies.  The FFTR2E macro uses "twiddle factor"
 lookup tables (-cosine in X memory and -sine in Y memory) stored in
 external data memory.  A SINCOS macro is available to generate these
 tables.  The -sine table requires 512 X Data locations and the -cosine
 table requires 512 Y Data locations.

 All register initialization is performed by this macro.  However, the
 macro assumes that registers which should not be altered by the FFT
 have already been saved by the main program.  This allows the user to
 fit the FFT macro into his application and thus control the context
 switching overhead.  No data scaling is performed and no overflow
 detection is done.  Modifications to this routine could allow it to
 be used with the scaling modes and thus allow dynamic scaling for
 each FFT pass.

 The FFTR2E macro requires 104 words of program memory.  Using a
 20.5 MHz DSP56000/1, the FFTR2E macro can perform a 1024 point
 complex FFT in 3.39 milliseconds.  Additional algorithm details are
 included in the source file; however, more algorithm description
 would be required for complete understanding by typical users.  The
 test program FFTR2ET demonstrates the calling procedure for this
 macro.

 References
 ----------

 [1]   Cooley, J. W., and Tukey, J. W., "An algorithm for the machine
       calculation of complex Fourier series," Math. Comput., vol. 19,
       pp. 297-301, April 1965.
 [2]   Bergland, G. D., "A guided tour of the fast Fourier transform,"
       IEEE Spectrum, vol. 6, pp. 41-52, July 1969, also in Digital
       Signal Processing, edited by L. R. Rabiner and C. M. Rader,
       IEEE Press, pp. 228-239, 1972.
