HOME :: JOB LISTINGS :: WEBCASTS :: ARCHIVES :: MEDIA KIT :: SUBSCRIBE :: FORUMS


SPONSORED WHITE PAPER

Efficient FPGA Multiplier
Usage in Wireless Basestations

by Ian Ing and Asher Hazanchuk - Lattice Semiconductor

Introduction

Emerging broadband wireless protocols, based on WiMax and its derivatives, demand an increasing amount of throughput and data-rate. Fast chip-rate and digital RF processing by these protocols are optimally implemented in hardware using FPGA solutions.

FPGAs are ideal as a high-performance, cost-effective solution to implementing the digital functionality of these physical layer protocols because they include the following resources:

• DSP blocks to implement the multiplier and adder/accumulator functions required for various FIR filtering and FFT/IFFT operations.
• SERDES transceivers to support CPRI and OBSAI interfaces between radio heads and baseband digital boards
• Significant FPGA embedded block memory (EBR) to store filter coefficients, perform block interleaving and implement FEC decoding (Turbo, Viterbi, Reed-Solomon, etc.)
• High-speed LVDS I/Os to support wide parallel interfaces to and from D/A and A/D converters, respectively. Converters define the boundary between RF / analog functions and less expensive digital baseband logic. Higher rates of this interface allow more integration of digital upconversion / digital downconversion functions into the low cost FPGA solution.

This article focuses on the first resource, DSP multiplier blocks. Through reduction and optimal implementation of DSP multiplier blocks in FFT and FIR implementation, the designer is able to minimize resources while meeting throughput requirements. This allows users to migrate to the most cost-effective FPGA devices available. The four reduction techniques are as follows:

Index Topic Potential Multiplier Savings Block Usage Notes
1 Efficient Complex Multiplication 25% OFDM transmit and receive FFT and IFFT For complex multipliers in radix-4 butterfly, use 3 multipliers instead of 4 to reduce mults and/or increase throughput
2 Filter Coefficient Symmetry 50% DUC/DDC FIR filters Use Symmetry of coefficients to reduce multipliers, up to one half
3 Distributed Arithmetic FIR Using EBRs or LUTs 100% DUC/DDC FIR filters Use Embedded Block Memory (EBRs) as a LUT of coefficient x partial products, feed in data serially
4 CIC filters using adders 100% DUC/DDC FIR filters Implement Cascaded Integrator Comb filters for high factor decimation and interpolation

DUC/DDC: Digital upconversion/downconversion

Table 1- Four Techniques for Multiplier Reduction in WiMax System Design

These four multiplier saving techniques are described in the following sections.

Efficient Complex Multiplication for OFDM Functions in WiMax

One key feature of WiMax system design is the support for orthogonal frequency division multiplexing, or OFDM. FPGAs make it extremely easy to implement OFDM transmitters and receivers in discrete time using IFFT (inverse FFT) and FFT (fast Fourier transform), respectively. Some protocols such as 802.16a (WiMAX for fixed mobile) require a specific size of FFT at 256 points. Other protocols require a range of FFT sizes (802.16e, mobile WiMax), or a scalable adjustment to FFT size in order to adjust to dynamic channel and bandwidth requirements (Scalable OFDMA).

Complex Multiplication

The most efficient use of multipliers when implementing 256 and 1024 point FFTs is through a Radix-4 structure. FFT algorithms are decomposed via the reuse of 4-point discrete Fourier transform (DFT) butterfly structures. For example, a 16-point FFT is implemented with 2 stages of radix-4 DFT structures (16= 4 2) using decimation-in-time, decimation-in-frequency or other related forms of decomposition. The first stage consists of four 4-point DFTs and the second stage also consists of four 4-point DFTs. As the output of each DFT requires applying three phase factors to the result before feeding to the next stage, a total of 9 phase factors between the first and second stages require nine complex multiplications.

At first glance, there are four multipliers and two adder/subtractors required to perform a complex multiplication. However, that expression can be rewritten as another expression that requires only three multiplications, three adders and two subtractors. Note that adders are implemented in the FPGA core logic, which utilize the abundant general-purpose Programmable Logic Cell (PLC) slices in ripple mode.

If D = Dr + jDi is the complex data and C = Cr + jCi is the complex coefficient, the standard expression for the complex multiplication is as follows

:
:

R = D • C = (Dr + jDi) • (Cr + jCi) = Rr + jRi
where Rr = Dr*Cr - Di*Ci, Ri = Dr*Ci + Di*Cr

 

The above standard expression requires the use of four multipliers. This expression can be algebraically rearranged as follows:

Rr = Dr*Cr - Di*Ci
Rr = Dr*Cr - Di*Ci + 0
Rr = Dr*Cr - Di*Ci + (Dr*Ci - Di*Cr) - (Dr*Ci - Di*Cr)
Rr = (Dr*Cr - Dr*Ci + Di*Cr - Di*Ci) + (Dr*Ci - Di*Cr)

The new expression for the complex result is:

Rr = ((Dr + Di)*(Cr - Ci)) + (Dr*Ci - Di*Cr) (three multiplications)
Ri = Dr*Ci + Di*Cr (reuse products from Rr)

As shown schematically in Figure 1, the optimal complex multiplication is implemented with 3 multipliers, 3 adders and 2 subtractors. Note that add/subtract blocks utilize less relative die area than 18x18 multiplier blocks in an FPGA.

Figure 1- Complex Multiplication with 4 and 3 Multipliers

In summary, a 25% reduction in multiplier usage allows the choice one of the two benefits:

1. Reduction of multipliers to achieve the same FFT throughput

2. Increase of FFT throughput with the same number of multipliers

Efficient FIR Filter Implementation in Digital Upconverters/Downconverters

The next three efficient multiplier techniques involve implementation of digital upconversion and downconversion in FPGAs. This has become an area of focus for optimization as wireless designers attempt to address the need to move data from very high frequency sample rates to chip processing rates. Digital down/up converter (DDC/DUC) sub-systems are some of the main digital components of the transmitter/receiver functionality within a base station, and historically have been implemented with expensive analog/mixed signal components. There are three techniques that can be used to reduce the number of multipliers in an FPGA implementation:

1. Coefficient symmetry of FIR filters save multipliers

2. Distributed arithmetic operations use block memory (EBR)

3. Cascaded-integrator comb filters use adders

Up-Conversion / Down-Conversion Overview

As described in the upper portion of figure 2, the DDC is composed of the following components: an I/Q splitter that is based on a numerical controlled oscillator (NCO) that modulates the input signal that comes from the RF section with sine and cosine waves, using two mixers, and a decimation section that can be configured from 3 levels of FIR decimation filters or FIR decimation filter followed by a cascaded integrator comb (CIC) filter.

Figure 2 - DDC/DUC Structures

The DUC in the lower portion of figure 2 is composed of the following components: 3 levels of FIR interpolation filters or a CIC filter followed by a FIR interpolation filter. An I/Q mixer that is based on NCO and two mixers that demodulate I and Q output signals before they are sent to the RF section. Remember, decimation involves the removal of samples to reach a lower sample rate, while interpolation involves adding extrapolated samples to increase the sample rate.

General Implementation Guidelines for Converters

The DDC/DUC system is a multiplier-intensive system. Decimation and interpolation filters are typically implemented by an array of multipliers and adders and the mixer function is a multiplier. An area-efficient method to implement NCO is based on phase shifting using complex multipliers.

The first step in overcoming the multiplier-intensive system challenge is to split and cascade the filters:

• A large FIR decimation filter or FIR interpolation filter with decimation/interpolation factor N can be broken down into two or three smaller and simpler cascading filters with N1, N2 and N3 decimation/interpolation factors. The decimation/interpolation factors satisfy the following equation: N = N1 * N2 * N3

• Breaking down FIR decimation filters or FIR interpolation filters into two or three separate filters reduces the total number of taps required to implement the entire filter. A single filter with decimation or interpolation factor N would need to have a large number of taps (multipliers) to satisfy decent filter attenuation and noise characteristic requirements. Breaking down the filter into two or three smaller and simpler filters reduces the entire filtering system number of taps (multipliers). Additionally, the lower sampling rate of the second and third cascading filters enables time multiplexing to reduce the implementation size even further.

Once the filter steps are defined, there are techniques to reduce the number of multipliers in the actual filters. This is discussed in the next section.

Three Specific Multiplier Saving Techniques for Converters

1. Symmetry decimation and interpolation filters

The symmetry of coefficients of the DDC decimation filters and DUC interpolation filters can be used to achieve up to 50% multiplier reduction. In case of symmetry the n taps FIR filter coefficients h(0), h(1), …,h(n) satisfy the condition of h (k) = h(n-k) {0 =< k =< n}.

Since h(k) = h(n-k) one multiplication of h(k) with the sum of the two, correspondent samples can be done and therefore the number of multipliers required can be reduced by a factor of up to 2 (for an even number of coefficients). In FPGAs, the inexpensive ripple-mode logic is available to implement the addition of the two data samples that will use the same coefficient.

2. FIR filters implemented with EBR memory blocks via distributed arithmetic functions.

Efficient use of FPGA resources is extremely important for multiplier intensive applications such as DDC or DUC. The use of memories and LUT fabric resources as multipliers can increase significantly the implementation efficiency. EBRs and the fabric’s distributed memories can be used as FIR filter multipliers using the technique of distributed memories, also known as the soft multipliers technique. Using this technique, the number of multipliers in FPGA devices typically can be increased by a factor of 2 to 5.

Figure 3 demonstrates how EBR can be used to implement FIR filter using a distributed arithmetic technique. The samples are serially shifted into the EBR address bus. Inside the EBR there is a table of pre-computed result multiplications and summation of each input sample bit (address bit) with its appropriate coefficient. The accumulator accumulates the n (n is sample bit resolution) intermediate results and provides a complete FIR filter result after n clock cycles.

Figure 3 - Use of Block Memories as FIR multipliers

3. CIC filter uses adders instead of multipliers

Replacement of some interpolation/decimation FIR filter chain portion with Cascaded Integrator-Comb (CIC) multipliers is another method to reduce the number of multipliers needed for implementation. CIC multipliers have no multipliers. They are based on adders and subtractors. Digital Up/Down converters usually require a large rate change on the order of a few hundred. High rate changing interpolation or decimation filters tend to be very expensive in hardware. CIC filters, also called Hogenauer filters, can serve as inexpensive high-factor decimation or interpolation filters [1]. They are used to achieve arbitrary and large rate changes in digital systems and can be implemented efficiently by using only adders and subtractors. Because FPGAs have fast carry chains for implementing adders, a CIC filter is very amenable for FPGA implementation.

The structure and characteristics of integrator and comb are listed in Table 2

Table 2 - Structure and Characteristics of Comb and Integrator

Implementing Converter and OFDM Functions Using Intellectual Property Cores

It is fairly simple to implement DDC or DUC converters in Lattice FPGAs, due to the availability of the constituent components as IP cores. One application that uses CIC filters as interpolators in digital rate conversion is shown in figure 4, which shows the use of a CIC interpolator for up-conversion for digital radio applications.

Figure 4 - Digital Up-Converter for Digital Radio Application

The digital up-converter uses the following IP core configurations:

1. FIR Filter (63-tap, interpolating filter)

2. FIR Filter (31-tap, interpolating filter)

3. CIC Filter (Interpolating CIC filter with rates programmable between 8 and 2K)

4. NCO (NCO with sine and cosine outputs)

Advantages of LatticeECP2/M

The LatticeECP2/M family of low-cost FPGAs has several high-performance features that are very relevant to WiMax system design. Very few if any of these features are found in other low-cost FPGA families; instead, they are only found in expensive high-end FPGA families:

• High-performance DSP blocks with hardwired multipliers, adder/accumulator blocks, and pipeline stages.

• SERDES transceiver channels at rates up to 3.125 Gbps, to support CPRI and OBSAI interfaces between radio heads and baseband digital boards.

• Abundant quantities of 18 kB EBR memory blocks in the LatticeECP2/M memory enhanced family.

• High-speed LVDS I/Os to support ADC/DAC interfaces, up to 840 Mbps for both inputs and outputs.

These abundant and high performance resources are found in the low-cost LatticeECP2/M family, at price points that are far lower than other FPGA devices. The WiMax system designer is also able to use several design techniques to reduce the number of DSP multipliers required, thus providing opportunities to migrate to smaller, less expensive FPGA devices.

by Ian Ing and Asher Hazanchuk - Lattice Semiconductor

September 20, 2007

[back to top]

Comments on this article? Send them to comments@fpgajournal.com

All material on this site copyright © 2006 techfocus media, inc. All rights reserved.
FPGA and Structured ASIC Journal
Privacy Statement