| |
In complex signal-processing designs, designers must weigh the tradeoffs between the programming flexibility of discrete DSPs and the performance flexibility available in dedicated hardware while considering the needs of their system. The effort of creating DSP hardware was a major deterrent for all but the most performance-hungry applications. However, a new design methodology called algorithmic synthesis makes is easier to create high-performance dedicated DSP hardware. With this methodology, designers can automatically create RTL implementations in seconds, compare multiple micro-architectural options, and quickly achieve designs that are optimized for the application at hand. With these capabilities designers find themselves seriously rethinking their overall design flow when implementing algorithms in hardware, via either ASIC or FPGA. Processing performance required for next-generation compute-intensive applications, such as wireless communication and image processing, has created a gap between off-the-shelf DSP performance and market needs. More and more, discrete DSP devices fall short of performance requirements for leading-edge communications and multimedia applications. In recent years, system designers have increasingly looked beyond programmable DSPs toward dedicated hardware solutions, such as FPGAs and ASICs that deliver increased levels of performance. However, manually implementing DSP algorithms in hardware can be an expensive, time-consuming process. Hand-coding hardware descriptions in RTL can take a design team weeks or months, with verification and optimization doubling or even tripling the total time required to implement complex DSP algorithms. This effort and expense meant that ASICs and FPGAs were only used in demanding niche applications. Now, new algorithmic synthesis design tools, such as Catapult® C Synthesis, make it faster and easier to implement DSP algorithms in hardware, and put within easy reach hardware implementations that are optimized for performance, area or power consumption. Algorithmic synthesis moves hardware design decisions to a higher level of abstraction. As a point of reference, one could compare the automated C-to-RTL design flow (Figure 1) used in algorithmic synthesis to the traditional DSP software programming flow. In both flows, algorithm designers develop a floating-point model of an algorithm, and then convert that to a fixed-point model, typically in C++. At this point in the traditional flow, software developers use a compiler to automatically compile the C code for an off-the-shelf DSP. With algorithmic synthesis, a hardware designer uses an algorithmic synthesis tool to automatically create an RTL description of DSP hardware that is tuned for a specific ASIC or FPGA technology platform, dramatically shortening the design hardware design flow. In fact, the technology-independent ANSI C++ description enables the algorithmic synthesis tool to target either ASIC or FPGA implementations.
Algorithmic synthesis like Catapult C Synthesis enable designers to tune the design to exactly match the performance required for a specific application, including latency, throughput, power consumption and frequency, helping designers avoid the common problem of overbuilt hardware. Since the C representation is completely abstracted from the final implementation, designers can later use the constraints in the algorithmic synthesis tool to easily re-target the same representation for different micro-architectures and ASIC/FPGA implementations. Abstraction Delivers Greater Exploration and Optimization Opportunities Algorithmic synthesis methodologies based on pure ANSI C++ speed the hardware design process 10-100x. By nature, an ANSI C++ source is a purely functional description of the algorithm and therefore independent of the target architecture and technology. Lower levels of abstraction necessarily include more information on hardware structure, which limits the freedom a designer has when exploring alternative micro-architectures. Secondly, automating the painstaking RTL creation process means designers have more time to consider architectural decisions that can have a tremendous impact on design performance, area, and power consumption. The user can apply synthesis constraints to specify the target technology (ASIC or FPGA), the amount of parallelism, and desired performance. These constraints, combined with increased productivity, give the designer both the ability and time to explore different tradeoffs, resulting in a wider optimization scope for their design (Figure 2).
These factors enable designers to create hardware designs more tightly tuned to for a specific application than traditional RTL methods. For example, the same pixel-pipe algorithm for video might be used in an HDTV, a notebook computer, and a 4G mobile handset, but each implementation would look very different based on performance, power consumption and area requirements. Tradeoffs to achieve these different implementations can be made using a variety of constraints within an algorithmic synthesis tool. These constraints can include loop unrolling or pipelining, loop merging, RAM, ROM and FIFO array mapping, memory resource merging and memory bit-width resizing. Using this methodology, hardware designers can easily perform "what if" tradeoffs evaluating area, latency, power consumption, throughput, and clock frequency for each micro-architecture, all the while leaving the original pure ANSI C/C++ source unchanged. Using Algorithmic Synthesis to Design Low-Power Implementations The dominance of battery-powered consumer applications that rely on power efficient algorithms—such as portable communications, data devices and video systems—have made power optimization a higher priority. Well-known tactics like clock-gating, optimizing memory accesses, controlling clock rates, and changing state machine encoding etc., are typical RTL methods for power efficient design. Most of these design techniques are available through advanced algorithmic synthesis tools like Catapult C Synthesis. Typically, the ability to influence factors like power consumption, performance and area is much greater at higher levels of abstraction. Unfortunately, accuracy of power, performance, and area estimation is inversely proportional to a design’s abstraction level. However, the ability to compare power consumption estimates for the various algorithms or micro architectures within an algorithm gives the designer an invaluable advantage and allows them to have a much greater impact on power-related decisions earlier on in the design process. Using a Finite Impulse Response (FIR) as an example, let’s look at how this simple algorithm can be implemented in multiple ways using algorithmic synthesis. Used to restore signal clarity in transmission systems, the FIR filter is one of the most common components used in signal processing applications. This example will evaluate tradeoffs for an 8-tap FIR filter with a performance requirement of 400 MHz and is targeted to a 90nm ASIC technology.
One of the most common structures of a FIR filter is the literal implementation or the direct form implementation (Figure 3), where the data is moved through a shift-register based delay line and each register’s output is multiplied with corresponding coefficients. The resulting outputs of all the multipliers are summed up to create the filter’s output. Typically, this implementation delivers the highest throughput. As most FIR filter coefficients are symmetrical, this traditional architecture could also be optimized by folding the structure thus reducing the number of multipliers required. Figure 4 shows a RTL schematic view of a pipelined implementation of a direct form FIR filter.
Another implementation of a FIR filter, that is typically used when the filter has a low number of taps, is the structure wherein the taps are rotated through a shift-register with only the end tap being indexed. This implementation typically results in a lower area structure. Figure 5 shows a schematic view of a register-based rotate implementation of a 4-tap FIR filter. Of course, there are many other logically equivalent popular FIR implementations such as a transpose format or circular buffer using memory (for larger number of taps) and it is up to the designer to choose one that best fits their performance needs. In this article, we will experiment with the direct form and register-based rotate implementations of the FIR filter and examine them with respect to power consumption.
Using an advanced algorithmic synthesis tool, such as Catapult C Synthesis from Mentor Graphics, one can rapidly create various micro architectures for any given algorithm. For example, a traditional or direct form implementation of the FIR filter can be designed with minimal resources or as a parallel fully pipelined system. Though similar in functionality, the effect on performance—especially with respect to power— for each of these implementations is quite different and can be clearly seen (Figure 6). The fully pipelined solution runs with the highest throughput rate, but also has larger area and higher estimated power usage. Similar experimental implementations can be created for the register based rotate version of the FIR filter algorithm. As expected, this algorithm’s implementation uses less area compared to the shift-register based version and also consumes lesser power.
Leveraging Abstraction Benefits for Efficient Verification Equally important, the quality of the RTL source code is greatly enhanced through algorithmic synthesis. Since the lower level code is automatically generated from the system specification, there are fewer bugs introduced into the design—up to 60% less. By eliminating errors that invariably crop up during manual RTL generation, algorithmic synthesis shortens the verification effort and thereby moving a design to completion faster. For those bugs that stem from design-related decisions, the same algorithmic description can be used to automatically create a consistent verification environment including high speed system models. Advanced algorithmic synthesis tools automatically create SystemC wrappers, allowing designers to rapidly verify their designs 20X to 100X faster than traditional register transfer level (RTL). A test bench can also be generated that automatically compares the ANSI C/C++ input to the RTL output, providing debug information for specific synchronization points in the case of a simulation mismatch. Rapidly Implementing Algorithms in Hardware using Catapult C Synthesis Even with an automated approach like algorithmic synthesis, the hardware design process is a departure from the traditional DSP software programming flow. For companies that simply must achieve higher performance, lower power consumption, or more efficient implementations, Catapult C Synthesis is a cornerstone design technology, delivering productivity benefits that enable broader design exploration and optimization opportunities. Based on pure ANSI C++, the Catapult C Synthesis tool eliminates the manual effort involved in hardware creation, helping reduce overall design time by 10-100x. by Shawn McCloud & Anil Khanna, Design Creation and Synthesis Division - Mentor Graphics Corporation September 20, 2007 Comments on this article? Send them to comments@fpgajournal.com |
All
material on this site copyright © 2003-2007 techfocus media, inc.
All rights reserved. FPGA and Structured ASIC Journal Privacy Statement |