| |
Reduce Build Costs by Offloading DSP Functions to
an FPGA Peter Baran, Hardware Architect, Be Here Technologies, Inc. In any product development cycle there are multiple opportunities to reduce cost and/or increase functionality. This is particularly true in higher-end DSP applications, which are computationally intensive and performance critical, and require more processing power than can be provided by common microprocessors or low-cost DSP chips. For such applications there are numerous software/hardware alternatives from which to choose, including DSP devices, custom ASICs and field programmable gate arrays (FPGAs). These alternatives offer varying degrees of performance benefits that must be weighed against other factors including cost, power consumption and design time. Recent reductions in the cost of high-density FPGAs, combined with advances in software-oriented FPGA design tools, have led to a corresponding rise in the use of these devices to handle functions that are traditionally the domain of DSP processors, while at same time dramatically reducing the risks and up-front costs of custom ASIC solutions. The decision to add an FPGA to a new or existing design can be driven by the desire to extend the life of a common, lower-cost microprocessor (by offloading computationally intensive work) or to reduce or eliminate the need for a higher-cost, top-tier DSP processor. In cases where the throughput of an existing system must increase to handle higher resolutions or larger signal bandwidths, the required performance increases may be primarily computational (requiring a scaling of computational resources) or may require completely new approaches to resolve bandwidth issues. Signal processing algorithms (the computational core of typical DSP applications) can often be described using a relatively small amount of C source code. The ability to quickly try out new algorithmic approaches from C language models is, therefore, beneficial. Reengineering low-level hardware designs, on the other hand, can be a tedious and error-prone process. FPGAs help to address these issues in two ways. First, they have the potential to implement extreme high-performance DSP applications as dedicated hardware without the up-front risk of custom ASICs. Mainstream (and relatively low cost) FPGA devices now have the capacity and features necessary to support such applications. Second, and equally important, FPGAs are becoming dramatically easier to use due to advances in design tools. It is now possible to use multiple software-oriented (graphical and language-based) design methods as part of the FPGA design process. When used to implement signal processing or other computationally-intensive applications, FPGAs may be used as prototyping vehicles (which may later be converted to hard-mask versions in a dedicated ASIC or structured ASIC) or as actual product platforms, in which case FPGAs offer unique benefits for field software upgrades and compelling cost advantages for low to medium volume products. High-capacity FPGAs also make sense for value engineering. In such cases, multiple devices (including processor peripherals and random or “glue” logic) may be consolidated into a single FPGA. While reduced size and system complexity are advantageous byproducts of such consolidation, the primary benefit is cost. Using an FPGA as a catch-all hardware platform is now becoming common, but this effort often ignores the benefits of using the FPGAs for primary processing as well as for traditional hardware functions. In recent years, FPGAs have become available with increasingly powerful embedded soft processor cores. Altera®’s Nios™ processor, for example, is a highly capable 32-bit RISC processor with a wide variety of peripheral options that may be selected and configured (using the SOPC Builder™ software available from Altera) and programmed directly into a high-density Stratix™ or Cyclone™ FPGA. Such processors make excellent platforms for mixed software/hardware applications, and further increase the integration and value engineering opportunities of FPGAs. An FPGA-based Videoconferencing solution To examine how the latest FPGA architectures are being used, consider the example of a videoconferencing product. Be Here Technologies (Fremont, California) produces a combined audio/video phone that includes a patented 360º lens (see figure below). The image processing required in this system includes the dynamic de-warping of the toroidal image produced by the single lens, as well as video compression and other computationally-intensive operations.
As with all advanced products of this type, there is a constant need for improved video and sound resolution and sampling rates. The initial analysis for this product suggested that a combination of an off-the-shelf processor (which acts as a controller for the image warping software), along with a high-performance DSP device would provide the necessary performance for the application’s video compression elements (which makes use of the H.264 compression standard in a custom frame size, operating at rates of up to 20 frames per second). As the team looked at build cost, however, it became clear that the use of the higher-performance DSPs would put the system outside of the budget. Another solution was required. The Be Here development team began looking at the newest FPGA platforms,
and at the Altera offerings in particular. The team settled on Altera’s
Stratix family because of the availability of large amounts of onboard
memory and the device’s ability to meet volume cost using the Altera
HardCopy™ ASIC conversion program. The system also required a microprocessor
to control the image warping and view generation hardware as well as the
image sensor. Altera’s embedded Nios processor (which is royalty
free and highly configurable) was the natural choice for these functions. Be Here’s latest products include H.264 compression along with proprietary image processing as part of a comprehensive videoconferencing solution. This portion of the design, as noted earlier, could not meet its build-cost goals using the relatively expensive DSP devices available. By using a lower cost, lower capability DSP alongside the Stratix FPGA, however, the cost goals could be achieved. The design problem then became one of partitioning: determining which functions (representing algorithmic “hot spots”) are most appropriate for FPGA implementation, making the conversion of those functions to lower-level hardware descriptions appropriate for FPGAs and leaving the less critical algorithms on the lower cost DSP. As in nearly every DSP application, the best solution turns out to be a mixed processor design, in which the application’s less performance-critical components of the (including the operating system, network stack, user interface, audio codecs and POTs control) reside on the host microprocessor. Computationally-intensive components (including image de-warp, image view generation and compression/decompression acceleration) reside either in a high-end DSP or in dedicated hardware in the FPGA, or both. This requires multiple tools and knowledge of hardware design methods and tools, but provides the greatest benefit in terms of performance for the dollar. For each processor type in the system (standard processor, DSP and FPGA), there are different advantages, disadvantages and levels of required design expertise to consider. For example, while DSPs are software programmable and require a low initial investment in tools, they require some expertise in DSP-specific design techniques and often require assembly-level programming skills. FPGAs, on the other hand, require a relatively large investment in design time and tools expertise, particularly when hardware design languages are used as the primary design input method. When compared to the expertise and tools investment required for custom ASIC design, however, FPGAs are clearly the lower-cost, lower-risk solution for developing custom hardware. Indeed, a key factor in the selection of an FPGA for this product was the design process’s simplicity and low risk relative to a custom ASIC approach. Further, by planning for a migration to an Altera HardCopy ASIC, Be Here could also lower the build cost and achieve ASIC-like price/performance. In terms of a payback analysis, the targeted savings after introducing an FPGA is a $35 to $45 reduction to the bill of materials cost, plus a one to two month savings in the engineering months required to get the product to market. FPGAs provide additional benefits related to the design process. By using an FPGA throughout the development process, the team was able to incrementally port and verify algorithms that were previously prototyped in software. This was done manually, (by hand-translating C code to lower level HDL) but C-based design tools (see below) show great promise in speeding this aspect of the design process. To speed hardware development and support an iterative approach to design, Be Here created a custom development board allowing image data to be processed through a succession of prototypes, starting with an off-the-shelf, high-resolution camera and frame grabber card installed into a high-end personal computer (running the prototype image warping and display software), then moving to a more comprehensive image chain consisting of a custom-designed camera with integrated image warping (performed on the FPGA). The display was still done on the development PC—again for prototyping purposes. An important benefit of such an iterative approach is the ability to change the design (move an algorithm to the FPGA, for example) one element at a time. In this project, the developers started by replacing the off-the-shelf camera with the prototype custom camera, while having the development card send data to the frame grabber as if it was the off-the-shelf camera. This allowed the system to be verified using the new lens system without changing anything else in the display chain. Software components were then incrementally moved from the software prototype implementation to the FPGA, with each part being verified before going to the next. This approach dramatically shortened debug time and reduced the risk of introducing hard-to-trace system-wide errors. C-based design and prototyping tools speed development Experimenting with mixed hardware/software solutions can be a time-consuming process because of the historic disconnect between software development methods and the lower-level methods required for hardware design, including design for FPGAs. In the above example, the resulting software/hardware design is a collection of software and hardware source files that are not easily compiled, simulated or debugged through a single tool set. In addition, because the hardware design process is inefficient, hardware and software design cycles may be out of synch, requiring system interfaces and fundamental software/hardware partitions and algorithms to be prematurely locked down. With the advent of C-based FPGA design tools, however, it is now possible to use familiar software design tools and standard C language for a much larger percentage of a given application, and in particular those parts of the design that are algorithmic in nature. Later performance tweaks may introduce hand-crafted HDL code as a replacement for the automatically-generated hardware (just as DSP users will often replace higher-level C code with hand-crafted assembly language). Because the design can be compiled directly from C code to an initial FPGA implementation, however, the point at which a hardware engineer needs to be brought in to make such performance tweaks is pushed farther back in the design cycle, and the system as a whole can be designed using more productive software design methods. Tools such as CoDeveloper™ (available from Impulse Accelerated Technologies), allow C language applications to be compiled to create hardware, in the form of FPGA netlists, and also include the necessary C language extensions to allow highly parallel, multiple-process applications to be described. For target platforms that include embedded processors (such as Altera’s Nios soft processor), CoDeveloper can be used to generate the necessary hardware/software interfaces as well as generating low-level hardware descriptions for specific processes. One key to success with such tools, and with hardware/software approaches in general, is to partition the application appropriately between software and hardware processing resources. A good partitioning strategy must consider not only the computational requirements of a given algorithmic component, but the data bandwidth requirements as well.This is because hardware/software interfaces may represent a significant performance bottleneck. Making use of a programming model appropriate for highly parallel applications is also important. While it is tempting to off-load specific functions onto an FPGA using legacy programming methods such as remote procedure call (RPC), research has demonstrated that alternate, more dataflow-oriented communication methods are more efficient and less likely to introduce blockages or deadlocks into an application. In many cases this means re-thinking the application as a whole and finding new ways to express data movement and processing. The results of doing so, however, can be dramatic. By increasing application-level parallelism and taking advantage of programmable hardware resources, for example, it is possible to accelerate common algorithms by orders of magnitude over a software-only implementation. During the development (or re-engineering) of such applications, design tools can be used to visualize and debug the interconnections of multiple parallel processes. Application monitoring, for example, can provide an overall view of the application and its constituent processes as it runs under the control of a standard C debugger. Such instrumentation can help to quantify the results of a given partitioning strategy by identifying areas of high data throughput that may represent application bottlenecks. When used in conjunction with familiar software profiling methods, these tools allow specific areas of code to be identified for more detailed analysis or performance tweaking. The use of cycle-accurate or instruction-set simulators later in the development process can help to further optimize the application. Example: an edge-detection image filter To demonstrate how such tools can be used to move algorithmic processes to an FPGA, consider the problem of image filtering, in which a stream of incoming image data must be processed very quickly (typically by performing some defined calculation against a “window” of adjacent pixels) to generate a converted image stream. Such a problem may involve substantial computation, but is also bandwidth-intensive. In addition, the final implementation must not compromise data throughput in order to increase overall performance. The specific image processing algorithm that we have chosen for this example is an image convolution algorithm, which is a critical step in many image processing algorithms and is representative of other such image processing filters. We used CoDeveloper to make the conversion from the original C code to a version suitable for hardware compilation in the selected FPGA target and to perform the required C to hardware compilation. We made use of Be Here’s FPGA development board (described above) as well as standard Nios development kits (making use of both Stratix and Cyclone FPGA) available from Altera to perform the experiments. The specific convolution performed in this test case is an edge-detection function, in which a 3 by 3 pixel window is assembled and processed for each pixel in the source image. Two pipelined hardware processes are described in C to describe this function. One process generates three marching columns of pixels from the source image (which is read as a single stream of pixels), while the second process accepts the results of the first and applies a convolution to each pixel window to produce an output image, represented by a stream of convoluted pixels appearing on the output of the second process. The processes and streams are declared and read/written using C-compatible stream I/O routines provided in the Impulse C libraries.
Because the algorithm is described using standard C (with the addition of the Impulse C™ libraries) we were able to begin with a software test application (developed using Microsoft® Visual Studio™) that exercises the image algorithm in a desktop simulation environment. This test application combines the two hardware processes associated with the image convolution function, along with a software test bench process (that can be compiled and run as either the PC-based desktop application or as an embedded application running on the Nios processor) that reads data from a TIFF format file for processing. This test was set up and run using standard desktop debugging tools and the CoDeveloper application monitor (shown below). The results were verified before going to the next step and compiling to the target FPGA platform.
Compiling to hardware After simulating its functionality using standard desktop tools, we were ready to implement the application on a mixed FPGA/processor target using the prototyping board. The Altera Nios development kit includes all hardware and software needed to compile and synthesize hardware and software applications (consisting of the automatically-generated HDL source files representing the hardware processes and the C source files representing the software test process) to the FPGA target. When combined with Impulse CoDeveloper, the Altera-provided software included everything needed to compile and execute the test application from our C language source files. Our first step was to generate the hardware for the image filter itself. To do this, we selected the Altera Nios platform support package from within the CoDeveloper tools and processed the relevant Impulse C source files. This resulted in approximately 1200 lines of generated RTL and related hardware/software interface source files being generated. During the compilation process, the two image filter hardware processes were analyzed, individual instructions were scheduled and pipelines were automatically introduced to create highly parallel structures. This automatically generated, process-level parallelism combined with the explicit (coarse-grained) parallelism represented by the two pipelined C processes resulted in a highly efficient hardware implementation of the image filter. Next, a new project was created using the Altera Quartus tools and a Nios processor core was generated (using Altera’s SOPC Builder) that included the necessary peripherals. The CoDeveloper’s export software and export hardware features of were used to export the generated hardware and software files from CoDeveloper to the newly-created Quartus project. Using the Altera block diagram tools, we connected the generated hardware processes to the Nios processor via the Avalon on-chip bus. The complete system, including the Nios processor and the generated
hardware were synthesized using Altera’s Quartus tool. The software
portions of the application (the standard C version of the image filter
and the test functions, including main) were also imported into the Quartus
project and compiled using the included Nios compiler. Evaluating the Results In this example, the explicit pipelining of the two image filtering processes, combined with automatically-generated (process-level) pipelines generated by the CoDeveloper C to hardware compiler resulted in a best-case image processing rate of one pixel for every two FPGA clock cycles, which translates to a processing speed of approximately 10ms for a full 512 by 512 image in the target prototyping board. Of course the absolute performance of any algorithm implemented on an FPGA will depend on I/O factors as well as on the nature of the algorithm itself. Our image filter test case, in which the image data was passed from the test producer running on Nios to the generated FPGA hardware via the Avalon on-chip interconnect, resulted in a substantially lower effective throughput than the maximum (two clock cycle) pixel rate described above. An alternate version of the algorithm, in which pixel data is transferred directly to and from FPGA hardware interfaces would yield results closer to the best-case results. It is, therefore, important to consider—and test in hardware when possible—the bandwidth limitations that may factor into application partitioning decisions. Tools such as CoDeveloper can make such evaluations and experiments easier and faster. Conclusion Modern high-density FPGAs have proven themselves more than capable of handing advanced, computationally-intensive algorithms and applications. These devices can, therefore, be considered as complements to (or replacements for) off-the-shelf microprocessors and DSPs. Recent advances in software-oriented design tools have made the use of FPGAs for mixed software/hardware applications even more compelling, and have dramatically reduced the time and expertise required to create mixed hardware/software prototypes and end-products. Peter Baran, Hardware Architect, Be Here Technologies, Inc. (www.BeHere.com) February 24, 2004 Comments on this article? Send them to comments@fpgajournal.com |
All
material on this site copyright © 2006 techfocus media, inc.
All rights reserved.
FPGA and Structured ASIC Journal Privacy Statement |