AN 831: Intel® FPGA SDK for OpenCL™: Host Pipelined Multithread

ID 683013
Date 11/20/2017
Public

1.1. Introduction

The basic idea of this pipelined framework is to build a pipelined architecture and accelerate the process for a large set of input data.

In this architecture, each task is a consumer of the previous task and producer for the next task or stage in the pipeline. Due to the dependency between tasks, it is impossible to run tasks in parallel for a single input data and improve the performance.

However, if each task works on a separate input data, there is no data dependency and all tasks can run concurrently. Elimination of dependency creates a pipeline of several tasks (that is, steps of the original algorithm) and therefore, can significantly improve the performance of the whole system, especially when processing large amount of input data.

The following figures illustrate the performance of such an algorithm, which includes three sequential tasks without and with pipelined architecture, respectively.

Figure 1. Performance of the Original Algorithm for Multiple Input Data
Figure 2. Performance of the Pipelined Algorithm for Multiple Input Data

One of the best applications of this framework is in heterogeneous platforms where high-throughput hardware or platform is used to accelerate the most time consuming part of the application. Remaining parts of the algorithm must run in a sequential order on other platforms such as CPU, to either prepare the input data for the accelerated task or to use the output of that task to prepare the final output. In this scenario, although the performance of the algorithm is partially accelerated, the total system throughput is much lower because of the sequential nature of the original algorithm.

Intel® has applied this framework to its data compression reference design. The data compression algorithm is based on the producer-consumer model and has several sequential steps. The most demanding task called Deflate is accelerated using FPGA. Yet there are few tasks or steps that must be performed before and after the Deflate process on the CPU, leading to high degradation in the total system throughput. However, by using the proposed pipelined framework, Intel® was able to achieve a very high system-throughput (close to the throughput of the Deflate task on FPGA) for multiple input files.

For more information about the proposed pipelining framework, refer to Pipelining Framework for High Throughput Design. For information about relevant CPU optimization techniques that can improve this pipelined framework further, refer to Optimization Techniques for CPU Tasks. For information about Intel® 's data compression algorithm, refer to Design Example: Data Compression Algorithm.