AN 870: Stencil Computation Reference Design

ID 683051
Date 10/10/2018
Public

1.2. OpenCL Design

Stencil computations in general are memory-bound applications. The optimizations included in this OpenCL* 1 2 reference design seek to leverage the power of an FPGA by both parallelizing as much as possible and using channels/pipes to saturate the bandwidth from on-board DDR to maximize GFLOPS and minimize execution time.

Data is initially read from global memory by the compute unit (CU) named "feeder" and written back to global memory by the "writer"’ CU. Data is read and written in blocks of 16 floats at a time, and the net data bandwidth depends on the frequency of the kernel and supported data read/write rate of the on-board DDR and memory controller.

Matrix data is read continuously from 0 up to the size of the matix, and data is written in that same order. Two different memory objects are allocated in device memory for this kernel:
  • A source location where data from the host is sent via PCIe to be calculated
  • An output location where data is read by the host after kernel execution

The following diagram captures the flow of data into and out of the kernel system:

Between the feeder and writer CUs is the main part of the kernel system – a series of chained and replicated calculation CUs. Data flows into the first calculation CU in the same order it was read from DDR, and the calculated data is piped into the next CU in the chain in the same order. Data coming into a CU is cached, boundaries are updated, and the result of each calculation is immediately sent into the next kernel. In this way every single CU, and hence every iteration, can be calculated in parallel once enough data has been sent through the chain.

Within each of these calculation CUs is a local memory system called "cache" that is composed of M20k blocks adjacent to the CUs. The cache size must be large enough to store an entire row of the incoming matrix. The height of the matrix can be as large as device memory allows. Matrix attributes are fed forward through the system before stencil computations begin.

Because of the nature of the computation, only 14 of the 16 floating point values being forwarded can be calculated at a time, so boundary conditions are updated between CUs. This is because each element requires both its left and right neighbors to perform a stencil computation, so the first and last elements in the block cannot be calculated. The entire computation requires a total of 3 blocks of 16 floats to be loaded into private registers, the row being calculated and their top/bottom neighbors. After a calculation is performed one block with 14 new and 2 outdated values is piped to the next CU.

1 OpenCL™ and the OpenCL logo are trademarks of Apple Inc. used by permission of the Khronos Group™.
2 The Intel® FPGA SDK for OpenCL™ is based on a published Khronos Specification, and has passed the Khronos Conformance Testing Process. Current conformance status is available at www.khronos.org/conformance.