Intel® FPGA SDK for OpenCL™ Pro Edition: Programming Guide

ID 683846
Date 12/19/2022
Public
Document Table of Contents

5.2.7. Loop Concurrency (max_concurrency Pragma)

You can use the max_concurrency pragma to limit the concurrency of a loop in your component.

The concurrency of a loop is how many iterations of that loop can be in progress at one time. By default, the Intel® FPGA SDK for OpenCL™ tries to maximize the concurrency of loops so that your component runs at peak throughput.

The max_concurrency pragma applies to single work-item kernels (that is, single-threaded kernels) in which loops are pipelined. Refer to the Single Work-Item Kernel versus NDRange Kernel section of the Intel® FPGA SDK for OpenCL™ Pro Edition Best Practices Guide for information on loop pipelining, and on kernel properties that drive the offline compiler's decision on whether to treat a kernel as single-threaded.

The max_concurrency pragma enables you to control the on-chip memory resources required to pipeline your loop. To achieve simultaneous execution of loop iterations, the offline compiler must create copies of any memory that is private to a single iteration. These copies are called private copies. The greater the permitted concurrency, the more private copies the compiler must create.

The kernel's HTML report (report.html) provides the following information pertaining to loop concurrency:

  • Maximum concurrency that the offline compiler has chosen

    This information is available in the Loop Analysis report and Kernel Memory viewer:

    • In the Loop Analysis report, a message in the Details pane reports as the maximum number of simultaneous executions has been limited to N.
      Note: The value of unsigned N can be greater than or equal to zero. A value of N = 0 indicates unlimited concurrency.
    • In the Kernel Memory Viewer, the bank view of your local memory graphically shows the number of private copies.

  • Impact to memory usage

    This information is available in the Area Analysis report. A message in the Details pane reports that the offline compiler has created N independent copies of the memory to enable simultaneous execution of N loop iterations.

If you want to exchange some performance for physical memory savings, apply #pragma max_concurrency <N> to the loop, as shown below. When you apply this pragma, the offline compiler limits the number of simultaneously-executed loop iterations to N. The number of private copies of loop memories is also reduced to N.

#pragma max_concurrency 1
​for (int i = 0; i < N; i++) {
  int arr[M];
  // Doing work on arr
}

You can also control the number of private copies (created for a local memory and accessed within a loop) by using __attribute__((private_copies(N))). Refer to Memory Attributes for Configuring Kernel Memory Systems for more details about the attribute. If a local memory with __attribute__((private_copies(N))) is accessed with a loop that has #pragma max_concurency M, the offline compiler limits the number of simultaneously-executed loop iterations to min(M,N).