AN 831: Intel® FPGA SDK for OpenCL™: Host Pipelined Multithread

ID 683013
Date 11/20/2017
Public

1.3.2. Cache Optimizations

Use data prefetching and cache coalescing techniques to optimize cache and improve performance of processors.

Prefetching

Prefetching data or instructions from a slower global memory to the local cache can improve performance of algorithms on modern processors. Prefetching data works best when you access your data sequentially rather than from random locations in the memory. Blocks of data are copied one time from the global memory to cache, thus minimizing cache misses and frequent data fetch to the cache. For example, consider adding one million numbers from an array in the global memory, as shown in the following code snippet:

Sum = 0;
for (int i = 0; i < 1000000; ++i) {
    Sum += a[i];
}

For each data access, CPU verifies if the data is available in its cache and if not, it loads the data from the global memory to its local cache and adds the data to the Sum. Changing the above code to prefetch data with additional directives reduces cache misses. This can improve the performance of the code many folds. *NX platforms have built-in APIs for prefetching.

void __builtin_prefetch (const void *addr,...)

The value of addr is the address of the memory to prefetch. It has two optional arguments:

  • rw: This argument is a compile-time constant (0 or 1).
    • A value of 0 (default) means that the prefetch is preparing for a read.
    • A value of 1 means that the prefetch is preparing for a write to the memory address.
  • locality: This argument is also a compile-time constant integer between zero and three.
    • A value of 3 (default) means that the data has a high degree of temporal locality and should be left in all levels of cache possible.
    • Values of 1 and 2 mean that a low or moderate degree of temporal locality.
    • A value of 0 means that data has no temporal locality, so it need not be left in the cache after the access.

Cache Coalescing

CPUs load 64-bytes of data at a time from the cache to registers, update the data and write it back to the cache. If consecutive memory locations are to be updated in consecutive cycles, CPU adds barriers to prevent wrong updates, that is, the CPU ensures that the previous update is completed before the next data is loaded. This can cause performance bottlenecks in computations when CPUs attempt to vectorize and perform multiple operations in parallel. Hence, in such scenarios where the processor has to update a sequence of locations in the memory, a different type of cache optimization technique, Cache coalescing, can be used.

Cache coalescing optimizes memory access so that the processors can do multiple operations in parallel. For example, consider the following frequency counter:

for (int i = 0; i < length; ++i) {
     freq_counter[[data[i]]++;
 }

The statement freq_counter[data[ch]]++ is composed of three instruction executions at the processor level: load, increment or update, and store. The simplest form is to interleave memory operations or to leave extra padding of memory between adjacent locations that must be updated.