AN 831: Intel® FPGA SDK for OpenCL™: Host Pipelined Multithread

ID 683013
Date 11/20/2017
Public

1.4.3.1. Huffman Code (Tree) Generation

With the existing FPGA software solutions, one of the main bottlenecks is the Huffman code or tree generation. The Huffman code generation first computes the frequency of each character in the input data, and then uses this frequency to construct the Huffman tree and related codes corresponding to each character.

It is observed that the frequency computation is the major bottleneck but not the tree generation from the codes. Following is the code snippet that computes the frequency:

unsigned int freq[256]
    for (int i = 0; i < length; ++i) {
        freq[data[i]]++;
    }
The existing implementations with the Huffman frequency table computed on a CPU had a maximum throughput of up to 4.5 GB/s and the worst case was up to 1.2 GB/s with multiple threads, when data was homogeneous. The following two aspects of computing Huffman data frequency table is considered:
  • Improving worst case scenarios
  • Improving threads

Worst Case Scenarios

With homogeneous data, throughput of the frequency computation is always one fourth of the best case. It is noticed that each frequency computation involves a load, add, and store operation (freq[ch]++). This means that if the adjacent characters in the data stream are similar or closer in the ASCII table (for example, A and B), the compiler and CPU cannot vectorize and pipeline it efficiently. Vectorization does multiple load, add, and store in parallel. This is because CPUs access caches in blocks (usually 64-bytes), and if same blocks are updated in adjacent instructions, compiler and CPU have to add barriers to make updates on adjacent locations sequentially. For more information, refer to the Prefetching section in Cache Optimizations topic.

You can solve this issue by adding a padding between the frequency table entries for each character. In case of the Huffman frequency counter, the loop was modified and updated to have 16 different tables. The data is accumulated after the loop ends.

unsigned int freq[256]

unsigned int freq_tab[16][256];
for (int i = 0; i < length/16; ++i) {
   for (int j = 0; j < 16; ++j) {
      freq_tab[j][data[i]]++;
   }
}
    
for (int i = 0; i < 16; ++i) {
   for (int j = 0; j < 256; ++j) {
      freq[j] += freq_tab[i][j];
   }
}

With the above distributed table update, the frequency computation throughput became identical for all cases.

Threads in Huffman Frequency Counter

Data is divided into multiple blocks and the frequency table of each block is computed independently in separate threads. The frequency tables are then merged at the end to generate the final table.

While experimenting, it was observed that as the number of threads increased, the frequency computation did not scale. This issue can be improved with CPU affinity of each thread by assigning threads to specific CPUs, to keep caches valid for longer than usual. By setting CPU affinity to threads, you can improve the best performance of the frequency computation (up to 15 GB/s).

Figure 9. Update 16 Tables for 16 Adjacent Chars in the Stream