

# Migrating Offloading Software to Intel® Xeon Phi<sup>™</sup> Processor

White Paper

February 2018

Document Number: 337129-001US



Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, Intel Xeon Phi, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

\*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All Rights Reserved.



# Contents

| 1 | Intro                      | ntroduction5              |                                                                                                                                                                                                                                                   |                                  |  |
|---|----------------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|--|
| 2 | Hardv<br>2.1<br>2.2<br>2.3 | Introdu<br>Large a        | onfigurations<br>ction<br>nd medium clusters<br>lusters and workstations                                                                                                                                                                          | 7<br>7                           |  |
| 3 | <b>Softw</b><br>3.1<br>3.2 | Introdu                   | gration<br>ction<br>e tools<br>Intel® MPI Library<br>Offloading over Fabric                                                                                                                                                                       | 11<br>11<br>11                   |  |
|   | 3.3                        | Porting<br>3.3.1<br>3.3.2 | application to the Intel® Xeon Phi <sup>™</sup> Processor<br>Native and MPI (distributed) applications<br>Offloading applications                                                                                                                 | 13<br>13                         |  |
|   | 3.4                        |                           | applications for Intel® Xeon Phi <sup>™</sup> Processor<br>Increased number of cores<br>Intel® Advanced Vector Extensions (Intel® AVX)-512<br>Cluster modes<br>High bandwidth memory<br>Load balancing of the application<br>Application examples | 16<br>16<br>16<br>16<br>16<br>18 |  |
| 4 | Benc                       | hmarkir                   | ng and Benchmarks Details                                                                                                                                                                                                                         | 21                               |  |

## Figures

| 2-1 | Fixed Assignment of Accelerators to Computing Nodes         | . 7 |
|-----|-------------------------------------------------------------|-----|
| 2-2 | Heterogeneous Computing Cluster                             | . 8 |
| 2-3 | Cluster Rack with Network Switches                          | . 8 |
| 2-4 | Example Server Rack Using Point to Point Fabric Connections | . 9 |
| 2-5 | Example of a Workstation Based Setup (1:2)                  | . 9 |
| 3-1 | LAMMPS* Ported to Offloading to Intel® Xeon Phi™ Processor  | 19  |



| Docum<br>Numb |       | Description     | Date          |
|---------------|-------|-----------------|---------------|
| 33712         | 9 001 | Initial Release | February 2018 |

§



The Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor introduced the concept of many core architecture with more than 57 processor cores in one package, 4-way multithreading and 512-bit vector instructions (Intel<sup>®</sup> Initial Many Core Instructions – Intel<sup>®</sup> IMCI). It enabled many usage models and can still be found in numerous machines. At the moment of writing, the largest computing cluster with hosts equipped with Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors is ranked 2<sup>nd</sup> on the list of 500 fastest clusters in the world (TOP500 list from June 2017<sup>1</sup>.

Many different programming models have been adopted for Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor and significant investments have been made into modernization of applications running on it.

There are three main programming models or types of applications that use  $Intel^{(m)}$ Xeon Phi<sup>TM</sup> x100 coprocessor:

- Native applications running on Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors.
- Distributed applications
- which use Message Passing Interface (MPI) to communicate between processes (called MPI ranks) on Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors and platforms with Intel<sup>®</sup> Xeon<sup>®</sup> processors.
- Offload applications running on platforms with Intel<sup>®</sup> Xeon<sup>®</sup> processors and using compiler offloading features (Intel<sup>®</sup> Language Extensions for Offload or OpenMP\* target directives) or communicating via Symmetric Communication Interface (SCIF) to execute code on Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors.

The introduction of Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor brought another breakthrough. It not only extended the many core concept by adding more cores (64-72) but also increased computing power, improved 512-bit vector instructions and added new type of memory (MCDRAM). Finally, the next generation of devices can work independently as the main CPU of a bootable platform.

Thanks to the fact that Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor is an independent device, the programming models known for Intel<sup>®</sup> Xeon<sup>®</sup> processors can be employed and porting highly parallel applications to Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor is relatively straightforward. However, there are applications that have both strong serial and strong parallel parts and take advantage of heterogeneous nature of Intel<sup>®</sup> Xeon<sup>®</sup> hosts equipped with Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors.

This white paper proposes migration paths for such applications (referred to as heterogeneous applications in the document), both from hardware and software perspective.

§

<sup>1.</sup> https://www.top500.org/lists/2017/06/





# 2 Hardware Configurations

### 2.1 Introduction

There are three Intel<sup>®</sup> hardware products that can not only replace Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors in computing platforms, but also offer better functionality and improved performance:

- Intel<sup>®</sup> Xeon<sup>®</sup> processors can power workstations or rack-mounted servers. Visit this link to learn more about Intel<sup>®</sup> Xeon<sup>®</sup> processors.
- Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors are usually mounted in rack-mounted servers, but are also available in workstations (http://dap.xeonphi.com/). Follow this link to learn more about Intel<sup>®</sup> Xeon Phi<sup>™</sup> x200 product family.
- Intel<sup>®</sup> Omni-Path Architecture (Intel<sup>®</sup> OPA) fast fabric adapters are usually separate PCI-e extension cards, but can also be integrated in Intel<sup>®</sup> Xeon Phi<sup>™</sup> x200 processors. Intel<sup>®</sup> Omni-Path Architecture is designed to connect nodes in clusters, but also supports point to point communication, so just two nodes can be connected with fast fabric, for example a platform with Intel<sup>®</sup> Xeon<sup>®</sup> processor and a workstation with Intel<sup>®</sup> Xeon Phi<sup>™</sup> x200 processor. It is also possible to install two Intel<sup>®</sup> OPA adapters in one machine to create a one-to-two configuration. Visit this link for more information.

#### 2.2 Large and medium clusters

Traditionally Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors in large clusters were installed in servers with Intel<sup>®</sup> Xeon<sup>®</sup> processors, creating what can be called a hardware-defined heterogeneous topology. The following figure illustrates this approach. Usually, all nodes were connected with a fast fabric.



#### Figure 2-1. Fixed Assignment of Accelerators to Computing Nodes

If highly parallel applications with relatively small serial parts prevail in the anticipated usage model of a cluster, homogeneous cluster of Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based nodes should be the number one choice. If the planned usage model contains applications with both serial and parallel parts, the topology shown in Figure 2-1, "Fixed Assignment of Accelerators to Computing Nodes" on page 7 can be extended by



adding Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based nodes and connecting them with a fast fabric to the original cluster. Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessors could then be removed, as Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors can take over their responsibilities.



#### Figure 2-2. Heterogeneous Computing Cluster

Figure 2-2 shows a heterogeneous cluster with systems based on Intel<sup>®</sup> Xeon<sup>®</sup> processors and Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors. Heterogeneous computing nodes can be joined in different configurations based on requirements of particular applications. This flexibility is usually provided by a fast fabric network, such as Intel<sup>®</sup> OPA. Both types of servers can reside in one rack or cabinet (the following figure) or in separate racks.

#### Figure 2-3. Cluster Rack with Network Switches



#### 2.3 Small clusters and workstations

Creating a fast fabric network for small clusters and in workstation environments can significantly increase the cost of the installation. The main contribution to that is the cost of fast fabric network switches. However, it is possible to create small setups using Intel<sup>®</sup> OPA adapters in point to point configuration. The following figure shows an



example configuration of a server rack with four  $Intel^{
embed{matrix}}$  Xeon<sup>®</sup> processor based servers connected using point to point connections to four servers with  $Intel^{
embed{matrix}}$  Xeon Phi<sup>m</sup> Processors, which were installed in a single 2U server chassis.





Figure 2-5. Example of a Workstation Based Setup (1:2)



Both rack-mounted servers and workstations with Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors are available (see http://dap.xeonphi.com/). It is therefore possible to build workstation setups connected with Intel<sup>®</sup> OPA fast fabric and, additionally, with regular Ethernet connections. This topology can utilize one to one connection or two Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based workstations serving one system with Intel<sup>®</sup> Xeon<sup>®</sup> processor (see Figure 2-5). All software solutions that can be used for migration will work on such setup.





# **3** Software Migration

### 3.1 Introduction

Porting different application types described in Chapter 1, "Introduction" can be quite straightforward if the right tools are utilized in the process.

Migration of any application running the  $Intel^{\ensuremath{\mathbb{R}}}$  Xeon Phi<sup> $\ensuremath{\mathbb{M}}$ </sup> x100 coprocessor consist of two major steps:

- 1. Porting the application to run on the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based system
- 2. Tuning of the application to utilize available resources in the best possible way.

#### 3.2 Software tools

### 3.2.1 Intel<sup>®</sup> MPI Library

Intel<sup>®</sup> MPI is an implementation of the Message Passing Interface, a highly optimized communication runtime standardized by the MPI Forum. It consist of hundreds of functions, but the simplest program can be built using just a few. The most basic communication is provided by *MPI\_Send* and *MPI\_Recv* pairs, used to exchange messages between processes (or ranks in the MPI nomenclature) that can run on different machines. MPI implementations are often highly optimized for a particular network type or even a particular network topology. The following example code demonstrates a simple program using MPI:

```
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv)
{
    int my_rank;
    int ranks no;
    MPI_Status status;
    MPI_Init(&argc, &argv);
    MPI Comm rank(MPI COMM WORLD, &my rank);
    MPI_Comm_size(MPI_COMM_WORLD, &ranks_no);
   if (my_rank == 0)
    {
       // Orchestrator
       int other rank;
       for (other_rank = 1; other_rank < ranks_no; other_rank++)</pre>
       {
```



```
// Receive rank numbers from other ranks
int other_rank_received = -1;
MPI_Recv(&other_rank_received, 1, MPI_INTEGER,
other_rank, 1, MPI_COMM_WORLD, &status);
printf("Rank %d reported!\n", other_rank_received);
}
}
else
{
// Report own rank to the orchestrator
MPI_Send(&my_rank, 1, MPI_INTEGER, 0, 1, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
```

Compile and run the example using the Intel<sup>®</sup> MPI Compiler:

\$ mpiicc -o basic\_mpi basic\_mpi.c
\$ mpirun -np 10 ./basic mpi

The np parameter instructs the MPI runtime to start 10 ranks on a local machine.

To learn more about Intel<sup>®</sup> MPI Library visit https://software.intel.com/en-us/intel-mpi-library.

#### 3.2.2 Offloading over Fabric

The Intel<sup>®</sup> C/C++ and Fortran Compilers support offloading directives in the source code. This feature allows the application developer to specify which parts of the program will be offloaded to the Intel<sup>®</sup> Xeon Phi<sup>TM</sup> processor-based coprocessors or nodes. The Intel<sup>®</sup> offloading runtime supports Intel<sup>®</sup> Language Extensions for Offload (Intel<sup>®</sup> LEO) and OpenMP\* target directives.

An example code using OpenMP\* target directives to execute code on a coprocessor can look like this:

```
#include <stdio.h>
#include <stdib.h>
#include <stdlib.h>
#include <omp.h>
int main(void)
{
    printf("Example: Offload using OpenMP target directives\n");
    printf("Hello from the host!\n");
    #pragma omp target
    {
        if (omp_is_initial_device() != 0)
        {
            printf("Offload executed on host.\n");
            exit(-1);
        }
}
```



```
printf("Hello from the target!\n");
}
return 0;
```

}

To compile this program and run it on a host equipped with Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor execute:

```
$ icc -qopenmp -o omp_basic omp_basic.c
$ ./omp_basic
```

With the introduction of the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor, the offloading programming model is implemented as Offload over Fabric (OOF) and enables offloading to compute nodes connected within a high-speed network, such as Intel<sup>®</sup> OPA. Communication with the networking layer is realized by the Open Fabric Interface API (OFI). OOF allows easy porting of applications using offloading programming models. To compile the same code example for Offload over Fabric and run it between a host with Intel<sup>®</sup> Xeon<sup>®</sup> processor named *host* and a host with Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor named *target* execute:

```
$ icc -qopenmp -qoffload-arch=mic-avx512 -o omp_basic
omp_basic.c
$ OFFLOAD_NODES=target ./omp_basic
```

To learn more about Offload over Fabric visit https://software.intel.com/en-us/articles/ how-to-use-offload-over-fabric-with-knights-landing-intel-xeon-phi-processor.

#### 3.3 Porting application to the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor

#### 3.3.1 Native and MPI (distributed) applications

Porting native and MPI applications to Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based platforms can be very straightforward. In many cases only a simple recompilation for new hardware is required to have a running application:

```
$ icc -xMIC-AVX512 ...
```

\$ mpiicc -xMIC-AVX512 ...

This strategy can sometimes fail if the application is using explicit vectorization (intrinsics). Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor uses Intel<sup>®</sup> Initial Many Core Instructions (Intel<sup>®</sup> IMCI) instruction set that is not fully compatible with AVX-512. Code can be ported to the new instruction set or modernized to use Intel<sup>®</sup> Compiler auto-vectorization features or one of the standard approaches to vectorization, e.g. OpenMP\* SIMD directives. The last approach is very highly recommended – it will allow for better code portability in the future.

#### 3.3.2 Offloading applications

Offloading using compiler directives



Offloading applications can require a little bit more work, mainly connected to installation and configuration of the Offload over Fabric runtime software. The runtime (for both Intel<sup>®</sup> Xeon<sup>®</sup> processor and Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based nodes) can be found at Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor Software page. The website also contains detailed documentation on required configuration of nodes and offloading runtime.

When the Offloading over Fabric software is installed and tested, the offloading application can be recompiled for offloading to Intel<sup>®</sup> Xeon Phi<sup>M</sup> Processor based host. For applications using OpenMP\* target directives the *-qoffload-arch=mic-avx512* switch should be added to the compiler options:

\$ icc -qopenmp -qoffload-arch=mic-avx512 ...

To compile applications using  $Intel^{\mathbb{R}}$  Language Extensions for Offload (Intel<sup>®</sup> LEO) the *- qoffload* switch should be replaced with *- qoffload-arch=mic-avx512* compiler option:

\$ icc -qoffload-arch=mic-avx512 ...

There are two situations when code change may be required:

1. Explicit vectorization is used in the offloaded code (i.e. intrinsics)

2. \_\_MIC\_\_ preprocessor definition is used in the offloaded code

The first case was addressed in Section 3.3.1, "Native and MPI (distributed) applications."

<u>*MIC*</u> preprocessor macro allows to compile a code segment only if the code is compiled for  $Intel^{\mathbb{R}}$  Xeon Phi<sup>M</sup> x100 coprocessor. For example:

This construct is possible because each offload region is implicitly compiled two times – once for Intel<sup>®</sup> Xeon<sup>®</sup> processor and once for Intel<sup>®</sup> Xeon Phi<sup>TM</sup> x100 coprocessor. It allows to execute offload regions on the host processor if no coprocessor is present in the system. For Intel<sup>®</sup> Xeon Phi<sup>TM</sup> x200 processor the *\_\_\_MIC\_\_\_* macro has been replaced with the *\_\_\_TARGET\_ARCH\_MIC* macro and the code of the application may have to be changed to reflect that. The above example, after modification for running on Intel<sup>®</sup> Xeon Phi<sup>TM</sup> Processor, should look like this:

```
#pragma omp target
{
#ifdef __TARGET_ARCH_MIC
    // This code is executed on
    // Intel® Xeon Phi™ Processor (target)
#else
    // This code is executed on Intel® Xeon® processor (host)
    #endif
}
```



It is worth noting that the OpenMP\* target directives can be automatically executed on the host processor if no coprocessor is available. It is also possible to compile code using Intel<sup>®</sup> Language Extensions for Offload so that the presence of a coprocessor in the system is not required and the code will be executed entirely on the host:

\$ icc -qoffload=optional ...

This method has a penalty of executing offload semantics without doing actual offloading, but applications with small amount of serial work can perform reasonably well on Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based systems. Those applications can be later modernized so as not to use the offloading model at all.

#### Offloading using SCIF API

There is a subset of offloading applications that use Symmetric Communication Interface (SCIF) to communicate with Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor and execute offloaded code in client/server model. Such applications can be easily migrated to other APIs working in highly optimized network environments, such as MPI. Table below shows example mapping of SCIF functions to MPI functions.

| SCIF functionality                      | SCIF functions                                             | MPI functions                             |  |
|-----------------------------------------|------------------------------------------------------------|-------------------------------------------|--|
| Connection management                   | scif_connect, scif_listen, scif_accept                     | Not necessary – handled by MPI<br>runtime |  |
| Sending/receiving messages              | scif_send/scif_recv                                        | MPI_Send/MPI_Recv                         |  |
| Remote Direct Memory Access<br>(RDMA)   | scif_writeto/scif_readfrom<br>scif_vwriteto/scif_vreadfrom | MPI_Put/MPI_Get                           |  |
| Memory registration for RDMA            | scif_register/scif_unregister                              | MPI_Win_create/ MPI_Win_free              |  |
| Data transfer (RDMA)<br>synchronization | scif_fence_wait                                            | MPI_Win_fence                             |  |

This is an example mapping that may not lead to optimal code, so more careful investigation of the application algorithms and MPI features should be performed before pursuing this migration path.

To achieve full performance the application should be compiled into two binaries – one optimized for Intel<sup>®</sup> Xeon<sup>®</sup> processor and one optimized for Intel<sup>®</sup> Xeon Phi<sup>M</sup> Processor. The two binaries should be both started using a single *mpirun* command, which establishes communication between ranks running on different nodes. For example the command line below starts one rank (from the binary file named *app\_mpi*) on a machine with hostname *hostname* and another rank (from the binary file optimized for Intel<sup>®</sup> Xeon Phi<sup>M</sup> Processor named *app\_mpi\_phi*) on a machine called *targetname*:

\$ mpirun -np 1 -hosts <hostname> ./app\_mpi : \
 -np 1 -hosts <targetname> ./app\_mpi\_phi



#### 3.4 Tuning applications for Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor

#### 3.4.1 Increased number of cores

Applications written with portability in mind, i.e. not hardcoded for the number of cores available on Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor, should scale to more cores. However, some new bottlenecks can emerge, so careful testing of the scalability of the application should be performed. Use tools such as Intel<sup>®</sup> VTune Amplifier XE to diagnose any potential issues.

#### 3.4.2 Intel® Advanced Vector Extensions (Intel® AVX)-512

The SIMD width of Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor vector instruction is the same as the width of Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor instructions, but the Instruction Set Architecture (ISA) is not fully compatible. One of the most important changes is the fact that Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor (unlike Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor) provides additional support to SSE and Intel AVX-2 instruction sets. In portable programs most of the issues connected to vectorization should be handled by the compiler and runtimes, but the utilization of the vector processing units (VPUs) and vectorization quality should be monitored using tools such as Intel<sup>®</sup> Vector Advisor and Intel<sup>®</sup> Compiler vectorization reports. Those tools can advise the user on how to refactor their code to release its full potential when run on Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor.

#### 3.4.3 Cluster modes

The Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor can run in several cluster (cluster on a die) modes, but for most workloads Quadrant mode is recommended. Some codes can achieve better performance when processor is run in SNC-4 cluster mode. In this mode the Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor is virtually divided into four (in Cache memory mode) or eight (in Flat memory mode) Non-uniform Memory Access (NUMA) nodes. This affects how the operating system kernel works and may have both negative and positive effects on the performance of the application. It is recommended to test both SNC-4 and Quadrant modes to find the most suitable mode for particular application.

#### 3.4.4 High bandwidth memory

Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor is equipped with 16GB of high bandwidth memory (MCDRAM). This memory can be used automatically to extend existing cache hierarchy (when the processor runs in Cache memory mode) or can extend the available physical address space when the processor runs in Flat mode. It should be noted, that MCDRAM's latency can be slightly higher than the latency of DDR4 memory<sup>1</sup>. To saturate the bandwidth, it is recommended to access the MCDRAM from many threads running on Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor.

Non-uniform Memory Access (NUMA) mechanism is used in Flat and Hybrid memory modes to expose the MCDRAM memory to the operating system. Learn more

about this mechanism.

Native/MPI applications

<sup>1. &</sup>lt;u>https://sites.utexas.edu/jdm4372/2016/12/06/memory-latency-on-the-intel-xeon-phi-x200-knights-landing-processor/</u>



There are two approaches to using MCDRAM memory in Flat mode by native (and MPI) applications:

- 1. Using NUMA control features of the Linux\* operating system.
- 2. Using explicit API calls: system or a library, such as
- 3. memkind library.

The first method can be used for workloads that require relatively small amounts of memory (up to 16GB) and therefore can fit entirely within MCDRAM. In Quadrant cluster mode, the MCDRAM memory is assigned to NUMA node 1 and the application can be bound to this node:

```
$ numactl -m 1 ./application
```

This command instructs the operating system to use strict *bind* policy: all allocations will be performed by the operating system from NUMA node 1 and therefore from MCDRAM memory. It is also possible to use the *preferred* policy:

```
$ numactl -p 1 ./application
```

This policy instructs the operating system to allocate from the NUMA node 1 (and thus MCDRAM) first and, once this resource is exhausted, from regular DDR memory. The performance of an application using preferred policy can be less predictable. The second method (i.e. using the *memkind* library) allows the user to place only selected allocations in MCDRAM memory and can often achieve better results. Visit the *memkind* website and learn more about the library.

The OpenMP\* committee is currently working on new features in the OpenMP\* standard (see OpenMP\* TR5 document for details) that will expose new kinds of memory to the users in an easy and portable way.

#### Offload applications

Offload regions running on Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based nodes use MCDRAM mode by default if it is available. The runtime can be configured to use different memory polices. The user can define the first kind of memory to be used (MCDRAM or DDR) and fallback mechanism. Fallback defines the action taken when the first kind of memory is exhausted: allocate from the other kind of memory or simply let the operating system abort the application.

It is also possible to use the *memkind* library to selectively place allocations made by the offloaded regions in the MCDRAM memory and use OpenMP\* target pointers to register those allocations in the OpenMP\* runtime:

```
#pragma omp target map(from: target_data)
    is_device_ptr(target_data)
{
    target_data = (double *)hbw_malloc(SIZE);
}
omp_target_memcpy(target_data, host_data, SIZE, 0, 0, 0,
omp_get_initial_device());
#pragma omp target is_device_ptr(target_data) map(from: result)
{
    result = compute(target_data);
```



}

```
#pragma omp target is_device_ptr(target_data)
{
    hbw_free(target_data);
}
```

This code can be compiled with the following command line:

```
$ icc <...> -lmemkind -qoffload-option,mic,compiler,\
'-lmemkind` <...>
```

See Offload over Fabric documentation for details about configuring the offloading runtime.

#### 3.4.5 Load balancing of the application

Some offloading and heterogeneous MPI applications implemented automatic load balancing algorithm for dividing workloads between Intel<sup>®</sup> Xeon<sup>®</sup> processor and Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor. Those applications should also work well on Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor based nodes. Programs without those mechanisms should be analyzed taking into account the fact that Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor is a more powerful and conversely the work division that worked well for Intel<sup>®</sup> Xeon<sup>®</sup> processors and Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor will most likely be invalid when cooperating between Intel<sup>®</sup> Xeon<sup>®</sup> processors and Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processors based hosts.

#### 3.4.6 Application examples

Techniques described in this section were successfully applied to existing applications and proved that the presented approach can improve the performance of existing applications. To confirm that we recompiled LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) for Offloading over Fabric and compared the results of running one of Liquid Cristal benchmarks to results achieved on a machine with two Intel<sup>®</sup> Xeon<sup>®</sup> processors and the same machine with Intel<sup>®</sup> Xeon Phi<sup>™</sup> x100 coprocessor. It was a naive port, with no additional optimizations for Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor. See Chapter 4, "Benchmarking and Benchmarks Details" for more information on benchmarking and configuration





Figure 3-1. LAMMPS\* Ported to Offloading to Intel® Xeon Phi<sup>™</sup> Processor

Another example of successful migration is porting the Parallel Tissue Modeling Framework (Timothy). Learn more about the results of this effort.







# 4 Benchmarking and Benchmarks Details

Software and workloads used in performance tests may have been optimized for performance on  $Intel^{\ensuremath{\mathbb{R}}}$  microprocessors only.

Performance tests, such as SYSmark\* and MobileMark\*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including performance of combinations of products. For more complete information visit www.intel.com/benchmarks.

Intel<sup>®</sup> measured results as of November 2016 using LAMMPS\* benchmarks from the *USER-INTEL* package (*src/USER-INTEL/TEST*) with naive port to Offload over Fabric (compilation for offloading to Intel® Xeon Phi<sup>™</sup> Processor).

#### OFFLOAD HOST AND BASELINE CONFIGURATION:

- Dual Socket Intel<sup>®</sup> Xeon<sup>®</sup> processor E5-2699 v3 (45 M Cache, 2.3 GHz, 18 Cores) with Intel<sup>®</sup> Hyper-Threading and Turbo Boost Technologies enabled
- 64 GB DDR4-2133 MHz memory
- Red Hat Enterprise Linux\* 7.2 (Maipo)
- Intel<sup>®</sup> Omni-Path Host Fabric Interface Adapter 100 Series 1 Port PCIe\* x16
- Intel<sup>®</sup> Server Board S2600WT2
- 500GB SATA drive ST500DM002 and 1TB ST1000NM0033 Disks.

14 MPI ranks and 2 threads on the Intel<sup>®</sup> Xeon<sup>®</sup> processor based host used in the benchmark, Intel<sup>®</sup> package, 1 offload target, automatic load balancing.

#### OFFLOAD TARGET CONFIGURATION:

- One node Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor 7250 (16 GB, 1.4 GHz, 68 Cores) in Intel<sup>®</sup> Server System LADMP2312KXXX41
- 64GB DDR4, quad cluster mode
- MCDRAM flat memory mode
- Red Hat Enterprise Linux\* 7.2 (Maipo)
- Intel<sup>®</sup> Omni-Path Fabric Interface
- 250MB SATA WD2502ABYS-0 System Disk.
- Intel<sup>®</sup> Compiler 17.0.0
- Intel<sup>®</sup> MPI Library 2017
- LAMMPS\* code base: 30 Jul 16

