NVIDIA Grace CPU Benchmarking Guide

How to Use This Guide

This guide is for end users and application developers working with the NVIDIA® Grace CPU who want to achieve optimal performance for key benchmarks and applications (workloads). It includes procedures, sample code, reference performance numbers, recommendations, and technical best practices directly related to the NVIDIA Grace CPU. Following the instructions given in this guide will help you realize the best possible performance for your particular system.

This guide is a is a living document and frequently updated with the latest recommendations, so it is best read online at https://nvidia.github.io/grace-cpu-benchmarking-guide/. If you want to help improve the guide, you can create a Github issue at https://github.com/NVIDIA/grace-cpu-benchmarking-guide/issues/new.

Understanding Workload Performance

Workload performance depends on many aspects of the system, so the measured performance of your system may be different from the performance figures presented here. These figures are provided as guidelines and should not be interpreted as performance expectations or targets. Do not use this guide for platform validation.

The guide is divided into the following sections:

Platform Configuration: This section helps you tune your system for benchmarking. The instructions will help optimize the platform configuration.
Foundational Benchmarks: After checking the platform configuration, this section helps you complete a sanity check and confirm that the system is healthy.
Common Benchmarks: This section has information about the industry-recognized benchmarks and mini-apps that represent the performance of key workloads.
Applications: This section has information about maximizing the performance of full applications.
Developer Best Practices: This section has general best practices information to develop for NVIDIA Grace.

The sections can be read in any order, but we strongly recommend you begin by tuning and sanity checking your platform.

Platform Configuration

Before benchmarking, you should check whether the platform configuration is optimal for the target benchmark. The optimal configuration can vary by benchmark, but there are some common high-level settings of which you should be aware. Most platforms benefit from the settings shown below.

Info

Refer to the NVIDIA Grace Performance Tuning Guide and the platform-specific documentation at https://docs.nvidia.com/grace/ for instructions on how to tune your platform for optimal performance.

Warning

The settings shown on this page are intended to maximize system performance and may affect system security.

Linux Kernel

The following Linux kernel command line options are recommended for performance:

init_on_alloc=0: Do not fill newly allocated pages and heap objects with zeroes by default.
acpi_power_meter.force_cap_on=y: Enable ACPI power meter and with power capping.
numa_balancing=disable: Disable automatic NUMA balancing.

You can confirm these command line options are set by reading /proc/cmdline:

cat /proc/cmdline | tr ' ' '\n'

BOOT_IMAGE=/boot/vmlinuz-6.2.0-1012-nvidia-64k
root=UUID=76c84c6d-a59f-4a8d-903e-4cb9ef69b970
ro
rd.driver.blacklist=nouveau
nouveau.modeset=0
earlycon
module_blacklist=nouveau
acpi_power_meter.force_cap_on=y
numa_balancing=disable
init_on_alloc=0
preempt=none

CPU and Memory

Use the performance CPU frequency governor:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Disable address space layout randomization (ASLR):
```
sudo sysctl -w kernel.randomize_va_space=0
```

Drop the caches:

echo 3 | sudo tee /proc/sys/vm/drop_caches

Set the kernel dirty page values to the default values:

echo 10 | sudo tee /proc/sys/vm/dirty_ratio
echo 5 | sudo tee /proc/sys/vm/dirty_background_ratio

To reduce disk I/O, check for dirty page writeback every 60 seconds:
```
echo 6000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
```

Disable the NMI watchdog:

echo 0 | sudo tee /proc/sys/kernel/watchdog

Optional, allow unprivileged users to measure system events. Note that this setting has implications for system security. See https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html for additional information.
```
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
```

Networking

Set the networking connection tracking size:

echo 512000 | sudo tee /proc/sys/net/netfilter/nf_conntrack_max

Before starting the test, allow the kernel to reuse TCP ports which may be in a TIME_WAIT state:
```
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_reuse
```

Device I/O

For full power for generic devices:

for i in `find /sys/devices/*/power/control` ; do
    echo 'on' > ${i}
done

For full power for PCI devices:

for i in `find /sys/bus/pci/devices/*/power/control` ; do
    echo 'on' > ${i}
done

Benchmarking Software Environment

Begin by installing all available software updates, for example, sudo apt update && sudo apt upgrade on Ubuntu. Use the command ld --version to check that GNU binutils version is 2.38 or later. For best performance, GCC should be at version 12.3 or later. gcc --version will report the GCC version.

Many Linux distributions provide packages for GCC 12 compilers that can be installed alongside the system GCC. For example, sudo apt install gcc-12 on Ubuntu. See your Linux distribution’s instructions for installing and using various GCC versions. In case your distribution does not provide these packages, or you are unable to install them, instructions for building and installing GCC are provided below.

A Recommended Software Stack

This guide shows a variety of compilers, libraries, and tools. Suggested minimum versions of the major software packages used in this guide are shown below, but any recent version of these tools will work well on NVIDIA Grace. Installation instructions for each package are provided in the associated link.

Package	Minimum Version	Link
NVIDIA HPC SDK	23.11	https://developer.nvidia.com/hpc-sdk
NVIDIA Clang for Grace	16.0.5	https://developer.nvidia.com/grace/clang
GNU Binutils	2.41	https://ftp.gnu.org/gnu/binutils/binutils-2.41.tar.xz
GNU GCC	12.3	https://ftp.gnu.org/gnu/gcc/gcc-12.3.0/gcc-12.3.0.tar.xz
UCX	1.14.1	https://github.com/openucx/ucx/releases/tag/v1.14.1
OpenMPI	4.1.5	https://www.open-mpi.org/software/ompi/v4.1/
MVAPICH2	2.3.7	https://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3.7-1.tar.gz
PAPI	7.1.0	https://icl.utk.edu/papi/
Arm Compiler for Linux	23.04	https://developer.arm.com/downloads/-/arm-compiler-for-linux

Building and Installing GCC 12.3 from Source

Prefer Linux Distribution GCC 12 Packages

Many Linux distributions provide packages for GCC 12 compilers that can be installed alongside the system GCC. For example, sudo apt install gcc-12 on Ubuntu. You should prefer those packages over building GCC from source.

Follow the instructions below to build GCC 12.3 from source. Note that filesystem I/O performance can affect compilation time, so we recommend building GCC on a local filesystem or ramdisk, e.g. /tmp.

Download and unpack the GCC source code:

wget https://ftp.gnu.org/gnu/gcc/gcc-12.3.0/gcc-12.3.0.tar.xz
tar xvf gcc-12.3.0.tar.xz

Download the GCC prerequisites:

cd gcc-12.3.0
./contrib/download_prerequisites

You should see output similar to:

2024-01-24 08:04:44 URL:http://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.2.1.tar.bz2 [2493916/2493916] -> "gmp-6.2.1.tar.bz2" [1]
2024-01-24 08:04:45 URL:http://gcc.gnu.org/pub/gcc/infrastructure/mpfr-4.1.0.tar.bz2 [1747243/1747243] -> "mpfr-4.1.0.tar.bz2" [1]
2024-01-24 08:04:47 URL:http://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.2.1.tar.gz [838731/838731] -> "mpc-1.2.1.tar.gz" [1]
2024-01-24 08:04:49 URL:http://gcc.gnu.org/pub/gcc/infrastructure/isl-0.24.tar.bz2 [2261594/2261594] -> "isl-0.24.tar.bz2" [1]
gmp-6.2.1.tar.bz2: OK
mpfr-4.1.0.tar.bz2: OK
mpc-1.2.1.tar.gz: OK
isl-0.24.tar.bz2: OK
All prerequisites downloaded successfully.

Configure, compile, and install GCC. Remember to set GCC_INSTALL_PREFIX appropriately! This example installs GCC to /opt/gcc/12.3 but any valid filesystem path can be used:

export GCC_INSTALL_PREFIX=/opt/gcc/12.3
./configure --prefix="$GCC_INSTALL_PREFIX" --enable-languages=c,c++,fortran --enable-lto --disable-bootstrap --disable-multilib
make -j
make install

To use the newly-installed GCC 12 compiler, simply update your $PATH environment variable:

export PATH=$GCC_INSTALL_PREFIX/bin:$PATH

Confirm that the gcc command invokes GCC 12.3:

which gcc
gcc --version

You should see output similar to:

/opt/gcc/12.3/bin/gcc

gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Foundational Benchmarks

Foundational benchmarks confirm whether the system operating as expected. These benchmarks do not represent one application or problem area. They are excellent sanity checks for the system and can produce simple, comparable numbers with minimal configuration.

Before performing any competitive analysis, we strongly recommend that you run all foundational benchmarks. These benchmarks are simple and execute quickly, so you should repeat them every time you benchmark.

Fused Multiply Add

NVIDIA provides an open source suite of benchmarking microkernels for Arm® CPUs. To allow precise counts of instructions and exercise specific functional units, these kernels are written in assembly language. To measure the peak floating point capability of a core and check the CPU clock speed, use a Fused Multiply Add (FMA) kernel.

Install

To measure achievable peak performance of a core, the fp64_sve_pred_fmla kernel executes a known number of SVE predicated fused multiply-add operations (FMLA). When combined with the perf tool, you can measure the performance and the core clock speed.

git clone https://github.com/NVIDIA/arm-kernels.git
cd arm-kernels
make

Execute

The benchmark score is reported in giga-operations per second (Gop/sec) near the top of the benchmark output. Grace can perform 16 FP64 FMA operations per cycle, so a Grace CPU with a nominal CPU frequency of 3.3GHz should report between 52 and 53 Gop/sec.

./arithmetic/fp64_sve_pred_fmla.x

4( 16(SVE_FMLA_64b) );
Iterations;100000000
Total Inst;6400000000
Total Ops;25600000000
Inst/Iter;64
Ops/Iter;256
Seconds;0.478488
GOps/sec;53.5019

Use the perf command to measure CPU frequency. The CPU frequency is reported in the perf output on the cycles line and after the # symbol.

Before running perf, check that the value of /proc/sys/kernel/perf_event_paranoid is less than 1. If it is, you can run the command as an unprivileged user.

perf stat ./arithmetic/fp64_sve_pred_fmla.x

If the value of /proc/sys/kernel/perf_event_paranoid is greater than 1, you will need to run perf as root.

sudo perf stat ./arithmetic/fp64_sve_pred_fmla.x

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

The key performance metric is giga-operations per second (Gop/sec). Grace can perform 16 FP64 FMA operations per cycle, so a Grace CPU with a nominal CPU frequency of 3.3GHz should report between 52 and 53 Gop/sec.

Here is an example of benchmark output:

$ perf stat ./arithmetic/fp64_sve_pred_fmla.x
4( 16(SVE_FMLA_64b) );
Iterations;100000000
Total Inst;6400000000
Total Ops;25600000000
Inst/Iter;64
Ops/Iter;256
Seconds;0.481267
GOps/sec;53.1929


 Performance counter stats for './arithmetic/fp64_sve_pred_fmla.x':


            482.25 msec task-clock                       #    0.996 CPUs utilized
                 0      context-switches                 #    0.000 /sec
                 0      cpu-migrations                   #    0.000 /sec
                65      page-faults                      #  134.786 /sec
     1,607,949,685      cycles                           #    3.334 GHz
     6,704,065,953      instructions                     #    4.17  insn per cycle
   <not supported>      branches
            18,383      branch-misses                    #    0.00% of all branches


       0.484136320 seconds time elapsed


       0.482678000 seconds user
       0.000000000 seconds sys

STREAM

STREAM is the standard for measuring memory bandwidth. The STREAM benchmark is a simple, synthetic benchmark program that measures the sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels. The benchmark includes the following kernels that operate on 1D arrays a, b, and c, with scalar x:

COPY: Measures transfer rates in the absence of arithmetic: c = a
SCALE: Adds a simple arithmetic operation: b = x*a
ADD: Adds a third operand to test multiple load/store ports: c = a + b
TRIAD: Allows chained/overlapped/fused multiply/add operations: a = b + x*c

The kernels are executed in sequence in a loop, and the following parameters configure STREAM:

STREAM_ARRAY_SIZE: The number of double-precision elements in each array. When you measure the bandwidth to/from main memory, you must select a sufficiently large array size.
NTIMES: The number of iterations of the test loop.

Use the STREAM benchmark to check LPDDR5X memory bandwidth. The following commands download and compile STREAM with a total memory footprint of approximately 2.7GB, which is sufficient to exceed the L3 cache without excessive runtime

Install

The following commands download and compile STREAM with memory footprint of approximately 2.7GB per Grace CPU, which is sufficient to exceed the total system L3 cache without excessive runtime. The general rule for running STREAM is that each array must be at least four times the size of the sum of all the last-level caches that were used in the run, or 1 million elements, whichever is larger.

STREAM_ARRAY_SIZE="($(nproc)/72*120000000)"
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
gcc -Ofast -march=native -fopenmp -mcmodel=large -fno-PIC \
  	-DSTREAM_ARRAY_SIZE=${STREAM_ARRAY_SIZE} -DNTIMES=200 \
  	-o stream_openmp.exe stream.c

Execute

To run STREAM, set the number of OpenMP threads (OMP_NUM_THREADS) according to the following example. Replace ${THREADS} with the appropriate value from the table of reference results shown above. To distribute the threads evenly over all available cores and maximize bandwidth, use OMP_PROC_BIND=spread.

OMP_NUM_THREADS=${THREADS} OMP_PROC_BIND=spread ./stream_openmp.exe

Grace superchip memory bandwidth is proportionate to the total memory capacity. Find your system’s memory capacity in the table above and use the same number of threads to generate the expected score for STREAM TRIAD. For example, when running on a Grace-Hopper superchip with a memory capacity of 120GB, this command will report between 410GB/s and 486GB/s in STREAM TRIAD:

OMP_NUM_THREADS=72 OMP_PROC_BIND=spread ./stream_openmp.exe

Similarly, the following command will report between 820GB/s and 972GB/s in STREAM TRIAD on a Grace CPU Superchip with a memory capacity of 480GB:

OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./stream_openmp.exe

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

Memory bandwidth depends on many factors, for instance, operating system kernel version and the default memory page size.
Without any code changes, STREAM TRIAD should score between 80% and 95% of the system’s theoretical peak memory bandwidth.

Superchip	Memory Capacity (GB)	Threads	TRIAD Min MB/s
Grace Hopper	120	72	400,000
Grace Hopper	480	72	307,000
Grace CPU	240	144	800,000
Grace CPU	480	144	800,000

Here is an example of the STREAM execution output:

OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./stream_openmp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 240000000 (elements), Offset = 0 (elements)
Memory per array = 1831.1 MiB (= 1.8 GiB).
Total memory required = 5493.2 MiB (= 5.4 GiB).
Each kernel will be executed 200 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 144
Number of Threads counted = 144
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5729 microseconds.
   (= 5729 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          662394.7     0.005964     0.005797     0.008116
Scale:         685483.8     0.005744     0.005602     0.007843
Add:           787098.2     0.007689     0.007318     0.008325
Triad:         806812.4     0.007713     0.007139     0.011388
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Common Benchmarks

These industry-recognized benchmarks facilitate a fair competitive performance analysis for a class of workloads.

High Performance Linpack (HPL)

The NVIDIA HPC-Benchmarks provides a multiplatform (x86 and aarch64) container image based on NVIDIA Optimized Frameworks container images that includes NVIDIA’s HPL benchmark. HPL-NVIDIA solves a random dense linear system in double precision arithmetic on distributed-memory computers and is based on the netlib HPL benchmark. Please visit the NVIDIA HPC-Benchmarks page in the NGC Catalog for detailed instructions.

The HPL-NVIDIA benchmark uses the same input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark for getting started with the HPL software concepts and best practices.

Downloading and using the container

The container image works well with Signularity, Docker, or Pyxis/Enroot. Instructions for running with Singularity are provided below. For a general guide on pulling and running containers, see Running A Container in the NVIDIA Containers For Deep Learning Frameworks User’s Guide. For more information about using NGC, refer to the NGC Container User Guide.

Running the benchmarks

The script hpl-aarch64.sh can be invoked on a command line or through a Slurm batch script to launch HPL-NVIDIA for NVIDIA Grace CPU. As of HPC-Benchmarks 23.10, hpl-aarch64.sh accepts the following parameters:

Required parameters:
- --dat path: Path to HPL.dat input file
Optional parameters:
- --cpu-affinity <string>: A colon-separated list of cpu index ranges
- --mem-affinity <string>: A colon separated list of memory indices
- --ucx-affinity <string>: A colon separated list of UCX devices
- --ucx-tls <string>: UCX transport to use
- --exec-name <string>: HPL executable file

Several sample input files are available in the container at /workspace/hpl-linux-aarch64.

Run with Singularity

The instructions below assume Singularity 3.4.1 or later.

Save the HPC-Benchmark container as a local Singularity image file:

singularity pull --docker-login hpc-benchmarks:23.10.sif docker://nvcr.io/nvidia/hpc-benchmarks:23.10

If prompted for a Docker username or password, just press “enter” to continue with guest access:

Enter Docker Username: # press "enter" key to skip
Enter Docker Password: # press "enter" key to skip

This command saves the container in the current directory as hpc-benchmarks:23.10.sif.

Use one of the following commands to run HPL-NVIDIA with a sample input file on one NVIDIA Grace CPU Superchip.

To run from a local command line, i.e. not using Slurm:

singularity run ./hpc-benchmarks:23.10.sif \
     mpirun -np 2 --bind-to none \
     ./hpl-aarch64.sh --dat ./hpl-linux-aarch64/sample-dat/HPL_2mpi.dat \
     --cpu-affinity 0-71:72-143 --mem-affinity 0:1

To run via Slurm:

srun -N 1 --ntasks-per-node=2 singularity run ./hpc-benchmarks:23.10.sif \
     ./hpl-aarch64.sh --dat ./hpl-linux-aarch64/sample-dat/HPL_2mpi.dat \
     --cpu-affinity 0-71:72-143 --mem-affinity 0:1

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

The score below was taken on a Grace CPU Superchip with 480GB of CPU memory:

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00L2L2      168880   448     1     2             616.41             5.2093e+03

HiBench: K-means

This workload from the HiBench suite tests K-means clustering in spark.mllib, a well-known clustering algorithm for knowledge discovery and data mining. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Gaussian Distribution. There is also an optimized K-means implementation based on Intel Data Analytics Library (DAL), which is available in the dal module of sparkbench. This benchmark requires Spark, HiBench, and Hadoop. HiBench is the workload generator, Hadoop is used to generate and store data, and Spark is the application we wish to test.

Installation

Java 8 and Java 11

Install Java 8, Java 11, and related tools from your Linux distribution’s package repository. For example, on Ubuntu:

sudo apt install openjdk-11-jre-headless openjdk-11-jdk-headless maven python2 net-tools openjdk-8-jdk

Hadoop

cd $HOME
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-aarch64.tar.gz
tar zxvf hadoop-3.3.6-aarch64.tar.gz
export PATH_TO_HADOOP=$HOME/hadoop-3.3.6
cd $PATH_TO_HADOOP/etc/hadoop

Create configuration files:

`yarn-site.xml`

<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>127.0.0.1</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>127.0.0.1:8032</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>127.0.0.1:8030</value>
</property>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>127.0.0.1:8031</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
</configuration>

`core-site.xml`

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property> 
</configuration>

`mapred-site.xml`

Important

Replace $PATH_TO_HADOOP with the path to the hadoop-3.3.6 directory, e.g. <value>HADOOP_MAPRED_HOME=/home/nvidia/hadoop-3.3.6</value>.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
<property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
<property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
</configuration>

`hdfs-site.xml`

Important

Replace $PATH_TO_HADOOP with the path to the hadoop-3.3.6 directory, e.g. <value>/home/nvidia/hadoop-3.3.6/namenode</value>

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>$PATH_TO_HADOOP/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>datanode</value>
</property>
</configuration>

`hadoop-env.sh`

Important

Replace $PATH_TO_HADOOP with the path to the hadoop-3.3.6 directory, e.g. export HADOOP_HOME="/home/nvidia/hadoop-3.3.6"

export HADOOP_HOME="$PATH_TO_HADOOP"
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-arm64"  
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}

After creating all Hadoop configuration files, initialize the namenode directory:

$PATH_TO_HADOOP/bin/hdfs namenode -format

Spark

Important

Replace $PATH_TO_HADOOP with the path to the hadoop-3.3.6 directory, e.g. export HADOOP_PREFIX="/home/nvidia/hadoop-3.3.6"

cd $HOME
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar zxvf spark-3.5.0-bin-hadoop3.tgz
cd spark-3.5.0-bin-hadoop3/conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf

HiBench

Java 8 is Required to Build HiBench

If the mvn command given below fails with an error like object java.lang.Object in compiler mirror not found please check that you have installed Java 8 and updated your JAVA_HOME and PATH environment variables.

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-arm64"
export PATH="$JAVA_HOME/bin:$PATH"

cd $HOME
git clone https://github.com/Intel-bigdata/HiBench.git
cd HiBench
mvn -Phadoopbench -Psparkbench -Dspark=2.4 -Dscala=2.11 clean package

Configure HiBench:

Important

Replace $NUM_PARTITIONS, $PATH_TO_HADOOP, and $PATH_TO_SPARK with the appropriate paths.

cd $HOME/HiBench/conf

# Important: replace "$NUM_PARTITIONS" with the number of CPU cores you wish to use, e.g. 72 for Grace-Hopper.
sed -i 's#hibench.scale.profile.*$#hibench.scale.profile huge#g' hibench.conf
sed -i 's#hibench.default.map.parallelism.*$#hibench.default.map.parallelism $NUM_PARTITIONS#g' hibench.conf
sed -i 's#hibench.default.shuffle.parallelism.*$#hibench.default.shuffle.parallelism $NUM_PARTITIONS#g' hibench.conf

# IMPORTANT: replace "$PATH_TO_HADOOP"
cp hadoop.conf.template hadoop.conf
sed -i 's#/PATH/TO/YOUR/HADOOP/ROOT#$PATH_TO_HADOOP#g' hadoop.conf

# IMPORTANT: replace "$PATH_TO_SPARK"
cp spark.conf.template spark.conf
sed -i 's#/PATH/TO/YOUR/SPARK/HOME#$PATH_TO_SPARK#g' spark.conf
sed -i 's#hibench.spark.master.*$#hibench.spark.master local[*]#' spark.conf
sed -i 's#spark.executor.memory.*$#spark.executor.memory 50g#' spark.conf
sed -i 's#spark.driver.memory.*$#spark.driver.memory 50g#' spark.conf

Run the Benchmark

Configure your environment. Remember to set $PATH_TO_HADOOP to the corect path.

export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-arm64"
export PATH="$JAVA_HOME/bin:$PATH"
# Set Hadoop-related environment variables
export PATH_TO_HADOOP=$HOME/hadoop-3.3.6
export HADOOP_HOME=$PATH_TO_HADOOP
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Start Hadoop
```
export PDSH_RCMD_TYPE=exec
$PATH_TO_HADOOP/sbin/start-all.sh
```
If Hadoop starts successfully, jps output should be similar to:
```
369207 SecondaryNameNode
369985 NodeManager
371293 NameNode
373148 Jps
368895 DataNode
369529 ResourceManager
```
All of NameNode, SecondaryNameNode, NodeManager, DataNode, and ResourceManager must be running before proceeding with the benchmark. If you do not see a NameNode process, check that you initialized the namenode directory as described in the Hadoop installation steps above.

Preprocess the k-means benchmark files

$HOME/HiBench/bin/workloads/ml/kmeans/prepare/prepare.sh

Run the benchmark once to initialize the system

$HOME/HiBench/bin/workloads/ml/kmeans/spark/run.sh

Run k-means benchmark several times and average the scores. This example shows 72 cores. If you wish to use a different number of CPU cores, remember to update hibench.default.map.parallelism and hibench.default.shuffle.parallelism in hibench.conf.
```
numactl -C0-71 -m0 $HOME/HiBench/bin/workloads/ml/kmeans/spark/run.sh
```

The results can be found in $HOME/HiBench/report/hibench.report.

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

Type         Date       Time     Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node    
ScalaSparkKmeans 2023-02-16 20:29:54 19903891441          23.692               840110224            840110224          
ScalaSparkKmeans 2023-02-16 23:45:33 19903891427          23.742               838340974            838340974          
ScalaSparkKmeans 2023-02-16 23:53:05 19903891439          24.129               824894999            824894999
 
In the above case, the median throughput/node across the three runs is the result. In the above example, it is  838340974 (bytes/s)

GAP Benchmark Suite

The GAP Benchmark Suite (Beamer, 2015) was released with the goal of helping standardize graph processing evaluations. Graph algorithms and their applications are currently gaining renewed interest, especially with the growth of social networks and their analysis. Graph algorithms are also important for their applications in science and recognition. The GAP benchmark suite provides high performance (CPU only) reference implementations for various graph operations and provides a standard for graph processing performance evaluations.

Even though the GAP benchmark suite provides real-world graphs and more than one kernel (high performance implementation of various graph operation algorithms), we will only look at using synthetic Kronecker graphs and will be focusing on the Breadth First Search (BFS) kernel.

Initial Configuration

This repo is the reference implementation for the GAP Benchmark Suite. It is designed to be a portable high-performance baseline that only requires a compiler with support for C++11. It uses OpenMP for parallelism, but, to run serially, it can be compiled without OpenMP. The details of the benchmark can be found in the specification.

Quick Start

To build from source, run the following commands:

git clone https://github.com/sbeamer/gapbs.git 
cd gapbs
make

To quickly test the build, run the BFS kernel on 1024 vertices for one iteration:

$ ./bfs -g 10 -n 1

The command output should be similar to

Generate Time:       0.00547
Build Time:          0.00295
Graph has 1024 nodes and 10496 undirected edges for degree: 10
Trial Time:          0.00921
Average Time:        0.00921

Additional command line flags can be found with -h.

Running the BFS Kernel

These command line options set runtime parameters for the BFS kernel:

-g <scale>: generate Kronecker graph with 2^scale vertices.
-k <degree>: average degree for a synthetic graph.
-n <n>: performs n trials.

Typically, we select a scale so that the working dataset size for the workload lies outside the Last level cache on the test platforms. A scale value of 26 means our graph will have approximately 67.11 million vertices. This graph size should be large enough so that the working set of the workload will not completely lie within the last level cache of the CPU.

Run bfs with the following command:

OMP_NUM_THREADS=72 OMP_PROC_BIND=close numactl -m0 -C 0-71 ./bfs -g 26 -k 16 -n 64

This command will pin our application to CPU socket 0 and physical cores 0-71.

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

When you run bfs using the command above on a Grace machine with at least 72 cores, using Kronecker graph of scale 26 and degree 16 for 64 trials, we see an average time of approximately 0.0395 +/- 0.001 ms as shown below.

Generate Time:       3.72415
Build Time:          5.94936
Graph has 67108864 nodes and 1051923215 undirected edges for degree: 15
Trial Time:          0.03807
Trial Time:          0.03730
Trial Time:          0.04042
Trial Time:          0.04184
Trial Time:          0.03676
...
Trial Time:          0.03795
Trial Time:          0.03576
Trial Time:          0.04318
Average Time:        0.03977

Graph500

The Graph500 is a rating of supercomputer systems, focused on data-intensive loads.

Build

The following script will build Graph500 and all the dependencies. The script tested on a freshly booted Ubuntu 22.04.

#!/bin/bash

set -e

sudo apt update && sudo apt install -y wget build-essential python3 numactl git
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.1.tar.gz
gunzip -c openmpi-5.0.1.tar.gz | tar xf -
mkdir -p ./ompi
export PATH="${PWD}/ompi/bin:${PATH}"
export LD_LIBRARY_PATH="${PWD}/ompi/lib:${LD_LIBRARY_PATH}"
pushd openmpi-5.0.1
./configure --prefix=${PWD}/../ompi
make all install
popd

git clone https://github.com/graph500/graph500.git
pushd ./graph500/src/
sed -i '/^CFLAGS/s/$/ -DPROCS_PER_NODE_NOT_POWER_OF_TWO -fcommon/' Makefile
make
popd

Running Benchmarks on Grace

#!/bin/bash
export SKIP_VALIDATION=1
unset SKIP_BFS

mpirun -n $(nproc) --map-by core ./graph500/src/graph500_reference_bfs 28 16

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

In the below output, bfs harmonic_mean_TEPS and bfs mean_time are our performance and runtime metrics respectively. TEPS here is traversed edges per second which is our absolute perf metric.

SCALE:                          28
edgefactor:                     16
NBFS:                           64
graph_generation:               14.2959
num_mpi_processes:              144
construction_time:              8.21669
bfs  min_time:                  1.4579
bfs  firstquartile_time:        1.47956
bfs  median_time:               1.54229
bfs  thirdquartile_time:        1.70613
bfs  max_time:                  1.81811
bfs  mean_time:                 1.58271
bfs  stddev_time:               0.112696
min_nedge:                      4294921166
firstquartile_nedge:            4294921166
median_nedge:                   4294921166
thirdquartile_nedge:            4294921166
max_nedge:                      4294921166
mean_nedge:                     4294921166
stddev_nedge:                   0
bfs  min_TEPS:                  2.3623e+09
bfs  firstquartile_TEPS:        2.51735e+09
bfs  median_TEPS:               2.78478e+09
bfs  thirdquartile_TEPS:        2.90284e+09
bfs  max_TEPS:                  2.94596e+09
bfs  harmonic_mean_TEPS:     !  2.71366e+09
bfs  harmonic_stddev_TEPS:      2.43441e+07
bfs  min_validate:              -1
bfs  firstquartile_validate:    -1
bfs  median_validate:           -1
bfs  thirdquartile_validate:    -1
bfs  max_validate:              -1
bfs  mean_validate:             -1
bfs  stddev_validate:           0

NAS Parallel Benchmarks

The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The NPB 1 benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications. Problem sizes in NPB are predefined and indicated as different classes. Reference implementations of NPB are available in commonly-used programming models like MPI and OpenMP.

Building the Benchmarks

Download and unpack the NPB source code from nas.nasa.gov:

wget https://www.nas.nasa.gov/assets/npb/NPB3.4.2.tar.gz
tar xvzf NPB3.4.2.tar.gz
cd NPB3.4.2/NPB3.4-OMP

Create the make.def file to configure the build for NVIDIA HPC compilers:

cat > config/make.def <<'EOF'
FC = nvfortran
FLINK = $(FC)
F_LIB =
F_INC =
FFLAGS = -O3 -mp
FLINKFLAGS = $(FFLAGS)
CC = nvc
CLINK = $(CC)
C_LIB = -lm
C_INC =
CFLAGS = -O3 -mp
CLINKFLAGS = $(CFLAGS)
UCC = gcc
BINDIR = ../bin
RAND = randi8
WTIME  = wtime.c
EOF

Create the suite.def file to build all benchmarks with the D problem size:

cat > config/suite.def <<'EOF'
bt	D
cg	D
ep	D
lu	D
mg	D
sp	D
ua	D
EOF

Compile all benchmarks:
```
make -j suite
```

A successful compilation will generate these binaries in the bin/ directory:

$ ls bin/
bt.D.x  cg.D.x  ep.D.x  ft.D.x  lu.D.x  mg.D.x  sp.D.x  ua.D.x

Running the Benchmarks

Run each benchmark individually using the command shown below. In the command, replace ${BENCHMARK} with the benchmark name, for example cg.D.x, and replace ${THREADS} and ${FLAGS} with the appropriate values from the reference results shown above.

OMP_NUM_THREADS=${THREADS} OMP_PROC_BIND=close numactl ${FLAGS} ./bin/${BENCHMARK}

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

Grace CPU Superchip, 480GB Memory Capacity

Use this script to run all the benchmarks on 72 cores of the Grace CPU:

#!/bin/bash
for BENCHMARK in bt cg ep lu mg sp ua ; do
    OMP_NUM_THREADS=72 OMP_PROC_BIND=close numactl -m0 ./bin/${BENCHMARK}.D.x
done

Performance is reported on the line marked “Mops / total”. The expected performance is shown below.

Benchmark	Mops / total
bt.D.x	386758.21
cg.D.x	26632.65
ep.D.x	10485.73
lu.D.x	293407.59
mg.D.x	125382.93
sp.D.x	136893.59
ua.D.x	973.52

protobuf

Protocol Buffers (a.k.a., protobuf) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.

Build

The following script will build Protocol Buffers. The script tested on a freshly booted Ubuntu 22.04.

#!/bin/bash

set -e

sudo apt update && sudo apt install -y autoconf automake libtool curl make g++ unzip libz-dev git cmake
git clone https://github.com/protocolbuffers/protobuf.git
pushd protobuf
# syncing at a specific commit
git checkout 7cd0b6fbf1643943560d8a9fe553fd206190b27f
git submodule update --init --recursive
./autogen.sh
./configure
make
make check 
sudo make install
sudo ldconfig
pushd benchmarks
make cpp
popd
popd

Running Benchmarks on Grace

#!/bin/bash
pushd ./protobuf
mkdir -p result
rm -rf result/*
C=$(nproc)
for (( i=0; i < $C-1; i++ ))
do
        filename_result="$i.log"
        filepath_result="result/$filename_result"
        taskset -c $i ./benchmarks/cpp-benchmark --benchmark_min_time=5.0 $(find $(cd . && pwd) -type f -name "dataset.*.pb" -not -path "$(cd . && pwd)/tmp/*") >> $filepath_result &
done
#last core will be sync
filename_result="$i.log"
filepath_result="result/$filename_result"
taskset -c $i ./benchmarks/cpp-benchmark --benchmark_min_time=5.0 $(find $(cd . && pwd) -type f -name "dataset.*.pb" -not -path "$(cd . && pwd)/tmp/*") >> $filepath_result 
sleep 1
popd
echo Done!

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

You need to run n copies of the benchmark as indicated above. Next, you need to take geomean of all the scores. Then, pick the least score across all copies. The total score would be Score * copies = Socket score.

Geomean of mins: 1291.26863081306
Total score: 185942.682837081 MB/s

Applications

The benchmarking recipes in this section show you how to maximize the performance of key applications.

NAMD

NAMD is a widely used molecular dynamics software that is used for large scale simulations of biomolecular systems¹. It is developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign and was used in the winning submission for the 2020 ACM Gordon Bell Special Prize for COVID-19 Research². As part of the submission, NAMD was used to simulate a 305-million atom SARS-CoV-2 viral envelope on over four thousand nodes of the ORNL Summit supercomputer. The Charm++ framework is used to scale to thousands of GPUs and hundreds of thousands of CPU cores³. NAMD has supported Arm since 2014.

Building the Source Code

To access the NAMD source code, submit a request at https://www.ks.uiuc.edu/Research/namd/gitlabrequest.html. After the request is approved, you can access the source code at https://gitlab.com/tcbgUIUC/namd.

Dependencies

The following script will install NAMD’s dependencies to the ./build/ directory. Charm++ version 7.0.0 does not support targeting the Armv9 architecture, so Armv8 is used instead.

#!/bin/bash

set -e

if [[ ! -a build ]]; then
    mkdir build
fi
cd build

#
# FFTW
#
if [[ ! -a fftw-3.3.9 ]]; then
    wget http://www.fftw.org/fftw-3.3.9.tar.gz
    tar xvfz fftw-3.3.9.tar.gz
fi

if [[ ! -a fftw ]]; then
    mkdir fftw
    cd fftw-3.3.9
    ./configure CC=gcc --prefix=$PWD/../fftw \
        --enable-float --enable-fma \
        --enable-neon \
        --enable-openmp --enable-threads | tee fftw_config.log
    make -j 8 | tee fftw_build.log
    make install
    cd ..
fi

#
# TCL
#
if [[ ! -e tcl ]]; then
    wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-arm64-threaded.tar.gz
    tar zxvf tcl8.5.9-linux-arm64-threaded.tar.gz
    mv tcl8.5.9-linux-arm64-threaded tcl
fi

#
# Charm++
#
if [[ ! -a charm ]]; then
    git clone https://github.com/UIUC-PPL/charm.git
fi

cd charm
git checkout v7.0.0
if [[ ! -a multicore-linux-arm8-gcc ]]; then
    ./build charm++ multicore-linux-arm8 gcc --with-production --enable-tracing -j 8
fi 
cd ..

NAMD

The following script downloads and compiles NAMD in the ./build/ directory. We recommend that you use GCC version 12.3 or later as it can target the neoverse-v2 architecture. If version GCC 12.3 is not available, the sed command below should be removed because the architecture will not be recognized.

#!/bin/bash
set -e
if [[ ! -a build ]]; then
    mkdir build
fi
cd build

#
# NAMD
#
if [[ ! -a namd ]]; then
    git clone git@gitlab.com:tcbgUIUC/namd.git
    cd namd
    git checkout release-3-0-beta-3
    cd ..
fi

cd namd

if [[ ! -a Linux-ARM64-g++ ]]; then
    ./config Linux-ARM64-g++ \
        --charm-arch multicore-linux-arm8-gcc --charm-base $PWD/../charm \
        --with-tcl --tcl-prefix $PWD/../tcl \
        --with-fftw --with-fftw3 --fftw-prefix $PWD/../fftw
    sed -i 's/FLOATOPTS = .*/FLOATOPTS = -Ofast -mcpu=neoverse-v2/g' arch/Linux-ARM64-g++.arch
    cd Linux-ARM64-g++
    make depends
    make -j 8
    cd ..
fi
cd ..

Running Benchmarks on Grace

STMV is a standard benchmark system with 1,066,628 atoms. To download STMV, run the following command.

wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz 
tar zxvf stmv.tar.gz
cd stmv
wget http://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/stmv_nve_cuda.namd
wget https://www.ks.uiuc.edu/Research/namd/utilities/ns_per_day.py
chmod +x ns_per_day.py

The stmv_nve_cuda.namd input file is not specific to CUDA and runs an NVE simulation with a 2 femtosecond timestep and PME evaluated every 4 fs with multi-time stepping. The benchmark can be run with the following command from the stmv directory:

../build/namd/Linux-ARM64-g++/namd3 +p72 +pemap 0-71 stmv_nve_cuda.namd | tee output.txt
./ns_per_day.py output.txt

The metric of interest is ns/day (higher is better) corresponding to the number of nanoseconds of simulation time that can be computed in 24 hours. The ns_per_day.py script will parse the standard output of a simulation and compute the overall performance of the benchmark.

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

The following result was collected on a Grace Hopper Superchip using 72 CPU cores and completed in 121 seconds. As measured by `hwmon``, the average Grace module power was 275 Watts, and the benchmark consumed approximately 9.31 Watt-hours of energy.

$ ./ns_per_day.py output.txt
Nanoseconds per day:    2.97202

Mean time per step:     0.0581422
Standard deviation:     0.00074301

References

Phillips, James C., et al. “Scalable molecular dynamics on CPU and GPU architectures with NAMD.” The Journal of Chemical Physics 153.4 (2020).

Casalino, Lorenzo, et al. “AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics.” The International Journal of High Performance Computing Applications 35.5 (2021): 432-451.

Phillips, James C., et al. “Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation.” SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014.

OpenFOAM

OpenFOAM is a C++ toolbox for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, most prominently including computational fluid dynamics.

Build

The following script will build OpenFOAM and all the dependencies. The script tested on a freshly booted Ubuntu 22.04.

#!/bin/bash

set -e

sudo apt update && sudo apt install -y time libfftw3-dev curl wget \ 
    build-essential libscotch-dev libcgal-dev git flex libfl-dev bison cmake \ 
    zlib1g-dev libboost-system-dev libboost-thread-dev \ 
    libopenmpi-dev openmpi-bin gnuplot \ 
    libreadline-dev libncurses-dev libxt-dev numactl

wget https://dl.openfoam.com/source/v2312/OpenFOAM-v2312.tgz 
wget https://dl.openfoam.com/source/v2312/ThirdParty-v2312.tgz
tar -zxvf OpenFOAM-v2312.tgz && tar -zxvf ThirdParty-v2312.tgz

source ./OpenFOAM-v2312/etc/bashrc
pushd $WM_PROJECT_DIR 
./Allwmake -j -s -l -q
popd

Running Benchmarks on Grace

export OPENFOAM_ROOT=${PWD}
export OPENFOAM_MPIRUN_ARGS="--map-by core --bind-to none --report-bindings"

source ./OpenFOAM-v2312/etc/bashrc
git clone https://develop.openfoam.com/committees/hpc.git 
pushd hpc/incompressible/simpleFoam/HPC_motorbike/Large/v1912 
sed -i "s/numberOfSubdomains.*/numberOfSubdomains $(nproc);/g" system/decomposeParDict 
sed -i "s/vector/normal/g" system/mirrorMeshDict 
sed -i "s/^endTime.*/endTime         100;/" system/controlDict
sed -i "s/^writeInterval.*/writeInterval   1000;/" system/controlDict
curl -o system/fvSolution "https://develop.openfoam.com/Development/openfoam/-/raw/master/tutorials/incompressible/simpleFoam/motorBike/system/fvSolution?ref_type=heads"
chmod +x All*

./AllmeshL
./Allrun
cat log.*
popd

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

The following result was collected on a Grace CPU Superchip using 144 CPU cores.

ExecutionTime = 189.49 s  ClockTime = 197 s

SPECFEM3D

SPECFEM3D Cartesian simulates acoustic (fluid), elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave propagation in any type of conforming mesh of hexahedra (structured or not).

Build

The following script will build SPEMFEM3D. The script tested on a freshly booted Ubuntu 22.04.

#!/bin/bash

set -e

sudo apt update && sudo apt install -y git build-essential gcc gfortran libopenmpi-dev openmpi-bin
git clone https://github.com/SPECFEM/specfem3d.git
pushd ./specfem3d
./configure FC=gfortran CC=gcc
make all
cp -r EXAMPLES/applications/meshfem3D_examples/simple_model/DATA/* DATA/
sed -i "s/NPROC .*/NPROC    = $(nproc)/g" DATA/Par_file
sed -i "s/NSTEP .*/NSTEP    = 10000/g" DATA/Par_file
sed -i "s/DT .*/DT       = 0.01/g" DATA/Par_file
sed -i "s/NEX_XI .*/NEX_XI       = 448/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i "s/NEX_ETA .*/NEX_ETA      = 576/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i "s/NPROC_XI .*/NPROC_XI      = 8/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i "s/NPROC_ETA .*/NPROC_ETA     = 18/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i '/^#NEX_XI_BEGIN/{n;s/1.*/1 448 1 576 1 4 1/;n;s/1.*/1 448 1 576 5 5 2/;n;s/1.*/1 448 1 576 6 15 3/}' DATA/meshfem3D_files/Mesh_Par_file
popd

Running Benchmarks on Grace

pushd ./specfem3d
mkdir -p DATABASES_MPI
rm -rf DATABASES_MPI/*
rm -rf OUTPUT_FILES/*
mpirun -n $(nproc) --bind-to none --map-by core ./bin/xmeshfem3D
mpirun -n $(nproc) --bind-to none --map-by core ./bin/xgenerate_databases
mpirun -n $(nproc) --bind-to none --map-by core ./bin/xspecfem3D
cat OUTPUT_FILES/output_solver.txt
popd

Reference Results

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

The following result was collected on a Grace Superchip using 144 CPU cores.

 Time loop finished. Timing info:
 Total elapsed time in seconds =    991.00492298699999
 Total elapsed time in hh:mm:ss =      0 h 16 m 31 s

Weather Research and Forecasting Model

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility.

Arm64 is supported by the standard WRF distribution as of WRF 4.3.3. The following is an example of how to perform the standard procedure to build and execute on NVIDIA Grace. See http://www2.mmm.ucar.edu/wrf/users/download/get_source.html for more details.

Install WRF

Initial Configuration

Verify that the most recent NVIDIA HPC SDK is available in your environment. The simplest way to do this is to load the nvhpc module file.

module load nvhpc

nvc --version

nvc 23.7-0 linuxarm64 target on aarch64 Linux -tp neoverse-v2
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

NVIDIA HPC SDK includes optimized MPI compilers and libraries, so you’ll also have the appropriate MPI compilers in your path:

$ which mpirun
/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/mpi/bin/mpirun

$ mpicc -show
nvc -I/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/openmpi/openmpi-3.1.5/include -Wl,-rpath -Wl,$ORIGIN:$ORIGIN/../../lib:$ORIGIN/../../../lib:$ORIGIN/../../../compilers/lib:$ORIGIN/../../../../compilers/lib:$ORIGIN/../../../../../compilers/lib -Wl,-rpath -Wl,/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -L/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -lmpi

NetCDF requires libcurl. On Ubuntu, you can install this easily with this command:

sudo apt install libcurl4-openssl-dev

Create a build directory to hold WRF and all its dependencies

mkdir WRF

# Configure build environment
export BUILD_DIR="$HOME/WRF"
export HDFDIR=$BUILD_DIR/opt
export HDF5=$BUILD_DIR/opt
export NETCDF=$BUILD_DIR/opt

export PATH=$HDFDIR/bin:$PATH
export LD_LIBRARY_PATH=$HDFDIR/lib:$LD_LIBRARY_PATH

Dependencies

WRF depends on the NetCDF Fortran library, which in turn requires the NetCDF C library and HDF5. This guide assumes that all of WRF’s dependencies have been installed at the same location such that they share the same lib and include directories.

HDF5

cd $BUILD_DIR

wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.2/src/hdf5-1.14.2.tar.gz
tar xvzf hdf5-1.14.2.tar.gz
cd hdf5-1.14.2

CC=mpicc FC=mpifort \
    CFLAGS="-O3 -fPIC" FCFLAGS="-O3 -fPIC" \
    ./configure --prefix=$HDFDIR --enable-fortran --enable-parallel

make -j72
make install

NetCDF-C

cd $BUILD_DIR

wget https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.2.tar.gz
tar xvzf v4.9.2.tar.gz
cd netcdf-c-4.9.2

CC=mpicc FC=mpifort \
    CPPFLAGS="-I$HDFDIR/include" \
    CFLAGS="-O3 -fPIC -I$HDFDIR/include" \
    FFLAGS="-O3 -fPIC -I$HDFDIR/include" \
    FCFLAGS="-O3 -fPIC -I$HDFDIR/include" \
    LDFLAGS="-O3 -fPIC -L$HDFDIR/lib -lhdf5_hl -lhdf5 -lz" \
    ./configure --prefix=$NETCDF

make -j72
make install

NetCDF-Fortran

cd $BUILD_DIR

wget https://github.com/Unidata/netcdf-fortran/archive/refs/tags/v4.6.1.tar.gz
tar xvzf v4.6.1.tar.gz
cd netcdf-fortran-4.6.1/

CC=mpicc FC=mpifort \
    CPPFLAGS="-I$HDFDIR/include" \
    CFLAGS="-O3 -fPIC -I$HDFDIR/include" \
    FFLAGS="-O3 -fPIC -I$HDFDIR/include" \
    FCFLAGS="-O3 -fPIC -I$HDFDIR/include" \
    LDFLAGS="-O3 -fPIC -L$HDFDIR/lib -lhdf5_hl -lhdf5 -lz" \
    ./configure --prefix=$NETCDF

make -j72
make install

Build WRF with NVIDIA Compilers

cd $BUILD_DIR

wget https://github.com/wrf-model/WRF/releases/download/v4.5.2/v4.5.2.tar.gz
tar xvzf v4.5.2.tar.gz
cd WRFV4.5.2

Run ./configure and select the following options:

Choose a dm+sm option on the NVHPC row. In this example, this is option 20.
Choose 1 for nesting.

./configure

------------------------------------------------------------------------
Please select from among the following Linux aarch64 options:

  1. (serial)   2. (smpar)   3. (dmpar)   4. (dm+sm)   GNU (gfortran/gcc)
  5. (serial)   6. (smpar)   7. (dmpar)   8. (dm+sm)   GNU (gfortran/gcc)
  9. (serial)  10. (smpar)  11. (dmpar)  12. (dm+sm)   armclang (armflang/armclang): Aarch64
 13. (serial)  14. (smpar)  15. (dmpar)  16. (dm+sm)   GCC (gfortran/gcc): Aarch64
 17. (serial)  18. (smpar)  19. (dmpar)  20. (dm+sm)   NVHPC (nvfortran/nvc)

Enter selection [1-16] : 16
------------------------------------------------------------------------
Compile for nesting? (0=no nesting, 1=basic, 2=preset moves, 3=vortex following) [default 0]: 1

Important

Depending on the compilers available in your environment, other options may be presented in the menu. Check the numbers in the menu before making your selection.

Reset environment variables:

# Reset build environment to include `-lnetcdf` in LDFLAGS
export CC=$(which mpicc)
export CXX=$(which mpicxx)
export FC=$(which mpifort)
export CPPFLAGS="-O3 -fPIC -I$HDFDIR/include"
export CFLAGS="-O3 -fPIC -I$HDFDIR/include"
export FFLAGS="-O3 -fPIC -I$HDFDIR/include"
export LDFLAGS="-O3 -fPIC -L$HDFDIR/lib -lnetcdf -lhdf5_hl -lhdf5 -lz"
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$$NETCDF/lib:$LD_LIBRARY_PATH

Set stack size to “unlimited”:

ulimit -s unlimited

Run ./compile to build WRF and save the output to build.log:

./compile em_real 2>&1 | tee build.log

Look for a message similar to this at the end of the compilation log:

==========================================================================
build started:   Wed Oct  4 05:07:19 PM PDT 2023
build completed: Wed Oct 4 05:07:44 PM PDT 2023

--->                  Executables successfully built                  <---

-rwxrwxr-x 1 jlinford jlinford 44994360 Oct  4 17:07 main/ndown.exe
-rwxrwxr-x 1 jlinford jlinford 44921440 Oct  4 17:07 main/real.exe
-rwxrwxr-x 1 jlinford jlinford 44481744 Oct  4 17:07 main/tc.exe
-rwxrwxr-x 1 jlinford jlinford 48876800 Oct  4 17:07 main/wrf.exe

==========================================================================

Run WRF CONUS 12km

Verify that the most recent NVIDIA HPC SDK is available in your environment. The simplest way to do this is to load the nvhpc module file.

module load nvhpc

Configure your environment.

export BUILD_DIR="$HOME/WRF"
export HDFDIR=$BUILD_DIR/opt
export HDF5=$BUILD_DIR/opt
export NETCDF=$BUILD_DIR/opt
export PATH=$HDFDIR/bin:$PATH
export LD_LIBRARY_PATH=$HDFDIR/lib:$LD_LIBRARY_PATH

Download and unpack the CONUS 12km input files into a fresh run directory.

cd $BUILD_DIR/WRFV4.5.2

# Copy the run directory template
cp -a run run_CONUS12km
cd run_CONUS12km

# Download the test case files and merge them into the run directory
wget https://www2.mmm.ucar.edu/wrf/src/conus12km.tar.gz
tar xvzf conus12km.tar.gz --strip-components=1

Configure the environment:

ulimit -s unlimited
export PATH=$NETCDF/bin:$HDFDIR/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$HDFDIR/lib:$LD_LIBRARY_PATH

On a Grace CPU Superchip with 144 cores, run WRF with 36 MPI ranks and give each MPI rank 4 OpenMP threads:

export OMP_STACKSIZE=1G 
export OMP_PLACES=cores 
export OMP_PROC_BIND=close 
export OMP_NUM_THREADS=4 
mpirun -np 36 -map-by ppr:18:numa:PE=4 ./wrf.exe

On a Grace Hopper Superchip with 72 cores, run WRF with 18 MPI ranks and give each MPI rank 4 OpenMP threads:

export OMP_STACKSIZE=1G 
export OMP_PLACES=cores 
export OMP_PROC_BIND=close 
export OMP_NUM_THREADS=4 
mpirun -np 18 -map-by ppr:18:numa:PE=4 ./wrf.exe

You can monitor the run progress by watching the output logs for MPI rank 0:

tail -f rsl.out.0000 rsl.error.0000

The benchmark score is the average elapsed seconds per domain for all MPI ranks. You can use the jq utility command to calculate this easily from the output logs of all MPI ranks.

# Quickly calculate the average elapsed seconds per domain as a figure-of-merit
cat rsl.out.* | grep 'Timing for main:' | awk '{print $9}' | jq -s add/length

Reference Results: CONUS 12km

Important

These figures are provided as guidelines and should not be interpreted as performance targets.

System	Capacity (GB)	Ranks	Threads	Average Elapsed Seconds
Grace CPU Superchip	480	36	4	0.3884
Grace Hopper	120	18	4	0.5761

Developing for NVIDIA Grace

Architectural Features

NVIDIA Grace implements the SVE2 and the NEON single-instruction-multiple-data (SIMD) instruction sets (refer to Arm SIMD Instructions for more information).

All server-class Arm64 processors support low-cost atomic operations that can improve system throughput for thread communication, locks, and mutexes (refer to Locks, Synchronization, and Atomics for more information).

All Arm CPUs (including NVIDIA Grace) provide several ways to determine the available CPU resources and topology at runtime (refer to Runtime CPU Detection for more information and the example code).

Debugging and Profiling

Typically, all the same debuggers and profilers you rely on when working on x86 are available on NVIDIA Grace. The notable exceptions are vendor-specific products, for example, Intel® VTune. The capabilities provided by these tools are provided also by other tools on NVIDIA Grace (refer to Debugging for more information).

Language-Specific Guidance

Check the Languages page for any language-specific guidance related to LSE, locking, synchronization, and atomics. If no guide is provided then there are no Arm-related specific issues for that language. You can proceed as you would on any other platform.

Arm Vector Instructions: SVE and NEON

NVIDIA Grace implements two vector single-instruction-multiple-data (SIMD) instruction extensions:

Advanced SIMD Instructions (NEON)
Arm Scalable Vector Extensions (SVE)

Arm Advanced SIMD Instructions (or NEON) is the most common SIMD ISA for Arm64. It is a fixed-length SIMD ISA that supports 128-bit vectors. The first Arm-based supercomputer to appear on the Top500 Supercomputers list (Astra) used NEON to accelerate linear algebra, and many applications and libraries are already taking advantage of NEON.

More recently, Arm64 CPUs have started supporting Arm Scalable Vector Extensions (SVE), which is a length-agnostic SIMD ISA that supports more datatypes than NEON (for example, FP16), offers more powerful instructions (for example, gather/scatter), and supports vector lengths of more than 128 bits. SVE is currently found in NVIDIA Grace, the AWS Graviton 3, Fujitsu A64FX, and others. SVE is not a new version of NEON, but an entirely new SIMD ISA.

The following table provides a quick summary of the SIMD capabilities of some of the currently available Arm64 CPUs:

	NVIDIA Grace	AWS Graviton3	Fujitsu A64FX	AWS Graviton2	Ampere Altra
CPU Core	Neoverse V2	Neoverse V1	A64FX	Neoverse N1	Neoverse N1
SIMD ISA	SVE2 & NEON	SVE & NEON	SVE & NEON	NEON only	NEON only
NEON Configuration	4x128	4x128	2x128	2x128	2x128
SVE Configuration	4x128	2x256	2x512	N/A	N/A
SVE Version	2	1	1	N/A	N/A
NEON FMLA FP64 TPeak	16	16	8	8	8
SVE FMLA FP64 TPeak	16	16	32	N/A	N/A

Many recent Arm64 CPUs provide the same peak theoretical performance for NEON and SVE. For example, NVIDIA Grace can retire four 128-bit NEON operations or four 128-bit SVE2 operations. Although the theoretical peak performance of SVE and NEON are the same for these CPUs, SVE (and especially SVE2) is a more capable SIMD ISA with support for complex data types and advanced features that enable the vectorization of complicated code. In practice, kernels that cannot be vectorized in NEON can be vectorized with SVE. So, although SVE will not beat NEON in a performance drag race, it can dramatically improve the overall performance of the application by vectorizing loops that would have otherwise executed with scalar instructions.

Fortunately, auto-vectorizing compilers are usually the best choice when programming Arm SIMD ISAs. The compiler will generally make the best decision on when to use SVE or NEON, and it will take advantage of SVE’s advanced auto-vectorization features more easily than a human coding in intrinsics or an assembly can.

Avoid writing SVE or NEON intrinsics. To realize the best performance for a loop, use the appropriate command-line options with your favorite auto-vectorizing compiler. You might need to use compiler directives or make changes in the high-level code to facilitate auto-vectorization, but this will be much easier and more maintainable than writing intrinsics. Leave the finer details to the compiler and focus on code patterns that auto-vectorize well.

Compiler-Driven Auto-Vectorization

The key to maximizing auto-vectorization is to allow the compiler to take advantage of the available hardware features. By default, GCC and LLVM compilers take a conservative approach and do not enable advanced features unless explicitly told to do so. The easiest way to enable all available features for GCC or LLVM is to use the -mcpu compiler flag. If you are compiling on the same CPU on which the code will run, use -mcpu=native. Otherwise, you can use -mcpu=<target>, where <target>is one of the CPU identifiers, for example, -mcpu=neoverse-v2.

The NVIDIA compilers take a more aggressive approach. By default, these compilers assume that the machine on which you are compiling is the machine on which you will run and enable all available hardware features that were detected at compile time. When compiling with the NVIDIA compilers natively on Grace, you do not need additional flags.

Note: When possible, use the most recent version of your compiler. For example, GCC9 supported auto-vectorization fairly well, but GCC12 has shown impressive improvement over GCC9 in most cases. GCC13 further improves auto-vectorization.

The second key compiler feature is the compiler vectorization report. GCC uses the -fopt-info flags to report on auto-vectorization success or failure. You can use the generated informational messages to guide code annotations or transformations that will facilitate autovectorization. For example, compiling with -fopt-info-vec-missed will report on which loops were not vectorized.

Relaxed Vector Conversions

Arm NEON differentiates between vectors of signed and unsigned types. For example, GCC will not implicitly cast between vectors of signed and unsigned 64-bit integers:

#include <arm_neon.h>
...
uint64x2_t u64x2;
int64x2_t s64x2;
// Error: cannot convert 'int64x2_t' to 'uint64x2_t' in assignment
u64x2 = s64x2;

To perform the cast, you must use NEON’s vreinterpretq functions:

u64x2 = vreinterpretq_u64_s64(s64x2);

Unfortunately, some codes written for other SIMD ISAs rely on these kinds of implicit conversions. If you see errors about “no known conversion” in a code that builds for AVX but does not build for NEON, you might need to relax GCC’s vector conversion rules:

/tmp/velox/third_party/xsimd/include/xsimd/types/xsimd_batch.hpp:35:11: note:   no known conversion for argument 1 from 'xsimd::batch<long int>' to 'const xsimd::batch<long unsigned int>&'

To allow implicit conversions between vectors with differing numbers of elements and/or incompatible element types, use the -flax-vector-conversions flag. This flag should be fine for legacy code, but it should not be used for new code. The safest option is to use the appropriate vreinterpretq calls.

Runtime Detection of Supported SIMD Instructions

To make your binaries more portable across various Arm64 CPUs, use the Arm64 hardware capabilities to determine the available instructions at runtime. For example, a CPU core that is compliant with Armv8.4 must support dot-product, but dot-products are optional in Armv8.2 and Armv8.3. A developer who wants to build an application or library that can detect the supported instructions in runtime, can follow this example:

#include<sys/auxv.h>
......
  uint64_t hwcaps = getauxval(AT_HWCAP);
  has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false;
  has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false;
  has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false;
  has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false;
  has_sve_feature = hwcaps & HWCAP_SVE ? true : false;

The full list of Arm64 hardware capabilities is defined in the glibc header file and in the Linux kernel.

Porting Codes with SSE/AVX Intrinsics to NEON

Detecting Arm64 systems

If you see errors like error: unrecognized command-line option '-msse2', it usually means that the build system is failing to detect Grace as an Arm CPU is incorrectly using the x86 target features compiler flags.

To detect an Arm64 system, the build system can use the following command:

(test $(uname -m) = "aarch64" && echo "arm64 system") || echo "other system"

To detect an Arm64 system, you can compile, run, and check the return value of a C program.

# cat << EOF > check-arm64.c
int main () {
#ifdef __aarch64__
  return 0;
#else
  return 1;
#endif
}
EOF

# gcc check-arm64.c -o check-arm64
# (./check-arm64 && echo "arm64 system") || echo "other system"

Translating x86 Intrinsics to NEON

When programs contain code with x86 intrinsics, drop-in intrinsic translation tools like SIMDe or sse2neon can be used to quickly obtain a working program on Arm64. This is a good starting point for rewriting the x86 intrinsics in NEON or SVE and will quickly get a prototype up and running. For example, to port code using AVX2 intrinsics with SIMDe:

#define SIMDE_ENABLE_NATIVE_ALIASES
#include "simde/x86/avx2.h"

SIMDe provides a quick starting point to port performance critical codes to Arm64. It shortens the time needed to get a working program that can be used to extract profiles and to identify hot paths in the code. After a profile is established, the hot paths can be rewritten to avoid the overhead of the generic translation.

Since you are rewriting your x86 intrinsics, you might want to take this opportunity to create a more portable version. Here are some suggestions to consider:

Rewrite in native C/C++, Fortran, or another high-level compiled language. Compilers are constantly improving, and technologies like Arm SVE enable the auto-vectorization of codes that formally would not vectorize. You can avoid platform-specific intrinsics entirely and let the compiler do all the work.
If your application is written in C++, use std::experimental::simd from the C++ Parallelism Technical Specification V2 by using the <experimental/simd> header.
Use the SLEEF Vectorized Math Library as a header-based set of “portable intrinsics”.
Instead of Time Stamp Counter (TSC) RDTSC intrinsics, use standards-compliant portable timers, for example, std::chrono (C++), clock_gettime (C/POSIX), omp_get_wtime (OpenMP), MPI_Wtime (MPI), and so on.

Locks, Synchronization, and Atomics

Efficient synchronization is critical to achieving good performance in applications with high thread counts, but synchronization is a complex and nuanced topic. See below for a high level overview (refer to the Synchronization Overview and Case Study on Arm Architecture whitepaper from Arm for more information).

The Arm Memory Model

One of the most significant differences between Arm and the x86 CPUs is their memory model. The Arm architecture has a weak memory model that allows for more compiler and hardware optimization to boost system performance. This differs from the x86 architecture Total Store Order (TSO) model. Different memory models can cause low-level codes (for example, drivers) to function well on one architecture but encounter performance problem or failure on the other.

Note

The unique features of the Arm memory model are only relevent if you are writing low level code, such as assembly language. Most software developers will not be affected by a change in memory model.

The details about Arm’s memory model are below the application level and will be completely invisible to most users. If you are writing in a high-level language such as C, C++, or Fortran, you do not need to know the nuances of Arm’s memory model. The one exception to this general rule is code that uses boutique synchronization constructs instead of standard best practices, for example, using volatile as a means of thread synchronization.

Deviating from established standards or ignoring best practices results in code that is almost guaranteed to be broken. It should be rewritten using system-provided locks, conditions, etc. and the stdatomic tools. (Refer to https://github.com/ParRes/Kernels/issues/611 for an example of this type of bug.)

Arm is not the only architecture that uses a weak memory model. If your application already runs on CPUs that are not x86-based, you might encounter fewer bugs that are related to the weak memory model. Specifically, if your application has been ported to a CPU implementing the POWER architecture, for example, IBM POWER9, the application will work on the Arm memory model.

Large-System Extension (LSE) Atomic Instructions

All server-class Arm64 processors, such as NVIDIA Grace, have support for the Large-System Extension (LSE), which was first introduced in Armv8.1. LSE provides low-cost atomic operations that can improve system throughput for thread communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. This improvement is not generally true for older Arm64 CPUs like the Marvell ThunderX2 or the Fujitsu A64FX (refer to these slides from the ISC 2022 AHUG Workshop for more information).

When building an application from source, the compiler needs to generate LSE atomic instructions for applications that use atomic operations. For example, the code of databases such as PostgreSQL contain atomic constructs: C++11 code with std::atomic statements that translate into atomic operations. Since GCC 9.4, GCC’s -mcpu=native flag enables all instructions supported by the host CPU, including LSE. To confirm that LSE instructions are created, the output of objdump command-line utility should contain LSE instructions:

$ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l

To check whether the application binary contains load and store exclusives, run the following command:

$ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l

Runtime CPU Detection

You can determine the available Arm CPU resources and topology at runtime in the following ways:

CPU architecture and supported instructions
CPU manufacturer
Number of CPU sockets
CPU cores per socket
Number of NUMA nodes
Number of NUMA nodes per socket
CPU cores per NUMA node

Well-established portable libraries like libnuma and hwloc are a great choice on Grace. You can also use Arm’s CPUID registers or query OS files. Since many of these methods serve the same function, you should choose the method that best fits your application.

If you are implementing your own approach, look at the Arm Architecture Registers, especially the Main ID Register MIDR_EL1: https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1–Main-ID-Register.

The source code for the lscpu utility is a great example of how to retrieve and use these registers. For example, to learn how to translate the CPU part number in the MIDR_EL1 register to a human-readable string read https://github.com/util-linux/util-linux/blob/master/sys-utils/lscpu-arm.c.

Here is the output of lscpu on NVIDIA Grace-Hopper:

nvidia@localhost:/home/nvidia$ lscpu
Architecture:          aarch64
  CPU op-mode(s):      64-bit
  Byte Order:          Little Endian
CPU(s):                72
  On-line CPU(s) list: 0-71
Vendor ID:             ARM
  Model:               0
  Thread(s) per core:  1
  Core(s) per socket:  72
  Socket(s):           1
  Stepping:            r0p0
  Frequency boost:     disabled
  CPU max MHz:         3438.0000
  CPU min MHz:         81.0000
  BogoMIPS:            2000.00
  Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitpe
                       rm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
  L1d:                 4.5 MiB (72 instances)
  L1i:                 4.5 MiB (72 instances)
  L2:                  72 MiB (72 instances)
  L3:                  114 MiB (1 instance)
NUMA:
  NUMA node(s):        1
  NUMA node0 CPU(s):   0-71
  NUMA node1 CPU(s):
Vulnerabilities:
  Itlb multihit:       Not affected
  L1tf:                Not affected
  Mds:                 Not affected
  Meltdown:            Not affected
  Mmio stale data:     Not affected
  Retbleed:            Not affected
  Spec store bypass:   Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:          Mitigation; __user pointer sanitization
  Spectre v2:          Not affected
  Srbds:               Not affected
  Tsx async abort:     Not affected

CPU Hardware Capabilities

To make your binaries more portable across various Arm64 CPUs, you can use Arm64 hardware capabilities to determine the available instructions at runtime. For example, a CPU core that is compliant with Armv8.4 must support a dot-product, but dot-products are optional in Armv8.2 and Armv8.3. A developer who wants to build an application or library that can detect the supported instructions in runtime, can use this example:

#include<sys/auxv.h>
......
  uint64_t hwcaps = getauxval(AT_HWCAP);
  has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false;
  has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false;
  has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false;
  has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false;
  has_sve_feature = hwcaps & HWCAP_SVE ? true : false;

A complete list of Arm64 hardware capabilities is defined in the glibc header file and in the Linux kernel.

Example Source Code

Here is a complete yet simple example code that includes some of the methods mentioned above.

#include <stdio.h>
#include <sys/auxv.h>
#include <numa.h>

// https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1--Main-ID-Register
typedef union
{
    struct {
        unsigned int revision : 4;
        unsigned int part : 12;
        unsigned int arch : 4;
        unsigned int variant : 4;
        unsigned int implementer : 8;
        unsigned int _RES0 : 32;
    };
    unsigned long bits;
} MIDR_EL1;

static MIDR_EL1 read_MIDR_EL1()
{
    MIDR_EL1 reg;
    asm("mrs %0, MIDR_EL1" : "=r" (reg.bits));
    return reg;
}

static const char * get_implementer_name(MIDR_EL1 midr)
{
    switch(midr.implementer) 
    {
        case 0xC0: return "Ampere";
        case 0x41: return "Arm";
        case 0x42: return "Broadcom";
        case 0x43: return "Cavium";
        case 0x44: return "DEC";
        case 0x46: return "Fujitsu";
        case 0x48: return "HiSilicon";
        case 0x49: return "Infineon";
        case 0x4D: return "Motorola";
        case 0x4E: return "NVIDIA";
        case 0x50: return "Applied Micro";
        case 0x51: return "Qualcomm";
        case 0x56: return "Marvell";
        case 0x69: return "Intel";
        default:   return "Unknown";
    }
}

static const char * get_part_name(MIDR_EL1 midr)
{
    switch(midr.implementer) 
    {
        case 0x41: // Arm Ltd.
            switch (midr.part) {
                case 0xd03: return "Cortex A53";
                case 0xd07: return "Cortex A57";
                case 0xd08: return "Cortex A72";
                case 0xd09: return "Cortex A73";
                case 0xd0c: return "Neoverse N1";
                case 0xd40: return "Neoverse V1";
                case 0xd4f: return "Neoverse V2";
                default:    return "Unknown";
            }
        case 0x42: // Broadcom
            switch (midr.part) {
                case 0x516: return "Vulcan";
                default:    return "Unknown";
            }
        case 0x43: // Cavium
            switch (midr.part) {
                case 0x0a1: return "ThunderX";
                case 0x0af: return "ThunderX2";
                default:    return "Unknown";
            }
        case 0x46: // Fujitsu
            switch (midr.part) {
                case 0x001: return "A64FX";
                default:    return "Unknown";
            }
        case 0x4E: // NVIDIA
            switch (midr.part) {
                case 0x000: return "Denver";
                case 0x003: return "Denver 2";
                case 0x004: return "Carmel";
                default:    return "Unknown";
            }
        case 0x50: // Applied Micro
            switch (midr.part) {
                case 0x000: return "EMAG 8180";
                default:    return "Unknown";
            }
        default: return "Unknown";
    }
}

int main(void) 
{
    // Main ID register
    MIDR_EL1 midr = read_MIDR_EL1();

    // CPU ISA capabilities
    unsigned long hwcaps = getauxval(AT_HWCAP);

    printf("CPU revision    : 0x%x\n", midr.revision);
    printf("CPU part number : 0x%x (%s)\n", midr.part, get_part_name(midr));
    printf("CPU architecture: 0x%x\n", midr.arch);
    printf("CPU variant     : 0x%x\n", midr.variant);
    printf("CPU implementer : 0x%x (%s)\n", midr.implementer, get_implementer_name(midr));
    printf("CPU LSE atomics : %sSupported\n", (hwcaps & HWCAP_ATOMICS) ? "" : "Not ");
    printf("CPU NEON SIMD   : %sSupported\n", (hwcaps & HWCAP_ASIMD)   ? "" : "Not ");
    printf("CPU SVE SIMD    : %sSupported\n", (hwcaps & HWCAP_SVE)     ? "" : "Not ");
    printf("CPU Dot-product : %sSupported\n", (hwcaps & HWCAP_ASIMDDP) ? "" : "Not ");
    printf("CPU FP16        : %sSupported\n", (hwcaps & HWCAP_FPHP)    ? "" : "Not ");
    printf("CPU BF16        : %sSupported\n", (hwcaps & HWCAP2_BF16)   ? "" : "Not ");

    if (numa_available() == -1) {
        printf("libnuma not available\n");
    }
    printf("CPU NUMA nodes  : %d\n", numa_num_configured_nodes());
    printf("CPU Cores       : %d\n", numa_num_configured_cpus());

    return 0;
}

Debugging

This section provides information about useful techniques and tools to find and resolve bugs while migrating your applications to NVIDIA Grace.

Sanitizers

The compiler might generate code and layout data that is slightly differently on Arm64 than on an x86 system, and this could expose latent memory bugs that were previously hidden. On GCC, the easiest way to look for these bugs is to compile with the memory sanitizers by adding the following to the standard compiler flags:

    CFLAGS += -fsanitize=address -fsanitize=undefined
    LDFLAGS += -fsanitize=address  -fsanitize=undefined

Run the resulting binary, and bugs that are detected by the sanitizers will cause the program to exit immediately and print helpful stack traces and other information.

Memory Ordering

Arm is weakly ordered, like POWER and other modern architectures, and x86 is a variant of total-store-ordering (TSO). Code that relies on TSO might lack barriers to properly order memory references. Arm64 systems are weakly ordered multi-copy-atomic.

Although TSO allows reads to occur out-of-order with writes, and a processor to observe its own write before it is visible to others, the Armv8 memory model provides additional relaxations for performance and power efficiency. Code relying on pthread mutexes or locking abstractions found in C++, Java or other languages should not notice any difference. Code that has a bespoke implementation of lockless data structures, or implements its own synchronization primitives, will have to use the proper intrinsics and barriers to correctly order memory transactions (refer to Locks, Synchronization, and Atomics for more information).

Language-Specific Considerations

This section contains language-specific information with recommendations. If no section exists for a language, it is because there is no specific guidance beyond using a suitably current version of the language. You can proceed as usual on other CPUs, Arm-based, or otherwise.

Broadly speaking, applications that are built using interpreted or JIT’ed languages (Python, Java, PHP, Node.js, and so on) should run as-is on Arm64. Applications using compiled languages, including C/C++, Fortran, and Rust need to be compiled for the Arm64 architecture. Most modern build systems (Make, CMake, Ninja, and so on) will just work on Arm64.

C/C++ on NVIDIA Grace

There are many C/C++ compilers available for NVIDIA Grace including:

Selecting a Compiler

The compiler you use depends on your application’s needs. If in doubt, try the NVIDIA HPC Compiler first because this compiler will always have the most recent updates and enhancements for Grace. If you prefer Clang, NVIDIA provides builds of Clang for NVIDIA Grace that are supported by NVIDIA and certified as a CUDA host compiler. GCC, ACfL, and HPE Cray Compilers also have their own advantages. As a general strategy, default to an NVIDIA-provided compiler and fall back to a third-party as needed.

When possible, use the latest compiler version that is available on your system. Newer compilers provide better support and optimizations for Grace, and when using newer compilers, many codes will demonstrate significantly better performance.

Recommended Compiler Flags

The NVIDIA HPC Compilers accept PGI flags and many GCC or Clang compiler flags. These compilers include the NVFORTRAN, NVC++, and NVC compilers. They work with an assembler, linker, libraries, and header files on your target system, and include a CUDA toolchain, libraries and header files for GPU computing. Refer to the NVIDIA HPC Compiler’s User’s Guide for more information. The freely NVIDIA HPC SDK is the best way to quickly get started with the freely available NVIDIA HPC Compiler.

Most compiler flags for GCC and Clang/LLVM operate the same on Arm64 as on other architectures except for the -mcpu flag. On Arm64, this flag both specifies the appropriate architecture and the tuning strategy. It is generally better to use -mcpu instead of -march or -mtune on Grace. You can find additional details in this presentation given at Stony Brook University.

CPU	Flag	GCC version	LLVM verison
NVIDIA Grace	`-mcpu=neoverse-v2`	12.3+	16+
Ampere Altra	`-mcpu=neoverse-n1`	9+	10+
Any Arm64	`-mcpu=native`	9+	10+

If you are cross compiling, use the appropriate -mcpu option for your target CPU, for example, to target NVIDIA Grace when compiling on an AWS Graviton 3 use -mcpu=neoverse-v2.

Compiler-Supported Hardware Features

The common -mcpu=native flag enables all instructions supported by the host CPU. You can check which Arm features GCC will enable with the -mcpu=native flag by running the following command:

gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE

For example, on the NVIDIA Grace CPU with GCC 12.3, we see “__ARM_FEATURE_ATOMICS 1” indicating that LSE atomics are enabled:

$ gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
#define __ARM_FEATURE_ATOMICS 1
#define __ARM_FEATURE_SM3 1
#define __ARM_FEATURE_SM4 1
#define __ARM_FEATURE_RCPC 1
#define __ARM_FEATURE_SVE_VECTOR_OPERATORS 1
#define __ARM_FEATURE_SVE2_AES 1
#define __ARM_FEATURE_AES 1
#define __ARM_FEATURE_SVE 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_JCVT 1
#define __ARM_FEATURE_DOTPROD 1
#define __ARM_FEATURE_BF16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_MATMUL_INT8 1
#define __ARM_FEATURE_CRYPTO 1
#define __ARM_FEATURE_BF16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_FRINT 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_FEATURE_SHA512 1
#define __ARM_FEATURE_QRDMX 1
#define __ARM_FEATURE_FMA 1
#define __ARM_FEATURE_SHA2 1
#define __ARM_FEATURE_SVE2_SHA3 1
#define __ARM_FEATURE_COMPLEX 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_SVE2_SM4 1
#define __ARM_FEATURE_SVE_MATMUL_INT8 1
#define __ARM_FEATURE_FP16_FML 1
#define __ARM_FEATURE_UNALIGNED 1
#define __ARM_FEATURE_SHA3 1
#define __ARM_FEATURE_CRC32 1
#define __ARM_FEATURE_SVE_BITS 0
#define __ARM_FEATURE_NUMERIC_MAXMIN 1
#define __ARM_FEATURE_SVE2 1
#define __ARM_FEATURE_SVE2_BITPERM 1

Porting SSE or AVX Intrinsics

Header-based intrinsics translation tools, such as SIMDe and SSE2NEON are a great way to quickly get a prototype running on Arm64. These tools automatically translate x86 intrinsics to SVE or NEON intrinsics (refer to Arm Single-Instruction Multiple-Data Instructions). This approach provides a quick starting point when porting performance critical codes and shortens the time needed to get a working program that can be used to extract profiles and to identify hot paths in the code. After a profile is established, the hot paths can be rewritten to avoid the overhead of the automatic translation of intrinsics.

Note: GCC’s __sync built-ins are outdated and might be biased towards the x86 memory model.
Use __atomic versions of these functions instead of the __sync versions.
Refer to the GCC documentation for more information.

Signedness of the `char` Type

The C standard does not specify the signedness of the char type. On x86, many compilers assume that char is signed by default, but on Arm64, compilers often assume it is unsigned by default. This difference can be addressed by using standard integer types that specify signedness (for example, uint8_t and int8_t) or by specifying char signedness with compiler flags, for example, -fsigned-char or -funsigned-char.

Arm Instructions for Machine Learning

NVIDIA Grace supports Arm dot-product instructions (commonly used for Machine Learning (quantized) inference workloads) and half precision floating point (FP16). These features enable performant and power efficient machine learning by doubling the number of operations per second and reducing the memory footprint compared to single precision floating point, all while enjoying large dynamic range. These features are enabled automatically in the NVIDIA compilers. To enable these features in GNU or LLVM compilers, compile with -mcpu=native or -mcpu=neoverse-v2.

Fortran on NVIDIA Grace

There are many Fortran compilers available for NVIDIA Grace including the following:

Selecting a Compiler

The compiler you use depends on your application’s needs. If in doubt, try the NVIDIA HPC Compiler first because this compiler will always have the most recent updates and enhancements for Grace. GFORTRAN, ACfL, and HPE Cray Compilers also have their own advantages. As a general strategy, default to an NVIDIA-provided compiler and fall back to a third-party as needed.

Recommended Compiler Flags

CPU	Flag	GFORTRAN version
NVIDIA Grace	`-mcpu=neoverse-v2`	12.3+
Ampere Altra	`-mcpu=neoverse-n1`	9+
Any Arm64	`-mcpu=native`	9+

If you are cross compiling, use the appropriate -mcpu option for your target CPU, for example, to target NVIDIA Grace when compiling on an AWS Graviton 3 use -mcpu=neoverse-v2.

Rust on NVIDIA Grace

Rust supports Arm64 systems as a tier1 platform.

Large-System Extensions (LSE)

LSE improves system throughput for CPU-to-CPU communication, locks, and mutexes. LSE can be enabled in Rust, and there have been instances on larger machines where performance is improved by over three times by setting the RUSTFLAG environment variable and rebuilding your project.

export RUSTFLAGS="-Ctarget-cpu=neoverse-v2"
cargo build --release

Python on NVIDIA Grace

Python is an interpreted, high-level, general-purpose programming language, with interpreters that are available for many operating systems and architectures, including Arm64. Python 2.7 went end-of-life on January 1, 2020, so we recommended that you upgrade to a Python 3.x version.

Installing Python packages

When pip, the standard package installer for Python, is used, it pulls the packages from Python Package Index and other indexes. If pip cannot find a pre-compiled package, it automatically downloads, compiles, and builds the package from source code. It typically takes a few more minutes to install the package from source code than from pre-built, especially for large packages (for example, pandas).

To install common Python packages from the source code, you need to install the following development tools:

On RedHat

sudo yum install "@Development tools" python3-pip python3-devel blas-devel gcc-gfortran lapack-devel
python3 -m pip install --user --upgrade pip

On Debian/Ubuntu

sudo apt update
sudo apt-get install build-essential python3-pip python3-dev libblas-dev gfortran liblapack-dev
python3 -m pip install --user --upgrade pip

Scientific and Numerical Applications

Python relies on native code to achieve high performance. For scientific and numerical applications, NumPy and SciPy provide an interface to high performance computing libraries such as ATLAS, BLAS, BLIS, OpenBLAS, and so on. These libraries contain code tuned for Arm64 processors, and especially the Arm Neoverse V2 core found in NVIDIA Grace.

We recommend that you use the latest software versions as much as possible. If the latest version migration is not feasible, ensure that it is at least the minimum version recommended below. Multiple fixes related to data precision and correctness on Arm64 went into OpenBLAS between v0.3.9 and v0.3.17 and the following SciPy and NumPy versions have been upgraded OpenBLAS from v0.3.9 to OpenBLAS v0.3.17.

Here are the minimum versions:

OpenBLAS: >= v0.3.17
SciPy: >= v1.7.2
NumPy: >= 1.21.1

The default SciPy and NumPy binary installations with pip3 install numpy scipy are configured to use OpenBLAS. The default installations of SciPy and NumPy are easy to setup and well tested.

Anaconda and Conda

Anaconda is a distribution of the Python and R programming languages for scientific computing that aim to simplify package management and deployment.

Anaconda has had support for Arm64 via AWS Graviton 2 since since 2021, so Anaconda works very well on NVIDIA Grace. Instructions to install the full Anaconda package installer can be found at https://docs.anaconda.com/. Anaconda also offers a lightweight version called Miniconda which is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.

Java on NVIDIA Grace

Java is well supported and generally performant out-of-the-box on Arm64. While Java 8 is fully supported on Arm64, some customers have not been able to obtain the CPU’s full performance benefit until after switching to Java 11.

This section includes specific details about building and tuning Java applications on Arm64.

Java JVM Options

There are options that control the JVM and might lead to better performance. Flags -XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M have shown large (1.5x) improvements in some Java workloads. ReservedCodeCacheSize and InitialCodeCacheSize should be equal and between 64M and 127M. The JIT compiler stores generated code in the code cache. The flags change the size of the code cache from the default 240M to the smaller one. The smaller code cache may help CPU to improve the caching and prediction of JIT’ed code. The flags disable the tiered compilation to make the JIT compiler able to use the smaller code cache. These are helpful on some workloads but can hurt on others so testing with and without them is essential.

Java Stack Size

For some JVMs, the default stack size for Java threads (ThreadStackSize) is 2MB on Arm64 instead of the 1MB used on x86. You can check the default with the following:

$ java -XX:+PrintFlagsFinal -version | grep ThreadStackSize
     intx CompilerThreadStackSize = 2048  {pd product} {default}
     intx ThreadStackSize         = 2048  {pd product} {default}
     intx VMThreadStackSize       = 2048  {pd product} {default}

The default can be easily changed on the command line with either -XX:ThreadStackSize=<kbytes> or -Xss<bytes>. Notice that -XX:ThreadStackSize interprets its argument as kilobytes and -Xss interprets it as bytes. As a result, -XX:ThreadStackSize=1024 and -Xss1m will both set the stack size for Java threads to 1 megabyte:

$ java -Xss1m -XX:+PrintFlagsFinal -version | grep ThreadStackSize
     intx CompilerThreadStackSize                  = 2048                                   {pd product} {default}
     intx ThreadStackSize                          = 1024                                   {pd product} {command line}
     intx VMThreadStackSize                        = 2048                                   {pd product} {default}

Typically, you do not have to change the default value because the thread stack will be committed lazily as it grows. Regardless of the default value, the thread will always only commit as much stack as it really uses (at page size granularity). However, there is one exception to this rule. If Transparent Huge Pages (THP) are turned on by default on a system, the stack will be completely committed to memory from the start. If you are using hundreds, or even thousands of threads, this memory overhead can be considerable.

To mitigate this issue, you can either manually change the stack size on the command line (as described above) or you can change the default for THP from always to madvise:

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Even if the the default is changed from always to madvise, if you specify -XX:+UseTransparentHugePages on the command line, the JVM can still use THP for the Java heap and code cache.

Additional Resources

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks

NVIDIA and the NVIDIA logo are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Arm

Arm, AMBA, and ARM Powered are registered trademarks of Arm Limited. Cortex, MPCore, and Mali are trademarks of Arm Limited. All other brands or product names are the property of their respective holders. “Arm” is used to represent ARM Holdings plc; its operating company Arm Limited; and the regional subsidiaries Arm Inc.; Arm KK; Arm Korea Limited.; Arm Taiwan Limited; Arm France SAS; Arm Consulting (Shanghai) Co. Ltd.; Arm Germany GmbH; Arm Embedded Technologies Pvt. Ltd.; Arm Norway, AS, and Arm Sweden AB.

OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.