How to Use This Guide
This guide is for end users and application developers working with the NVIDIA® Grace CPU who want to achieve optimal performance for key benchmarks and applications (workloads). It includes procedures, sample code, reference performance numbers, recommendations, and technical best practices directly related to the NVIDIA Grace CPU. Following the instructions given in this guide will help you realize the best possible performance for your particular system.
This guide is a is a living document and frequently updated with the latest recommendations, so it is best read online at https://nvidia.github.io/grace-cpu-benchmarking-guide/. If you want to help improve the guide, you can create a Github issue at https://github.com/NVIDIA/grace-cpu-benchmarking-guide/issues/new.
Workload performance depends on many aspects of the system, so the measured performance of your system may be different from the performance figures presented here. These figures are provided as guidelines and should not be interpreted as performance expectations or targets. Do not use this guide for platform validation.
The guide is divided into the following sections:
- Platform Configuration: This section helps you tune your system for benchmarking. The instructions will help optimize the platform configuration.
- Foundational Benchmarks: After checking the platform configuration, this section helps you complete a sanity check and confirm that the system is healthy.
- Common Benchmarks: This section has information about the industry-recognized benchmarks and mini-apps that represent the performance of key workloads.
- Applications: This section has information about maximizing the performance of full applications.
- Developer Best Practices: This section has general best practices information to develop for NVIDIA Grace.
The sections can be read in any order, but we strongly recommend you begin by tuning and sanity checking your platform.
Platform Configuration
Before benchmarking, you should check whether the platform configuration is optimal for the target benchmark. The optimal configuration can vary by benchmark, but there are some common high-level settings of which you should be aware. Most platforms benefit from the settings shown below.
Refer to the NVIDIA Grace Performance Tuning Guide and the platform-specific documentation at https://docs.nvidia.com/grace/ for instructions on how to tune your platform for optimal performance.
The settings shown on this page are intended to maximize system performance and may affect system security.
Linux Kernel
The following Linux kernel command line options are recommended for performance:
init_on_alloc=0
: Do not fill newly allocated pages and heap objects with zeroes by default.acpi_power_meter.force_cap_on=y
: Enable ACPI power meter and with power capping.numa_balancing=disable
: Disable automatic NUMA balancing.
You can confirm these command line options are set by reading /proc/cmdline
:
cat /proc/cmdline | tr ' ' '\n'
BOOT_IMAGE=/boot/vmlinuz-6.2.0-1012-nvidia-64k
root=UUID=76c84c6d-a59f-4a8d-903e-4cb9ef69b970
ro
rd.driver.blacklist=nouveau
nouveau.modeset=0
earlycon
module_blacklist=nouveau
acpi_power_meter.force_cap_on=y
numa_balancing=disable
init_on_alloc=0
preempt=none
CPU and Memory
-
Use the
performance
CPU frequency governor:echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
-
Disable address space layout randomization (ASLR):
sudo sysctl -w kernel.randomize_va_space=0
-
Drop the caches:
echo 3 | sudo tee /proc/sys/vm/drop_caches
-
Set the kernel dirty page values to the default values:
echo 10 | sudo tee /proc/sys/vm/dirty_ratio echo 5 | sudo tee /proc/sys/vm/dirty_background_ratio
-
To reduce disk I/O, check for dirty page writeback every 60 seconds:
echo 6000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
-
Disable the NMI watchdog:
echo 0 | sudo tee /proc/sys/kernel/watchdog
-
Optional, allow unprivileged users to measure system events. Note that this setting has implications for system security. See https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html for additional information.
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
Networking
-
Set the networking connection tracking size:
echo 512000 | sudo tee /proc/sys/net/netfilter/nf_conntrack_max
-
Before starting the test, allow the kernel to reuse TCP ports which may be in a
TIME_WAIT
state:echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_reuse
Device I/O
-
For full power for generic devices:
for i in `find /sys/devices/*/power/control` ; do echo 'on' > ${i} done
-
For full power for PCI devices:
for i in `find /sys/bus/pci/devices/*/power/control` ; do echo 'on' > ${i} done
Benchmarking Software Environment
Begin by installing all available software updates, for example, sudo apt update && sudo apt upgrade
on Ubuntu. Use the command ld --version
to check that GNU binutils version is 2.38 or later.
For best performance, GCC should be at version 12.3 or later. gcc --version
will report the GCC version.
Many Linux distributions provide packages for GCC 12 compilers that can be installed alongside the system GCC. For example, sudo apt install gcc-12
on Ubuntu. See your Linux distribution’s instructions for installing and using various GCC versions.
In case your distribution does not provide these packages, or you are unable to install them, instructions for building and installing GCC are provided below.
A Recommended Software Stack
This guide shows a variety of compilers, libraries, and tools. Suggested minimum versions of the major software packages used in this guide are shown below, but any recent version of these tools will work well on NVIDIA Grace. Installation instructions for each package are provided in the associated link.
Building and Installing GCC 12.3 from Source
Many Linux distributions provide packages for GCC 12 compilers that can be installed alongside the system GCC. For example, sudo apt install gcc-12
on Ubuntu. You should prefer those packages over building GCC from source.
Follow the instructions below to build GCC 12.3 from source. Note that filesystem I/O performance can affect compilation time, so we recommend building GCC on a local filesystem or ramdisk, e.g. /tmp
.
Download and unpack the GCC source code:
wget https://ftp.gnu.org/gnu/gcc/gcc-12.3.0/gcc-12.3.0.tar.xz
tar xvf gcc-12.3.0.tar.xz
Download the GCC prerequisites:
cd gcc-12.3.0
./contrib/download_prerequisites
You should see output similar to:
2024-01-24 08:04:44 URL:http://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.2.1.tar.bz2 [2493916/2493916] -> "gmp-6.2.1.tar.bz2" [1]
2024-01-24 08:04:45 URL:http://gcc.gnu.org/pub/gcc/infrastructure/mpfr-4.1.0.tar.bz2 [1747243/1747243] -> "mpfr-4.1.0.tar.bz2" [1]
2024-01-24 08:04:47 URL:http://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.2.1.tar.gz [838731/838731] -> "mpc-1.2.1.tar.gz" [1]
2024-01-24 08:04:49 URL:http://gcc.gnu.org/pub/gcc/infrastructure/isl-0.24.tar.bz2 [2261594/2261594] -> "isl-0.24.tar.bz2" [1]
gmp-6.2.1.tar.bz2: OK
mpfr-4.1.0.tar.bz2: OK
mpc-1.2.1.tar.gz: OK
isl-0.24.tar.bz2: OK
All prerequisites downloaded successfully.
Configure, compile, and install GCC. Remember to set GCC_INSTALL_PREFIX
appropriately! This example installs GCC to /opt/gcc/12.3
but any valid filesystem path can be used:
export GCC_INSTALL_PREFIX=/opt/gcc/12.3
./configure --prefix="$GCC_INSTALL_PREFIX" --enable-languages=c,c++,fortran --enable-lto --disable-bootstrap --disable-multilib
make -j
make install
To use the newly-installed GCC 12 compiler, simply update your $PATH
environment variable:
export PATH=$GCC_INSTALL_PREFIX/bin:$PATH
Confirm that the gcc
command invokes GCC 12.3:
which gcc
gcc --version
You should see output similar to:
/opt/gcc/12.3/bin/gcc
gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Foundational Benchmarks
Foundational benchmarks confirm whether the system operating as expected. These benchmarks do not represent one application or problem area. They are excellent sanity checks for the system and can produce simple, comparable numbers with minimal configuration.
Before performing any competitive analysis, we strongly recommend that you run all foundational benchmarks. These benchmarks are simple and execute quickly, so you should repeat them every time you benchmark.
Fused Multiply Add
NVIDIA provides an open source suite of benchmarking microkernels for Arm® CPUs. To allow precise counts of instructions and exercise specific functional units, these kernels are written in assembly language. To measure the peak floating point capability of a core and check the CPU clock speed, use a Fused Multiply Add (FMA) kernel.
Install
To measure achievable peak performance of a core, the fp64_sve_pred_fmla kernel executes a known number of SVE predicated fused multiply-add operations (FMLA). When combined with the perf tool, you can measure the performance and the core clock speed.
git clone https://github.com/NVIDIA/arm-kernels.git
cd arm-kernels
make
Execute
The benchmark score is reported in giga-operations per second (Gop/sec) near the top of the benchmark output. Grace can perform 16 FP64 FMA operations per cycle, so a Grace CPU with a nominal CPU frequency of 3.3GHz should report between 52 and 53 Gop/sec.
./arithmetic/fp64_sve_pred_fmla.x
4( 16(SVE_FMLA_64b) );
Iterations;100000000
Total Inst;6400000000
Total Ops;25600000000
Inst/Iter;64
Ops/Iter;256
Seconds;0.478488
GOps/sec;53.5019
Use the perf
command to measure CPU frequency. The CPU frequency is reported in the perf output on the cycles line and after the #
symbol.
Before running perf
, check that the value of /proc/sys/kernel/perf_event_paranoid
is less than 1. If it is, you can run the command as an unprivileged user.
perf stat ./arithmetic/fp64_sve_pred_fmla.x
If the value of /proc/sys/kernel/perf_event_paranoid
is greater than 1, you will need to run perf
as root.
sudo perf stat ./arithmetic/fp64_sve_pred_fmla.x
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
The key performance metric is giga-operations per second (Gop/sec). Grace can perform 16 FP64 FMA operations per cycle, so a Grace CPU with a nominal CPU frequency of 3.3GHz should report between 52 and 53 Gop/sec.
Here is an example of benchmark output:
$ perf stat ./arithmetic/fp64_sve_pred_fmla.x
4( 16(SVE_FMLA_64b) );
Iterations;100000000
Total Inst;6400000000
Total Ops;25600000000
Inst/Iter;64
Ops/Iter;256
Seconds;0.481267
GOps/sec;53.1929
Performance counter stats for './arithmetic/fp64_sve_pred_fmla.x':
482.25 msec task-clock # 0.996 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
65 page-faults # 134.786 /sec
1,607,949,685 cycles # 3.334 GHz
6,704,065,953 instructions # 4.17 insn per cycle
<not supported> branches
18,383 branch-misses # 0.00% of all branches
0.484136320 seconds time elapsed
0.482678000 seconds user
0.000000000 seconds sys
STREAM
STREAM is the standard for measuring memory bandwidth. The STREAM benchmark is a simple, synthetic benchmark program that measures the sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels. The benchmark includes the following kernels that operate on 1D arrays a
, b
, and c
, with scalar x
:
- COPY: Measures transfer rates in the absence of arithmetic:
c = a
- SCALE: Adds a simple arithmetic operation:
b = x*a
- ADD: Adds a third operand to test multiple load/store ports:
c = a + b
- TRIAD: Allows chained/overlapped/fused multiply/add operations:
a = b + x*c
The kernels are executed in sequence in a loop, and the following parameters configure STREAM:
STREAM_ARRAY_SIZE
: The number of double-precision elements in each array. When you measure the bandwidth to/from main memory, you must select a sufficiently large array size.NTIMES
: The number of iterations of the test loop.
Use the STREAM benchmark to check LPDDR5X memory bandwidth. The following commands download and compile STREAM with a total memory footprint of approximately 2.7GB, which is sufficient to exceed the L3 cache without excessive runtime
Install
The following commands download and compile STREAM with memory footprint of approximately 2.7GB per Grace CPU, which is sufficient to exceed the total system L3 cache without excessive runtime. The general rule for running STREAM is that each array must be at least four times the size of the sum of all the last-level caches that were used in the run, or 1 million elements, whichever is larger.
STREAM_ARRAY_SIZE="($(nproc)/72*120000000)"
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
gcc -Ofast -march=native -fopenmp -mcmodel=large -fno-PIC \
-DSTREAM_ARRAY_SIZE=${STREAM_ARRAY_SIZE} -DNTIMES=200 \
-o stream_openmp.exe stream.c
Execute
To run STREAM, set the number of OpenMP threads (OMP_NUM_THREADS) according to the following example. Replace ${THREADS}
with the appropriate value from the table of reference results shown above. To distribute the threads evenly over all available cores and maximize bandwidth, use OMP_PROC_BIND=spread
.
OMP_NUM_THREADS=${THREADS} OMP_PROC_BIND=spread ./stream_openmp.exe
Grace superchip memory bandwidth is proportionate to the total memory capacity. Find your system’s memory capacity in the table above and use the same number of threads to generate the expected score for STREAM TRIAD. For example, when running on a Grace-Hopper superchip with a memory capacity of 120GB, this command will report between 410GB/s and 486GB/s in STREAM TRIAD:
OMP_NUM_THREADS=72 OMP_PROC_BIND=spread ./stream_openmp.exe
Similarly, the following command will report between 820GB/s and 972GB/s in STREAM TRIAD on a Grace CPU Superchip with a memory capacity of 480GB:
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./stream_openmp.exe
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
Memory bandwidth depends on many factors, for instance, operating system kernel version and the default memory page size.
Without any code changes, STREAM TRIAD should score between 80% and 95% of the system’s theoretical peak memory bandwidth.
Superchip | Memory Capacity (GB) | Threads | TRIAD Min MB/s |
---|---|---|---|
Grace Hopper | 120 | 72 | 400,000 |
Grace Hopper | 480 | 72 | 307,000 |
Grace CPU | 240 | 144 | 800,000 |
Grace CPU | 480 | 144 | 800,000 |
Here is an example of the STREAM execution output:
OMP_NUM_THREADS=144 OMP_PROC_BIND=spread ./stream_openmp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 240000000 (elements), Offset = 0 (elements)
Memory per array = 1831.1 MiB (= 1.8 GiB).
Total memory required = 5493.2 MiB (= 5.4 GiB).
Each kernel will be executed 200 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 144
Number of Threads counted = 144
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5729 microseconds.
(= 5729 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 662394.7 0.005964 0.005797 0.008116
Scale: 685483.8 0.005744 0.005602 0.007843
Add: 787098.2 0.007689 0.007318 0.008325
Triad: 806812.4 0.007713 0.007139 0.011388
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Common Benchmarks
These industry-recognized benchmarks facilitate a fair competitive performance analysis for a class of workloads.
High Performance Linpack (HPL)
The NVIDIA HPC-Benchmarks provides a multiplatform (x86 and aarch64) container image based on NVIDIA Optimized Frameworks container images that includes NVIDIA’s HPL benchmark. HPL-NVIDIA solves a random dense linear system in double precision arithmetic on distributed-memory computers and is based on the netlib HPL benchmark. Please visit the NVIDIA HPC-Benchmarks page in the NGC Catalog for detailed instructions.
The HPL-NVIDIA benchmark uses the same input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark for getting started with the HPL software concepts and best practices.
Downloading and using the container
The container image works well with Signularity, Docker, or Pyxis/Enroot. Instructions for running with Singularity are provided below. For a general guide on pulling and running containers, see Running A Container in the NVIDIA Containers For Deep Learning Frameworks User’s Guide. For more information about using NGC, refer to the NGC Container User Guide.
Running the benchmarks
The script hpl-aarch64.sh
can be invoked on a command line or through a Slurm batch script to launch HPL-NVIDIA for NVIDIA Grace CPU. As of HPC-Benchmarks 23.10, hpl-aarch64.sh
accepts the following parameters:
- Required parameters:
--dat path
: Path toHPL.dat
input file
- Optional parameters:
--cpu-affinity <string>
: A colon-separated list of cpu index ranges--mem-affinity <string>
: A colon separated list of memory indices--ucx-affinity <string>
: A colon separated list of UCX devices--ucx-tls <string>
: UCX transport to use--exec-name <string>
: HPL executable file
Several sample input files are available in the container at /workspace/hpl-linux-aarch64
.
Run with Singularity
The instructions below assume Singularity 3.4.1 or later.
Save the HPC-Benchmark container as a local Singularity image file:
singularity pull --docker-login hpc-benchmarks:23.10.sif docker://nvcr.io/nvidia/hpc-benchmarks:23.10
If prompted for a Docker username or password, just press “enter” to continue with guest access:
Enter Docker Username: # press "enter" key to skip
Enter Docker Password: # press "enter" key to skip
This command saves the container in the current directory as hpc-benchmarks:23.10.sif
.
Use one of the following commands to run HPL-NVIDIA with a sample input file on one NVIDIA Grace CPU Superchip.
-
To run from a local command line, i.e. not using Slurm:
singularity run ./hpc-benchmarks:23.10.sif \ mpirun -np 2 --bind-to none \ ./hpl-aarch64.sh --dat ./hpl-linux-aarch64/sample-dat/HPL_2mpi.dat \ --cpu-affinity 0-71:72-143 --mem-affinity 0:1
-
To run via Slurm:
srun -N 1 --ntasks-per-node=2 singularity run ./hpc-benchmarks:23.10.sif \ ./hpl-aarch64.sh --dat ./hpl-linux-aarch64/sample-dat/HPL_2mpi.dat \ --cpu-affinity 0-71:72-143 --mem-affinity 0:1
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
The score below was taken on a Grace CPU Superchip with 480GB of CPU memory:
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC00L2L2 168880 448 1 2 616.41 5.2093e+03
HiBench: K-means
This workload from the HiBench suite tests K-means clustering in spark.mllib
, a well-known clustering algorithm for knowledge discovery and data mining. The input data set is generated by GenKMeansDataset
based on Uniform Distribution and Gaussian Distribution. There is also an optimized K-means implementation based on Intel Data Analytics Library (DAL), which is available in the dal
module of sparkbench. This benchmark requires Spark, HiBench, and Hadoop. HiBench is the workload generator, Hadoop is used to generate and store data, and Spark is the application we wish to test.
Installation
Java 8 and Java 11
Install Java 8, Java 11, and related tools from your Linux distribution’s package repository. For example, on Ubuntu:
sudo apt install openjdk-11-jre-headless openjdk-11-jdk-headless maven python2 net-tools openjdk-8-jdk
Hadoop
cd $HOME
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-aarch64.tar.gz
tar zxvf hadoop-3.3.6-aarch64.tar.gz
export PATH_TO_HADOOP=$HOME/hadoop-3.3.6
cd $PATH_TO_HADOOP/etc/hadoop
Create configuration files:
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>127.0.0.1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>127.0.0.1:8031</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
mapred-site.xml
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6
directory,
e.g. <value>HADOOP_MAPRED_HOME=/home/nvidia/hadoop-3.3.6</value>
.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
</configuration>
hdfs-site.xml
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6 directory, e.g.
<value>/home/nvidia/hadoop-3.3.6/namenode</value>
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>$PATH_TO_HADOOP/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>datanode</value>
</property>
</configuration>
hadoop-env.sh
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6 directory, e.g.
export HADOOP_HOME="/home/nvidia/hadoop-3.3.6"
export HADOOP_HOME="$PATH_TO_HADOOP"
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-arm64"
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
After creating all Hadoop configuration files, initialize the namenode
directory:
$PATH_TO_HADOOP/bin/hdfs namenode -format
Spark
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6 directory, e.g.
export HADOOP_PREFIX="/home/nvidia/hadoop-3.3.6"
cd $HOME
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar zxvf spark-3.5.0-bin-hadoop3.tgz
cd spark-3.5.0-bin-hadoop3/conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
HiBench
If the mvn
command given below fails with an error like object java.lang.Object in compiler mirror not found
please check that you have installed Java 8 and updated your JAVA_HOME
and PATH
environment variables.
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-arm64"
export PATH="$JAVA_HOME/bin:$PATH"
cd $HOME
git clone https://github.com/Intel-bigdata/HiBench.git
cd HiBench
mvn -Phadoopbench -Psparkbench -Dspark=2.4 -Dscala=2.11 clean package
Configure HiBench:
cd $HOME/HiBench/conf
# Important: replace "$NUM_PARTITIONS" with the number of CPU cores you wish to use, e.g. 72 for Grace-Hopper.
sed -i 's#hibench.scale.profile.*$#hibench.scale.profile huge#g' hibench.conf
sed -i 's#hibench.default.map.parallelism.*$#hibench.default.map.parallelism $NUM_PARTITIONS#g' hibench.conf
sed -i 's#hibench.default.shuffle.parallelism.*$#hibench.default.shuffle.parallelism $NUM_PARTITIONS#g' hibench.conf
# IMPORTANT: replace "$PATH_TO_HADOOP"
cp hadoop.conf.template hadoop.conf
sed -i 's#/PATH/TO/YOUR/HADOOP/ROOT#$PATH_TO_HADOOP#g' hadoop.conf
# IMPORTANT: replace "$PATH_TO_SPARK"
cp spark.conf.template spark.conf
sed -i 's#/PATH/TO/YOUR/SPARK/HOME#$PATH_TO_SPARK#g' spark.conf
sed -i 's#hibench.spark.master.*$#hibench.spark.master local[*]#' spark.conf
sed -i 's#spark.executor.memory.*$#spark.executor.memory 50g#' spark.conf
sed -i 's#spark.driver.memory.*$#spark.driver.memory 50g#' spark.conf
Run the Benchmark
-
Configure your environment. Remember to set
$PATH_TO_HADOOP
to the corect path.export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-arm64" export PATH="$JAVA_HOME/bin:$PATH" # Set Hadoop-related environment variables export PATH_TO_HADOOP=$HOME/hadoop-3.3.6 export HADOOP_HOME=$PATH_TO_HADOOP export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop # Native Path export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib" # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
-
Start Hadoop
export PDSH_RCMD_TYPE=exec $PATH_TO_HADOOP/sbin/start-all.sh
If Hadoop starts successfully,
jps
output should be similar to:369207 SecondaryNameNode 369985 NodeManager 371293 NameNode 373148 Jps 368895 DataNode 369529 ResourceManager
All of
NameNode
,SecondaryNameNode
,NodeManager
,DataNode
, andResourceManager
must be running before proceeding with the benchmark. If you do not see aNameNode
process, check that you initialized the namenode directory as described in the Hadoop installation steps above. -
Preprocess the k-means benchmark files
$HOME/HiBench/bin/workloads/ml/kmeans/prepare/prepare.sh
-
Run the benchmark once to initialize the system
$HOME/HiBench/bin/workloads/ml/kmeans/spark/run.sh
-
Run k-means benchmark several times and average the scores. This example shows 72 cores. If you wish to use a different number of CPU cores, remember to update
hibench.default.map.parallelism
andhibench.default.shuffle.parallelism
inhibench.conf
.numactl -C0-71 -m0 $HOME/HiBench/bin/workloads/ml/kmeans/spark/run.sh
The results can be found in $HOME/HiBench/report/hibench.report
.
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkKmeans 2023-02-16 20:29:54 19903891441 23.692 840110224 840110224
ScalaSparkKmeans 2023-02-16 23:45:33 19903891427 23.742 838340974 838340974
ScalaSparkKmeans 2023-02-16 23:53:05 19903891439 24.129 824894999 824894999
In the above case, the median throughput/node across the three runs is the result. In the above example, it is 838340974 (bytes/s)
GAP Benchmark Suite
The GAP Benchmark Suite (Beamer, 2015) was released with the goal of helping standardize graph processing evaluations. Graph algorithms and their applications are currently gaining renewed interest, especially with the growth of social networks and their analysis. Graph algorithms are also important for their applications in science and recognition. The GAP benchmark suite provides high performance (CPU only) reference implementations for various graph operations and provides a standard for graph processing performance evaluations.
Even though the GAP benchmark suite provides real-world graphs and more than one kernel (high performance implementation of various graph operation algorithms), we will only look at using synthetic Kronecker graphs and will be focusing on the Breadth First Search (BFS) kernel.
Initial Configuration
This repo is the reference implementation for the GAP Benchmark Suite. It is designed to be a portable high-performance baseline that only requires a compiler with support for C++11. It uses OpenMP for parallelism, but, to run serially, it can be compiled without OpenMP. The details of the benchmark can be found in the specification.
Quick Start
To build from source, run the following commands:
git clone https://github.com/sbeamer/gapbs.git
cd gapbs
make
To quickly test the build, run the BFS kernel on 1024 vertices for one iteration:
$ ./bfs -g 10 -n 1
The command output should be similar to
Generate Time: 0.00547
Build Time: 0.00295
Graph has 1024 nodes and 10496 undirected edges for degree: 10
Trial Time: 0.00921
Average Time: 0.00921
Additional command line flags can be found with -h
.
Running the BFS Kernel
These command line options set runtime parameters for the BFS kernel:
-g <scale>
: generate Kronecker graph with2^scale
vertices.-k <degree>
: average degree for a synthetic graph.-n <n>
: performsn
trials.
Typically, we select a scale so that the working dataset size for the workload lies outside the Last level cache on the test platforms. A scale value of 26 means our graph will have approximately 67.11 million vertices. This graph size should be large enough so that the working set of the workload will not completely lie within the last level cache of the CPU.
Run bfs with the following command:
OMP_NUM_THREADS=72 OMP_PROC_BIND=close numactl -m0 -C 0-71 ./bfs -g 26 -k 16 -n 64
This command will pin our application to CPU socket 0 and physical cores 0-71.
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
When you run bfs using the command above on a Grace machine with at least 72 cores, using Kronecker graph of scale 26 and degree 16 for 64 trials, we see an average time of approximately 0.0395 +/- 0.001 ms as shown below.
Generate Time: 3.72415
Build Time: 5.94936
Graph has 67108864 nodes and 1051923215 undirected edges for degree: 15
Trial Time: 0.03807
Trial Time: 0.03730
Trial Time: 0.04042
Trial Time: 0.04184
Trial Time: 0.03676
...
Trial Time: 0.03795
Trial Time: 0.03576
Trial Time: 0.04318
Average Time: 0.03977
Graph500
The Graph500 is a rating of supercomputer systems, focused on data-intensive loads.
Build
The following script will build Graph500 and all the dependencies. The script tested on a freshly booted Ubuntu 22.04.
#!/bin/bash
set -e
sudo apt update && sudo apt install -y wget build-essential python3 numactl git
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.1.tar.gz
gunzip -c openmpi-5.0.1.tar.gz | tar xf -
mkdir -p ./ompi
export PATH="${PWD}/ompi/bin:${PATH}"
export LD_LIBRARY_PATH="${PWD}/ompi/lib:${LD_LIBRARY_PATH}"
pushd openmpi-5.0.1
./configure --prefix=${PWD}/../ompi
make all install
popd
git clone https://github.com/graph500/graph500.git
pushd ./graph500/src/
sed -i '/^CFLAGS/s/$/ -DPROCS_PER_NODE_NOT_POWER_OF_TWO -fcommon/' Makefile
make
popd
Running Benchmarks on Grace
#!/bin/bash
export SKIP_VALIDATION=1
unset SKIP_BFS
mpirun -n $(nproc) --map-by core ./graph500/src/graph500_reference_bfs 28 16
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
In the below output, bfs harmonic_mean_TEPS and bfs mean_time are our performance and runtime metrics respectively. TEPS here is traversed edges per second which is our absolute perf metric.
SCALE: 28
edgefactor: 16
NBFS: 64
graph_generation: 14.2959
num_mpi_processes: 144
construction_time: 8.21669
bfs min_time: 1.4579
bfs firstquartile_time: 1.47956
bfs median_time: 1.54229
bfs thirdquartile_time: 1.70613
bfs max_time: 1.81811
bfs mean_time: 1.58271
bfs stddev_time: 0.112696
min_nedge: 4294921166
firstquartile_nedge: 4294921166
median_nedge: 4294921166
thirdquartile_nedge: 4294921166
max_nedge: 4294921166
mean_nedge: 4294921166
stddev_nedge: 0
bfs min_TEPS: 2.3623e+09
bfs firstquartile_TEPS: 2.51735e+09
bfs median_TEPS: 2.78478e+09
bfs thirdquartile_TEPS: 2.90284e+09
bfs max_TEPS: 2.94596e+09
bfs harmonic_mean_TEPS: ! 2.71366e+09
bfs harmonic_stddev_TEPS: 2.43441e+07
bfs min_validate: -1
bfs firstquartile_validate: -1
bfs median_validate: -1
bfs thirdquartile_validate: -1
bfs max_validate: -1
bfs mean_validate: -1
bfs stddev_validate: 0
NAS Parallel Benchmarks
The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The NPB 1 benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications. Problem sizes in NPB are predefined and indicated as different classes. Reference implementations of NPB are available in commonly-used programming models like MPI and OpenMP.
Building the Benchmarks
-
Download and unpack the NPB source code from nas.nasa.gov:
wget https://www.nas.nasa.gov/assets/npb/NPB3.4.2.tar.gz tar xvzf NPB3.4.2.tar.gz cd NPB3.4.2/NPB3.4-OMP
-
Create the
make.def
file to configure the build for NVIDIA HPC compilers:cat > config/make.def <<'EOF' FC = nvfortran FLINK = $(FC) F_LIB = F_INC = FFLAGS = -O3 -mp FLINKFLAGS = $(FFLAGS) CC = nvc CLINK = $(CC) C_LIB = -lm C_INC = CFLAGS = -O3 -mp CLINKFLAGS = $(CFLAGS) UCC = gcc BINDIR = ../bin RAND = randi8 WTIME = wtime.c EOF
-
Create the
suite.def
file to build all benchmarks with theD
problem size:cat > config/suite.def <<'EOF' bt D cg D ep D lu D mg D sp D ua D EOF
-
Compile all benchmarks:
make -j suite
A successful compilation will generate these binaries in the bin/
directory:
$ ls bin/
bt.D.x cg.D.x ep.D.x ft.D.x lu.D.x mg.D.x sp.D.x ua.D.x
Running the Benchmarks
Run each benchmark individually using the command shown below. In the command, replace ${BENCHMARK}
with the benchmark name, for example cg.D.x
, and replace ${THREADS}
and ${FLAGS}
with the appropriate values from the reference results shown above.
OMP_NUM_THREADS=${THREADS} OMP_PROC_BIND=close numactl ${FLAGS} ./bin/${BENCHMARK}
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
Grace CPU Superchip, 480GB Memory Capacity
Use this script to run all the benchmarks on 72 cores of the Grace CPU:
#!/bin/bash
for BENCHMARK in bt cg ep lu mg sp ua ; do
OMP_NUM_THREADS=72 OMP_PROC_BIND=close numactl -m0 ./bin/${BENCHMARK}.D.x
done
Performance is reported on the line marked “Mops / total”. The expected performance is shown below.
Benchmark | Mops / total |
---|---|
bt.D.x | 386758.21 |
cg.D.x | 26632.65 |
ep.D.x | 10485.73 |
lu.D.x | 293407.59 |
mg.D.x | 125382.93 |
sp.D.x | 136893.59 |
ua.D.x | 973.52 |
protobuf
Protocol Buffers (a.k.a., protobuf) are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.
Build
The following script will build Protocol Buffers. The script tested on a freshly booted Ubuntu 22.04.
#!/bin/bash
set -e
sudo apt update && sudo apt install -y autoconf automake libtool curl make g++ unzip libz-dev git cmake
git clone https://github.com/protocolbuffers/protobuf.git
pushd protobuf
# syncing at a specific commit
git checkout 7cd0b6fbf1643943560d8a9fe553fd206190b27f
git submodule update --init --recursive
./autogen.sh
./configure
make
make check
sudo make install
sudo ldconfig
pushd benchmarks
make cpp
popd
popd
Running Benchmarks on Grace
#!/bin/bash
pushd ./protobuf
mkdir -p result
rm -rf result/*
C=$(nproc)
for (( i=0; i < $C-1; i++ ))
do
filename_result="$i.log"
filepath_result="result/$filename_result"
taskset -c $i ./benchmarks/cpp-benchmark --benchmark_min_time=5.0 $(find $(cd . && pwd) -type f -name "dataset.*.pb" -not -path "$(cd . && pwd)/tmp/*") >> $filepath_result &
done
#last core will be sync
filename_result="$i.log"
filepath_result="result/$filename_result"
taskset -c $i ./benchmarks/cpp-benchmark --benchmark_min_time=5.0 $(find $(cd . && pwd) -type f -name "dataset.*.pb" -not -path "$(cd . && pwd)/tmp/*") >> $filepath_result
sleep 1
popd
echo Done!
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
You need to run n copies of the benchmark as indicated above. Next, you need to take geomean of all the scores. Then, pick the least score across all copies. The total score would be Score * copies = Socket score.
Geomean of mins: 1291.26863081306
Total score: 185942.682837081 MB/s
Applications
The benchmarking recipes in this section show you how to maximize the performance of key applications.
NAMD
NAMD is a widely used molecular dynamics software that is used for large scale simulations of biomolecular systems1. It is developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign and was used in the winning submission for the 2020 ACM Gordon Bell Special Prize for COVID-19 Research2. As part of the submission, NAMD was used to simulate a 305-million atom SARS-CoV-2 viral envelope on over four thousand nodes of the ORNL Summit supercomputer. The Charm++ framework is used to scale to thousands of GPUs and hundreds of thousands of CPU cores3. NAMD has supported Arm since 2014.
Building the Source Code
To access the NAMD source code, submit a request at https://www.ks.uiuc.edu/Research/namd/gitlabrequest.html. After the request is approved, you can access the source code at https://gitlab.com/tcbgUIUC/namd.
Dependencies
The following script will install NAMD’s dependencies to the ./build/
directory. Charm++ version 7.0.0 does not support targeting the Armv9 architecture, so Armv8 is used instead.
#!/bin/bash
set -e
if [[ ! -a build ]]; then
mkdir build
fi
cd build
#
# FFTW
#
if [[ ! -a fftw-3.3.9 ]]; then
wget http://www.fftw.org/fftw-3.3.9.tar.gz
tar xvfz fftw-3.3.9.tar.gz
fi
if [[ ! -a fftw ]]; then
mkdir fftw
cd fftw-3.3.9
./configure CC=gcc --prefix=$PWD/../fftw \
--enable-float --enable-fma \
--enable-neon \
--enable-openmp --enable-threads | tee fftw_config.log
make -j 8 | tee fftw_build.log
make install
cd ..
fi
#
# TCL
#
if [[ ! -e tcl ]]; then
wget http://www.ks.uiuc.edu/Research/namd/libraries/tcl8.5.9-linux-arm64-threaded.tar.gz
tar zxvf tcl8.5.9-linux-arm64-threaded.tar.gz
mv tcl8.5.9-linux-arm64-threaded tcl
fi
#
# Charm++
#
if [[ ! -a charm ]]; then
git clone https://github.com/UIUC-PPL/charm.git
fi
cd charm
git checkout v7.0.0
if [[ ! -a multicore-linux-arm8-gcc ]]; then
./build charm++ multicore-linux-arm8 gcc --with-production --enable-tracing -j 8
fi
cd ..
NAMD
The following script downloads and compiles NAMD in the ./build/
directory. We recommend that you use GCC version 12.3 or later as it can target the neoverse-v2
architecture. If version GCC 12.3 is not available, the sed
command below should be removed because the architecture will not be recognized.
#!/bin/bash
set -e
if [[ ! -a build ]]; then
mkdir build
fi
cd build
#
# NAMD
#
if [[ ! -a namd ]]; then
git clone git@gitlab.com:tcbgUIUC/namd.git
cd namd
git checkout release-3-0-beta-3
cd ..
fi
cd namd
if [[ ! -a Linux-ARM64-g++ ]]; then
./config Linux-ARM64-g++ \
--charm-arch multicore-linux-arm8-gcc --charm-base $PWD/../charm \
--with-tcl --tcl-prefix $PWD/../tcl \
--with-fftw --with-fftw3 --fftw-prefix $PWD/../fftw
sed -i 's/FLOATOPTS = .*/FLOATOPTS = -Ofast -mcpu=neoverse-v2/g' arch/Linux-ARM64-g++.arch
cd Linux-ARM64-g++
make depends
make -j 8
cd ..
fi
cd ..
Running Benchmarks on Grace
STMV is a standard benchmark system with 1,066,628 atoms. To download STMV, run the following command.
wget https://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
tar zxvf stmv.tar.gz
cd stmv
wget http://www.ks.uiuc.edu/Research/namd/2.13/benchmarks/stmv_nve_cuda.namd
wget https://www.ks.uiuc.edu/Research/namd/utilities/ns_per_day.py
chmod +x ns_per_day.py
The stmv_nve_cuda.namd
input file is not specific to CUDA and runs an NVE simulation with a 2 femtosecond timestep and PME evaluated every 4 fs with multi-time stepping. The benchmark can be run with the following command from the stmv
directory:
../build/namd/Linux-ARM64-g++/namd3 +p72 +pemap 0-71 stmv_nve_cuda.namd | tee output.txt
./ns_per_day.py output.txt
The metric of interest is ns/day (higher is better) corresponding to the number of nanoseconds of simulation time that can be computed in 24 hours. The ns_per_day.py
script will parse the standard output of a simulation and compute the overall performance of the benchmark.
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
The following result was collected on a Grace Hopper Superchip using 72 CPU cores and completed in 121 seconds. As measured by `hwmon``, the average Grace module power was 275 Watts, and the benchmark consumed approximately 9.31 Watt-hours of energy.
$ ./ns_per_day.py output.txt
Nanoseconds per day: 2.97202
Mean time per step: 0.0581422
Standard deviation: 0.00074301
References
Phillips, James C., et al. “Scalable molecular dynamics on CPU and GPU architectures with NAMD.” The Journal of Chemical Physics 153.4 (2020).
Casalino, Lorenzo, et al. “AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics.” The International Journal of High Performance Computing Applications 35.5 (2021): 432-451.
Phillips, James C., et al. “Mapping to irregular torus topologies and other techniques for petascale biomolecular simulation.” SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014.
OpenFOAM
OpenFOAM is a C++ toolbox for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, most prominently including computational fluid dynamics.
Build
The following script will build OpenFOAM and all the dependencies. The script tested on a freshly booted Ubuntu 22.04.
#!/bin/bash
set -e
sudo apt update && sudo apt install -y time libfftw3-dev curl wget \
build-essential libscotch-dev libcgal-dev git flex libfl-dev bison cmake \
zlib1g-dev libboost-system-dev libboost-thread-dev \
libopenmpi-dev openmpi-bin gnuplot \
libreadline-dev libncurses-dev libxt-dev numactl
wget https://dl.openfoam.com/source/v2312/OpenFOAM-v2312.tgz
wget https://dl.openfoam.com/source/v2312/ThirdParty-v2312.tgz
tar -zxvf OpenFOAM-v2312.tgz && tar -zxvf ThirdParty-v2312.tgz
source ./OpenFOAM-v2312/etc/bashrc
pushd $WM_PROJECT_DIR
./Allwmake -j -s -l -q
popd
Running Benchmarks on Grace
export OPENFOAM_ROOT=${PWD}
export OPENFOAM_MPIRUN_ARGS="--map-by core --bind-to none --report-bindings"
source ./OpenFOAM-v2312/etc/bashrc
git clone https://develop.openfoam.com/committees/hpc.git
pushd hpc/incompressible/simpleFoam/HPC_motorbike/Large/v1912
sed -i "s/numberOfSubdomains.*/numberOfSubdomains $(nproc);/g" system/decomposeParDict
sed -i "s/vector/normal/g" system/mirrorMeshDict
sed -i "s/^endTime.*/endTime 100;/" system/controlDict
sed -i "s/^writeInterval.*/writeInterval 1000;/" system/controlDict
curl -o system/fvSolution "https://develop.openfoam.com/Development/openfoam/-/raw/master/tutorials/incompressible/simpleFoam/motorBike/system/fvSolution?ref_type=heads"
chmod +x All*
./AllmeshL
./Allrun
cat log.*
popd
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
The following result was collected on a Grace CPU Superchip using 144 CPU cores.
ExecutionTime = 189.49 s ClockTime = 197 s
SPECFEM3D
SPECFEM3D Cartesian simulates acoustic (fluid), elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave propagation in any type of conforming mesh of hexahedra (structured or not).
Build
The following script will build SPEMFEM3D. The script tested on a freshly booted Ubuntu 22.04.
#!/bin/bash
set -e
sudo apt update && sudo apt install -y git build-essential gcc gfortran libopenmpi-dev openmpi-bin
git clone https://github.com/SPECFEM/specfem3d.git
pushd ./specfem3d
./configure FC=gfortran CC=gcc
make all
cp -r EXAMPLES/applications/meshfem3D_examples/simple_model/DATA/* DATA/
sed -i "s/NPROC .*/NPROC = $(nproc)/g" DATA/Par_file
sed -i "s/NSTEP .*/NSTEP = 10000/g" DATA/Par_file
sed -i "s/DT .*/DT = 0.01/g" DATA/Par_file
sed -i "s/NEX_XI .*/NEX_XI = 448/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i "s/NEX_ETA .*/NEX_ETA = 576/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i "s/NPROC_XI .*/NPROC_XI = 8/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i "s/NPROC_ETA .*/NPROC_ETA = 18/g" DATA/meshfem3D_files/Mesh_Par_file
sed -i '/^#NEX_XI_BEGIN/{n;s/1.*/1 448 1 576 1 4 1/;n;s/1.*/1 448 1 576 5 5 2/;n;s/1.*/1 448 1 576 6 15 3/}' DATA/meshfem3D_files/Mesh_Par_file
popd
Running Benchmarks on Grace
pushd ./specfem3d
mkdir -p DATABASES_MPI
rm -rf DATABASES_MPI/*
rm -rf OUTPUT_FILES/*
mpirun -n $(nproc) --bind-to none --map-by core ./bin/xmeshfem3D
mpirun -n $(nproc) --bind-to none --map-by core ./bin/xgenerate_databases
mpirun -n $(nproc) --bind-to none --map-by core ./bin/xspecfem3D
cat OUTPUT_FILES/output_solver.txt
popd
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
The following result was collected on a Grace Superchip using 144 CPU cores.
Time loop finished. Timing info:
Total elapsed time in seconds = 991.00492298699999
Total elapsed time in hh:mm:ss = 0 h 16 m 31 s
Weather Research and Forecasting Model
The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility.
Arm64 is supported by the standard WRF distribution as of WRF 4.3.3. The following is an example of how to perform the standard procedure to build and execute on NVIDIA Grace. See http://www2.mmm.ucar.edu/wrf/users/download/get_source.html for more details.
Install WRF
Initial Configuration
Verify that the most recent NVIDIA HPC SDK is available in your environment. The simplest way to do this is to load the nvhpc
module file.
module load nvhpc
nvc --version
nvc 23.7-0 linuxarm64 target on aarch64 Linux -tp neoverse-v2
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
NVIDIA HPC SDK includes optimized MPI compilers and libraries, so you’ll also have the appropriate MPI compilers in your path:
$ which mpirun
/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/mpi/bin/mpirun
$ mpicc -show
nvc -I/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/openmpi/openmpi-3.1.5/include -Wl,-rpath -Wl,$ORIGIN:$ORIGIN/../../lib:$ORIGIN/../../../lib:$ORIGIN/../../../compilers/lib:$ORIGIN/../../../../compilers/lib:$ORIGIN/../../../../../compilers/lib -Wl,-rpath -Wl,/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -L/opt/nvidia/hpc_sdk/Linux_aarch64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib -lmpi
NetCDF requires libcurl. On Ubuntu, you can install this easily with this command:
sudo apt install libcurl4-openssl-dev
Create a build directory to hold WRF and all its dependencies
mkdir WRF
# Configure build environment
export BUILD_DIR="$HOME/WRF"
export HDFDIR=$BUILD_DIR/opt
export HDF5=$BUILD_DIR/opt
export NETCDF=$BUILD_DIR/opt
export PATH=$HDFDIR/bin:$PATH
export LD_LIBRARY_PATH=$HDFDIR/lib:$LD_LIBRARY_PATH
Dependencies
WRF depends on the NetCDF Fortran library, which in turn requires the NetCDF C library and HDF5. This guide assumes that all of WRF’s dependencies have been installed at the same location such that they share the same lib
and include
directories.
HDF5
cd $BUILD_DIR
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.2/src/hdf5-1.14.2.tar.gz
tar xvzf hdf5-1.14.2.tar.gz
cd hdf5-1.14.2
CC=mpicc FC=mpifort \
CFLAGS="-O3 -fPIC" FCFLAGS="-O3 -fPIC" \
./configure --prefix=$HDFDIR --enable-fortran --enable-parallel
make -j72
make install
NetCDF-C
cd $BUILD_DIR
wget https://github.com/Unidata/netcdf-c/archive/refs/tags/v4.9.2.tar.gz
tar xvzf v4.9.2.tar.gz
cd netcdf-c-4.9.2
CC=mpicc FC=mpifort \
CPPFLAGS="-I$HDFDIR/include" \
CFLAGS="-O3 -fPIC -I$HDFDIR/include" \
FFLAGS="-O3 -fPIC -I$HDFDIR/include" \
FCFLAGS="-O3 -fPIC -I$HDFDIR/include" \
LDFLAGS="-O3 -fPIC -L$HDFDIR/lib -lhdf5_hl -lhdf5 -lz" \
./configure --prefix=$NETCDF
make -j72
make install
NetCDF-Fortran
cd $BUILD_DIR
wget https://github.com/Unidata/netcdf-fortran/archive/refs/tags/v4.6.1.tar.gz
tar xvzf v4.6.1.tar.gz
cd netcdf-fortran-4.6.1/
CC=mpicc FC=mpifort \
CPPFLAGS="-I$HDFDIR/include" \
CFLAGS="-O3 -fPIC -I$HDFDIR/include" \
FFLAGS="-O3 -fPIC -I$HDFDIR/include" \
FCFLAGS="-O3 -fPIC -I$HDFDIR/include" \
LDFLAGS="-O3 -fPIC -L$HDFDIR/lib -lhdf5_hl -lhdf5 -lz" \
./configure --prefix=$NETCDF
make -j72
make install
Build WRF with NVIDIA Compilers
cd $BUILD_DIR
wget https://github.com/wrf-model/WRF/releases/download/v4.5.2/v4.5.2.tar.gz
tar xvzf v4.5.2.tar.gz
cd WRFV4.5.2
Run ./configure
and select the following options:
- Choose a
dm+sm
option on theNVHPC
row. In this example, this is option 20. - Choose 1 for nesting.
./configure
------------------------------------------------------------------------
Please select from among the following Linux aarch64 options:
1. (serial) 2. (smpar) 3. (dmpar) 4. (dm+sm) GNU (gfortran/gcc)
5. (serial) 6. (smpar) 7. (dmpar) 8. (dm+sm) GNU (gfortran/gcc)
9. (serial) 10. (smpar) 11. (dmpar) 12. (dm+sm) armclang (armflang/armclang): Aarch64
13. (serial) 14. (smpar) 15. (dmpar) 16. (dm+sm) GCC (gfortran/gcc): Aarch64
17. (serial) 18. (smpar) 19. (dmpar) 20. (dm+sm) NVHPC (nvfortran/nvc)
Enter selection [1-16] : 16
------------------------------------------------------------------------
Compile for nesting? (0=no nesting, 1=basic, 2=preset moves, 3=vortex following) [default 0]: 1
Depending on the compilers available in your environment, other options may be presented in the menu. Check the numbers in the menu before making your selection.
Reset environment variables:
# Reset build environment to include `-lnetcdf` in LDFLAGS
export CC=$(which mpicc)
export CXX=$(which mpicxx)
export FC=$(which mpifort)
export CPPFLAGS="-O3 -fPIC -I$HDFDIR/include"
export CFLAGS="-O3 -fPIC -I$HDFDIR/include"
export FFLAGS="-O3 -fPIC -I$HDFDIR/include"
export LDFLAGS="-O3 -fPIC -L$HDFDIR/lib -lnetcdf -lhdf5_hl -lhdf5 -lz"
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$$NETCDF/lib:$LD_LIBRARY_PATH
Set stack size to “unlimited”:
ulimit -s unlimited
Run ./compile
to build WRF and save the output to build.log
:
./compile em_real 2>&1 | tee build.log
Look for a message similar to this at the end of the compilation log:
==========================================================================
build started: Wed Oct 4 05:07:19 PM PDT 2023
build completed: Wed Oct 4 05:07:44 PM PDT 2023
---> Executables successfully built <---
-rwxrwxr-x 1 jlinford jlinford 44994360 Oct 4 17:07 main/ndown.exe
-rwxrwxr-x 1 jlinford jlinford 44921440 Oct 4 17:07 main/real.exe
-rwxrwxr-x 1 jlinford jlinford 44481744 Oct 4 17:07 main/tc.exe
-rwxrwxr-x 1 jlinford jlinford 48876800 Oct 4 17:07 main/wrf.exe
==========================================================================
Run WRF CONUS 12km
Verify that the most recent NVIDIA HPC SDK is available in your environment. The simplest way to do this is to load the nvhpc module file.
module load nvhpc
Configure your environment.
export BUILD_DIR="$HOME/WRF"
export HDFDIR=$BUILD_DIR/opt
export HDF5=$BUILD_DIR/opt
export NETCDF=$BUILD_DIR/opt
export PATH=$HDFDIR/bin:$PATH
export LD_LIBRARY_PATH=$HDFDIR/lib:$LD_LIBRARY_PATH
Download and unpack the CONUS 12km input files into a fresh run directory.
cd $BUILD_DIR/WRFV4.5.2
# Copy the run directory template
cp -a run run_CONUS12km
cd run_CONUS12km
# Download the test case files and merge them into the run directory
wget https://www2.mmm.ucar.edu/wrf/src/conus12km.tar.gz
tar xvzf conus12km.tar.gz --strip-components=1
Configure the environment:
ulimit -s unlimited
export PATH=$NETCDF/bin:$HDFDIR/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$HDFDIR/lib:$LD_LIBRARY_PATH
On a Grace CPU Superchip with 144 cores, run WRF with 36 MPI ranks and give each MPI rank 4 OpenMP threads:
export OMP_STACKSIZE=1G
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_NUM_THREADS=4
mpirun -np 36 -map-by ppr:18:numa:PE=4 ./wrf.exe
On a Grace Hopper Superchip with 72 cores, run WRF with 18 MPI ranks and give each MPI rank 4 OpenMP threads:
export OMP_STACKSIZE=1G
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_NUM_THREADS=4
mpirun -np 18 -map-by ppr:18:numa:PE=4 ./wrf.exe
You can monitor the run progress by watching the output logs for MPI rank 0:
tail -f rsl.out.0000 rsl.error.0000
The benchmark score is the average elapsed seconds per domain for all MPI ranks. You can use the jq
utility command to calculate this easily from the output logs of all MPI ranks.
# Quickly calculate the average elapsed seconds per domain as a figure-of-merit
cat rsl.out.* | grep 'Timing for main:' | awk '{print $9}' | jq -s add/length
Reference Results: CONUS 12km
These figures are provided as guidelines and should not be interpreted as performance targets.
System | Capacity (GB) | Ranks | Threads | Average Elapsed Seconds |
---|---|---|---|---|
Grace CPU Superchip | 480 | 36 | 4 | 0.3884 |
Grace Hopper | 120 | 18 | 4 | 0.5761 |
Developing for NVIDIA Grace
Architectural Features
NVIDIA Grace implements the SVE2 and the NEON single-instruction-multiple-data (SIMD) instruction sets (refer to Arm SIMD Instructions for more information).
All server-class Arm64 processors support low-cost atomic operations that can improve system throughput for thread communication, locks, and mutexes (refer to Locks, Synchronization, and Atomics for more information).
All Arm CPUs (including NVIDIA Grace) provide several ways to determine the available CPU resources and topology at runtime (refer to Runtime CPU Detection for more information and the example code).
Debugging and Profiling
Typically, all the same debuggers and profilers you rely on when working on x86 are available on NVIDIA Grace. The notable exceptions are vendor-specific products, for example, Intel® VTune. The capabilities provided by these tools are provided also by other tools on NVIDIA Grace (refer to Debugging for more information).
Language-Specific Guidance
Check the Languages page for any language-specific guidance related to LSE, locking, synchronization, and atomics. If no guide is provided then there are no Arm-related specific issues for that language. You can proceed as you would on any other platform.
Arm Vector Instructions: SVE and NEON
NVIDIA Grace implements two vector single-instruction-multiple-data (SIMD) instruction extensions:
- Advanced SIMD Instructions (NEON)
- Arm Scalable Vector Extensions (SVE)
Arm Advanced SIMD Instructions (or NEON) is the most common SIMD ISA for Arm64. It is a fixed-length SIMD ISA that supports 128-bit vectors. The first Arm-based supercomputer to appear on the Top500 Supercomputers list (Astra) used NEON to accelerate linear algebra, and many applications and libraries are already taking advantage of NEON.
More recently, Arm64 CPUs have started supporting Arm Scalable Vector Extensions (SVE), which is a length-agnostic SIMD ISA that supports more datatypes than NEON (for example, FP16), offers more powerful instructions (for example, gather/scatter), and supports vector lengths of more than 128 bits. SVE is currently found in NVIDIA Grace, the AWS Graviton 3, Fujitsu A64FX, and others. SVE is not a new version of NEON, but an entirely new SIMD ISA.
The following table provides a quick summary of the SIMD capabilities of some of the currently available Arm64 CPUs:
NVIDIA Grace | AWS Graviton3 | Fujitsu A64FX | AWS Graviton2 | Ampere Altra | |
---|---|---|---|---|---|
CPU Core | Neoverse V2 | Neoverse V1 | A64FX | Neoverse N1 | Neoverse N1 |
SIMD ISA | SVE2 & NEON | SVE & NEON | SVE & NEON | NEON only | NEON only |
NEON Configuration | 4x128 | 4x128 | 2x128 | 2x128 | 2x128 |
SVE Configuration | 4x128 | 2x256 | 2x512 | N/A | N/A |
SVE Version | 2 | 1 | 1 | N/A | N/A |
NEON FMLA FP64 TPeak | 16 | 16 | 8 | 8 | 8 |
SVE FMLA FP64 TPeak | 16 | 16 | 32 | N/A | N/A |
Many recent Arm64 CPUs provide the same peak theoretical performance for NEON and SVE. For example, NVIDIA Grace can retire four 128-bit NEON operations or four 128-bit SVE2 operations. Although the theoretical peak performance of SVE and NEON are the same for these CPUs, SVE (and especially SVE2) is a more capable SIMD ISA with support for complex data types and advanced features that enable the vectorization of complicated code. In practice, kernels that cannot be vectorized in NEON can be vectorized with SVE. So, although SVE will not beat NEON in a performance drag race, it can dramatically improve the overall performance of the application by vectorizing loops that would have otherwise executed with scalar instructions.
Fortunately, auto-vectorizing compilers are usually the best choice when programming Arm SIMD ISAs. The compiler will generally make the best decision on when to use SVE or NEON, and it will take advantage of SVE’s advanced auto-vectorization features more easily than a human coding in intrinsics or an assembly can.
Avoid writing SVE or NEON intrinsics. To realize the best performance for a loop, use the appropriate command-line options with your favorite auto-vectorizing compiler. You might need to use compiler directives or make changes in the high-level code to facilitate auto-vectorization, but this will be much easier and more maintainable than writing intrinsics. Leave the finer details to the compiler and focus on code patterns that auto-vectorize well.
Compiler-Driven Auto-Vectorization
The key to maximizing auto-vectorization is to allow the compiler to take advantage of the available hardware features. By default, GCC and LLVM compilers take a conservative approach and do not enable advanced features unless explicitly told to do so. The easiest way to enable all available features for GCC or LLVM is to use the -mcpu
compiler flag. If you are compiling on the same CPU on which the code will run, use -mcpu=native
. Otherwise, you can use -mcpu=<target>
, where <target>
is one of the CPU identifiers, for example, -mcpu=neoverse-v2
.
The NVIDIA compilers take a more aggressive approach. By default, these compilers assume that the machine on which you are compiling is the machine on which you will run and enable all available hardware features that were detected at compile time. When compiling with the NVIDIA compilers natively on Grace, you do not need additional flags.
Note: When possible, use the most recent version of your compiler. For example, GCC9 supported auto-vectorization fairly well, but GCC12 has shown impressive improvement over GCC9 in most cases. GCC13 further improves auto-vectorization.
The second key compiler feature is the compiler vectorization report. GCC uses the -fopt-info
flags to report on auto-vectorization success or failure. You can use the generated informational messages to guide code annotations or transformations that will facilitate autovectorization. For example, compiling with -fopt-info-vec-missed
will report on which loops were not vectorized.
Relaxed Vector Conversions
Arm NEON differentiates between vectors of signed and unsigned types. For example, GCC will not implicitly cast between vectors of signed and unsigned 64-bit integers:
#include <arm_neon.h>
...
uint64x2_t u64x2;
int64x2_t s64x2;
// Error: cannot convert 'int64x2_t' to 'uint64x2_t' in assignment
u64x2 = s64x2;
To perform the cast, you must use NEON’s vreinterpretq
functions:
u64x2 = vreinterpretq_u64_s64(s64x2);
Unfortunately, some codes written for other SIMD ISAs rely on these kinds of implicit conversions. If you see errors about “no known conversion” in a code that builds for AVX but does not build for NEON, you might need to relax GCC’s vector conversion rules:
/tmp/velox/third_party/xsimd/include/xsimd/types/xsimd_batch.hpp:35:11: note: no known conversion for argument 1 from 'xsimd::batch<long int>' to 'const xsimd::batch<long unsigned int>&'
To allow implicit conversions between vectors with differing numbers of elements and/or incompatible element types, use the -flax-vector-conversions
flag. This flag should be fine for legacy code, but it should not be used for new code. The safest option is to use the appropriate vreinterpretq
calls.
Runtime Detection of Supported SIMD Instructions
To make your binaries more portable across various Arm64 CPUs, use the Arm64 hardware capabilities to determine the available instructions at runtime. For example, a CPU core that is compliant with Armv8.4 must support dot-product, but dot-products are optional in Armv8.2 and Armv8.3. A developer who wants to build an application or library that can detect the supported instructions in runtime, can follow this example:
#include<sys/auxv.h>
......
uint64_t hwcaps = getauxval(AT_HWCAP);
has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false;
has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false;
has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false;
has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false;
has_sve_feature = hwcaps & HWCAP_SVE ? true : false;
The full list of Arm64 hardware capabilities is defined in the glibc header file and in the Linux kernel.
Porting Codes with SSE/AVX Intrinsics to NEON
Detecting Arm64 systems
If you see errors like error: unrecognized command-line option '-msse2'
, it usually means that the build system is
failing to detect Grace as an Arm CPU is incorrectly using the x86 target features compiler flags.
To detect an Arm64 system, the build system can use the following command:
(test $(uname -m) = "aarch64" && echo "arm64 system") || echo "other system"
To detect an Arm64 system, you can compile, run, and check the return value of a C program.
# cat << EOF > check-arm64.c
int main () {
#ifdef __aarch64__
return 0;
#else
return 1;
#endif
}
EOF
# gcc check-arm64.c -o check-arm64
# (./check-arm64 && echo "arm64 system") || echo "other system"
Translating x86 Intrinsics to NEON
When programs contain code with x86 intrinsics, drop-in intrinsic translation tools like SIMDe or sse2neon can be used to quickly obtain a working program on Arm64. This is a good starting point for rewriting the x86 intrinsics in NEON or SVE and will quickly get a prototype up and running. For example, to port code using AVX2 intrinsics with SIMDe:
#define SIMDE_ENABLE_NATIVE_ALIASES
#include "simde/x86/avx2.h"
SIMDe provides a quick starting point to port performance critical codes to Arm64. It shortens the time needed to get a working program that can be used to extract profiles and to identify hot paths in the code. After a profile is established, the hot paths can be rewritten to avoid the overhead of the generic translation.
Since you are rewriting your x86 intrinsics, you might want to take this opportunity to create a more portable version. Here are some suggestions to consider:
- Rewrite in native C/C++, Fortran, or another high-level compiled language. Compilers are constantly improving, and technologies like Arm SVE enable the auto-vectorization of codes that formally would not vectorize. You can avoid platform-specific intrinsics entirely and let the compiler do all the work.
- If your application is written in C++, use
std::experimental::simd
from the C++ Parallelism Technical Specification V2 by using the<experimental/simd>
header. - Use the SLEEF Vectorized Math Library as a header-based set of “portable intrinsics”.
- Instead of Time Stamp Counter (TSC) RDTSC intrinsics, use standards-compliant portable timers, for example,
std::chrono
(C++),clock_gettime
(C/POSIX),omp_get_wtime
(OpenMP),MPI_Wtime
(MPI), and so on.
Locks, Synchronization, and Atomics
Efficient synchronization is critical to achieving good performance in applications with high thread counts, but synchronization is a complex and nuanced topic. See below for a high level overview (refer to the Synchronization Overview and Case Study on Arm Architecture whitepaper from Arm for more information).
The Arm Memory Model
One of the most significant differences between Arm and the x86 CPUs is their memory model. The Arm architecture has a weak memory model that allows for more compiler and hardware optimization to boost system performance. This differs from the x86 architecture Total Store Order (TSO) model. Different memory models can cause low-level codes (for example, drivers) to function well on one architecture but encounter performance problem or failure on the other.
The unique features of the Arm memory model are only relevent if you are writing low level code, such as assembly language. Most software developers will not be affected by a change in memory model.
The details about Arm’s memory model are below the application level and will be completely invisible to most users. If you are writing in a high-level language such as C, C++, or Fortran, you do not need to know the nuances of Arm’s memory model. The one exception to this general rule is code that uses boutique synchronization constructs instead of standard best practices, for example, using volatile
as a means of thread synchronization.
Deviating from established standards or ignoring best practices results in code that is almost guaranteed to be broken. It should be rewritten using system-provided locks, conditions, etc. and the stdatomic
tools. (Refer to https://github.com/ParRes/Kernels/issues/611 for an example of this type of bug.)
Arm is not the only architecture that uses a weak memory model. If your application already runs on CPUs that are not x86-based, you might encounter fewer bugs that are related to the weak memory model. Specifically, if your application has been ported to a CPU implementing the POWER architecture, for example, IBM POWER9, the application will work on the Arm memory model.
Large-System Extension (LSE) Atomic Instructions
All server-class Arm64 processors, such as NVIDIA Grace, have support for the Large-System Extension (LSE), which was first introduced in Armv8.1. LSE provides low-cost atomic operations that can improve system throughput for thread communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. This improvement is not generally true for older Arm64 CPUs like the Marvell ThunderX2 or the Fujitsu A64FX (refer to these slides from the ISC 2022 AHUG Workshop for more information).
When building an application from source, the compiler needs to generate LSE atomic instructions for applications that use atomic operations. For example, the code of databases such as PostgreSQL contain atomic constructs: C++11 code with std::atomic
statements that translate into atomic operations. Since GCC 9.4, GCC’s -mcpu=native
flag enables all instructions supported by the host CPU, including LSE. To confirm that LSE instructions are created, the output of objdump command-line utility should contain LSE instructions:
$ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l
To check whether the application binary contains load and store exclusives, run the following command:
$ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l
Runtime CPU Detection
You can determine the available Arm CPU resources and topology at runtime in the following ways:
- CPU architecture and supported instructions
- CPU manufacturer
- Number of CPU sockets
- CPU cores per socket
- Number of NUMA nodes
- Number of NUMA nodes per socket
- CPU cores per NUMA node
Well-established portable libraries like libnuma
and hwloc
are a great choice on Grace. You can also use Arm’s CPUID registers or query OS files. Since many of these methods serve the same function, you should choose the method that best fits your application.
If you are implementing your own approach, look at the Arm Architecture Registers, especially the Main ID Register MIDR_EL1
: https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1–Main-ID-Register.
The source code for the lscpu
utility is a great example of how to retrieve and use these registers. For example, to learn how to translate the CPU part number in the MIDR_EL1
register to a human-readable string read https://github.com/util-linux/util-linux/blob/master/sys-utils/lscpu-arm.c.
Here is the output of lscpu
on NVIDIA Grace-Hopper:
nvidia@localhost:/home/nvidia$ lscpu
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Vendor ID: ARM
Model: 0
Thread(s) per core: 1
Core(s) per socket: 72
Socket(s): 1
Stepping: r0p0
Frequency boost: disabled
CPU max MHz: 3438.0000
CPU min MHz: 81.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitpe
rm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
L1d: 4.5 MiB (72 instances)
L1i: 4.5 MiB (72 instances)
L2: 72 MiB (72 instances)
L3: 114 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-71
NUMA node1 CPU(s):
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
CPU Hardware Capabilities
To make your binaries more portable across various Arm64 CPUs, you can use Arm64 hardware capabilities to determine the available instructions at runtime. For example, a CPU core that is compliant with Armv8.4 must support a dot-product, but dot-products are optional in Armv8.2 and Armv8.3. A developer who wants to build an application or library that can detect the supported instructions in runtime, can use this example:
#include<sys/auxv.h>
......
uint64_t hwcaps = getauxval(AT_HWCAP);
has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false;
has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false;
has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false;
has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false;
has_sve_feature = hwcaps & HWCAP_SVE ? true : false;
A complete list of Arm64 hardware capabilities is defined in the glibc header file and in the Linux kernel.
Example Source Code
Here is a complete yet simple example code that includes some of the methods mentioned above.
#include <stdio.h>
#include <sys/auxv.h>
#include <numa.h>
// https://developer.arm.com/documentation/ddi0601/2020-12/AArch64-Registers/MIDR-EL1--Main-ID-Register
typedef union
{
struct {
unsigned int revision : 4;
unsigned int part : 12;
unsigned int arch : 4;
unsigned int variant : 4;
unsigned int implementer : 8;
unsigned int _RES0 : 32;
};
unsigned long bits;
} MIDR_EL1;
static MIDR_EL1 read_MIDR_EL1()
{
MIDR_EL1 reg;
asm("mrs %0, MIDR_EL1" : "=r" (reg.bits));
return reg;
}
static const char * get_implementer_name(MIDR_EL1 midr)
{
switch(midr.implementer)
{
case 0xC0: return "Ampere";
case 0x41: return "Arm";
case 0x42: return "Broadcom";
case 0x43: return "Cavium";
case 0x44: return "DEC";
case 0x46: return "Fujitsu";
case 0x48: return "HiSilicon";
case 0x49: return "Infineon";
case 0x4D: return "Motorola";
case 0x4E: return "NVIDIA";
case 0x50: return "Applied Micro";
case 0x51: return "Qualcomm";
case 0x56: return "Marvell";
case 0x69: return "Intel";
default: return "Unknown";
}
}
static const char * get_part_name(MIDR_EL1 midr)
{
switch(midr.implementer)
{
case 0x41: // Arm Ltd.
switch (midr.part) {
case 0xd03: return "Cortex A53";
case 0xd07: return "Cortex A57";
case 0xd08: return "Cortex A72";
case 0xd09: return "Cortex A73";
case 0xd0c: return "Neoverse N1";
case 0xd40: return "Neoverse V1";
case 0xd4f: return "Neoverse V2";
default: return "Unknown";
}
case 0x42: // Broadcom
switch (midr.part) {
case 0x516: return "Vulcan";
default: return "Unknown";
}
case 0x43: // Cavium
switch (midr.part) {
case 0x0a1: return "ThunderX";
case 0x0af: return "ThunderX2";
default: return "Unknown";
}
case 0x46: // Fujitsu
switch (midr.part) {
case 0x001: return "A64FX";
default: return "Unknown";
}
case 0x4E: // NVIDIA
switch (midr.part) {
case 0x000: return "Denver";
case 0x003: return "Denver 2";
case 0x004: return "Carmel";
default: return "Unknown";
}
case 0x50: // Applied Micro
switch (midr.part) {
case 0x000: return "EMAG 8180";
default: return "Unknown";
}
default: return "Unknown";
}
}
int main(void)
{
// Main ID register
MIDR_EL1 midr = read_MIDR_EL1();
// CPU ISA capabilities
unsigned long hwcaps = getauxval(AT_HWCAP);
printf("CPU revision : 0x%x\n", midr.revision);
printf("CPU part number : 0x%x (%s)\n", midr.part, get_part_name(midr));
printf("CPU architecture: 0x%x\n", midr.arch);
printf("CPU variant : 0x%x\n", midr.variant);
printf("CPU implementer : 0x%x (%s)\n", midr.implementer, get_implementer_name(midr));
printf("CPU LSE atomics : %sSupported\n", (hwcaps & HWCAP_ATOMICS) ? "" : "Not ");
printf("CPU NEON SIMD : %sSupported\n", (hwcaps & HWCAP_ASIMD) ? "" : "Not ");
printf("CPU SVE SIMD : %sSupported\n", (hwcaps & HWCAP_SVE) ? "" : "Not ");
printf("CPU Dot-product : %sSupported\n", (hwcaps & HWCAP_ASIMDDP) ? "" : "Not ");
printf("CPU FP16 : %sSupported\n", (hwcaps & HWCAP_FPHP) ? "" : "Not ");
printf("CPU BF16 : %sSupported\n", (hwcaps & HWCAP2_BF16) ? "" : "Not ");
if (numa_available() == -1) {
printf("libnuma not available\n");
}
printf("CPU NUMA nodes : %d\n", numa_num_configured_nodes());
printf("CPU Cores : %d\n", numa_num_configured_cpus());
return 0;
}
Debugging
This section provides information about useful techniques and tools to find and resolve bugs while migrating your applications to NVIDIA Grace.
Sanitizers
The compiler might generate code and layout data that is slightly differently on Arm64 than on an x86 system, and this could expose latent memory bugs that were previously hidden. On GCC, the easiest way to look for these bugs is to compile with the memory sanitizers by adding the following to the standard compiler flags:
CFLAGS += -fsanitize=address -fsanitize=undefined
LDFLAGS += -fsanitize=address -fsanitize=undefined
Run the resulting binary, and bugs that are detected by the sanitizers will cause the program to exit immediately and print helpful stack traces and other information.
Memory Ordering
Arm is weakly ordered, like POWER and other modern architectures, and x86 is a variant of total-store-ordering (TSO). Code that relies on TSO might lack barriers to properly order memory references. Arm64 systems are weakly ordered multi-copy-atomic.
Although TSO allows reads to occur out-of-order with writes, and a processor to observe its own write before it is visible to others, the Armv8 memory model provides additional relaxations for performance and power efficiency. Code relying on pthread mutexes or locking abstractions found in C++, Java or other languages should not notice any difference. Code that has a bespoke implementation of lockless data structures, or implements its own synchronization primitives, will have to use the proper intrinsics and barriers to correctly order memory transactions (refer to Locks, Synchronization, and Atomics for more information).
Language-Specific Considerations
This section contains language-specific information with recommendations. If no section exists for a language, it is because there is no specific guidance beyond using a suitably current version of the language. You can proceed as usual on other CPUs, Arm-based, or otherwise.
Broadly speaking, applications that are built using interpreted or JIT’ed languages (Python, Java, PHP, Node.js, and so on) should run as-is on Arm64. Applications using compiled languages, including C/C++, Fortran, and Rust need to be compiled for the Arm64 architecture. Most modern build systems (Make, CMake, Ninja, and so on) will just work on Arm64.
C/C++ on NVIDIA Grace
There are many C/C++ compilers available for NVIDIA Grace including:
Selecting a Compiler
The compiler you use depends on your application’s needs. If in doubt, try the NVIDIA HPC Compiler first because this compiler will always have the most recent updates and enhancements for Grace. If you prefer Clang, NVIDIA provides builds of Clang for NVIDIA Grace that are supported by NVIDIA and certified as a CUDA host compiler. GCC, ACfL, and HPE Cray Compilers also have their own advantages. As a general strategy, default to an NVIDIA-provided compiler and fall back to a third-party as needed.
When possible, use the latest compiler version that is available on your system. Newer compilers provide better support and optimizations for Grace, and when using newer compilers, many codes will demonstrate significantly better performance.
Recommended Compiler Flags
The NVIDIA HPC Compilers accept PGI flags and many GCC or Clang compiler flags. These compilers include the NVFORTRAN, NVC++, and NVC compilers. They work with an assembler, linker, libraries, and header files on your target system, and include a CUDA toolchain, libraries and header files for GPU computing. Refer to the NVIDIA HPC Compiler’s User’s Guide for more information. The freely NVIDIA HPC SDK is the best way to quickly get started with the freely available NVIDIA HPC Compiler.
Most compiler flags for GCC and Clang/LLVM operate the same on Arm64 as on other architectures except for the -mcpu
flag. On Arm64, this flag both specifies the appropriate architecture and the tuning strategy. It is generally better to use -mcpu
instead of -march
or -mtune
on Grace. You can find additional details in this presentation given at Stony Brook University.
CPU | Flag | GCC version | LLVM verison |
---|---|---|---|
NVIDIA Grace | -mcpu=neoverse-v2 | 12.3+ | 16+ |
Ampere Altra | -mcpu=neoverse-n1 | 9+ | 10+ |
Any Arm64 | -mcpu=native | 9+ | 10+ |
If you are cross compiling, use the appropriate -mcpu
option for your target CPU, for example, to target NVIDIA Grace when compiling on an AWS Graviton 3 use -mcpu=neoverse-v2
.
Compiler-Supported Hardware Features
The common -mcpu=native
flag enables all instructions supported by the host CPU. You can check which Arm features GCC will enable with the -mcpu=native
flag by running the following command:
gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
For example, on the NVIDIA Grace CPU with GCC 12.3, we see “__ARM_FEATURE_ATOMICS 1
” indicating that LSE atomics are enabled:
$ gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
#define __ARM_FEATURE_ATOMICS 1
#define __ARM_FEATURE_SM3 1
#define __ARM_FEATURE_SM4 1
#define __ARM_FEATURE_RCPC 1
#define __ARM_FEATURE_SVE_VECTOR_OPERATORS 1
#define __ARM_FEATURE_SVE2_AES 1
#define __ARM_FEATURE_AES 1
#define __ARM_FEATURE_SVE 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_JCVT 1
#define __ARM_FEATURE_DOTPROD 1
#define __ARM_FEATURE_BF16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_MATMUL_INT8 1
#define __ARM_FEATURE_CRYPTO 1
#define __ARM_FEATURE_BF16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_FRINT 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_FEATURE_SHA512 1
#define __ARM_FEATURE_QRDMX 1
#define __ARM_FEATURE_FMA 1
#define __ARM_FEATURE_SHA2 1
#define __ARM_FEATURE_SVE2_SHA3 1
#define __ARM_FEATURE_COMPLEX 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_SVE2_SM4 1
#define __ARM_FEATURE_SVE_MATMUL_INT8 1
#define __ARM_FEATURE_FP16_FML 1
#define __ARM_FEATURE_UNALIGNED 1
#define __ARM_FEATURE_SHA3 1
#define __ARM_FEATURE_CRC32 1
#define __ARM_FEATURE_SVE_BITS 0
#define __ARM_FEATURE_NUMERIC_MAXMIN 1
#define __ARM_FEATURE_SVE2 1
#define __ARM_FEATURE_SVE2_BITPERM 1
Porting SSE or AVX Intrinsics
Header-based intrinsics translation tools, such as SIMDe and SSE2NEON are a great way to quickly get a prototype running on Arm64. These tools automatically translate x86 intrinsics to SVE or NEON intrinsics (refer to Arm Single-Instruction Multiple-Data Instructions). This approach provides a quick starting point when porting performance critical codes and shortens the time needed to get a working program that can be used to extract profiles and to identify hot paths in the code. After a profile is established, the hot paths can be rewritten to avoid the overhead of the automatic translation of intrinsics.
Note: GCC’s __sync
built-ins are outdated and might be biased towards the x86 memory model.
Use __atomic
versions of these functions instead of the __sync
versions.
Refer to the GCC documentation for more information.
Signedness of the char
Type
The C standard does not specify the signedness of the char
type. On x86, many compilers assume that char
is signed by default, but on Arm64, compilers often assume it is unsigned by default. This difference can be addressed by using standard integer types that specify signedness (for example, uint8_t
and int8_t
) or by specifying char
signedness with compiler flags, for example, -fsigned-char
or -funsigned-char
.
Arm Instructions for Machine Learning
NVIDIA Grace supports Arm dot-product instructions (commonly used for Machine Learning (quantized) inference workloads) and half precision floating point (FP16). These features enable performant and power efficient machine learning by doubling the number of operations per second and reducing the memory footprint compared to single precision floating point, all while enjoying large dynamic range. These features are enabled automatically in the NVIDIA compilers. To enable these features in GNU or LLVM compilers, compile with -mcpu=native
or -mcpu=neoverse-v2
.
Fortran on NVIDIA Grace
There are many Fortran compilers available for NVIDIA Grace including the following:
Selecting a Compiler
The compiler you use depends on your application’s needs. If in doubt, try the NVIDIA HPC Compiler first because this compiler will always have the most recent updates and enhancements for Grace. GFORTRAN, ACfL, and HPE Cray Compilers also have their own advantages. As a general strategy, default to an NVIDIA-provided compiler and fall back to a third-party as needed.
When possible, use the latest compiler version that is available on your system. Newer compilers provide better support and optimizations for Grace, and when using newer compilers, many codes will demonstrate significantly better performance.
Recommended Compiler Flags
The NVIDIA HPC Compilers accept PGI flags and many GCC or Clang compiler flags. These compilers include the NVFORTRAN, NVC++, and NVC compilers. They work with an assembler, linker, libraries, and header files on your target system, and include a CUDA toolchain, libraries and header files for GPU computing. Refer to the NVIDIA HPC Compiler’s User’s Guide for more information. The freely NVIDIA HPC SDK is the best way to quickly get started with the freely available NVIDIA HPC Compiler.
Most compiler flags for GCC and Clang/LLVM operate the same on Arm64 as on other architectures except for the -mcpu
flag. On Arm64, this flag both specifies the appropriate architecture and the tuning strategy. It is generally better to use -mcpu
instead of -march
or -mtune
on Grace. You can find additional details in this presentation given at Stony Brook University.
CPU | Flag | GFORTRAN version |
---|---|---|
NVIDIA Grace | -mcpu=neoverse-v2 | 12.3+ |
Ampere Altra | -mcpu=neoverse-n1 | 9+ |
Any Arm64 | -mcpu=native | 9+ |
If you are cross compiling, use the appropriate -mcpu
option for your target CPU, for example, to target NVIDIA Grace when compiling on an AWS Graviton 3 use -mcpu=neoverse-v2
.
Rust on NVIDIA Grace
Rust supports Arm64 systems as a tier1 platform.
Large-System Extensions (LSE)
LSE improves system throughput for CPU-to-CPU communication, locks, and mutexes. LSE can be enabled in Rust, and there have been instances on larger machines where performance is improved by over three times by setting the RUSTFLAG
environment variable and rebuilding your project.
export RUSTFLAGS="-Ctarget-cpu=neoverse-v2"
cargo build --release
Python on NVIDIA Grace
Python is an interpreted, high-level, general-purpose programming language, with interpreters that are available for many operating systems and architectures, including Arm64. Python 2.7 went end-of-life on January 1, 2020, so we recommended that you upgrade to a Python 3.x version.
Installing Python packages
When pip
, the standard package installer for Python, is used, it pulls the packages from Python Package Index and other indexes. If pip
cannot find a pre-compiled package, it automatically downloads, compiles, and builds the package from source code. It typically takes a few more minutes to install the package from source code than from pre-built, especially for large packages (for example, pandas).
To install common Python packages from the source code, you need to install the following development tools:
On RedHat
sudo yum install "@Development tools" python3-pip python3-devel blas-devel gcc-gfortran lapack-devel
python3 -m pip install --user --upgrade pip
On Debian/Ubuntu
sudo apt update
sudo apt-get install build-essential python3-pip python3-dev libblas-dev gfortran liblapack-dev
python3 -m pip install --user --upgrade pip
Scientific and Numerical Applications
Python relies on native code to achieve high performance. For scientific and numerical applications, NumPy and SciPy provide an interface to high performance computing libraries such as ATLAS, BLAS, BLIS, OpenBLAS, and so on. These libraries contain code tuned for Arm64 processors, and especially the Arm Neoverse V2 core found in NVIDIA Grace.
We recommend that you use the latest software versions as much as possible. If the latest version migration is not feasible, ensure that it is at least the minimum version recommended below. Multiple fixes related to data precision and correctness on Arm64 went into OpenBLAS between v0.3.9 and v0.3.17 and the following SciPy and NumPy versions have been upgraded OpenBLAS from v0.3.9 to OpenBLAS v0.3.17.
Here are the minimum versions:
- OpenBLAS: >= v0.3.17
- SciPy: >= v1.7.2
- NumPy: >= 1.21.1
The default SciPy and NumPy binary installations with pip3 install numpy scipy
are configured to use OpenBLAS. The default installations of SciPy and NumPy
are easy to setup and well tested.
Anaconda and Conda
Anaconda is a distribution of the Python and R programming languages for scientific computing that aim to simplify package management and deployment.
Anaconda has had support for Arm64 via AWS Graviton 2 since since 2021, so Anaconda works very well on NVIDIA Grace. Instructions to install the full Anaconda package installer can be found at https://docs.anaconda.com/. Anaconda also offers a lightweight version called Miniconda which is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others.
Java on NVIDIA Grace
Java is well supported and generally performant out-of-the-box on Arm64. While Java 8 is fully supported on Arm64, some customers have not been able to obtain the CPU’s full performance benefit until after switching to Java 11.
This section includes specific details about building and tuning Java applications on Arm64.
Java JVM Options
There are options that control the JVM and might lead to better performance. Flags -XX:-TieredCompilation -XX:ReservedCodeCacheSize=64M -XX:InitialCodeCacheSize=64M
have shown large (1.5x) improvements in some Java workloads. ReservedCodeCacheSize
and InitialCodeCacheSize
should be equal and between 64M and 127M. The JIT compiler stores generated code in the code cache. The flags change the size of the code cache from the default 240M to the smaller one. The smaller code cache may help CPU to improve the caching and prediction of JIT’ed code. The flags disable the tiered compilation to make the JIT compiler able to use the smaller code cache. These are helpful on some workloads but can hurt on others so testing with and without them is essential.
Java Stack Size
For some JVMs, the default stack size for Java threads (ThreadStackSize
) is 2MB on Arm64 instead of the 1MB used on x86. You can check the default with the following:
$ java -XX:+PrintFlagsFinal -version | grep ThreadStackSize
intx CompilerThreadStackSize = 2048 {pd product} {default}
intx ThreadStackSize = 2048 {pd product} {default}
intx VMThreadStackSize = 2048 {pd product} {default}
The default can be easily changed on the command line with either -XX:ThreadStackSize=<kbytes>
or -Xss<bytes>
. Notice that -XX:ThreadStackSize
interprets its argument as kilobytes and -Xss
interprets it as bytes. As a result, -XX:ThreadStackSize=1024
and -Xss1m
will both set the stack size for Java threads to 1 megabyte:
$ java -Xss1m -XX:+PrintFlagsFinal -version | grep ThreadStackSize
intx CompilerThreadStackSize = 2048 {pd product} {default}
intx ThreadStackSize = 1024 {pd product} {command line}
intx VMThreadStackSize = 2048 {pd product} {default}
Typically, you do not have to change the default value because the thread stack will be committed lazily as it grows. Regardless of the default value, the thread will always only commit as much stack as it really uses (at page size granularity). However, there is one exception to this rule. If Transparent Huge Pages (THP) are turned on by default on a system, the stack will be completely committed to memory from the start. If you are using hundreds, or even thousands of threads, this memory overhead can be considerable.
To mitigate this issue, you can either manually change the stack size on the command line (as described above) or you can change the default for THP from always
to madvise
:
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
Even if the the default is changed from always
to madvise
, if you specify -XX:+UseTransparentHugePages
on the command line, the JVM can still use THP for the Java heap and code cache.
Additional Resources
- NVIDIA Grace Documentation
- NVIDIA Grace CPU Developer Resources
- Arm Neoverse V2 Software Optimization Guide
- Arm Neoverse V2 PMU Guide
- Neon Intrinsics
- Coding for Neon
- Neon Programmer’s Guide for Armv8-A
Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.
Trademarks
NVIDIA and the NVIDIA logo are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
Arm
Arm, AMBA, and ARM Powered are registered trademarks of Arm Limited. Cortex, MPCore, and Mali are trademarks of Arm Limited. All other brands or product names are the property of their respective holders. “Arm” is used to represent ARM Holdings plc; its operating company Arm Limited; and the regional subsidiaries Arm Inc.; Arm KK; Arm Korea Limited.; Arm Taiwan Limited; Arm France SAS; Arm Consulting (Shanghai) Co. Ltd.; Arm Germany GmbH; Arm Embedded Technologies Pvt. Ltd.; Arm Norway, AS, and Arm Sweden AB.
OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
Copyright
© 2023 NVIDIA Corporation & Affiliates. All rights reserved.