HiBench: K-means
This workload from the HiBench suite tests K-means clustering in spark.mllib
, a well-known clustering algorithm for knowledge discovery and data mining. The input data set is generated by GenKMeansDataset
based on Uniform Distribution and Gaussian Distribution. There is also an optimized K-means implementation based on Intel Data Analytics Library (DAL), which is available in the dal
module of sparkbench. This benchmark requires Spark, HiBench, and Hadoop. HiBench is the workload generator, Hadoop is used to generate and store data, and Spark is the application we wish to test.
Installation
Java 8 and Java 11
Install Java 8, Java 11, and related tools from your Linux distribution’s package repository. For example, on Ubuntu:
sudo apt install openjdk-11-jre-headless openjdk-11-jdk-headless maven python2 net-tools openjdk-8-jdk
Hadoop
cd $HOME
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-aarch64.tar.gz
tar zxvf hadoop-3.3.6-aarch64.tar.gz
export PATH_TO_HADOOP=$HOME/hadoop-3.3.6
cd $PATH_TO_HADOOP/etc/hadoop
Create configuration files:
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>127.0.0.1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>127.0.0.1:8031</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
mapred-site.xml
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6
directory,
e.g. <value>HADOOP_MAPRED_HOME=/home/nvidia/hadoop-3.3.6</value>
.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$PATH_TO_HADOOP</value>
</property>
</configuration>
hdfs-site.xml
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6 directory, e.g.
<value>/home/nvidia/hadoop-3.3.6/namenode</value>
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>$PATH_TO_HADOOP/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>datanode</value>
</property>
</configuration>
hadoop-env.sh
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6 directory, e.g.
export HADOOP_HOME="/home/nvidia/hadoop-3.3.6"
export HADOOP_HOME="$PATH_TO_HADOOP"
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-arm64"
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
After creating all Hadoop configuration files, initialize the namenode
directory:
$PATH_TO_HADOOP/bin/hdfs namenode -format
Spark
Replace $PATH_TO_HADOOP
with the path to the hadoop-3.3.6 directory, e.g.
export HADOOP_PREFIX="/home/nvidia/hadoop-3.3.6"
cd $HOME
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar zxvf spark-3.5.0-bin-hadoop3.tgz
cd spark-3.5.0-bin-hadoop3/conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
HiBench
If the mvn
command given below fails with an error like object java.lang.Object in compiler mirror not found
please check that you have installed Java 8 and updated your JAVA_HOME
and PATH
environment variables.
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-arm64"
export PATH="$JAVA_HOME/bin:$PATH"
cd $HOME
git clone https://github.com/Intel-bigdata/HiBench.git
cd HiBench
mvn -Phadoopbench -Psparkbench -Dspark=2.4 -Dscala=2.11 clean package
Configure HiBench:
cd $HOME/HiBench/conf
# Important: replace "$NUM_PARTITIONS" with the number of CPU cores you wish to use, e.g. 72 for Grace-Hopper.
sed -i 's#hibench.scale.profile.*$#hibench.scale.profile huge#g' hibench.conf
sed -i 's#hibench.default.map.parallelism.*$#hibench.default.map.parallelism $NUM_PARTITIONS#g' hibench.conf
sed -i 's#hibench.default.shuffle.parallelism.*$#hibench.default.shuffle.parallelism $NUM_PARTITIONS#g' hibench.conf
# IMPORTANT: replace "$PATH_TO_HADOOP"
cp hadoop.conf.template hadoop.conf
sed -i 's#/PATH/TO/YOUR/HADOOP/ROOT#$PATH_TO_HADOOP#g' hadoop.conf
# IMPORTANT: replace "$PATH_TO_SPARK"
cp spark.conf.template spark.conf
sed -i 's#/PATH/TO/YOUR/SPARK/HOME#$PATH_TO_SPARK#g' spark.conf
sed -i 's#hibench.spark.master.*$#hibench.spark.master local[*]#' spark.conf
sed -i 's#spark.executor.memory.*$#spark.executor.memory 50g#' spark.conf
sed -i 's#spark.driver.memory.*$#spark.driver.memory 50g#' spark.conf
Run the Benchmark
-
Configure your environment. Remember to set
$PATH_TO_HADOOP
to the corect path.export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-arm64" export PATH="$JAVA_HOME/bin:$PATH" # Set Hadoop-related environment variables export PATH_TO_HADOOP=$HOME/hadoop-3.3.6 export HADOOP_HOME=$PATH_TO_HADOOP export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop # Native Path export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib" # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
-
Start Hadoop
export PDSH_RCMD_TYPE=exec $PATH_TO_HADOOP/sbin/start-all.sh
If Hadoop starts successfully,
jps
output should be similar to:369207 SecondaryNameNode 369985 NodeManager 371293 NameNode 373148 Jps 368895 DataNode 369529 ResourceManager
All of
NameNode
,SecondaryNameNode
,NodeManager
,DataNode
, andResourceManager
must be running before proceeding with the benchmark. If you do not see aNameNode
process, check that you initialized the namenode directory as described in the Hadoop installation steps above. -
Preprocess the k-means benchmark files
$HOME/HiBench/bin/workloads/ml/kmeans/prepare/prepare.sh
-
Run the benchmark once to initialize the system
$HOME/HiBench/bin/workloads/ml/kmeans/spark/run.sh
-
Run k-means benchmark several times and average the scores. This example shows 72 cores. If you wish to use a different number of CPU cores, remember to update
hibench.default.map.parallelism
andhibench.default.shuffle.parallelism
inhibench.conf
.numactl -C0-71 -m0 $HOME/HiBench/bin/workloads/ml/kmeans/spark/run.sh
The results can be found in $HOME/HiBench/report/hibench.report
.
Reference Results
These figures are provided as guidelines and should not be interpreted as performance targets.
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkKmeans 2023-02-16 20:29:54 19903891441 23.692 840110224 840110224
ScalaSparkKmeans 2023-02-16 23:45:33 19903891427 23.742 838340974 838340974
ScalaSparkKmeans 2023-02-16 23:53:05 19903891439 24.129 824894999 824894999
In the above case, the median throughput/node across the three runs is the result. In the above example, it is 838340974 (bytes/s)