This repository has been archived by the owner on Dec 20, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 70
Running HiBench with SparkRDMA
Peter Rudenko edited this page Nov 29, 2018
·
7 revisions
HiBench is a big data benchmark suite that can be used to evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.
- Environment: 17 nodes with 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch
- Apache hadoop-2.7.4 HDFS (1 namenode, 16 datanodes)
- Spark-2.2 standalone 17 nodes
- Setup HiBench
-
Configure Hadoop and Spark settings in HiBench
conf
directory - In
HiBench/conf/hibench.conf
set:
hibench.scale.profile bigdata
# Mapper number in hadoop, partition number in Spark
hibench.default.map.parallelism 1000
# Reducer nubmer in hadoop, shuffle partition number in Spark
hibench.default.shuffle.parallelism 15000
- Set in
HiBench/conf/workloads/micro/terasort.conf
:
hibench.terasort.bigdata.datasize 1890000000
- Run
HiBench/bin/workloads/micro/terasort/prepare/prepare.sh
andHiBench/bin/workloads/micro/terasort/spark/run.sh
- Open
HiBench/report/hibench.report
:
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkTerasort 2018-03-26 19:13:52 189000000000 79.931 2364539415 2364539415
- Add to
HiBench/conf/spark.conf
:
spark.driver.extraClassPath /PATH/TO/spark-rdma-3.1-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.executor.extraClassPath /PATH/TO/spark-rdma-3.1-for-spark-SPARK_VERSION-jar-with-dependencies.jar
spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
spark.shuffle.compress false
spark.shuffle.spill.compress false
spark.broadcast.compress false
spark.broadcast.checksum false
spark.locality.wait 0
- Run
HiBench/bin/workloads/micro/terasort/spark/run.sh
- Open
HiBench/report/hibench.report
:
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkTerasort 2018-03-26 19:13:52 189000000000 79.931 2364539415 2364539415
ScalaSparkTerasort 2018-03-26 19:17:13 189000000000 52.166 3623049495 3623049495
- Overall improvement:
- Environment: 6 nodes with 2x Intel Xeon E5-2697 v3 @ 2.60GHz, 30 cores per Worker, 256GB RAM, non-flash storage (HDD), Mellanox ConnectX-4 network adapter with 100GbE RoCE fabric, connected with a Mellanox Spectrum switch
- Apache hadoop-2.7.4 HDFS (1 namenode, 5 datanodes)
- Spark-2.2 in YARN mode with 6 nodes
- Setup HiBench
-
Configure Hadoop and Spark settings in HiBench
conf
directory
spark.executor.memory 160g
hibench.yarn.executor.num 5
hibench.yarn.executor.cores 20
- In
HiBench/conf/hibench.conf
set:
hibench.scale.profile gigantic
hibench.default.map.parallelism 8000
hibench.default.shuffle.parallelism 8000
spark.default.parallelism 15000
- Run
HiBench/bin/workloads/websearch/pagerank/prepare/prepare.sh
andHiBench/bin/workloads/websearch/pagerank/spark/run.sh
- Open
HiBench/report/hibench.report
:
Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node
ScalaSparkPagerank 2018-10-03 15:07:28 19933175464 2184.023 9126815 1825363
ScalaSparkPagerank 2018-10-03 15:07:28 19933175464 1086.578 18344899 3668979
- Overall improvement: