基情链接
Spark 环境搭建-Local
Spark 环境搭建-Standalone
Spark 环境搭建-Standalone HA
环境搭建-Spark on YARN
模式说明
Spark Standalone Mode - Spark 2.4.5 Documentation (apache.org)
Spark Standalone 集群是 Master-Slaves 架构的集群模式,和大部分的 Master-Slaves 结构集群一样,存在着 Master 单点故障的问题,该模式基于 Zookeeper 实现 HA
搭建准备
环境准备
云服务器 3 台
| node1/172.17.0.8 | node2/172.17.30.12 | node3/172.17.30.26 |
---|
ResourceManager | ? | | | NodeManager | ? | ? | ? | JobHistoryServer | ? | | | HistoryServer | ? | | |
安装包下载
目前 Spark 最新稳定版本,企业中使用较多版本为 2.x 的版本系列
Spark 下载界面:Downloads | Apache Spark
Spark 2.4.5 版本下载:Index of /dist/spark/spark-2.4.5 (apache.org)
本博客安装的版本为:spark-2.4.5-bin-hadoop2.7.tgz
环境配置
3 台服务器装好 JDK、配置 服务器的 hostname 、域名映射、zk 集群、Yarn
安装及配置
① 上传解压
tar -zxvf spark-2.4.5-bin-hadoop2.7.tgz
ln -s /opt/server/spark-2.4.5-bin-hadoop2.7 /opt/server/spark
② 修改配置文件:slaves
cd /opt/server/spark/conf
mv slaves.template slaves
vim slaves
node1
node2
node3
③ 修改配置文件:spark-env.sh
cd /opt/server/spark/conf
mv spark-env.sh.template spark-env.sh
vim spark-env.sh
JAVA_HOME=/usr/java/jdk1.8.0_172
export SPARK_MASTER_PORT=7077
HADOOP_CONF_DIR=/opt/server/hadoop/etc/hadoop
YARN_CONF_DIR=/opt/server/hadoop/etc/hadoop
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181 -Dspark.deploy.zookeeper.dir=/spark-ha"
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1g
注意:配置的历史服务器地址 /sparklog ,需要手动在 HDFS 上创建
④ 修改 yarn-site.xml ( 整合 Yarn 历史服务器并关闭资源检查 )
cd /opt/server/hadoop/etc/hadoop
vim /opt/server/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>20480</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://node1:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
⑤ 修改 spark-defaults.conf ( 配置 spark 历史服务器)
cd /opt/server/spark/conf
mv spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node1:8020/sparklog/
spark.eventLog.compress true
spark.yarn.historyServer.address node1:18080
⑥ 修改 log4j.properties (配置日志级别)
cd /opt/server/spark/conf
mv log4j.properties.template log4j.properties
vim log4j.properties
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
⑦ 配置依赖 Spark Jar 包
Spark Application 应用提交运行在 YARN上 时,默认情况下,每次提交应用都需要将依赖 Spark 相关 jar 包上传到 YARN 集群中,为了节省提交时间和存储空间,将 Spark 相关 jar 包上传到 HDFS 目录中,设置属性告知 Spark Application 应用
hdfs dfs -mkdir -p /spark/jars/
hdfs dfs -put /opt/server/spark/jars/* /spark/jars/
vim /opt/server/spark/conf/spark-defaults.conf
spark.yarn.jars hdfs://node1:8020/spark/jars/*
启动及测试
集群启动
start-all.sh
mr-jobhistory-daemon.sh start historyserver
/opt/server/spark/sbin/start-history-server.sh
访问测试
Spark HistoryServer服务:http://node1:18080/
|