在学习spark过程中遇到的问题,做下记录,这个问题网上出现的不再少数,出现问题的原因也是各不相同,并且没有一个人的问题和我完全一样(我高兴得都快哭了),顺着大家的思路,尝试了两个多小时才搞明白。
问题的根源大多都在于 hostname 的配置与映射
环境前置说明(三台虚拟机模拟):
- 系统:CentOS Linux release 7.5.1804 (Core)
- host
- hadoop102 192.168.20.102 Master Worker
- hadoop103 192.168.20.103 Worker
- hadoop104 192.168.20.104 Worker
- jdk1.8
- spark-3.0.0-bin-hadoop3.2 - standalone 集群
? 在 sbin/start-all.sh 后,输入 jps 查看进程,并没有问题,都有worker进程在里面(事实上过了一段时间后再看,slave节点的worker进程也都关闭了)
[root@hadoop102 spark-standalone]
=============== hadoop102 ===============
2273 Worker
2363 Jps
2206 Master
=============== hadoop103 ===============
1768 Worker
1864 Jps
=============== hadoop104 ===============
1765 Worker
1870 Jps
[root@hadoop102 spark-standalone]
=============== hadoop102 ===============
2930 Master
2995 Worker
3286 Jps
=============== hadoop103 ===============
3211 Jps
=============== hadoop104 ===============
3194 Jps
但是打开 8080 web界面却只能看到一个worker,如图,根据 Address 也可以看到这是master主机上的worker
到slave1,slave2主机上查看worker的启动日志,日志位于$SPARK_HOME/logs,日志非常长,但有用的就这一句
Failed to connect to master hadoop102:7077
首先,确认是不是Master的7077端口能否连接呢?下载nmap工具(yum install nmap),在Slave主机用如下命令测试:
[root@hadoop103 logs]
Starting Nmap 6.40 ( http://nmap.org ) at 2021-08-08 09:58 CST
Nmap scan report for hadoop102 (192.168.20.102)
Host is up (0.0011s latency).
PORT STATE SERVICE
7077/tcp closed unknown
MAC Address: 00:0C:29:93:10:89 (VMware)
Nmap done: 1 IP address (1 host up) scanned in 0.15 seconds
You have new mail in /var/spool/mail/root
[root@hadoop103 ~]
Starting Nmap 6.40 ( http://nmap.org ) at 2021-08-08 10:20 CST
Nmap scan report for hadoop102 (192.168.20.102)
Host is up (0.0028s latency).
PORT STATE SERVICE
8080/tcp open http-proxy
MAC Address: 00:0C:29:93:10:89 (VMware)
Nmap done: 1 IP address (1 host up) scanned in 0.23 seconds
7077不能连接,8080能连接,我确认hadoop102的防火墙已关闭,于是考虑是否7077端口未开放?在Master主机输入 netstat -ntlp
[root@hadoop102 spark-standalone]
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 662/rpcbind
tcp 0 0 192.168.122.1:53 0.0.0.0:* LISTEN 1271/dnsmasq
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 905/sshd
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 903/cupsd
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 1098/master
tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN 1637/sshd: root@pts
tcp 0 0 127.0.0.1:6011 0.0.0.0:* LISTEN 1685/sshd: root@pts
tcp6 0 0 127.0.0.1:7077 :::* LISTEN 2930/java
tcp6 0 0 192.168.20.102:32812 :::* LISTEN 2995/java
tcp6 0 0 :::111 :::* LISTEN 662/rpcbind
tcp6 0 0 :::8080 :::* LISTEN 2930/java
tcp6 0 0 :::8081 :::* LISTEN 2995/java
tcp6 0 0 :::22 :::* LISTEN 905/sshd
tcp6 0 0 ::1:631 :::* LISTEN 903/cupsd
tcp6 0 0 ::1:6010 :::* LISTEN 1637/sshd: root@pts
tcp6 0 0 ::1:6011 :::* LISTEN 1685/sshd: root@pts
发现 7077 和 8080 端口的 ip 部分不同,现在只需要去找 127.0.0.1:7077 这个玩意是怎么出来的?
具体原因暂时没时间去深究,猜测在 127.0.0.1 上端口只对本机开发(参考 https://blog.csdn.net/xifeijian/article/details/12879395)
查看 sbin/start-all.sh 脚本
[root@hadoop102 spark-standalone]
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/sbin/spark-config.sh"
"${SPARK_HOME}/sbin"/start-master.sh
"${SPARK_HOME}/sbin"/start-slaves.sh
继续看 start-slaves.sh 脚本
[root@hadoop102 spark-standalone]
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi
. "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh"
if [ "$SPARK_MASTER_PORT" = "" ]; then
SPARK_MASTER_PORT=7077
fi
if [ "$SPARK_MASTER_HOST" = "" ]; then
case `uname` in
(SunOS)
SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"
;;
(*)
SPARK_MASTER_HOST="`hostname -f`"
;;
esac
fi
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
重点看最后一行中
"${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
SPARK_MASTER_HOST,SPARK_MASTER_PORT,这两个值在 con/spark-env.sh 进行了配置(如果没有自己配置,就是前两个 if 中所赋的默认值,检查本机的 hostsname 配置是否有问题 https://www.lmlphp.com/user/552/article/item/371036)
[root@hadoop102 spark-standalone]
export JAVA_HOME=/opt/module/jdk1.8.0_212
SPARK_MASTER_HOST=hadoop102
SPARK_MASTER_PORT=7077
hadoop102 在 etc/hosts 做了映射
[root@hadoop102 spark-standalone]
于是破案了。。。在之前测试 spark-local 模式时,因为一个 bug,把 hadoop102 指向了 127.0.0.1 ,忘记改回来了。
两个解决办法,要么 etc/hosts 把 hadoop102 指向局域网 ip 192.168.20.102,要么 spark-env.sh 把 SPARK_MASTER_HOST 指向局域网 ip 192.168.20.102,改过来后,一切正常 本次问题根本原因:由于个人原因不得不在对于 linux 和 shell 等基础知识并不熟悉的情况下去学习Spark框架引发的问题,急于求成,适得其反
|