一.部署虚拟机
1. 过程略过 VMware安装Centos7
2. 部署3台
二.配置虚拟机
虚拟机网络
1.虚拟机网络采用仅主机模式。 安装时选择网络时选择,或者装好后重新设值。
2.设置主机VMware Virtual Ethernet Adapter for VMnet1 3.配置静态IP
vim /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=bcb7a50e-bfdf-4255-b7fe-1c26e29d036d
DEVICE=ens33
ONBOOT=yes
IPADDR=192.168.40.110
NETMASK=255.255.255.0
GATEWAY=192.168.40.1
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=9c11cadc-1810-4181-bd1c-bc0dd18c424f
DEVICE=ens33
ONBOOT=yes
IPADDR=192.168.40.120
NETMASK=255.255.255.0
GATEWAY=192.168.40.1
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=9c11cadc-1810-4181-bd1c-bc0dd18c424f
DEVICE=ens33
ONBOOT=yes
IPADDR=192.168.40.130
NETMASK=255.255.255.0
GATEWAY=192.168.40.1
service network restart
4.配置hosts
vim /etc/hosts
192.168.40.110 hadoop110
192.168.40.120 hadoop120
192.168.40.130 hadoop130
192.168.40.110 hadoop110
192.168.40.120 hadoop120
192.168.40.130 hadoop130
192.168.40.110 hadoop110
192.168.40.120 hadoop120
192.168.40.130 hadoop130
5.关闭防火墙
systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2021-08-08 11:53:17 CST; 2 days ago
Docs: man:firewalld(1)
Process: 670 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 670 (code=exited, status=0/SUCCESS)
systemctl stop firewalld
6.设置网络共享 仅主机模式虚拟机不能连接外网多少还是不方便,有需要可以配置一下。 VMware虚拟机仅主机模式访问外网
7.Xshell连接虚拟机非常慢 如果有可以配置一下 Xshell连接虚拟机非常慢
创建hadoop用户
1.创建一个一般用户hadoop,配置密码
useradd hadoop
passwd hadoop
2.配置这个用户为sudoers(sudo时不用输入密码)
vim /etc/sudoers
hadoop ALL=(ALL) NOPASSWD:ALL
- 最终结果
3.在/opt目录下创建两个文件夹module和software,并把所有权赋给hadoop
mkdir /opt/module /opt/software
chown hadoop:hadoop /opt/module /opt/software
- 结果
安装Java
1.下载jdk8
2.将下载后的文件上传至虚拟机/opt/software
cd /opt/software
rz -b
- 选择下载的文件
3.安装rz命令 没有rz命令可以安装一下,或者使用其它方式上传
yum install -y lrzsz
4.安装jdk
cd /opt/software/
tar -zxvf jdk-8u261-linux-x64.tar.gz -C /opt/module
5.配置Java环境变量
sudo vim /etc/profile
export JAVA_HOME=/opt/module/jdk1.8.0_261
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile
java -version
java version "1.8.0_261"
Java(TM) SE Runtime Environment (build 1.8.0_261-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)
安装Hadoop
1.下载Hadoop 去Apache hadoop官网下载hadoop包,找到你需要的版本的包,我这里是2.7.2。 下载地址 2.将下载后的文件上传至虚拟机/opt/software 3.解压至/opt/module
cd /opt/software/
tar -zxvf hadoop-2.7.2.tar.gz -C /opt/module
4.配置hadoop环境变量
sudo vim /etc/profile
export HADOOP_HOME=/opt/module/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2017-05-22T10:49Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar
配置hadoop
1.hadoop目录结构
drwxr-xr-x. 2 hadoop hadoop 194 May 22 2017 bin
drwxr-xr-x. 3 hadoop hadoop 20 May 22 2017 etc
drwxr-xr-x. 2 hadoop hadoop 106 May 22 2017 include
drwxr-xr-x. 3 hadoop hadoop 20 May 22 2017 lib
drwxr-xr-x. 2 hadoop hadoop 239 May 22 2017 libexec
-rw-r--r--. 1 hadoop hadoop 15429 May 22 2017 LICENSE.txt
-rw-r--r--. 1 hadoop hadoop 101 May 22 2017 NOTICE.txt
-rw-r--r--. 1 hadoop hadoop 1366 May 22 2017 README.txt
drwxr-xr-x. 2 hadoop hadoop 4096 May 22 2017 sbin
drwxr-xr-x. 4 hadoop hadoop 31 May 22 2017 share
- bin目录:存放对Hadoop相关服务(HDFS,YARN)进行操作的脚本。
- etc目录:Hadoop的配置文件目录,存放Hadoop的配置文件。
- lib目录:存放Hadoop的本地库(对数据进行压缩解压缩功能)。
- sbin目录:存放启动或停止Hadoop相关服务的脚本。
- share目录:存放Hadoop的依赖jar包、文档、和官方案例。
2.本地模式
2.1官方Grep案例
1.在hadoop-2.7.2文件下面创建一个input文件夹
cd /opt/module/hadoop-2.7.2
mkdir input
2.将Hadoop的xml配置文件复制到input
cd /opt/module/hadoop-2.7.2
cp /etc/hadoop/*.xml input/
3.执行share目录下的MapReduce程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
4.查看输出结果
cat output/*
2.1官方WordCount案例
1.在hadoop-2.7.2文件下面创建一个wcinput文件夹
cd /opt/module/hadoop-2.7.2
mkdir wcinput
2.在wcinput文件下创建一个wc.input文件
vim wc.input
3.编辑wc.input文件 4.执行程序
cd /opt/module/hadoop-2.7.2
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount wcinput wcoutput
5.查看结果
cat wcoutput/part-r-00000
3.伪分布式运行模式
3.1启动HDFS并运行MapReduce程序
1.目标
- 配置集群
- 启动、测试集群增、删、查
- 执行WordCount案例
2.执行步骤
export JAVA_HOME=/opt/module/jdk1.8.0_261
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop110:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-2.7.2/data/tmp</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
(1)格式化NameNode(第一次启动时格式化,以后就不要总格式化)
格式化NameNode,会产生新的集群id,导致NameNode和DataNode的集群id不一致,集群找不到已往数据。所以,格式NameNode时,一定要先删除data数据和log日志,然后再格式化NameNode
hdfs namenode -format
(2)启动NameNode
hadoop-daemon.sh start namenode
(3)启动DataNode
hadoop-daemon.sh start datanode
(1)查看是否启动成功
jps
(1)web端查看HDFS文件系统 http://192.168.40.110:50070/dfshealth.html#tab-overview (2)查看日志
cd /opt/module/hadoop-2.7.2/logs
可以查看对应的日志
(1)在HDFS文件系统上创建一个input文件夹
hdfs dfs -mkdir -p /user/hadoop/input
(2)将测试文件内容上传到文件系统上
hdfs dfs -put wcinput/wc.input /user/hadoop/input/
(3)查看上传的文件是否正确
hdfs dfs -ls /user/hadoop/input/
hdfs dfs -cat /user/hadoop/input/wc.input
(4)运行 MapReduce 程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/hadoop/input/ /user/hadoop/output
(5)查看输出结果
hdfs dfs -cat /user/hadoop/output/*
(6)浏览文件夹 分别可以查看生成的文件和上传的文件。
3.2启动 YARN 并运行 MapReduce 程序
1.目标
- 配置集群在 YARN 上运行 MR
- 启动、 测试集群增、 删、 查
- 在 YARN 上执行 WordCount 案例
2.执行步骤
(1)配置 yarn-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_261
(2)配置 yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop110</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>20480</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
(3)配置: mapred-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_261
(4)配置:mapred-site.xml(对 mapred-site.xml.template 重新命名为:mapred-site.xml)
mv mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
(1)启动前必须保证 NameNode 和 DataNode 已经启动
(2)启动 ResourceManager
yarn-daemon.sh start resourcemanager
(3)启动 NodeManager
yarn-daemon.sh start nodemanager
(1)YARN 的浏览器页面查看 http://192.168.40.110:8088/cluster
(2)删除文件系统上的 output 文件
hdfs dfs -rm -R /user/hadoop/output
(3)执行 MapReduce 程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/hadoop/input/ /user/hadoop/output
(4)查看运行结果
hdfs dfs -cat /user/hadoop/output/*
3.3 配置历史服务器
为了查看程序的历史运行情况, 需要配置一下历史服务器。
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop110:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop110:19888</value>
</property>
mr-jobhistory-daemon.sh start historyserver
jps
http://192.168.40.110:19888/jobhistory
3.4配置日志的聚集
日志聚集概念: 应用运行完成以后, 将程序运行日志信息上传到 HDFS 系统上。 日志聚集功能好处: 可以方便的查看到程序运行详情, 方便开发调试。 注 意 : 开 启 日 志 聚 集 功 能 , 需 要 重 新 启 动 NodeManager 、 ResourceManager 和 HistoryManager。
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
- 关闭 NodeManager 、 ResourceManager 和 HistoryServer
yarn-daemon.sh stop resourcemanager
yarn-daemon.sh stop nodemanager
mr-jobhistory-daemon.sh stop historyserver
- 启动 NodeManager 、 ResourceManager 和 HistoryServer
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
mr-jobhistory-daemon.sh start historyserver
hdfs dfs -rm -R /user/hadoop/output
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/hadoop/input/ /user/hadoop/output
http://192.168.40.110:19888/jobhistory
4.分布式运行模式(重要)
4.1准备
- 准备 3 台客户机(关闭防火墙、 静态 ip、 主机名称)
- 安装 JDK并配置环境变量
- 安装 Hadoop并配置环境变量
- 配置集群
- 单点启动
4.2集群配置
| | | |
---|
| hadoop110 | hadoop120 | hadoop130 | HDFS | NameNode DataNode | DateNode JobHistory | SecondaryNameNode DataNode | YARN | NodeManager | ResourceManager NodeManager | NodeManager |
(1)下载最新版的hadoop
- 我选择的是3.3.1
- 安装hadoop(不在重复)
- 重新配置hadoop环境变量(不在重复)
- 删除旧的hadoop(不在重复)
参考地址:https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/ClusterSetup.html
(1) 核心配置文件
export JAVA_HOME=/opt/module/jdk1.8.0_261
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop110:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.3.1/data/tmp</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop130:50090</value>
</property>
- 配置 yarn-site.xml(每台都要配)(3.0无需配置yarn-env.sh的java_home)
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop120</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
- 配置 mapred-site.xml(每台都要配)(3.0无需配置mapred-env.sh的java_home)
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop120:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop120:19888</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
/opt/module/hadoop-3.3.1/etc/hadoop,
/opt/module/hadoop-3.3.1/share/hadoop/common/*,
/opt/module/hadoop-3.3.1/share/hadoop/common/lib/*,
/opt/module/hadoop-3.3.1/share/hadoop/hdfs/*,
/opt/module/hadoop-3.3.1/share/hadoop/hdfs/lib/*,
/opt/module/hadoop-3.3.1/share/hadoop/mapreduce/*,
/opt/module/hadoop-3.3.1/share/hadoop/mapreduce/lib/*,
/opt/module/hadoop-3.3.1/share/hadoop/yarn/*,
/opt/module/hadoop-3.3.1/share/hadoop/yarn/lib/*
</value>
</property>
- 配置works(每台都要配)(2.0是slaves文件)
hadoop110
hadoop120
hadoop130
ssh-keygen -t rsa
cd /home/hadoop/.ssh/
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop110
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop120
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop130
ssh-keygen -t rsa
cd /home/hadoop/.ssh/
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop110
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop120
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop130
ssh-keygen -t rsa
cd /home/hadoop/.ssh/
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop110
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop120
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop130
(1)安装ntp(都安装)
yum install -y ntp
systemctl enable ntpd
(2)配置ntp(hadoop110)
vim /etc/ntp.conf
restrict 192.168.40.0 mask 255.255.255.0 nomodify notrap
server ntp1.aliyun.com
server ntp2.aliyun.com
server ntp3.aliyun.com
server 127.0.0.1
fudge 127.0.0.1 stratum 10
(3)测试是否可以同步(hadoop110)
ntpdate -u ntp2.aliyun.com
(4)启动ntp服务(hadoop110)
systemctl start ntpd
systemctl status ntpd
(5)客户机配置(hadoop120、hadoop130)
server hadoop110
server 127.0.0.1
fudge 127.0.0.1 stratum 10
ntpdate -u ntp2.aliyun.com
systemctl start ntpd
systemctl status ntpd
4.3集群启动
- 如果集群是第一次启动, 需要格式化 NameNode
hdfs namenode -format
hadoop-daemon.sh start namenode
jps
- 在 hadoop110 hadoop120 hadoop130上启动DataNode
hadoop-daemon.sh start datanode
jps
- 在hadoop120上启动ResourceManager
yarn-daemon.sh start resourcemanager
jps
- 在 hadoop110 hadoop120 hadoop130上启动NodeManager
yarn-daemon.sh start nodemanager
jps
- 在hadoop130上启动SecondaryNamenode
hadoop-daemon.sh start secondarynamenode
jps
- 在hadoop120上启动HistoryServer
mr-jobhistory-daemon.sh start historyserver
jps
(1)Web 端查看 Yarn
http://192.168.40.120:8088/cluster
(2)Web 端查看 SecondaryNameNode
http://192.168.40.130:50090/status.html
(3)Web查看HistoryServer
http://192.168.40.120:19888/jobhistory/app
(4)Web查看NameNode(2.x端口是50070,3.x端口是9870)
http://192.168.40.110:9870/dfshealth.html#tab-overview
(1)在 HDFS 文件系统上创建一个 input 文件夹
hdfs dfs -mkdir -p /user/hadoop/input
(2)测试文件上传到文件系统
hdfs dfs -put wcinput/wc.input /user/hadoop/input
hdfs dfs -put /opt/software/jdk-8u261-linux-x64.tar.gz /user/hadoop/input
(3)查看上传是否正确
(4)查看文件上传是否正确(每个节点都能看的文件)
cd /opt/module/hadoop-3.3.1/data/tmp/dfs/data/current/BP-1927473396-192.168.40.110-1629372952977/current/finalized/subdir0/subdir0
从文件大小上我们可以看到1002和1003是我上传的jdk包,拼接起来就是完整的包了。 (5)下载
hadoop fs -get /user/hadoop/input/jdk-8u261-linux-x64.tar.gz ./
|