1.大体流程
2.具体配置
3.配置流程
1.配置Flume Agent
在hadoop104的/opt/module/flume/conf目录下创建kafka-flume-hdfs.conf文件
[lili@hadoop104 conf]$ vim kafka-flume-hdfs.conf
文件配置内容如下:
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_start
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r2.kafka.topics=topic_event
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1/
a1.channels.c1.keep-alive = 6
a1.channels.c2.type = file
a1.channels.c2.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c2.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c2.keep-alive = 6
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2
2.Flume启动停止脚本
-
创建脚本f2.sh [lili@hadoop102 bin]$ vim f2.sh
case $1 in
"start"){
for i in hadoop104
do
echo " --------启动 $i 消费flume-------"
ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/kafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/job/data-warehouse/flume2.log 2>&1 &"
done
};;
"stop"){
for i in hadoop104
do
echo " --------停止 $i 消费flume-------"
ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs kill"
done
};;
esac
-
增加权限 [lili@hadoop102 bin]$ chmod 777 f2.sh
-
启动脚本 [lili@hadoop102 bin]$ f2.sh start
-
停止脚本 [lili@hadoop102 bin]$ f1.sh stop
4.Flume内存优化
1.抛出异常
-
问题描述:如果启动消费Flume抛出如下异常 ERROR hdfs.HDFSEventSink: process failed java.lang.OutOfMemoryError: GC overhead limit exceeded -
解决方法:
-
在hadoop102服务器的/opt/module/flume/conf/flume-env.sh文件中增加如下配置 [lili@hadoop102 ~]$ vim /opt/module/flume/conf/flume-env.sh
export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
-
同步配置到hadoop103、hadoop104服务器
2.内存参数设置及优化
- JVM heap一般设置为4G或更高,部署在单独的服务器上(4核8线程16G内存)
- -Xmx与-Xms最好设置一致,减少内存抖动带来的性能影响,如果设置不一致容易导致频繁Full GC。
- -Xms表示JVM Heap(堆内存)最小尺寸,初始分配;-Xmx 表示JVM Heap(堆内存)最大允许的尺寸,按需分配。如果不设置一致,容易在初始化时,由于内存不够,频繁触发Full GC。
5.采集通道启动停止脚本
-
创建脚本 [lili@hadoop102 ~]$ vim /home/lili/bin/cluster.sh
case $1 in
"start"){
echo "================ 开始启动集群 ================"
echo "================ 正在启动HDFS ================"
/opt/module/hadoop-2.7.2/sbin/start-dfs.sh
echo "================ 正在启动YARN ================"
ssh hadoop103 "/opt/module/hadoop-2.7.2/sbin/start-yarn.sh"
echo "============== 正在启动zookeeper =============="
zk.sh start
sleep 4s;
f1.sh start
sleep 2s;
kf.sh start
sleep 6s;
f2.sh start
};;
"stop"){
echo "================ 开始停止集群 ================"
f2.sh stop
sleep 2s;
kf.sh stop
sleep 6s;
f1.sh stop
sleep 2s;
echo "============== 正在停止zookeeper =============="
zk.sh stop
echo "================ 正在停止YARN ================"
ssh hadoop103 "/opt/module/hadoop-2.7.2/sbin/stop-yarn.sh"
echo "================ 正在停止HDFS ================"
/opt/module/hadoop-2.7.2/sbin/stop-dfs.sh
};;
esac
-
增加权限 [lili@hadoop102 bin]$ chmod 777 cluster.sh
-
启动脚本 [lili@hadoop102 bin]$ cluster.sh start
-
停止脚本 [lili@hadoop102 bin]$ cluster.sh stop
6.数据传输测试
1.启动集群
[lili@hadoop102 bin]$ cluster.sh start
================ 开始启动集群 ================
================ 正在启动HDFS ================
Starting namenodes on [hadoop102]
hadoop102: starting namenode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-lili-namenode-hadoop102.out
hadoop103: starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-lili-datanode-hadoop103.out
hadoop104: starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-lili-datanode-hadoop104.out
hadoop102: starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-lili-datanode-hadoop102.out
Starting secondary namenodes [hadoop104]
hadoop104: starting secondarynamenode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-lili-secondarynamenode-hadoop104.out
================ 正在启动YARN ================
starting yarn daemons
starting resourcemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-lili-resourcemanager-hadoop103.out
hadoop102: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-lili-nodemanager-hadoop102.out
hadoop104: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-lili-nodemanager-hadoop104.out
hadoop103: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-lili-nodemanager-hadoop103.out
============== 正在启动zookeeper ==============
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
--------启动 hadoop102 采集flume-------
--------启动 hadoop103 采集flume-------
--------启动 hadoop102 Kafka-------
--------启动 hadoop103 Kafka-------
--------启动 hadoop104 Kafka-------
--------启动 hadoop104 消费flume-------
[lili@hadoop102 bin]$
2.生成日志数据
[lili@hadoop102 bin]$ lg.sh
------hadoop102 生成日志-------
------hadoop103 生成日志-------
[lili@hadoop102 bin]$
3.进入HDFS的Web页面查看落盘情况
|