开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Flume入门必看 -> 正文阅读

[大数据]Flume入门必看

Flume

一、概述

本文参考原文链接

1.Flume定义

Flume是Cloudera提供的一个海量日志采集、传输的系统。Flume基于流式架构，灵活简单。

在这里插入图片描述

2.Flume优点

① 可以和任意存储进程集成

② 输入的数据速率大于写入目的存储的速率，flume会进行缓冲，减小hdfs的压力。

③ flume中的事务基于channel，使用了两个事务模型（sender + receiver），确保消息被可靠发送。

? Flume使用两个独立的事务分别负责从source到channel，以及从channel到sink的事件传递。一旦事务中所有的数据全部成功提交到channel，那么source才认为该数据读取完成。同理，只有成功被sink写出去的数据，才会从channel中移除。

3.Flume组成架构

在这里插入图片描述

4.Flume组件

Agent

Agent是一个JVM进程，它以事件的形式将数据从源头送至目的地。

Agent主要有3个部分组成，Source、Channel、Sink。
Source

Source是负责接收数据到Flume Agent的组件。
Channel

Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。

Flume自带两种Channel：Memory Channel和File Channel。

Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么Memory Channel就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。

File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。
Sink

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
Event (事件 --> 新的产生数据)

传输单元，Flume数据传输的基本单元，以事件的形式将数据从源头送至目的地。 Event由可选的header和载有数据的一个byte array 构成。Header是容纳了key-value字符串对的HashMap。

在这里插入图片描述

二、安装

1.安装地址

1） Flume官网地址

http://flume.apache.org

2）文档查看地址

http://flume.apache.org/FlumeUserGuide.html

3）下载地址

http://archive.apache.org/dist/flume/

2.安装步骤

准备工作：安装JDK、并且配置环境变量

1）将apache-flume-1.9.0-bin.tar.gz上传到linux的/opt/modules目录下

2）解压apache-flume-1.9.0-bin.tar.gz到/opt/installs目录下

3）将flume/conf下的flume-env.sh.template文件修改为flume-env.sh，并配置flume-env.sh文件

[root@flume0 conf]# pwd
/opt/installs/apache-flume-1.9.0-bin/conf
[root@flume0 conf]# mv flume-env.sh.template flume-env.sh
[root@flume0 conf]# vi flume-env.sh
export JAVA_HOME=/opt/installs/jdk1.8

[root@flume0 apache-flume-1.9.0-bin]# bin/flume-ng version
Flume 1.9.0

三、Flume 入门案例

3.1 监控端口数据

3.1.1 需求

使用 Flume 监听一个端口，收集该端口数据，并打印到控制台。

3.1.2 分析

在这里插入图片描述

3.1.3 实现流程

安装 netcat 工具。

yum install -y nc

创建 FLume Agent 的配置文件 flume-netcat-logger.conf 。
（1）在 flume 目录下创建 job 文件夹并进入 job 文件夹。

mkdir job
cd job/

? （2）在 job 文件夹下创建 FLume Agent 的配置文件 flume-netcat-logger.conf 。

vim flume-netcat-logger.conf

（3）在该配置文件中添加如下内容：

# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

注：a1 为 agent 的名称。

开启 Flume 监听窗口

写法一:

bin/flume-ng agent --conf conf --conf-file job/flume-netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

写法二：

bin/flume-ng agent -c conf -f job/flume-netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console

使用 netcat 工具向本机 44444端口发送内容

nc localhost 44444

在 FLume 监听页面观察接收数据情况

3.2 监控单个追加文件

3.2.1 需求

   实时监控 Hive 日志，输出到控制台。
   实时监控 Hive 日志，输出到 HDFS 上。

3.2.2 分析

在这里插入图片描述

注：要想读 Linux 系统中的文件，就得按照 Linux 命令的规则执行命令。由于 Hive 日志在 Linux 系统中，所以读取文件的类型为：exec（execute）。表示执行 Linux 命令来读取文件。

3.2.3 实现流程

（一）输出到控制台

创建 flume-file-logger.conf 文件。vim flume-file-logger.conf
配置该文件内容。

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
#hadoop日志文件地址
a2.sources.r2.command = tail -F /hadoop/hive-2.3.6/logs/hive.log

# Describe the sink 输出到控制台
a2.sinks.k2.type = logger

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

运行 Flume。

 bin/flume-ng agent -c conf/ -f job/flume-file-logger.conf -n a2 -Dflume.root.logger=INFO,console

开启 Hadoop 的 Hive，并操作 Hive 产生日志。（比如：show databases;）
在控制台查看数据。

（二）输出到 HDFS 上

创建 flume-file-hdfs.conf 文件。vim flume-file-hdfs.conf
配置该文件。

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
#hadoop日志文件地址
a2.sources.r2.command = tail -F /hadoop/hive-2.3.6/logs/hive.log


# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://Hadoop10:9000/flume/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = logs- 
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k2.hdfs.batchSize = 10
#设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 30
#设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

运行 Flume。

bin/flume-ng agent -c conf/ -f job/flume-file-hdfs.conf -n a2

开启 Hadoop 的 Hive，并操作 Hive 产生日志。（比如：show databases;）
在 HDFS 上查看文件。

3.3 监控目录下多个新文件

3.3.1 需求

使用 Flume 监听整个目录的文件，并上传到 HDFS 上。

3.3.2 分析

在这里插入图片描述

3.3.3 实现流程

创建配置文件 flume-dir-hdfs.conf。vim flume-dir-hdfs.conf
配置该文件内容。

 # Name the components on this agent
 a3.sources = r3
 a3.sinks = k3
 a3.channels = c3
 
 # Describe/configure the source
 a3.sources.r3.type = spooldir
 a3.sources.r3.spoolDir = /opt/module/flume/upload
 a3.sources.r3.fileSuffix = .COMPLETED
 a3.sources.r3.fileHeader = true
 #忽略所有以.tmp 结尾的文件，不上传
 a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
 
 # Describe the sink
 a3.sinks.k3.type = hdfs
 a3.sinks.k3.hdfs.path =  hdfs://master:9000/flume/%Y%m%d/%H
 #上传文件的前缀
 a3.sinks.k3.hdfs.filePrefix = upload- 
 #是否按照时间滚动文件夹
 a3.sinks.k3.hdfs.round = true
 #多少时间单位创建一个新的文件夹
 a3.sinks.k3.hdfs.roundValue = 1
 #重新定义时间单位
 a3.sinks.k3.hdfs.roundUnit = hour
 #是否使用本地时间戳
 a3.sinks.k3.hdfs.useLocalTimeStamp = true
 #积攒多少个 Event 才 flush 到 HDFS 一次
 a3.sinks.k3.hdfs.batchSize = 10
 #设置文件类型，可支持压缩
 a3.sinks.k3.hdfs.fileType = DataStream
 #多久生成一个新的文件
 a3.sinks.k3.hdfs.rollInterval = 60
 #设置每个文件的滚动大小大概是 128M
 a3.sinks.k3.hdfs.rollSize = 134217700
 #文件的滚动与 Event 数量无关
 a3.sinks.k3.hdfs.rollCount = 0
 
 # Use a channel which buffers events in memory
 a3.channels.c3.type = memory
 a3.channels.c3.capacity = 1000
 
 a3.channels.c3.transactionCapacity = 100
 # Bind the source and sink to the channel
 a3.sources.r3.channels = c3
 a3.sinks.k3.channel = c3

启动监控文件夹命令。

bin/flume-ng agent -c conf -f job/flume-dir-hdfs.conf -n a3

向 upload 文件夹中添加文件。

(1）在 /opt/module/flume/ 下创建文件夹 upload
mkdir upload
(2）向 upload 文件夹中添加文件。
touch 1.txt
touch 2.txt
touch 3.txt

查看 HDFS 上的数据。
再次查看 upload 文件夹。

3.4 监控目录下的多个追加文件

Exec Source 适用于监控一个实时追加的文件，但不能保证数据不丢失；Spooldir Source 能够保证数据不丢失，且能实现断点续传，但延迟较高，不能实时监控；而 Taildir Source 既能实现断点续传，又可以保证数据不丢失，还能够进行实时监控。

3.4.1 需求

使用 Flume 监听整个目录的实时追加的文件，并上传至 HDFS。

3.4.2 分析

在这里插入图片描述

3.4.3 实现流程

创建配置文件 flume-taildir-hdfs.conf。vim flume-taildir-hdfs.conf
配置该文件。

# Name the components on this agent
a4.sources = r4
a4.sinks = k4
a4.channels = c4

# Describe/configure the source
a4.sources.r4.type = TAILDIR
a4.sources.r4.filegroups = f1 f2
a4.sources.r4.filegroups.f1 = /opt/module/flume/files/file1.txt
a4.sources.r4.filegroups.f2 = /opt/module/flume/files/file2.txt

# Describe the sink
a4.sinks.k4.type = hdfs
a4.sinks.k4.hdfs.path = hdfs://master:9000/flume/%Y%m%d/%H
#上传文件的前缀
a4.sinks.k4.hdfs.filePrefix = upload- 
#是否按照时间滚动文件夹
a4.sinks.k4.hdfs.round = true
#多少时间单位创建一个新的文件夹
a4.sinks.k4.hdfs.roundValue = 1
#重新定义时间单位
a4.sinks.k4.hdfs.roundUnit = hour
#是否使用本地时间戳
a4.sinks.k4.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a4.sinks.k4.hdfs.batchSize = 10
#设置文件类型，可支持压缩
a4.sinks.k4.hdfs.fileType = DataStream
#多久生成一个新的文件
a4.sinks.k4.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a4.sinks.k4.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a4.sinks.k4.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a4.channels.c4.type = memory
a4.channels.c4.capacity = 1000
a4.channels.c4.transactionCapacity = 100

# Bind the source and sink to the channel
a4.sources.r4.channels = c4
a4.sinks.k4.channel = c4

启动监控文件夹命令。

   bin/flume-ng agent -c conf/ -f job/flume-taildir-hdfs.conf -n a4

向 files 文件夹中追加内容。

（1）在 /opt/module/flume 目录下创建 files 文件夹.
 mkdir files
（2）向 files 文件夹中追加内容。
echo hello >> file1.txt
echo hello >> file2.txt

查看 HDFS 上的数据。

四、Flume 企业开发案例

4.1 复制和多路复用–单数据源多出口案例(选择器)

4.1.1 需求

使用 Flume-1 监控文件变动，Flume-1 将文件变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem。

4.1.2 分析

在这里插入图片描述

4.1.3 实现流程

准备工作

在 …/jobs 目录下创建 group1 文件夹，在 /opt/data/ 目录下创建 flume3 文件夹。

创建 flume1.conf

配置 1 个接收日志文件的 Source 和两个 Channel、两个 Sink，分别输送给 flume2，flume3。

me the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = TAILDIR
#a1.sources.r1.positionFile = /opt/module/flume/postion/position1.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/data/update/a.log

# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop10
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = Hadoop10
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

创建 flume2.conf。

配置上级 Flume 的 Source，输出是 HDFS 的 Sink。

me the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop10
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop10:9000/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2- 
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 10
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建 flume3.conf。

配置上级 Flume 输出的 Source ，输出是本地目录 Sink。

me the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop10
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/data/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

执行配置文件。

bin/flume-ng agent -c conf -f job/group1/flume1.conf -n a1
bin/flume-ng agent -c conf -f job/group1/flume2.conf -n a2
bin/flume-ng agent -c conf -f job/group1/flume3.conf -n a3

启动 Hadoop 的 Hive。
查看 HDFS 上数据。
查看 /opt/module/datas/flume3 目录中数据。

4.2 负载均衡和故障转移

4.2.1 需求

使用 Flume1 监控一个端口，其中 Sink 组中 Sink 分别对接 Flume2 和 Flume3，采用 FailoverSinkProcessor 时实现故障转移，使用 LoadBalancingSinkProcessor 时实现负载均衡。

4.2.2 分析

在这里插入图片描述

4.2.3 实现流程

（一）故障转移

准备工作。

在 /opt/module/flume/job 目录下创建 group2 文件夹

创建 flume1.conf。

配置 1 个 netcat Source 和 1 个 channel 、1 个 Sink Group（2 个 Sink）,分别输送给 flume2 和 flume3。

# Name the components on this agent
a1.sources = r1
a1.channels = c1
# 提供sinkgrounps
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
#avro 序列化方案
a1.sinks.k1.type = avro
# sink 主机名 （需配置映射）
a1.sinks.k1.hostname = slave1
# sink 发送数据端口号
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = slave1
a1.sinks.k2.port = 4142

# Sink Group
# failover --故障转移
a1.sinkgroups.g1.processor.type = failover
# 优先值。更高优先级的值接收器更早被激活。绝对值越大表示优先级越高
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
# 失败 Sink 的最大退避周期（以毫秒为单位）默认30000
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

创建 flume2.conf。

配置上级 Flume 输出的 Source，输出是到本地控制台。

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = slave1
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

创建 flume3.conf。

配置上级 Flume 输出的 Source，输出是到本地控制台。

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = slave1
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

执行配置文件。

bin/flume-ng agent -c conf -f job/group2/flume1.conf -n a1

bin/flume-ng agent -c conf -f job/group2/flume2.conf -n a2 -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf -f job/group2/flume3.conf -n a3 -Dflume.root.logger=INFO,console

使用 netcat 工具向本机 44444 端口发送内容。

nc localhost 44444

查看 Flume2 及 Flume3 的控制台打印日志。
将 Flume2 kill 掉，观察 Flume3 的控制台打印情况。

（二）负载均衡

和上面故障转移实现流程一样，只需更改 flume1.conf 中 Sink Group 配置，其余一模一样。

# Sink Group	
a1.sinkgroups.g1.sinks = k1 k2
#处理器类型  load_balance 负载均衡
a1.sinkgroups.g1.processor.type = load_balance
# 失败的接收器是否应该以指数方式回退。
a1.sinkgroups.g1.processor.backoff = true
#选拔机制。必须是 从AbstractSinkSelector继承的自定义类的round_robin、random（随机）或 FQCN
a1.sinkgroups.g1.processor.selector = random

4.3 聚合–多数据源汇总案例

4.3.1 需求

slave1 上的 Flume-1 监控文件 /opt/module/datas/group.log，slave2 上的 Flume-2 监控某一端口数据流，Flume-1 与 Flume-2 将数据发送给 master 上的 Flume3，Flume3 将最终数据打印到控制台。

4.3.2 分析

在这里插入图片描述

4.3.3 实现流程

准备工作。

在 master、slave1 以及 slave2 的 /opt/module/flume/job 目录下创建一个 group4 文件夹。在 salve1 /opt/module/flume/datas 目录下创建 group.log 文件。

在 slave1 上创建 flume1.conf。

配置 Source 用于监控 group.log 文件，配置 Sink 输出数据到下一级 Flume。

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
# filegroup 配置多个文件
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/flume/datas/group.log

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在 salve2 上创建 flume2.conf。

配置 Source 监控端口 44444 数据流，配置 Sink 输出数据到下一级 Flume。

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = localhost
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = master
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

在 master 上创建 flume3.conf

配置 Source 用于接收 flume1 与 flume2 发送过来的数据，最终合并后 Sink 输出数据到控制台。

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = master
a3.sources.r1.port = 4141

# Describe the sink
# 打印到控制台
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

在三台机器上分别执行配置文件。

bin/flume-ng agent -c conf/ -f job/group4/flume3.conf -n a3 -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf -f job/group4/flume2.conf -n a2

bin/flume-ng agent -c conf -f job/group4/flume1.conf -n a1

在 slave1 上向 /opt/module/flume/datas 目录下的 group.log 追加内容。

echo hello >> group.log

在 slave2 向 44444 端口发送数据。

nc localhost 44444

检查 master 上的数据，查看控制台的数据。

五、拦截器

flume通过使用Interceptors（拦截器）实现修改和过滤事件（包含body和header属性的对象）的功能。举个栗子，一个网站每天产生海量数据，但是可能会有很多数据是不完整的（缺少重要字段），或冗余的，如果不对这些数据进行特殊处理，那么会降低系统的效率。这时候拦截器就派上用场了。

1.flume内置的拦截器

先列个flume内置拦截器的表：

在这里插入图片描述

由于拦截器一般针对Event的Header进行处理，那我先介绍一Event吧

在这里插入图片描述

event是flume中处理消息的基本单元，由零个或者多个header和正文body组成。
Header 是 key/value 形式的，可以用来制造路由决策或携带其他结构化信息(如事件的时间戳或事件来源的服务器主机名)。你可以把它想象成和 HTTP 头一样提供相同的功能——通过该方法来传输正文之外的额外信息。
Body是一个字节数组，包含了实际的内容。
flume提供的不同source会给其生成的event添加不同的header

1.1 timestamp拦截器

Timestamp Interceptor拦截器就是可以往event的header中插入关键词为timestamp的时间戳。

[root@flume0 job]# mkdir interceptors
[root@flume0 job]# cd interceptors/	
[root@flume0 interceptors]# touch demo1-timestamp.conf

#文件内容如下
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = netcat
a1.sources.r1.bind = flume0
a1.sources.r1.port = 44444

#timestamp interceptor
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

测试

[root@flume0 interceptors]# nc flume0 44444

测试结果

2020-04-11 03:54:14,179 (SinkRunner-PollingRunner-DefaultSinkProcessor) 
[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
Event: { headers:{timestamp=1586548451701} body: 68 65 6C 6C 6F

1.2 host拦截器

该拦截器可以往event的header中插入关键词默认为host的主机名或者ip地址（注意是agent运行的机器的主机名或者ip地址）

[root@flume0 interceptors]# touch demo2-host.conf

#文件内容如下
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = netcat
a1.sources.r1.bind = flume0
a1.sources.r1.port = 44444

#host interceptor
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

测试

[root@flume0 interceptors]# nc flume0 44444

测试结果

2020-04-11 04:04:09,954 (SinkRunner-PollingRunner-DefaultSinkProcessor) 
[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
Event: { headers:{host=192.168.150.61} body: 61 61 61  
           aaa }

1.3 Regex Filtering Interceptor拦截器 (重要)

Regex Filtering Interceptor拦截器用于过滤事件，筛选出与配置的正则表达式相匹配的事件。可以用于包含事件和排除事件。常用于数据清洗，通过正则表达式把数据过滤出来。

[root@flume0 interceptors]# touch demo3-regex-filtering.conf

#文件内容如下
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = netcat
a1.sources.r1.bind = flume0
a1.sources.r1.port = 44444

#host interceptor
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
#全部是数字的数据
a1.sources.r1.interceptors.i1.regex = ^[0-9]*$
#排除符合正则表达式的数据
a1.sources.r1.interceptors.i1.excludeEvents  = true

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Regex Filtering 实战开发引用场景：排除错误日志

2007-02-13 15:22:26 [com.sms.test.TestLogTool]-[INFO] this is info
2007-02-13 15:22:26 [com.sms.test.TestLogTool]-[ERROR] my exception  
com.sms.test.TestLogTool.main(TestLogTool.java:8)

多个拦截器可以同时使用，例如:

# 拦截器：作用于Source，按照设定的顺序对event装饰或者过滤

a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = regex_filter
a1.sources.r1.interceptors.i3.regex = ^[20-9]*$

2.自定义拦截器

概述：在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到 Flume 拓扑结构中的 Multiplexing 结构，Multiplexing的原理是，根据 event 中 Header 的某个 key 的值，将不同的 event 发送到不同的 Channel中，所以我们需要自定义一个 Interceptor，为不同类型的 event 的 Header 中的 key 赋予不同的值。

案例演示：我们以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义 interceptor 区分数字和字母，将其分别发往不同的分析系统（Channel）。

实现步骤

创建一个项目，并且引入以下依赖

<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-core</artifactId>
    <version>1.9.0</version>
</dependency>

自定义拦截器，实现拦截器接口

package com.baizhi.interceptors;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;

public class MyInterceptor implements Interceptor {
    @Override
    public void initialize() {
    }

    @Override
    public Event intercept(Event event) {
        byte[] body = event.getBody();   
        if (body[0] >= 'a' && body[0] <= 'z'){
            event.getHeaders().put("type","letter");
        }else if (body[0] >= '0' && body[0] <= '9'){
            event.getHeaders().put("type","number");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        for (Event event : list) {
            intercept(event);
        }
        return list;
    }

    @Override
    public void close() {
    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new MyInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }

}

将项目打成jar包，上传到flume安装目录的lib目录下

在这里插入图片描述

编写agent,在job目录下的interceptors目录下创建，命名为my.conf

a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.baizhi.interceptors.MyInterceptor$Builder

a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
# map 的k-v  put("type","letter")  put("type","number")
a1.sources.r1.selector.mapping.letter = c1
a1.sources.r1.selector.mapping.number = c2
# 其他
a1.sources.r1.selector.default  = c2

a1.channels.c1.type = memory
a1.channels.c2.type = memory

a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory  = /root/t1
a1.sinks.k1.sink.rollInterval = 600

a1.sinks.k2.type = file_roll
a1.sinks.k2.sink.directory = /root/t2
a1.sinks.k1.sink.rollInterval = 600


a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

在/root目录下创建t1、t2文件夹
测试

bin/flume-ng agent --conf conf --name a1 --conf-file job/interceptors/my.conf -Dflume.roogger=INFO,console

六、事务

在这里插入图片描述

七、Agent 内部原理

在这里插入图片描述

重要组件：

ChannelSelector

? ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicatng 和 Multiplexing。
Replicatng 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。

SinkProcessor

? SinkProcessor 共有三种类型，分别是 DefaultSinkProcessor、LoadBalancingSinkProcessor 和 FailoverSinkProcessor。
DefaultSinkProcessor 对应的是单个 Sink；LoadBalancingSinkProcessor 和 FailoverSinkProcessor 对应的是 SInk Group。LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以实现故障转移的功能。

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-07-16 11:22:15 更:2021-07-16 11:23:04

360图书馆购物三丰科技阅读网日历万年历 2025年12日历

-2025/12/1 5:46:55-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码