[大数据] Flume Agent原理、拓扑结构和案例分析

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Flume Agent原理、拓扑结构和案例分析 -> 正文阅读

[大数据]Flume Agent原理、拓扑结构和案例分析

Flume Agent内部原理

在这里插入图片描述

重要组件：（官方文档对应搜索即可）
1）ChannelSelector（搜索flume channel selector）
ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型，分别是Replicating（复制）和Multiplexing（多路复用）。（默认Replicating）
ReplicatingSelector会将同一个Event发往所有的Channel

Examples:

a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating#这一行不写也行，因为是默认方式；
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3

Multiplexing会根据相应的原则，将不同的Event发往不同的Channel。

Examples:

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state#（Header是容纳了key-value字符串对的HashMap；其中key便是state)
a1.sources.r1.selector.mapping.CZ = c1#(kv中的v便是CZ和US)
a1.sources.r1.selector.mapping.US = c2 c3#如果v是CZ和US，则对应传入C1、C2、C3 channels
a1.sources.r1.selector.default = c4#如果kv不匹配以上情况，传入到default channel c4；

2）SinkProcessor（搜索Flume Sink Processors）
SinkProcessor共有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor
DefaultSinkProcessor：对应的是单个的Sink

? LoadBalancingSinkProcessor对应的是Sink Group，可以实现负载均衡的功能（防止单个sink压力过大）（用得最多）

Examples:

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000 #processor.selector.maxTimeOut 可选参数

? FailoverSinkProcessor对应的是Sink Group，FailoverSinkProcessor可以实现错误恢复的功能。（一个active、其他备用）

Examples:

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance#默认load_balance
a1.sinkgroups.g1.processor.backoff = true#默认false  退避算法；结合可选参数processor.selector.maxTimeOut使用（以毫秒为单位）；可选参数
a1.sinkgroups.g1.processor.selector = random#默认round_robin轮询方式；还有random和FQCN（用户自定义）可选参数

? processor.selector.maxTimeOut参数：已知某个sinkProcessor挂掉了，下一次会退避一个时间，不再访问该sink，且随着访问失败次数的增加，这个退避时间会成指数形式的增长，而这个参数就是限制这个时间的最大值；（防止再次成功启动后，需要等待过长时间，数据才会再次访问这个sink）实际工作中一般会开启这个参数；

为什么用多个Interceptor？

? 单个Interceptor的效率更高，但是不够灵活，比如有时候需要特定的拦截器方法，或者是拦截器中的某一部分方法的时候，就显得难以操作；而如果将这些功能分开，选择性地进行拦截，可以更方便、更灵活；

Flume拓扑结构

简单串联（用的比较少、属于基础架构）

? 这种模式是将多个flume（指的是agent）顺序连接起来了，从最初的source开始到最终sink传送的目的存储系统。此模式不建议桥接过多的flume数量， flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。

复制和多路复用

? 想将一份数据发向多个文件管理系统，需要用到多个agent以及Replicating ChannelSelector；

? Flume支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel中，或者将不同数据分发到不同的channel中，sink可以选择传送到不同的目的地。

负载均衡和故障转移

在这里插入图片描述

? Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor可以实现负载均衡和错误恢复的功能。

比如：如果每个sink后接一个agent，这样每个agent又有一个channel缓冲区，这样就大大解决了sink写入慢导致channel满了的问题；

也可以防止一个sink挂掉了无法传输数据的情况；（一般用来解决负载均衡问题的情况比较多）

聚合

在这里插入图片描述

? 这种模式是我们最常见的，也非常实用，日常web应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个或多个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase等，进行日志分析。

Flume企业开发案例

复制

1、案例需求

? 使用 Flume-1 监控文件变动，Flume-1 将变动内容传递给 Flume-2，Flume-2 负责存储到 HDFS。同时 Flume-1 将变动内容传递给 Flume-3，Flume-3 负责输出到 Local FileSystem。

2、需求分析

在这里插入图片描述

Flume间信息传递使用的是Avro Source和Avro Sink；

? 注意这里不能只用一个Memory Channel，然后来传给两个sink；因为这里后续的两个Flume都要接收完整的数据

3、实现步骤：

一、首先创建三个对应的配置文件；

进入job目录下，创建group1（与后面的案例做区分）:

#配置 1 个接收日志文件的 source 和两个 channel、两个 sink，分别输送给 flume2 和 flume3
touch flume1.conf  
    
配置上级 Flume 输出的 Source，输出是到HDFS 的 Sink
touch flume2.conf
    
配置上级 Flume 输出的 Source，输出是到本地目录的 Sink
touch flume3.conf

vim flume1.conf

配置 1 个接收日志文件的 source 和两个 channel、两个Avro sink，分别输送给 flume-flume-hdfs 和 flume-flume-dir。

编辑配置文件：

添加内容：

//官方文档搜索Avro Source
flume1.conf：
    
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 将数据流复制给所有 channel
# 由于replicating是默认配置，所以也可以不加这个参数；    
a1.sources.r1.selector.type = replicating
    
# Source    
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
#这里的hive.log是自己建立的，因为hive里面的那个文件需要加载的话数量很大；
a1.sources.r1.filegroups.f1 = /opt/module/data/hive.log
a1.sources.r1.positionFile = /opt/module/flume/position/position1.json
    
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
    
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
    
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142    
    
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
#因为一个sink只能连一个channel，所以这里的channel不加s；    
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

配置 flume2.conf

配置上级 Flume 输出的 Source，输出是到HDFS 的 Sink。

编辑配置文件：

添加内容：

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
#Source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
    
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Describe the sink
a2.sinks.k1.type = hdfs
#HDFS的上传目录可以不存在，会自动创建；    
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/group1/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 1000
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1    
a2.sinks.k1.channel = c1

配置 flume3.conf

配置上级 Flume 输出的 Source，输出是到本地目录的 Sink（官方文档搜索：File Roll Sink）

编辑配置文件：

添加内容：

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = file_roll
#本地的文件系统必须存在，这里需要先在本地建好目录；    
a3.sinks.k1.sink.directory = /opt/module/datas/group1 
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

在/opt/module/data目录下创建hive.log文件:touch hive.log;

开启多个控制台窗口：（由于flume开启之后的阻塞性）

注意：这里先开下游flume，再开上游；（Avro Source相当于是服务端，Avro Sink是客户端；先开服务端）

这里也可以做一个测试，先开a1端，发现会报错；4141和4142端口拒绝连接

bin/flume-ng agent -c conf/ -f job/group1/flume3.conf -n a3

bin/flume-ng agent -c conf/ -f job/group1/flume2.conf -n a2

bin/flume-ng agent -c conf/ -f job/group1/flume1.conf -n a1

启动hadoop；

测试：

追加两条数据到Hive.log文件中：

echo hello >> hive.log
echo chenxu >> hive.log

查看本地文件位置：datas/group1以及HDFS上的文件：group1目录下的logs文件

查看文件是否是追加的内容；

注意：本地目录下会滚动生成文件，而HDFS上由于没有新的Event，所以不会滚动生成文件；

负载均衡和故障转移

1、案例需求

使用 Flume1 监控一个端口，其 sink 组中的 sink 分别对接 Flume2 和 Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

FailoverSinkProcessor：可在官网查询FailoverSink Processor进行查询；

先分组，再设定优先级；

Examples:

a1.sinkgroups = g1#分组
a1.sinkgroups.g1.sinks = k1 k2 #组里面对应的sinks（c、k的分组一定要定义好）
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5#优先级
a1.sinkgroups.g1.processor.priority.k2 = 10#这个参数的配置类似于Taildir Source的f1 f2的配置方法
a1.sinkgroups.g1.processor.maxpenalty = 10000（退避原则；挂掉以后10秒以内还是在失败队列，不予考虑）

2、需求分析

在这里插入图片描述

3、实现步骤

（1）准备工作

在/opt/module/flume/job 目录下创建 group2 文件夹

touch flume1.conf;

touch flume2.conf;

touch flume3.conf;

vim flume1.conf

编辑配置文件：

添加内容：

flume1.conf：
    
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
a1.sinkgroups = g1    

# Source    
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444#监控端口44444；
    
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
      
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
    
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142    

#Sink Group
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
    
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
#因为一个sink只能连一个channel，所以这里的channel不加s；    
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

配置 flume2.conf

编辑配置文件：

添加内容：

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
#Source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
    
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Describe the sink
a2.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1    
a2.sinks.k1.channel = c1

配置 flume3.conf

编辑配置文件：

添加内容：

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

4、开启多个控制台窗口：

注意：这里先开下游flume，再开上游；（Avro Source相当于是服务端，Avro Sink是客户端；先开服务端）

bin/flume-ng agent -c conf/ -f job/group2/flume3.conf -n a3 -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf/ -f job/group2/flume2.conf -n a2 -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf/ -f job/group2/flume1.conf -n a1

使用 netcat 工具向本机的 44444 端口发送内容：

nc localhost 44444

查看 Flume2 及 Flume3 的控制台打印日志

内容会集中输出到一个日志窗口；

将 Flume2 kill，观察 Flume3 的控制台打印情况。

注：使用 jps -ml 查看 Flume 进程

负载均衡案例只需要修改flume1.conf文件中的Failover Sink的内容即可；

官网搜索Load balancing Sink Processor

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
    
#将这三行内容替换；    
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random

其他步骤一样，唯一不同的是日志不再集中输出到一个窗口，而是随机输出；

聚合

1、案例需求

hadoop102 上的 Flume-1 监控文件/opt/module/data/group.log

hadoop103 上的 Flume-2 监控某一个端口的数据流，

Flume-1 与 Flume-2 将数据发送给 hadoop104 上的 Flume-3，Flume-3 将最终数据打印到控制台。

Flume-1 与 Flume-2的数据是发送到Flume-3，然后Flume-3在自己的服务器上进行读取内容；Flume-3是服务端（客户端与服务端的远程通信应用思想，后面配置文件会提现到）

2、需求分析
在这里插入图片描述

3、实现步骤

（1）准备工作

在 hadoop102/opt/module/flume/job目录下创建一个 group3文件夹。

分发flume；

touch flume1.conf;

touch flume2.conf;

touch flume3.conf;

分别对应hadoop102、hadoop103、hadoop104的文件；

vim flume1.conf

编辑配置文件：

添加内容：

在hadoop102服务器的/opt/module/data/目录下创建group.log文件；

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 
a1.channels = c1   
   
# Source    
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = f1
#这里的hive.log是自己建立的，因为hive里面的那个文件需要加载的话数量很大；
a1.sources.r1.filegroups.f1 = /opt/module/data/group.log
a1.sources.r1.positionFile = /opt/module/flume/position/position2.json
    
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
      
# Describe the sink
a1.sinks.k1.type = avro
#发送到104    
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141 

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
#因为一个sink只能连一个channel，所以这里的channel不加s；    
a1.sinks.k1.channel = c1

配置 flume2.conf

编辑配置文件：

添加内容：

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
#Netcat Source    
a2.sources.r1.type = netcat
a2.sources.r1.bind = localhost
#监控端口44444；
a2.sources.r1.port = 44444 
    
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
#发送到104    
a2.sinks.k1.port = 4141 
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1    
a2.sinks.k1.channel = c1

配置 flume3.conf

编辑配置文件：

添加内容：

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe the source
a3.sources.r1.type = avro
#接收服务器；    
a3.sources.r1.bind = hadoop104 
#接收端口；
a3.sources.r1.port = 4141

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

4、开启多个控制台窗口：

执行配置文件

分别开启对应配置文件：先开104的服务器（服务端），在三台机器上分别对应运行：

（104）bin/flume-ng agent -c conf/ -f job/group3/flume3.conf -n a3 -Dflume.root.logger=INFO,console

（103）bin/flume-ng agent -c conf/ -f job/group3/flume2.conf -n a2 #监控端口

（102）bin/flume-ng agent -c conf/ -f job/group3/flume1.conf -n a1 #监控文件

在hadoop102服务器上向group.log增加内容：echo hello >> group.log;

在hadoop103上向服务器44444端口发送内容：nc localhost 44444; 发送内容；

对应查看104服务器；

自定义Interceptor（拦截器）

1.自定义Interceptor

1）案例需求

? 使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构，Multiplexing的原理是，根据event中Header的某个key的值，将不同的event发送到不同的Channel中，所以我们需要自定义一个Interceptor，为不同类型的event的Header中的key赋予不同的值。

? 在该案例中，我们以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义interceptor区分数字和字母，将其分别发往不同的分析系统（Channel）。
1.自定义Interceptor

1）案例需求

? 使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

2）需求分析

? 在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同

的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构，Multiplexing的原理是，根据event中

Header的value的值，将不同的event发送到不同的Channel中，所以我们需要自定义一个

Interceptor，为不同类型的event的Header中的value赋予不同的值。

? 在该案例中，我们以端口数据模拟日志，以发送内容带"hello"还是"chenxu"来模拟不同类型的日志，我

们需要自定义interceptor区分两个字符串，将其分别发往不同的分析系统（Channel）。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cWOI70Mt-1627220655086)(C:\Users\86157\Desktop\大数据学习\hadoop核心组件\2020082711405469.png)]

在Hadoop-Study-01工程下建立的子工程Flume-Study-01；

建立package：com.chenxu.Interceptor;

创建class：TypeInterceptor；重写四个方法；

package com.chenxu.Interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    //定义一个存放事件的集合：因为intercept(List<Event> events) 是要循环调用的方法，不适合放；initialize()内的声明不是全局变量，所以放在开头声明；
    private List<Event> addHeaderEvents;//不在这里创建，会浪费资源，当调用这个接口并使用时才创建，所以放进初始化过程中；

    @Override
    public void initialize() {

        //初始化
        addHeaderEvents = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {

        //获取时间中的Header(kv类型)
        Map<String, String> headers = event.getHeaders();

        //获取事件中的Body(字节数组)
        String body = new String(event.getBody());

        //根据Body当中的内容，进入不同的channels；
        //拦截器的判断部分；
        //添加头信息；
        if(body.contains("hello")){
            headers.put("type","op");//mapping.op = c1;
        }else{
            headers.put("type","np");//mapping.np = c2; 这样就把数据隔离开了；
        }

        return event;
    }

    //可以直接调用intercept(Event event)方法，批量事件拦截；(设定一个ArrayList来起到缓冲的作用）
    @Override
    public List<Event> intercept(List<Event> events) {

        //1、清空集合；
        addHeaderEvents.clear();

        //2、遍历Events，为每一个事件添加头信息；
        for (Event event : events) {
            addHeaderEvents.add(intercept(event));
        }

        return addHeaderEvents;
    }

    @Override
    public void close() {

    }

    //这里的名字并不一定非要是builder，但$后的内容必须与该名字保持一致；
    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

拦截器的内容见官方文档：搜索Flume Interceptors（其他带拦截器参数例子的文档也行），查看一下；

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder #自定义的Interceptor，后面加$Builder ;这里的bulider实际上就是一个自定义方法中的静态内部类；
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1
    
给拦截器名字，指定拦截器类型；大体上配置自定义拦截器方式都类似；    
要实现org.apache.flume.interceptor.Interceptor接口； 
输入参数是Event，输出参数也是Event；

将写好的TypeInterceptor打包放入集群中；cd /opt/module/flume/lib/目录下放入包；

（3）编辑flume配置文件

? 需求回顾：使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

为了方便区分，这里的输出控制台放入hadoop103和hadoop104中；

分别进入三台服务器的job目录下，创建group4目录,分别创建flume1.conf、flume2.conf、flume3.conf

配置flume1.conf：

# Name the components on this agent
a1.sources = r1
a1.channels = c1 c2 
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
    
#Interceptor
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.chenxu.Interceptor.TypeInterceptor$Builder    
    
#Channel Selector
#需要的是MultiPlexing Channel Selector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.op = c1
a1.sources.r1.selector.mapping.np = c2 
#这里不需要配置a1.sources.r1.selector.default = c4
    

# Use a channel which buffers events in memory   
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4142

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4142
    
# Bind the source and sink to the channel   
a1.sources.r1.channels = c1 c2 
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

配置flume2.conf

a2.sources = r1
a2.sinks = k1
a2.channels = c1
    
# Describe the source
a2.sources.r1.type = avro
#接收服务器；    
a2.sources.r1.bind = hadoop103 
#接收端口；
a2.sources.r1.port = 4142

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
    
# Describe the sink
a2.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置flume3.conf

a3.sources = r1
a3.sinks = k1
a3.channels = c1
    
# Describe the source
a3.sources.r1.type = avro
#接收服务器；    
a3.sources.r1.bind = hadoop104 
#接收端口；由于这里是两台服务器，所以可以配置同一个端口号；
a3.sources.r1.port = 4142

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
    
# Describe the sink
a3.sinks.k1.type = logger
    
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

同样，先启动下游hadoop103和hadoop104上的flume；

（104）bin/flume-ng agent -c conf/ -f job/group4/flume3.conf -n a3 -Dflume.root.logger=INFO,console

（103）bin/flume-ng agent -c conf/ -f job/group4/flume2.conf -n a2 -Dflume.root.logger=INFO,console

（102）bin/flume-ng agent -c conf/ -f job/group4/flume1.conf -n a1

发现hadoop103和hadoop104服务器上打印了日志，说明连接成功，开始测试：

在hadoop102服务器上：

nc localhost 44444

输入数据

hello world

chenxu qifei 

world

带"hello"的会被打印在hadoop103服务器的控制台中，其他打印在hadoop104的控制台中；

自定义Source

Source回顾；

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据，包括avro（flume间信息传递）、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。

官方也提供了自定义source的接口：
https://flume.apache.org/FlumeDeveloperGuide.html#source根据官方说明自定义MySource需要继承AbstractSource类并实现Configurable和PollableSource接口。
实现相应方法：
getBackOffSleepIncrement()//暂不用
getMaxBackOffSleepInterval()//暂不用
configure(Context context)//初始化context（读取配置文件内容）
process()//获取数据封装成event并写入channel，这个方法将被循环调用。
使用场景：读取MySQL数据或者其他文件系统。

需求：

使用flume接收数据，并给每条数据添加前缀，输出到控制台。前缀可从flume配置文件中配置。

? 这里我们选择自己造一部分数据，循环读取然后导出到控制台；这样以后用JDBC获取数据来替代这部分数据，即可实现MySQL的连接功能；替换成其他数据即可实现其他功能；
在这里插入图片描述

//进入官方文档Flume Developer Guide查看文档内容：
public class MySource extends AbstractSource implements Configurable, PollableSource {
  private String myProp;

  @Override
  public void configure(Context context) {
      //建立配置关系：第一个参数是key，第二个参数为默认值，
    String myProp = context.getString("myProp", "defaultValue");

    // Process the myProp value (e.g. validation, convert to another type, ...)

    // Store myProp for later retrieval by process() method
    this.myProp = myProp;
  }

    //类似于initialize
  @Override
  public void start() {
    // Initialize the connection to the external client
  }

    //类似于close
  @Override
  public void stop () {
    // Disconnect from external client and do any additional cleanup
    // (e.g. releasing resources or nulling-out field values) ..
  }

    //会被循环调用的方法
  @Override
  public Status process() throws EventDeliveryException {
    Status status = null;

    try {
      // This try clause includes whatever Channel/Event operations you want to do

      // Receive new data
        //获取数据的方法；代码主要修改的地方；
      Event e = getSomeData();

      // Store the Event into this Source's associated Channel(s)
      getChannelProcessor().processEvent(e);

      status = Status.READY;
    } catch (Throwable t) {
      // Log exception, handle individual exceptions as needed

        //如果出现异常，进行退避
      status = Status.BACKOFF;

      // re-throw all Errors
      if (t instanceof Error) {
        throw (Error)t;
      }
    } finally {
      txn.close();
    }
    return status;
  }
}

//自定义source
package com.chenxu.Source;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {

    /*需求分析：
   1、接收数据（for循环造数据）
   2、封装为Event
   3、将事件传递给Channel；
     */

    private String prefix;
    private String subfix;

    //设置参数；
    @Override
    public void configure(Context context) {
        //key如何定义：写成哪一个字符串是无所谓的，只要保持后面conf文件中的后缀与key保持一致即可；（类似k1.type,a1.channel)
        //尽量写成能一眼看出来这个参数是什么意思的形式；
        prefix = context.getString("prefix");
        //如果在配置文件中没有指明subfix这个key对应的值，对应输出"chenxu"默认值；
        subfix = context.getString("subfix","chenxu");
    }

    @Override
    public Status process() throws EventDeliveryException {

        //process方法是可以用到configure里的数据的；

        Status status = null;
        //Crtl + Alt +T可以选择进行包裹；
        try {
            //1、接收数据(造数据）
            for (int i = 0; i < 5; i++) {

                //2、构建事件
                SimpleEvent event = new SimpleEvent();//Event本身是一个接口，SimpleEvent和JSONEvent是它的实现类；

                //创建事件头信息（这里也可以不设置header）
                HashMap<String, String> hearderMap = new HashMap<>();

                //3、给事件设置Body（给Body加一个带默认值的前缀和一个不带默认值的后缀）
                event.setBody((prefix + "--" + i + "--" + subfix).getBytes());

                //将事件写入channel
                getChannelProcessor().processEvent(event);
                //processEvent方法的第一步是做拦截处理，然后做非空判断，再走选择器，开启事务，做put之后进行提交；
                //每一个event都对应一个事务

                //事务正常
                status = Status.READY;

            }
        } catch (Throwable t) {
            // Log exception, handle individual exceptions as needed

            //如果出现异常，进行退避
            status = Status.BACKOFF;

            // re-throw all Errors
            if (t instanceof Error) {
                throw (Error)t;
            }
        }

        //2秒运行一次；
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return status;
    }

    //不设置这个参数；
    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    //不设置这个参数；
    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }


}

在job目录下创建mysource.conf

添加内容：

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.chenxu.Source.MySource
a1.sources.r1.prefix = feiji
#a1.sources.r1.subfix= xiaxian

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel    
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动配置文件：

bin/flume-ng agent -c conf/ -f job/mysource.conf -n a1 -Dflume.root.logger=INFO,console

观察是否会自动打印信息；

注意：如果每次输出的信息过程是无法全部打印出来的，需要设置Logger Sink的参数：maxBytesToLog，默认值为16；

自定义sink

Sink回顾：

Sink不断地轮询Channel中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。(avro);
Sink是完全事务性的。在从Channel批量删除数据之前，每个Sink用Channel启动一个事务。批量事件一旦成功写出到存储系统或下一个Flume Agent，Sink就利用Channel提交事务。事务一旦被提交，该Channel从自己的内部缓冲区删除事件。
Sink组件目的地包括hdfs、logger(输出到控制台)、avro（flume间信息传递）、thrift、ipc、file、null、HBase、solr、自定义。Channel是位于Source和Sink之间的缓冲区。因此，Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的，可以同时处理几个Source的写入操作和几个Sink的读取操作。

需求:

使用flume接收数据，并在Sink端给每条数据添加前缀和后缀，输出到控制台。前后缀可在flume任务配置文件中配置。

注意：与自定义Source不同，其事务的查验、提交、put等过程都可以靠ChannelProcessor来完成，但自定义Sink时不能，所以需要在重写process方法时写出这些过程；

代码在com.chenxu.MySink包下；

public class MySink extends AbstractSink implements Configurable {
  private String myProp;

  @Override
  public void configure(Context context) {
    String myProp = context.getString("myProp", "defaultValue");

    // Process the myProp value (e.g. validation)

    // Store myProp for later retrieval by process() method
    this.myProp = myProp;
  }

    //以下两个方法可以不重写；
  @Override
  public void start() {
    // Initialize the connection to the external repository (e.g. HDFS) that
    // this Sink will forward Events to ..
  }

  @Override
  public void stop () {
    // Disconnect from the external respository and do any
    // additional cleanup (e.g. releasing resources or nulling-out
    // field values) ..
  }

  @Override
  public Status process() throws EventDeliveryException {
    Status status = null;

    // Start transaction
    Channel ch = getChannel();
    Transaction txn = ch.getTransaction();
    txn.begin();
    try {
      // This try clause includes whatever Channel operations you want to do

        //与自定义Source不同的是，这里take数据的位置只会是channel；
      Event event = ch.take();

      // Send the Event to the external repository.
      // storeSomeData(e);

      txn.commit();
      status = Status.READY;
    } catch (Throwable t) {
      txn.rollback();

      // Log exception, handle individual exceptions as needed

      status = Status.BACKOFF;

      // re-throw all Errors
      if (t instanceof Error) {
        throw (Error)t;
      }
    }
    return status;
  }
}


//自定义Sink
package com.chenxu.MySink;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MySink extends AbstractSink implements Configurable {

    //获取Logger对象
    private Logger logger = LoggerFactory.getLogger(MySink.class);;

    //设置配置参数；
    private String prefix;
    private String subfix;
    @Override
    public void configure(Context context) {

        prefix = context.getString("prefix");
        subfix = context.getString("subfix","chenxu");

    }

    /*
    1、获取Channel
    2、从Channel获取事务以及数据
    3、发送数据；
     */

    @Override
    public Status process() throws EventDeliveryException {
        Status status = null;

        // Start transaction
        Channel channel = getChannel();
        Transaction transaction = channel.getTransaction();
        transaction.begin();
        try {


            // This try clause includes whatever Channel operations you want to do

            //与自定义Source过程不同的是：自定义的sink只能从channel中take数据；
            Event event = channel.take();

            //这里就是业务逻辑，也即是自定义最重要的部分
            // Send the Event to the external repository.
            // storeSomeData(e);

            //获取事件体
            String body = new String(prefix + new String(event.getBody()) + subfix);
            //输出内容
            logger.info(body);

            transaction.commit();
            status = Status.READY;
        } catch (Throwable t) {
            //出现异常则回滚；
            transaction.rollback();

            // Log exception, handle individual exceptions as needed（可以在这里输出某些信息）

            status = Status.BACKOFF;

            // re-throw all Errors
            if (t instanceof Error) {
                throw (Error)t;
            }
        }finally {
            transaction.close();

        }
        return status;
    }
}

测试：

1、打包
将写好的代码打包，并放到flume的lib目录（/opt/module/flume）下。

2、配置文件

在job目录下创建mysink.conf，添加内容：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
    
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
    
# Describe the sink
a1.sinks.k1.type = com.chenxu.MySink.MySink
a1.sinks.k1.prefix = feiji
a1.sinks.k1.suffix = jiangluo
    
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
    
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1