IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> Flink checkpoint 过期导致失败(线上问题) -> 正文阅读

[大数据]Flink checkpoint 过期导致失败(线上问题)

报错信息:

2021-08-18 18:28:40,502 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1629282520408 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:38:40,502 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 1 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:38:40,527 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 2 (type=CHECKPOINT) @ 1629283120507 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:48:40,528 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 2 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:48:40,581 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 3 (type=CHECKPOINT) @ 1629283720528 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:49:35,383 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 8e8ad4a5e66753df9bacb838c4a8c195 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000016 @ (dataPort=26682).
2021-08-18 18:49:41,220 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 6c425a7bf2cdb6f9eaff230e657e7b71 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000028 @(dataPort=26329).
2021-08-18 18:51:23,142 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 4aca7ff37807d7c53dab4af4d7317a92 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000015 @ (dataPort=28244).
2021-08-18 18:51:43,449 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task f410376b8a4c4482529b47d615d0ef7d of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000027 @ (dataPort=35241).
2021-08-18 18:52:14,054 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task b0b4e5bc960fd38f09ed136eabc92207 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000011 @  (dataPort=17858).
2021-08-18 18:53:15,172 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 254800e15801981e9a5fcaf8f8374f80 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000019 @ (dataPort=28753).
2021-08-18 18:54:23,649 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task bda7204e335e5818502024d14b9cf515 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000032 @ (dataPort=37476).
2021-08-18 18:54:53,507 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 9839eef45f9224ddbbd3383cf962ea75 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000026 @ (dataPort=31847).
2021-08-18 18:57:07,707 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 6b91c567ac5eb48d6d8c4c692327c95d of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000028 @  (dataPort=26329).
2021-08-18 18:58:40,581 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 3 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:58:40,604 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 4 (type=CHECKPOINT) @ 1629284320583 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:58:40,590 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.

jobmanager定时trigger checkpoint,给source处发送trigger信号,同时会启动一个异步线程,在checkpoint timeout时长之后停止本轮 checkpoint,cancel动作执行之后本轮的checkpoint就为超时,如果在超时之前收到了最后一个sink算子的ack信号,那么checkpoint就是成功的。

超时原因一般有两种

? ? ? ? 1. barrier对齐超时

? ? ? ? 2. 异步状态遍历和写HDFS超时(比如State太大)

解决:? ? ?

? ? ? ? 这个job 的sink是写入ES中,由于ES集群扩容了一些机器,因为跨机房导致ES端有问题,从而导致消费端有问题从而出现反压,从而导致barrier对齐时间过长 而导致checkpoint失败。

一般的排查方式(网上找的方法):

查看ui上的失败checkpoint的detail,可以看到失败的是Pending -> EventEmit xxx这个算子的10个subtask。

checkpoint

接着去jobmanager上查看这个checkpoint的一些延迟信息:

(最上面报错信息内容)

可以根据这些失败的task的id去查询这些任务落在哪一个taskmanager上,经过排查发现,是同一台机器,通过ui看到该机器流入的数据明显比别的流入量大 因此是因为数据倾斜导致了这个问题,追根溯源还是下游消费能力不足的问题

根据日志可以去源码中查找相对应的信息,例如:

2021-08-18 18:49:35,383 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Received late message for now expired checkpoint attempt 2 from task 8e8ad4a5e66753df9bacb838c4a8c195 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000016 @ xxxxxxxxx (dataPort=26682).

CheckpointCoordinator类中查找:

if (recentPendingCheckpoints.contains(checkpointId)) {
                    wasPendingCheckpoint = true;
                    LOG.warn(
                            "Received late message for now expired checkpoint attempt {} from task "
                                    + "{} of job {} at {}.",
                            checkpointId,
                            message.getTaskExecutionId(),
                            message.getJob(),
                            taskManagerLocationInfo);

代码中就可以知道 每个大括号{} 中代表的是什么。

(问题排查:主要是通过 JM的日志和TM日志,以及Flink WebUI 的 每个算子反压状况以及checkpoint的状况)

  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2021-08-20 15:11:21  更:2021-08-20 15:12:59 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/23 13:06:43-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码