报错信息:
2021-08-18 18:28:40,502 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 1 (type=CHECKPOINT) @ 1629282520408 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:38:40,502 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 1 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:38:40,527 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2 (type=CHECKPOINT) @ 1629283120507 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:48:40,528 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 2 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:48:40,581 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 3 (type=CHECKPOINT) @ 1629283720528 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:49:35,383 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 8e8ad4a5e66753df9bacb838c4a8c195 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000016 @ (dataPort=26682).
2021-08-18 18:49:41,220 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 6c425a7bf2cdb6f9eaff230e657e7b71 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000028 @(dataPort=26329).
2021-08-18 18:51:23,142 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 4aca7ff37807d7c53dab4af4d7317a92 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000015 @ (dataPort=28244).
2021-08-18 18:51:43,449 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task f410376b8a4c4482529b47d615d0ef7d of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000027 @ (dataPort=35241).
2021-08-18 18:52:14,054 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task b0b4e5bc960fd38f09ed136eabc92207 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000011 @ (dataPort=17858).
2021-08-18 18:53:15,172 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 254800e15801981e9a5fcaf8f8374f80 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000019 @ (dataPort=28753).
2021-08-18 18:54:23,649 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task bda7204e335e5818502024d14b9cf515 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000032 @ (dataPort=37476).
2021-08-18 18:54:53,507 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 9839eef45f9224ddbbd3383cf962ea75 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000026 @ (dataPort=31847).
2021-08-18 18:57:07,707 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 6b91c567ac5eb48d6d8c4c692327c95d of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000028 @ (dataPort=26329).
2021-08-18 18:58:40,581 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 3 of job 1cba827ad8a8d68521605157b77a6191 expired before completing.
2021-08-18 18:58:40,604 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 4 (type=CHECKPOINT) @ 1629284320583 for job 1cba827ad8a8d68521605157b77a6191.
2021-08-18 18:58:40,590 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
jobmanager定时trigger checkpoint ,给source处发送trigger信号,同时会启动一个异步线程,在checkpoint timeout 时长之后停止本轮 checkpoint,cancel动作执行之后本轮的checkpoint就为超时,如果在超时之前收到了最后一个sink算子的ack 信号,那么checkpoint就是成功的。
超时原因一般有两种:
? ? ? ? 1. barrier对齐超时
? ? ? ? 2. 异步状态遍历和写HDFS超时(比如State太大)
解决:? ? ?
? ? ? ? 这个job 的sink是写入ES中,由于ES集群扩容了一些机器,因为跨机房导致ES端有问题,从而导致消费端有问题从而出现反压,从而导致barrier对齐时间过长 而导致checkpoint失败。
一般的排查方式(网上找的方法):
查看ui上的失败checkpoint的detail,可以看到失败的是Pending -> EventEmit xxx 这个算子的10个subtask。
接着去jobmanager上查看这个checkpoint的一些延迟信息:
(最上面报错信息内容)
可以根据这些失败的task的id去查询这些任务落在哪一个taskmanager上,经过排查发现,是同一台机器,通过ui看到该机器流入的数据明显比别的流入量大 因此是因为数据倾斜导致了这个问题,追根溯源还是下游消费能力不足的问题
根据日志可以去源码中查找相对应的信息,例如:
2021-08-18 18:49:35,383 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 2 from task 8e8ad4a5e66753df9bacb838c4a8c195 of job 1cba827ad8a8d68521605157b77a6191 at container_e150_1628460895322_672063_01_000016 @ xxxxxxxxx (dataPort=26682).
CheckpointCoordinator类中查找:
if (recentPendingCheckpoints.contains(checkpointId)) {
wasPendingCheckpoint = true;
LOG.warn(
"Received late message for now expired checkpoint attempt {} from task "
+ "{} of job {} at {}.",
checkpointId,
message.getTaskExecutionId(),
message.getJob(),
taskManagerLocationInfo);
代码中就可以知道 每个大括号{} 中代表的是什么。
(问题排查:主要是通过 JM的日志和TM日志,以及Flink WebUI 的 每个算子反压状况以及checkpoint的状况)
|