当 timeline server 使用LeveldbTimelineStore 时,hive 任务有时卡住。
1. 根本原因
根本原因,在删除过期数据时,对于每种 Entiry 类型,拿到 deleteLock.writeLock() , 在put 操作,需要拿到 deleteLock.readLock 。当长时间做清理工作时,put 需要一直等待。
2. 推荐方案
2.1 关闭 timeline server
在 yarn-site.xml 配置 yarn.timeline-service.enabled=false
影响: 此影响比较大,不仅 tez-ui 无法打开,其他服务如 spark history 也受影响。但是比较彻底,timeline server 不会出现问题。
2.2 关闭 hive 不向 timeline server 发送数据
在 hive-site.xml 配置 yarn.timeline-service.enabled=false 。因为现在 tez 产生的数据最多,所以关闭 hive 任务能解决此问题。 影响:tez-ui 没有用。历史任务如果知道 application id,查看日志没有影响。当其他任务发送大量数据时,还是会有问题。
2.3 减少 timeline ttl 时间
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>
默认 7 天,减少 timeline ttl 时间可以缓解,但不能彻底解决此问题。
3. 分析过程
3.1 测试当 timelineserver 没有运行,任务卡住一会还是能继续运行?
从以下日志分析,当 timelineserver 没有运行,任务卡住一会(5秒),然后继续运行
22/09/21 15:38:38 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:39 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:40 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:41 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
22/09/21 15:38:42 INFO ipc.Client: Retrying connect to server: bmr-master-40d309e/172.16.48.23:10200. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=5, sleepTime=1000 MILLISECONDS)
3.2 测试 timeline 各数据类型的数据量
level db(92G) 数据类型统计。手动 leveldb 切换时间 9月1日21:26. type:d, count: 180466 type:e, count: 205866053 type:i, count: 800383696 type:k, count: 9511102 type:o, count: 180466
d: DOMAIN_ENTRY_PREFIX e: ENTITY_ENTRY_PREFIX i: INDEXED_ENTRY_PREFIX k: START_TIME_LOOKUP_PREFIX o: OWNER_LOOKUP_PREFIX
3.3 2022-09-01部分清理日志
2022-09-01 15:28:57,711 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 12662 entities for timestamp 1661412494248 and earlier in 43.463 seconds
2022-09-01 15:34:48,522 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 15766 entities for timestamp 1661412837711 and earlier in 50.811 seconds
2022-09-01 15:41:31,624 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 32860 entities for timestamp 1661413188523 and earlier in 103.101 seconds
2022-09-01 15:52:43,284 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 128077 entities for timestamp 1661413591624 and earlier in 371.66 seconds
2022-09-01 16:11:08,737 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 279761 entities for timestamp 1661414263284 and earlier in 805.453 seconds
2022-09-01 16:46:48,740 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 630752 entities for timestamp 1661415368738 and earlier in 1840.002 seconds
2022-09-01 17:45:03,624 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1147831 entities for timestamp 1661417508740 and earlier in 3194.884 seconds
2022-09-01 19:30:44,539 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1664671 entities for timestamp 1661421003624 and earlier in 6040.915 seconds
2022-09-01 21:04:49,154 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 1855544 entities for timestamp 1661427344539 and earlier in 5344.615 seconds
2022-09-01 22:26:20,494 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 456813 entities for timestamp 1661436231351 and earlier in 1349.129 seconds
说明在切换leveldb 之前,每次 delete 过期数据耗费太长时间。导致 leveldb 响应变慢。
3.4 测试是否不能删除彻底。
如现在保留7天的日志,如果改为1小时,那么就会删除大部分日志。并且在1小时后,基本没有日志。
3.4.1 某集群,重启timeline-server 之前个日期的文件数量,第1列为文件数量,总数据量为 30G
3942 Sep 16
7073 Sep 17
3218 Sep 18
648 Sep 19
2630 Sep 20
1004 Sep 21
3.4.2 重启 timeline-server 时间 09:18
[root@bmr-master-40d309e ~]
2022-09-21 09:19:05,341 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1952 entities of type TEZ_APPLICATION
2022-09-21 09:19:15,403 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1952 entities of type TEZ_APPLICATION_ATTEMPT
2022-09-21 09:19:31,542 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 15392 entities of type TEZ_CONTAINER_ID
2022-09-21 09:19:32,657 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1845 entities of type TEZ_DAG_EXTRA_INFO
2022-09-21 09:19:38,116 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1845 entities of type TEZ_DAG_ID
2022-09-21 10:28:04,133 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1909751 entities of type TEZ_TASK_ATTEMPT_ID
2022-09-21 11:26:06,381 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1915052 entities of type TEZ_TASK_ID
2022-09-21 11:29:58,285 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 21668 entities of type TEZ_VERTEX_ID
2022-09-21 11:30:00,194 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2262 entities of type YARN_APPLICATION
2022-09-21 11:30:01,634 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2265 entities of type YARN_APPLICATION_ATTEMPT
2022-09-21 11:33:38,043 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2264 entities of type YARN_CONTAINER
2022-09-21 11:33:38,043 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 3876248 entities for timestamp 1663719541882 and earlier in 8076.019 seconds
3.4.3 查看 leveldb数据量
过两个小时后,查看数据量,为1.2G。数据量有变少, 说明清理清除历史数据, 但是可能比较慢。
[root@bmr-master-40d309e ~]
1.2G /mnt/hadoop/yarn/timeline/leveldb-timeline-store.ldb/
3.5 测试 delete 过去 entries 卡住put
再进行一次测试,在测试的时候,执行 tpcds-3 query。
3.5.1 timeline 相关日志
从以下日志分析,结束时间是 2022-09-21 16:31:48,531 ,开始时间是 2022-09-21 16:16:34,262 。每种类型的删除是获取锁的。
2022-09-21 16:16:36,468 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type HIVE_QUERY_ID
2022-09-21 16:16:36,795 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 256 entities of type TEZ_APPLICATION
2022-09-21 16:16:37,178 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 256 entities of type TEZ_APPLICATION_ATTEMPT
2022-09-21 16:16:39,607 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 2117 entities of type TEZ_CONTAINER_ID
2022-09-21 16:16:39,749 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type TEZ_DAG_EXTRA_INFO
2022-09-21 16:16:40,497 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 267 entities of type TEZ_DAG_ID
2022-09-21 16:23:59,988 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 234974 entities of type TEZ_TASK_ATTEMPT_ID
2022-09-21 16:31:29,118 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 234823 entities of type TEZ_TASK_ID
2022-09-21 16:31:42,539 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 3334 entities of type TEZ_VERTEX_ID
2022-09-21 16:31:43,628 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 1271 entities of type YARN_APPLICATION
2022-09-21 16:31:44,056 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 682 entities of type YARN_APPLICATION_ATTEMPT
2022-09-21 16:31:48,531 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1553)) - Deleted 668 entities of type YARN_CONTAINER
2022-09-21 16:31:48,531 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:discardOldEntities(1560)) - Discarded 479182 entities for timestamp 1663744594222 and earlier in 914.279 seconds
3.5.2 tpcds测试结果
可以看到,query3 卡住很长时间。
Begin exec query1 at 20220921-16:16.58
Run DAG 17.07s
End exec query1 successfully at 20220921-16:18.21
Begin exec query2 at 20220921-16:18.21
Run DAG 24.78s
End exec query2 successfully at 20220921-16:19.04
Begin exec query3 at 20220921-16:19.04
Run DAG 13.19s
End exec query3 successfully at 20220921-16:32.07
Begin exec query4 at 20220921-16:32.07
Run DAG 39.89s
End exec query4 successfully at 20220921-16:33.09
Begin exec query5 at 20220921-16:33.09
2. 代码分析
2.1 删除过期数据线程
@Override
public void run() {
while (true) {
long timestamp = System.currentTimeMillis() - ttl;
try {
discardOldEntities(timestamp);
Thread.sleep(ttlInterval);
} catch (IOException e) {
LOG.error(e.toString());
} catch (InterruptedException e) {
LOG.info("Deletion thread received interrupt, exiting");
break;
}
}
}
2.2 discardOldEntities 方法
在处理每种类型时,先获取 writeLock,处理后释放。
void discardOldEntities(long timestamp)
throws IOException, InterruptedException {
byte[] reverseTimestamp = writeReverseOrderedLong(timestamp);
long totalCount = 0;
long t1 = System.currentTimeMillis();
try {
List<String> entityTypes = getEntityTypes();
for (String entityType : entityTypes) {
LeveldbIterator iterator = null;
LeveldbIterator pfIterator = null;
long typeCount = 0;
try {
deleteLock.writeLock().lock();
iterator = getDbIterator(false);
pfIterator = getDbIterator(false);
if (deletionThread != null && deletionThread.isInterrupted()) {
throw new InterruptedException();
}
boolean seeked = false;
while (deleteNextEntity(entityType, reverseTimestamp, iterator,
pfIterator, seeked)) {
typeCount++;
totalCount++;
seeked = true;
if (deletionThread != null && deletionThread.isInterrupted()) {
throw new InterruptedException();
}
}
} catch (IOException e) {
LOG.error("Got IOException while deleting entities for type " +
entityType + ", continuing to next type", e);
} finally {
IOUtils.cleanupWithLogger(LOG, iterator, pfIterator);
deleteLock.writeLock().unlock();
if (typeCount > 0) {
LOG.info("Deleted " + typeCount + " entities of type " +
entityType);
}
}
}
} finally {
long t2 = System.currentTimeMillis();
LOG.info("Discarded " + totalCount + " entities for timestamp " +
timestamp + " and earlier in " + (t2 - t1) / 1000.0 + " seconds");
}
}
2.3 put 方法
在 put 时,获取读锁,如果正在删除,则此方法一直等待
@Override
public TimelinePutResponse put(TimelineEntities entities) {
try {
deleteLock.readLock().lock();
TimelinePutResponse response = new TimelinePutResponse();
for (TimelineEntity entity : entities.getEntities()) {
put(entity, response, false);
}
return response;
} finally {
deleteLock.readLock().unlock();
}
}
|