事情是这样的
问题
我在执行DWS层命令dws_load_member_start.sh 2020-07-21 时,报错了,这是所有的报错信息
which: no hbase in (:/opt/install/jdk1.8.0_231/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/install/hadoop-2.9.2/bin:/opt/install/hadoop-2.9.2/sbin:/opt/install/flume-1.9.0/bin:/opt/install/hive-2.3.7/bin:/opt/install/datax/bin:/opt/install/spark-2.4.5/bin:/opt/install/spark-2.4.5/sbin:/root/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/install/hive-2.3.7/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/install/tez-0.9.2/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/install/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/install/hive-2.3.7/lib/hive-common-2.3.7.jar!/hive-log4j2.properties Async: true
Query ID = root_20211014210413_76de217f-e97b-4435-adca-7e662260ab0b
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1634216554071_0002)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
VERTICES: 00/00 [>>--------------------------] 0% ELAPSED TIME: 8.05 s
----------------------------------------------------------------------------------------------
Status: Failed--------------------------------------------------------------------------------
Application application_1634216554071_0002 failed 2 times due to AM Container for appattempt_1634216554071_0002_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2021-10-14 21:04:29.444]Container [pid=20544,containerID=container_1634216554071_0002_02_000001] is running beyond virtual memory limits. Current usage: 277.4 MB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1634216554071_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 20544 20543 20544 20544 (bash) 0 0 115900416 304 /bin/bash -c /opt/install/jdk1.8.0_231/bin/java -Xmx819m -Djava.io.tmpdir=/opt/install/hadoop-2.9.2/data/tmp/nm-local-dir/usercache/root/appcache/application_1634216554071_0002/container_1634216554071_0002_02_000001/tmp -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001 -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster --session 1>/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001/stdout 2>/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001/stderr
|- 20551 20544 20544 20544 (java) 367 99 2771484672 70721 /opt/install/jdk1.8.0_231/bin/java -Xmx819m -Djava.io.tmpdir=/opt/install/hadoop-2.9.2/data/tmp/nm-local-dir/usercache/root/appcache/application_1634216554071_0002/container_1634216554071_0002_02_000001/tmp -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001 -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel= org.apache.tez.dag.app.DAGAppMaster --session
[2021-10-14 21:04:29.458]Container killed on request. Exit code is 143
[2021-10-14 21:04:29.481]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: http://hadoop1:8088/cluster/app/application_1634216554071_0002 Then click on links to logs of each attempt.
. Failing the application.
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Application application_1634216554071_0002 failed 2 times due to AM Container for appattempt_1634216554071_0002_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2021-10-14 21:04:29.444]Container [pid=20544,containerID=container_1634216554071_0002_02_000001] is running beyond virtual memory limits. Current usage: 277.4 MB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1634216554071_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 20544 20543 20544 20544 (bash) 0 0 115900416 304 /bin/bash -c /opt/install/jdk1.8.0_231/bin/java -Xmx819m -Djava.io.tmpdir=/opt/install/hadoop-2.9.2/data/tmp/nm-local-dir/usercache/root/appcache/application_1634216554071_0002/container_1634216554071_0002_02_000001/tmp -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001 -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster --session 1>/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001/stdout 2>/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001/stderr
|- 20551 20544 20544 20544 (java) 367 99 2771484672 70721 /opt/install/jdk1.8.0_231/bin/java -Xmx819m -Djava.io.tmpdir=/opt/install/hadoop-2.9.2/data/tmp/nm-local-dir/usercache/root/appcache/application_1634216554071_0002/container_1634216554071_0002_02_000001/tmp -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/opt/install/hadoop-2.9.2/logs/userlogs/application_1634216554071_0002/container_1634216554071_0002_02_000001 -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel= org.apache.tez.dag.app.DAGAppMaster --session
[2021-10-14 21:04:29.458]Container killed on request. Exit code is 143
[2021-10-14 21:04:29.481]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: http://hadoop1:8088/cluster/app/application_1634216554071_0002 Then click on links to logs of each attempt.
. Failing the application.
摘下来是为了解读这段报错信息。
第9行之前的SLF4J 开头的日志都可以不用关心,只是提示缺少一些不碍事的包;
然后是我本次提交的任务执行情况
Logging initialized using configuration in jar:file:/opt/install/hive-2.3.7/lib/hive-common-2.3.7.jar!/hive-log4j2.properties Async: true
Query ID = root_20211014210413_76de217f-e97b-4435-adca-7e662260ab0b
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1634216554071_0002)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
VERTICES: 00/00 [>>--------------------------] 0% ELAPSED TIME: 8.05 s
----------------------------------------------------------------------------------------------
再告诉我,任务执行失败了,程序退出编号为-103 ①
Status: Failed--------------------------------------------------------------------------------
Application application_1634216554071_0002 failed 2 times due to AM Container for appattempt_1634216554071_0002_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2021-10-14 21:04:29.444]Container [pid=20544,containerID=container_1634216554071_0002_02_000001] is running beyond virtual memory limits. Current usage: 277.4 MB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
后面再跟着一串原因:Container [用于描述Container的属性] is running beyond virtual memory limits. Current usage: 277.4 MB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container ②,意思是我这个在运行着的任务(实际上是Container,这里说任务比较好理解)超出虚拟内存的限制了,使用情况是1G的物理内存我的任务用了277.4M,还可以吧不过分,但是虚拟内存2.1G我却用到了2.7G,很明显不合理,所以NodeManager把它干掉了。
日志的最后一部分,也是内容最多的一部分,将告诉我们这个问题会记录到哪里去
Dump of the process-tree for container_1634216554071_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- ...
|- ...
[2021-10-14 21:04:29.458]Container killed on request. Exit code is 143
[2021-10-14 21:04:29.481]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: http://hadoop1:8088/cluster/app/application_1634216554071_0002 Then click on links to logs of each attempt.
. Failing the application.
重点是倒数第二句:For more detailed output, check the application tracking page: http://hadoop1:8088/cluster/app/application_1634216554071_0002 Then click on links to logs of each attempt. ,这句话告诉我们,要想知道更多详情,到http://hadoop1:8088/cluster/app/application_1634216554071_0002 去找。③
解决
老思路,资源方面出问题,有且只有两个方向:1.任务太重、2.资源太少。该案例中任务不重,从占用的物理内存也可以看出来,我分配的内存要完成任务是绰绰有余的,问题就出在虚拟内存中,那么有关“虚拟内存”的问题,我又有两个思路,首先这个肯定是有什么配置可以干预的,究竟是哪个配置呢?不晓得,只能百度了,最后在这里找到比较全面的分析:https://www.suibibk.com/topic/634440889984352256。
下次碰到这种问题,可以往这几点去想:
-
取消虚拟内存的检查 在yarn-site.xml 或者程序中设置yarn.nodemanager.vmem-check-enabled 为false <property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers.</description>
</property>
除了虚拟内存超了,也有可能是物理内存超了,同样也可以设置物理内存的检查 yarn.nodemanager.pmem-check-enabled 为false ,个人认为这种办法并不太好,如果程序有内存泄漏等问题,取消这个检查,可能会导致集群崩溃。 -
增大mapreduce.map.memory.mb 或者mapreduce.reduce.memory.mb 应该优先考虑这种办法,这种办法不仅仅可以解决虚拟内存,或许大多时候都是物理内存不够了,这个办法正好适用.
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
<description>maps的资源限制</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
<description>reduces的资源限制</description>
</property>
-
适当增大yarn.nodemanager.vmem-pmem-ratio 的大小,一个物理内存增大多个虚拟内存, 但是这个参数也不能太离谱,本质还是围绕mapreduce.reduce.memory.db * yarn.nodemanager.vmem-pmem-ratio 来做处理。 -
如果任务所占用的内存太过离谱,更多考虑的应该是程序是否有内存泄漏,是否存在数据倾斜等,优先程序解决此类问题。
|