近期,某用户环境出现集群数据库一个节点无法启动、加入集群的问题。集群版本为11.2版本,检查集群日志,问题比较明显,集群alert日志中让看CSSD进程日志,CSSD中显示无心跳网络:has a disk HB, but no network HB;按如下步骤排查处理:
1.首先通过hosts文件确认了数据库心跳网络IP,并在操作系统层面确认心跳网卡状态正常并且可以互相PING通、SSH联通。
2.通过gpnptool get确认集群使用的心跳网络即为上一步检查的。
3.根据11.2集群组件功能,GIPC进程负责检测集群网络状态;查看GIPC进程日志,发现GIPC进程标识的心跳网络eth1 - rank 0; 即为异常状态(正常时为eth1 - rank 99)。
4.在步骤1中已经检查心跳网络在主机层面正常;因此结合集群组件的特性,尝试让触发集群重新检测心跳网络的状态(通常可以KILL GIPC进程或者重启集群软件);
5.本次KILL GIPC进程或者重启集群软件均无效,通过在操作系统 层面重启网卡,之后GIPC进程正确识别网卡状态,集群可以正常启动。
相关日志如下:
1.异常时的GPNP中心跳网络信息:
[grid@nphisdb1 gpnpd]$gpnptool get
Warning: some command line parameters were defaulted. Resulting command line:
/u01/app/11.2.0/grid_1/bin/gpnptool.bin get -o-
<?xml version="1.0" encoding="UTF-8"?><gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd" ProfileSequence="4" ClusterUId="a3268b3b769cdf7dbfc43c8ffd69e87f" ClusterName="nphisdb-cluster" PALocation=""><gpnp:Network-Profile><gpnp:HostNetwork id="gen" HostName="*"><gpnp:Network id="net1" IP="192.168.205.0" Adapter="eth0" Use="public"/><gpnp:Network id="net2" IP="10.10.10.0" Adapter="eth1" Use="cluster_interconnect"/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/><orcl:ASM-Profile id="asm" DiscoveryString="/dev/oracleasm/disks" SPFile="+CRS/nphisdb-cluster/asmparameterfile/registry.253.1028034033"/><ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/><ds:Reference URI=""><ds:Transforms><ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/><ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"> <InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl xsi"/></ds:Transform></ds:Transforms><ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>bjVFpM9uJREXWTWBP6GSC1A11Zw=</ds:DigestValue></ds:Reference></ds:SignedInfo><ds:SignatureValue>UN5iBJd7mbmW8usjptRlTXtIBf05z76r+MyCNOSlXAGcsTE/zbb2BFeZkH0LMpyF5jbpQUzHE+U3wjUzZl/VsQS+y9QPeANVz1q1E9XDpfsxJwhRyhv0MNtK4/yy9xr9Y/zgTdg6dO2utm2Hy9pyCoDIrQ75gsmnZCtmPrfwR0A=</ds:SignatureValue></ds:Signature></gpnp:GPnP-Profile>
Success.
2.检查GIPC进程中网络的rank值
2022-03-20 13:30:58.580: [ CLSINET][346261248] Returning NETDATA: 1 interfaces
2022-03-20 13:30:58.580: [ CLSINET][346261248] # 0 Interface 'eth1',ip='10.10.10.1',mac='40-f2-e9-64-24-5e',mask='255.255.255.0',net='10.10.10.0',use='cluster_interconnect'
2022-03-20 13:31:00.903: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2022-03-20 13:31:01.430: [GIPCDCLT][350463744] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000046d
2022-03-20 13:31:02.431: [GIPCDCLT][350463744] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000199
2022-03-20 13:31:03.432: [GIPCDCLT][350463744] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000032e
2022-03-20 13:31:03.584: [ CLSINET][346261248] Returning NETDATA: 1 interfaces
2022-03-20 13:31:03.584: [ CLSINET][346261248] # 0 Interface 'eth1',ip='10.10.10.1',mac='40-f2-e9-64-24-5e',mask='255.255.255.0',net='10.10.10.0',use='cluster_interconnect'
2022-03-20 13:31:06.433: [GIPCDCLT][350463744] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 000000000000046d
2022-03-20 13:31:07.434: [GIPCDCLT][350463744] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000199
3.重启集群软件无法解决后,重启网卡
4.检查GIPC进程日志,已经恢复正常rank 99
[grid@nphisdb1 gipcd]$tail -f gipcd.log |grep rank
2022-03-20 13:38:30.626: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 1.143791 [ 300 / 306 / 306 ]
2022-03-20 13:39:00.634: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 0.628019 [ 204 / 207 / 207 ]
2022-03-20 13:39:30.642: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 1.564626 [ 153 / 147 / 147 ]
2022-03-20 13:40:00.642: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 1.052632 [ 119 / 114 / 114 ]
2022-03-20 13:40:30.644: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 1.016949 [ 121 / 118 / 118 ]
2022-03-20 13:41:00.655: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 1.636364 [ 115 / 110 / 110 ]
2022-03-20 13:41:30.658: [GIPCDMON][346261248] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 99, avgms 1.071429 [ 117 / 112 / 112 ]
|