问题
现场测试环境有两个微服务组启动失败,看日志内容是调用nacos接口进行注册时报错,报错内容大意是raft协议组找不到主节点
报错
2022-09-07 14:29:56.900 [main] ERROR -[NA] failed to request
com.alibaba.nacos.api.exception.NacosException: caused: java.util.concurrent.ExecutionException: com.alibaba.nacos.consistency.exception.ConsistencyException: com.alibaba.nacos.core.distributed.raft.exception.NoLeaderException: The Raft Group [naming_persistent_service_v2] did not find the Leader node;caused: com.alibaba.nacos.consistency.exception.ConsistencyException: com.alibaba.nacos.core.distributed.raft.exception.NoLeaderException: The Raft Group [naming_persistent_service_v2] did not find the Leader node;caused: com.alibaba.nacos.core.distributed.raft.exception.NoLeaderException: The Raft Group [naming_persistent_service_v2] did not find the Leader node;
at com.alibaba.nacos.client.naming.net.NamingProxy.callServer(NamingProxy.java:615)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:526)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:498)
at com.alibaba.nacos.client.naming.net.NamingProxy.reqApi(NamingProxy.java:493)
at com.alibaba.nacos.client.naming.net.NamingProxy.registerService(NamingProxy.java:246)
at com.alibaba.nacos.client.naming.NacosNamingService.registerInstance(NacosNamingService.java:212)
at com.alibaba.cloud.nacos.registry.NacosServiceRegistry.register(NacosServiceRegistry.java:74)
at org.springframework.cloud.client.serviceregistry.AbstractAutoServiceRegistration.register(AbstractAutoServiceRegistration.java:239)
at com.alibaba.cloud.nacos.registry.NacosAutoServiceRegistration.register(NacosAutoServiceRegistration.java:78)
at org.springframework.cloud.client.serviceregistry.AbstractAutoServiceRegistration.start(AbstractAutoServiceRegistration.java:138)
at org.springframework.cloud.client.serviceregistry.AbstractAutoServiceRegistration.bind(AbstractAutoServiceRegistration.java:101)
at org.springframework.cloud.client.serviceregistry.AbstractAutoServiceRegistration.onApplicationEvent(AbstractAutoServiceRegistration.java:88)
at org.springframework.cloud.client.serviceregistry.AbstractAutoServiceRegistration.onApplicationEvent(AbstractAutoServiceRegistration.java:47)
at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:172)
at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:165)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:139)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:404)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:361)
at org.springframework.boot.web.servlet.context.WebServerStartStopLifecycle.start(WebServerStartStopLifecycle.java:46)
at org.springframework.context.support.DefaultLifecycleProcessor.doStart(DefaultLifecycleProcessor.java:182)
at org.springframework.context.support.DefaultLifecycleProcessor.access$200(DefaultLifecycleProcessor.java:53)
at org.springframework.context.support.DefaultLifecycleProcessor$LifecycleGroup.start(DefaultLifecycleProcessor.java:360)
at org.springframework.context.support.DefaultLifecycleProcessor.startBeans(DefaultLifecycleProcessor.java:158)
at org.springframework.context.support.DefaultLifecycleProcessor.onRefresh(DefaultLifecycleProcessor.java:122)
at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:895)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:554)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:143)
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:758)
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:750)
at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:397)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:315)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1237)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1226)
at com.ais.cdc.CdcManageApplication.main(CdcManageApplication.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88)
排查过程
日志排查
对比正常启动的组件和启动失败组件的日志,发现失败组件的nacos配置ephemeral=false,表示此程序注册到nacos为永久实例,需要nacos服务端采用raft协议
技术原因:
永久实例与nacos的健康监测是需要nacos集群之间采用的一致性协议是raft,但是测试环境nacos是单节点,不能采用raft协议,采用的是默认的Distro协议,只支持临时实例模式
人为原因
测试反馈现场想在生产上使用永久实例,所以在失败组件的nacos配置列表中添加了spring.cloud.nacos.discovery.ephemeral=false
解决方式
现场修改了组件配置,未考虑到实际影响范围,在去除添加的配置后,可以启动成功
参考
Nacos一致性协议 CP/AP/JRaft/Distro协议
Raft协议
? 是CP协议
? 配置管理采用CP协议
Distro协议
? 是Nacos社区自研的?种AP分布式协议,是面向临时实例设计的?种分布式协议
? 服务注册发现采用AP协议
Nacos注册中心之概要设计
Nacos 2.0 为ephemeral不同的实例提供了两套流程:
- ephemeral=false,永久实例,与server端的交互采用http请求,server节点间数据同步采用了raft协议,健康检查采用了server端主动探活的机制
- ephemeral=true,临时实例,与server端的交互采用grpc请求,server节点间数据同步采用了distro协议,健康检查采用了TCP连接的KeepAlive模式
Nacos临时实例和持久化实例
Nacos 在 1.0.0版本 instance级别增加了一个ephemeral字段,该字段表示注册的实例是否是临时实例还是持久化实例。如果是临时实例,则不会在 Nacos 服务端持久化存储,需要通过上报心跳的方式进行包活,如果一段时间内没有上报心跳,则会被 Nacos 服务端摘除。在被摘除后如果又开始上报心跳,则会重新将这个实例注册。持久化实例则会持久化被 Nacos 服务端,此时即使注册实例的客户端进程不在,这个实例也不会从服务端删除,只会将健康状态设为不健康。
同一个服务下可以同时有临时实例和持久化实例,这意味着当这服务的所有实例进程不在时,会有部分实例从服务上摘除,剩下的实例则会保留在服务下。
使用实例的ephemeral来判断,ephemeral为true对应的是服务健康检查模式中的 client 模式,为false对应的是 server 模式。
nacos 的 cp 和 ap
如果只有一个 nacos 节点,可以直接以单机模式启动,naming 功能是完全正常的。
也可以以集群模式启动,配置文件中只配一个节点即可。但是,因为此时 nacos 的 raft 实现中,一个节点是选不了主的,所以 cp 模式的 raft 协议就不可用,即持久化的服务是无法注册的。但是,临时服务是可以注册的
|