Nacos集群(clustering)异常,unable to find local peer: *.*.*.*, all peers: []

Posted by Geuni's Blog on January 11, 2022

现象:

公司新系统要上线了,生产环境搭建Nacos集群之后,发现有些节点无法被识别。

nacos.log日志文件显示节点正常启动,没有异常日志,不过看naming-server.log日志报了一些异常,无法匹配节点信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
2022-01-11 08:31:03,630 WARN NamingProxy

java.io.IOException: failed to req API:http://10.*.*.*:8848/nacos/v1/ns/operator/cluster/state. code:500 msg: caused: unable to find local peer: 10.*.*.*:8848, all peers: [];
	at com.alibaba.nacos.naming.misc.NamingProxy.reqCommon(NamingProxy.java:321)
	at com.alibaba.nacos.naming.cluster.ServerListManager$ServerInfoUpdater.run(ServerListManager.java:183)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2022-01-11 08:31:04,688 WARN [STATUS-SYNCHRONIZE] failed to request serverStatus, remote server: 10.*.*.*:8848

如果服务器有多个网卡或用虚拟机、docker的时候有很大概率能碰上类似问题,用内网IP做集群而Nacos识别的是外网IP亦或反过来。

解决:

修改每个节点的startup.sh文件,加上-Dnacos.server.ipJVM参数即可解决问题。

1
2
3
4
5
6
7
8
9
10
JAVA_OPT="${JAVA_OPT} -Dnacos.server.ip=selfIp" # selfIp换成我们集群服务器的IP,也就是cluster.conf里面指定的IP
JAVA_OPT="${JAVA_OPT} -Dloader.path=${BASE_DIR}/plugins/health,${BASE_DIR}/plugins/cmdb"
JAVA_OPT="${JAVA_OPT} -Dnacos.home=${BASE_DIR}"
JAVA_OPT="${JAVA_OPT} -jar ${BASE_DIR}/target/${SERVER}.jar"
JAVA_OPT="${JAVA_OPT} ${JAVA_OPT_EXT}"
JAVA_OPT="${JAVA_OPT} --spring.config.additional-location=${CUSTOM_SEARCH_LOCATIONS}"
JAVA_OPT="${JAVA_OPT} --logging.config=${BASE_DIR}/conf/nacos-logback.xml"
JAVA_OPT="${JAVA_OPT} --server.max-http-header-size=524288"

可参考源码:config模块下com.alibaba.nacos.config.server.utils.SystemConfig.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/**
 * System config.
 *
 * @author Nacos
 */
public class SystemConfig {
    
    public static final String LOCAL_IP = getHostAddress();
    
    private static final Logger LOGGER = LoggerFactory.getLogger(SystemConfig.class);
    
    private static String getHostAddress() {
        String address = System.getProperty("nacos.server.ip");
        if (StringUtils.isNotEmpty(address)) {
            return address;
        } else {
            address = InternetAddressUtil.localHostIP();
        }
        try {
            Enumeration<NetworkInterface> en = NetworkInterface.getNetworkInterfaces();
            while (en.hasMoreElements()) {
                NetworkInterface ni = en.nextElement();
                Enumeration<InetAddress> ads = ni.getInetAddresses();
                while (ads.hasMoreElements()) {
                    InetAddress ip = ads.nextElement();
                    // Compatible group does not regulate 11 network segments
                    if (!ip.isLoopbackAddress() && ip.getHostAddress().indexOf(":") == -1
                        /* && ip.isSiteLocalAddress() */) {
                        return ip.getHostAddress();
                    }
                }
            }
        } catch (Exception e) {
            LOGGER.error("get local host address error", e);
        }
        return address;
    }
    
}

集群环境有变化时,最好把data目录下derby-data,protocal两个目录删掉后再启动,以免因缓存问题发生其他一些问题。