rabbitmq 节点故障修复原创
# rabbitMq 节点故障修复
(opens new window)
(opens new window)
今日rabbitMq 集群一节点莫名被踢出,且无法手动加入。
# 故障现象
rabbitMq 节点无法连接到集群,报如下错误
2022-06-15 15:41:05.244 [error] <0.274.0>
2022-06-15 15:41:05.244 [error] <0.274.0> BOOT FAILED
2022-06-15 15:41:05.244 [error] <0.274.0> ===========
2022-06-15 15:41:05.244 [error] <0.274.0> Timeout contacting cluster nodes: ['rabbit@rabbitmq-3','rabbit@rabbitmq-1'].
2022-06-15 15:41:05.245 [error] <0.274.0>
2022-06-15 15:41:05.245 [error] <0.274.0> BACKGROUND
2022-06-15 15:41:05.245 [error] <0.274.0> ==========
2022-06-15 15:41:05.245 [error] <0.274.0>
2022-06-15 15:41:05.245 [error] <0.274.0> This cluster node was shut down while other nodes were still running.
2022-06-15 15:41:05.245 [error] <0.274.0> To avoid losing data, you should start the other nodes first, then
2022-06-15 15:41:05.245 [error] <0.274.0> start this one. To force this node to start, first invoke
2022-06-15 15:41:05.245 [error] <0.274.0> "rabbitmqctl force_boot". If you do so, any changes made on other
2022-06-15 15:41:05.245 [error] <0.274.0> cluster nodes after this one was shut down may be lost.
2022-06-15 15:41:05.245 [error] <0.274.0>
2022-06-15 15:41:05.245 [error] <0.274.0> DIAGNOSTICS
2022-06-15 15:41:05.246 [error] <0.274.0> ===========
2022-06-15 15:41:05.246 [error] <0.274.0>
2022-06-15 15:41:05.246 [error] <0.274.0> attempted to contact: ['rabbit@rabbitmq-3','rabbit@rabbitmq-1']
2022-06-15 15:41:05.246 [error] <0.274.0>
2022-06-15 15:41:05.246 [error] <0.274.0> rabbit@rabbitmq-3:
2022-06-15 15:41:05.246 [error] <0.274.0> * connected to epmd (port 4369) on rabbitmq-3
2022-06-15 15:41:05.246 [error] <0.274.0> * node rabbit@rabbitmq-3 up, 'rabbit' application running
2022-06-15 15:41:05.246 [error] <0.274.0> rabbit@rabbitmq-1:
2022-06-15 15:41:05.246 [error] <0.274.0> * connected to epmd (port 4369) on rabbitmq-1
2022-06-15 15:41:05.247 [error] <0.274.0> * node rabbit@rabbitmq-1 up, 'rabbit' application running
2022-06-15 15:41:05.247 [error] <0.274.0>
2022-06-15 15:41:05.247 [error] <0.274.0> Current node details:
2022-06-15 15:41:05.247 [error] <0.274.0> * node name: 'rabbit@rabbitmq-2'
2022-06-15 15:41:05.247 [error] <0.274.0> * effective user's home directory: /var/lib/rabbitmq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 问题处理过程
在故障节点上查看网络无通信问题,手动reset和start_app 都提示错误。为快速恢复故障,我选择将故障节点移除然后重建故障节点。
# 移除故障节点,在上操作rabbit@rabbitmq-1
rabbitmqctl -n rabbit@rabbitmq-1 forget_cluster_node rabbit@rabbitmq-2
# 在故障节点删除故障节点数据
cd /var/lib/rabbitmq/
mv mnesia mnesia.bak
# 启动 rabbitmq-server
systemctl start rabbitmq-server
# 查看状态
rabbitmqctl status
# 将节点添加到集群
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@rabbitmq-1
rabbitmqctl start_app
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
故障节点恢复。
# 问题回顾
为分析故障原因,我尝试将mnesia.bak 改回 mnesia, 重启后竟然没出现故障,该节点也在集群中运行正常。
上次更新: 2022/12/05, 22:29:05