rabbitmq 节点故障修复原创

# rabbitMq 节点故障修复

Centos (opens new window) RabbitMq (opens new window)

今日rabbitMq 集群一节点莫名被踢出,且无法手动加入。

# 故障现象

rabbitMq 节点无法连接到集群,报如下错误

2022-06-15 15:41:05.244 [error] <0.274.0> 
2022-06-15 15:41:05.244 [error] <0.274.0> BOOT FAILED
2022-06-15 15:41:05.244 [error] <0.274.0> ===========
2022-06-15 15:41:05.244 [error] <0.274.0> Timeout contacting cluster nodes: ['rabbit@rabbitmq-3','rabbit@rabbitmq-1'].
2022-06-15 15:41:05.245 [error] <0.274.0> 
2022-06-15 15:41:05.245 [error] <0.274.0> BACKGROUND
2022-06-15 15:41:05.245 [error] <0.274.0> ==========
2022-06-15 15:41:05.245 [error] <0.274.0> 
2022-06-15 15:41:05.245 [error] <0.274.0> This cluster node was shut down while other nodes were still running.
2022-06-15 15:41:05.245 [error] <0.274.0> To avoid losing data, you should start the other nodes first, then
2022-06-15 15:41:05.245 [error] <0.274.0> start this one. To force this node to start, first invoke
2022-06-15 15:41:05.245 [error] <0.274.0> "rabbitmqctl force_boot". If you do so, any changes made on other
2022-06-15 15:41:05.245 [error] <0.274.0> cluster nodes after this one was shut down may be lost.
2022-06-15 15:41:05.245 [error] <0.274.0> 
2022-06-15 15:41:05.245 [error] <0.274.0> DIAGNOSTICS
2022-06-15 15:41:05.246 [error] <0.274.0> ===========
2022-06-15 15:41:05.246 [error] <0.274.0> 
2022-06-15 15:41:05.246 [error] <0.274.0> attempted to contact: ['rabbit@rabbitmq-3','rabbit@rabbitmq-1']
2022-06-15 15:41:05.246 [error] <0.274.0> 
2022-06-15 15:41:05.246 [error] <0.274.0> rabbit@rabbitmq-3:
2022-06-15 15:41:05.246 [error] <0.274.0>   * connected to epmd (port 4369) on rabbitmq-3
2022-06-15 15:41:05.246 [error] <0.274.0>   * node rabbit@rabbitmq-3 up, 'rabbit' application running
2022-06-15 15:41:05.246 [error] <0.274.0> rabbit@rabbitmq-1:
2022-06-15 15:41:05.246 [error] <0.274.0>   * connected to epmd (port 4369) on rabbitmq-1
2022-06-15 15:41:05.247 [error] <0.274.0>   * node rabbit@rabbitmq-1 up, 'rabbit' application running
2022-06-15 15:41:05.247 [error] <0.274.0> 
2022-06-15 15:41:05.247 [error] <0.274.0> Current node details:
2022-06-15 15:41:05.247 [error] <0.274.0>  * node name: 'rabbit@rabbitmq-2'
2022-06-15 15:41:05.247 [error] <0.274.0>  * effective user's home directory: /var/lib/rabbitmq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

# 问题处理过程

在故障节点上查看网络无通信问题,手动reset和start_app 都提示错误。为快速恢复故障,我选择将故障节点移除然后重建故障节点。

# 移除故障节点,在上操作rabbit@rabbitmq-1
rabbitmqctl -n rabbit@rabbitmq-1 forget_cluster_node rabbit@rabbitmq-2

# 在故障节点删除故障节点数据
cd /var/lib/rabbitmq/
mv mnesia mnesia.bak

# 启动 rabbitmq-server
systemctl start rabbitmq-server
# 查看状态
rabbitmqctl status

# 将节点添加到集群
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@rabbitmq-1
rabbitmqctl start_app

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

故障节点恢复。

# 问题回顾

为分析故障原因,我尝试将mnesia.bak 改回 mnesia, 重启后竟然没出现故障,该节点也在集群中运行正常。

上次更新: 2022/12/05, 22:29:05

Initializing...

最近更新
01
git的tag与branch 原创
05-21
02
阿里云SLS日志服务的数据脱敏及安全管理 原创
03-21
03
云平台的成本管理 原创
03-13
更多文章>
×