Ver configuración de la parte 1.
El estado previo del cluster es el siguiente:
[postgres@postgres1 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+-----------+-----------+----------+----------+----------+----------------------------------------------
1 | postgres1 | primary | * running | | default | 100 | 3 | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | standby | running | postgres1 | default | 100 | 3 | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | postgres1 | default | 100 | 3 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | postgres1 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr
Simulamos un fallo en el nodo primary (postgres1) para ver el failover automático:
[postgres@postgres1 ~]$ sudo systemctl stop postgresql-13
Y el estado del cluster en ese momento:
[postgres@postgres2 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+---------------+-------------+----------+----------+----------+----------------------------------------------
1 | postgres1 | primary | ? unreachable | ? | default | 100 | | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | standby | running | ? postgres1 | default | 100 | 3 | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | ? postgres1 | default | 100 | 3 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | ? postgres1 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr
WARNING: following issues were detected
- unable to connect to node "postgres1" (ID: 1)
- node "postgres1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "postgres2" (ID: 2)'s upstream node "postgres1" (ID: 1)
- unable to determine if node "postgres2" (ID: 2) is attached to its upstream node "postgres1" (ID: 1)
- unable to connect to node "postgres3" (ID: 3)'s upstream node "postgres1" (ID: 1)
- unable to determine if node "postgres3" (ID: 3) is attached to its upstream node "postgres1" (ID: 1)
- unable to connect to node "postgres4-witness" (ID: 4)'s upstream node "postgres1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
Tras un breve espacio de tiempo, comprobamos el estado del cluster otra vez:
[postgres@postgres4 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+-----------+-----------+----------+----------+----------+----------------------------------------------
1 | postgres1 | primary | - failed | ? | default | 100 | | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | primary | * running | | default | 100 | 4 | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | postgres2 | default | 100 | 3 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | postgres2 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr
WARNING: following issues were detected
- unable to connect to node "postgres1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
Observad también, que el nodo postgres3 ha cambiado hacia el nuevo primary (postgres2)!!
Una vez arreglado el problema (provocado por nosotros) si queremos unir el antiguo nodo primario (postgres1) al cluster dónde está el nodo primario actual (postgres2), vemos que nos pide el rewind:
[postgres@postgres1 ~]$ repmgr -d 'host=192.168.0.102 user=repmgr dbname=repmgr' node rejoin
ERROR: this node cannot attach to rejoin target node 2
DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/16000028
HINT: use --force-rewind to execute pg_rewind
[postgres@postgres1 ~]$ repmgr -d 'host=192.168.0.102 user=repmgr dbname=repmgr' node rejoin --force-rewind
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/16000028
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-13/bin/pg_rewind -D '/var/lib/pgsql/13/data' --source-server='host=192.168.0.102 user=repmgr dbname=repmgr'"
NOTICE: 0 files copied to /var/lib/pgsql/13/data
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.0.101 user=repmgr dbname=repmgr"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "sudo /bin/systemctl start postgresql-13"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
El estado del cluster entonces queda de la siguiente manera:
[postgres@postgres4 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+-----------+-----------+----------+----------+----------+----------------------------------------------
1 | postgres1 | standby | running | postgres2 | default | 100 | 3 | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | primary | * running | | default | 100 | 4 | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | postgres2 | default | 100 | 4 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | postgres2 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr
Si queremos ver los eventos del cluster:
[postgres@postgres2 ~]$ repmgr cluster event
Node ID | Name | Event | OK | Timestamp | Details
---------+-------------------+----------------------------+----+---------------------+------------------------------------------------------------------------------------------------
2 | postgres2 | child_node_new_connect | t | 2021-01-08 18:28:09 | new standby "postgres1" (ID: 1) has connected
1 | postgres1 | repmgrd_standby_reconnect | t | 2021-01-08 18:28:04 | node restored as standby after 26 seconds, monitoring connection to upstream node 2
1 | postgres1 | node_rejoin | t | 2021-01-08 18:28:03 | node 1 is now attached to node 2
3 | postgres3 | repmgrd_failover_follow | t | 2021-01-08 18:20:39 | node "postgres3" (ID: 3) now following new upstream node "postgres2" (ID: 2)
3 | postgres3 | standby_follow | t | 2021-01-08 18:20:39 | standby attached to upstream node "postgres2" (ID: 2)
2 | postgres2 | child_node_new_connect | t | 2021-01-08 18:20:39 | new witness "postgres4-witness" (ID: 4) has connected
2 | postgres2 | child_node_new_connect | t | 2021-01-08 18:20:39 | new standby "postgres3" (ID: 3) has connected
4 | postgres4-witness | repmgrd_upstream_reconnect | t | 2021-01-08 18:20:33 | witness monitoring connection to primary node "postgres2" (ID: 2)
4 | postgres4-witness | repmgrd_failover_follow | t | 2021-01-08 18:20:33 | witness node "postgres4-witness" (ID: 4) now following new primary node "postgres2" (ID: 2)
2 | postgres2 | repmgrd_reload | t | 2021-01-08 18:20:33 | monitoring cluster primary "postgres2" (ID: 2)
2 | postgres2 | repmgrd_failover_promote | t | 2021-01-08 18:20:33 | node "postgres2" (ID: 2) promoted to primary; old primary "postgres1" (ID: 1) marked as failed