Ver parte 1
Ver parte 2
Ver parte 3
Vamos a hacer una prueba de balanceo automático tras la pérdida del nodo primario (postgres2) y como balancea el failover a un nodo secundario (postgres1), tras la parada de la conexiones en el pgbouncer, posterior reconfiguración, y conexión posterior al nuevo primario (postgres1).
Estado previo del cluster:
[postgres@postgres4 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+-----------+-----------+----------+----------+----------+----------------------------------------------
1 | postgres1 | standby | running | postgres2 | default | 100 | 9 | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | primary | * running | | default | 100 | 10 | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | postgres2 | default | 100 | 10 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | postgres2 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr
Desde una conexión cliente nos conectamos para comprobar la ip a la que nos ha conectado el pgbouncer (y no cerraremos la conexión para ver el balanceo automático tras el fallo del primary):
[postgres@postgres4 ~]$ psql 'host=192.168.0.104 user=appdb port=6432'
psql (13.1)
Type "help" for help.
appdb=> select inet_server_addr(), inet_server_port();
inet_server_addr | inet_server_port
------------------+------------------
192.168.0.102 | 5432
(1 row)
Desde el pgbouncer vemos las conexiones a las bases de datos y lo monitorizaremos cada 5 segundos:
[postgres@postgres4 ~]$ psql --host=192.168.0.104 --username=postgres --port=6432 --dbname=pgbouncer
psql (13.1, server 1.15.0/bouncer)
Type "help" for help.
pgbouncer=# show databases;
name | host | port | database | force_user | pool_size | reserve_pool | pool_mode | max_connections | current_connections | paused | disabled
-----------+-----------+------+-----------+------------+-----------+--------------+-----------+-----------------+---------------------+--------+----------
appdb | postgres2 | 5432 | appdb | | 20 | 5 | | 0 | 5 | 0 | 0
pgbouncer | | 6432 | pgbouncer | pgbouncer | 2 | 0 | statement | 0 | 0 | 0 | 0
(2 rows)
pgbouncer=# \watch 5;
Provocamos el fallo (simulado) en el nodo primario (postgres2):
[postgres@postgres2 ~]$ sudo systemctl stop postgresql-13
Vemos el estado del cluster:
[postgres@postgres4 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+-----------+-----------+----------+----------+----------+----------------------------------------------
1 | postgres1 | primary | * running | | default | 100 | 11 | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | primary | - failed | ? | default | 100 | | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | postgres1 | default | 100 | 10 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | postgres1 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr
WARNING: following issues were detected
- unable to connect to node "postgres2" (ID: 2)
HINT: execute with --verbose option to see connection error messages
Vemos la parada y arranque del pgbouncer en la select que teníamos ejecutándose cada 5 segundos:
Sat 09 Jan 2021 06:31:05 PM CET (every 5s)
name | host | port | database | force_user | pool_size | reserve_pool | pool_mode | max_connections | current_connections | paused | disabled
-----------+-----------+------+-----------+------------+-----------+--------------+-----------+-----------------+---------------------+--------+----------
appdb | postgres2 | 5432 | appdb | | 20 | 5 | | 0 | 0 | 1 | 0
pgbouncer | | 6432 | pgbouncer | pgbouncer | 2 | 0 | statement | 0 | 0 | 0 | 0
(2 rows)
Sat 09 Jan 2021 06:31:10 PM CET (every 5s)
name | host | port | database | force_user | pool_size | reserve_pool | pool_mode | max_connections | current_connections | paused | disabled
-----------+-----------+------+-----------+------------+-----------+--------------+-----------+-----------------+---------------------+--------+----------
appdb | postgres1 | 5432 | appdb | | 20 | 5 | | 0 | 0 | 0 | 0
pgbouncer | | 6432 | pgbouncer | pgbouncer | 2 | 0 | statement | 0 | 0 | 0 | 0
(2 rows)
Comprobamos también desde nuestra conexión cliente a la base de datos primary (postgres2) como ha cambiado a la nueva primary (postgres1):
appdb=> select inet_server_addr(), inet_server_port();
inet_server_addr | inet_server_port
------------------+------------------
192.168.0.101 | 5432
(1 row)
Una vez recuperado el fallo (provocado por nosotros) que hizo este failover automático, si queremos unir el antiguo primary al cluster, sería:
[postgres@postgres2 ~]$ repmgr -d 'host=192.168.0.101 user=repmgr dbname=repmgr' node rejoin --force-rewind
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-13/bin/pg_rewind -D '/var/lib/pgsql/13/data' --source-server='host=192.168.0.101 user=repmgr dbname=repmgr'"
NOTICE: 0 files copied to /var/lib/pgsql/13/data
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=192.168.0.102 user=repmgr dbname=repmgr"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "sudo /bin/systemctl start postgresql-13"
WARNING: node "postgres2" attached in state "startup"
INFO: waiting for node "postgres2" (ID: 2) to connect to new primary; 1 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: node "postgres1" (ID: 2) is currrently attached to its upstream node in state "startup"
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
El estado final del cluster sería:
[postgres@postgres4 ~]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------------------+---------+-----------+-----------+----------+----------+----------+----------------------------------------------
1 | postgres1 | primary | * running | | default | 100 | 11 | host=192.168.0.101 user=repmgr dbname=repmgr
2 | postgres2 | standby | running | postgres1 | default | 100 | 10 | host=192.168.0.102 user=repmgr dbname=repmgr
3 | postgres3 | standby | running | postgres1 | default | 100 | 10 | host=192.168.0.103 user=repmgr dbname=repmgr
4 | postgres4-witness | witness | * running | postgres1 | default | 0 | n/a | host=192.168.0.104 user=repmgr dbname=repmgr