PostgreSQL의 auto-failover 솔루션인 pgpool 은 multi-master 를 지원하지 않기 때문에
MySQL의 MMM 처럼 failover 된 OLD Master 를 자동으로 cluster의 slave 로 포함시키는 기능이 없습니다.
그래서 old master 를 new master의 slave 로 재구성하는 작업을 수작업으로 해야만 하는데
이 과정을 pgpool 의 recovery_node 기능을 통해 자동화 할 수 있습니다.
pgpool 구성 방법은 아래 내용 참고

자동화 설정

수행 과정

  • db down
  • pgpool failover 감지
    • failover command 수행 됨
  • failover 완료 후 pcp_recovery_node -n node-id ( new slave 노드 ) 커맨드 수행
    • pgpool.conf의 recovery_1st_stage_command = ‘recover.sh’ 설정에 따라 recover.sh 수행 됨
    • old master를 slave 로 재구성 후
    • old master 가 pgpool cluster로 attach 됨

아래 extension 설치 , pgpool 설정, 스크립트를 모든 노드에 동일하게 설정한다

pgpool_recovery extension 설치

cp /usr/pgsql-11/lib/pgpool* /home1/testuser/psql/engn/postgresql-11.7/lib/
cp /usr/pgsql-11/share/extension/pgpool* /home1/testuser/psql/engn/postgresql-11.7/share/extension

postgres=# create extension pgpool_recovery;
CREATE EXTENSION
postgres=# select * from pg_extension;
      extname       | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
--------------------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql            |       10 |           11 | f              | 1.0        |           |
 pg_stat_statements |       10 |         2200 | t              | 1.6        |           |
 pgpool_recovery    |       10 |         2200 | t              | 1.2        |           |
(3 rows)

postgres=# \c template1
You are now connected to database "template1" as user "testuser".

template1=# create extension pgpool_recovery;
CREATE EXTENSION
template1=# select * from pg_extension;
      extname       | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition
--------------------+----------+--------------+----------------+------------+-----------+--------------
 plpgsql            |       10 |           11 | f              | 1.0        |           |
 pg_stat_statements |       10 |         2200 | t              | 1.6        |           |
 pgpool_recovery    |       10 |         2200 | t              | 1.2        |           |
(3 rows)


postgres=# select proname,pronamespace,prosrc from pg_proc where proname like '%pgpool%';
       proname       | pronamespace |       prosrc
---------------------+--------------+---------------------
 pgpool_recovery     |         2200 | pgpool_recovery
 pgpool_recovery     |         2200 | pgpool_recovery
 pgpool_recovery     |         2200 | pgpool_recovery
 pgpool_remote_start |         2200 | pgpool_remote_start
 pgpool_pgctl        |         2200 | pgpool_pgctl
 pgpool_switch_xlog  |         2200 | pgpool_switch_xlog
(6 rows)

=> pgpool_recovery extension 을 postgres , template1 database 에 설치

pgpool 설정

[testuser@testserver2 12:04:52 ~/psql/scripts ]$ vi /etc/pgpool-II/pgpool.conf

#------------------------------------------------------------------------------
# ONLINE RECOVERY
#------------------------------------------------------------------------------

recovery_user = 'testuser'
recovery_password = 'qhdks123'
recovery_1st_stage_command = 'recover.sh'

script 생성

[testuser@testserver2 11:59:52 ~/psql/scripts ]$ pwd
/home1/testuser/psql/scripts
[testuser@testserver2 11:59:57 ~/psql/scripts ]$ ls -ltr
total 16
-rwx------ 1 testuser testuser 177 Sep 26 20:49 start_instance.sh
-rwx------ 1 testuser testuser 280 Sep 26 20:49 rewind.sh
-rwx------ 1 testuser testuser 146 Sep 26 20:49 stop_instance.sh
-rwx------ 1 testuser testuser 724 Sep 26 23:31 failover.sh
  • stop_instance.sh
#!/bin/bash
PGDATA="/home1/testuser/psql/engn/PGSQL"
PATH="/home1/testuser/psql/engn/postgresql-11.7/bin/:$PATH"
pg_ctl -D $PGDATA stop -m immediate
  • start_instance.sh
#!/bin/bash
PGDATA="/home1/testuser/psql/engn/PGSQL"
PATH="/home1/testuser/psql/engn/postgresql-11.7/bin/:$PATH"
pg_ctl -w -s -D $PGDATA start 2>/dev/null 1>/dev/null < /dev/null &
  • rewind.sh
#!/bin/bash
master_node_host_name=$1
PGDATA="/home1/testuser/psql/engn/PGSQL"
PATH="/home1/testuser/psql/engn/postgresql-11.7/bin/:$PATH"
pg_rewind --target-pgdata=${PGDATA} --source-server="port=3000 user=testuser dbname=postgres host=${master_node_host_name}" >> /var/tmp/recover.log
  • failover.sh
#!/bin/bash
if [ $# -ne 3 ]
then
        echo "failover failed_node new_master trigger_file"
        exit 1
fi

FAILED_NODE=$1
NEW_MASTER=$2
TRIGGER_FILE=$3

# Do nothing if standby server goes down
if [$FAILED_NODE = 1]; then
        echo "Standby Server is downed\n" >> /etc/pgpool-II/failover.log
        exit 0
fi

echo "failover.sh FAILED_NODE:${FAILED_NODE}; NEW_MASTER:${NEW_MASTER}; at $(date)\n" >> /etc/pgpool-II/failover.log

#sudo -u testuser ssh -T testuser@$NEW_MASTER touch $TRIGGER_FILE
su - testuser -c "ssh -T testuser@$NEW_MASTER touch $TRIGGER_FILE"
su - testuser -c "ssh -T testuser@$NEW_MASTER 'psql -p 3000 -d postgres -c "\""select * from pg_create_physical_replication_slot('\''repl_slot_01'\'')"\""'"


exit 0
  • recover.sh => recover.sh 는 반드시 $PGDATA 아래에 있어야 함, pgpool 에 path가 하드코딩 되어있음
#!/bin/sh
set -x
master_node_host_name=`hostname`
master_db_cluster=$1
recovery_node_hostname=$2
recovery_db_cluster=$3
tmp=/tmp/mytemp$$
trap "rm -f $tmp" 0 1 2 3 15

PGDATA="/home1/testuser/psql/engn/PGSQL"
PATH="/home1/testuser/psql/engn/postgresql-11.7/bin/:$PATH"

export PATH PGDATA

# make sure the old master is down
ssh -T $recovery_node_hostname /home1/testuser/psql/scripts/stop_instance.sh
# rewind the old master
ssh -T ${recovery_node_hostname} /home1/testuser/psql/scripts/rewind.sh ${master_node_host_name}

cat > $tmp <<EOF
standby_mode = 'on'
primary_conninfo = 'host=$master_node_host_name port=3000 user=testuser'
primary_slot_name = 'repl_slot_01'
recovery_target_timeline = 'latest'
trigger_file='/home1/testuser/psql/engn/PGSQL/failover_trigger'
EOF

scp $tmp $recovery_node_hostname:$recovery_db_cluster/recovery.conf

ssh -T ${recovery_node_hostname} /home1/testuser/psql/scripts/start_instance.sh
  • pgpool_remote_start => recover.sh 수행 후 $PGDATA/pgpool_remote_start 커맨드도 수행되도록 pgpool 에 하드코딩 되어있음 new slave 기동하는 커맨드를 recover.sh 에 포함했으므로 touch로 파일 생성만 함, $PGDATA 아래에 생성해야함
[testuser@testserver2 12:51:01 ~/psql/engn/PGSQL ]$ cat pgpool_remote_start
exit 0

테스트

  • status
postgres=# show pool_nodes;
 node_id |     hostname      | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change
---------+-------------------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | testserver1 | 3000 | up     | 0.500000  | standby | 688        | true              | 0                 | 2020-09-27 11:42:00
 1       | testserver2 | 3000 | up     | 0.500000  | primary | 2632       | false             | 0                 | 2020-09-27 10:23:29

=> t1802 가 primary 인 상황

  • primary db down
[testuser@testserver2 12:19:45 ~/psql/scripts ]$ stop
waiting for server to shut down.... done
server stopped
  • failover 확인
[testuser@testserver1 12:20:02 ~ ]$ psql -p 4000 -d postgres
psql (11.7)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |     hostname      | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change
---------+-------------------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | testserver1 | 3000 | up     | 0.500000  | primary | 826        | true              | 0                 | 2020-09-27 12:20:17
 1       | testserver2 | 3000 | down   | 0.500000  | standby | 2596       | false             | 0                 | 2020-09-27 12:20:17
(2 rows)

=> t1801로 failover 완료

  • pcp_recovery_node 커맨드 수행
[testuser@testserver1 12:20:42 ~/psql/scripts ]$ pcp_recovery_node -n 1
Password:
pcp_recovery_node -- Command Successful



2020-09-27 12:21:04: pid 48880: DEBUG:  PCP: processing recovery request
2020-09-27 12:21:04: pid 48880: DETAIL:  start online recovery
2020-09-27 12:21:04: pid 48880: LOG:  starting recovering node 1
2020-09-27 12:21:04: pid 48880: LOG:  executing recovery
2020-09-27 12:21:04: pid 48880: DETAIL:  starting recovery command: "SELECT pgpool_recovery('recover.sh', 'testserver2', '/home1/testuser/psql/engn/PGSQL', '3000', 1)"
2020-09-27 12:21:04: pid 48880: LOG:  executing recovery
2020-09-27 12:21:04: pid 48880: DETAIL:  disabling statement_timeout
2020-09-27 12:21:04: pid 48880: DEBUG:  executing recovery, start recovery
2020-09-27 12:21:04: pid 48880: DEBUG:  executing recovery, finish recovery
2020-09-27 12:21:04: pid 48880: LOG:  node recovery, 1st stage is done
2020-09-27 12:21:04: pid 48880: DEBUG:  executing remote start
2020-09-27 12:21:04: pid 48880: DETAIL:  start pgpool_remote_start
2020-09-27 12:21:05: pid 48880: DEBUG:  executing remote start
2020-09-27 12:21:05: pid 48880: DETAIL:  finish pgpool_remote_start

=> t1801, t1802 아무 노드나 pgpool 살아있는 곳에서 수행하면 됨

  • 클러스터 확인
[testuser@testserver1 12:21:00 ~/psql/engn/PGSQL ]$ psql -p 4000 -d postgres
psql (11.7)
Type "help" for help.

postgres=# show pool_nodes;
 node_id |     hostname      | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change
---------+-------------------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | testserver1 | 3000 | up     | 0.500000  | primary | 868        | false             | 0                 | 2020-09-27 12:21:10
 1       | testserver2 | 3000 | up     | 0.500000  | standby | 2602       | true              | 0                 | 2020-09-27 12:21:10
(2 rows)

=> old master인 t1802 가 slave 로 재구성 후 cluster 에 자동으로 포함된 것 확인