Disaster Recovery Strategies for PostgreSQL Deployment on Kubernetes

volcano eruption during daytime

For a production environment, it is important to design a disaster recovery strategy. The traditional disaster recovery strategy is to make regular database backups and store them. However, large databases may take hours or several days to restore. Cloud computing has made disaster recovery more efficient and simple. Cloud computing enables the deployment of applications or databases across multiple availability zones or multiple regions to achieve high availability and disaster recovery. 

Today, many companies have migrated their applications to a microservices architecture and deploy the application systems including databases on Kubernetes. However, how to implement disaster recovery for databases deployed on Kubernetes? In this post, I'd like to describe the disaster recovery strategies for PostgreSQL deployment on Kubernetes.

Because database is a stateful service, it is hard to manage and scale databases on Kubernetes. PostgreSQL Operators (e.g. Crunchy PostgreSQL Operator, Zalando PostgreSQL Operator) allow you to deploy and manage PostgreSQL clusters on Kubernetes. Both Crunchy PostgreSQL Operator and Zalando PostgreSQL Operator support for multi-cluster deployments used for disaster recovery strategies and high availability. In the next sections, I am going to describe how to use Crunchy PostgreSQL Operator to achieve disaster recovery and high availability.

Multi-Cluster Deployments 

Crunchy PostgreSQL Operator provides an advanced configuration to distribute your PostgreSQL database clusters across multiple Kubernetes clusters (created in multiple zones or multiple regions) to minimize service downtime. 

Multi-cluster architecture

Below is the multi-cluster architecture.

As the architecture above shows, there are two PostgreSQL clusters deployed in different Kubernetes clusters, and one is deployed as  active PostgreSQL cluster and another is deployed as standby PostgreSQL cluster. The active PostgreSQL cluster pushes WAL to an external storage (S3 or GCS).  One PostgreSQL instance in the standby PostgreSQL cluster continuously reads WAL from the external storage and and the replicas in the standby PostgreSQL cluster are cascading replicas. The standby PostgreSQL cluster is read-only.

If the active PostgreSQL cluster goes down, you can promote the standby PostgreSQL cluster manually. And the old active PostgreSQL cluster can be recovered as a standby PostgreSQL cluster.

Next, let's verify the multiple clusters architecture on Google Kubernetes Engine (GKE).

Creating prerequisite resources

Assuming you have installed and setup Google Cloud SDK.

Configure IAM 

Create an IAM service account for PostgreSQL cluster and add the storage.admin role to it so that it can access Google Cloud Storage.

export PROJECT_ID=$(gcloud config get-value project)
gcloud iam service-accounts create postgres-gcs

gcloud iam service-accounts keys create --iam-account \
    postgres-gcs@${PROJECT_ID}.iam.gserviceaccount.com gcs-key.json

gcloud projects add-iam-policy-binding \
   --role roles/storage.admin ${PROJECT_ID} \

Then the service account key file will be generated in the current directory.

Create GKE clusters

Create two GKE clusters in different regions.

gcloud container clusters create cluster1 \
    --release-channel regular --zone us-central1-a

gcloud container clusters create cluster2 \
    --release-channel regular --zone us-east1-b

Create a cloud storage bucket

Create a cloud storage bucket to store the backups and WAL.

gsutil mb gs://crunchy-postgres-backups

Install PostgreSQL Operator

Install PostgreSQL Operator in both GKE clusters. Here we use the default storage class. 

curl -LO https://raw.githubusercontent.com/CrunchyData/postgres-operator/v4.7.0/installers/kubectl/postgres-operator.yml

kubectl create namespace pgo
kubectl apply -f postgres-operator.yml

Once PostgreSQL Operator is deployed successfully, the pod will be in "Running" status. 

kubectl get pod -n pgo
NAME                                READY   STATUS      RESTARTS   AGE
pgo-deploy-qlcnb                    0/1     Completed   0          3m17s
postgres-operator-85cbf887f-lnzlk   4/4     Running     1          2m27s

Install pgo client

Install pgo client command.

curl -LO https://raw.githubusercontent.com/CrunchyData/postgres-operator/v4.7.0/installers/kubectl/client-setup.sh
chmod +x client-setup.sh

cat <<EOF >> ~/.bashrc
export PATH=${HOME}/.pgo/pgo:$PATH
export PGOUSER=${HOME}/.pgo/pgo/pgouser
export PGO_CA_CERT=${HOME}/.pgo/pgo/client.crt
export PGO_CLIENT_CERT=${HOME}/.pgo/pgo/client.crt
export PGO_CLIENT_KEY=${HOME}/.pgo/pgo/client.key
export PGO_NAMESPACE=pgo

source ~/.bashrc

Run "pgo version" to verify the installation.

kubectl -n pgo port-forward svc/postgres-operator 8443:8443 &

If the installation is successful, you will see the version 4.7.0 in the result.

pgo version
pgo client version 4.7.0
pgo-apiserver version 4.7.0

Deploy active PostgreSQL cluster

First, let's deploy the active PostgreSQL cluster in GKE cluster1.

[GKE cluster1]# pgo create cluster hippo \
--replica-count=2 \
--pgbackrest-storage-type=gcs \
--pgbackrest-gcs-bucket=crunchy-postgres-backups \
--pgbackrest-gcs-key=gcs-key.json \
--password-superuser=postgres \
--password-replication=postgres \

The command line options: 

  • --replica-count: deploy a PostgreSQL cluster with two replicas
  • --pgbackrest-storage-type: push the backups and WAL to GCS
  • --pgbackrest-gcs-bucket: the GCS bucket name used for storing backups and WAL
  • --pgbackrest-gcs-key: the file name that contains the service account key
  • --password-superuser, --password-replication, --password: the password for PostgreSQL superuser, replication user and the standard PostgreSQL user. Here we specify all passwords as "postgres".

Verify the cluster deployment. If successful, you can see one primary and two replicas in "UP" status. 

[GKE cluster1]# pgo test hippo
cluster : hippo
primary ( UP
replica ( UP
primary (hippo-85474cd644-hpprz): UP
replica (hippo-kvko-77bbdd4756-2hjr7): UP
replica (hippo-xvgh-7bb554d876-6pxr8): UP 

Connect to primary PostgreSQL and run several write queries.

[GKE cluster1]# kubectl port-forward svc/hippo 5432:5432 -n pgo &

[GKE cluster1]# psql -h -U postgres
postgres=# create table t1 (id int);
postgres=# insert into t1 select generate_series(1,10);

Deploy standby PostgreSQL cluster

Next, let's deploy the standby PostgreSQL cluster with two cascading replicas in GKE cluster2

Please notice that you need to include the passwords for PostgreSQL superuser, replication user and  the standard PostgreSQL user specified in the active cluster deployment. And ensure the service account key file is in the current directory.

[GKE cluster2]# pgo create cluster hippo-standby \
--standby \
--replica-count=2 \
--pgbackrest-repo-path=/backrestrepo/hippo-backrest-shared-repo \
--pgbackrest-storage-type=gcs \
--pgbackrest-gcs-bucket=crunchy-postgres-backups \
--pgbackrest-gcs-key=gcs-key.json \
--password-superuser=postgres \
--password-replication=postgres \

Verify the standby cluster deployment.

[GKE cluster2]# pgo test hippo-standby
cluster : hippo-standby
                primary ( UP
                replica ( UP
                primary (hippo-standby-7d949bb7f8-d7xwn): UP
                replica (hippo-standby-dsiw-5455d859d9-sr8c7): UP
                replica (hippo-standby-lpqp-599c56f5cc-dgbnd): UP

Once all the instances are in UP status, let's connect to the standby PostgreSQL and verify data replication.

[GKE cluster2]# kubectl port-forward svc/hippo-standby 5432:5432 -n pgo &

[GKE cluster2]# psql -h -U postgres
postgres=# select * from t1;


As you can see in the result above, replication has been configured correctly.

Promote standby PostgreSQL cluster

First, let's shutdown the active PostgreSQL cluster.  

[GKE cluster1]# pgo update cluster hippo --shutdown

You need to ensure the active PostgreSQL cluster is  shutdown completely to prevent a split-brain situation.

[GKE cluster1]# pgo test hippo
cluster : hippo
                primary (): DOWN

Once the active cluster is shutdown completely, you can perform promotion in cluster2: 

[GKE cluster2]# pgo update cluster hippo-standby --promote-standby

Now the promoted standby PostgreSQL cluster is running as an active cluster and can accept write/read queries. 

[GKE cluster2]# psql -h -U postgres
postgres=# insert into t1 select generate_series(11,20);

Recover the old active cluster as a standby cluster

Next, let's recover the failed active PostgrSQL cluster as a standby PostgreSQL cluster.

[GKE cluster1]# pgo update cluster hippo --enable-standby
Enabling standby mode will result in the deltion of all PVCs for this cluster!
Data will only be retained if the proper retention policy is configured for any associated storage
 classes and/or persistent volumes.
Please proceed with caution.
WARNING: Are you sure? (yes/no): yes
updated pgcluster hippo

Ensure the PVCs are recreated and the PostgreSQL is configured as a standby cluster.
[GKE cluster1]# kubectl get pvc -n pgo
NAME              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
hippo             Bound    pvc-196dcdf4-ad6a-448d-a31f-ed28464a19d6   1Gi        RWO            standard       7m53s
hippo-kvko        Bound    pvc-fc94bbfb-cf08-4ea7-bd2a-3f2d8531a174   1Gi        RWO            standard       7m52s
hippo-pgbr-repo   Bound    pvc-c4855f8a-6cb8-4d61-abe8-c9a0a497c433   1Gi        RWO            standard       7m52s
hippo-xvgh        Bound    pvc-6a482286-3122-478a-bca8-71e2aec16a2e   1Gi        RWO            standard       7m51s

[GKE cluster1]# kubectl get cm hippo-config -n pgo -o yaml | grep standby
      archive-get %f \"%p\""},"use_pg_rewind":true,"use_slots":false},"standby_cluster":{"create_replica_methods":["pgbackrest_standby"],"restore_command":"source

Then, you can start this cluster as a standby cluster.

[GKE cluster1]# pgo update cluster hippo --startup

Now, the PostgreSQL cluster is running as a standby cluster.

[GKE cluster1]# pgo test hippo
cluster : hippo
                primary ( UP
                replica ( UP
                primary (hippo-85474cd644-k9xwc): UP
                replica (hippo-kvko-77bbdd4756-jh7bx): UP
                replica (hippo-xvgh-7bb554d876-hdzzg): UP


Popular posts from this blog

Connection Pooling in Pgpool-II

Monitoring PostgreSQL Cluster via Pgpool-II with Prometheus

Authentication in Pgpool-II