Koray Karaman

About Koray Karaman Categories Home Music Photos Files
Posts
Files
About
Categories
Music
Projects
Photos
Go Back
22.08.2025 00:25
PERMALINK Deneme 3


Like22618 Reactions

Building a PostgreSQL HA Cluster with Ubuntu VMs

In today’s digital landscape, data is the lifeblood of nearly every business. From e-commerce platforms and financial services to healthcare systems and enterprise SaaS applications, the reliability of your database infrastructure directly impacts user experience, revenue, and operational continuity. That’s why high availability (HA) is no longer a luxury—it’s a necessity.

PostgreSQL, one of the most powerful and widely adopted open-source relational databases, offers robust features for scalability, performance, and data integrity. However, out-of-the-box PostgreSQL does not provide built-in high availability. To achieve true resilience, you need a carefully orchestrated cluster architecture that can handle failovers, load balancing, and real-time monitoring.

This blog post introduces the PostgreSQL-HA-Cluster-VM project—a fully automated, Docker-free solution for deploying a PostgreSQL high availability cluster on Ubuntu virtual machines. Designed for DevOps engineers, system administrators, and database architects, this project provides:

  • Seamless replication between master and replica nodes
  • Intelligent failover using Pgpool-II and keepalived
  • Real-time observability with Prometheus and Grafana
  • A single setup script for full automation
  • Full transparency and control over every component

Licensed under the MIT License, this project is free to use, modify, and distribute. Whether you're building a production-grade database cluster or experimenting in a lab environment, this project gives you the tools to deploy a resilient PostgreSQL infrastructure with confidence.

 

Project Philosophy: No Docker, Full Transparency, 

Why I Avoided Docker

Here’s why: Most modern HA setups rely on Docker or Kubernetes. While these tools are powerful, they abstract away too much for my taste—especially when you’re trying to learn or debug. I wanted:

  • Full control over every configuration file
  • No abstraction layers between the OS and PostgreSQL
  • Real-world networking with static IPs and Netplan
  • Modular architecture with clear separation of roles
  • Educational value for anyone who wants to understand HA deeply
  • Transparency: I wanted to see every config file and log
  • Real networking: Docker’s virtual networks hide real-world behavior
  • Systemd integration: Managing services natively is more realistic
  • Automation: A single script (setup.sh) handles full deployment

This project is not just a deployment tool—it’s a learning platform.

 

Architecture Overview

The cluster consists of six Ubuntu Server 22.04 LTS virtual machines, each with a distinct role:

 

Node Name Role IP Description
pg-master Primary PostgreSQL node 10.0.2.101 Handles writes and WAL archiving
pg-replica1 Standby node 10.0.2.102 Receives streaming replication
pg-replica2 Standby node 10.0.2.103 Adds redundancy
pgpool-node Connection manager 10.0.2.104 Manages load balancing and failover
pgha-node Watchdog controller 10.0.2.105 Oversees VIP and Pgpool health
monitoring-node Prometheus + Grafana 10.0.2.106 Enables PostgreSQL monitoring Grafana dashboards

1. PostgreSQL Master & Replicas

  • Master Node: pg-master
  • Replica Nodes: pg-replica1, pg-replica2
  • Replication: Achieved via pg_basebackup
  • Health Checks: Monitored using healthcheck.sh

2. Pgpool-II

  • Acts as a connection pooler and load balancer
  • Provides automatic failover and query routing
  • Uses Virtual IP (VIP) for seamless client access

3. PGHA Controller

  • Built with keepalived and watchdog.conf
  • Manages VIP failover between Pgpool nodes

4. Monitoring Stack

  • Prometheus collects metrics via postgres_exporter
  • Grafana visualizes metrics with prebuilt dashboards
  • Tracks replication lag, active connections, CPU/RAM usage, and more

 

Installation and Automation

The installation is fully scripted, making it a great example of PostgreSQL cluster automation. The process includes:

  • Static IP configuration for all nodes
  • PostgreSQL installation and configuration
  • PostgreSQL streaming replication configuration
  • Pgpool-II setup for connection pooling and failover
  • PGHA deployment for VIP failover
  • Monitoring stack with Prometheus PostgreSQL metrics
  • Healthcheck scripts for replication lag alerts

This approach is perfect for DevOps best practices and can be integrated into CI/CD pipelines for infrastructure provisioning.

Download Setup Script

sudo wget https://raw.githubusercontent.com/koray-karaman/PostgreSQL-HA-Cluster-VM/main/setup.sh -O setup.shchmod +x setup.sh 

Run Setup Script

./setup.sh 

Configure Networking

sudo cp configs/network-setup.yaml /etc/netplan/00-installer-config.yaml
sudo netplan apply 

 

Monitoring & Alerts

Grafana dashboards include:

  • postgres_up
  • postgres_replication_lag_seconds
  • postgres_replication_lag_alert
  • postgres_in_recovery

You can customize alert thresholds using the LAG_THRESHOLD variable. 

 

Replication and Failover Scenarios

PostgreSQL’s streaming replication is used to keep replicas in sync with the master. In case of failure, Pgpool and watchdog coordinate to promote a replica and reassign the virtual IP.

Planned Failover

  1. Stop PostgreSQL on master
  2. Promote replica using pg_ctl promote
  3. Update Pgpool configuration
  4. Restart Pgpool

Unplanned Failover

  • Pgpool detects failure
  • Watchdog promotes a replica
  • VIP is reassigned
  • Clients reconnect automatically

This setup ensures minimal downtime and automatic recovery.

Recovery

  • Reinitialize old master as replica
  • Use pg_basebackup for sync

 

Post-Deployment Checklist

  • Can all nodes ping each other?
  • Are PostgreSQL services running?
  • Is replication visible in pg_stat_replication?
  • Is Pgpool VIP correctly assigned?
  • Are Prometheus and Grafana dashboards active?

 

Best Practices

  • Version control your configuration files
  • Monitor replication lag regularly
  • Test failover scenarios in staging
  • Backup VIP configuration
  • Use verify.sh for health checks

 

Health Checks and Cron Jobs

A custom script healthcheck.sh runs periodic checks on the cluster:

  • Verifies replication status
  • Checks Pgpool node health
  • Logs anomalies

You can schedule it with cron:

(crontab -l 2>/dev/null; echo "* * * * * /opt/pg-ha/healthcheck.sh") | crontab -

This ensures continuous monitoring even outside Grafana.

 

Troubleshooting Tips

Issue Solution
VIP not moving heck watchdog.conf and Pgpool connectivity
Replication lag too high Investigate disk I/O and network latency
Prometheus not scraping Verify targets and firewall rules
Grafana dashboard empty Ensure exporters are running

 

Lessons Learned and Honest Reflections

This project taught me a lot—but it wasn’t all smooth sailing. Here are some hard-earned lessons:

1. Pgpool Is Powerful but Fragile

Pgpool-II is a beast. It handles connection pooling, load balancing, and failover—but its configuration is notoriously complex. A single misconfigured line in pgpool.conf can break the entire cluster. I spent hours debugging watchdog behavior and VIP transitions.

2. VIP Failover Isn’t Always Instant

While Pgpool and watchdog handle VIP reassignment, it’s not always seamless. In some cases, the VIP lingers on the failed node due to network delays or misconfigured ARP tables. This can cause brief outages.

3. Monitoring Needs More Alerts

Prometheus and Grafana provide great visibility, but alerting is minimal. I plan to add Slack and email notifications for critical events like failover or replication lag spikes.

4. No Backup System Yet

This cluster handles availability—but not backups. That’s a major gap. I’m exploring tools like pgBackRest and barman to integrate automated backups and point-in-time recovery.

5. Resource Usage Is High

Running six VMs locally is resource-intensive. This setup is better suited for cloud environments or powerful physical servers. On a typical laptop, you’ll hit CPU and RAM limits quickly.

 

MIT License

This project is released under the MIT License:

MIT License

Copyright (c) 2025 Koray Karaman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction.

You are free to use, modify, and distribute this project with proper attribution.

 

Contribute

Want to improve or extend the project?

  • Fork the repo and submit a pull request
  • Open issues for bugs or feature requests
  • Star the project to support its visibility

GitHub Repository: PostgreSQL-HA-Cluster-VM

 

Conclusion

Building a high availability PostgreSQL cluster may sound complex—but with the right tools and architecture, it becomes a repeatable, scalable, and maintainable process. The PostgreSQL HA Cluster VM project simplifies this journey by offering a complete, open-source solution that runs on Ubuntu virtual machines without relying on containerization or external orchestration platforms.

By combining PostgreSQL replication, Pgpool-II failover logic, VIP management via keepalived, and robust monitoring with Prometheus and Grafana, this project delivers a production-ready HA setup that can be deployed in minutes. Whether you're managing mission-critical applications or preparing for disaster recovery scenarios, this architecture ensures your data remains available, consistent, and protected.

And because it's licensed under the MIT License, you're free to adapt it to your own infrastructure, contribute improvements, or integrate it into larger automation pipelines. The project is designed to be transparent, modular, and extensible—perfect for teams that value control and clarity.

If you're ready to take your PostgreSQL deployment to the next level, explore the GitHub repository, try the setup script, and start building a truly resilient database cluster today.

Koray Karaman

Browse on Mobile

Vultr.API: A Lightweight .NET Client for Cloud Automation
Facebook Twitter Instagram Github
© 2006—2026 Koray Karaman