Understanding High Availability Clusters
Elasticsearch cluster architecture — quorum, node roles, split-brain prevention, failure simulation, and recovery procedures.
40m25m reading15m lab
Why High Availability?
Single-Node Risks
| Failure Type | Impact |
|---|---|
| Node crash | Entire cluster stops, data unavailable |
| Network issues | Cluster unreachable |
| Hardware failure | Potential data loss |
| Disk full | All indexing and search stops |
HA Benefits
With multiple nodes, if one node fails, the others continue serving requests. Replica shards are promoted to primary, and the cluster keeps running.Cluster Architecture
┌─────────────────────────────────────────────────────────────┐
│ Elasticsearch Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ Master │ │ Master │ │ Master │ │
│ │ (Active) │ │ (Standby) │ │ (Standby) │ │
│ │ + Data │ │ + Data │ │ + Data │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Primary shards: [P0] [P1] [P2] │
│ Replica shards: [R0] [R1] [R2] │
│ Distributed across nodes for redundancy │
└─────────────────────────────────────────────────────────────┘
Understanding Quorum
Quorum is the minimum number of master-eligible nodes that must agree on decisions (electing a master, modifying cluster state). It prevents split-brain scenarios where the cluster splits into two independent clusters that both accept writes.Quorum Formula
Quorum = (Number of master-eligible nodes / 2) + 1
| Master-eligible Nodes | Quorum | Can Survive |
|---|---|---|
| 1 | 1 | 0 node failures |
| 2 | 2 | 0 node failures (avoid!) |
| 3 | 2 | 1 node failure |
| 5 | 3 | 2 node failures |
| 7 | 4 | 3 node failures |
Why Is Quorum Important?
Without quorum, a network partition can create two independent clusters:
Network Partition:
┌──────────┐ X ┌──────────┐
│ Node 1 │ X │ Node 2 │
│ Node 3 │ X │ │
└──────────┘ └──────────┘
Quorum: 2/3 Quorum: 1/3
Stays active Becomes read-only
With 3 nodes and quorum of 2, the partition with 2 nodes continues operating while the isolated node stops accepting writes.
Node Roles
Elasticsearch nodes can serve different roles. In small clusters, each node typically handles all roles. In larger clusters, dedicate nodes to specific roles.Role Reference
| Role | Purpose | When to Dedicate |
|---|---|---|
| Master | Manages cluster state, creates/deletes indices | 5+ nodes |
| Data | Stores data, handles search and indexing | Always |
| Ingest | Pre-processes documents before indexing | Heavy pipeline use |
| ML | Machine learning tasks | If using ML features |
| Coordinating | Routes requests, aggregates results | High query volume |
| Remote Cluster Client | Cross-cluster search/replication | Multi-cluster setups |
| Transform | Pivot and aggregation transforms | If using transforms |
Role Configuration
# Master-only node (no data)
node.roles: [master]
# Data-only node (no master elections)
node.roles: [data, ingest]
# All roles (default for small clusters)
node.roles: [master, data, ingest]
# Coordinating-only node (no data, no master)
node.roles: []
Role Strategy by Cluster Size
| Cluster Size | Strategy |
|---|---|
| 1-3 nodes | All roles on every node |
| 3-5 nodes | All roles, but consider dedicated masters |
| 5-10 nodes | 3 dedicated masters + data nodes |
| 10+ nodes | Dedicated masters + dedicated data + coordinating |
Essential Configuration
For a 3-node HA cluster, these properties are essential:cluster.initial_master_nodes
Defines the initial master-eligible nodes to form the cluster. Only used during the very first startup.
cluster.initial_master_nodes:
- elasticsearch-1
- elasticsearch-2
- elasticsearch-3
discovery.seed_hosts
Lists nodes to contact for cluster discovery. Used by new nodes when they join.
discovery.seed_hosts:
- elasticsearch-1
- elasticsearch-2
- elasticsearch-3
Minimum Configuration per Node
# Node identity
cluster.name: production-cluster
node.name: elasticsearch-1
node.roles: [master, data, ingest]
# Network
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
# Discovery
discovery.seed_hosts:
- elasticsearch-2:9300
- elasticsearch-3:9300
cluster.initial_master_nodes:
- elasticsearch-1
- elasticsearch-2
- elasticsearch-3
Cluster Health
| Status | Meaning | Action |
|---|---|---|
| Green | All primary and replica shards allocated | None needed |
| Yellow | All primaries allocated, some replicas missing | Add nodes or reduce replicas |
| Red | Some primary shards missing | Immediate attention required |
# Check cluster health
curl -s "localhost:9200/_cluster/health?pretty"
# View shard allocation
curl -s "localhost:9200/_cat/shards?v"
# View node info
curl -s "localhost:9200/_cat/nodes?v&h=name,role,heap.percent,disk.used_percent"
Cluster Failure Simulation
Understanding how your cluster behaves during failures is critical for production readiness.Scenario 1: Single Node Failure
# Check health before
curl -s "localhost:9200/_cluster/health?pretty"
# Expected: green, 3 nodes
# Stop one node
docker stop elasticsearch-3
# Check health after
curl -s "localhost:9200/_cluster/health?pretty"
# Expected: yellow — replicas promoted to primary, some replicas unassigned
What happens: Quorum is maintained (2 of 3 nodes). The cluster promotes replica shards to primary. Status turns yellow because some replicas are now unassigned.
Scenario 2: Two Node Failure
# Stop two nodes
docker stop elasticsearch-2 elasticsearch-3
# Check health
curl -s "localhost:9200/_cluster/health?pretty"
# Expected: red — quorum lost (only 1 of 3 master-eligible nodes)
What happens: With only 1 node and quorum of 2, the cluster cannot elect a master. It becomes read-only or unavailable. Data on the stopped nodes is inaccessible.
Scenario 3: Complete Cluster Failure
# Stop all nodes
docker stop elasticsearch-1 elasticsearch-2 elasticsearch-3
# Cluster is completely unavailable
curl -s "localhost:9200/_cluster/health?pretty"
# Expected: connection refused
Recovery
# Start nodes one by one
docker start elasticsearch-1
docker start elasticsearch-2
docker start elasticsearch-3
# Monitor recovery
watch -n 2 'curl -s "localhost:9200/_cluster/health?pretty"'
# Cluster recovers to green once all nodes rejoin and shards rebalance
Key takeaways:
- 3-node cluster survives 1 node failure (stays yellow)
- 2 node failures cause quorum loss (cluster goes red)
- Elasticsearch automatically recovers when nodes rejoin
- Replicas are essential — without them, any node failure means data loss
Split-Brain Prevention
In Elasticsearch 7+, split-brain prevention is automatic. The cluster uses a voting configuration that requires a majority of master-eligible nodes.
For Elasticsearch 6.x and earlier, configure manually:# Deprecated in 7.x — handled automatically
discovery.zen.minimum_master_nodes: 2
Lab: Explore Cluster Behavior
- 1 Start a 3-node cluster with Docker Compose
- 2 Check cluster health and verify green status
- 3 Create an index with 1 replica
- 4 Stop one node and observe health change to yellow
- 5 Verify the data is still accessible
- 6 Stop a second node and observe red status
- 7 Restart all nodes and watch recovery to green
Next Steps
- Three Node Cluster Setup — deploy a 3-node cluster with Docker
- Five Node Cluster — production-grade multi-node deployment