Cluster Monitoring

Monitor Elasticsearch cluster health — cat APIs, node stats, shard allocation, thread pool monitoring, and diagnosing cluster issues.

30m15m reading15m lab

Cluster Health

The first thing to check — always.

curl -s "http://localhost:9200/_cluster/health?pretty"
{
  "cluster_name": "production-cluster",
  "status": "green",
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 15,
  "active_shards": 30,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0
}

Cluster Status Meanings

StatusMeaningAction
GreenAll primary and replica shards assignedNone
YellowAll primaries assigned, some replicas unassignedCheck node count vs replica count
RedSome primary shards unassignedImmediate investigation needed

Wait for Status

Block until the cluster reaches a specific status (useful in scripts):

curl -s "http://localhost:9200/_cluster/health?wait_for_status=green&timeout=30s"

Cat APIs

The _cat APIs provide human-readable output for quick monitoring. Add ?v for column headers.

Essential Cat Commands

# Cluster health (compact)
curl -s "http://localhost:9200/_cat/health?v"

# List all nodes with key stats
curl -s "http://localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,node.role,master"

# List all indices sorted by size
curl -s "http://localhost:9200/_cat/indices?v&s=store.size:desc"

# List all indices sorted by document count
curl -s "http://localhost:9200/_cat/indices?v&s=docs.count:desc"

# Shard distribution
curl -s "http://localhost:9200/_cat/shards?v&s=index"

# Unassigned shards (problems)
curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state"

# Thread pool activity
curl -s "http://localhost:9200/_cat/thread_pool?v&h=name,active,rejected,completed"

# Pending tasks
curl -s "http://localhost:9200/_cat/pending_tasks?v"

# Recovery progress
curl -s "http://localhost:9200/_cat/recovery?v&active_only=true"

# Disk allocation per node
curl -s "http://localhost:9200/_cat/allocation?v"

# Segment information
curl -s "http://localhost:9200/_cat/segments?v&s=index"

Cat API Formatting

# JSON output
curl -s "http://localhost:9200/_cat/indices?format=json&pretty"

# Custom columns
curl -s "http://localhost:9200/_cat/indices?v&h=index,docs.count,store.size,pri,rep&s=store.size:desc"

# Filter with pattern
curl -s "http://localhost:9200/_cat/indices/logs-*?v"

Node Statistics

Deep dive into node-level metrics:

# All node stats
curl -s "http://localhost:9200/_nodes/stats?pretty"

# Specific stats
curl -s "http://localhost:9200/_nodes/stats/jvm,os,process?pretty"

JVM Memory

Extract heap usage from the full JVM stats response:

curl -s "http://localhost:9200/_nodes/stats/jvm?pretty" | \
  grep -A 10 '"heap_used"'
Key metrics:
  • heap_used_percent — Should stay below 75%. Consistently above 85% = increase heap
  • heap_max — Verify it matches your configured -Xmx
  • gc.collectors.old.collection_count — Frequent old GC = heap pressure

OS and Process

curl -s "http://localhost:9200/_nodes/stats/os?pretty"

Key metrics:
  • os.cpu.percent — Overall CPU usage
  • os.mem.used_percent — System memory usage
  • os.swap.used_in_bytes — Should be 0 (swapping is bad)

Thread Pool Monitoring

Thread pools handle different operation types. Rejections mean the queue is full.

curl -s "http://localhost:9200/_cat/thread_pool?v&h=name,active,queue,rejected,completed"

Critical Thread Pools

Thread PoolPurposeWarning Sign
searchSearch requestsRejections = search queue full
writeIndex/update/deleteRejections = indexing bottleneck
getGet requestsRejections = heavy read load
managementCluster managementQueue buildup = cluster instability
flushFlush operationsQueue buildup = disk I/O issues
refreshSegment refreshQueue buildup = too many shards

Thread Pool Rejections

Any rejection count above 0 needs investigation:

# Watch for rejected operations
curl -s "http://localhost:9200/_cat/thread_pool?v&h=name,rejected&s=rejected:desc" | head -10

Shard Allocation

Why Shards Go Unassigned

# Find unassigned shards and their reasons
curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED

ReasonMeaningFix
INDEX_CREATEDIndex just createdWait or add nodes
CLUSTER_RECOVEREDCluster restartWait for recovery
NODE_LEFTNode departedRestart the node
ALLOCATION_FAILEDFailed to allocateCheck disk space, shard limits
REPLICA_ADDEDNew replica addedWait or add nodes

Explain Shard Allocation

When a shard won't allocate, find out why:

curl -s "http://localhost:9200/_cluster/allocation/explain?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "index": "my-index",
  "shard": 0,
  "primary": false
}'

Disk Watermarks

Elasticsearch stops allocating shards when disk is nearly full:

# Check current settings
curl -s "http://localhost:9200/_cluster/settings?include_defaults&filter_path=**.watermark&pretty"

WatermarkDefaultEffect
low85%No new shards allocated to this node
high90%Shards relocated away from this node
flood_stage95%Index becomes read-only
Adjust watermark thresholds for large disks where the defaults are too aggressive:

curl -X PUT "http://localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
  }
}'

Hot Threads

Find what the cluster is spending CPU time on:

curl -s "http://localhost:9200/_nodes/hot_threads"
This shows the Java stack traces of the busiest threads. Look for:
  • search threads — heavy query load
  • write threads — heavy indexing
  • merge threads — segment merges (I/O bound)
  • GC threads — garbage collection (heap pressure)

Pending Tasks

Cluster-level tasks waiting to be processed:

curl -s "http://localhost:9200/_cluster/pending_tasks?pretty"
A growing queue of pending tasks indicates the master node is overloaded.

Index Statistics

Per-index performance metrics:

# Index stats
curl -s "http://localhost:9200/my-index/_stats?pretty"

# Search performance
curl -s "http://localhost:9200/my-index/_stats/search?pretty"

# Indexing performance
curl -s "http://localhost:9200/my-index/_stats/indexing?pretty"
Key metrics:
  • search.query_time_in_millis / search.query_total = average query time
  • indexing.index_time_in_millis / indexing.index_total = average index time
  • refresh.total_time_in_millis — time spent refreshing

Monitoring Checklist

Run these checks regularly (or automate with Metricbeat):

CheckCommandHealthy Value
Cluster status_cluster/healthGreen
Heap usage_nodes/stats/jvm< 75%
CPU usage_nodes/stats/os< 80% sustained
Disk usage_cat/allocation< 85%
Unassigned shards_cat/shards0
Thread pool rejections_cat/thread_pool0
Pending tasks_cluster/pending_tasks0 or near 0
GC frequency_nodes/stats/jvmOld GC < 1/min
Swap usage_nodes/stats/os0 bytes

Common Error Scenarios

Error: All Shards Failed

# Check which shards are available
curl -s "http://localhost:9200/_cat/shards/problem-index?v"

# Check shard allocation
curl -s "http://localhost:9200/_cluster/allocation/explain?pretty"

Error: Too Many Concurrent Shard Recoveries

# Check concurrent recoveries
curl -s "http://localhost:9200/_cat/recovery?v&active_only=true"

# Throttle recovery
curl -X PUT "http://localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": 2,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": 2,
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}'

Error: Read-Only Index (Flood Stage)

# Remove read-only block
curl -X PUT "http://localhost:9200/my-index/_settings" \
  -H 'Content-Type: application/json' -d'
{
  "index.blocks.read_only_allow_delete": null
}'

# Or for all indices
curl -X PUT "http://localhost:9200/_all/_settings" \
  -H 'Content-Type: application/json' -d'
{
  "index.blocks.read_only_allow_delete": null
}'

Lab: Monitor Your Cluster

  1. 1 Check cluster health and node status
  2. 2 List all indices sorted by size
  3. 3 Check thread pool for rejections
  4. 4 Verify no unassigned shards
  5. 5 Check JVM heap usage across all nodes
  6. 6 Run the hot threads API and analyze output
  7. 7 Check disk watermark settings

Next Steps

Discussion