etcd — Metadata Store
Deep dive into etcd — Milvus's metadata storage. Learn configuration, backup, recovery, and sizing for production.
30m15m reading15m lab
etcd — Metadata Store
etcd is Milvus's brain. It stores:- Collection schemas and metadata
- Segment information and states
- Coordination data (timestamps, leader election)
- Service registration and discovery
Why etcd Matters
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Proxy │────▶│ etcd │◀────│ RootCoord │
└─────────────┘ │ (metadata) │ └─────────────┘
┌─────────────┐ └─────────────┘ ┌─────────────┐
│ QueryNode │ │ DataCoord │
└─────────────┘ └─────────────┘
Every Milvus component depends on etcd:
- Coordinators store state and elect leaders
- Workers register themselves and discover work
- Proxy routes requests based on metadata
Deployment Modes
Single Node (Development Only)
docker run -d \
--name etcd \
-p 2379:2379 \
-v etcd-data:/etcd-data \
quay.io/coreos/etcd:v3.5.16 \
etcd --data-dir /etcd-data \
--listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://localhost:2379
⚠️
Warning: Single-node etcd has no redundancy. Data loss on failure.
Three-Node Cluster (Production Minimum)
# docker-compose.yml snippet
services:
etcd-0:
image: quay.io/coreos/etcd:v3.5.16
environment:
- ETCD_NAME=etcd-0
- ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-0:2380
- ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
- ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
- ETCD_ADVERTISE_CLIENT_URLS=http://etcd-0:2379
- ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
- ETCD_INITIAL_CLUSTER=etcd-0=http://etcd-0:2380,etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380
- ETCD_INITIAL_CLUSTER_STATE=new
volumes:
- etcd-0-data:/etcd-data
etcd-1:
image: quay.io/coreos/etcd:v3.5.16
environment:
- ETCD_NAME=etcd-1
- ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-1:2380
- ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
- ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
- ETCD_ADVERTISE_CLIENT_URLS=http://etcd-1:2379
- ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
- ETCD_INITIAL_CLUSTER=etcd-0=http://etcd-0:2380,etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380
- ETCD_INITIAL_CLUSTER_STATE=new
volumes:
- etcd-1-data:/etcd-data
etcd-2:
image: quay.io/coreos/etcd:v3.5.16
environment:
- ETCD_NAME=etcd-2
- ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-2:2380
- ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
- ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
- ETCD_ADVERTISE_CLIENT_URLS=http://etcd-2:2379
- ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster
- ETCD_INITIAL_CLUSTER=etcd-0=http://etcd-0:2380,etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380
- ETCD_INITIAL_CLUSTER_STATE=new
volumes:
- etcd-2-data:/etcd-data
Key Configuration Parameters
Auto-Compaction
etcd keeps all revision history. Without compaction, it grows indefinitely:
# Auto-compact every 1000 revisions
ETCD_AUTO_COMPACTION_MODE=revision
ETCD_AUTO_COMPACTION_RETENTION=1000
# Or time-based (compact hourly)
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1h
Important: Milvus requires compaction. Without it, etcd will run out of disk.
Quota and Limits
# Set quota to 4GB (default is 2GB)
ETCD_QUOTA_BACKEND_BYTES=4294967296
# Snapshot count (when to create snapshot)
ETCD_SNAPSHOT_COUNT=50000
Milvus-Specific Settings
In milvus.yaml:
etcd:
endpoints:
- etcd-0:2379
- etcd-1:2379
- etcd-2:2379
rootPath: by-dev # Key prefix for Milvus data
metaSubPath: meta # Metadata subdirectory
kvSubPath: kv # Key-value subdirectory
# Authentication (if enabled)
auth:
enabled: false
userName: ""
password: ""
Operations
Check Cluster Health
# Single node health
docker exec etcd-0 etcdctl endpoint health
# Full cluster health
docker exec etcd-0 etcdctl endpoint health --cluster
# Expected:
# http://etcd-0:2379 is healthy
# http://etcd-1:2379 is healthy
# http://etcd-2:2379 is healthy
Check Cluster Members
docker exec etcd-0 etcdctl member list
# Expected:
# 1234567890abcdef, started, etcd-0, http://etcd-0:2380, http://etcd-0:2379
# 1234567890abcd01, started, etcd-1, http://etcd-1:2380, http://etcd-1:2379
# 1234567890abcd02, started, etcd-2, http://etcd-2:2380, http://etcd-2:2379
View Milvus Data
# List all keys (use with caution on large clusters)
docker exec etcd-0 etcdctl get --prefix "" --keys-only | head -20
# Get specific key
docker exec etcd-0 etcdctl get "by-dev/meta/collection/xxx"
# Watch changes in real-time
docker exec etcd-0 etcdctl watch "by-dev/meta" --prefix
Backup and Recovery
Snapshot Backup
Method 1: Online Snapshot (Recommended)#!/bin/bash
# backup-etcd.sh
BACKUP_DIR="/backups/etcd/$(date +%Y%m%d_%H%M%S)"
mkdir -p $BACKUP_DIR
# Create snapshot
docker exec etcd-0 etcdctl snapshot save /tmp/etcd.snapshot
# Copy from container
docker cp etcd-0:/tmp/etcd.snapshot $BACKUP_DIR/
# Also backup cluster info
docker exec etcd-0 etcdctl member list > $BACKUP_DIR/members.txt
echo "Backup saved to $BACKUP_DIR"
Method 2: Volume Backup (Offline)
# Stop etcd (requires downtime)
docker compose stop etcd-0 etcd-1 etcd-2
# Backup volumes
tar czf etcd-backup-$(date +%Y%m%d).tar.gz etcd-0-data/ etcd-1-data/ etcd-2-data/
# Restart
docker compose start etcd-0 etcd-1 etcd-2
Restore from Snapshot
#!/bin/bash
# restore-etcd.sh
SNAPSHOT_FILE="$1"
if [ -z "$SNAPSHOT_FILE" ]; then
echo "Usage: $0 <snapshot-file>"
exit 1
fi
# Stop Milvus first (critical!)
docker compose stop milvus-proxy milvus-rootcoord milvus-querycoord milvus-datacoord
# Stop etcd
docker compose stop etcd-0 etcd-1 etcd-2
# Clear old data
rm -rf etcd-0-data/* etcd-1-data/* etcd-2-data/*
# Restore to first node
docker run --rm \
-v $(pwd)/etcd-0-data:/etcd-data \
-v $(pwd)/$SNAPSHOT_FILE:/snapshot.db \
quay.io/coreos/etcd:v3.5.16 \
etcdctl snapshot restore /snapshot.db \
--data-dir /etcd-data \
--name etcd-0 \
--initial-cluster etcd-0=http://etcd-0:2380,etcd-1=http://etcd-1:2380,etcd-2=http://etcd-2:2380 \
--initial-cluster-token etcd-cluster \
--initial-advertise-peer-urls http://etcd-0:2380
# Copy restored data to other nodes
cp -r etcd-0-data/* etcd-1-data/
cp -r etcd-0-data/* etcd-2-data/
# Start etcd
docker compose up -d etcd-0 etcd-1 etcd-2
# Wait for health
sleep 5
docker exec etcd-0 etcdctl endpoint health --cluster
# Start Milvus
docker compose up -d
echo "Restore complete"
Sizing Guidelines
| Milvus Scale | etcd Nodes | CPU | RAM | Disk |
|---|---|---|---|---|
| <1M vectors | 1 (dev) | 0.5 | 512 MB | 10 GB |
| <10M vectors | 3 | 1 | 2 GB | 20 GB |
| <100M vectors | 3 | 2 | 4 GB | 50 GB |
| >100M vectors | 3-5 | 4 | 8 GB | 100 GB |
- Number of collections (not vectors)
- Number of segments
- Frequency of DDL operations
Troubleshooting
etcd is Slow
Symptoms: Milvus operations timeout, high latency Check:# Disk latency
docker exec etcd-0 etcdctl check perf
# Expected: <10ms for 90% of requests
Solutions:
- Use SSD (NVMe preferred)
- Ensure dedicated disk for etcd (not shared with other I/O)
- Increase CPU allocation
"No space left on device"
Cause: etcd quota exceeded or no compaction Fix:# Check size
docker exec etcd-0 etcdctl endpoint status --write-out table
# Manual compaction
docker exec etcd-0 etcdctl compaction $(docker exec etcd-0 etcdctl endpoint status --write-out json | jq -r '.[0].Header.revision')
# Defragment
docker exec etcd-0 etcdctl defrag
Leader Election Issues
Symptoms: Cluster unavailable, frequent leader changes Check:# Check network latency between nodes
for node in etcd-0 etcd-1 etcd-2; do
docker exec etcd-0 ping -c 3 $node
done
# Check logs
docker logs etcd-0 | grep -i "leader"
Common causes:
- Network latency > 100ms between nodes
- Clock skew (ensure NTP sync)
- Resource starvation (CPU/memory)
Member Recovery
If one etcd node fails permanently:
# Remove failed member
docker exec etcd-0 etcdctl member remove <member-id>
# Add new member
docker exec etcd-0 etcdctl member add etcd-new --peer-urls=http://etcd-new:2380
# Start new node with --initial-cluster-state=existing
docker run -d \
--name etcd-new \
-e ETCD_NAME=etcd-new \
-e ETCD_INITIAL_CLUSTER_STATE=existing \
-e ETCD_INITIAL_CLUSTER=etcd-0=http://etcd-0:2380,etcd-1=http://etcd-1:2380,etcd-new=http://etcd-new:2380 \
# ... other env vars
quay.io/coreos/etcd:v3.5.16
Best Practices
- 1 Always run 3+ nodes for production (5 for very large clusters)
- 2 Use dedicated SSD for etcd data — latency matters
- 3 Enable auto-compaction — prevents unbounded growth
- 4 Monitor etcd metrics — leader changes, disk usage, latency
- 5 Regular backups — snapshots every 6-24 hours
- 6 Keep etcd close to Milvus — same datacenter, <10ms latency
- 7 Don't share etcd — dedicated instance per Milvus cluster
Next Steps
Learn about object storage:
→ MinIO/S3 — Object Storage