Milvus Autoscaling on Kubernetes with EC2 and S3-Backed Persistent Storage
An AI platform with rapidly growing vector data (embeddings for search, recommendations, and RAG) needed their Milvus vector database to scale automatically based on query load and data volume — with durable, cost-effective storage that wouldn't be lost if pods restarted or nodes were replaced.
Discuss Your Project
The Challenge
Running Milvus at scale in production presented several infrastructure challenges:
- Fixed Capacity — Static Milvus deployments couldn't handle 10x query load spikes during peak hours
- Data Loss Risk — Pod restarts on ephemeral storage caused index rebuilds taking hours on large collections
- Cost Inefficiency — Over-provisioning for peak load meant paying for idle compute 70% of the time
- Storage Costs — Block storage volumes tied to instances were expensive for multi-terabyte vector datasets
- Index Rebuilds — Re-indexing millions of vectors after a node replacement took hours of downtime
- Multi-AZ Durability — Single-AZ storage couldn't survive availability zone failures
Our Solution
We deployed Milvus on Kubernetes (EKS) with Horizontal Pod Autoscaling for query nodes, Cluster Autoscaler for compute, and Amazon S3 as the persistent storage backend — eliminating data loss risk and reducing storage costs by ~80%.
Architecture
- Orchestration: Amazon EKS (Elastic Kubernetes Service)
- Compute: EC2 instances (mixed instance types) managed by Cluster Autoscaler
- Vector DB: Milvus deployed via Helm chart in distributed mode
- Object Storage: Amazon S3 for segment files, index files, and binlog persistence
- Metadata: etcd cluster for Milvus coordination and metadata
- Message Queue: Message streaming for Milvus log pipeline
- Monitoring: Prometheus + Grafana for Milvus metrics and autoscaling signals
Milvus Distributed Architecture on Kubernetes
Component Deployment
Milvus runs in distributed mode with dedicated node types, each deployed as a Kubernetes workload with independent scaling:
- Proxy Nodes — Handle client connections and request routing
- Query Nodes — Execute vector searches and load segments into memory
- Data Nodes — Handle write paths and flush segments to S3
- Index Nodes — Build vector indexes and write to S3
- Coordinator — Cluster coordination and timestamp allocation
- etcd — Metadata storage and service discovery
- Message Queue — Log streaming and write-ahead log
Horizontal Pod Autoscaling (HPA)
Query Node Autoscaling
Query nodes are the primary scaling target — they load vector segments into memory and execute searches. Scaling is driven by multiple metrics including CPU utilization, memory utilization, query queue depth, and P99 query latency. The HPA is configured with appropriate min/max replicas, fast scale-up for handling spikes, and gradual scale-down to avoid flapping.
Index Node Autoscaling
Index nodes scale based on pending index build jobs — scaling up when the build queue has pending items and scaling back down when idle.
EC2 Cluster Autoscaler
Instance Strategy
- Node Groups: Multiple node groups with different instance types for cost optimization
- Query Workload: Memory-optimized instances for in-memory vector segments
- Index Workload: Compute-optimized instances for CPU-intensive index building
- Spot Instances: Index nodes and non-critical data nodes run on spot instances for significant savings
- On-Demand: Query nodes and coordinators on on-demand instances for stability
Scaling Behavior
When HPA creates new pods that can't be scheduled, the Cluster Autoscaler provisions new EC2 instances in the appropriate node group. New query nodes then load their assigned segments from S3 into memory and begin serving queries, with the total scale-up process completing in minutes.
S3-Backed Persistent Storage
Why S3 Instead of Block Storage
S3 provides significant advantages over block storage for Milvus:
- ~80% lower storage cost for large datasets
- 11-nines durability with built-in multi-AZ replication
- Unlimited scaling without manual volume resizing
- Pod-independent — Data always available regardless of pod or node lifecycle
- No AZ lock-in — Data accessible from any availability zone
Data Flow with S3
- Write Path: Data nodes buffer inserts in memory, then flush sealed segments to S3
- Index Build: Index nodes read segments from S3, build indexes, and write index files back to S3
- Query Path: Query nodes download segments and indexes from S3, load into memory, and serve queries
- Recovery: On pod restart, query nodes re-download assigned segments from S3 (no data loss)
S3 Performance Optimization
- Segment size tuning balances S3 request costs vs. data freshness
- Local SSD caching on NVMe instance storage avoids repeated S3 reads for hot segments
- Parallel downloads enable fast query node startup
- Lifecycle policies archive old data to cheaper storage tiers
Monitoring & Observability
The deployment includes comprehensive monitoring via Prometheus and Grafana:
- Query Performance — Latency distribution, QPS, cache hit rate
- Cluster Overview — Node count, pod status, resource utilization
- Storage Health — S3 usage, segment counts, flush rates
- Autoscaling Events — HPA events, node scaling, pod scheduling latency
- Alerting — Automated alerts for high latency, OOM risk, flush failures, and capacity limits
Key Features
- Query Node HPA — Automatic scaling based on CPU, memory, latency, and queue depth
- EC2 Cluster Autoscaler — Dynamic node provisioning with mixed instance types
- S3 Persistence — 11-nines durability, ~80% cheaper than block storage, survives AZ failures
- Spot Instances — Index and data nodes on spot for significant compute savings
- Local SSD Cache — NVMe caching eliminates repeated S3 reads for hot segments
- Zero-Downtime Recovery — Pod restarts reload segments from S3 without data loss
- Multi-AZ — S3 storage + multi-AZ node groups for full AZ failure tolerance
- Observability — Prometheus + Grafana with Milvus-specific metrics and autoscaling visibility
Results
Technology Stack
More Case Studies
Explore more of our technical implementations
AI-Powered Blog Content Scraping & Generation Platform
A media company needed an intelligent content platform that could automate blog content creation by scraping existing web content, analyzing it using AI, and generating original, SEO-optimized blog posts from the extracted data.
Automated B2B Supplier Data Collection Platform with Anti-Detection & IP Rotation
A sourcing team needed to build a comprehensive supplier database across 19+ product categories and 50+ countries by collecting structured business data from B2B marketplace platforms — at scale, reliably, and without being blocked.
Custom WordPress Theme Redevelopment
Krystelis needed their existing WordPress website rebuilt from a pre-built theme into a fully custom WordPress theme, maintaining the original design while gaining complete control over the codebase for better customization, performance, and maintainability.
Have a Similar Project in Mind?
Let's discuss how we can build a solution tailored to your needs.