Back to Case Studies
GPU Infrastructure

On-Off Scaling Pattern for AI & Video Processing Workloads

An AI-powered video processing platform needed to handle highly variable workloads — from zero jobs during off-hours to hundreds of concurrent video processing and AI inference tasks during peak times — without paying for idle GPU and compute resources.

Discuss Your Project
On-Off Scaling Pattern for AI & Video Processing Workloads
GPU Infrastructure
Domain
10
Technologies
5
Key Results
Delivered
Status

The Challenge

AI and video processing workloads are inherently bursty and expensive:

  • GPU instances are costly whether processing jobs or sitting idle
  • Video encoding, transcription, and AI inference demand different resource profiles
  • Peak-to-trough ratio was 50:1 — 200+ jobs during peak, near-zero overnight
  • Traditional auto-scaling was too slow (5-10 min cold start) for time-sensitive user requests
  • Fixed infrastructure provisioned for peak meant 80%+ waste during off-peak hours

Our Solution

We implemented an On-Off scaling pattern — a hybrid architecture where compute resources are provisioned just-in-time for active workloads and fully deallocated when idle, with warm pools for latency-sensitive tasks and cold pools for batch jobs.

Architecture

  • Job Queue: Database-backed job queue with priority classification
  • Orchestrator: Service managing resource lifecycle and job routing
  • GPU Workers (AI): Cloud GPU pods for inference (object detection, transcription, speaker detection)
  • CPU Workers (Video): Cloud VMs for video encoding and rendering
  • Warm Pool: Pre-initialized instances for latency-sensitive jobs (< 30s startup)
  • Cold Pool: On-demand instances for batch/bulk processing (2-5 min startup acceptable)

On-Off Pattern Implementation

Resource Lifecycle States

Resources move through a defined lifecycle: from fully deallocated (zero cost), through provisioning and warming (models loading, health checks), to ready and processing states, then through a cooldown window before returning to deallocated.

Warm Pool Strategy

For latency-sensitive processing (user-initiated, expects results in minutes):

  • Maintain a minimum warm pool of instances during business hours
  • Pre-load AI models at container startup
  • Route incoming jobs to warm instances first
  • Scale out additional warm instances when queue depth exceeds threshold
  • Configurable cooldown timer keeps instances alive between sporadic jobs

Cold Pool Strategy

For batch processing (overnight bulk jobs, non-urgent re-encodes):

  • Zero instances running by default
  • Job queue triggers provisioning when batch jobs are submitted
  • Bulk-optimized instances for throughput over latency
  • Terminate immediately after batch completes
  • Use spot/preemptible instances for significant cost savings

Job Classification & Routing

Jobs are automatically classified by priority and type, then routed to the appropriate pool:

  • High priority user-initiated AI tasks route to warm GPU pools
  • Critical real-time tasks route to always-on dedicated instances
  • Medium priority encoding tasks route to warm or cold CPU pools
  • Low priority batch tasks route to cold spot/preemptible instances

Orchestrator Logic

Scale-Up Triggers

  • Queue depth exceeds configurable threshold
  • Average wait time exceeds SLA for the priority level
  • Scheduled ramp-up before known peak hours
  • Manual trigger via admin API for anticipated traffic spikes

Scale-Down Triggers

  • No jobs processed for the duration of the cooldown window
  • Scheduled wind-down after peak hours
  • All queued jobs completed with no new submissions
  • Cost threshold reached for the billing period

Health & Recovery

  • Regular health probes on all active instances
  • Unhealthy instances replaced automatically
  • Failed jobs re-queued with retry count and routed to a different instance
  • Dead letter queue for jobs exceeding max retries

Cost Impact

The On-Off pattern delivered approximately 70% cost reduction vs. always-on fixed infrastructure by eliminating idle compute during off-peak hours, right-sizing resources per job type, and leveraging spot instances for batch workloads.

Key Features

  1. Zero Idle Cost — Resources fully deallocated when not processing jobs
  2. Warm Pools — Pre-initialized instances for latency-sensitive workloads
  3. Cold Pools — On-demand provisioning for batch jobs at lowest cost
  4. Job Classification — Automatic routing based on priority, type, and latency requirements
  5. Cooldown Windows — Configurable idle timeout prevents premature scale-down between bursts
  6. Spot/Preemptible Support — Batch jobs routed to discounted instances for significant savings
  7. Health & Recovery — Auto-replacement of unhealthy instances with job re-queuing
  8. Scheduled Scaling — Anticipate known traffic patterns with time-based provisioning rules

Results

Cost Reduction: ~70% savings vs. always-on fixed infrastructure
Latency: < 30 second cold-to-ready for warm pool instances
Reliability: Auto-recovery and job re-queuing maintained 99.5%+ job completion rate
Flexibility: Different GPU/CPU tiers for different job types optimized cost-per-job
Scale: Handled 200+ concurrent jobs during peak with zero pre-provisioned infrastructure during off-peak

Technology Stack

Node.jsMongoDBRunPod APICloud VM APIsDockerFastAPIFFmpegRedisJob QueueCron Scheduling

Have a Similar Project in Mind?

Let's discuss how we can build a solution tailored to your needs.

Contact UsSchedule Appointment