AI Face Tracking & Smart Reframing for Vertical Video Conversion
A content repurposing platform needed to automatically convert horizontal (16:9) long-form videos into vertical (9:16) short-form clips while keeping speakers and subjects perfectly centered — without any manual cropping or keyframing.
Discuss Your ProjectThe Challenge
Converting horizontal video to vertical format was one of the most tedious steps in short-form content production:
- Manually cropping and repositioning the frame for every clip was time-consuming
- Multi-person conversations required dynamic reframing as speakers changed
- Static center-crop cut off speakers who moved or sat off-center
- Traditional face detection was too slow for real-time reframing decisions across thousands of clips
- Different content types (interviews, solo vlogs, presentations) required different framing strategies
Our Solution
We built an AI-powered face tracking and smart reframing engine that detects faces in video frames, tracks their movement, and dynamically adjusts the vertical crop region to keep the active subject centered.
Architecture
- Face Detection: YOLO-based face detection model optimized for speed
- Face Tracking: IoU-based frame-to-frame tracking with persistent subject IDs
- Reframing Engine: Dynamic crop region calculation based on face positions and movement
- Active Speaker Coupling: Integration with speaker detection to prioritize the person talking
- Rendering: FFmpeg crop filter chain with smooth pan transitions
Reframing Pipeline
- Face Detection - Run YOLO face detection across sampled frames
- Subject Tracking - Link face detections across frames using IoU-based tracking
- Speaker Priority - When coupled with active speaker detection, prioritize the talking subject
- Crop Calculation - Determine optimal 9:16 crop region based on primary subject position
- Smoothing - Apply easing to crop movement to avoid jarring jumps
- Rendering - FFmpeg applies the dynamic crop with smooth pan transitions
Key Features
- Multi-Subject Handling - Tracks multiple faces and determines the primary subject per segment
- Speaker-Aware Framing - Prioritizes the active speaker when integrated with speaker detection
- Smooth Transitions - Eased panning between subjects eliminates jarring cuts
- Content-Type Adaptation - Different framing strategies for solo, interview, and group content
- Batch Processing - Reframe hundreds of clips from a single long-form video
- No Manual Intervention - Fully automated from detection to final render
Results
Technology Stack
More Case Studies
Explore more of our technical implementations
Cross-Platform Social Media Scheduling & Performance Analytics
Content creators producing dozens of short-form clips weekly needed a unified scheduling and analytics system to distribute content across TikTok, YouTube Shorts, and Instagram Reels from a single dashboard — with insights to optimize posting strategy.
Multi-Language Caption Translation for Global Content Distribution
Content creators with international audiences needed to expand their reach by translating video captions into 30+ languages while preserving the original audio, enabling viewers worldwide to consume content in their native language.
Automated Caption Styling & Video Export Engine
Video creators needed a fast, reliable system to apply professional-grade animated captions to short-form videos with pixel-perfect rendering across different styles and platforms.
Frequently Asked Questions
MicrocosmWorks implemented a hybrid tracking approach that combines a lightweight face detector running every 5th frame with a KCF optical flow tracker for inter-frame predictions. When occlusion is detected via confidence score drops, the system maintains the last known trajectory with Kalman filtering and re-acquires the face within 200ms of it becoming visible again.
MicrocosmWorks built a saliency-weighted cropping algorithm that prioritizes detected faces, then text regions, then motion areas when determining the 9:16 crop window position. For multi-person scenes, the system uses a configurable priority ranking, defaulting to the active speaker or the largest face, with smooth interpolation between crop positions to avoid jarring shifts.
Yes, MicrocosmWorks implemented a fallback saliency detection mode that activates when no faces are present, using a combination of motion detection, visual attention modeling, and mouse cursor tracking for screen recordings. The system intelligently follows the most relevant content region even in purely visual or text-based footage.
MicrocosmWorks optimized the pipeline for batch workflows, achieving 8x real-time processing speed on a single NVIDIA T4 GPU, meaning a 10-minute video is reframed in approximately 75 seconds. The system supports parallel processing across multiple GPUs, scaling linearly for high-volume content operations.
MicrocosmWorks develops AI video reframing systems at rates of $25-$45/hr, with a full face tracking and smart reframing solution including model optimization, batch processing support, and API integration typically requiring 350-550 development hours. This investment eliminates the need for manual reframing editors, which typically cost $5-$15 per video.
Have a Similar Project in Mind?
Let's discuss how we can build a solution tailored to your needs.