What is Smallest Self-Host?
Smallest Self-Host enables you to deploy state-of-the-art speech-to-text (STT) models in your own infrastructure, whether in the cloud or on-premises. Built for enterprises with stringent performance, security, or compliance requirements, it provides the same powerful AI capabilities as Smallest’s cloud service while keeping your data under your complete control.Why Self-Host?
Using Smallest as a managed service has many benefits: it’s fast to start developing with, requires no infrastructure setup, and eliminates all hardware, installation, configuration, backup, and maintenance-related costs. However, there are situations where a self-hosted deployment makes more sense.Performance Requirements
Certain use cases have very sensitive latency and load requirements. If you need ultra-low latency with voice AI services colocated with your other services, self-hosting can meet these requirements. Ideal for:- Real-time AI voicebots requiring <100ms response times
- Live transcription systems for broadcasts or conferences
- High-volume processing with predictable costs
- Edge deployments with limited internet connectivity
- Colocate speech services with your application infrastructure
- Scale independently based on your specific workload patterns
- No network latency to external APIs
- Consistent performance regardless of internet conditions
Security & Data Privacy
One of the most common use cases for self-hosting Smallest is to satisfy security or data privacy requirements. In a typical self-hosted deployment, no audio, transcripts, or other identifying markers of the request content are sent to Smallest servers. Ideal for:- Healthcare applications requiring HIPAA compliance
- Financial services with strict data governance
- Government and defense applications
- Enterprise environments with air-gapped networks
- Your audio data never leaves your infrastructure
- Transcripts remain entirely within your control
- No data stored beyond the duration of the API request
- Self-hosted deployments do not persist request/response data
- Only metadata such as audio duration, character count, features requested, and success response codes
- No audio content, transcripts, or personally identifiable information
In a typical self-hosted deployment, no audio or transcript data is sent to
Smallest servers. Only usage metadata (duration, feature flags, response
codes) is reported to the license server for validation and billing purposes.
Cost Optimization
For high-volume or predictable workloads, self-hosting can be more cost-effective:- Predictable costs based on infrastructure, not usage
- No per-minute charges for audio processing
- Efficient resource utilization with autoscaling
- Long-term savings for sustained high volumes
Customization & Control
Self-hosting provides complete control over your deployment:- Custom resource allocation optimized for your workload
- Version control - upgrade on your schedule
- Network isolation - deploy in private networks
- Integration flexibility - direct database access, custom monitoring
Components
Before you deploy Smallest, you’ll need to understand the components of your system, their relationships, and the interactions between components. A well-designed architecture will meet your business needs, optimize both performance and security, and provide a strong technical foundation for future growth.Architecture Diagram
Component Details
API Server
API Server
Purpose: The API server interfaces with Lightning ASR to expose endpoints for your requests.Key Features:
- Routes incoming API requests to available Lightning ASR workers
- Manages WebSocket connections for streaming transcription
- Handles request queuing and load balancing across workers
- Provides unified REST API interface
- CPU: 0.5-2 cores
- Memory: 512 MB - 2 GB
- No GPU required
Lightning ASR
Lightning ASR
Purpose: The Lightning ASR engine performs the computationally intensive
task of speech recognition. It manages GPU devices and responds to requests
from the API layer. Key Features: - GPU-accelerated speech recognition
(0.05-0.15x real-time factor) - Real-time and batch audio transcription -
Automatic model loading and optimization - Horizontal scaling support
Resource Requirements: - CPU: 4-8 cores - Memory: 12-16 GB RAM - GPU: 1x
NVIDIA GPU (16+ GB VRAM required) - Storage: 50+ GB for models Note:
Because Lightning ASR is decoupled from the API Server, you can scale it
independently based on your transcription load.
License Proxy
License Proxy
Purpose: Components register with the Smallest License Server to verify
licensing and report usage. API and Engine containers can be configured to
connect directly to the licensing server, or to proxy their communication
through the License Proxy. Key Features: - License key validation on
startup - Usage metadata reporting (no audio/transcript data) - Grace period
support for offline operation - Secure communication with Smallest License
Server Resource Requirements: - CPU: 0.25-1 core - Memory: 256-512 MB - No
GPU required Network: Requires outbound HTTPS to
https://console-api.smallest.aiRedis
Redis
Purpose: Provides caching and state management for the system.Key Features:
- Request queuing and coordination between API and ASR workers
- Session state for streaming connections
- Performance optimization through caching
- Can be embedded or external (AWS ElastiCache, etc.)
- CPU: 0.5-1 core
- Memory: 512 MB - 2 GB
- No GPU required
Common Setup Path
All deployments follow the same initial setup path through environment preparation. Here’s what to expect:1. Choose Your Deployment Method
Docker/Podman
Best for: Development, testing, small-scale productionTimeline: 15-30 minutesComplexity: Low
Kubernetes
Best for: Production deployments with autoscalingTimeline: 1-2 hoursComplexity: Medium-High
2. Prepare Infrastructure
Steps:- Obtain credentials from Smallest.ai (license key, registry access, model URLs)
- Prepare infrastructure (Docker host or Kubernetes cluster)
- Setup GPU support (NVIDIA drivers, device plugins)
- Deploy components (API Server, Lightning ASR, License Proxy, Redis)
- Configure autoscaling (optional, Kubernetes only)
- Setup monitoring (optional, Prometheus & Grafana)
What You’ll Need
Before starting, ensure you have:From Smallest.ai
- License key
- Container registry credentials
- Model download URLs
Technical Requirements
- GPU infrastructure (NVIDIA A10, T4, or better)
- Kubernetes cluster or Docker host
- Basic DevOps knowledge
- Network connectivity for license validation
Deployment Options
Smallest Self-Host supports two primary deployment methods, each suited for different operational requirements:Docker Deployment
Best for development, testing, or small-scale production deploymentsPros:
- Fastest setup (under 15 minutes)
- Minimal infrastructure requirements
- Single-machine deployment
- Easy configuration with docker-compose
- Development and testing
- Proof of concept
- Small-scale production
- Edge deployments
Kubernetes Deployment
Production-grade deployment with enterprise featuresPros:
- Auto-scaling based on load
- High availability and fault tolerance
- Advanced monitoring with Grafana
- Shared model storage
- Production workloads
- High-traffic applications
- Multi-region deployments
- Enterprise infrastructure
Prerequisites
Before deploying Smallest Self-Host, ensure you have:1
License Key
Contact [email protected] or your Smallest representative to obtain:
- License key for validation
- Container registry credentials
2
Infrastructure
Provision compute resources: - For Docker: Single machine with NVIDIA GPU
- For Kubernetes: Cluster with GPU node pool
3
GPU Drivers
Install NVIDIA drivers and container runtime:
- NVIDIA Driver 525+ (for A10, A100, L4)
- NVIDIA Driver 470+ (for T4, V100)
- NVIDIA Container Toolkit
What’s Next?
Choose your deployment path based on your needs:For Quick Start & Testing
Start with Docker
Fastest path to get running (15-30 minutes) Perfect if you’re: -
Evaluating Smallest Self-Host for the first time - Building a proof-of-concept
- Setting up a development environment - Running on a single GPU server Go to Docker Setup →
For Production Deployment
Kubernetes on AWS
Full-featured production setup
- Auto-scaling (HPA + Cluster Autoscaler)
- High availability across zones
- Grafana monitoring dashboards
- Shared model storage with EFS
Kubernetes (Generic)
For any Kubernetes cluster
- Works on GCP, Azure, on-prem
- Full autoscaling support
- Advanced monitoring
- Production-ready
Quick Links by Role
I'm a DevOps Engineer
I'm a DevOps Engineer
Start here:
- Kubernetes Prerequisites - Check cluster requirements
- AWS EKS Setup - Create EKS cluster (if on AWS)
- Quick Start - Deploy with Helm
- Autoscaling - Configure HPA
- Monitoring - Setup Grafana
I'm a Developer
I'm a Developer
Start here: 1. Docker Prerequisites - Setup local
environment 2. Docker Quick Start - Get running in
15 minutes 3. API Reference - Integrate with
your app 4. Examples - See code examples
I'm Evaluating the Product
I'm Evaluating the Product
Start here: 1. Docker Quick Start - Fastest way to
test 2. API Reference - See what you
can do 3. Common Issues - Get help if stuck
4. Then move to Kubernetes for production
I Need Help
I Need Help
Resources:
- Common Issues - Quick fixes
- Debugging Guide - Advanced troubleshooting
- Logs Analysis - Interpret error messages
- Support: [email protected]
Recommendation: Start with Docker to familiarize yourself with the
components and API. Once you’re comfortable, move to Kubernetes for production
deployments with autoscaling and high availability.

