Skip to main content

What is Smallest Self-Host?

Smallest Self-Host enables you to deploy state-of-the-art speech-to-text (STT) models in your own infrastructure, whether in the cloud or on-premises. Built for enterprises with stringent performance, security, or compliance requirements, it provides the same powerful AI capabilities as Smallest’s cloud service while keeping your data under your complete control.

Why Self-Host?

Using Smallest as a managed service has many benefits: it’s fast to start developing with, requires no infrastructure setup, and eliminates all hardware, installation, configuration, backup, and maintenance-related costs. However, there are situations where a self-hosted deployment makes more sense.

Performance Requirements

Certain use cases have very sensitive latency and load requirements. If you need ultra-low latency with voice AI services colocated with your other services, self-hosting can meet these requirements. Ideal for:
  • Real-time AI voicebots requiring <100ms response times
  • Live transcription systems for broadcasts or conferences
  • High-volume processing with predictable costs
  • Edge deployments with limited internet connectivity
Benefits:
  • Colocate speech services with your application infrastructure
  • Scale independently based on your specific workload patterns
  • No network latency to external APIs
  • Consistent performance regardless of internet conditions

Security & Data Privacy

One of the most common use cases for self-hosting Smallest is to satisfy security or data privacy requirements. In a typical self-hosted deployment, no audio, transcripts, or other identifying markers of the request content are sent to Smallest servers. Ideal for:
  • Healthcare applications requiring HIPAA compliance
  • Financial services with strict data governance
  • Government and defense applications
  • Enterprise environments with air-gapped networks
Data Privacy:
  • Your audio data never leaves your infrastructure
  • Transcripts remain entirely within your control
  • No data stored beyond the duration of the API request
  • Self-hosted deployments do not persist request/response data
What is reported:
  • Only metadata such as audio duration, character count, features requested, and success response codes
  • No audio content, transcripts, or personally identifiable information
In a typical self-hosted deployment, no audio or transcript data is sent to Smallest servers. Only usage metadata (duration, feature flags, response codes) is reported to the license server for validation and billing purposes.

Cost Optimization

For high-volume or predictable workloads, self-hosting can be more cost-effective:
  • Predictable costs based on infrastructure, not usage
  • No per-minute charges for audio processing
  • Efficient resource utilization with autoscaling
  • Long-term savings for sustained high volumes

Customization & Control

Self-hosting provides complete control over your deployment:
  • Custom resource allocation optimized for your workload
  • Version control - upgrade on your schedule
  • Network isolation - deploy in private networks
  • Integration flexibility - direct database access, custom monitoring

Components

Before you deploy Smallest, you’ll need to understand the components of your system, their relationships, and the interactions between components. A well-designed architecture will meet your business needs, optimize both performance and security, and provide a strong technical foundation for future growth.

Architecture Diagram

Component Details

Purpose: The API server interfaces with Lightning ASR to expose endpoints for your requests.Key Features:
  • Routes incoming API requests to available Lightning ASR workers
  • Manages WebSocket connections for streaming transcription
  • Handles request queuing and load balancing across workers
  • Provides unified REST API interface
Resource Requirements:
  • CPU: 0.5-2 cores
  • Memory: 512 MB - 2 GB
  • No GPU required
Purpose: The Lightning ASR engine performs the computationally intensive task of speech recognition. It manages GPU devices and responds to requests from the API layer. Key Features: - GPU-accelerated speech recognition (0.05-0.15x real-time factor) - Real-time and batch audio transcription - Automatic model loading and optimization - Horizontal scaling support Resource Requirements: - CPU: 4-8 cores - Memory: 12-16 GB RAM - GPU: 1x NVIDIA GPU (16+ GB VRAM required) - Storage: 50+ GB for models Note: Because Lightning ASR is decoupled from the API Server, you can scale it independently based on your transcription load.
Purpose: Components register with the Smallest License Server to verify licensing and report usage. API and Engine containers can be configured to connect directly to the licensing server, or to proxy their communication through the License Proxy. Key Features: - License key validation on startup - Usage metadata reporting (no audio/transcript data) - Grace period support for offline operation - Secure communication with Smallest License Server Resource Requirements: - CPU: 0.25-1 core - Memory: 256-512 MB - No GPU required Network: Requires outbound HTTPS to https://console-api.smallest.ai
Purpose: Provides caching and state management for the system.Key Features:
  • Request queuing and coordination between API and ASR workers
  • Session state for streaming connections
  • Performance optimization through caching
  • Can be embedded or external (AWS ElastiCache, etc.)
Resource Requirements:
  • CPU: 0.5-1 core
  • Memory: 512 MB - 2 GB
  • No GPU required

Common Setup Path

All deployments follow the same initial setup path through environment preparation. Here’s what to expect:

1. Choose Your Deployment Method

Docker/Podman

Best for: Development, testing, small-scale productionTimeline: 15-30 minutesComplexity: Low

Kubernetes

Best for: Production deployments with autoscalingTimeline: 1-2 hoursComplexity: Medium-High

2. Prepare Infrastructure

Steps:
  1. Obtain credentials from Smallest.ai (license key, registry access, model URLs)
  2. Prepare infrastructure (Docker host or Kubernetes cluster)
  3. Setup GPU support (NVIDIA drivers, device plugins)
  4. Deploy components (API Server, Lightning ASR, License Proxy, Redis)
  5. Configure autoscaling (optional, Kubernetes only)
  6. Setup monitoring (optional, Prometheus & Grafana)

What You’ll Need

Before starting, ensure you have:

From Smallest.ai

  • License key
  • Container registry credentials
  • Model download URLs
Contact: [email protected]

Technical Requirements

  • GPU infrastructure (NVIDIA A10, T4, or better)
  • Kubernetes cluster or Docker host
  • Basic DevOps knowledge
  • Network connectivity for license validation

Deployment Options

Smallest Self-Host supports two primary deployment methods, each suited for different operational requirements:

Prerequisites

Before deploying Smallest Self-Host, ensure you have:
1

License Key

Contact [email protected] or your Smallest representative to obtain:
  • License key for validation
  • Container registry credentials
2

Infrastructure

Provision compute resources: - For Docker: Single machine with NVIDIA GPU
  • For Kubernetes: Cluster with GPU node pool
3

GPU Drivers

Install NVIDIA drivers and container runtime:
  • NVIDIA Driver 525+ (for A10, A100, L4)
  • NVIDIA Driver 470+ (for T4, V100)
  • NVIDIA Container Toolkit

What’s Next?

Choose your deployment path based on your needs:

For Quick Start & Testing

For Production Deployment

Start here:
  1. Kubernetes Prerequisites - Check cluster requirements
  2. AWS EKS Setup - Create EKS cluster (if on AWS)
  3. Quick Start - Deploy with Helm
  4. Autoscaling - Configure HPA
  5. Monitoring - Setup Grafana
Start here: 1. Docker Prerequisites - Setup local environment 2. Docker Quick Start - Get running in 15 minutes 3. API Reference - Integrate with your app 4. Examples - See code examples
Start here: 1. Docker Quick Start - Fastest way to test 2. API Reference - See what you can do 3. Common Issues - Get help if stuck 4. Then move to Kubernetes for production
Resources:
Recommendation: Start with Docker to familiarize yourself with the components and API. Once you’re comfortable, move to Kubernetes for production deployments with autoscaling and high availability.