Skip to main content
Kubernetes deployment is currently available for ASR (Speech-to-Text) only. For TTS deployments, use Docker.
Ensure you’ve completed all prerequisites before starting.

Add Helm Repository

helm repo add smallest-self-host https://smallest-inc.github.io/smallest-self-host
helm repo update

Create Namespace

kubectl create namespace smallest
kubectl config set-context --current --namespace=smallest

Configure Values

Create a values.yaml file:
values.yaml
global:
  licenseKey: "your-license-key-here"
  imageCredentials:
    create: true
    registry: quay.io
    username: "your-registry-username"
    password: "your-registry-password"
    email: "[email protected]"

models:
  asrModelUrl: "your-model-url-here"

scaling:
  replicas:
    lightningAsr: 1
    licenseProxy: 1

lightningAsr:
  nodeSelector:
  tolerations:

redis:
  enabled: true
  auth:
    enabled: true
Replace placeholder values with credentials provided by Smallest.ai support.

Install

helm install smallest-self-host smallest-self-host/smallest-self-host \
  -f values.yaml \
  --namespace smallest
Monitor the deployment:
kubectl get pods -w
ComponentStartup TimeReady Indicator
Redis~30s1/1 Running
License Proxy~1m1/1 Running
Lightning ASR2-10m1/1 Running (model download on first run)
API Server~30s1/1 Running
Model downloads are cached when using shared storage (EFS). Subsequent starts complete in under a minute.

Verify Installation

kubectl get pods,svc
All pods should show Running status with the following services available:
ServicePortDescription
api-server7100REST API endpoint
lightning-asr-internal2269ASR inference service
license-proxy3369License validation
redis-master6379Request queue

Test the API

Port forward and send a health check:
kubectl port-forward svc/api-server 7100:7100
curl http://localhost:7100/health

Autoscaling

Enable automatic scaling based on real-time inference load:
values.yaml
scaling:
  auto:
    enabled: true
This deploys HorizontalPodAutoscalers that scale based on active requests:
ComponentMetricDefault TargetBehavior
Lightning ASRasr_active_requests4 per podScales GPU workers based on inference queue depth
API Serverlightning_asr_replica_count2:1 ratioMaintains API capacity proportional to ASR workers

How It Works

  1. Lightning ASR exposes asr_active_requests metric on port 9090
  2. Prometheus scrapes this metric via ServiceMonitor
  3. Prometheus Adapter makes it available to the Kubernetes metrics API
  4. HPA scales pods when average requests per pod exceeds target

Configuration

values.yaml
scaling:
  auto:
    enabled: true
    lightningAsr:
      hpa:
        minReplicas: 1
        maxReplicas: 10
        targetActiveRequests: 4

Verify Autoscaling

kubectl get hpa
NAME            REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS
lightning-asr   Deployment/lightning-asr   0/4       1         10        1
api-server      Deployment/api-server      1/2       1         10        1
The TARGETS column shows current/target. When current exceeds target, pods scale up.
Autoscaling requires the Prometheus stack. It’s included as a dependency and enabled by default.

Helm Operations

helm upgrade smallest-self-host smallest-self-host/smallest-self-host \
  -f values.yaml -n smallest

Troubleshooting

IssueCauseResolution
Pods PendingInsufficient resources or missing GPU nodesCheck kubectl describe pod <name> for scheduling errors
ImagePullBackOffInvalid registry credentialsVerify imageCredentials in values.yaml
CrashLoopBackOffInvalid license or insufficient memoryCheck logs with kubectl logs <pod> --previous
Slow model downloadLarge model size (~20GB)Use shared storage (EFS) for caching
For detailed troubleshooting, see Troubleshooting Guide.

Next Steps