Quick Start

Kubernetes deployment is currently available for ASR (Speech-to-Text) only. For TTS deployments, use Docker.

Ensure you’ve completed all prerequisites before starting.

Add Helm Repository

helm repo add smallest-self-host https://smallest-inc.github.io/smallest-self-host
helm repo update

Create Namespace

kubectl create namespace smallest
kubectl config set-context --current --namespace=smallest

Configure Values

Create a values.yaml file:

values.yaml

global:
  licenseKey: "your-license-key-here"
  imageCredentials:
    create: true
    registry: quay.io
    username: "your-registry-username"
    password: "your-registry-password"
    email: "your-email@example.com"

models:
  asrModelUrl: "your-model-url-here"

scaling:
  replicas:
    lightningAsr: 1
    licenseProxy: 1

lightningAsr:
  nodeSelector:
  tolerations:

redis:
  enabled: true
  auth:
    enabled: true

Replace placeholder values with credentials provided by Smallest.ai support.

Install

helm install smallest-self-host smallest-self-host/smallest-self-host \
  -f values.yaml \
  --namespace smallest

Monitor the deployment:

kubectl get pods -w

Component	Startup Time	Ready Indicator
Redis	~30s	`1/1 Running`
License Proxy	~1m	`1/1 Running`
Lightning ASR	2-10m	`1/1 Running` (model download on first run)
API Server	~30s	`1/1 Running`

Model downloads are cached when using shared storage (EFS). Subsequent starts complete in under a minute.

Verify Installation

kubectl get pods,svc

All pods should show Running status with the following services available:

Service	Port	Description
api-server	7100	REST API endpoint
lightning-asr-internal	2269	ASR inference service
license-proxy	3369	License validation
redis-master	6379	Request queue

Test the API

Port forward and send a health check:

kubectl port-forward svc/api-server 7100:7100

curl http://localhost:7100/health

Autoscaling

Enable automatic scaling based on real-time inference load:

values.yaml

scaling:
  auto:
    enabled: true

This deploys HorizontalPodAutoscalers that scale based on active requests:

Component	Metric	Default Target	Behavior
Lightning ASR	`asr_active_requests`	4 per pod	Scales GPU workers based on inference queue depth
API Server	`lightning_asr_replica_count`	2:1 ratio	Maintains API capacity proportional to ASR workers

How It Works

Lightning ASR exposes asr_active_requests metric on port 9090
Prometheus scrapes this metric via ServiceMonitor
Prometheus Adapter makes it available to the Kubernetes metrics API
HPA scales pods when average requests per pod exceeds target

Configuration

values.yaml

scaling:
  auto:
    enabled: true
    lightningAsr:
      hpa:
        minReplicas: 1
        maxReplicas: 10
        targetActiveRequests: 4

Verify Autoscaling

kubectl get hpa

NAME            REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS
lightning-asr   Deployment/lightning-asr   0/4       1         10        1
api-server      Deployment/api-server      1/2       1         10        1

The TARGETS column shows current/target. When current exceeds target, pods scale up.

Autoscaling requires the Prometheus stack. It’s included as a dependency and enabled by default.

Helm Operations

helm upgrade smallest-self-host smallest-self-host/smallest-self-host \
  -f values.yaml -n smallest

Troubleshooting

Issue	Cause	Resolution
Pods `Pending`	Insufficient resources or missing GPU nodes	Check `kubectl describe pod <name>` for scheduling errors
`ImagePullBackOff`	Invalid registry credentials	Verify `imageCredentials` in values.yaml
`CrashLoopBackOff`	Invalid license or insufficient memory	Check logs with `kubectl logs <pod> —previous`
Slow model download	Large model size (~20GB)	Use shared storage (EFS) for caching

For detailed troubleshooting, see Troubleshooting Guide.

Next Steps

AWS Setup

EKS-specific configuration

Model Storage

Shared storage for faster cold starts

Advanced Autoscaling

Fine-tune scaling behavior and thresholds

Monitoring

Grafana dashboards and alerting

Getting Started

Docker Setup

Kubernetes Setup

Troubleshooting

Add Helm Repository

Create Namespace

Configure Values

Install

Verify Installation

Test the API

Autoscaling

How It Works

Configuration

Verify Autoscaling

Helm Operations

Troubleshooting

Next Steps

AWS Setup

Model Storage

Advanced Autoscaling

Monitoring

Getting Started

Docker Setup

Kubernetes Setup

Troubleshooting

​Add Helm Repository

​Create Namespace

​Configure Values

​Install

​Verify Installation

​Test the API

​Autoscaling

​How It Works

​Configuration

​Verify Autoscaling

​Helm Operations

​Troubleshooting

​Next Steps

AWS Setup

Model Storage

Advanced Autoscaling

Monitoring

Add Helm Repository

Create Namespace

Configure Values

Install

Verify Installation

Test the API

Autoscaling

How It Works

Configuration

Verify Autoscaling

Helm Operations

Troubleshooting

Next Steps