Overview
This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.Diagnostic Commands
Quick Status Check
Detailed Pod Information
Events
Common Issues
Pods Stuck in Pending
Symptoms:Insufficient GPU Resources
Insufficient GPU Resources
Check:Look for:
0/3 nodes are available: 3 Insufficient nvidia.com/gpuSolutions:- Add GPU nodes to cluster
- Check GPU nodes are ready:
kubectl get nodes -l nvidia.com/gpu=true - Verify GPU device plugin:
kubectl get pods -n kube-system -l name=nvidia-device-plugin - Reduce requested GPUs or add more nodes
Node Selector Mismatch
Node Selector Mismatch
Check:Solutions:
- Update nodeSelector in values.yaml to match actual node labels
- Remove nodeSelector if not needed
- Add labels to nodes:
kubectl label nodes <node-name> workload=gpu
Tolerations Missing
Tolerations Missing
Check:Solutions:
Update tolerations in values.yaml:
PVC Not Bound
PVC Not Bound
Check:Look for:
STATUS: PendingSolutions:- Check storage class exists:
kubectl get storageclass - Verify sufficient storage:
kubectl describe pvc <pvc-name> -n smallest - Check EFS/EBS CSI driver running:
kubectl get pods -n kube-system -l app=efs-csi-controller
ImagePullBackOff
Symptoms:Invalid Credentials
Invalid Credentials
Error:
unauthorized: authentication requiredSolutions:- Verify imageCredentials in values.yaml
- Check secret created:
kubectl get secrets -n smallest | grep registry - Test credentials locally:
docker login quay.io - Recreate secret:
Image Not Found
Image Not Found
Error:
manifest unknown or not foundSolutions:- Verify image name in values.yaml
- Check image exists:
docker pull quay.io/smallestinc/lightning-asr:latest - Contact [email protected] for access
Rate Limited
Rate Limited
Error:
rate limit exceededSolutions:- Wait and retry
- Use authenticated pulls (imageCredentials)
CrashLoopBackOff
Symptoms:License Validation Failed
License Validation Failed
Error:
License validation failed or Invalid license keySolutions:- Check License Proxy is running:
kubectl get pods -l app=license-proxy -n smallest - Verify license key in values.yaml
- Check License Proxy logs:
kubectl logs -l app=license-proxy -n smallest - Test License Proxy:
kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health
Model Download Failed
Model Download Failed
Error:
Failed to download model or Connection timeoutSolutions:- Verify MODEL_URL in values.yaml
- Check network connectivity
- Check disk space:
kubectl exec -it <pod> -- df -h - Test URL:
kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL
Out of Memory
Out of Memory
Error: Pod killed, exit code 137Solutions:
- Check memory limits:
- Increase memory:
- Check node capacity:
kubectl describe node <node-name>
GPU Not Accessible
GPU Not Accessible
Error:
No CUDA-capable device or GPU not foundSolutions:- Verify GPU available on node:
kubectl describe node <node-name> | grep nvidia.com/gpu - Check NVIDIA device plugin:
kubectl get pods -n kube-system -l name=nvidia-device-plugin - Restart device plugin:
kubectl delete pod -n kube-system -l name=nvidia-device-plugin - Verify GPU driver on node
Service Not Accessible
Symptoms:- Cannot connect to API server
- Connection refused errors
- Timeouts
No Endpoints
No Endpoints
Issue: Service has no endpointsCheck:Solutions:
- Verify pods are running:
kubectl get pods -l app=api-server -n smallest - Check pod labels match service selector
- Check pods are ready:
kubectl get pods -l app=api-server -o wide
Wrong Port
Wrong Port
Solutions:
- Verify service port:
- Use correct port in connections (7100 for API Server)
Network Policy Blocking
Network Policy Blocking
Check:Solutions:
- Review network policies
- Temporarily disable to test:
HPA Not Scaling
Symptoms:- HPA shows
<unknown>for metrics - Pods not scaling despite high load
Metrics Not Available
Metrics Not Available
Check:Solutions:
- Enable ServiceMonitor:
- Verify Prometheus is scraping:
Query:
asr_active_requests
Already at Max Replicas
Already at Max Replicas
Check:Solutions:
- Increase maxReplicas:
Insufficient Cluster Resources
Insufficient Cluster Resources
Solutions:
- Add more nodes
- Enable Cluster Autoscaler
- Check pending pods:
kubectl get pods --field-selector=status.phase=Pending
Persistent Volume Issues
Symptoms:- PVC stuck in Pending
- Mount failures
- Permission denied
No Storage Class
No Storage Class
Check:Solutions:
- Install EBS CSI driver (AWS)
- Install EFS CSI driver (AWS)
- Create storage class
EFS Mount Failed
EFS Mount Failed
Check:Solutions:
- Verify EFS file system ID
- Check security group allows NFS (port 2049)
- Verify EFS CSI driver:
kubectl get pods -n kube-system -l app=efs-csi-controller
Permission Denied
Permission Denied
Solutions:
- Check volume permissions
- Add fsGroup to pod securityContext:
Performance Issues
Slow Response Times
Check:- Increase pod resources
- Scale up replicas
- Check GPU utilization:
kubectl exec -it <lightning-asr-pod> -- nvidia-smi - Review model configuration
- Check network latency
High CPU/Memory Usage
Check:- Increase resource limits
- Scale horizontally (more pods)
- Investigate memory leaks in logs
- Enable monitoring with Grafana
Debugging Tools
Interactive Shell
Debug Container
Network Debugging
Copy Files
Getting Help
Collect Diagnostic Information
Before contacting support, collect:Contact Support
Email: [email protected] Include:- Description of the issue
- Steps to reproduce
- Diagnostic files collected above
- Cluster information (EKS version, node types, etc.)
- Helm chart version

