Skip to main content

Overview

This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.

Diagnostic Commands

Quick Status Check

kubectl get all -n smallest
kubectl get pods -n smallest --show-labels
kubectl top pods -n smallest
kubectl top nodes

Detailed Pod Information

kubectl describe pod <pod-name> -n smallest
kubectl logs <pod-name> -n smallest
kubectl logs <pod-name> -n smallest --previous
kubectl logs <pod-name> -c <container-name> -n smallest -f

Events

kubectl get events -n smallest --sort-by='.lastTimestamp'
kubectl get events -n smallest --field-selector type=Warning

Common Issues

Pods Stuck in Pending

Symptoms:
NAME                READY   STATUS    RESTARTS   AGE
lightning-asr-xxx   0/1     Pending   0          5m
Causes and Solutions:
Check:
kubectl describe pod lightning-asr-xxx -n smallest
Look for: 0/3 nodes are available: 3 Insufficient nvidia.com/gpuSolutions:
  • Add GPU nodes to cluster
  • Check GPU nodes are ready: kubectl get nodes -l nvidia.com/gpu=true
  • Verify GPU device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
  • Reduce requested GPUs or add more nodes
Check:
kubectl get nodes --show-labels
kubectl describe pod lightning-asr-xxx -n smallest | grep "Node-Selectors"
Solutions:
  • Update nodeSelector in values.yaml to match actual node labels
  • Remove nodeSelector if not needed
  • Add labels to nodes: kubectl label nodes <node-name> workload=gpu
Check:
kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 "Tolerations"
kubectl describe node <node-name> | grep "Taints"
Solutions: Update tolerations in values.yaml:
lightningAsr:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
Check:
kubectl get pvc -n smallest
Look for: STATUS: PendingSolutions:
  • Check storage class exists: kubectl get storageclass
  • Verify sufficient storage: kubectl describe pvc <pvc-name> -n smallest
  • Check EFS/EBS CSI driver running: kubectl get pods -n kube-system -l app=efs-csi-controller

ImagePullBackOff

Symptoms:
NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     ImagePullBackOff   0          2m
Diagnosis:
kubectl describe pod lightning-asr-xxx -n smallest
Look for errors in Events section. Solutions:
Error: unauthorized: authentication requiredSolutions:
  • Verify imageCredentials in values.yaml
  • Check secret created: kubectl get secrets -n smallest | grep registry
  • Test credentials locally: docker login quay.io
  • Recreate secret:
    kubectl delete secret <pull-secret> -n smallest
    helm upgrade smallest-self-host ... -f values.yaml
    
Error: manifest unknown or not foundSolutions:
  • Verify image name in values.yaml
  • Check image exists: docker pull quay.io/smallestinc/lightning-asr:latest
  • Contact [email protected] for access
Error: rate limit exceededSolutions:
  • Wait and retry
  • Use authenticated pulls (imageCredentials)

CrashLoopBackOff

Symptoms:
NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     CrashLoopBackOff   5          5m
Diagnosis:
kubectl logs lightning-asr-xxx -n smallest
kubectl logs lightning-asr-xxx -n smallest --previous
kubectl describe pod lightning-asr-xxx -n smallest
Common Causes:
Error: License validation failed or Invalid license keySolutions:
  • Check License Proxy is running: kubectl get pods -l app=license-proxy -n smallest
  • Verify license key in values.yaml
  • Check License Proxy logs: kubectl logs -l app=license-proxy -n smallest
  • Test License Proxy: kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health
Error: Failed to download model or Connection timeoutSolutions:
  • Verify MODEL_URL in values.yaml
  • Check network connectivity
  • Check disk space: kubectl exec -it <pod> -- df -h
  • Test URL: kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL
Error: Pod killed, exit code 137Solutions:
  • Check memory limits:
    kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 Limits
    
  • Increase memory:
    lightningAsr:
      resources:
        limits:
          memory: 16Gi
    
  • Check node capacity: kubectl describe node <node-name>
Error: No CUDA-capable device or GPU not foundSolutions:
  • Verify GPU available on node: kubectl describe node <node-name> | grep nvidia.com/gpu
  • Check NVIDIA device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
  • Restart device plugin: kubectl delete pod -n kube-system -l name=nvidia-device-plugin
  • Verify GPU driver on node

Service Not Accessible

Symptoms:
  • Cannot connect to API server
  • Connection refused errors
  • Timeouts
Diagnosis:
kubectl get svc -n smallest
kubectl describe svc api-server -n smallest
kubectl get endpoints -n smallest
Solutions:
Issue: Service has no endpointsCheck:
kubectl get endpoints api-server -n smallest
Solutions:
  • Verify pods are running: kubectl get pods -l app=api-server -n smallest
  • Check pod labels match service selector
  • Check pods are ready: kubectl get pods -l app=api-server -o wide
Solutions:
  • Verify service port:
    kubectl get svc api-server -n smallest -o yaml
    
  • Use correct port in connections (7100 for API Server)
Check:
kubectl get networkpolicy -n smallest
Solutions:
  • Review network policies
  • Temporarily disable to test:
    kubectl delete networkpolicy <policy-name> -n smallest
    

HPA Not Scaling

Symptoms:
  • HPA shows <unknown> for metrics
  • Pods not scaling despite high load
Diagnosis:
kubectl get hpa -n smallest
kubectl describe hpa lightning-asr -n smallest
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
Solutions:
Check:
kubectl get servicemonitor -n smallest
kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter
Solutions:
  • Enable ServiceMonitor:
    scaling:
      auto:
        lightningAsr:
          servicemonitor:
            enabled: true
    
  • Verify Prometheus is scraping:
    kubectl port-forward svc/smallest-prometheus-stack-prometheus 9090:9090
    
    Query: asr_active_requests
Check:
kubectl get hpa lightning-asr -n smallest
Solutions:
  • Increase maxReplicas:
    scaling:
      auto:
        lightningAsr:
          hpa:
            maxReplicas: 20
    
Solutions:
  • Add more nodes
  • Enable Cluster Autoscaler
  • Check pending pods: kubectl get pods --field-selector=status.phase=Pending

Persistent Volume Issues

Symptoms:
  • PVC stuck in Pending
  • Mount failures
  • Permission denied
Solutions:
Check:
kubectl get storageclass
Solutions:
  • Install EBS CSI driver (AWS)
  • Install EFS CSI driver (AWS)
  • Create storage class
Check:
kubectl describe pod <pod-name> | grep -A10 "Events"
Solutions:
  • Verify EFS file system ID
  • Check security group allows NFS (port 2049)
  • Verify EFS CSI driver: kubectl get pods -n kube-system -l app=efs-csi-controller
Solutions:
  • Check volume permissions
  • Add fsGroup to pod securityContext:
    securityContext:
      fsGroup: 1000
    

Performance Issues

Slow Response Times

Check:
kubectl top pods -n smallest
kubectl top nodes
kubectl logs -l app=lightning-asr -n smallest | grep -i "latency\|duration"
Solutions:
  • Increase pod resources
  • Scale up replicas
  • Check GPU utilization: kubectl exec -it <lightning-asr-pod> -- nvidia-smi
  • Review model configuration
  • Check network latency

High CPU/Memory Usage

Check:
kubectl top pods -n smallest
kubectl describe pod <pod-name> -n smallest | grep -A5 "Limits"
Solutions:
  • Increase resource limits
  • Scale horizontally (more pods)
  • Investigate memory leaks in logs
  • Enable monitoring with Grafana

Debugging Tools

Interactive Shell

kubectl exec -it <pod-name> -n smallest -- /bin/sh

Debug Container

kubectl debug <pod-name> -n smallest -it --image=ubuntu -- bash

Network Debugging

kubectl run netdebug --rm -it --restart=Never \
  --image=nicolaka/netshoot \
  --namespace=smallest
Inside the debug pod:
nslookup api-server
curl http://api-server:7100/health
traceroute lightning-asr

Copy Files

kubectl cp <pod-name>:/path/to/file ./local-file -n smallest
kubectl cp ./local-file <pod-name>:/path/to/file -n smallest

Getting Help

Collect Diagnostic Information

Before contacting support, collect:
kubectl get all -n smallest > status.txt
kubectl describe pods -n smallest > pods.txt
kubectl logs -l app=lightning-asr -n smallest --tail=500 > asr-logs.txt
kubectl logs -l app=api-server -n smallest --tail=500 > api-logs.txt
kubectl logs -l app=license-proxy -n smallest --tail=500 > license-logs.txt
kubectl get events -n smallest --sort-by='.lastTimestamp' > events.txt
kubectl top nodes > nodes.txt
kubectl top pods -n smallest > pod-resources.txt
helm get values smallest-self-host -n smallest > values.txt

Contact Support

Email: [email protected] Include:
  • Description of the issue
  • Steps to reproduce
  • Diagnostic files collected above
  • Cluster information (EKS version, node types, etc.)
  • Helm chart version

What’s Next?