Kubernetes Troubleshooting

Overview

This guide covers common issues encountered when deploying Smallest Self-Host on Kubernetes and how to resolve them.

Diagnostic Commands

Quick Status Check

kubectl get all -n smallest
kubectl get pods -n smallest --show-labels
kubectl top pods -n smallest
kubectl top nodes

Detailed Pod Information

kubectl describe pod <pod-name> -n smallest
kubectl logs <pod-name> -n smallest
kubectl logs <pod-name> -n smallest --previous
kubectl logs <pod-name> -c <container-name> -n smallest -f

Events

kubectl get events -n smallest --sort-by='.lastTimestamp'
kubectl get events -n smallest --field-selector type=Warning

Common Issues

Pods Stuck in Pending

Symptoms:

NAME                READY   STATUS    RESTARTS   AGE
lightning-asr-xxx   0/1     Pending   0          5m

Causes and Solutions:

Insufficient GPU Resources

Check:

kubectl describe pod lightning-asr-xxx -n smallest

Look for: 0/3 nodes are available: 3 Insufficient nvidia.com/gpuSolutions:

Add GPU nodes to cluster
Check GPU nodes are ready: kubectl get nodes -l nvidia.com/gpu=true
Verify GPU device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Reduce requested GPUs or add more nodes

Node Selector Mismatch

Check:

kubectl get nodes --show-labels
kubectl describe pod lightning-asr-xxx -n smallest | grep "Node-Selectors"

Solutions:

Update nodeSelector in values.yaml to match actual node labels
Remove nodeSelector if not needed
Add labels to nodes: kubectl label nodes <node-name> workload=gpu

Tolerations Missing

Check:

kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 "Tolerations"
kubectl describe node <node-name> | grep "Taints"

Solutions: Update tolerations in values.yaml:

lightningAsr:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

PVC Not Bound

Check:

kubectl get pvc -n smallest

Look for: STATUS: PendingSolutions:

Check storage class exists: kubectl get storageclass
Verify sufficient storage: kubectl describe pvc <pvc-name> -n smallest
Check EFS/EBS CSI driver running: kubectl get pods -n kube-system -l app=efs-csi-controller

ImagePullBackOff

Symptoms:

NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     ImagePullBackOff   0          2m

Diagnosis:

kubectl describe pod lightning-asr-xxx -n smallest

Look for errors in Events section. Solutions:

Invalid Credentials

Error: unauthorized: authentication requiredSolutions:

Verify imageCredentials in values.yaml
Check secret created: kubectl get secrets -n smallest | grep registry
Test credentials locally: docker login quay.io

Recreate secret:

kubectl delete secret <pull-secret> -n smallest
helm upgrade smallest-self-host ... -f values.yaml

Image Not Found

Error: manifest unknown or not foundSolutions:

Verify image name in values.yaml
Check image exists: docker pull quay.io/smallestinc/lightning-asr:latest
Contact support@smallest.ai for access

Rate Limited

Error: rate limit exceededSolutions:

Wait and retry
Use authenticated pulls (imageCredentials)

CrashLoopBackOff

Symptoms:

NAME                READY   STATUS             RESTARTS   AGE
lightning-asr-xxx   0/1     CrashLoopBackOff   5          5m

Diagnosis:

kubectl logs lightning-asr-xxx -n smallest
kubectl logs lightning-asr-xxx -n smallest --previous
kubectl describe pod lightning-asr-xxx -n smallest

Common Causes:

License Validation Failed

Error: License validation failed or Invalid license keySolutions:

Check License Proxy is running: kubectl get pods -l app=license-proxy -n smallest
Verify license key in values.yaml
Check License Proxy logs: kubectl logs -l app=license-proxy -n smallest
Test License Proxy: kubectl exec -it <api-server-pod> -- curl http://license-proxy:3369/health

Model Download Failed

Error: Failed to download model or Connection timeoutSolutions:

Verify MODEL_URL in values.yaml
Check network connectivity
Check disk space: kubectl exec -it <pod> -- df -h
Test URL: kubectl run test --rm -it --image=curlimages/curl -- curl -I $MODEL_URL

Out of Memory

Error: Pod killed, exit code 137Solutions:

Check memory limits:

kubectl describe pod lightning-asr-xxx -n smallest | grep -A5 Limits

Increase memory:

lightningAsr:
  resources:
    limits:
      memory: 16Gi

Check node capacity: kubectl describe node <node-name>

GPU Not Accessible

Error: No CUDA-capable device or GPU not foundSolutions:

Verify GPU available on node: kubectl describe node <node-name> | grep nvidia.com/gpu
Check NVIDIA device plugin: kubectl get pods -n kube-system -l name=nvidia-device-plugin
Restart device plugin: kubectl delete pod -n kube-system -l name=nvidia-device-plugin
Verify GPU driver on node

Service Not Accessible

Symptoms:

Cannot connect to API server
Connection refused errors
Timeouts

Diagnosis:

kubectl get svc -n smallest
kubectl describe svc api-server -n smallest
kubectl get endpoints -n smallest

Solutions:

No Endpoints

Issue: Service has no endpointsCheck:

kubectl get endpoints api-server -n smallest

Solutions:

Verify pods are running: kubectl get pods -l app=api-server -n smallest
Check pod labels match service selector
Check pods are ready: kubectl get pods -l app=api-server -o wide

Wrong Port

Solutions:

Verify service port:

kubectl get svc api-server -n smallest -o yaml

Use correct port in connections (7100 for API Server)

Network Policy Blocking

Check:

kubectl get networkpolicy -n smallest

Solutions:

Review network policies

Temporarily disable to test:

kubectl delete networkpolicy <policy-name> -n smallest

HPA Not Scaling

Symptoms:

HPA shows <unknown> for metrics
Pods not scaling despite high load

Diagnosis:

kubectl get hpa -n smallest
kubectl describe hpa lightning-asr -n smallest
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Solutions:

Metrics Not Available

Check:

kubectl get servicemonitor -n smallest
kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter

Solutions:

Enable ServiceMonitor:

scaling:
  auto:
    lightningAsr:
      servicemonitor:
        enabled: true

Verify Prometheus is scraping:

kubectl port-forward svc/smallest-prometheus-stack-prometheus 9090:9090

Query: asr_active_requests

Already at Max Replicas

Check:

kubectl get hpa lightning-asr -n smallest

Solutions:

Increase maxReplicas:

scaling:
  auto:
    lightningAsr:
      hpa:
        maxReplicas: 20

Insufficient Cluster Resources

Solutions:

Add more nodes
Enable Cluster Autoscaler
Check pending pods: kubectl get pods --field-selector=status.phase=Pending

Persistent Volume Issues

Symptoms:

PVC stuck in Pending
Mount failures
Permission denied

Solutions:

No Storage Class

Check:

kubectl get storageclass

Solutions:

Install EBS CSI driver (AWS)
Install EFS CSI driver (AWS)
Create storage class

EFS Mount Failed

Check:

kubectl describe pod <pod-name> | grep -A10 "Events"

Solutions:

Verify EFS file system ID
Check security group allows NFS (port 2049)
Verify EFS CSI driver: kubectl get pods -n kube-system -l app=efs-csi-controller

Permission Denied

Solutions:

Check volume permissions
Add fsGroup to pod securityContext:
```
securityContext:
  fsGroup: 1000
```

Performance Issues

Slow Response Times

Check:

kubectl top pods -n smallest
kubectl top nodes
kubectl logs -l app=lightning-asr -n smallest | grep -i "latency\|duration"

Solutions:

Increase pod resources
Scale up replicas
Check GPU utilization: kubectl exec -it <lightning-asr-pod> -- nvidia-smi
Review model configuration
Check network latency

High CPU/Memory Usage

Check:

kubectl top pods -n smallest
kubectl describe pod <pod-name> -n smallest | grep -A5 "Limits"

Solutions:

Increase resource limits
Scale horizontally (more pods)
Investigate memory leaks in logs
Enable monitoring with Grafana

Debugging Tools

Interactive Shell

kubectl exec -it <pod-name> -n smallest -- /bin/sh

Debug Container

kubectl debug <pod-name> -n smallest -it --image=ubuntu -- bash

Network Debugging

kubectl run netdebug --rm -it --restart=Never \
  --image=nicolaka/netshoot \
  --namespace=smallest

Inside the debug pod:

nslookup api-server
curl http://api-server:7100/health
traceroute lightning-asr

Copy Files

kubectl cp <pod-name>:/path/to/file ./local-file -n smallest
kubectl cp ./local-file <pod-name>:/path/to/file -n smallest

Getting Help

Collect Diagnostic Information

Before contacting support, collect:

kubectl get all -n smallest > status.txt
kubectl describe pods -n smallest > pods.txt
kubectl logs -l app=lightning-asr -n smallest --tail=500 > asr-logs.txt
kubectl logs -l app=api-server -n smallest --tail=500 > api-logs.txt
kubectl logs -l app=license-proxy -n smallest --tail=500 > license-logs.txt
kubectl get events -n smallest --sort-by='.lastTimestamp' > events.txt
kubectl top nodes > nodes.txt
kubectl top pods -n smallest > pod-resources.txt
helm get values smallest-self-host -n smallest > values.txt

Contact Support

Email: support@smallest.ai Include:

Description of the issue
Steps to reproduce
Diagnostic files collected above
Cluster information (EKS version, node types, etc.)
Helm chart version

Getting Started

Docker Setup

Kubernetes Setup

Troubleshooting

Kubernetes Troubleshooting

Overview

Diagnostic Commands

Quick Status Check

Detailed Pod Information

Events

Common Issues

Pods Stuck in Pending

ImagePullBackOff

CrashLoopBackOff

Service Not Accessible

HPA Not Scaling

Persistent Volume Issues

Performance Issues

Slow Response Times

High CPU/Memory Usage

Debugging Tools

Interactive Shell

Debug Container

Network Debugging

Copy Files

Getting Help

Collect Diagnostic Information

Contact Support

What’s Next?

General Troubleshooting

API Reference

Getting Started

Docker Setup

Kubernetes Setup

Troubleshooting

​Overview

​Diagnostic Commands

​Quick Status Check

​Detailed Pod Information

​Events

​Common Issues

​Pods Stuck in Pending

​ImagePullBackOff

​CrashLoopBackOff

​Service Not Accessible

​HPA Not Scaling

​Persistent Volume Issues

​Performance Issues

​Slow Response Times

​High CPU/Memory Usage

​Debugging Tools

​Interactive Shell

​Debug Container

​Network Debugging

​Copy Files

​Getting Help

​Collect Diagnostic Information

​Contact Support

​What’s Next?

General Troubleshooting

API Reference

Overview

Diagnostic Commands

Quick Status Check

Detailed Pod Information

Events

Common Issues

Pods Stuck in Pending

ImagePullBackOff

CrashLoopBackOff

Service Not Accessible

HPA Not Scaling

Persistent Volume Issues

Performance Issues

Slow Response Times

High CPU/Memory Usage

Debugging Tools

Interactive Shell

Debug Container

Network Debugging

Copy Files

Getting Help

Collect Diagnostic Information

Contact Support

What’s Next?