Overview
This guide covers advanced debugging techniques for troubleshooting complex issues with Smallest Self-Host.
Docker Debugging
Enter Running Container
docker exec -it <container-name> /bin/bash
Inside the container:
ls -la
ps aux
df -h
nvidia-smi
env
Debug Failed Container
View logs of crashed container:
docker logs <container-name>
docker logs <container-name> --tail=100 --follow
Inspect container configuration:
docker inspect <container-name>
Network Debugging
Check container networking:
docker network ls
docker network inspect <network-name>
docker exec <container> ping license-proxy
docker exec <container> curl http://license-proxy:3369/health
Kubernetes Debugging
Debug Pod
Interactive debug container:
kubectl debug <pod-name> -it --image=ubuntu --target=<container-name>
Copy debug tools into pod:
kubectl cp ./debug-script.sh <pod-name>:/tmp/debug.sh
kubectl exec -it <pod-name> -- bash /tmp/debug.sh
Ephemeral Debug Container
Add temporary container to running pod:
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=lightning-asr
Inside debug container:
nslookup license-proxy
curl http://api-server:7100/health
tcpdump -i eth0
Get Previous Logs
If pod crashed and restarted:
kubectl logs <pod-name> --previous
kubectl logs <pod-name> -c <container-name> --previous
Network Debugging
Test Service Connectivity
From inside cluster:
kubectl run netdebug --rm -it --restart=Never \
--image=nicolaka/netshoot \
--namespace=smallest \
-- bash
Inside debug pod:
nslookup api-server
nslookup license-proxy
nslookup lightning-asr
curl http://api-server:7100/health
curl http://license-proxy:3369/health
traceroute api-server
ping -c 3 lightning-asr
DNS Resolution
Check DNS is working:
kubectl run dnstest --rm -it --restart=Never \
--image=busybox \
-- nslookup kubernetes.default
Check CoreDNS logs:
kubectl logs -n kube-system -l k8s-app=kube-dns
Network Policies
List network policies:
kubectl get networkpolicy -n smallest
kubectl describe networkpolicy <policy-name> -n smallest
Temporarily disable for testing:
kubectl delete networkpolicy <policy-name> -n smallest
Remember to recreate network policies after testing!
Resource Usage
Check pod resource consumption:
kubectl top pods -n smallest
kubectl top pods -n smallest --sort-by=memory
kubectl top pods -n smallest --sort-by=cpu
Check node resource usage:
kubectl top nodes
kubectl describe node <node-name> | grep -A 10 "Allocated resources"
GPU Debugging
Check GPU availability in pod:
kubectl exec -it <lightning-asr-pod> -- nvidia-smi
kubectl exec -it <lightning-asr-pod> -- nvidia-smi dmon
Watch GPU utilization:
kubectl exec -it <lightning-asr-pod> -- watch -n 1 nvidia-smi
Check GPU events:
kubectl exec -it <lightning-asr-pod> -- nvidia-smi -q -d MEMORY,UTILIZATION,POWER,CLOCK,PERFORMANCE
Application Profiling
Profile Lightning ASR:
kubectl exec -it <pod> -- sh -c 'apt-get update && apt-get install -y python3-pip && pip3 install py-spy'
kubectl exec -it <pod> -- py-spy top --pid 1
Memory profiling:
kubectl exec -it <pod> -- sh -c 'cat /proc/1/status | grep -i mem'
Log Analysis
Structured Log Parsing
Extract errors from logs:
kubectl logs <pod> | grep -i "error\|exception\|failed"
Count errors:
kubectl logs <pod> | grep -i "error" | wc -l
Show errors with context:
kubectl logs <pod> | grep -B 5 -A 5 "error"
Log Aggregation
Combine logs from all replicas:
kubectl logs -l app=lightning-asr -n smallest --tail=100 --all-containers=true
Follow logs from multiple pods:
kubectl logs -l app=lightning-asr -f --max-log-requests=10
Parse JSON Logs
Using jq:
kubectl logs <pod> | jq 'select(.level=="error")'
kubectl logs <pod> | jq 'select(.duration > 1000)'
kubectl logs <pod> | jq '.message' -r
Database Debugging
Redis Debugging
Connect to Redis:
kubectl exec -it <redis-pod> -- redis-cli
Inside Redis CLI:
AUTH your-password
INFO
DBSIZE
KEYS *
GET some_key
MONITOR
Check Redis memory:
Check slow queries:
API Debugging
Test API Endpoints
Health check:
kubectl port-forward svc/api-server 7100:7100
curl http://localhost:7100/health
Test transcription:
curl -X POST http://localhost:7100/v1/listen \
-H "Authorization: Token ${LICENSE_KEY}" \
-H "Content-Type: application/json" \
-d '{"url": "https://www2.cs.uic.edu/~i101/SoundFiles/StarWars60.wav"}' \
-v
Request Tracing
Add request ID tracking:
curl -X POST http://localhost:7100/v1/listen \
-H "Authorization: Token ${LICENSE_KEY}" \
-H "X-Request-ID: debug-123" \
-H "Content-Type: application/json" \
-d '{"url": "..."}' \
-v
Grep logs for request:
kubectl logs -l app=api-server | grep "debug-123"
kubectl logs -l app=lightning-asr | grep "debug-123"
Packet Capture
Capture network traffic:
kubectl exec -it <pod> -- apt-get update && apt-get install -y tcpdump
kubectl exec -it <pod> -- tcpdump -i any -w /tmp/capture.pcap port 7100
kubectl cp <pod>:/tmp/capture.pcap ./capture.pcap
Analyze with Wireshark or:
tcpdump -r capture.pcap -A
Event Debugging
Watch Events
Real-time events:
kubectl get events -n smallest --watch
Filter by type:
kubectl get events -n smallest --field-selector type=Warning
Sort by timestamp:
kubectl get events -n smallest --sort-by='.lastTimestamp'
Event Analysis
Count events by reason:
kubectl get events -n smallest -o json | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'
Metrics Debugging
Check Prometheus Metrics
Port forward Prometheus:
kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090
Query metrics:
Open http://localhost:9090 and run:
asr_active_requests
rate(asr_total_requests[5m])
asr_gpu_utilization
Check Custom Metrics
Verify metrics available to HPA:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
Query specific metric:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .
Debugging Checklists
Startup Issues Checklist
Check Image Pull
kubectl describe pod <pod> | grep -A 10 "Events"
Verify Secrets
kubectl get secrets -n smallest
kubectl describe secret <secret-name>
Check Resources
kubectl describe node <node> | grep "Allocated resources" -A 10
Review Logs
kubectl logs <pod> --all-containers=true
Check Resource Usage
kubectl top pods -n smallest
kubectl top nodes
Verify GPU
kubectl exec <pod> -- nvidia-smi
Check HPA
kubectl get hpa
kubectl describe hpa lightning-asr
Review Metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"
Advanced Techniques
Enable Debug Logging
Increase log verbosity:
lightningAsr:
env:
- name: LOG_LEVEL
value: "DEBUG"
Simulate Failures
Test error handling:
kubectl delete pod <pod-name>
kubectl drain <node-name> --ignore-daemonsets
Load Testing
Generate load:
kubectl run load-test --rm -it --image=williamyeh/hey \
-- -z 5m -c 50 http://api-server:7100/health
Chaos Engineering
Test resilience (requires Chaos Mesh):
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure
spec:
action: pod-failure
mode: one
selector:
namespaces:
- smallest
labelSelectors:
app: lightning-asr
duration: "30s"
What’s Next?