Skip to main content

Overview

This guide covers advanced debugging techniques for troubleshooting complex issues with Smallest Self-Host.

Debugging Tools

Docker Debugging

Enter Running Container

docker exec -it <container-name> /bin/bash
Inside the container:
ls -la
ps aux
df -h
nvidia-smi
env

Debug Failed Container

View logs of crashed container:
docker logs <container-name>
docker logs <container-name> --tail=100 --follow
Inspect container configuration:
docker inspect <container-name>

Network Debugging

Check container networking:
docker network ls
docker network inspect <network-name>
docker exec <container> ping license-proxy
docker exec <container> curl http://license-proxy:3369/health

Kubernetes Debugging

Debug Pod

Interactive debug container:
kubectl debug <pod-name> -it --image=ubuntu --target=<container-name>
Copy debug tools into pod:
kubectl cp ./debug-script.sh <pod-name>:/tmp/debug.sh
kubectl exec -it <pod-name> -- bash /tmp/debug.sh

Ephemeral Debug Container

Add temporary container to running pod:
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=lightning-asr
Inside debug container:
nslookup license-proxy
curl http://api-server:7100/health
tcpdump -i eth0

Get Previous Logs

If pod crashed and restarted:
kubectl logs <pod-name> --previous
kubectl logs <pod-name> -c <container-name> --previous

Network Debugging

Test Service Connectivity

From inside cluster:
kubectl run netdebug --rm -it --restart=Never \
  --image=nicolaka/netshoot \
  --namespace=smallest \
  -- bash
Inside debug pod:
nslookup api-server
nslookup license-proxy
nslookup lightning-asr

curl http://api-server:7100/health
curl http://license-proxy:3369/health

traceroute api-server
ping -c 3 lightning-asr

DNS Resolution

Check DNS is working:
kubectl run dnstest --rm -it --restart=Never \
  --image=busybox \
  -- nslookup kubernetes.default
Check CoreDNS logs:
kubectl logs -n kube-system -l k8s-app=kube-dns

Network Policies

List network policies:
kubectl get networkpolicy -n smallest
kubectl describe networkpolicy <policy-name> -n smallest
Temporarily disable for testing:
kubectl delete networkpolicy <policy-name> -n smallest
Remember to recreate network policies after testing!

Performance Debugging

Resource Usage

Check pod resource consumption:
kubectl top pods -n smallest
kubectl top pods -n smallest --sort-by=memory
kubectl top pods -n smallest --sort-by=cpu
Check node resource usage:
kubectl top nodes
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

GPU Debugging

Check GPU availability in pod:
kubectl exec -it <lightning-asr-pod> -- nvidia-smi

kubectl exec -it <lightning-asr-pod> -- nvidia-smi dmon
Watch GPU utilization:
kubectl exec -it <lightning-asr-pod> -- watch -n 1 nvidia-smi
Check GPU events:
kubectl exec -it <lightning-asr-pod> -- nvidia-smi -q -d MEMORY,UTILIZATION,POWER,CLOCK,PERFORMANCE

Application Profiling

Profile Lightning ASR:
kubectl exec -it <pod> -- sh -c 'apt-get update && apt-get install -y python3-pip && pip3 install py-spy'

kubectl exec -it <pod> -- py-spy top --pid 1
Memory profiling:
kubectl exec -it <pod> -- sh -c 'cat /proc/1/status | grep -i mem'

Log Analysis

Structured Log Parsing

Extract errors from logs:
kubectl logs <pod> | grep -i "error\|exception\|failed"
Count errors:
kubectl logs <pod> | grep -i "error" | wc -l
Show errors with context:
kubectl logs <pod> | grep -B 5 -A 5 "error"

Log Aggregation

Combine logs from all replicas:
kubectl logs -l app=lightning-asr -n smallest --tail=100 --all-containers=true
Follow logs from multiple pods:
kubectl logs -l app=lightning-asr -f --max-log-requests=10

Parse JSON Logs

Using jq:
kubectl logs <pod> | jq 'select(.level=="error")'
kubectl logs <pod> | jq 'select(.duration > 1000)'
kubectl logs <pod> | jq '.message' -r

Database Debugging

Redis Debugging

Connect to Redis:
kubectl exec -it <redis-pod> -- redis-cli
Inside Redis CLI:
AUTH your-password
INFO
DBSIZE
KEYS *
GET some_key
MONITOR
Check Redis memory:
INFO memory
Check slow queries:
SLOWLOG GET 10

API Debugging

Test API Endpoints

Health check:
kubectl port-forward svc/api-server 7100:7100
curl http://localhost:7100/health
Test transcription:
curl -X POST http://localhost:7100/v1/listen \
  -H "Authorization: Token ${LICENSE_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www2.cs.uic.edu/~i101/SoundFiles/StarWars60.wav"}' \
  -v

Request Tracing

Add request ID tracking:
curl -X POST http://localhost:7100/v1/listen \
  -H "Authorization: Token ${LICENSE_KEY}" \
  -H "X-Request-ID: debug-123" \
  -H "Content-Type: application/json" \
  -d '{"url": "..."}' \
  -v
Grep logs for request:
kubectl logs -l app=api-server | grep "debug-123"
kubectl logs -l app=lightning-asr | grep "debug-123"

Packet Capture

Capture network traffic:
kubectl exec -it <pod> -- apt-get update && apt-get install -y tcpdump

kubectl exec -it <pod> -- tcpdump -i any -w /tmp/capture.pcap port 7100

kubectl cp <pod>:/tmp/capture.pcap ./capture.pcap
Analyze with Wireshark or:
tcpdump -r capture.pcap -A

Event Debugging

Watch Events

Real-time events:
kubectl get events -n smallest --watch
Filter by type:
kubectl get events -n smallest --field-selector type=Warning
Sort by timestamp:
kubectl get events -n smallest --sort-by='.lastTimestamp'

Event Analysis

Count events by reason:
kubectl get events -n smallest -o json | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'

Metrics Debugging

Check Prometheus Metrics

Port forward Prometheus:
kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090
Query metrics: Open http://localhost:9090 and run:
asr_active_requests
rate(asr_total_requests[5m])
asr_gpu_utilization

Check Custom Metrics

Verify metrics available to HPA:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
Query specific metric:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .

Debugging Checklists

Startup Issues Checklist

1

Check Image Pull

kubectl describe pod <pod> | grep -A 10 "Events"
2

Verify Secrets

kubectl get secrets -n smallest
kubectl describe secret <secret-name>
3

Check Resources

kubectl describe node <node> | grep "Allocated resources" -A 10
4

Review Logs

kubectl logs <pod> --all-containers=true

Performance Issues Checklist

1

Check Resource Usage

kubectl top pods -n smallest
kubectl top nodes
2

Verify GPU

kubectl exec <pod> -- nvidia-smi
3

Check HPA

kubectl get hpa
kubectl describe hpa lightning-asr
4

Review Metrics

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

Advanced Techniques

Enable Debug Logging

Increase log verbosity:
lightningAsr:
  env:
    - name: LOG_LEVEL
      value: "DEBUG"

Simulate Failures

Test error handling:
kubectl delete pod <pod-name>
kubectl drain <node-name> --ignore-daemonsets

Load Testing

Generate load:
kubectl run load-test --rm -it --image=williamyeh/hey \
  -- -z 5m -c 50 http://api-server:7100/health

Chaos Engineering

Test resilience (requires Chaos Mesh):
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - smallest
    labelSelectors:
      app: lightning-asr
  duration: "30s"

What’s Next?