Debugging Guide

Overview

This guide covers advanced debugging techniques for troubleshooting complex issues with Smallest Self-Host.

Debugging Tools

Docker Debugging

Enter Running Container

docker exec -it <container-name> /bin/bash

Inside the container:

ls -la
ps aux
df -h
nvidia-smi
env

Debug Failed Container

View logs of crashed container:

docker logs <container-name>
docker logs <container-name> --tail=100 --follow

Inspect container configuration:

docker inspect <container-name>

Network Debugging

Check container networking:

docker network ls
docker network inspect <network-name>
docker exec <container> ping license-proxy
docker exec <container> curl http://license-proxy:3369/health

Kubernetes Debugging

Debug Pod

Interactive debug container:

kubectl debug <pod-name> -it --image=ubuntu --target=<container-name>

Copy debug tools into pod:

kubectl cp ./debug-script.sh <pod-name>:/tmp/debug.sh
kubectl exec -it <pod-name> -- bash /tmp/debug.sh

Ephemeral Debug Container

Add temporary container to running pod:

kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=lightning-asr

Inside debug container:

nslookup license-proxy
curl http://api-server:7100/health
tcpdump -i eth0

Get Previous Logs

If pod crashed and restarted:

kubectl logs <pod-name> --previous
kubectl logs <pod-name> -c <container-name> --previous

Network Debugging

Test Service Connectivity

From inside cluster:

kubectl run netdebug --rm -it --restart=Never \
  --image=nicolaka/netshoot \
  --namespace=smallest \
  -- bash

Inside debug pod:

nslookup api-server
nslookup license-proxy
nslookup lightning-asr

curl http://api-server:7100/health
curl http://license-proxy:3369/health

traceroute api-server
ping -c 3 lightning-asr

DNS Resolution

Check DNS is working:

kubectl run dnstest --rm -it --restart=Never \
  --image=busybox \
  -- nslookup kubernetes.default

Check CoreDNS logs:

kubectl logs -n kube-system -l k8s-app=kube-dns

Network Policies

List network policies:

kubectl get networkpolicy -n smallest
kubectl describe networkpolicy <policy-name> -n smallest

Temporarily disable for testing:

kubectl delete networkpolicy <policy-name> -n smallest

Remember to recreate network policies after testing!

Performance Debugging

Resource Usage

Check pod resource consumption:

kubectl top pods -n smallest
kubectl top pods -n smallest --sort-by=memory
kubectl top pods -n smallest --sort-by=cpu

Check node resource usage:

kubectl top nodes
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

GPU Debugging

Check GPU availability in pod:

kubectl exec -it <lightning-asr-pod> -- nvidia-smi

kubectl exec -it <lightning-asr-pod> -- nvidia-smi dmon

Watch GPU utilization:

kubectl exec -it <lightning-asr-pod> -- watch -n 1 nvidia-smi

Check GPU events:

kubectl exec -it <lightning-asr-pod> -- nvidia-smi -q -d MEMORY,UTILIZATION,POWER,CLOCK,PERFORMANCE

Application Profiling

Profile Lightning ASR:

kubectl exec -it <pod> -- sh -c 'apt-get update && apt-get install -y python3-pip && pip3 install py-spy'

kubectl exec -it <pod> -- py-spy top --pid 1

Memory profiling:

kubectl exec -it <pod> -- sh -c 'cat /proc/1/status | grep -i mem'

Log Analysis

Structured Log Parsing

Extract errors from logs:

kubectl logs <pod> | grep -i "error\|exception\|failed"

Count errors:

kubectl logs <pod> | grep -i "error" | wc -l

Show errors with context:

kubectl logs <pod> | grep -B 5 -A 5 "error"

Log Aggregation

Combine logs from all replicas:

kubectl logs -l app=lightning-asr -n smallest --tail=100 --all-containers=true

Follow logs from multiple pods:

kubectl logs -l app=lightning-asr -f --max-log-requests=10

Parse JSON Logs

Using jq:

kubectl logs <pod> | jq 'select(.level=="error")'
kubectl logs <pod> | jq 'select(.duration > 1000)'
kubectl logs <pod> | jq '.message' -r

Database Debugging

Redis Debugging

Connect to Redis:

kubectl exec -it <redis-pod> -- redis-cli

Inside Redis CLI:

AUTH your-password
INFO
DBSIZE
KEYS *
GET some_key
MONITOR

Check Redis memory:

INFO memory

Check slow queries:

SLOWLOG GET 10

API Debugging

Test API Endpoints

Health check:

kubectl port-forward svc/api-server 7100:7100
curl http://localhost:7100/health

Test transcription:

curl -X POST http://localhost:7100/v1/listen \
  -H "Authorization: Token ${LICENSE_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www2.cs.uic.edu/~i101/SoundFiles/StarWars60.wav"}' \
  -v

Request Tracing

Add request ID tracking:

curl -X POST http://localhost:7100/v1/listen \
  -H "Authorization: Token ${LICENSE_KEY}" \
  -H "X-Request-ID: debug-123" \
  -H "Content-Type: application/json" \
  -d '{"url": "..."}' \
  -v

Grep logs for request:

kubectl logs -l app=api-server | grep "debug-123"
kubectl logs -l app=lightning-asr | grep "debug-123"

Packet Capture

Capture network traffic:

kubectl exec -it <pod> -- apt-get update && apt-get install -y tcpdump

kubectl exec -it <pod> -- tcpdump -i any -w /tmp/capture.pcap port 7100

kubectl cp <pod>:/tmp/capture.pcap ./capture.pcap

Analyze with Wireshark or:

tcpdump -r capture.pcap -A

Event Debugging

Watch Events

Real-time events:

kubectl get events -n smallest --watch

Filter by type:

kubectl get events -n smallest --field-selector type=Warning

Sort by timestamp:

kubectl get events -n smallest --sort-by='.lastTimestamp'

Event Analysis

Count events by reason:

kubectl get events -n smallest -o json | jq '.items | group_by(.reason) | map({reason: .[0].reason, count: length})'

Metrics Debugging

Check Prometheus Metrics

Port forward Prometheus:

kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090

Query metrics: Open http://localhost:9090 and run:

asr_active_requests
rate(asr_total_requests[5m])
asr_gpu_utilization

Check Custom Metrics

Verify metrics available to HPA:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

Query specific metric:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .

Debugging Checklists

Startup Issues Checklist

Check Image Pull

kubectl describe pod <pod> | grep -A 10 "Events"

Verify Secrets

kubectl get secrets -n smallest
kubectl describe secret <secret-name>

Check Resources

kubectl describe node <node> | grep "Allocated resources" -A 10

Review Logs

kubectl logs <pod> --all-containers=true

Performance Issues Checklist

Check Resource Usage

kubectl top pods -n smallest
kubectl top nodes

Verify GPU

kubectl exec <pod> -- nvidia-smi

Check HPA

kubectl get hpa
kubectl describe hpa lightning-asr

Review Metrics

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

Advanced Techniques

Enable Debug Logging

Increase log verbosity:

lightningAsr:
  env:
    - name: LOG_LEVEL
      value: "DEBUG"

Simulate Failures

Test error handling:

kubectl delete pod <pod-name>
kubectl drain <node-name> --ignore-daemonsets

Load Testing

Generate load:

kubectl run load-test --rm -it --image=williamyeh/hey \
  -- -z 5m -c 50 http://api-server:7100/health

Chaos Engineering

Test resilience (requires Chaos Mesh):

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - smallest
    labelSelectors:
      app: lightning-asr
  duration: "30s"

Getting Started

Docker Setup

Kubernetes Setup

Troubleshooting

​Overview

​Debugging Tools

​Docker Debugging

​Enter Running Container

​Debug Failed Container

​Network Debugging

​Kubernetes Debugging

​Debug Pod

​Ephemeral Debug Container

​Get Previous Logs

​Network Debugging

​Test Service Connectivity

​DNS Resolution

​Network Policies

​Performance Debugging

​Resource Usage

​GPU Debugging

​Application Profiling

​Log Analysis

​Structured Log Parsing

​Log Aggregation

​Parse JSON Logs

​Database Debugging

​Redis Debugging

​API Debugging

​Test API Endpoints

​Request Tracing

​Packet Capture

​Event Debugging

​Watch Events

​Event Analysis

​Metrics Debugging

​Check Prometheus Metrics

​Check Custom Metrics

​Debugging Checklists

​Startup Issues Checklist

​Performance Issues Checklist

​Advanced Techniques

​Enable Debug Logging

​Simulate Failures

​Load Testing

​Chaos Engineering

​What’s Next?

Logs Analysis

Common Issues

Overview

Debugging Tools

Docker Debugging

Enter Running Container

Debug Failed Container

Network Debugging

Kubernetes Debugging

Debug Pod

Ephemeral Debug Container

Get Previous Logs

Network Debugging

Test Service Connectivity

DNS Resolution

Network Policies

Performance Debugging

Resource Usage

GPU Debugging

Application Profiling

Log Analysis

Structured Log Parsing

Log Aggregation

Parse JSON Logs

Database Debugging

Redis Debugging

API Debugging

Test API Endpoints

Request Tracing

Packet Capture

Event Debugging

Watch Events

Event Analysis

Metrics Debugging

Check Prometheus Metrics

Check Custom Metrics

Debugging Checklists

Startup Issues Checklist

Performance Issues Checklist

Advanced Techniques

Enable Debug Logging

Simulate Failures

Load Testing

Chaos Engineering

What’s Next?