Overview
The metrics setup enables autoscaling by collecting Lightning ASR metrics with Prometheus and exposing them to Kubernetes HPA through the Prometheus Adapter.
Architecture
/* The original had a syntax error in Mermaid—edges must connect nodes, not labels.
“Metrics” is now a node, and edge directions/names are consistent.
*/
Components
Prometheus
Collects and stores metrics from Lightning ASR pods.
Included in chart :
scaling :
auto :
enabled : true
kube-prometheus-stack :
prometheus :
prometheusSpec :
serviceMonitorSelectorNilUsesHelmValues : false
retention : 7d
resources :
requests :
memory : 2Gi
ServiceMonitor
CRD that tells Prometheus which services to scrape.
Enabled for Lightning ASR :
scaling :
auto :
lightningAsr :
servicemonitor :
enabled : true
Prometheus Adapter
Converts Prometheus metrics to Kubernetes custom metrics API.
Configuration :
prometheus-adapter :
prometheus :
url : http://smallest-prometheus-stack-prometheus.default.svc
port : 9090
rules :
custom :
- seriesQuery : "asr_active_requests"
resources :
overrides :
namespace : { resource : "namespace" }
pod : { resource : "pod" }
name :
matches : "^(.*)$"
as : "${1}"
metricsQuery : "asr_active_requests{<<.LabelMatchers>>}"
Available Metrics
Lightning ASR exposes the following metrics:
Metric Type Description asr_active_requestsGauge Current number of active transcription requests asr_total_requestsCounter Total requests processed asr_failed_requestsCounter Total failed requests asr_request_duration_secondsHistogram Request processing time asr_model_load_time_secondsGauge Time to load model on startup asr_gpu_utilizationGauge GPU utilization percentage asr_gpu_memory_used_bytesGauge GPU memory used
Verify Metrics Setup
Check Prometheus
Forward Prometheus port:
kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090
Open http://localhost:9090 and verify:
Status → Targets : Lightning ASR endpoints should be “UP”
Graph : Query asr_active_requests - should return data
Status → Service Discovery : Should show ServiceMonitor
Check ServiceMonitor
kubectl get servicemonitor -n smallest
Expected output:
NAME AGE
lightning-asr 5m
Describe ServiceMonitor:
kubectl describe servicemonitor lightning-asr -n smallest
Should show:
Spec :
Endpoints :
Port : metrics
Path : /metrics
Selector :
Match Labels :
app : lightning-asr
Check Prometheus Adapter
Verify custom metrics are available:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq -r '.resources[].name' | grep asr
Expected output:
pods/asr_active_requests
pods/asr_total_requests
pods/asr_failed_requests
Query specific metric:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/smallest/pods/*/asr_active_requests" | jq .
Custom Metric Configuration
Add New Custom Metrics
To expose additional metrics to HPA:
prometheus-adapter :
rules :
custom :
- seriesQuery : "asr_active_requests"
resources :
overrides :
namespace : { resource : "namespace" }
pod : { resource : "pod" }
name :
matches : "^(.*)$"
as : "${1}"
metricsQuery : "asr_active_requests{<<.LabelMatchers>>}"
- seriesQuery : "asr_gpu_utilization"
resources :
overrides :
namespace : { resource : "namespace" }
pod : { resource : "pod" }
name :
as : "gpu_utilization"
metricsQuery : "avg_over_time(asr_gpu_utilization{<<.LabelMatchers>>}[2m])"
External Metrics
For cluster-wide metrics:
prometheus-adapter :
rules :
external :
- seriesQuery : 'kube_deployment_status_replicas{deployment="lightning-asr"}'
metricsQuery : 'sum(kube_deployment_status_replicas{deployment="lightning-asr"})'
name :
as : "lightning_asr_replica_count"
resources :
overrides :
namespace : { resource : "namespace" }
Use in HPA:
spec :
metrics :
- type : External
external :
metric :
name : lightning_asr_replica_count
target :
type : Value
value : "5"
Prometheus Configuration
Retention Policy
Configure how long metrics are stored:
kube-prometheus-stack :
prometheus :
prometheusSpec :
retention : 15d
retentionSize : "50GB"
Storage
Persist Prometheus data:
kube-prometheus-stack :
prometheus :
prometheusSpec :
storageSpec :
volumeClaimTemplate :
spec :
storageClassName : gp3
accessModes : [ "ReadWriteOnce" ]
resources :
requests :
storage : 100Gi
Scrape Interval
Adjust how frequently metrics are collected:
kube-prometheus-stack :
prometheus :
prometheusSpec :
scrapeInterval : 30s
evaluationInterval : 30s
Lower intervals (e.g., 15s) provide faster HPA response but increase storage.
Recording Rules
Pre-compute expensive queries:
kube-prometheus-stack :
prometheus :
prometheusSpec :
additionalScrapeConfigs :
- job_name : 'lightning-asr-aggregated'
scrape_interval : 15s
static_configs :
- targets : [ 'lightning-asr:2269' ]
additionalPrometheusRulesMap :
asr-rules :
groups :
- name : asr_aggregations
interval : 30s
rules :
- record : asr:requests:rate5m
expr : rate(asr_total_requests[5m])
- record : asr:requests:active_avg
expr : avg(asr_active_requests) by (namespace)
- record : asr:gpu:utilization_avg
expr : avg(asr_gpu_utilization) by (namespace)
Use recording rules in HPA for better performance.
Alerting Rules
Create alerts for anomalies:
kube-prometheus-stack :
prometheus :
prometheusSpec :
additionalPrometheusRulesMap :
asr-alerts :
groups :
- name : asr_alerts
rules :
- alert : HighErrorRate
expr : rate(asr_failed_requests[5m]) > 0.1
for : 5m
labels :
severity : warning
annotations :
summary : "High ASR error rate"
description : "Error rate is {{ $value }} errors/sec"
- alert : HighQueueLength
expr : asr_active_requests > 50
for : 2m
labels :
severity : warning
annotations :
summary : "ASR queue backing up"
description : "{{ $value }} requests queued"
- alert : GPUMemoryHigh
expr : asr_gpu_memory_used_bytes / 24000000000 > 0.9
for : 5m
labels :
severity : warning
annotations :
summary : "GPU memory usage high"
description : "GPU memory at {{ $value | humanizePercentage }}"
Debugging Metrics
Check Metrics Endpoint
Directly query Lightning ASR metrics:
kubectl port-forward -n smallest svc/lightning-asr 2269:2269
curl http://localhost:2269/metrics
Expected output:
# HELP asr_active_requests Current active requests
# TYPE asr_active_requests gauge
asr_active_requests{pod="lightning-asr-xxx"} 3
# HELP asr_total_requests Total requests processed
# TYPE asr_total_requests counter
asr_total_requests{pod="lightning-asr-xxx"} 1523
...
Test Prometheus Query
Access Prometheus UI and test queries:
asr_active_requests
rate(asr_total_requests[5m])
histogram_quantile(0.95, asr_request_duration_seconds_bucket)
Check Prometheus Targets
kubectl port-forward -n default svc/smallest-prometheus-stack-prometheus 9090:9090
Navigate to: http://localhost:9090/targets
Verify Lightning ASR targets are “UP”
View Prometheus Logs
kubectl logs -n default -l app.kubernetes.io/name=prometheus --tail=100
Look for scrape errors.
Troubleshooting
Metrics Not Appearing
Check ServiceMonitor is created :
kubectl get servicemonitor -n smallest
Check Prometheus is discovering :
kubectl logs -n default -l app.kubernetes.io/name=prometheus | grep lightning-asr
Check service has metrics port :
kubectl get svc lightning-asr -n smallest -o yaml
Should show:
ports :
- name : metrics
port : 2269
Custom Metrics Not Available
Check Prometheus Adapter logs :
kubectl logs -n kube-system -l app.kubernetes.io/name=prometheus-adapter
Verify adapter configuration :
kubectl get configmap prometheus-adapter -n kube-system -o yaml
Test API manually :
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
High Cardinality Issues
If Prometheus is using too much memory:
Reduce label cardinality
Increase retention limits
Use recording rules for complex queries
kube-prometheus-stack :
prometheus :
prometheusSpec :
resources :
requests :
memory : 4Gi
limits :
memory : 8Gi
Best Practices
Pre-compute expensive queries: - record : asr:requests:rate5m
expr : rate(asr_total_requests[5m])
Then use in HPA instead of raw query
Set Appropriate Scrape Intervals
Balance responsiveness vs storage:
Fast autoscaling: 15s
Normal: 30s
Cost-optimized: 60s
Always persist Prometheus data: storageSpec :
volumeClaimTemplate :
spec :
resources :
requests :
storage : 100Gi
Monitor Prometheus Itself
Track Prometheus performance:
Query duration
Scrape duration
Memory usage
TSDB size
Use Grafana for Visualization
What’s Next?