Skip to main content
POST
/
api
/
v1
/
lightning
/
get_text
Convert speech to text
curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/lightning/get_text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/octet-stream'
{
  "status": "success",
  "transcription": "Hello world.",
  "word_timestamps": [
    {
      "word": "Hello",
      "start": 0,
      "end": 0.5
    },
    {
      "word": "world.",
      "start": 0.6,
      "end": 0.9
    }
  ],
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  },
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  }
}
The ASR POST API allows you to convert speech to text using two different input methods:
  1. Raw Audio Bytes (application/octet-stream) - Send raw audio data with all parameters as query parameters
  2. Audio URL (application/json) - Provide only a URL to an audio file in the JSON body, with all other parameters as query parameters
Both methods use our Lightning ASR model with automatic language detection across 30+ languages.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:
Authorization: Bearer YOUR_API_KEY

Input Methods

Choose the input method that best fits your use case:
MethodContent TypeUse CaseParameters
Raw Bytesapplication/octet-streamStreaming audio data, real-time processingQuery parameters
Audio URLapplication/jsonRemote audio files, webhook processingQuery parameters

Code Examples

Method 1: Raw Audio Bytes (application/octet-stream)

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/lightning/get_text?model=lightning&language=en&word_timestamps=true&age_detection=true&gender_detection=true&emotion_detection=true" \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: audio/wav' \
  --data-binary '@/path/to/your/audio.wav'

Method 2: Audio URL (application/json)

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/lightning/get_text?model=lightning&language=en&word_timestamps=true&age_detection=true&gender_detection=true&emotion_detection=true" \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/audio.mp3"
  }'

Supported Languages

The Lightning ASR model supports automatic language detection and transcription across 30+ languages. For the full list of supported languages, please check ASR Supported Languages.
Specify the language of the input audio using its ISO 639-1 code. Use multi to enable automatic language detection from the supported list. The default is en (English).

Authorizations

Authorization
string
header
required

API key authentication using Bearer token format. Include your API key in the Authorization header as: Bearer YOUR_API_KEY

Query Parameters

model
enum<string>
required

The ASR model to use for transcription

Available options:
lightning
Example:

"lightning"

language
enum<string>
default:en

Language of the audio file. Use multi for automatic language detection

Available options:
it,
es,
en,
pt,
hi,
de,
fr,
uk,
ru,
kn,
ml,
pl,
mr,
gu,
cs,
sk,
te,
or,
nl,
bn,
lv,
et,
ro,
pa,
fi,
sv,
bg,
ta,
hu,
da,
lt,
mt,
multi
word_timestamps
boolean
default:false

Whether to include word-level timestamps in the response

age_detection
enum<string>
default:false

Whether to predict age group of the speaker

Available options:
true,
false
gender_detection
enum<string>
default:false

Whether to predict the gender of the speaker

Available options:
true,
false
emotion_detection
enum<string>
default:false

Whether to predict speaker emotions

Available options:
true,
false

Body

Raw audio bytes. Content-Type header should specify the audio format (e.g., audio/wav, audio/mp3). All parameters are passed as query parameters.

Response

Speech transcribed successfully

status
string

Status of the transcription request

Example:

"success"

transcription
string

The transcribed text from the audio file

Example:

"Hello world."

audio_length
number

Duration of the audio file in seconds

Example:

1.7

word_timestamps
object[]

Word-level timestamps in seconds.

age
enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:
infant,
teenager,
adult,
old
Example:

"adult"

gender
enum<string>

Predicted gender of the speaker if requested

Available options:
male,
female
Example:

"male"

emotions
object

Predicted emotions of the speaker if requested

metadata
object

Metadata about the transcription