Skip to main content
POST
/
api
/
v1
/
speech-to-text
Convert speech to text
curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form model=lightning \
  --form language=en \
  --form word_timestamps=true \
  --form age_detection=true \
  --form gender_detection=true \
  --form emotion_detection=true \
  --form file=@example-file
{
  "status": "success",
  "transcription": "Hello world.",
  "word_timestamps": [
    {
      "word": "Hello",
      "start": 0,
      "end": 0.5
    },
    {
      "word": "world.",
      "start": 0.6,
      "end": 0.9
    }
  ],
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  },
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  }
}
The ASR POST API allows you to convert speech to text by uploading audio files. This endpoint accepts any standard audio format and returns the transcribed text using our Lightning ASR model, which automatically detects the spoken language from the audio.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:
Authorization: Bearer YOUR_API_KEY

Code Examples

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --form 'model="lightning"' \
  --form 'age_detection="true"' \
  --form 'gender_detection="true"' \
  --form 'emotion_detection="true"' \
  --form 'language="en"' \
  --form 'file=@"/path/to/your/audio.mp3"'

Supported Languages

The Lightning ASR model supports automatic language detection and transcription across 30+ languages. For the full list of supported languages, please check ASR Supported Languages.
Specify the language of the input audio using its ISO 639-1 code.
Use multi to enable automatic language detection from the supported list. The default is en (English).

Authorizations

Authorization
string
header
required

API key authentication using Bearer token format. Include your API key in the Authorization header as: Bearer YOUR_API_KEY

Body

multipart/form-data
model
enum<string>
required

The ASR model to use for transcription

Available options:
lightning
Example:

"lightning"

file
file
required

Audio file to transcribe. Supports any audio/* format including mp3, wav, flac, m4a, ogg, and more

language
enum<string>

Language of the audio file. Use multi for automatic language detection. Language follows the ISO 639-1 code standard. Default is en.

Available options:
it,
es,
en,
pt,
hi,
de,
fr,
uk,
ru,
kn,
ml,
pl,
mr,
gu,
cs,
sk,
te,
or,
nl,
bn,
lv,
et,
ro,
pa,
fi,
sv,
bg,
ta,
hu,
da,
lt,
mt,
multi
Example:

"en"

word_timestamps
boolean

Whether to include word-level timestamps in the response

Example:

true

age_detection
enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:
true,
false
Example:

"true"

gender_detection
enum<string>

Whether to predict the gender of the speaker

Available options:
true,
false
Example:

"true"

emotion_detection
enum<string>

Whether to predict speaker emotions (happiness, sadness, disgust, fear, anger)

Available options:
true,
false
Example:

"true"

Response

Speech transcribed successfully

status
string

Status of the transcription request

Example:

"success"

transcription
string

The transcribed text from the audio file

Example:

"Hello world."

audio_length
number

Duration of the audio file in seconds

Example:

1.7

word_timestamps
object[]

Word-level timestamps in seconds.

age
enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:
infant,
teenager,
adult,
old
Example:

"adult"

gender
enum<string>

Predicted gender of the speaker if requested

Available options:
male,
female
Example:

"male"

emotions
object

Predicted emotions of the speaker if requested

metadata
object

Metadata about the transcription