ASR

Convert speech to text

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form model=lightning \
  --form language=en \
  --form age_detection=true \
  --form gender_detection=true \
  --form emotion_detection=true \
  --form file=@example-file

{
  "status": "success",
  "transcription": "Hello world.",
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  },
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  }
}

POST

api

speech-to-text

Convert speech to text

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form model=lightning \
  --form language=en \
  --form age_detection=true \
  --form gender_detection=true \
  --form emotion_detection=true \
  --form file=@example-file

{
  "status": "success",
  "transcription": "Hello world.",
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  },
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  }
}

The ASR POST API allows you to convert speech to text by uploading audio files. This endpoint accepts any standard audio format and returns the transcribed text using our Lightning ASR model, which automatically detects the spoken language from the audio.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Code Examples

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --form 'model="lightning"' \
  --form 'age_detection="true"' \
  --form 'gender_detection="true"' \
  --form 'emotion_detection="true"' \
  --form 'language="en"' \
  --form 'file=@"/path/to/your/audio.mp3"'

Supported Languages

The Lightning ASR model supports automatic language detection and transcription for the following languages:

Italian (it)
Spanish (es)
Portuguese (pt)
English (en)
German (de)
Hindi (hi)
French (fr)
Russian (ru)
Ukrainian (uk)
Polish (pl)
Dutch (nl)
Slovak (sk)
Czech (cs)
Bulgarian (bg)
Romanian (ro)
Finnish (fi)
Hungarian (hu)
Swedish (sv)
Danish (da)
Estonian (et)
Maltese (mt)
Lithuanian (lt)
Latvian (lv)
Slovenian (sl)

Use en if your audio is strictly English. Use multi if you don’t know the language a priori and want the model to automatically detect the spoken language from the list above.

Authorizations

Authorization

string

header

required

API key authentication using Bearer token format. Include your API key in the Authorization header as: Bearer YOUR_API_KEY

Body

multipart/form-data

model

enum<string>

required

The ASR model to use for transcription

Available options:

lightning

Example:

"lightning"

file

required

Audio file to transcribe. Supports any audio/* format including mp3, wav, flac, m4a, ogg, and more

language

enum<string>

Language of the audio file. Use 'en' for English-only or 'multi' for multilingual audio.

Available options:

en,

multi

Example:

"en"

age_detection

enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:

true,

false

Example:

"true"

gender_detection

enum<string>

Whether to predict the gender of the speaker

Available options:

true,

false

Example:

"true"

emotion_detection

enum<string>

Whether to predict speaker emotions (happiness, sadness, disgust, fear, anger)

Available options:

true,

false

Example:

"true"

Response

Speech transcribed successfully

status

string

Status of the transcription request

Example:

"success"

transcription

string

The transcribed text from the audio file

Example:

"Hello world."

audio_length

number

Duration of the audio file in seconds

Example:

1.7

metadata

object

Metadata about the transcription

Show child attributes

age

enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:

infant,

teenager,

adult,

old

Example:

"adult"

gender

enum<string>

Predicted gender of the speaker if requested

Available options:

male,

female

Example:

"male"

emotions

object

Predicted emotions of the speaker if requested

Show child attributes

WebSocket ASR (Websocket)

⌘I

API References

Speech to Text

Lightning v2

Lightning Large

Lightning

Voices

Voice Cloning

Pronunciations dicts

Authentication

Code Examples

Supported Languages

Authorizations

Body

Response

API References

Speech to Text

Lightning v2

Lightning Large

Lightning

Voices

Voice Cloning

Pronunciations dicts

​Authentication

​Code Examples

​Supported Languages

Authorizations

Body

Response

Authentication

Code Examples

Supported Languages