ASR

Convert speech to text

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form model=lightning \
  --form language=en \
  --form word_timestamps=true \
  --form age_detection=true \
  --form gender_detection=true \
  --form emotion_detection=true \
  --form file=@example-file

{
  "status": "success",
  "transcription": "Hello world.",
  "word_timestamps": [
    {
      "word": "Hello",
      "start": 0,
      "end": 0.5
    },
    {
      "word": "world.",
      "start": 0.6,
      "end": 0.9
    }
  ],
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  },
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  }
}

POST

api

speech-to-text

Convert speech to text

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form model=lightning \
  --form language=en \
  --form word_timestamps=true \
  --form age_detection=true \
  --form gender_detection=true \
  --form emotion_detection=true \
  --form file=@example-file

{
  "status": "success",
  "transcription": "Hello world.",
  "word_timestamps": [
    {
      "word": "Hello",
      "start": 0,
      "end": 0.5
    },
    {
      "word": "world.",
      "start": 0.6,
      "end": 0.9
    }
  ],
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  },
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  }
}

The ASR POST API allows you to convert speech to text by uploading audio files. This endpoint accepts any standard audio format and returns the transcribed text using our Lightning ASR model, which automatically detects the spoken language from the audio.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Code Examples

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --form 'model="lightning"' \
  --form 'age_detection="true"' \
  --form 'gender_detection="true"' \
  --form 'emotion_detection="true"' \
  --form 'language="en"' \
  --form 'file=@"/path/to/your/audio.mp3"'

Supported Languages

The Lightning ASR model supports automatic language detection and transcription across 30+ languages. For the full list of supported languages, please check ASR Supported Languages.

Specify the language of the input audio using its ISO 639-1 code.
Use multi to enable automatic language detection from the supported list. The default is en (English).

Authorizations

Authorization

string

header

required

API key authentication using Bearer token format. Include your API key in the Authorization header as: Bearer YOUR_API_KEY

Body

multipart/form-data

model

enum<string>

required

The ASR model to use for transcription

Available options:

lightning

Example:

"lightning"

file

required

Audio file to transcribe. Supports any audio/* format including mp3, wav, flac, m4a, ogg, and more

language

enum<string>

Language of the audio file. Use multi for automatic language detection. Language follows the ISO 639-1 code standard. Default is en.

Available options:

it,

es,

en,

pt,

hi,

de,

fr,

uk,

ru,

kn,

ml,

pl,

mr,

gu,

cs,

sk,

te,

or,

nl,

bn,

lv,

et,

ro,

pa,

fi,

sv,

bg,

ta,

hu,

da,

lt,

mt,

multi

Example:

"en"

word_timestamps

boolean

Whether to include word-level timestamps in the response

Example:

true

age_detection

enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:

true,

false

Example:

"true"

gender_detection

enum<string>

Whether to predict the gender of the speaker

Available options:

true,

false

Example:

"true"

emotion_detection

enum<string>

Whether to predict speaker emotions (happiness, sadness, disgust, fear, anger)

Available options:

true,

false

Example:

"true"

Response

Speech transcribed successfully

status

string

Status of the transcription request

Example:

"success"

transcription

string

The transcribed text from the audio file

Example:

"Hello world."

audio_length

number

Duration of the audio file in seconds

Example:

1.7

word_timestamps

object[]

Word-level timestamps in seconds.

Show child attributes

age

enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:

infant,

teenager,

adult,

old

Example:

"adult"

gender

enum<string>

Predicted gender of the speaker if requested

Available options:

male,

female

Example:

"male"

emotions

object

Predicted emotions of the speaker if requested

Show child attributes

metadata

object

Metadata about the transcription

Show child attributes

WebSocket ASR (Websocket)

⌘I

API References

Lightning ASR

Lightning v2

Lightning Large

Lightning

Voices

Voice Cloning

Pronunciations dicts

Authentication

Code Examples

Supported Languages

Authorizations

Body

Response

API References

Lightning ASR

Lightning v2

Lightning Large

Lightning

Voices

Voice Cloning

Pronunciations dicts

​Authentication

​Code Examples

​Supported Languages

Authorizations

Body

Response

Authentication

Code Examples

Supported Languages