Skip to main content
POST
/
api
/
v1
/
speech-to-text
Convert speech to text
curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form model=lightning \
  --form language=en \
  --form age_detection=true \
  --form gender_detection=true \
  --form emotion_detection=true \
  --form file=@example-file
{
  "status": "success",
  "transcription": "Hello world.",
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  },
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  }
}
The ASR POST API allows you to convert speech to text by uploading audio files. This endpoint accepts any standard audio format and returns the transcribed text using our Lightning ASR model, which automatically detects the spoken language from the audio.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:
Authorization: Bearer YOUR_API_KEY

Code Examples

curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/speech-to-text \
  --header 'Authorization: Bearer <token>' \
  --form 'model="lightning"' \
  --form 'age_detection="true"' \
  --form 'gender_detection="true"' \
  --form 'emotion_detection="true"' \
  --form 'language="en"' \
  --form 'file=@"/path/to/your/audio.mp3"'

Supported Languages

The Lightning ASR model supports automatic language detection and transcription for the following languages:
  • Italian (it)
  • Spanish (es)
  • Portuguese (pt)
  • English (en)
  • German (de)
  • Hindi (hi)
  • French (fr)
  • Russian (ru)
  • Ukrainian (uk)
  • Polish (pl)
  • Dutch (nl)
  • Slovak (sk)
  • Czech (cs)
  • Bulgarian (bg)
  • Romanian (ro)
  • Finnish (fi)
  • Hungarian (hu)
  • Swedish (sv)
  • Danish (da)
  • Estonian (et)
  • Maltese (mt)
  • Lithuanian (lt)
  • Latvian (lv)
  • Slovenian (sl)
Use en if your audio is strictly English. Use multi if you don’t know the language a priori and want the model to automatically detect the spoken language from the list above.

Authorizations

Authorization
string
header
required

API key authentication using Bearer token format. Include your API key in the Authorization header as: Bearer YOUR_API_KEY

Body

multipart/form-data
model
enum<string>
required

The ASR model to use for transcription

Available options:
lightning
Example:

"lightning"

file
file
required

Audio file to transcribe. Supports any audio/* format including mp3, wav, flac, m4a, ogg, and more

language
enum<string>

Language of the audio file. Use 'en' for English-only or 'multi' for multilingual audio.

Available options:
en,
multi
Example:

"en"

age_detection
enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:
true,
false
Example:

"true"

gender_detection
enum<string>

Whether to predict the gender of the speaker

Available options:
true,
false
Example:

"true"

emotion_detection
enum<string>

Whether to predict speaker emotions (happiness, sadness, disgust, fear, anger)

Available options:
true,
false
Example:

"true"

Response

Speech transcribed successfully

status
string

Status of the transcription request

Example:

"success"

transcription
string

The transcribed text from the audio file

Example:

"Hello world."

audio_length
number

Duration of the audio file in seconds

Example:

1.7

metadata
object

Metadata about the transcription

age
enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:
infant,
teenager,
adult,
old
Example:

"adult"

gender
enum<string>

Predicted gender of the speaker if requested

Available options:
male,
female
Example:

"male"

emotions
object

Predicted emotions of the speaker if requested

I