To achieve the best results when cloning your voice, it’s essential to provide high-quality reference audio. Below are some best practices, dos and don’ts, and examples to guide you.

Ready to Clone Your Voice? Try it out on our platform waves.smallest.ai


🎙️ How to Record Reference Audio

  1. Environment

    • Record in a quiet room with minimal background noise.
    • Use a good quality microphone. While dedicated mics are ideal, MacBook microphones work surprisingly well for this purpose.
    • Mobile and Laptop recordings can work well too, as long as the device is placed at an adequate distance—not too far or too close—to ensure clear, natural sound without distortion.
    • Make sure the recording environment doesn’t introduce echo or distortion (e.g., avoid large empty rooms or outdoor spaces).
    • After uploading the audio, listen to it to ensure it is clear and free of interruptions, background noise, or distortion.
  2. Speaking Style

    • Speak naturally and avoid excessive emotion unless a specific tone is required.
    • Maintain a consistent pace and tone throughout the recording. Be mindful of long pauses, as they can impact the quality of the cloned voice.
  3. Length of Audio

    • Provide at least 5 seconds to 15 seconds of clean audio.

🎧 Examples of Good and Bad Reference Audio

NOTE: Currently, there is no direct support for adding audio to Mintlify. As a workaround, we have embedded a video to include the necessary audio content.

Good Reference Audio

  • High-quality, clear, and consistent tone.

Bad Reference Audio

  1. With Background Noise

  2. Inconsistent Speaking Style

  3. Overlapping Voices


🎭 Creating Expressive Voice Clones

Our platform supports emotional reference audio, meaning the emotions, pitch or tone in the reference audio will influence the output. This is ideal for creating expressive clones that match your intended tone.

😄 Emotional Control

  • The emotions in the reference audio (e.g., angry, happy, sad) directly impact the tone of the generated voice.
  • For example, if the reference audio conveys happiness, the output will replicate that cheerful tone.

⚡ Speed Control

  • The pace of your reference audio determines the speed of the output.
  • A fast-paced reference will generate a similarly fast delivery, while a slower reference will produce a more measured response.

🔊 Loudness Control

  • The loudness or volume in your reference audio is reflected in the output.
  • For instance, a soft-spoken input will result in a quieter clone, while a louder, more energetic recording will produce a bolder output.

🎧 Emotional Reference Audio Examples

Angry Tone

  • Reference Audio Sample:

  • Output Audio Example:

Silent Tone

  • Reference Audio Sample:

  • Output Audio Example:

Fast-Paced Tone

  • Reference Audio Sample:

  • Output Audio Example:


By following these guidelines and leveraging emotional reference audio, you can achieve highly accurate and expressive voice clones tailored to your needs.