Technical guidelines

Preparation of training data for voice synthesis

Based on the audio recordings and scripts for training the voice, BotTalk's voice experts create a unique voice that matches the audio recording.

Speaker voice

The selection of the speaker voice is done by the corporate. Consent of the speaker must be obtained by the Corporate. Please provide BotTalk with the speaker's full name and confirm that we may synthesize the speaker's voice.

Type of training data

In order for us to achieve the best possible quality of the desired custom voice, the training data must be delivered in the following format.

A data set contains audio recordings and a text file with the corresponding transcriptions. Each audio file should contain exactly one utterance (a single sentence or turn of a dialog system).

Technical delivery:

  • Collection of audio files (.zip).

  • Audio recordings as single utterances (.wav)

  • Associated formatted transcript (.txt)

To produce a good voice model, create the recordings in a quiet room with a high-quality microphone. Consistent volume, speaking rate, speaking pitch, and expressive mannerisms of speech are essential.


BotTalk compiles relevant sentences for the transcript in advance and provides it to the corporate. The transcript is based on real news articles.

BotTalk will take care to adjust the length of the sentences to the maximum audio length. In case of any pronunciation errors, noise, too long pauses, please record the audio again.

If necessary, the speaker can listen to the audio recording and re-record it.

The recordings must match the corresponding transcript by 100%. Errors in the transcripts will lead to loss of quality during the training.

Audio files

Each audio file should contain a single utterance (a single sentence or a single turn of a dialog system). All files must be in the same spoken language. Multi-language custom Text-to-Speech voices aren't supported. Each audio file must have a unique filename with the filename extension .wav.

At least 2 hours of audio recording are needed to synthesize a voice.

Follow these guidelines when preparing audio.

File format

RIFF (.wav), grouped into a .zip file

File name

File name will be provided by BotTalk.

No duplicate file names allowed.

Sampling rate

For creating a custom voice, 44.100 Hz is required.

No silence at the beginning and the end.

Not peaking more than -6db.

Sample format

PCM, at least 16-bit

Archive format


Last updated