Training your own voice

This is a little and smart tutorial to use Piper TTS to training a neural network with your own voice.

Introduction

Piper TTS is an open source, neural text-to-speech engine focused on self-host project. Developed from Open Home Foundation. https://github.com/OHF-Voice/piper1-gpl/tree/main?tab=readme-ov-file

Key Features

Fully Offline: No data leaves your machine. Once you download the voice models, you don’t need an internet connection.
Incredible Speed: It uses ONNX (Open Neural Network Exchange) runtime, making it much faster than many other AI-based TTS systems like Tortoise or XTTS.5
Low Requirements: It is optimized for low-power ARM devices (like the Raspberry Pi or Jetson Nano) but works great on Windows, Linux, and macOS.6
Wide Language Support: It supports over 40 languages and hundreds of voices, ranging from English and Spanish to rarer inclusions like Swahili, Kazakh, and Welsh.7

Voice Quality Levels

Piper offers voices in four tiers so you can balance quality vs. performance:8

x_low: Tiny models (5-7M parameters), extremely fast, 16kHz audio.9
low: Small and fast with decent quality.10
medium: The “sweet spot” for most users; 22kHz audio with natural-sounding intonation.
high: Largest models (28-32M parameters) with the best clarity, but requires more CPU power.11

Use Cases

Because of its speed and privacy, it is the primary engine used in:

Home Assistant: It’s the default for the “Year of the Voice” local voice assistant project.
Accessibility Tools: Screen readers or plugins (like the Piper VS Code extension) that need instant feedback.
E-book Readers: People often use it to turn EPUBs into audiobooks locally without paying for cloud API credits.

Training your own Voice

Based on official documentation to training your own voice, you must install first the dependencies on Linux or WSL Windows.

Installation & Environment Setup

Install system dependencies:

1
sudo apt-get install build-essential cmake ninja-build espeak-ng

Clone and install Piper:

1
2
3
4
5
6
7
8
git clone https://github.com/OHF-voice/piper1-gpl.git 
cd piper1-gpl 
python3 -m venv .venv 
source .venv/bin/activate 
python3 -m pip install -e .[train]

./build_monotonic_align.sh 
python3 setup.py build_ext --inplace

Troubleshooting

Errore su build_monotonic_align.sh

If you receive an error regarding Exception check on ‘maximum_path_each’ will always require the GIL when running the sh script, you need to modify the file /piper1-gpl/src/piper/train/vits/monotonic_align/core.pyx.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ ./build_monotonic_align.sh

Compiling /mnt/d/Projects/piper/piper1-gpl/src/piper/train/vits/monotonic_align/core.pyx because it changed.
[1/1] Cythonizing /mnt/d/Projects/piper/piper1-gpl/src/piper/train/vits/monotonic_align/core.pyx
performance hint: core.pyx:5:0: Exception check on 'maximum_path_each' will always require the GIL to be acquired.
Possible solutions:
        1. Declare 'maximum_path_each' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on 'maximum_path_each' to allow an error code to be returned.
performance hint: core.pyx:36:0: Exception check on 'maximum_path_c' will always require the GIL to be acquired.
Possible solutions:
        1. Declare 'maximum_path_c' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on 'maximum_path_c' to allow an error code to be returned.
performance hint: core.pyx:42:21: Exception check after calling 'maximum_path_each' will always require the GIL to be acquired.
Possible solutions:
        1. Declare 'maximum_path_each' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on 'maximum_path_each' to allow an error code to be returned.

edit /piper1-gpl/src/piper/train/vits/monotonic_align/core.pyx

1
vi /piper1-gpl/src/piper/train/vits/monotonic_align/core.pyx

Allo snippet di codice

1
cdef void maximum_path_each(...)

add

1
cdef void maximum_path_each(...) noexcept:

and to the snippet

1
cpdef void maximum_path_c(...)

add

1
cpdef void maximum_path_c(...) noexcept:

The final code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
cimport cython
from cython.parallel import prange


@cython.boundscheck(False)
@cython.wraparound(False)
cdef void maximum_path_each(int[:,::1] path, float[:,::1] value, int t_y, int t_x, float max_neg_val=-1e9) noexcept nogil:
  cdef int x
  cdef int y
  cdef float v_prev
  cdef float v_cur
  cdef float tmp
  cdef int index = t_x - 1

  for y in range(t_y):
    for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)):
      if x == y:
        v_cur = max_neg_val
      else:
        v_cur = value[y-1, x]
      if x == 0:
        if y == 0:
          v_prev = 0.
        else:
          v_prev = max_neg_val
      else:
        v_prev = value[y-1, x-1]
      value[y, x] += max(v_prev, v_cur)

  for y in range(t_y - 1, -1, -1):
    path[y, index] = 1
    if index != 0 and (index == y or value[y-1, index] < value[y-1, index-1]):
      index = index - 1


@cython.boundscheck(False)
@cython.wraparound(False)
cpdef void maximum_path_c(int[:,:,::1] paths, float[:,:,::1] values, int[::1] t_ys, int[::1] t_xs) noexcept nogil:
  cdef int b = paths.shape[0]
  cdef int i
  for i in prange(b, nogil=True):
    maximum_path_each(paths[i], values[i], t_ys[i], t_xs[i])

Prepare your Dataset (I use Italian Dataset)

A. Audio Files (.wav)

Record yourself speaking in a quiet room (no echo/noise).
Save files as WAV.
Ideally, convert them to 22050 Hz, 16-bit, Mono (Piper can process others, but this is the target format).
Put them all in one folder, e.g., /home/user/my-voice-dataset/wavs/.

B. The Metadata (metadata.csv) Create a file named metadata.csv. It must use the pipe symbol | as a separator.

Format: filename.wav|Italian text here

C. Dataset Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
001.wav|Ciao, questa è la mia nuova voce sintetica creata con l'intelligenza artificiale.
002.wav|Oggi è una giornata meravigliosa per imparare qualcosa di nuovo.
003.wav|Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura.
004.wav|Per favore, portami un bicchiere d'acqua fresca e frizzante.
005.wav|Il gatto dorme tranquillamente sul divano rosso del salotto.
006.wav|Ieri sera ho guardato un film molto interessante alla televisione.
007.wav|La pizza napoletana è famosa in tutto il mondo per il suo sapore unico.
008.wav|Vorrei prenotare un tavolo per due persone per stasera alle otto.
009.wav|Attenzione, il treno regionale è in arrivo al binario tre.
010.wav|Non credo che sia una buona idea uscire con questo tempaccio.
011.wav|Scusi, sa dirmi dov'è la farmacia più vicina da qui?
012.wav|Ho comprato delle mele, delle pere, delle banane e un po' di uva.
013.wav|Sopra la panca la capra campa, sotto la panca la capra crepa.
014.wav|È incredibile quanto velocemente cambi la tecnologia al giorno d'oggi.
015.wav|Mi piace molto ascoltare la musica classica quando lavoro al computer.
016.wav|L'estate scorsa siamo andati in vacanza in Sardegna e il mare era stupendo.
017.wav|Devo ricordarmi di pagare le bollette prima della fine del mese.
018.wav|Il cane abbaiava forte perché aveva visto un gatto passare in giardino.
019.wav|Potresti passarmi il sale e il pepe, per favore?
020.wav|La felicità non è una destinazione, ma un modo di viaggiare.
021.wav|Hai visto le chiavi della macchina? Non riesco a trovarle da nessuna parte.
022.wav|Domani mattina devo svegliarmi molto presto per prendere l'aereo.
023.wav|Il cielo è azzurro e non c'è nemmeno una nuvola all'orizzonte.
024.wav|Preferisco il caffè espresso, corto e senza zucchero.
025.wav|Guglielmo ha deciso di imparare a suonare la chitarra elettrica.
026.wav|La scienza e l'arte sono due facce della stessa medaglia umana.
027.wav|Non c'è peggior sordo di chi non vuol sentire.
028.wav|Ho bisogno di un consiglio sincero su questa situazione complicata.
029.wav|Il traffico in centro a Milano è sempre molto intenso durante la settimana.
030.wav|Grazie per aver ascoltato, spero che questa voce vi piaccia molto.

Recording Tips for Best Quality:

Consistency: Try to record everything in one session so your voice tone and distance from the microphone stay the same.
Silence: Leave 0.5 seconds of silence at the start and end of every recording.
Format: If possible, set your recording software (like Audacity) to save as Mono, 22050Hz, 16-bit WAV.
Performance: Read naturally! If you read like a robot, the AI will sound like a robot. If you read with emotion, the AI will learn that emotion.

Download a Base Checkpoint

The documentation highly recommends using an existing checkpoint (“finetuning”) rather than training from zero. This makes the process much faster and requires less data (15-30 minutes of audio is often enough for a decent result).

Go to the Piper Checkpoints Hugging Face page.
Download a medium quality checkpoint *.ckpt.
- Tip: If you can find an Italian checkpoint (like it_IT-riccardo-medium.ckpt), that is best.
- If not, an English one (like en_US-lessac-medium.ckpt) will still work; the model will just “learn” the Italian accent from your data.
Place it somewhere accessible, e.g., /home/user/piper-checkpoints/epoch=2000.ckpt.

Note on older Piper versions: According to official documentation, only medium quality checkpoints are supported without tweaking other settings. Many checkpoints on Hugging Face might not work with the current version of Piper and require specific configuration changes (refer to config.py in the source code).

Run the Training Command

This is command customized for your Italian voice.

--data.espeak_voice "it": This tells the system to pronounce the text using Italian rules.
Paths: Update the /path/to/... placeholders with your actual file locations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
python3 -m piper.train fit \
  --data.voice_name "my_italian_voice" \
  --data.csv_path /home/user/my-voice-dataset/metadata.csv \
  --data.audio_dir /home/user/my-voice-dataset/wavs/ \
  --model.sample_rate 22050 \
  --data.espeak_voice "it" \
  --data.cache_dir /home/user/my-voice-dataset/cache/ \
  --data.config_path /home/user/my-voice-dataset/config.json \
  --data.batch_size 32 \
  --ckpt_path /home/user/piper-checkpoints/epoch=2000.ckpt

Workaround

Hardware Warning: If you have a GPU with less than 24GB VRAM (like an RTX 3060 or 4070), you may get an “Out of Memory” error. If this happens, lower the batch size: Change --data.batch_size 32 to 16 or 8.

ImportError: cannot import name ’espeakbridge

ImportError: cannot import name ’espeakbridge’ If you see this error, it means the compiled espeakbridge file is missing.

1
2
sudo apt update
sudo apt install espeak-ng libespeak-ng-dev

Compile the missing module:

1
2
3
4
5
# Install missing module 
python3 setup.py build_ext --inplace

# Test it
ls -l src/piper/espeakbridge*.so

Monitoring Progress

Open another Linux terminal, activate the environment (the python venv in the piper folder), and launch TensorBoard:

1
tensorboard --logdir tensorboard --logdir /mnt/d/Projects/piper/piper1-gpl/lightning_logs

Open your browser on Windows and navigate to the address displayed (usually localhost:6006) to view graphs and listen to audio previews.

When to stop: Once you are satisfied (when the “loss” in the graph stops decreasing and the audio samples sound good), you can stop the training (Ctrl+C) and proceed to export the model.

Exporting Your Voice

Once training is done (or you stop it because the loss is low enough), export the model to ONNX format to use it.

1
2
3
python3 -m piper.train.export_onnx \
  --checkpoint /home/user/my-voice-dataset/logs/lightning_logs/version_0/checkpoints/last.ckpt \
  --output-file /home/user/my_italian_voice.onnx

Final Steps:

Rename your config file: cp /home/user/my-voice-dataset/config.json my_italian_voice.onnx.json
You now have the pair needed to use the voice:
- my_italian_voice.onnx
- my_italian_voice.onnx.json