Speech Generation for Indigenous Language Education & the EveryVoice TTS Toolkit


Vocoders

FastSpeech2 + HiFi-GAN base

5741.wav (Ground-Truth)5741.wav (Synthesized)
Text: Pero la vergüenza no me impidió tomar una resolución.


6941.wav (Ground-Truth)6941.wav (Synthesized)
Text: Se proponía presentarse a Don Carlos y retarle a desafío para decidir en juicio de Dios, peleando con toda lealtad, la grave cuestión que motivaba la guerra.


9603.wav (Ground-Truth)9603.wav (Synthesized)
Text: Era un día de Marzo de esos que parecen días de Junio, privilegio de la corte de las Españas, que suele abrasarse en Febrero y helarse en Mayo.


FastSpeech2 + HiFi-GAN finetuned

5741.wav (Ground-Truth)5741.wav (Synthesized)
Text: Pero la vergüenza no me impidió tomar una resolución.


6941.wav (Ground-Truth)6941.wav (Synthesized)
Text: Se proponía presentarse a Don Carlos y retarle a desafío para decidir en juicio de Dios, peleando con toda lealtad, la grave cuestión que motivaba la guerra.


9603.wav (Ground-Truth)9603.wav (Synthesized)
Text: Era un día de Marzo de esos que parecen días de Junio, privilegio de la corte de las Españas, que suele abrasarse en Febrero y helarse en Mayo.


FastSpeech2 + BigVGAN base

5741.wav (Ground-Truth)5741.wav (Synthesized)
Text: Pero la vergüenza no me impidió tomar una resolución.


6941.wav (Ground-Truth)6941.wav (Synthesized)
Text: Se proponía presentarse a Don Carlos y retarle a desafío para decidir en juicio de Dios, peleando con toda lealtad, la grave cuestión que motivaba la guerra.


9603.wav (Ground-Truth)9603.wav (Synthesized)
Text: Era un día de Marzo de esos que parecen días de Junio, privilegio de la corte de las Españas, que suele abrasarse en Febrero y helarse en Mayo.


FastSpeech2 + BigVGAN finetuned

5741.wav (Ground-Truth)5741.wav (Synthesized)
Text: Pero la vergüenza no me impidió tomar una resolución.


6941.wav (Ground-Truth)6941.wav (Synthesized)
Text: Se proponía presentarse a Don Carlos y retarle a desafío para decidir en juicio de Dios, peleando con toda lealtad, la grave cuestión que motivaba la guerra.


9603.wav (Ground-Truth)9603.wav (Synthesized)
Text: Era un día de Marzo de esos que parecen días de Junio, privilegio de la corte de las Españas, que suele abrasarse en Febrero y helarse en Mayo.


Ensemble

Speaker IDESEN
Training (unique utterances):
Speaker-Dependent54113100
Sampling 1st5414112
Sampling 2nd5414174
Sampling 3rd5414142
Ensemble (Sampling 1+2+3)5418921
Testing100100


HiFi-GAN (universal)HiFi-GAN (finetuned)
EnglishSpanishSampleEnglishSpanishSample
SD-30-minutes
(speaker dependant)
-7.476-7.791
SD-5-hours
(speaker dependant)
-7.286-7.631
MU
(multi-speaker/language)
-8.467-8.625
OV
(oversampling)
-7.677-7.946
E1 (multi-speaker/language
with resampled corpora)
-7.698-8.063
E2 (multi-speaker/language
with resampled corpora)
-7.834-8.093
E3 (multi-speaker/language
with resampled corpora)
-7.745-8.081
EN
(ensemble model)
-7.289-7.592
EN
(ensemble model weighted)
-7.372-7.695
Mel-ceptral distortion (smaller is better).


HiFi-GAN (universal)HiFi-GAN (finetuned)
EnglishSpanishSampleEnglishSpanishSample
SD-30-minutes
(speaker dependant)
-0.287-0.286
SD-5-hours
(speaker dependant)
-0.328-0.336
MU
(multi-speaker/language)
-0.316-0.308
OV
(oversampling)
-0.312-0.320
E1 (multi-speaker/language
with resampled corpora)
-0.306-0.289
E2 (multi-speaker/language
with resampled corpora)
-0.324-0.312
E3 (multi-speaker/language
with resampled corpora)
-0.308-0.299
EN
(ensemble model)
-0.344-0.303
EN
(ensemble model weighted)
-0.335-0.312
F0 correlation (bigger is better).


SeamlessM4T

SeamlessM4T-large

EnglishSpanishFrench
Text (eng): The quick brown fox jumps over the lazy dog.
Text (spa): El zorro marrón rápido salta sobre el perro perezoso.
Text (fra): Le renard brun rapide saute sur le chien paresseux.


EnglishSpanishFrench
Text (eng): The earliest book printed with movable types, the Gutenberg, or forty-two line Bible of about 1455
Text (spa): El primer libro impreso con tipos móviles, la Gutenberg, o Biblia de cuarenta y dos líneas de alrededor de 1455
Text (fra): Le premier livre imprimé avec des caractères mobiles, la Bible de Gutenberg, ou Bible de quarante-deux lignes, datant d'environ 1455


EnglishSpanishFrench
Text (eng): The Middle Ages brought calligraphy to perfection, and it was natural therefore
Text (spa): La Edad Media trajo la caligrafía a la perfección, y por lo tanto era natural
Text (fra): Le Moyen Âge a apporté la calligraphie à la perfection, et il était donc naturel


EnglishSpanishFrench
Text (eng): The earliest book printed with movable type, the aforesaid Gutenberg Bible, is printed in letters which are an exact imitation
Text (spa): El primer libro impreso con tipos móviles, la mencionada Biblia de Gutenberg, está impreso en letras que son una imitación exacta
Text (fra): Le premier livre imprimé avec des caractères mobiles, la Bible de Gutenberg susmentionnée, est imprimé en lettres qui sont une imitation exacte


SeamlessM4T-large finetuned (eng-spa)

EnglishSpanishFrench
Text (eng): The quick brown fox jumps over the lazy dog.
Text (spa): El zorro marrón rápido salta sobre el perro perezoso.
Text (fra): Le renard brun rapide saute sur le chien paresseux.


EnglishSpanishFrench
Text (eng): The earliest book printed with movable types, the Gutenberg, or forty-two line Bible of about 1455
Text (spa): El primer libro impreso con tipos móviles, la Gutenberg, o Biblia de cuarenta y dos líneas de alrededor de 1455
Text (fra): Le premier livre imprimé avec des caractères mobiles, la Bible de Gutenberg, ou Bible de quarante-deux lignes, datant d'environ 1455


EnglishSpanishFrench
Text (eng): The Middle Ages brought calligraphy to perfection, and it was natural therefore
Text (spa): La Edad Media trajo la caligrafía a la perfección, y por lo tanto era natural
Text (fra): Le Moyen Âge a apporté la calligraphie à la perfection, et il était donc naturel


EnglishSpanishFrench
Text (eng): The earliest book printed with movable type, the aforesaid Gutenberg Bible, is printed in letters which are an exact imitation
Text (spa): El primer libro impreso con tipos móviles, la mencionada Biblia de Gutenberg, está impreso en letras que son una imitación exacta
Text (fra): Le premier livre imprimé avec des caractères mobiles, la Bible de Gutenberg susmentionnée, est imprimé en lettres qui sont une imitation exacte


SeamlessM4T-large finetuned (moh)

501-650-kawe0612.wav (Ground-Truth)501-650-kawe0612.wav (10 epochs)501-650-kawe0612.wav (20 epochs)
Text: kèn:'en tsitskó:tak kanaktà:ke


651-1056-kawe0755.wav (Ground-Truth)651-1056-kawe0755.wav (10 epochs)651-1056-kawe0755.wav (20 epochs)
Text: tó: nitiothó:re', ronhátien iowisóntion nè:'e ki' shà:ka tenkate'khahahkwà:na'


251-500-kawe0401.wav (Ground-Truth)251-500-kawe0401.wav (10 epochs)251-500-kawe0401.wav (20 epochs)
Text: kanatí:re's, ka' nòn:wa niahón:ne' thí: ratiksa'okòn:'a?


251-500-kawe0305.wav (Ground-Truth)251-500-kawe0305.wav (10 epochs)251-500-kawe0305.wav (20 epochs)
Text: ahsonhtà:ke ken kahrónnion?


651-1056-kawe0696.wav (Ground-Truth)651-1056-kawe0696.wav (10 epochs)651-1056-kawe0696.wav (20 epochs)
Text: á:keh tsi niiononhwákte' khsinà:ke


Denoiser

 FastSpeech2FastSpeech2 + denoiserFastSpeech2 + postnetFastSpeech2 + postnet + denoiser
Mel-ceptral distortion5.23475.22975.23075.2294
F0 correlation0.18080.19850.19020.2014
PESQ1.28981.28581.28511.2972
LJ001-0007.wav (FastSpeech2)
LJ001-0007.wav (FastSpeech2 + denoiser)
LJ001-0007.wav (FastSpeech2 + postnet)
LJ001-0007.wav (FastSpeech2 + postnet + denoiser)
Text: the earliest book printed with movable types, the Gutenberg, or "forty-two line Bible" of about fourteen fifty-five


40 and 80 Mel bands

FastSpeech2FastSpeech2 + denoiserFastSpeech2 + postnetFastSpeech2 + postnet + denoiser
80-melMel-ceptral distortion7.31287.35317.32037.3594
F0 correlation0.03600.04180.03700.0385
PESQ1.31701.28091.29391.2858
40-melMel-ceptral distortion7.34447.33917.35357.3462
F0 correlation0.03920.03200.03810.0307
PESQ1.26041.22541.24361.2295
LJ015-0199.wav (40 mel)LJ015-0199.wav (80 mel)
LJ024-0078.wav (40 mel)LJ024-0078.wav (80 mel)
LJ037-0160.wav (40 mel)LJ037-0160.wav (80 mel)

EveryVoice vocoder

This model was only trained with English data (around 445 hours from LJSpeech, LibriTTS and VCTK). We also test in unseen speakers and languages.

English

LJ022-0180.wav (Ground truth)LJ022-0180.wav (Universal vocoder)LJ022-0180.wav (EveryVoice vocoder)


LJ030-0054.wav (Ground truth)LJ030-0054.wav (Universal vocoder)LJ030-0054.wav (EveryVoice vocoder)


7314_77782_000009_000003.wav (Ground truth)7314_77782_000009_000003.wav (Universal vocoder)7314_77782_000009_000003.wav (EveryVoice vocoder)


Spanish

549.wav (Ground truth)549.wav (Universal vocoder)549.wav (EveryVoice vocoder)


774.wav (Ground truth)774.wav (Universal vocoder)774.wav (EveryVoice vocoder)


822.wav (Ground truth)822.wav (Universal vocoder)822.wav (EveryVoice vocoder)


French

F10_a1_s050_v01.wav (Ground truth)F10_a1_s050_v01.wav (Universal vocoder)F10_a1_s050_v01.wav (EveryVoice vocoder)


M07_a3_s076_v01.wav (Ground truth)M07_a3_s076_v01.wav (Universal vocoder)M07_a3_s076_v01.wav (EveryVoice vocoder)


M07_a3_s095_v01.wav (Ground truth)M07_a3_s095_v01.wav (Universal vocoder)M07_a3_s095_v01.wav (EveryVoice vocoder)


Xhosa

xho_0050_0251778038.wav (Ground truth)xho_0050_0251778038.wav (Universal vocoder)xho_0050_0251778038.wav (EveryVoice vocoder)


xho_0050_2031808299.wav (Ground truth)xho_0050_2031808299.wav (Universal vocoder)xho_0050_2031808299.wav (EveryVoice vocoder)


xho_4280_6127787822.wav (Ground truth)xho_4280_6127787822.wav (Universal vocoder)xho_4280_6127787822.wav (EveryVoice vocoder)


Acknowledgement

The synthetic speech samples were constructed using a Spanish speaker dataset with the LJSpeech format collected and validated by Carlos Fonseca.