ZeroSpeech 2019: TTS without T
This document details how to register to the challenge, download the development datasets and evaluation code, download the surprise datasets and submit your results for official evaluation. If you are experiencing any issue, please contact us at email@example.com.
The baseline and evaluation system is provided in a Docker image running Ubuntu 16.04.
To register, send an email to firstname.lastname@example.org, with the subject “Registration” (the email body can be empty).
Registration is mandatory in order to submit your results for evaluation and to download the surprise speech dataset.
We will keep you informed at your registered email address in case there is any update.
Create a project directory (e.g.
zerospeech2019) somewhere on your machine, and in this directory, the tree
shared/databases; for instance:
mkdir -p zerospeech2019/shared/databases
shared/must exist and will be shared between your host machine and the Docker container (see Development kit). It contains the speech datasets in
databasesand may have any supplementary folders to store your own models and results.
Download the English development data from download.zerospeech.com (all wavs are mono, 16 bit, 16 kHz; 2.5GB) and uncompress it in the
databasesdirectory. For example, on a standard Unix system:
wget https://download.zerospeech.com/archive/2019/english.tgz tar xvfz english.tgz -C zerospeech2019/shared/databases rm -f english.tgz
Do the same for the English small dataset, a toy dataset (150MB) used to check that your system or the baseline runs correctly (see Using the baseline system):
wget https://download.zerospeech.com/archive/archive/2019/english_small.tgz tar xvfz english_small.tgz -C zerospeech2019/shared/databases rm -f english_small.tgz
Download the surprise dataset as well (all wavs are mono, 16 bit, 16 kHz; 1.5 G). The archive is protected by a password you can retrieve by accepting a licence agreement on the data page:
wget https://download.zerospeech.com/archive/2019/surprise.zip unzip surprise.zip -d zerospeech2019/shared/databases # enter the password when prompted for rm -f surprise.zip
The speech datasets contain only audio files (mono wav, 16khz, 16-bit). See Datasets for a description of the corpus sub-components.
voice/: Voice dataset
unit/: Unit discovery dataset (note: not cut at sentence boundaries)
parallel/: Optional parallel datasets (note: parallel files share the same
<FILE_ID>but have a different
test/: .wav files to be re-encoded and resynthesized (see Submissions procedure; note: many files are only a few hundred ms).
vads.txt: Timestamps indicating where speech occurs in each file. VAD timestamps are in the format
<file> <speech onset in seconds> <speech offset in seconds>.
synthesis.txt: The list of files for which a synthesis is required, and an indication of the voice in which they need to be synthesized.
Audio filenames are in the format
SPEAKER_ID are target voices for synthesis (two
voices in the English development data, one in the surprise language
data). The same
<SPEAKER_ID> can appear in several files
(including very small ones); this is useful for speaker
The English development data contains an additional top-level folder
synthesis_comparison/ containing a subset of the test files in the
source voices (they will have the same
<FILE_ID> as the ones in
test/, but a different
<SPEAKER_ID>). These are for the human
evaluation, which can be done manually, and are provided to aid in
evaluation during development. The surprise language dataset will have
The code for baselines and evaluation is provided as a Docker image.
To install Docker, visit Docker installation documentation.
To download the Docker image (8GB) do:
docker pull zeroresource/zs2019 # This is the name of the ZR2019 Docker image
Then run an interactive bash session in a new container with:
docker run -it --name <container_name> \ -v <absolute_path>/shared:/shared zeroresource/zs2019 bash
<container_name>with a name of your choice. This will allow you to stop and restart the running container conveniently. Replace
<absolute_path>with the absolute path on your machine (host) to the folder
--volumeoption allows for access to folders on your machine from within the Docker container. In this case, the contents of the
sharedfolder are accessible from within your Docker via
/shared. Since it is impossible to directly access folders within the Docker container from your machine, storing your important data outside the container is an essential step for practical usage. The evaluation scripts within the container require the data to be stored under
You can add additional directories in
sharedto store your own models and results and share them between your machine and the Docker container.
docker run creates a new container (like a “light” virtual
machine) from the Docker image. If you log out of the only running
terminal on the container, then the “machine” will shut down. You
can resume your work by doing:
docker start <container_name> # to restart the Docker "machine" docker attach <container_name> # to start an interactive session ## ... now, do some work within the Docker "machine" ... docker stop <container_name> # to "power down" the Docker "machine"
You do not need to immediately start a terminal in your Docker
container. By adding the
-d option to the
docker run command
above, you can start a “detached” instance, which you can later open
a shell on, or simply run commands in from the outside. To create a
new terminal on a detached instance, do:
docker exec -it <container_name> bash
Quitting this new terminal will not shut down the running container (the container shuts down when the last running process terminates). To run a command from the outside, do:
docker exec <container_name> COMMAND [ARGS*]
Other useful Docker commands:
docker images # shows all the images docker container ls --all # shows which container is doing what docker rm <container_name> # deletes the container and any data stored within it (except the 'shared' folder); # does not delete the source image
More commands in Docker commands documentation
The Docker comes shipped with a complete Linux system (Ubuntu)
preinstalled with Python and several basic libraries (miniconda). The
virtual machine has one user (home directory in
root privilege. Here is the content of the virtual machine’s home
validate.sh # validation script evaluate.sh # evaluation script baseline/ # baseline system outputs and training scripts Dockerfile # contains the image creation script (do not touch) system/ # scripts used by the Docker and evaluation (do not touch) miniconda3/ # local Python installation
/shared/, where the shared folder is mounted, is
under the root, not under the home directory.)
You are encouraged to use the Docker virtual machine to build your system, in addition to using it for evaluation and submission. This will also enable easy distribution of your code in a replicable environment. If you do complex things inside your container (like installing additional libraries or dependencies, etc.), all of this will remain inside the specific container
<container_name>. New containers created from the original
zeroresource/zs2019image will not contain these changes. If you wish to create your own image from your own modified container, use
docker commit <container_name> <image_name>.
It is possible to use NVIDIA GPUs (CUDA-9.0, CUDNN-7) within the Docker container. You must install
nvidia-docker <https://github.com/NVIDIA/nvidia-docker>_ on your host machine and run the container with the
docker run -it --runtime=nvidia --name <container_name> \ -v <absolute_path>/shared:/shared zeroresource/zs2019 bash
To make sure your GPUs are usable inside the container, run
The development kit includes a baseline system. It first does unsupervised unit discovery using the BEER system described in Ondel et al. 2016, generates a decoding of the voice training corpus, and then trains a synthesis voice using the unsupervised decoding of the voice training using Ossian, a synthesis system based on Merlin.
The precompiled results (decoded symbolic embeddings and resynthesized
wavs) of the test dataset is provided in
baseline/baseline_submission.zip (see also Submission procedure below).
Tools are provided to re-train the baseline system from scratch, also and to generate decodings and
resynthesis for the test corpus given a trained system
(see below Using the baseline system ).
A very small data set is also provided in the Docker image to quickly test the training
datasets/english_small, for download see Dataset Download)
During the life of the challenge, we may upgrade some of the functions
and scripts in the Docker. If this happens, an email notice will be
sent to you. In this case, the upgrade will require a simple
git pull` command in the $HOME` directory of the virtual machine.
The submission procedure is decomposed in four steps: (1) preparing
the output files, (2) validating the files, (3) evaluating the
results, (4) submitting the output files to
During the development phase with the English dataset (and debugging
with the surprise dataset), only steps 1 to 3 are used, of course.
The files should be organized in a
submission.zip archive with the
metadata.yaml code/* (optional) surprise/test/*.txt and*.wav surprise/auxiliary_embedding1/*.txt (optional) surprise/auxiliary_embedding2/*.txt (optional) english/test/*.txt and *.wav english/auxiliary_embedding1/*.txt (optional) english/auxiliary_embedding2/*.txt (optional)
metadata.yamlfile must exist. It should contain the following entries (order does not matter):
author: authors of the submission affiliation: affiliation of the authors (university or company) abx distance: the ABX distance used for ranking the test embeddings, must be 'dtw_cosine', 'dtw_kl' or 'levenshtein' open source: true or false, if true you must provide a 'code' folder in the submission archive with the code source or a pointer to a public repository (e.g. on github) system description: a brief description of your system, eventually pointing to a paper auxiliary1 description: description of the auxiliary1 embeddings (if used) auxiliary2 description: description of the auxiliary1 embeddings (if used) using parallel train: true or false, set to true if you used the parallel train dataset using external data: true or false, set to true if you used an external dataset
code/folder must be present if you specified open source as true in the metadata. The folder can contain a simple README file with a link to download your code or a full source tree. Binary-only submissions are also possible. You are strongly encouraged to submit your code. Participants who submit their code in this way will be awarded an OPEN SCIENCE badge that will appear in the leaderboard.
.txtfiles in the
english/testsubdirectories contain the embeddings in the learned unit representation corresponding to the audio files in the
testfolders in the given corpora (see below for file format information). All the input audio files in the
testsubfolder must be decoded.
.wavfiles in the
english/testsubdirectories contain the output of synthesis applied to the embedding files.
The contents of a given embedding file is all that may be used to generate the corresponding waveform, and the contents of an embedding file must not contain any supplementary information not read by the decoder.** Only the subset of files specified in
synthesis.txt(in the root corpus folders, either
english/) need to be resynthesized, although you are welcome to resynthesize all of them.
synthesis.txtin the dataset also specifies which of the synthesis voices is to be used for resynthesizing a given file. For a given test audio file
<SXXX>_<ID>.wav, the corresponding resynthesized file should be called
<VXXX>is the name of the voice indicated in
synthesis.txt. Thus, for example, the file
test/S002_0379088085.wav, which is marked in the English development data set as going with voice
V002, should be resynthesized in the submission as
Results on the surprise language must be the output of applying exactly the same training procedure as the one applied to the development corpus; all hyperparameter selection must be done beforehand, on the development corpus, or automated and integrated into training.
Embedding file format
The format of embedding files is plain text, with no header, with one discovered unit per line. No requirement is placed on the length of the sequence (i.e., number of lines). The sequence of units need not represent “frames” at any constant sampling rate. Each line corresponds to a unit in a fixed-dimension numerical encoding. Each column represents a dimension. Columns are separated by spaces. “Textual” encodings must be converted into numerical, one-hot representations (one binary dimension per symbol).
Example (dense, continuous encoding):
42.286527175400906 -107.68503050450957 59.79000088588511 -113.85831030071697 0.7872647311548775 45.33505222077471 -8.468742865224545 0 328.05422046327067 -4.495454384937348 241.186547397405 40.16161685378687
Example (binary encoding, converted from a symbolic representation):
0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0
While the embeddings immediately passed to the synthesis system are an obligatory part of the submission, each submission may include up to two intermediate or derived representations of interest. This may be of particular interest to participants doing unit discovery, and making use of the baseline synthesis system. This system requires one-hot representations, and therefore requires participants to do a quantization and a transformation into one-hot representations on their embeddings. Since any such transformation will radically change the distances between vectors, it may induce an unwanted degradation in the ABX discrimination evaluation. It is therefore of interest to submit the representation prior to quantization as an auxiliary embedding.
The example submission from the baseline system gives two embeddings
in this way. The baseline unit discovery system gives a decoding at
the frame level (one symbol per 10-ms frame), and these initial
decodings therefore contain long repeated sequences consisting of the
same discovered unit repeated for a number of frames. While one could
pass this decoding directly to a synthesis system, that is not the way
our system works; repetitions are first removed, and the synthesis
system predicts durations using a model learned during its training
phase. Since we do not pass the initial, frame-level decodings
directly to the synthesis module, we cannot put these decodings in
test; we place the collapsed embeddings in
test, and the
frame-level decodings in
auxiliary_embedding1 for comparison.
(Note, however, that, in an end-to-end system which never explicitly
removed repetitions, we would not be allowed to include embeddings
with repetitions removed: the representation in
test must be the
one on the basis of which the synthesis is done.)
validate.sh program in the home directory will be
automatically executed as the first step of the evaluation
process. This will verify that all required files exist, and verify
that embedding and synthesized wav files are in conformance with the
required format. If the check fails, your submission will be rejected
(it will not be counted as one of the two submissions).
You are advised to run the validation program on your own before making your submission. To apply the script that will be run when you make your submission, run:
./validate.sh <submission> (english|surprise|both)
<submission> is the path to your submission (can be a
directory or a zip archive). It is possible to validate only the
English results by setting the second argument to
english, or only
the surprise language submission by changing the second argument to
surprise. However, upon submission in Codalab,
both will be
automatically be run, meaning that both languages will be checked and
need to be present in the zip file. For example, to check the
precompiled baseline results for English:
./validate.sh baseline/baseline_submission.zip english
This script will check that all necessary files are present and validate the file formats.
evaluate.sh script will be automatically executed on Codalab
during the evaluation process. It executes the machine evaluations on
the embeddings (ABX and bitrate). By default it runs on the english
langage. We do not provide the evaluation for the surprise dataset!
You are not suppose to optimise your system on this dataset. To run
them on a submission, do:
bash evaluate.sh <submission_zipfile> <embedding> <distance>
<submission_zipfile> is the name of the
.zip file containing
<embedding> is either
levenshtein (see Evaluation metrics.
For example, to evaluate the precompiled results with the Leveinshtein distance:
bash evaluate.sh baseline/baseline_submission.zip test levenshtein
The evaluation process should take about 15 minutes to run. The output in this example should give an ABX score of around 34.7% error, and an estimated bitrate of 72 bits/sec.
Please remember the next deadline is for wavefile submission in Codalab on March 15, 2019, 23h59 GMT - 12. This deadline will be strictly enforced.
The results must be submitted as an archive called
to the competition page on Codalab. Each
team is only allowed TWO submissions. You can use them to vary the
bitrate/quality balance, or to submit one system using the parallel
dataset, and one without. If you submitted something by mistake,
please let us know ASAP. Once the human evaluation is launched, we
cannot cancel it.
Using the baseline system
To re-train the complete baseline system
baseline/train_all.sh, with the name of the corpus and the
number of epochs to train the unit discovery for as arguments. For
example, to do the same training as was done in the sample submission,
do (from the home directory):
bash baseline/train_all.sh english 10
This will train the unit discovery system and the two synthesis voices
on the full development corpus. It will take a long time (twelve hours
or so). To simply verify that the baseline will run, replace
english_small, and replace the number of epochs with something
small (e.g., 1).
The learned unit discovery models will be stored in
complete decoding of the Voice training subset and the Unit
discovery training subset, in a symbolic version of the learned
representation, can also be found in that folder, under
The learned synthesis voices will be stored in
To re-train the synthesizer only (using different symbolic embeddings)
The unit discovery component of the baseline training script generates a decoding of the Voice training subset, in terms of unsupervised units, which is then passed on to Ossian for training. For participants not developing their own synthesis system, it is possible to train Ossian using any other set of discovered units. Only one-hot vectors are supported. (See Output Preparation for an explanation of the format and information about submitting and evaluating auxiliary embeddings.)
To do only the synthesis voice training component, first generate
one-hot decodings of all the files in the Voice subset, in the
challenge submission format, in some folder. Then, run
train_synth.sh, as follows (from the home directory),
bash baseline/train_synth.sh <decoding_folder> <corpus_name> <clean|noclean>
<decoding_folder> is the folder in which the decodings of the
Voice files are contained,
<corpus_name> is the name of the
corpus, and the third argument indicates whether existing models
should be removed before training. For example,
bash baseline/train_synth.sh voice_txts/ english clean
The learned synthesis voices will be stored in
# MODE='gpu') of the script
To generate test outputs (embeddings and wavs) from a trained baseline model
submission.sh as follows (from the home directory):
bash baseline/submission.sh <submission_dir> <zip|nozip> <corpus_name>
The first argument is the directory where the submission will be
generated. If the second argument is
zip, then, additionally, a zip
file containing the submission will be created in the folder above
<submission_dir>. The third argument is the name of the corpus on
which the models were trained, which will determine the test corpus
for which to generate the submission. For example,
bash $HOME/baseline/submission.sh $HOME/mysubmission zip english
will look for trained unit and synthesis models stored under
in the relevant subfolders (see above) and generate the embeddings and
synthesis for the test stimuli in the English development corpus. The
resulting files will be stored under
$HOME/mysubmission, and a zip
file containing the contents of this folder will be saved as
(See Submission procedure for more information about what is required in a complete submission.)
The last argument can also be omitted to create a submission
combining both the English development and the surprise language. In
this case, the script will first invoke itself with
as the last argument, and then with
surprise. The zip
file, if requested, will be created at the end.
To synthesize test outputs from other symbolic embeddings
For participants not developing their own synthesis system, it is possible to run Ossian to synthesize the test stimuli using any other one-hot encoding of the test stimuli, provided that an appropriate synthesis model has been trained using this encoding (see above). The stimuli (See Output preparation for an explanation of the format and information about submitting and evaluating auxiliary embeddings.)
To do only the synthesis for the test stimuli, first generate one-hot
decodings of the files in the Test subset, in the challenge
submission format, into the appropriate folder in your submission (see Submission Procedure
for the submission structure). Then, run
generate_tts_from_onehot.sh, with that folder as the first
argument, and the name of the corpus as the second argument. For
bash $HOME/baseline/synthesize.sh mysubmission/test/ english english
This will use the trained Ossian voices called
english to generate
synthesis for the test files in the English development corpus on the
basis of the decodings contained in
the resulting wav files in
Note that only a subset of the Test files need to be synthesized,
and only this subset will be looked for by this script. The list of
files for which a synthesis is required, and an indication of the
voice in which they need to be synthesized, is contained in
synthesis.txt in the root folder of the dataset. However, in a
complete submission, an embedding file must be provided for each of
the Test audio files. See Submission procedure for the
While participants must provide an accompanying synthesis for their embedding files, participants wishing to concentrate only on unit discovery are free to use the provided baseline synthesis system rather than building their own. It should be noted, however, that, while some modern speech synthesis allows for conditioning on potentially dense vectors, the baseline system we provide, based on Ossian, is designed for use with traditional textual input. As such, it must be given sequences of one-hot vectors as input. Participants with unit discovery systems that yield continuous or otherwise structured vector representations will need to quantize them and convert them to one-hot representations before using the baseline synthesis system. See Output preparation.
This 2019 Challenge is targeted for an Interspeech 2019 Special Workshop. The participants will submit their paper to https://www.interspeech2019.org/authors/author_resources, be reviewed by an external panel and get the response by email. The selected papers will get to present their work either as a talk or as a poster plus flash presentation during the Workshop.