ZeroSpeech 2019: TTS without T
Getting Started
This document details how to register to the challenge, download the development datasets and evaluation code, download the surprise datasets and submit your results for official evaluation. If you are experiencing any issue, please contact us at contact@zerospeech.com.
Summary
System requirements
The baseline and evaluation system is provided in a Docker image running Ubuntu 16.04.
Registration
-
To register, send an email to contact@zerospeech.com, with the subject “Registration” (the email body can be empty).
-
Registration is mandatory in order to submit your results for evaluation and to download the surprise speech dataset.
-
We will keep you informed at your registered email address in case there is any update.
Speech dataset
Download
-
Create a project directory (e.g.
zerospeech2019
) somewhere on your machine, and in this directory, the treeshared/databases
; for instance:mkdir -p zerospeech2019/shared/databases
The folder
shared/
must exist and will be shared between your host machine and the Docker container (see Development kit). It contains the speech datasets indatabases
and may have any supplementary folders to store your own models and results. -
Download the English development data from download.zerospeech.com (all wavs are mono, 16 bit, 16 kHz; 2.5GB) and uncompress it in the
databases
directory. For example, on a standard Unix system:wget https://download.zerospeech.com/archive/2019/english.tgz tar xvfz english.tgz -C zerospeech2019/shared/databases rm -f english.tgz
-
Do the same for the English small dataset, a toy dataset (150MB) used to check that your system or the baseline runs correctly (see Using the baseline system):
wget https://download.zerospeech.com/archive/archive/2019/english_small.tgz tar xvfz english_small.tgz -C zerospeech2019/shared/databases rm -f english_small.tgz
-
Download the surprise dataset as well (all wavs are mono, 16 bit, 16 kHz; 1.5 G). The archive is protected by a password you can retrieve by accepting a licence agreement on the data page:
wget https://download.zerospeech.com/archive/2019/surprise.zip unzip surprise.zip -d zerospeech2019/shared/databases # enter the password when prompted for rm -f surprise.zip
Structure
The speech datasets contain only audio files (mono wav, 16khz, 16-bit). See Datasets for a description of the corpus sub-components.
-
train/
voice/
: Voice datasetunit/
: Unit discovery dataset (note: not cut at sentence boundaries)parallel/
: Optional parallel datasets (note: parallel files share the same<FILE_ID>
but have a different<SPEAKER_ID>
)
-
test/
: .wav files to be re-encoded and resynthesized (see Submissions procedure; note: many files are only a few hundred ms). -
vads.txt
: Timestamps indicating where speech occurs in each file. VAD timestamps are in the format<file> <speech onset in seconds> <speech offset in seconds>
. -
synthesis.txt
: The list of files for which a synthesis is required, and an indication of the voice in which they need to be synthesized.
Audio filenames are in the format <SPEAKER_ID>_<FILE_ID>.wav
. When
prefixed by V
, SPEAKER_ID
are target voices for synthesis (two
voices in the English development data, one in the surprise language
data). The same <SPEAKER_ID>
can appear in several files
(including very small ones); this is useful for speaker
normalization techniques.
The English development data contains an additional top-level folder
synthesis_comparison/
containing a subset of the test files in the
source voices (they will have the same <FILE_ID>
as the ones in
test/
, but a different <SPEAKER_ID>
). These are for the human
evaluation, which can be done manually, and are provided to aid in
evaluation during development. The surprise language dataset will have
no such synthesis_comparison/
directory.
Development kit
The code for baselines and evaluation is provided as a Docker image.
-
To install Docker, visit Docker installation documentation.
-
To download the Docker image (8GB) do:
docker pull zeroresource/zs2019 # This is the name of the ZR2019 Docker image
-
Then run an interactive bash session in a new container with:
docker run -it --name <container_name> \ -v <absolute_path>/shared:/shared zeroresource/zs2019 bash
Replace
<container_name>
with a name of your choice. This will allow you to stop and restart the running container conveniently. Replace<absolute_path>
with the absolute path on your machine (host) to the foldershared
created above. -
The
-v
or--volume
option allows for access to folders on your machine from within the Docker container. In this case, the contents of theshared
folder are accessible from within your Docker via/shared
. Since it is impossible to directly access folders within the Docker container from your machine, storing your important data outside the container is an essential step for practical usage. The evaluation scripts within the container require the data to be stored under/shared/databases/
. -
You can add additional directories in
shared
to store your own models and results and share them between your machine and the Docker container.
Note
docker run
creates a new container (like a “light” virtual
machine) from the Docker image. If you log out of the only running
terminal on the container, then the “machine” will shut down. You
can resume your work by doing:
docker start <container_name> # to restart the Docker "machine"
docker attach <container_name> # to start an interactive session
## ... now, do some work within the Docker "machine" ...
docker stop <container_name> # to "power down" the Docker "machine"
You do not need to immediately start a terminal in your Docker
container. By adding the -d
option to the docker run
command
above, you can start a “detached” instance, which you can later open
a shell on, or simply run commands in from the outside. To create a
new terminal on a detached instance, do:
docker exec -it <container_name> bash
Quitting this new terminal will not shut down the running container (the container shuts down when the last running process terminates). To run a command from the outside, do:
docker exec <container_name> COMMAND [ARGS*]
Other useful Docker commands:
docker images # shows all the images
docker container ls --all # shows which container is doing what
docker rm <container_name> # deletes the container and any data stored within it (except the 'shared' folder);
# does not delete the source image
More commands in Docker commands documentation
Contents
The Docker comes shipped with a complete Linux system (Ubuntu)
preinstalled with Python and several basic libraries (miniconda). The
virtual machine has one user (home directory in /home/zs2019
) with
root privilege. Here is the content of the virtual machine’s home
directory:
validate.sh # validation script
evaluate.sh # evaluation script
baseline/ # baseline system outputs and training scripts
Dockerfile # contains the image creation script (do not touch)
system/ # scripts used by the Docker and evaluation (do not touch)
miniconda3/ # local Python installation
(The directory /shared/
, where the shared folder is mounted, is
under the root, not under the home directory.)
-
You are encouraged to use the Docker virtual machine to build your system, in addition to using it for evaluation and submission. This will also enable easy distribution of your code in a replicable environment. If you do complex things inside your container (like installing additional libraries or dependencies, etc.), all of this will remain inside the specific container
<container_name>
. New containers created from the originalzeroresource/zs2019
image will not contain these changes. If you wish to create your own image from your own modified container, usedocker commit <container_name> <image_name>
. -
It is possible to use NVIDIA GPUs (CUDA-9.0, CUDNN-7) within the Docker container. You must install
nvidia-docker <https://github.com/NVIDIA/nvidia-docker>
_ on your host machine and run the container with the--runtime=nvidia
option:docker run -it --runtime=nvidia --name <container_name> \ -v <absolute_path>/shared:/shared zeroresource/zs2019 bash
To make sure your GPUs are usable inside the container, run
nvidia-smi
.
Baseline System
The development kit includes a baseline system. It first does unsupervised unit discovery using the BEER system described in Ondel et al. 2016, generates a decoding of the voice training corpus, and then trains a synthesis voice using the unsupervised decoding of the voice training using Ossian, a synthesis system based on Merlin.
The precompiled results (decoded symbolic embeddings and resynthesized
wavs) of the test dataset is provided in
baseline/baseline_submission.zip
(see also Submission procedure below).
Tools are provided to re-train the baseline system from scratch, also and to generate decodings and
resynthesis for the test corpus given a trained system
(see below Using the baseline system ).
A very small data set is also provided in the Docker image to quickly test the training
(datasets/english_small
, for download see Dataset Download)
above).
Updates
During the life of the challenge, we may upgrade some of the functions
and scripts in the Docker. If this happens, an email notice will be
sent to you. In this case, the upgrade will require a simple git pull` command in the
$HOME` directory of the virtual machine.
Submission procedure
The submission procedure is decomposed in four steps: (1) preparing
the output files, (2) validating the files, (3) evaluating the
results, (4) submitting the output files to Codalab <https://competitions.codalab.org/competitions/20692#participate>
_.
During the development phase with the English dataset (and debugging
with the surprise dataset), only steps 1 to 3 are used, of course.
Output Preparation
The files should be organized in a submission.zip
archive with the
following content:
metadata.yaml
code/* (optional)
surprise/test/*.txt and*.wav
surprise/auxiliary_embedding1/*.txt (optional)
surprise/auxiliary_embedding2/*.txt (optional)
english/test/*.txt and *.wav
english/auxiliary_embedding1/*.txt (optional)
english/auxiliary_embedding2/*.txt (optional)
- The
metadata.yaml
file must exist. It should contain the following entries (order does not matter):
author:
authors of the submission
affiliation:
affiliation of the authors (university or company)
abx distance:
the ABX distance used for ranking the test embeddings,
must be 'dtw_cosine', 'dtw_kl' or 'levenshtein'
open source:
true or false, if true you must provide a 'code' folder in the
submission archive with the code source or a pointer to a public
repository (e.g. on github)
system description:
a brief description of your system, eventually pointing to a paper
auxiliary1 description:
description of the auxiliary1 embeddings (if used)
auxiliary2 description:
description of the auxiliary1 embeddings (if used)
using parallel train:
true or false, set to true if you used the parallel train dataset
using external data:
true or false, set to true if you used an external dataset
-
The
code/
folder must be present if you specified open source as true in the metadata. The folder can contain a simple README file with a link to download your code or a full source tree. Binary-only submissions are also possible. You are strongly encouraged to submit your code. Participants who submit their code in this way will be awarded an OPEN SCIENCE badge that will appear in the leaderboard. -
The
.txt
files in thesurprise/test
andenglish/test
subdirectories contain the embeddings in the learned unit representation corresponding to the audio files in thetest
folders in the given corpora (see below for file format information). All the input audio files in thetest
subfolder must be decoded. -
The
.wav
files in thesurprise/test
andenglish/test
subdirectories contain the output of synthesis applied to the embedding files.-
The contents of a given embedding file is all that may be used to generate the corresponding waveform, and the contents of an embedding file must not contain any supplementary information not read by the decoder.** Only the subset of files specified in
synthesis.txt
(in the root corpus folders, eithersurprise/
orenglish/
) need to be resynthesized, although you are welcome to resynthesize all of them.The file
synthesis.txt
in the dataset also specifies which of the synthesis voices is to be used for resynthesizing a given file. For a given test audio file<SXXX>_<ID>.wav
, the corresponding resynthesized file should be called<VXXX>_<ID>.wav
, where<VXXX>
is the name of the voice indicated insynthesis.txt
. Thus, for example, the filetest/S002_0379088085.wav
, which is marked in the English development data set as going with voiceV002
, should be resynthesized in the submission astest/V002_0379088085.wav
.
-
Results on the surprise language must be the output of applying exactly the same training procedure as the one applied to the development corpus; all hyperparameter selection must be done beforehand, on the development corpus, or automated and integrated into training.
Note:
Embedding file format
The format of embedding files is plain text, with no header, with one discovered unit per line. No requirement is placed on the length of the sequence (i.e., number of lines). The sequence of units need not represent “frames” at any constant sampling rate. Each line corresponds to a unit in a fixed-dimension numerical encoding. Each column represents a dimension. Columns are separated by spaces. “Textual” encodings must be converted into numerical, one-hot representations (one binary dimension per symbol).
Example (dense, continuous encoding):
42.286527175400906 -107.68503050450957 59.79000088588511 -113.85831030071697
0.7872647311548775 45.33505222077471 -8.468742865224545 0
328.05422046327067 -4.495454384937348 241.186547397405 40.16161685378687
Example (binary encoding, converted from a symbolic representation):
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0
Auxiliary embeddings
While the embeddings immediately passed to the synthesis system are an obligatory part of the submission, each submission may include up to two intermediate or derived representations of interest. This may be of particular interest to participants doing unit discovery, and making use of the baseline synthesis system. This system requires one-hot representations, and therefore requires participants to do a quantization and a transformation into one-hot representations on their embeddings. Since any such transformation will radically change the distances between vectors, it may induce an unwanted degradation in the ABX discrimination evaluation. It is therefore of interest to submit the representation prior to quantization as an auxiliary embedding.
The example submission from the baseline system gives two embeddings
in this way. The baseline unit discovery system gives a decoding at
the frame level (one symbol per 10-ms frame), and these initial
decodings therefore contain long repeated sequences consisting of the
same discovered unit repeated for a number of frames. While one could
pass this decoding directly to a synthesis system, that is not the way
our system works; repetitions are first removed, and the synthesis
system predicts durations using a model learned during its training
phase. Since we do not pass the initial, frame-level decodings
directly to the synthesis module, we cannot put these decodings in
test
; we place the collapsed embeddings in test
, and the
frame-level decodings in auxiliary_embedding1
for comparison.
(Note, however, that, in an end-to-end system which never explicitly
removed repetitions, we would not be allowed to include embeddings
with repetitions removed: the representation in test
must be the
one on the basis of which the synthesis is done.)
Output validation
The validate.sh
program in the home directory will be
automatically executed as the first step of the evaluation
process. This will verify that all required files exist, and verify
that embedding and synthesized wav files are in conformance with the
required format. If the check fails, your submission will be rejected
(it will not be counted as one of the two submissions).
You are advised to run the validation program on your own before making your submission. To apply the script that will be run when you make your submission, run:
./validate.sh <submission> (english|surprise|both)
where <submission>
is the path to your submission (can be a
directory or a zip archive). It is possible to validate only the
English results by setting the second argument to english
, or only
the surprise language submission by changing the second argument to
surprise
. However, upon submission in Codalab, both
will be
automatically be run, meaning that both languages will be checked and
need to be present in the zip file. For example, to check the
precompiled baseline results for English:
./validate.sh baseline/baseline_submission.zip english
This script will check that all necessary files are present and validate the file formats.
Output evaluation
The evaluate.sh
script will be automatically executed on Codalab
during the evaluation process. It executes the machine evaluations on
the embeddings (ABX and bitrate). By default it runs on the english
langage. We do not provide the evaluation for the surprise dataset!
You are not suppose to optimise your system on this dataset. To run
them on a submission, do:
bash evaluate.sh <submission_zipfile> <embedding> <distance>
Where <submission_zipfile>
is the name of the .zip
file containing
your submission, <embedding>
is either test
,
auxiliary_embedding1
, or auxiliary_embedding2
, and <distance>
is
dtw_cosine
, dtw_kl
or levenshtein
(see Evaluation metrics.
For example, to evaluate the precompiled results with the Leveinshtein distance:
bash evaluate.sh baseline/baseline_submission.zip test levenshtein
The evaluation process should take about 15 minutes to run. The output in this example should give an ABX score of around 34.7% error, and an estimated bitrate of 72 bits/sec.
Output Submission
Note
Please remember the next deadline is for wavefile submission in Codalab on March 15, 2019, 23h59 GMT - 12. This deadline will be strictly enforced.
The results must be submitted as an archive called submission.zip
to the competition page on Codalab. Each
team is only allowed TWO submissions. You can use them to vary the
bitrate/quality balance, or to submit one system using the parallel
dataset, and one without. If you submitted something by mistake,
please let us know ASAP. Once the human evaluation is launched, we
cannot cancel it.
Using the baseline system
To re-train the complete baseline system
Run baseline/train_all.sh
, with the name of the corpus and the
number of epochs to train the unit discovery for as arguments. For
example, to do the same training as was done in the sample submission,
do (from the home directory):
bash baseline/train_all.sh english 10
This will train the unit discovery system and the two synthesis voices
on the full development corpus. It will take a long time (twelve hours
or so). To simply verify that the baseline will run, replace english
with english_small
, and replace the number of epochs with something
small (e.g., 1).
The learned unit discovery models will be stored in
baseline/training/beer/recipes/zrc2019/exp/<corpus_name>/aud
. A
complete decoding of the Voice training subset and the Unit
discovery training subset, in a symbolic version of the learned
representation, can also be found in that folder, under
trans_voice.txt
and trans_unit.txt
.
The learned synthesis voices will be stored in
$HOME/baseline/training/ossian/voices/<corpus_name>
.
To re-train the synthesizer only (using different symbolic embeddings)
The unit discovery component of the baseline training script generates a decoding of the Voice training subset, in terms of unsupervised units, which is then passed on to Ossian for training. For participants not developing their own synthesis system, it is possible to train Ossian using any other set of discovered units. Only one-hot vectors are supported. (See Output Preparation for an explanation of the format and information about submitting and evaluating auxiliary embeddings.)
To do only the synthesis voice training component, first generate
one-hot decodings of all the files in the Voice subset, in the
challenge submission format, in some folder. Then, run
train_synth.sh
, as follows (from the home directory),
bash baseline/train_synth.sh <decoding_folder> <corpus_name> <clean|noclean>
where <decoding_folder>
is the folder in which the decodings of the
Voice files are contained, <corpus_name>
is the name of the
corpus, and the third argument indicates whether existing models
should be removed before training. For example,
bash baseline/train_synth.sh voice_txts/ english clean
The learned synthesis voices will be stored in
$HOME/baseline/training/ossian/voices/<corpus_name>
.
# MODE='gpu'
) of
the script train_synth.sh
. To generate test outputs (embeddings and wavs) from a trained baseline model
Run submission.sh
as follows (from the home directory):
bash baseline/submission.sh <submission_dir> <zip|nozip> <corpus_name>
The first argument is the directory where the submission will be
generated. If the second argument is zip
, then, additionally, a zip
file containing the submission will be created in the folder above
<submission_dir>
. The third argument is the name of the corpus on
which the models were trained, which will determine the test corpus
for which to generate the submission. For example,
bash $HOME/baseline/submission.sh $HOME/mysubmission zip english
will look for trained unit and synthesis models stored under english
in the relevant subfolders (see above) and generate the embeddings and
synthesis for the test stimuli in the English development corpus. The
resulting files will be stored under $HOME/mysubmission
, and a zip
file containing the contents of this folder will be saved as
$HOME/mysubmission.zip
.
(See Submission procedure for more information about what is required in a complete submission.)
The last argument can also be omitted to create a submission
combining both the English development and the surprise language. In
this case, the script will first invoke itself with english
as the last argument, and then with surprise
. The zip
file, if requested, will be created at the end.
To synthesize test outputs from other symbolic embeddings
For participants not developing their own synthesis system, it is possible to run Ossian to synthesize the test stimuli using any other one-hot encoding of the test stimuli, provided that an appropriate synthesis model has been trained using this encoding (see above). The stimuli (See Output preparation for an explanation of the format and information about submitting and evaluating auxiliary embeddings.)
To do only the synthesis for the test stimuli, first generate one-hot
decodings of the files in the Test subset, in the challenge
submission format, into the appropriate folder in your submission (see Submission Procedure
for the submission structure). Then, run generate_tts_from_onehot.sh
, with that folder as the first
argument, and the name of the corpus as the second argument. For
example,
bash $HOME/baseline/synthesize.sh mysubmission/test/ english english
This will use the trained Ossian voices called english
to generate
synthesis for the test files in the English development corpus on the
basis of the decodings contained in mysubmission/test/
, storing
the resulting wav files in mysubmission/test/
.
Note that only a subset of the Test files need to be synthesized,
and only this subset will be looked for by this script. The list of
files for which a synthesis is required, and an indication of the
voice in which they need to be synthesized, is contained in
synthesis.txt
in the root folder of the dataset. However, in a
complete submission, an embedding file must be provided for each of
the Test audio files. See Submission procedure for the
submission structure.
Note:
While participants must provide an accompanying synthesis for their embedding files, participants wishing to concentrate only on unit discovery are free to use the provided baseline synthesis system rather than building their own. It should be noted, however, that, while some modern speech synthesis allows for conditioning on potentially dense vectors, the baseline system we provide, based on Ossian, is designed for use with traditional textual input. As such, it must be given sequences of one-hot vectors as input. Participants with unit discovery systems that yield continuous or otherwise structured vector representations will need to quantize them and convert them to one-hot representations before using the baseline synthesis system. See Output preparation.
Paper Submission
This 2019 Challenge is targeted for an Interspeech 2019 Special Workshop. The participants will submit their paper to https://www.interspeech2019.org/authors/author_resources, be reviewed by an external panel and get the response by email. The selected papers will get to present their work either as a talk or as a poster plus flash presentation during the Workshop.