ZeroSpeech 2019: TTS without T

Getting Started

This document details how to register to the challenge, download the development datasets and evaluation code, download the surprise datasets and submit your results for official evaluation. If you are experiencing any issue, please contact us at

System requirements

The baseline and evaluation system is provided in a Docker image running Ubuntu 16.04.


  • To register, send an email to, with the subject “Registration” (the email body can be empty).

  • Create an account on Codalab, the submission page is here.

  • Registration is mandatory in order to submit your results for evaluation and to download the surprise speech dataset.

  • We will keep you informed at your registered email address in case there is any update.

Speech dataset


  • Create a project directory (e.g. zerospeech2019) somewhere on your machine, and in this directory, the tree shared/databases; for instance:

    mkdir -p zerospeech2019/shared/databases

    The folder shared/ must exist and will be shared between your host machine and the Docker container (see Development kit). It contains the speech datasets in databases and may have any supplementary folders to store your own models and results.

  • Download the English development data from (all wavs are mono, 16 bit, 16 kHz; 2.5GB) and uncompress it in the databases directory. For example, on a standard Unix system:

     tar xvfz english.tgz -C zerospeech2019/shared/databases
     rm -f english.tgz
  • Do the same for the English small dataset, a toy dataset (150MB) used to check that your system or the baseline runs correctly (see Using the baseline system):

      tar xvfz english_small.tgz -C zerospeech2019/shared/databases
      rm -f english_small.tgz
  • Download the surprise dataset as well (all wavs are mono, 16 bit, 16 kHz; 1.5 G). The archive is protected by a password you can retrieve by accepting a licence agreement on the data page:

      unzip -d zerospeech2019/shared/databases
      # enter the password when prompted for
      rm -f  


The speech datasets contain only audio files (mono wav, 16khz, 16-bit). See Datasets for a description of the corpus sub-components.

  • train/

    • voice/: Voice dataset
    • unit/: Unit discovery dataset (note: not cut at sentence boundaries)
    • parallel/: Optional parallel datasets (note: parallel files share the same <FILE_ID> but have a different <SPEAKER_ID>)
  • test/: .wav files to be re-encoded and resynthesized (see Submissions procedure; note: many files are only a few hundred ms).

  • vads.txt: Timestamps indicating where speech occurs in each file. VAD timestamps are in the format <file> <speech onset in seconds> <speech offset in seconds>.

  • synthesis.txt: The list of files for which a synthesis is required, and an indication of the voice in which they need to be synthesized.

Audio filenames are in the format <SPEAKER_ID>_<FILE_ID>.wav. When prefixed by V, SPEAKER_ID are target voices for synthesis (two voices in the English development data, one in the surprise language data). The same <SPEAKER_ID> can appear in several files (including very small ones); this is useful for speaker normalization techniques.

The English development data contains an additional top-level folder synthesis_comparison/ containing a subset of the test files in the source voices (they will have the same <FILE_ID> as the ones in test/, but a different <SPEAKER_ID>). These are for the human evaluation, which can be done manually, and are provided to aid in evaluation during development. The surprise language dataset will have no such synthesis_comparison/ directory.

Development kit

The code for baselines and evaluation is provided as a Docker image.

  • To install Docker, visit Docker installation documentation.

  • To download the Docker image (8GB) do:

    docker pull zeroresource/zs2019    # This is the name of the ZR2019 Docker image
  • Then run an interactive bash session in a new container with:

    docker run -it --name <container_name> \
    -v <absolute_path>/shared:/shared zeroresource/zs2019 bash

    Replace <container_name> with a name of your choice. This will allow you to stop and restart the running container conveniently. Replace <absolute_path> with the absolute path on your machine (host) to the folder shared created above.

  • The -v or --volume option allows for access to folders on your machine from within the Docker container. In this case, the contents of the shared folder are accessible from within your Docker via /shared. Since it is impossible to directly access folders within the Docker container from your machine, storing your important data outside the container is an essential step for practical usage. The evaluation scripts within the container require the data to be stored under /shared/databases/.

  • You can add additional directories in shared to store your own models and results and share them between your machine and the Docker container.


The Docker comes shipped with a complete Linux system (Ubuntu) preinstalled with Python and several basic libraries (miniconda). The virtual machine has one user (home directory in /home/zs2019) with root privilege. Here is the content of the virtual machine’s home directory:   # validation script   # evaluation script
baseline/     # baseline system outputs and training scripts
Dockerfile    # contains the image creation script (do not touch)
system/       # scripts used by the Docker and evaluation (do not touch)
miniconda3/   # local Python installation

(The directory /shared/, where the shared folder is mounted, is under the root, not under the home directory.)

Baseline System

The development kit includes a baseline system. It first does unsupervised unit discovery using the BEER system described in Ondel et al. 2016, generates a decoding of the voice training corpus, and then trains a synthesis voice using the unsupervised decoding of the voice training using Ossian, a synthesis system based on Merlin.

The precompiled results (decoded symbolic embeddings and resynthesized wavs) of the test dataset is provided in baseline/ (see also Submission procedure below). Tools are provided to re-train the baseline system from scratch, also and to generate decodings and resynthesis for the test corpus given a trained system (see below Using the baseline system ). A very small data set is also provided in the Docker image to quickly test the training (datasets/english_small, for download see Dataset Download) above).


During the life of the challenge, we may upgrade some of the functions and scripts in the Docker. If this happens, an email notice will be sent to you. In this case, the upgrade will require a simple git pull` command in the $HOME` directory of the virtual machine.

Submission procedure

The submission procedure is decomposed in four steps: (1) preparing the output files, (2) validating the files, (3) evaluating the results, (4) submitting the output files to Codalab <>_. During the development phase with the English dataset (and debugging with the surprise dataset), only steps 1 to 3 are used, of course.

Output Preparation

The files should be organized in a archive with the following content:

   code/* (optional)
   surprise/test/*.txt and*.wav
   surprise/auxiliary_embedding1/*.txt (optional)
   surprise/auxiliary_embedding2/*.txt (optional)
   english/test/*.txt and *.wav
   english/auxiliary_embedding1/*.txt (optional)
   english/auxiliary_embedding2/*.txt (optional)
  • The metadata.yaml file must exist. It should contain the following entries (order does not matter):
       authors of the submission
       affiliation of the authors (university or company)
     abx distance:
       the ABX distance used for ranking the test embeddings,
       must be 'dtw_cosine', 'dtw_kl' or 'levenshtein'
     open source:
       true or false, if true you must provide a 'code' folder in the
       submission archive with the code source or a pointer to a public
       repository (e.g. on github)
     system description:
       a brief description of your system, eventually pointing to a paper
     auxiliary1 description:
       description of the auxiliary1 embeddings (if used)
     auxiliary2 description:
       description of the auxiliary1 embeddings (if used)
     using parallel train:
       true or false, set to true if you used the parallel train dataset
     using external data:
       true or false, set to true if you used an external dataset
  • The code/ folder must be present if you specified open source as true in the metadata. The folder can contain a simple README file with a link to download your code or a full source tree. Binary-only submissions are also possible. You are strongly encouraged to submit your code. Participants who submit their code in this way will be awarded an OPEN SCIENCE badge that will appear in the leaderboard.

  • The .txt files in the surprise/test and english/test subdirectories contain the embeddings in the learned unit representation corresponding to the audio files in the test folders in the given corpora (see below for file format information). All the input audio files in the test subfolder must be decoded.

  • The .wav files in the surprise/test and english/test subdirectories contain the output of synthesis applied to the embedding files.

    • The contents of a given embedding file is all that may be used to generate the corresponding waveform, and the contents of an embedding file must not contain any supplementary information not read by the decoder.** Only the subset of files specified in synthesis.txt (in the root corpus folders, either surprise/ or english/) need to be resynthesized, although you are welcome to resynthesize all of them.

      The file synthesis.txt in the dataset also specifies which of the synthesis voices is to be used for resynthesizing a given file. For a given test audio file <SXXX>_<ID>.wav, the corresponding resynthesized file should be called <VXXX>_<ID>.wav, where <VXXX> is the name of the voice indicated in synthesis.txt. Thus, for example, the file test/S002_0379088085.wav, which is marked in the English development data set as going with voice V002, should be resynthesized in the submission as test/V002_0379088085.wav.

Results on the surprise language must be the output of applying exactly the same training procedure as the one applied to the development corpus; all hyperparameter selection must be done beforehand, on the development corpus, or automated and integrated into training.

Auxiliary embeddings

While the embeddings immediately passed to the synthesis system are an obligatory part of the submission, each submission may include up to two intermediate or derived representations of interest. This may be of particular interest to participants doing unit discovery, and making use of the baseline synthesis system. This system requires one-hot representations, and therefore requires participants to do a quantization and a transformation into one-hot representations on their embeddings. Since any such transformation will radically change the distances between vectors, it may induce an unwanted degradation in the ABX discrimination evaluation. It is therefore of interest to submit the representation prior to quantization as an auxiliary embedding.

The example submission from the baseline system gives two embeddings in this way. The baseline unit discovery system gives a decoding at the frame level (one symbol per 10-ms frame), and these initial decodings therefore contain long repeated sequences consisting of the same discovered unit repeated for a number of frames. While one could pass this decoding directly to a synthesis system, that is not the way our system works; repetitions are first removed, and the synthesis system predicts durations using a model learned during its training phase. Since we do not pass the initial, frame-level decodings directly to the synthesis module, we cannot put these decodings in test; we place the collapsed embeddings in test, and the frame-level decodings in auxiliary_embedding1 for comparison. (Note, however, that, in an end-to-end system which never explicitly removed repetitions, we would not be allowed to include embeddings with repetitions removed: the representation in test must be the one on the basis of which the synthesis is done.)

Output validation

The program in the home directory will be automatically executed as the first step of the evaluation process. This will verify that all required files exist, and verify that embedding and synthesized wav files are in conformance with the required format. If the check fails, your submission will be rejected (it will not be counted as one of the two submissions).

You are advised to run the validation program on your own before making your submission. To apply the script that will be run when you make your submission, run:

     ./ <submission> (english|surprise|both)

where <submission> is the path to your submission (can be a directory or a zip archive). It is possible to validate only the English results by setting the second argument to english, or only the surprise language submission by changing the second argument to surprise. However, upon submission in Codalab, both will be automatically be run, meaning that both languages will be checked and need to be present in the zip file. For example, to check the precompiled baseline results for English:

   ./ baseline/ english

This script will check that all necessary files are present and validate the file formats.

Output evaluation

The script will be automatically executed on Codalab during the evaluation process. It executes the machine evaluations on the embeddings (ABX and bitrate). By default it runs on the english langage. We do not provide the evaluation for the surprise dataset! You are not suppose to optimise your system on this dataset. To run them on a submission, do:

bash <submission_zipfile> <embedding> <distance>

Where <submission_zipfile> is the name of the .zip file containing your submission, <embedding> is either test, auxiliary_embedding1, or auxiliary_embedding2, and <distance> is dtw_cosine, dtw_kl or levenshtein (see Evaluation metrics.

For example, to evaluate the precompiled results with the Leveinshtein distance:

bash baseline/ test levenshtein

The evaluation process should take about 15 minutes to run. The output in this example should give an ABX score of around 34.7% error, and an estimated bitrate of 72 bits/sec.

Output Submission


The results must be submitted as an archive called to the competition page on Codalab. Each team is only allowed TWO submissions. You can use them to vary the bitrate/quality balance, or to submit one system using the parallel dataset, and one without. If you submitted something by mistake, please let us know ASAP. Once the human evaluation is launched, we cannot cancel it.

Using the baseline system

To re-train the complete baseline system

Run baseline/, with the name of the corpus and the number of epochs to train the unit discovery for as arguments. For example, to do the same training as was done in the sample submission, do (from the home directory):

bash baseline/ english 10

This will train the unit discovery system and the two synthesis voices on the full development corpus. It will take a long time (twelve hours or so). To simply verify that the baseline will run, replace english with english_small, and replace the number of epochs with something small (e.g., 1).

The learned unit discovery models will be stored in baseline/training/beer/recipes/zrc2019/exp/<corpus_name>/aud. A complete decoding of the Voice training subset and the Unit discovery training subset, in a symbolic version of the learned representation, can also be found in that folder, under trans_voice.txt and trans_unit.txt.

The learned synthesis voices will be stored in $HOME/baseline/training/ossian/voices/<corpus_name>.

To re-train the synthesizer only (using different symbolic embeddings)

The unit discovery component of the baseline training script generates a decoding of the Voice training subset, in terms of unsupervised units, which is then passed on to Ossian for training. For participants not developing their own synthesis system, it is possible to train Ossian using any other set of discovered units. Only one-hot vectors are supported. (See Output Preparation for an explanation of the format and information about submitting and evaluating auxiliary embeddings.)

To do only the synthesis voice training component, first generate one-hot decodings of all the files in the Voice subset, in the challenge submission format, in some folder. Then, run, as follows (from the home directory),

bash baseline/ <decoding_folder> <corpus_name> <clean|noclean>

where <decoding_folder> is the folder in which the decodings of the Voice files are contained, <corpus_name> is the name of the corpus, and the third argument indicates whether existing models should be removed before training. For example,

bash baseline/ voice_txts/ english clean

The learned synthesis voices will be stored in $HOME/baseline/training/ossian/voices/<corpus_name>.

To generate test outputs (embeddings and wavs) from a trained baseline model

Run as follows (from the home directory):

bash baseline/ <submission_dir> <zip|nozip> <corpus_name>

The first argument is the directory where the submission will be generated. If the second argument is zip, then, additionally, a zip file containing the submission will be created in the folder above <submission_dir>. The third argument is the name of the corpus on which the models were trained, which will determine the test corpus for which to generate the submission. For example,

bash $HOME/baseline/ $HOME/mysubmission zip english

will look for trained unit and synthesis models stored under english in the relevant subfolders (see above) and generate the embeddings and synthesis for the test stimuli in the English development corpus. The resulting files will be stored under $HOME/mysubmission, and a zip file containing the contents of this folder will be saved as $HOME/

(See Submission procedure for more information about what is required in a complete submission.)

The last argument can also be omitted to create a submission combining both the English development and the surprise language. In this case, the script will first invoke itself with english as the last argument, and then with surprise. The zip file, if requested, will be created at the end.

To synthesize test outputs from other symbolic embeddings

For participants not developing their own synthesis system, it is possible to run Ossian to synthesize the test stimuli using any other one-hot encoding of the test stimuli, provided that an appropriate synthesis model has been trained using this encoding (see above). The stimuli (See Output preparation for an explanation of the format and information about submitting and evaluating auxiliary embeddings.)

To do only the synthesis for the test stimuli, first generate one-hot decodings of the files in the Test subset, in the challenge submission format, into the appropriate folder in your submission (see Submission Procedure for the submission structure). Then, run, with that folder as the first argument, and the name of the corpus as the second argument. For example,

bash $HOME/baseline/ mysubmission/test/ english english

This will use the trained Ossian voices called english to generate synthesis for the test files in the English development corpus on the basis of the decodings contained in mysubmission/test/, storing the resulting wav files in mysubmission/test/.

Note that only a subset of the Test files need to be synthesized, and only this subset will be looked for by this script. The list of files for which a synthesis is required, and an indication of the voice in which they need to be synthesized, is contained in synthesis.txt in the root folder of the dataset. However, in a complete submission, an embedding file must be provided for each of the Test audio files. See Submission procedure for the submission structure.

Paper Submission

This 2019 Challenge is targeted for an Interspeech 2019 Special Workshop. The participants will submit their paper to, be reviewed by an external panel and get the response by email. The selected papers will get to present their work either as a talk or as a poster plus flash presentation during the Workshop.