Zerospeech 2015

System requirements

Any recent Linux system can be used. If the package does not run, try the installation from source. The evaluation runs faster with a multicore machine. Ten cores is a good number.

The start up kit

We provide a basic kit to get things started. It contains a sample dataset (in English), together with the packages to do the evaluations. They can be downloaded here:

You can download this dataset and evaluation software without committing to the challenge. To register, send an email to and use this github repository for instructions.

General organisation of the start up and evaluation kits

The start up and evaluation kits are organized in a similar fashion. We give instructions for the start up kit below, but they can be adapted for the evaluation kit in a straightforward fashion.

Once downloaded, the kit will contain the following files:

  samplewav/                 # contains sample *.wav
  sampleeval1/               # evaluation software for task 1
      eval1*                 # the executable
      resources/             # the code, libraries, etc (do not enter)
      HTKposteriors/         # sample posteriorgram
      MFCC/                  # sample mfcc
  sampleeval2/               # evaluation software for task 2
      sample_eval2*          # the executable
      resources/             # the guts of the eval2 software (do not enter)
      sample.classes.example # example output for the ``sample'' dataset

To run the Track 1 evaluations:

  Usage: eval1 [options] <feature_file> <output-directory>
      -h                  # print help message with details
      -j <int>            # number of CPUs to use (default 1)
      -kl                 # DTW+KL distance (DTW+cosine is the default)
      -d <distancemodule> # define your own distance
                          # (in the format `path/module.function')
      -csv                # outputs a csv file with detailed results

  <feature_file>          # a text file containing the frame by frame
                          # values of the speech features to be
                          # evaluated.(see below for the precise format)
  <output-directory>      # name of a directory where the results will be stored


  $ cd sampleeval1
  $ ./eval1 MFCC MFCCscore        # by default, the distance is DTW+cosine

  ... lots of progress information ...  # takes about 3-5 minutes

  {'within_talkers': 24.3, 'across_talkers': 32.5} # ABX discriminability scores
                                                   # for each task
                                                   # (between 0 and 1, 1 being best)

  $ ./eval1 -kl HTKposteriors HTKscore  # you can specify a distance other than cosine

  {'within_talkers': 20.3, 'across_talkers': 23.9}

To run the Track 2 evaluations on the provided sampleset:

   usage: sample_eval2 [-h] [-v] [-j N_JOBS] [-V]

   Evaluate spoken term discovery

   positional arguments:
       DISCCLSFILE               discovered classes
       DESTINATION               location for the evaluation results

   optional arguments:
      -h, --help                 show this help message and exit
      -v, --verbose              display progress
      -j N_JOBS, --n-jobs N_JOBS number of cores to use
      -V, --version              show program's version number and exit

For example, to run the evaluation on the provided output (sample.classes.example) for the sample dataset:

  $ cd sample_eval2
  $ ./sample_eval2 sample.classes.example outputdir

To evaluate your own system’s output on the provided english dataset and print progress information along the way:

  $ ./english_eval2 my_output.classes outputdir -v

To run the evaluation on multiple cores, use the -j flag. As an indication of the runtime of the program, the evaluation of the english dataset with a gold output takes 20 minutes using 2 3.2Ghz cores. Top RAM usage is about 10 GB. Note that this will increase with the number of cores used. Evaluation runtime and memory usage are also strongly dependent on the particulars of the input file. It is not useful to use more than 10 cores (each parallel job will do one of the 10 subsampling folds).

The output directory will contain one file each for the above described measures, with scores for both cross-speaker and within-speaker performance. The directory will also contain a file called VERSION_$ indicating the version of the evaluation code that was used. Please make sure to report that number in your report. The version number can also be obtained by:

  $ ./sample_eval2 -V
  $ ./english_eval2 -V

File formats

Speech Datasets

The files have been cut into sentence-sized files. They are in wav format (16kHz, 16 bits). The speaker identity can be obtained for each file by taking characters 11 to 13 of the file name.

Track 1 feature file format

Our evaluation system requires that your unsupervised subword modeling system outputs a vector of feature values for each frame. For each utterance in the set (e.g. aghsu09.wav), an ASCII features file with the same name (e.g. aghsu09.fea) as the utterance should be generated with the following format:

  <time> <val1>    ... <valN>
  <time> <val1>    ... <valN>

For example:

  0.0125 12.3 428.8 -92.3 0.021 43.23
  0.0225 19.0 392.9 -43.1 10.29 40.02

Track 2 output format

The spoken word discovery system should output an ASCII file listing the set of fragments that were found with the following format:

  Class <classnb>
  <filename> <fragment_onset> <fragment_offset>
  <filename> <fragment_onset> <fragment_offset>
  Class <classnb>
  <filename> <fragment_onset> <fragment_offset>

For example:

  Class 1
  dsgea01   1.238  1.763
  dsgea19   3.380  3.821
  reuiz28  18.036 18.537

  Class 2
  zeoqx71   8.389  9.132

Track 1 details

Processing time

The evaluation can be divided in two components:

  • One component whose running time is independent from the particular features and distance used in the evaluation. It takes around 30 min to run on the English test set on an average machine. This component isn’t parallelized and will always use only one cpu, irrespective of the -j option specified.

  • One component whose running time depends on the time it takes to computes distances between the features corresponding to various triphones. On the English test set there is a total of 42 158 722 distances to be computed, the typical duration of a triphone is of the order of 150-250ms, i.e. around 20 frames for a 10ms between-frames spacing, and the typical time our implementation of DTW + cosine distance requires to compute a distance between two sequences of frames of length around 20, where each frame is composed of 13 MFCC coefficients, is 0.2ms. This yields an expected execution time of 2h30min (42 158 722 * 0.0002 seconds) for this component when using 13 MFCC coefficients as features with DTW + cosine as a distance. This component is parallelized and the effective duration should be inversely proportional to the number of CPUs specified with the -j option. For example, if using 10 cores in the previous example, the expected execution time is 15min.

Using your own distance

To see how it is possible to provide your own distance, let us show first how it is possible to obtain the default DTW+cosine distance using the -d option. The distance function used by default (DTW + cosine) is defined in the python script:


by the function named distance. So calling the eval1 executable from the sampleeval1 folder with the option:

-d ./ressources/distance.distance

will reproduce the default behavior.

Now to define your own distance function you can for example copy the file:


in directory dir somewhere on your system, modify the distance function definition to suit your needs and call eval1 with the option:

-d dir/distance.distance

You will see that the script begins by importing three other python modules, one for DTW, one for cosine distance and one for Kullback-Leibler divergence. The cosine and Kullback-Leibler modules are located in folder:


and implement frame-to-frame distance computations in a fashion similar to the scipy.spatial.distance.cdist function from the scipy python library. The DTW module is also located in the folder:


but as a static library ( compiled from the cython source file:


for efficiency reasons. You can use our optimized DTW implementation with any frame-to-frame distance function with a synopsis like the scipy.spatial.distance.cdist function by modifying appropriately your copy of You can also replace the whole distance computation by any python or cython module that you designed as long as it has the same input and output format than the the distance function in the script.


If the linux executables do not work directly for you, you might want to try installing from source. This is not the recommended solution, please try using the provided executables first.

It is strongly recommended that you use python anaconda, which is a self-contained scientific python installation, containing most of the libraries and dependencies that are needed for this software. Python anaconda does not require admin privilege to be installed and can be installed in any directory on your system. You can use a virtual environment to isolate it completely from the rest of your system.

To install with anaconda, go to the src folder and type:

  pip install h5features
  make install

If PyTables fails building, try:

  pip install numexpr
  pip install tables

If you really don’t want to use anaconda, check out the README.rst and requirements.txt files in the src folder.

The parallelisation of our program relies on a module from python’s standard library called which can be a bit unstable. If you experience problems when running the evaluation, try requiring only one cpu to avoid using this module altogether.