Zerospeech 2015
Summary
System requirements
Any recent Linux system can be used. If the package does not run, try the installation from source. The evaluation runs faster with a multicore machine. Ten cores is a good number.
The start up kit
We provide a basic kit to get things started. It contains a sample dataset (in English), together with the packages to do the evaluations. They can be downloaded here:
-
sample dataset (.wav) (65.3 Mb)
-
evaluation kit Track 1
-
binary executable (Linux 64 bits) (170 Mb)
-
source code (73.8 Mb) – see below for more information
-
-
evaluation kit Track 2
You can download this dataset and evaluation software without committing to the challenge. To register, send an email to contact@zerospeech.com and use this github repository for instructions.
General organisation of the start up and evaluation kits
The start up and evaluation kits are organized in a similar fashion. We give instructions for the start up kit below, but they can be adapted for the evaluation kit in a straightforward fashion.
Once downloaded, the kit will contain the following files:
samplewav/ # contains sample *.wav
sampleeval1/ # evaluation software for task 1
eval1* # the executable
resources/ # the code, libraries, etc (do not enter)
HTKposteriors/ # sample posteriorgram
MFCC/ # sample mfcc
sampleeval2/ # evaluation software for task 2
sample_eval2* # the executable
resources/ # the guts of the eval2 software (do not enter)
sample.classes.example # example output for the ``sample'' dataset
To run the Track 1 evaluations:
Usage: eval1 [options] <feature_file> <output-directory>
options:
-h # print help message with details
-j <int> # number of CPUs to use (default 1)
-kl # DTW+KL distance (DTW+cosine is the default)
-d <distancemodule> # define your own distance
# (in the format `path/module.function')
-csv # outputs a csv file with detailed results
<feature_file> # a text file containing the frame by frame
# values of the speech features to be
# evaluated.(see below for the precise format)
<output-directory> # name of a directory where the results will be stored
Example:
$ cd sampleeval1
$ ./eval1 MFCC MFCCscore # by default, the distance is DTW+cosine
... lots of progress information ... # takes about 3-5 minutes
{'within_talkers': 24.3, 'across_talkers': 32.5} # ABX discriminability scores
# for each task
# (between 0 and 1, 1 being best)
$ ./eval1 -kl HTKposteriors HTKscore # you can specify a distance other than cosine
{'within_talkers': 20.3, 'across_talkers': 23.9}
Note:
The posteriorgrams are just provided for illustration purposes;
they were obtained through a not particularly optimized HTK
pipeline using a monophone model (PER: 42%). The output directory
contains a text file with the above results (called
results.txt
), plus (if the -csv
option is used), a .csv
files with the detailed results per minimal pair and talker (called
DATASET_across.csv
and DATASET_within.csv
, where DATASET
corresponds to one of the dataset provided with the challenge). The
directory will also contain a file called VERSION_$
indicating
the version of the evaluation code that was used. Please make sure
to report that number in your report.
To run the Track 2 evaluations on the provided sampleset:
usage: sample_eval2 [-h] [-v] [-j N_JOBS] [-V]
DISCCLSFILE DESTINATION
Evaluate spoken term discovery
positional arguments:
DISCCLSFILE discovered classes
DESTINATION location for the evaluation results
optional arguments:
-h, --help show this help message and exit
-v, --verbose display progress
-j N_JOBS, --n-jobs N_JOBS number of cores to use
-V, --version show program's version number and exit
For example, to run the evaluation on the provided output
(sample.classes.example
) for the sample dataset:
$ cd sample_eval2
$ ./sample_eval2 sample.classes.example outputdir
To evaluate your own system’s output on the provided english dataset and print progress information along the way:
$ ./english_eval2 my_output.classes outputdir -v
To run the evaluation on multiple cores, use the -j
flag. As an
indication of the runtime of the program, the evaluation of the
english dataset with a gold output takes 20 minutes using 2 3.2Ghz
cores. Top RAM usage is about 10 GB. Note that this will increase with
the number of cores used. Evaluation runtime and memory usage are also
strongly dependent on the particulars of the input file. It is not
useful to use more than 10 cores (each parallel job will do one of the
10 subsampling folds).
The output directory will contain one file each for the above
described measures, with scores for both cross-speaker and
within-speaker performance. The directory will also contain a file
called VERSION_$
indicating the version of the evaluation code
that was used. Please make sure to report that number in your
report. The version number can also be obtained by:
$ ./sample_eval2 -V
$ ./english_eval2 -V
File formats
Speech Datasets
The files have been cut into sentence-sized files. They are in wav format (16kHz, 16 bits). The speaker identity can be obtained for each file by taking characters 11 to 13 of the file name.
Track 1 feature file format
Our evaluation system requires that your unsupervised subword modeling
system outputs a vector of feature values for each frame. For each
utterance in the set (e.g. aghsu09.wav
), an ASCII features file
with the same name (e.g. aghsu09.fea
) as the utterance should be
generated with the following format:
<time> <val1> ... <valN>
<time> <val1> ... <valN>
For example:
0.0125 12.3 428.8 -92.3 0.021 43.23
0.0225 19.0 392.9 -43.1 10.29 40.02
...
Track 2 output format
The spoken word discovery system should output an ASCII file listing the set of fragments that were found with the following format:
Class <classnb>
<filename> <fragment_onset> <fragment_offset>
<...>
<filename> <fragment_onset> <fragment_offset>
<NEWLINE>
Class <classnb>
<filename> <fragment_onset> <fragment_offset>
For example:
Class 1
dsgea01 1.238 1.763
dsgea19 3.380 3.821
reuiz28 18.036 18.537
Class 2
zeoqx71 8.389 9.132
...etc...
Track 1 details
Processing time
The evaluation can be divided in two components:
-
One component whose running time is independent from the particular features and distance used in the evaluation. It takes around 30 min to run on the English test set on an average machine. This component isn’t parallelized and will always use only one cpu, irrespective of the
-j
option specified. -
One component whose running time depends on the time it takes to computes distances between the features corresponding to various triphones. On the English test set there is a total of 42 158 722 distances to be computed, the typical duration of a triphone is of the order of 150-250ms, i.e. around 20 frames for a 10ms between-frames spacing, and the typical time our implementation of DTW + cosine distance requires to compute a distance between two sequences of frames of length around 20, where each frame is composed of 13 MFCC coefficients, is 0.2ms. This yields an expected execution time of 2h30min (42 158 722 * 0.0002 seconds) for this component when using 13 MFCC coefficients as features with DTW + cosine as a distance. This component is parallelized and the effective duration should be inversely proportional to the number of CPUs specified with the
-j
option. For example, if using 10 cores in the previous example, the expected execution time is 15min.
Using your own distance
To see how it is possible to provide your own distance, let us show
first how it is possible to obtain the default DTW+cosine distance
using the -d
option. The distance function used by default (DTW +
cosine) is defined in the python script:
sampleeval1/ressources/distance.py
by the function named distance
. So calling the eval1
executable
from the sampleeval1
folder with the option:
-d ./ressources/distance.distance
will reproduce the default behavior.
Now to define your own distance function you can for example copy the file:
sampleeval1/ressources/distance.py
in directory dir
somewhere on your system, modify the distance
function definition to suit your needs and call eval1
with the
option:
-d dir/distance.distance
You will see that the distance.py
script begins by importing three
other python modules, one for DTW, one for cosine distance and one for
Kullback-Leibler divergence. The cosine and Kullback-Leibler modules
are located in folder:
sampleeval1/src/ABXpy/distances/metrics
and implement frame-to-frame distance computations in a fashion similar to the scipy.spatial.distance.cdist function from the scipy python library. The DTW module is also located in the folder:
sampleeval1/src/ABXpy/distances/metrics
but as a static library (dtw.so) compiled from the cython source file:
sampleeval1/src/ABXpy/distances/metrics/install/dtw.pyx
for efficiency reasons. You can use our optimized DTW implementation
with any frame-to-frame distance function with a synopsis like the
scipy.spatial.distance.cdist
function by modifying appropriately
your copy of distance.py
. You can also replace the whole distance
computation by any python or cython module that you designed as long
as it has the same input and output format than the the distance
function in the distance.py
script.
Troubleshooting
If the linux executables do not work directly for you, you might want to try installing from source. This is not the recommended solution, please try using the provided executables first.
It is strongly recommended that you use python anaconda, which is a self-contained scientific python installation, containing most of the libraries and dependencies that are needed for this software. Python anaconda does not require admin privilege to be installed and can be installed in any directory on your system. You can use a virtual environment to isolate it completely from the rest of your system.
To install with anaconda, go to the src
folder and type:
pip install h5features
make install
If PyTables fails building, try:
pip install numexpr
pip install tables
If you really don’t want to use anaconda, check out the README.rst
and requirements.txt
files in the src
folder.
Multiprocessing.py
The parallelisation of our program relies on a module from python’s
standard library called multiprocessing.py
which can be a bit
unstable. If you experience problems when running the evaluation, try
requiring only one cpu to avoid using this module altogether.