spoken term discovery / word segmentation Benchmarks & Datasets How to participate Leaderboards

How to participate

For any issues please email us at issue@zerospeech.com

Choosing a train dataset

For the training you can use the zrc2017-train-dataset or the zrc2015-dataset depending on your use case. The provided datasets can be downloaded using our toolkit or directly using the provided URLs in our repository . For more details on the train dataset see datasets section

To download the dataset you can run the command zrc datasets:pull zrc2017-train-dataset

Using our toolkit

It is recommended to install and use our toolkit to manage, evaluate & upload your submissions. The toolkit consists of a python package containing evaluation scripts, scripts to download datasets & other relevant files, also scripts to facilitate uploading of results to the leaderboards. You can find instructions on how to download and use our toolkit here

Submission Preparation

Each benchmark requires a specific set of files to be prepared.

To facilitate this you can use the zrc submission:init <name> <location> command from the toolkit to create an empty submission template folder. Where is the name of the benchmark (tde15, tde17) And location is the path where the directory will be created

`meta.yaml`

This file contains meta information about the author and how this submission was created.

example :

model_info:
  model_id: null
  gpu_budget: 60
  system_description: "CPC-big (trained on librispeech 960), kmeans (trained on librispeech 100), LSTM. See https://zerospeech.com/2021 for more details."
  train_set: "librispeech 960, librispeech 100"
publication:
  author_label: "Nguyen et al."
  authors: "Nguyen, T., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E. & Dupoux, E."
  paper_title: "The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling."
  paper_url: "https://arxiv.org/abs/2011.11589"
  publication_year: 2021
  institution: "EHESS, ENS, PSL Research University, CNRS and Inria"
  team: "CoML Team"
code_url: "https://github.com/zerospeech/zerospeech2021_baseline"
open_source: true

To Note

While most of the information in meta.yaml is optional, we appreciate if you take the time and fill this information as it allows us to verify the submissions and be able to keep track of all the systems that use our benchmarks.

We also would appreciate if you made your code open source and provided a link to it, although we understand that this is not always possible.

The model_id parameter is generated when you submit a system to our backend, if you wish to submit the same system to multiple benchmarks keep the model_id the same to allow our system to link the submissions.

`params.yaml`

This file contains various parameters that can override the defaults of each benchmark.

njobs: <int> specifies the number of processes to use for evaluation acceleration.

`model outputs`

The spoken word discovery system should output an ASCII file listing the set of fragments that were found with the following format:

Class <classnb>
<filename> <fragment_onset> <fragment_offset>
<...>
<filename> <fragment_onset> <fragment_offset>
<NEWLINE>
Class <classnb>
<filename> <fragment_onset> <fragment_offset>

For example:

Class 1
dsgea01   1.238  1.763
dsgea19   3.380  3.821
reuiz28  18.036 18.537

Class 2
zeoqx71   8.389  9.132
...etc...

The onset and offset are in seconds. If your system only does matching and not clustering, your classes will only have two elements each. If your system does not only matching, but also clustering and parsing, the fragments found will cover the entirety of the files, and there may be classes with only one element in it (the remainder of lexical-based segmentation).

Structure of files for each abx benchmark :

tde15

meta.yaml
params.yaml
english.txt
xitsonga.txt

tde17

meta.yaml
params.yaml
english.txt
french.txt
mandarin.txt
german.txt
wolof.txt

Running the evaluation

Once the submission has been successfully created we can now run the evaluation. Depending on your benchmark choice you can use the following command to run the evaluation :

zrc benchmarks:run tde17 </path/to/submission> -o scores_dir
zrc benchmarks:run tde15 </path/to/submission> -o scores_dir

Your results are created in the scores_dir directory.

ou can run a partial evaluation using the -t, --tasks option to specify specific sub-tasks, Ex:

zrc benchmarks:run tde17 </path/to/submission> -o scores_dir -t english french mandarin

allows you to run evaluations only on those languages skipping german & wolof. This can be used in development, please try to use all the languages when uploading results to our leaderboards, as it makes more sense for system comparison.

Uploading Results

DEV-NOTE: The upload functionality will become available in January 2023

We appreciate if you upload your results so that we can compile them into our leaderboards, this helps us with a couple of ways :

It allows us to follow new systems that are evaluated on our benchmarks and compare them.
It also helps us with creating a central place where all systems trying to solve unsupervised speech processing can be indexed.
It shows that interest in our benchmarks is still active and motivates us to create more

To submit your results you need to create an account on our website (if one is not already available). You can follow this link to create your account

Using the toolkit create a local session zrc user:login provide your username & password.

Once this is done you can upload using the following command zrc submit <submission_dir>

To submit your scores you need include all the required files in the same directory.

source files: (embeddings/probabilities) these are files extracted from your model.
score files: these are the result of the evaluation process.
params.yaml: these are the parameters of the evaluation process.
meta.yaml: generic information on submission

Multiple Submissions

If your system can be used for multiple tasks (for example, Task 1 and Task 3, Task 1 and Task 4), you are strongly encouraged to make submission to all the tasks you can. To link submissions of a single system you need to use the same model_id in your meta.yaml auto-generated after the first submission.