Benchmarks and datasets

Three benchmark datasets and four benchmarks exist:

  • zr2015: Data from English and Xitsonga, used to define the abx15 benchmark
  • zrc2017: Data from English, French, Mandarin, German, and Wolof, used to define the abx17 benchmark
  • abxLS: Data from English (LibriSpeech), used to define the abxLS benchmark

For more information about the details of the benchmarks, see Metrics explained.

Each dataset is also associated with a training set. Use of these training sets is strongly suggested, but the benchmarks are independent of the training sets. Systems must train using no labels or supervision other than speaker IDs.

Table. Characteristics of the different ZRC ABX Benchmark Datasets.

Dataset Language Dataset Type Train Set (Duration / Speakers) Test Set (Duration / Speakers) Availability
zr2015 English Buckeye conversations same as test set 5h, 12spk External: Download Buckeye Corpus
" " Xitsonga NCHLT Timit-like 2h30, 24spk External: Download NCHLT Xitsonga corpus
zrc2017 English Librivox audiobook 45h, 69 spk 27h, 9spk Train and test available via ZRC Toolbox
" " French Librivox audiobook 24h, 28 spk 17h, 10spk Train and test available via ZRC Toolbox
" " Mandarin THCHS-30 read speech 2h30, 12 spk 25h, 4spk Train and test available via ZRC Toolbox
" " German (LANG1) Librivox audiobook 25h, 30 spk 11h, 10spk Train and test available via ZRC Toolbox
" " Wolof (LANG2) TIMIT-like 10h, 14 spk 5.9h, 4spk Train and test available via ZRC Toolbox
abxLS English Librispeech audiobook libriSpeech,Libri-light, etc. dev/test x clean/other: 5h each, 40 spk (clean), 33 spk (other) Test available via ZRC Toolbox, train sets: LibriSpeech, LibriLight

zr2015 and abx15

The abx15 benchmark uses the zr2015 dataset.

This dataset is based on two external datasets which we do not have the right to distribute, but which can be downloaded freely from their respective websites:

  • Buckeye Corpus of conversational American English the Buckeye dataset ( Citation: , & al., , , , , , & (). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor). )
  • NCHLT Xitsonga corpus of read-speech Xitsonga (Tsonga) ( Citation: , (). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549 )

Before evaluating on the abx15 benchmark, you need to download these datasets using the links just above. Then, using our toolkit, import them, with the following commands:

> zrc datasets:import zr2015-buckeye [/path/to/buckeye-corpus]
> zrc datasets:import zr2015-nchlt [/path/to/nchlt_tso]

The zr2015 dataset was used for the Zero Resource Speech Challenge 2015. Systems submitted to this challenge were required to use the zr2015 dataset to train on, and we continue to strongly encourage the use of this set to train systems evaluated on this benchmark. The train set is coextensive with the zr2015 test set.

zrc2017 and abx17

The abx17 benchmark uses the zrc2017 dataset. This dataset comes from

The abx17 benchmark is split into small files of varying durations (1s, 10s, and 120s), in order to evaluate systems’ ability to (implicitly or explicitly) perform speaker normalization on the fly at test time. The same data is distributed in the three durations, to allow for comparison.

Before evaluating on the abx17 benchmark, you need to download the zrc2017 dataset, which you can do using our toolkit with the following command:

> zrc datasets:pull zrc2017-test-dataset

This dataset was used for the Zero Resource Speech Challenge 2017. Systems submitted to this challenge were required to use the separate zr2017-train dataset to train on, which comes from the same source corpora, but is disjoint from the main zrc2017 test set, both in terms of utterances and in terms of speakers. We continue to strongly encourage the use of this set to train systems evaluated on this benchmark.

In addition to the disjoint train and test split within each language, the zr2017-train set was split into development languages (English, French and Mandarin) and test languages (German and Wolof), for which participants in the Zero Resource Speech Challenge 2017 did not have access to the automatic evaluation, in order to encourage systems that work on multiple languages without architectural changes. The training set was deliberately setup with a power law imbalance in speakers to mimic the common feature of young children’s early environments by which they are exposed to more speech from a handful of close family members.

abxLS dataset and benchmark

The abxLS benchmark (see the Metrics explained) uses the abxLS dataset. This dataset is derived from the popular LibriSpeech dataset of English read speech from audiobooks ( Citation: , & al., , , & (). Librispeech: An asr corpus based on public domain audio books. IEEE. ) .

The abxLS can be downloaded from our toolkit using the following command:

> zrc datasets:pull abxLS-dataset

The abxLS dataset is split into a dev and a test set, and split into clean and other based on the degree of filtering applied in the original LibriSpeech corpus.

As a training set, participants are strongly encouraged to use the different sections of LibriSpeech (100, 360, 960) or LibriLight (60k, 6k, etc).

Cited

Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007)
, , , , , & (). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
Gauthier, Besacier, Voisin, Melese & Elingui (2016)
, , , & (). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. Retrieved from https://hal.archives-ouvertes.fr/hal-01350037
Wang, Zhang & Zhang (2015)
, & (). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
Barnard (2014)
(). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
Panayotov, Chen, Povey & Khudanpur (2015)
, , & (). Librispeech: An asr corpus based on public domain audio books. IEEE.