Acoustic Unit Discovery / Speech Representation Learning Metrics explained Benchmarks and datasets How to participate Leaderboards

Benchmarks and datasets

Three benchmark datasets and four benchmarks exist:

zr2015: Data from English and Xitsonga, used to define the abx15 benchmark
zrc2017: Data from English, French, Mandarin, German, and Wolof, used to define the abx17 benchmark
abxLS: Data from English (LibriSpeech), used to define the abxLS benchmark

For more information about the details of the benchmarks, see Metrics explained.

Each dataset is also associated with a training set. Use of these training sets is strongly suggested, but the benchmarks are independent of the training sets. Systems must train using no labels or supervision other than speaker IDs.

Table. Characteristics of the different ZRC ABX Benchmark Datasets.

Dataset	Language	Dataset	Type	Train Set (Duration / Speakers)	Test Set (Duration / Speakers)	Availability
zr2015	English	Buckeye	conversations	same as test set	5h, 12spk	External: Download Buckeye Corpus
" "	Xitsonga	NCHLT	Timit-like		2h30, 24spk	External: Download NCHLT Xitsonga corpus
zrc2017	English	Librivox	audiobook	45h, 69 spk	27h, 9spk	Train and test available via ZRC Toolbox
" "	French	Librivox	audiobook	24h, 28 spk	17h, 10spk	Train and test available via ZRC Toolbox
" "	Mandarin	THCHS-30	read speech	2h30, 12 spk	25h, 4spk	Train and test available via ZRC Toolbox
" "	German (LANG1)	Librivox	audiobook	25h, 30 spk	11h, 10spk	Train and test available via ZRC Toolbox
" "	Wolof (LANG2)		TIMIT-like	10h, 14 spk	5.9h, 4spk	Train and test available via ZRC Toolbox
abxLS	English	Librispeech	audiobook	libriSpeech,Libri-light, etc.	dev/test x clean/other: 5h each, 40 spk (clean), 33 spk (other)	Test available via ZRC Toolbox, train sets: LibriSpeech, LibriLight

zr2015 and abx15

The abx15 benchmark uses the zr2015 dataset.

This dataset is based on two external datasets which we do not have the right to distribute, but which can be downloaded freely from their respective websites:

Buckeye Corpus of conversational American English the Buckeye dataset (Citation: Pitt, Dilley & al., 2007 Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor). )
NCHLT Xitsonga corpus of read-speech Xitsonga (Tsonga) (Citation: Barnard, 2014 Barnard, D. (2014). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549 )

Before evaluating on the abx15 benchmark, you need to download these datasets using the links just above. Then, using our toolkit, import them, with the following commands:

> zrc datasets:import zr2015-buckeye [/path/to/buckeye-corpus]
> zrc datasets:import zr2015-nchlt [/path/to/nchlt_tso]

The zr2015 dataset was used for the Zero Resource Speech Challenge 2015. Systems submitted to this challenge were required to use the zr2015 dataset to train on, and we continue to strongly encourage the use of this set to train systems evaluated on this benchmark. The train set is coextensive with the zr2015 test set.

zrc2017 and abx17

The abx17 benchmark uses the zrc2017 dataset. This dataset comes from

English audiobook speech from LibriVox librivox.org
French audiobook speech from LibriVox librivox.org
German audiobook speech from LibriVox librivox.org
Mandarin read speech from THCHS-30 (Citation: Wang, Zhang & al., 2015 Wang, D., Zhang, X. & Zhang, Z. (2015). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882. )
Wolof read speech from the TIMIT-style Wolof sentence corpus (Citation: Gauthier, Besacier & al., 2016 Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. 10th Language Resources and Evaluation Conference (LREC 2016). Retrieved from https://hal.archives-ouvertes.fr/hal-01350037 )

The abx17 benchmark is split into small files of varying durations (1s, 10s, and 120s), in order to evaluate systems’ ability to (implicitly or explicitly) perform speaker normalization on the fly at test time. The same data is distributed in the three durations, to allow for comparison.

Before evaluating on the abx17 benchmark, you need to download the zrc2017 dataset, which you can do using our toolkit with the following command:

> zrc datasets:pull zrc2017-test-dataset

This dataset was used for the Zero Resource Speech Challenge 2017. Systems submitted to this challenge were required to use the separate zr2017-train dataset to train on, which comes from the same source corpora, but is disjoint from the main zrc2017 test set, both in terms of utterances and in terms of speakers. We continue to strongly encourage the use of this set to train systems evaluated on this benchmark.

In addition to the disjoint train and test split within each language, the zr2017-train set was split into development languages (English, French and Mandarin) and test languages (German and Wolof), for which participants in the Zero Resource Speech Challenge 2017 did not have access to the automatic evaluation, in order to encourage systems that work on multiple languages without architectural changes. The training set was deliberately setup with a power law imbalance in speakers to mimic the common feature of young children’s early environments by which they are exposed to more speech from a handful of close family members.

abxLS dataset and benchmark

The abxLS benchmark (see the Metrics explained) uses the abxLS dataset. This dataset is derived from the popular LibriSpeech dataset of English read speech from audiobooks (Citation: Panayotov, Chen & al., 2015 Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206-5210. IEEE. ) .

The abxLS can be downloaded from our toolkit using the following command:

> zrc datasets:pull abxLS-dataset

The abxLS dataset is split into a dev and a test set, and split into clean and other based on the degree of filtering applied in the original LibriSpeech corpus.

As a training set, participants are strongly encouraged to use the different sections of LibriSpeech (100, 360, 960) or LibriLight (60k, 6k, etc).

Bibliography

$^*$The full bibliography can be found here

Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007): Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
Barnard (2014): Barnard, D. (2014). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
Wang, Zhang & Zhang (2015): Wang, D., Zhang, X. & Zhang, Z. (2015). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
Panayotov, Chen, Povey & Khudanpur (2015): Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206-5210. IEEE.
Gauthier, Besacier, Voisin, Melese & Elingui (2016): Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. 10th Language Resources and Evaluation Conference (LREC 2016). Retrieved from https://hal.archives-ouvertes.fr/hal-01350037