Benchmarks and datasets
Three benchmark datasets and four benchmarks exist:
- zr2015: Data from English and Xitsonga, used to define the abx15 benchmark
- zrc2017: Data from English, French, Mandarin, German, and Wolof, used to define the abx17 benchmark
- abxLS: Data from English (LibriSpeech), used to define the abxLS benchmark
For more information about the details of the benchmarks, see Metrics explained.
Each dataset is also associated with a training set. Use of these training sets is strongly suggested, but the benchmarks are independent of the training sets. Systems must train using no labels or supervision other than speaker IDs.
Table. Characteristics of the different ZRC ABX Benchmark Datasets.
Dataset | Language | Dataset | Type | Train Set (Duration / Speakers) | Test Set (Duration / Speakers) | Availability |
---|---|---|---|---|---|---|
zr2015 | English | Buckeye | conversations | same as test set | 5h, 12spk | External: Download Buckeye Corpus |
" " | Xitsonga | NCHLT | Timit-like | 2h30, 24spk | External: Download NCHLT Xitsonga corpus | |
zrc2017 | English | Librivox | audiobook | 45h, 69 spk | 27h, 9spk | Train and test available via ZRC Toolbox |
" " | French | Librivox | audiobook | 24h, 28 spk | 17h, 10spk | Train and test available via ZRC Toolbox |
" " | Mandarin | THCHS-30 | read speech | 2h30, 12 spk | 25h, 4spk | Train and test available via ZRC Toolbox |
" " | German (LANG1) | Librivox | audiobook | 25h, 30 spk | 11h, 10spk | Train and test available via ZRC Toolbox |
" " | Wolof (LANG2) | TIMIT-like | 10h, 14 spk | 5.9h, 4spk | Train and test available via ZRC Toolbox | |
abxLS | English | Librispeech | audiobook | libriSpeech,Libri-light, etc. | dev/test x clean/other: 5h each, 40 spk (clean), 33 spk (other) | Test available via ZRC Toolbox, train sets: LibriSpeech, LibriLight |
zr2015 and abx15
The abx15 benchmark uses the zr2015 dataset.
This dataset is based on two external datasets which we do not have the right to distribute, but which can be downloaded freely from their respective websites:
- Buckeye Corpus of conversational American English the Buckeye dataset ( Citation: Pitt, Dilley & al., 2007 Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor). )
- NCHLT Xitsonga corpus of read-speech Xitsonga (Tsonga) ( Citation: Barnard, 2014 Barnard, D. (2014). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549 )
Before evaluating on the abx15 benchmark, you need to download these datasets using the links just above. Then, using our toolkit, import them, with the following commands:
> zrc datasets:import zr2015-buckeye [/path/to/buckeye-corpus]
> zrc datasets:import zr2015-nchlt [/path/to/nchlt_tso]
The zr2015 dataset was used for the Zero Resource Speech Challenge 2015. Systems submitted to this challenge were required to use the zr2015 dataset to train on, and we continue to strongly encourage the use of this set to train systems evaluated on this benchmark. The train set is coextensive with the zr2015 test set.
zrc2017 and abx17
The abx17 benchmark uses the zrc2017 dataset. This dataset comes from
- English audiobook speech from LibriVox librivox.org
- French audiobook speech from LibriVox librivox.org
- German audiobook speech from LibriVox librivox.org
- Mandarin read speech from THCHS-30 ( Citation: Wang, Zhang & al., 2015 Wang, D., Zhang, X. & Zhang, Z. (2015). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882. )
- Wolof read speech from the TIMIT-style Wolof sentence corpus ( Citation: Gauthier, Besacier & al., 2016 Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. Retrieved from https://hal.archives-ouvertes.fr/hal-01350037 )
The abx17 benchmark is split into small files of varying durations (1s, 10s, and 120s), in order to evaluate systems’ ability to (implicitly or explicitly) perform speaker normalization on the fly at test time. The same data is distributed in the three durations, to allow for comparison.
Before evaluating on the abx17 benchmark, you need to download the zrc2017 dataset, which you can do using our toolkit with the following command:
> zrc datasets:pull zrc2017-test-dataset
This dataset was used for the Zero Resource Speech Challenge 2017. Systems submitted to this challenge were required to use the separate zr2017-train dataset to train on, which comes from the same source corpora, but is disjoint from the main zrc2017 test set, both in terms of utterances and in terms of speakers. We continue to strongly encourage the use of this set to train systems evaluated on this benchmark.
In addition to the disjoint train and test split within each language, the zr2017-train set was split into development languages (English, French and Mandarin) and test languages (German and Wolof), for which participants in the Zero Resource Speech Challenge 2017 did not have access to the automatic evaluation, in order to encourage systems that work on multiple languages without architectural changes. The training set was deliberately setup with a power law imbalance in speakers to mimic the common feature of young children’s early environments by which they are exposed to more speech from a handful of close family members.
abxLS dataset and benchmark
The abxLS benchmark (see the Metrics explained) uses the abxLS dataset. This dataset is derived from the popular LibriSpeech dataset of English read speech from audiobooks ( Citation: Panayotov, Chen & al., 2015 Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. IEEE. ) .
The abxLS can be downloaded from our toolkit using the following command:
> zrc datasets:pull abxLS-dataset
The abxLS dataset is split into a dev and a test set, and split into clean and other based on the degree of filtering applied in the original LibriSpeech corpus.
As a training set, participants are strongly encouraged to use the different sections of LibriSpeech (100, 360, 960) or LibriLight (60k, 6k, etc).
Cited
- Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007)
- Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
- Gauthier, Besacier, Voisin, Melese & Elingui (2016)
- Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. Retrieved from https://hal.archives-ouvertes.fr/hal-01350037
- Wang, Zhang & Zhang (2015)
- Wang, D., Zhang, X. & Zhang, Z. (2015). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
- Barnard (2014)
- Barnard, D. (2014). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
- Panayotov, Chen, Povey & Khudanpur (2015)
- Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. IEEE.