Benchmarks & Datasets
Two benchmarks have been defined for Spoken term discovery: TDE-15 and TDE-17. As customary for this kind of problem, there is no separate train and test set; the evaluation is done directly on the terms discovered in the train set.
Table. Characteristics of the different ZRC TDE Benchmarks.
|Benchmark||Language||Dataset||Type||Train/Test Set (Duration/Speakers)|
|TDE-17||English||Librivox||audiobook||45h, 69 spk|
|^^||French||Librivox||audiobook||24h, 28 spk|
|^^||Mandarin||THCHS-30||Timit-like||2h30, 12 spk|
|^^||German (L1)||Librivox||audiobook||25h, 30 spk|
|^^||Wolof (L2)||Timit-like||10h, 14 spk|
TDE-15 contains conversational English based on a fragment of the Buckeye dataset ) , and Xitsonga, a fragment of the NCHLT dataset (read speech).
TDE-17 was aimed at testing robustness of the algorithms to languages. They were 3 dev languages (English, French and Mandarin) and 2 held-out test languages (German and Wolof).
- Buckeye )
- NCHLT )
- Librivox librivox.org
- THCHS-30 )
- Wolof )
The tde15 benchmark uses the zr-2015 dataset which is based on the buckeye corpus. We can not bundle this dataset due to restrictive licencing, so you will have to download it from their website, and then using our toolkit import it, with the following command :
> zrc datasets:import buckeye-corpus
zrc datasets:pull zs2017-test-dataset
The abx17 benchmark uses the zr-2017 dataset that you can download using the following command :
> zrc datasets:pull zrc2017-dataset