Benchmarks & Datasets
Contents
Introduction
Two benchmarks have been defined for Spoken term discovery: TDE-15 and TDE-17. As customary for this kind of problem, there is no separate train and test set; the evaluation is done directly on the terms discovered in the train set.
Table. Characteristics of the different ZRC TDE Benchmarks.
Benchmark | Language | Dataset | Type | Train/Test Set (Duration/Speakers) |
---|---|---|---|---|
TDE-15 | English | Buckeye | conversations | 5h, 12spk |
^^ | Xitsonga | NCHLT | Timit-like | 2h30, 24spk |
TDE-17 | English | Librivox | audiobook | 45h, 69 spk |
^^ | French | Librivox | audiobook | 24h, 28 spk |
^^ | Mandarin | THCHS-30 | Timit-like | 2h30, 12 spk |
^^ | German (L1) | Librivox | audiobook | 25h, 30 spk |
^^ | Wolof (L2) | Timit-like | 10h, 14 spk |
Datasets
TDE-15 contains conversational English based on a fragment of the Buckeye dataset ) , and Xitsonga, a fragment of the NCHLT dataset (read speech).
TDE-17 was aimed at testing robustness of the algorithms to languages. They were 3 dev languages (English, French and Mandarin) and 2 held-out test languages (German and Wolof).
Dataset References
- Buckeye )
- NCHLT )
- Librivox librivox.org
- THCHS-30 )
- Wolof )
Download
TDE-15
The tde15 benchmark uses the zr-2015 dataset which is based on the buckeye corpus. We can not bundle this dataset due to restrictive licencing, so you will have to download it from their website, and then using our toolkit import it, with the following command :
> zrc datasets:import buckeye-corpus
TDE-17
zrc datasets:pull zs2017-test-dataset
The abx17 benchmark uses the zr-2017 dataset that you can download using the following command :
> zrc datasets:pull zrc2017-dataset