Benchmarks & Datasets


Two benchmarks have been defined for Spoken term discovery: TDE-15 and TDE-17. As customary for this kind of problem, there is no separate train and test set; the evaluation is done directly on the terms discovered in the train set.

Table. Characteristics of the different ZRC TDE Benchmarks.

Benchmark Language Dataset Type Train/Test Set (Duration/Speakers)
TDE-15 English Buckeye conversations 5h, 12spk
^^ Xitsonga NCHLT Timit-like 2h30, 24spk
TDE-17 English Librivox audiobook 45h, 69 spk
^^ French Librivox audiobook 24h, 28 spk
^^ Mandarin THCHS-30 Timit-like 2h30, 12 spk
^^ German (L1) Librivox audiobook 25h, 30 spk
^^ Wolof (L2) Timit-like 10h, 14 spk


TDE-15 contains conversational English based on a fragment of the Buckeye dataset ) , and Xitsonga, a fragment of the NCHLT dataset (read speech).

TDE-17 was aimed at testing robustness of the algorithms to languages. They were 3 dev languages (English, French and Mandarin) and 2 held-out test languages (German and Wolof).

Dataset References

  • Buckeye )
  • NCHLT )
  • Librivox
  • THCHS-30 )
  • Wolof )



The tde15 benchmark uses the zr-2015 dataset which is based on the buckeye corpus. We can not bundle this dataset due to restrictive licencing, so you will have to download it from their website, and then using our toolkit import it, with the following command :

> zrc datasets:import buckeye-corpus

The abx17 benchmark uses the zr-2017 dataset that you can download using the following command :

> zrc datasets:pull zrc2017-dataset