spoken term discovery / word segmentation Benchmarks & Datasets How to participate Leaderboards

Benchmarks & Datasets

Introduction

Two benchmarks have been defined for Spoken term discovery: TDE-15 and TDE-17. As customary for this kind of problem, there is no separate train and test set; the evaluation is done directly on the terms discovered in the train set.

Table. Characteristics of the different ZRC TDE Benchmarks.

Benchmark	Language	Dataset	Type	Train/Test Set (Duration/Speakers)
TDE-15	English	Buckeye	conversations	5h, 12spk
^^	Xitsonga	NCHLT	Timit-like	2h30, 24spk
TDE-17	English	Librivox	audiobook	45h, 69 spk
^^	French	Librivox	audiobook	24h, 28 spk
^^	Mandarin	THCHS-30	Timit-like	2h30, 12 spk
^^	German (L1)	Librivox	audiobook	25h, 30 spk
^^	Wolof (L2)		Timit-like	10h, 14 spk

Datasets

TDE-15 contains conversational English based on a fragment of the Buckeye dataset ) , and Xitsonga, a fragment of the NCHLT dataset (read speech).

TDE-17 was aimed at testing robustness of the algorithms to languages. They were 3 dev languages (English, French and Mandarin) and 2 held-out test languages (German and Wolof).

Dataset References

Buckeye )
NCHLT )
Librivox librivox.org
THCHS-30 )
Wolof )

Download

TDE-15

The tde15 benchmark uses the zr-2015 dataset which is based on the buckeye corpus. We can not bundle this dataset due to restrictive licencing, so you will have to download it from their website, and then using our toolkit import it, with the following command :

> zrc datasets:import buckeye-corpus

TDE-17

To download the dataset needed for the tde17 run the command zrc datasets:pull zs2017-test-dataset

The abx17 benchmark uses the zr-2017 dataset that you can download using the following command :

> zrc datasets:pull zrc2017-dataset

Made with

Last updated on May 24 14:53 2023