Tasks & Goals Track 1 Track 2 Getting started Data Results

Zerospeech 2017

The 2017 Challenge appeared as a special session at ASRU 2017 (Dec, 16-20, 2017, Okinawa, Japan, see the ASRU 2017 webpage. The challenge’s aims and metrics are an extension of those presented in [1] and the main results summarized in [2]. Below are the baseline results for the hyper training set. The baselines and system results for the hyper test will be revealed after the paper’s acceptance.

Track 1

Baseline and topline

The baseline ABX error rates for Track 1 are given in Table 1 (see also [1]). For the baseline model, we used 39 dimensions MFCC+Delta+Delta2 features computed every 10ms and the ABX score was computed using the frame-wise cosine distance averaged along the DTW path. The topline consists in posteriorgrams from a supervised phone recognition Kaldi pipeline.

Systems comparison

Table 1. Track 1 - Across Speaker Results
Authors	Affiliation	DOI	#	English			French			mandarin			LANG1			LANG2			with supervision
Authors	Affiliation	DOI	#	1s	10s	120s	1s	10s	120s	1s	10s	120s	1s	10s	120s	1s	10s	120s	with supervision

Table 2. Track1 - Within Speaker Results
Authors	Affiliation	DOI	#	English			French			mandarin			LANG1			LANG2			with supervision
Authors	Affiliation	DOI	#	1s	10s	120s	1s	10s	120s	1s	10s	120s	1s	10s	120s	1s	10s	120s	with supervision

Track 2

Baseline and topline

For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected-component graph clustering as a second step. A topline using an Adaptor Grammar using a unigram model on the decoding provided by the phone recognizer is also reported; The topline performance is probably not attainable by unsupervised systems since it uses the gold transcription and it is more of a reference for the maximum value that it reasonable to expect for the used scores.

Table 3. Track 2 metrics for a baseline and topline model on English, French and Mandarin datasets.
	Authors	DOI	English			French			mandarin			LANG1			LANG2
	Authors	DOI	NED	cov	words	NED	cov	words	NED	cov	words	NED	cov	words	NED	cov	words

Note:

The spoken term discovery baseline can be found here. The Pitman-Yor Adaptor Grammar sampler can be found here. The baseline can be replicated by running the code keep in github.

Challenge References

[1] Dunbar, E., Cao, X. N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., … & Dupoux, E. (2017, December). The zero resource speech challenge 2017. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 323-330). IEEE.