The 2015 Challenge appeared as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden, see the Interspeech 2015 proceedings. The challenge’s aims were presented in  and the main results summarized in . The references for Track 1 are in , , , , ,  and for Track 2 in , . Further papers were published in the SLTU 2016 special topic on Zero resource speech technology and elsewhere , , .
Baseline and topline¶
The baseline and topline ABX error rates for Track 1 are given in Table 1 (see also ). For the baseline model, we used 13 dimensions MFCC features computed every 10ms and the ABX score was computed using the cosine distance. For the topline model, we used posteriorgrams extracted from a Kaldi GMM-HMM pipeline with MFCC and Delta and Delta-Delta features, Gaussian mixtures, triphone word-position-dependent states, fMLLR talker adaptation, with a bigram word language model. The exact same Kaldi pipeline was used for the two languages and gave a phone error rate (PER) of 26.4% for English, and 7.5% for Tsonga. Note that the two corpora are quite different: The English corpus contains spontaneous, casual speech; the Tsonga corpus contain read speech constructed out of a small vocabulary, and tailored for constructing speech recognition applications. The acoustic and language models were trained on the part of the corpora not used in the evaluation, and the posterior fed into the ABX evaluation software using the KL divergence. Unsupervised models are expected to fall in between the performance of these two systems.
The kaldi recipes can be found here
If your result does not appear there, please email us.
Baseline and topline¶
For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected component clustering as a second step. The topline is an Adaptor Grammar using a unigram grammar, run on the gold phoneme transcription. Here, the topline performance is probably not attainable by unsupervised systems since it uses the gold transcription. It is more of a reference for the maximum value that it reasonable to expect for the metrics used.
|NED||cov||Boundary F-score||Token F-score||Type F-score||grouping F-score||NED||cov||Boundary F-score||Token F-score||Type F-score||grouping F-score|
Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux. The zero resource speech challenge 2015. In INTERSPEECH-2015. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html.
Maarten Versteegh, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux. The zero resource speech challenge 2015: proposed approaches and results. In SLTU-2016. ISCA-ITRW, 2016. URL: http://www.lscp.net/persons/dupoux/papers/Versteegh_AJD_2016.ZeroSpeech%202015%20results.SLTU.pdf.
Roland Thiollière, Ewan Dunbar, Gabriel Synnaeve, Maarten Versteegh, and Emmanuel Dupoux. A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. In INTERSPEECH-2015. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html.
Leonardo Badino, Alessio Mereta, and Lorenzo Rosasco. Discovering Discrete Subword Units with Binarized Autoencoders and Hidden-Markov-Model Encoders. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3174.html.
Daniel Renshaw, Herman Kamper, Aren Jansen, and Sharon Goldwater. A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3199.html.
Wiehan Agenbag and Thomas Niesler. Automatic Segmentation and Clustering of Speech Using Sparse Coding and Metaheuristic Search. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3184.html.
Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, and Haizhou Li. Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3189.html.
Pallavi Baljekar, Sunayana Sitaram, Prasanna Kumar Muthukumar, and Alan W. Black. Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: http://www.cs.cmu.edu/~pbaljeka/papers/IS2015.pdf.
Okko R\”as\”anen, Gabriel Doyle, and Michael C. Frank. Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In Proceedings of Interspeech. 2015. URL: http://www.isca-speech.org/archive/interspeech_2015/i15_3204.html.
Vince Lyzinski, Gregory Sell, and Aren Jansen. An Evaluation of Graph Clustering Methods for Unsupervised Term Discovery. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: https://ccrma.stanford.edu/~gsell/pubs/2015_IS1.pdf.
Neil Zeghidour, Gabriel Synnaeve, Maarten Versteegh, and Emmanuel Dupoux. A deep scattering spectrum - deep siamese network pipeline for unsupervised acoustic modeling. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4965–4969. IEEE, 2016.
Michael Heck, Sakriani Sakti, and Satoshi Nakamura. Unsupervised linear discriminant analysis for supporting dpgmm clustering in the zero resource scenario. Procedia Computer Science, 81:73–79, 2016.
Brij Mohan Lal Srivastava and Manish Shrivastava. Articulatory gesture rich representation learning of phonological units in low resource settings. In International Conference on Statistical Language and Speech Processing, 80–95. Springer, 2016.