Zerospeech 2015

The 2015 Challenge appeared as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden, see the Interspeech 2015 proceedings. The challenge’s aims were presented in [1] and the main results summarized in [2]. The references for Track 1 are in [3], [4], [5], [6], [7], [8] and for Track 2 in [9], [10]. Further papers were published in the SLTU 2016 special topic on Zero resource speech technology and elsewhere [11], [12], [13].

Track 1

Baseline and topline

The baseline and topline ABX error rates for Track 1 are given in Table 1 (see also [1]). For the baseline model, we used 13 dimensions MFCC features computed every 10ms and the ABX score was computed using the cosine distance. For the topline model, we used posteriorgrams extracted from a Kaldi GMM-HMM pipeline with MFCC and Delta and Delta-Delta features, Gaussian mixtures, triphone word-position-dependent states, fMLLR talker adaptation, with a bigram word language model. The exact same Kaldi pipeline was used for the two languages and gave a phone error rate (PER) of 26.4% for English, and 7.5% for Tsonga. Note that the two corpora are quite different: The English corpus contains spontaneous, casual speech; the Tsonga corpus contain read speech constructed out of a small vocabulary, and tailored for constructing speech recognition applications. The acoustic and language models were trained on the part of the corpora not used in the evaluation, and the posterior fed into the ABX evaluation software using the KL divergence. Unsupervised models are expected to fall in between the performance of these two systems.

System’s comparisons

Authors Ref. Team With Supervision Binary English Xitsonga
across within across within
Table 1. Track 1 - ABX across and within speaker results, on English and Xitsonga Datasets

If your result does not appear there, please email us.

Track 2

Baseline and topline

For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected component clustering as a second step. The topline is an Adaptor Grammar using a unigram grammar, run on the gold phoneme transcription. Here, the topline performance is probably not attainable by unsupervised systems since it uses the gold transcription. It is more of a reference for the maximum value that it reasonable to expect for the metrics used.

System’s comparisons

Authors Team English Xitsonga
NED cov Boundary F-score Token F-score Type F-score grouping F-score NED cov Boundary F-score Token F-score Type F-score grouping F-score
Table 2. Track 2 metrics on English and Xitsonga datasets.

Challenge References

  1. Ludusan, Bogdan and Versteegh, Maarten and Jansen, Aren and Gravier, Guillaume and Cao, Xuan-Nga and Johnson, Mark and Dupoux, Emmanuel . Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems . In Proceedings of LREC 2014 , 2014 . p560-567 [URL]
  2. Versteegh, Maarten and Thiollière, Roland and Schatz, Thomas and Cao, Xuan-Nga and Anguera, Xavier and Jansen, Aren and Dupoux, Emmanuel . The Zero Resource Speech Challenge 2015 . In INTERSPEECH-2015 , 2015 . [URL]
  3. Versteegh, Maarten and Anguera, Xavier and Jansen, Aren and Dupoux, Emmanuel . The Zero Resource Speech Challenge 2015: Proposed Approaches and Results . In SLTU-2016 , 2016 . [URL]
  4. Thiollière, Roland and Dunbar, Ewan and Synnaeve, Gabriel and Versteegh, Maarten and Dupoux, Emmanuel . A Hybrid Dynamic Time Warping-Deep Neural Network Architecture for Unsupervised Acoustic Modeling . In INTERSPEECH-2015 , 2015 . [URL]
  5. Badino, Leonardo and Mereta, Alessio and Rosasco, Lorenzo . Discovering Discrete Subword Units with Binarized Autoencoders and Hidden-Markov-Model Encoders . In Sixteenth Annual Conference of the International Speech Communication Association , 2015 . [URL]
  6. Renshaw, Daniel and Kamper, Herman and Jansen, Aren and Goldwater, Sharon . A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge . In Sixteenth Annual Conference of the International Speech Communication Association , 2015 . [URL]
  7. Agenbag, Wiehan and Niesler, Thomas . Automatic Segmentation and Clustering of Speech Using Sparse Coding and Metaheuristic Search . In Sixteenth Annual Conference of the International Speech Communication Association , 2015 . [URL]
  8. Chen, Hongjie and Leung, Cheung-Chi and Xie, Lei and Ma, Bin and Li, Haizhou . Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study . In Sixteenth Annual Conference of the International Speech Communication Association , 2015 . [URL]
  9. Baljekar, Pallavi and Sitaram, Sunayana and Muthukumar, Prasanna Kumar and Black, Alan W. . Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing . In Sixteenth Annual Conference of the International Speech Communication Association , 2015 . [URL]
  10. Räsänen, Okko and Doyle, Gabriel and Frank, Michael C. . Unsupervised word discovery from speech using automatic segmentation into syllable-like units . In Proceedings of Interspeech , 2015 . [URL]
  11. Lyzinski, Vince and Sell, Gregory and Jansen, Aren . An Evaluation of Graph Clustering Methods for Unsupervised Term Discovery . In Sixteenth Annual Conference of the International Speech Communication Association , 2015 . [URL]
  12. Zeghidour, Neil and Synnaeve, Gabriel and Versteegh, Maarten and Dupoux, Emmanuel . A Deep Scattering Spectrum - Deep Siamese network Pipeline For Unsupervised Acoustic Modeling . In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE , 2016 . p[4965 -- 4969]
  13. Heck, Michael and Sakti, Sakriani and Nakamura, Satoshi . Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario . In Procedia Computer Science , 2016 . p[73 -- 79] , v[81]
  14. Srivastava, Brij Mohan Lal and Shrivastava, Manish . Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings . In International Conference on Statistical Language and Speech Processing , Springer , 2016 . p[80 -- 95]