Bibliography

Dunbar, Hamilakis & Dupoux (2022): Dunbar, E., Hamilakis, N. & Dupoux, E. (2022). Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge series. IEEE Journal of Special Topics in Signal Processing, 16(6). 1211-1226. Retrieved from https://arxiv.org/abs/2005.12656
Hallap, Dupoux & Dunbar (2022): Hallap, M., Dupoux, E. & Dunbar, E. (2022). Evaluating context-invariance in unsupervised speech representations. arXiv preprint arXiv:2210.15775.
Goldwater, Griffiths & Johnson (2009): Goldwater, S., Griffiths, T. & Johnson, M. (2009). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1). 21-54.
Agirre, Alfonseca, Hall, Kravalova, Pasca & Soroa (2009): Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M. & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches.
Al-Rfou, Choe, Constant, Guo & Jones (2018): Al-Rfou, R., Choe, D., Constant, N., Guo, M. & Jones, L. (2018). Character-level language modeling with deeper self-attention. arXiv preprint 1808.04444.
Allen & Seidenberg (1999): Allen, J. & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. The emergence of language. 115-151.
Ansari, Kumar, Singh, Ganapathy & Devi (): Ansari, T., Kumar, R., Singh, S., Ganapathy, S. & Devi, S. (). Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions. Submitted to ASRU 2017.
Chaudhuri, Roth, Ellis, Gallagher, Kaver, Marvin, Pantofaru, Reale, Reid, Wilson & Xi (2018): Chaudhuri, S., Roth, J., Ellis, D., Gallagher, A., Kaver, L., Marvin, R., Pantofaru, C., Reale, N., Reid, L., Wilson, K. & Xi, Z. (2018). AVA-speech: A densely labeled dataset of speech activity in movies. Proceedings of interspeech, 2018. Retrieved from https://arxiv.org/pdf/1808.00606
Baevski, Auli & Mohamed (2019): Baevski, A., Auli, M. & Mohamed, A. (2019). Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912.
Baevski, Zhou, Mohamed & Auli (2020): Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
Baker, Reichart & Korhonen (2014): Baker, S., Reichart, R. & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 278-289.
Bérard, Pietquin, Servan & Besacier (2016): Bérard, A., Pietquin, O., Servan, C. & Besacier, L. (2016). Listen and translate: A proof of concept for end-to-end speech-to-text translation. NIPS workshop on end-to-end learning for speech and audio processing.
Best (1995): Best, C. (1995). A direct realist perspective on cross-language speech perception. Speech perception and linguistic experience: Issues in cross-language research. 167-200. York Press.
Bavin (2009): Bavin, E. (2009). The Cambridge handbook of child language. Cambridge University Press. Retrieved from http://site.ebrary.com/id/10303044
Dunbar, Bernard, Hamilakis, Nguyen, Seyssel, Rozé, Rivière, Kharitonov & Dupoux (2021): Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E. & Dupoux, E. (2021). The zero resource speech challenge 2021: Spoken language modelling. Interspeech 2021-conference of the international speech communication association.
Kohonen (1988): Kohonen, T. (1988). The ’neural’ phonetic typewriter. Computer, 21(3). 11-22.
Adda, Stücker, Adda-Decker, Ambouroue, Besacier, Blachon, Bonneau-Maynard, Godard, Hamlaoui, Idiatov, Kouarata, Lamel, Makasso, Rialland, Van de Velde, Yvon & Zerbian (2016): Adda, G., Stücker, S., Adda-Decker, M., Ambouroue, O., Besacier, L., Blachon, D., Bonneau-Maynard, H., Godard, P., Hamlaoui, F., Idiatov, D., Kouarata, G., Lamel, L., Makasso, E., Rialland, A., Van de Velde, M., Yvon, F. & Zerbian, S. (2016). Breaking the unwritten kanguage barrier: The Bulb project. Proceedings of SLTU (spoken language technologies for under-resourced languages).
Akaike (1974): Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6). 716-723. IEEE.
Alishahi, Barking & Chrupała (2017): Alishahi, A., Barking, M. & Chrupała, G. (2017). Encoding of phonology in a recurrent neural model of grounded speech. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 368-378.
Ansari, Singh, Kumar & Ganapathy (): Ansari, T., Singh, S., Kumar, R. & Ganapathy, S. (). Deep learning methods for unsupervised acoustic modeling: LEAP submission to ZeroSpeech challenge 2017. Submitted to ASRU 2017.
Ansari, Kumar, Singh & Ganapathy (2017): Ansari, T., Kumar, R., Singh, S. & Ganapathy, S. (2017). Deep learning methods for unsupervised acoustic modeling—leap submission to zerospeech challenge 2017. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 754-761. IEEE.
Jansen & Van Durme (2011): Jansen, A. & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. Automatic speech recognition and understanding (ASRU), 2011 IEEE workshop on. 401-406. IEEE.
Badino, Canevari, Fadiga & Metta (2014): Badino, L., Canevari, C., Fadiga, L. & Metta, G. (2014). An Auto-encoder based approach to unsupervised learning of subword units. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Baevski, Schneider & Auli (2020): Baevski, A., Schneider, S. & Auli, M. (2020). Vq-wav2vec: Self-supervised learning of discrete speech representations. International conference on learning representations. Retrieved from https://openreview.net/forum?id=rylwJxrYDS
Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014): Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A. & , (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Bengio, Ducharme, Vincent & Jauvin (2003): Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. (2003). A neural probabilistic language model. JMLR.
Besacier, Zhou & Gao (2006): Besacier, L., Zhou, B. & Gao, Y. (2006). Towards speech translation of non written languages. Spoken Language Technology Workshop, 2006. IEEE. 222-225.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019): Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. & Bowman, S. (2019). BLiMP: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Peng & Harwath (2022): Peng, P. & Harwath, D. (2022). Self-supervised representation learning for speech using visual grounding and masked language modeling. arXiv preprint arXiv:2202.03543.
Bruni, Boleda, Baroni & Tran (2012): Bruni, E., Boleda, G., Baroni, M. & Tran, N. (2012). Distributional semantics in technicolor. Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers). 136-145.
Chalnick & Billman (1988): Chalnick, A. & Billman, D. (1988). Unsupervised learning of correlational structure. Proceedings of the tenth annual conference of the cognitive science society. 510-516. Lawrence Erlbaum Associates.
Chen, Leung, Xie, Ma & Li (2015): Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (2015). Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study. INTERSPEECH.
Chrupała (2021): Chrupała, G. (2021). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Retrieved from https://arxiv.org/abs/2104.13225
Chung & Glass (2018): Chung, Y. & Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
Chung, Hsu, Tang & Glass (2019): Chung, Y., Hsu, W., Tang, H. & Glass, J. (2019). An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
Chung, Hsu, Tang & Glass (2019): Chung, Y., Hsu, W., Tang, H. & Glass, J. (2019). An unsupervised autoregressive model for speech representation learning. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 146-150.
Keuleers & Brysbaert (2010): Keuleers, E. & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627-633. Springer.
Kharitonov, Lee, Polyak, Adi, Copet, Lakhotia, Nguyen, Rivière, Mohamed, Dupoux & (2021): Kharitonov, E., Lee, A., Polyak, A., Adi, Y., Copet, J., Lakhotia, K., Nguyen, T., Rivière, M., Mohamed, A., Dupoux, E. & , (2021). Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264.
Heck, Sakti & Nakamura (2016): Heck, M., Sakti, S. & Nakamura, S. (2016). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. Procedia Computer Science, 81. 73-79. Elsevier.
Srivastava & Shrivastava (2016): Srivastava, B. & Shrivastava, M. (2016). Articulatory gesture rich representation learning of phonological units in low resource settings. International conference on statistical language and speech processing. 80-95. Springer.
Heck, Sakti & Nakamura (2017): Heck, M., Sakti, S. & Nakamura, S. (2017). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 740-746. IEEE.
Shibata, Kato, Shinozaki & Watanabet (2017): Shibata, H., Kato, T., Shinozaki, T. & Watanabet, S. (2017). Composite embedding systems for ZeroSpeech2017 Track1. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 747-753. IEEE.
Chorowski, Weiss, Bengio & Van Den Oord (2019): Chorowski, J., Weiss, R., Bengio, S. & Van Den Oord, A. (2019). Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12). 2041-2053. IEEE.
Kamper, Livescu & Goldwater (2017): Kamper, H., Livescu, K. & Goldwater, S. (2017). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 719-726. IEEE.
Hsu, Harwath & Glass (2019): Hsu, W., Harwath, D. & Glass, J. (2019). Transfer learning from audio-visual grounding to speech recognition. arXiv preprint arXiv:1907.04355.
Chung & Glass (2019): Chung, Y. & Glass, J. (2019). Generative pre-training for speech with autoregressive predictive coding. arXiv preprint arXiv:1910.12607.
Millet, Chitoran & Dunbar (2021): Millet, J., Chitoran, I. & Dunbar, E. (2021). Predicting non-native speech perception using the perceptual assimilation model and state-of-the-art acoustic models. Proceedings of the 25th conference on computational natural language learning. 661-673.
Warstadt, Singh & Bowman (2018): Warstadt, A., Singh, A. & Bowman, S. (2018). Neural network acceptability judgments. arXiv preprint 1805.12471.
Dai, Yang, Yang, Cohen, Carbonell, Le & Salakhutdinov (2019): Dai, Z., Yang, Z., Yang, Y., Cohen, W., Carbonell, J., Le, Q. & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint 1901.02860.
Räsänen, Doyle & Frank (2015): Räsänen, O., Doyle, G. & Frank, M. (2015). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. Sixteenth annual conference of the international speech communication association.
Räsänen & Blandón (2020): Räsänen, O. & Blandón, M. (2020). Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics. arXiv preprint arXiv:2008.00731.
Prakash, Kumar, Murthy & (2020): Prakash, A., Kumar, M., Murthy, H. & , (2020). Exploration of end-to-end synthesisers for zero resource speech challenge 2020. arXiv preprint arXiv:2009.04983.
Davis & Mermelstein (1980): Davis, S. & Mermelstein, P. (1980). Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4). 357-366.
Lee & Glass (2012): Lee, C. & Glass, J. (2012). A nonparametric bayesian approach to acoustic model discovery. ACL (1). 40-49. The Association for Computer Linguistics.
Hsu, Hwang, Wu, Tsao & Wang (2016): Hsu, C., Hwang, H., Wu, Y., Tsao, Y. & Wang, H. (2016). Voice conversion from non-parallel corpora using variational auto-encoder. Asia-pacific signal and information processing association annual summit and conference, APSIPA 2016, jeju, south korea, december 13-16, 2016. 1-6.
Tjandra, Sakti & Nakamura (2017): Tjandra, A., Sakti, S. & Nakamura, S. (2017). Listening while speaking: Speech chain by deep learning. ASRU 2017. 301-308.
Badino, Canevari, Fadiga & Metta (2014): Badino, L., Canevari, C., Fadiga, L. & Metta, G. (2014). An auto-encoder based approach to unsupervised learning of subword units. ICASSP. 7634-7638. IEEE.
Gao, Singh & Raj (2018): Gao, Y., Singh, R. & Raj, B. (2018). Voice impersonation using generative adversarial networks. ICASSP. 2506-2510. IEEE.
Jansen, Thomas & Hermansky (2013): Jansen, A., Thomas, S. & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training. ICASSP. 8091-8095. IEEE.
Eloff, Nortje, Niekerk, Govender, Nortje, Pretorius, Van Biljon, Westhuizen, Staden & Kamper (2019): Eloff, R., Nortje, A., Niekerk, B., Govender, A., Nortje, L., Pretorius, A., Van Biljon, E., Westhuizen, E., Staden, L. & Kamper, H. (2019). Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. arXiv preprint arXiv:1904.07556.
Yusuf, Gök, Gündogdu, Kose & Saraclar (2019): Yusuf, B., Gök, A., Gündogdu, B., Kose, O. & Saraclar, M. (2019). Temporally-aware acoustic unit discovery for zerospeech 2019 challenge.. INTERSPEECH. 1098-1102.
Liu, Hsu & Lee (2019): Liu, A., Hsu, P. & Lee, H. (2019). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. arXiv preprint arXiv:1905.11563.
Nayak, Kumar, Ramesh, Bhati & Murty (2019): Nayak, S., Kumar, C., Ramesh, G., Bhati, S. & Murty, K. (2019). Virtual phone discovery for speech synthesis without text. 2019 IEEE global conference on signal and information processing (GlobalSIP). 1-5. IEEE.
Muthukumar & Black (2014): Muthukumar, P. & Black, A. (2014). Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, florence, italy, may 4-9, 2014. 2594-2598.
Scharenborg, Besacier, Black, Hasegawa-Johnson, Metze, Neubig, Stüker, Godard, Müller, Ondel, Palaskar, Arthur, Ciannella, Du, Larsen, Merkx, Riad, Wang & Dupoux (2018): Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stüker, S., Godard, P., Müller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L. & Dupoux, E. (2018). Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "speaking rosetta" JSALT 2017 workshop. ICASSP. 4979-4983. IEEE.
Shen, Pang, Weiss, Schuster, Jaitly, Yang, Chen, Zhang, Wang, Ryan, Saurous, Agiomyrgiannakis & Wu (2018): Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Ryan, R., Saurous, R., Agiomyrgiannakis, Y. & Wu, Y. (2018). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. ICASSP. 4779-4783. IEEE.
Heck, Sakti & Nakamura (2016): Heck, M., Sakti, S. & Nakamura, S. (2016). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. SLTU-2016, 5th workshop on spoken language technologies for under-resourced languages, 9-12 may 2016, yogyakarta, indonesia. 73-79.
Ondel, Burget & Cernocký (2016): Ondel, L., Burget, L. & Cernocký, J. (2016). Variational inference for acoustic unit discovery. SLTU, 81. 80-86. Elsevier.
Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior & Kavukcuoglu (2016): Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. SSW. 125. ISCA.
Wu, Watts & King (2016): Wu, Z., Watts, O. & King, S. (2016). Merlin: An open source neural network speech synthesis system. Speech Synthesis Workshop. 202-207. ISCA.
Ping, Peng, Gibiansky, Arik, Kannan, Narang, Raiman & Miller (2017): Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J. & Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654.
Kaneko & Kameoka (2017): Kaneko, T. & Kameoka, H. (2017). Parallel-data-free voice conversion using cycle-consistent adversarial networks. CoRR, abs/1711.11293. Retrieved from https://arxiv.org/abs/1711.11293
Chou, Yeh, Lee & Lee (2018): Chou, J., Yeh, C., Lee, H. & Lee, L. (2018). Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. CoRR, abs/1804.02812. Retrieved from https://arxiv.org/abs/1804.02812
Li, Liu, Liu, Zhao, Liu & Zhou (2018): Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M. & Zhou, M. (2018). Close to human quality TTS with transformer. CoRR, abs/1809.08895. Retrieved from https://arxiv.org/abs/1809.08895
Mehri, Kumar, Gulrajani, Kumar, Jain, Sotelo, Courville & Bengio (2016): Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A. & Bengio, Y. (2016). SampleRNN: An unconditional end-to-end neural audio generation model. CoRR, abs/1612.07837. Retrieved from https://arxiv.org/abs/1612.07837
Taigman, Wolf, Polyak & Nachmani (2017): Taigman, Y., Wolf, L., Polyak, A. & Nachmani, E. (2017). Voice synthesis for in-the-wild speakers via a phonological loop. CoRR, abs/1707.06588.
Dillon, Dunbar & Idsardi (2013): Dillon, B., Dunbar, E. & Idsardi, W. (2013). A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science, 37(2). 344-377. Wiley Online Library.
DeCarlo (1998): DeCarlo, L. (1998). Signal detection theory and generalized linear models.. Psychological Methods, 3(2). 186. American Psychological Association.
Deng, Dong, Socher, Li, Li & Fei-Fei (2009): Deng, J., Dong, W., Socher, R., Li, L., Li, K. & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248-255.
Devlin, Chang, Lee & Toutanova (2019): Devlin, J., Chang, M., Lee, K. & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL.
Driesen & Van hamme (2011): Driesen, J. & Van hamme, H. (2011). Modeling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA. Neurocomputing, 74. 1874-1882.
Dunbar, Cao, Benjumea, Karadayi, Bernard, Besacier, Anguera & Dupoux (2017): Dunbar, E., Cao, X., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X. & Dupoux, E. (2017). The Zero Resource Speech Challenge 2017. 2017 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 323-330. IEEE. Retrieved from https://arxiv.org/abs/1712.04313
Algayres, Zaiem, Sagot & Dupoux (2020): Algayres, R., Zaiem, M., Sagot, B. & Dupoux, E. (2020). Evaluating the reliability of acoustic speech embeddings. arXiv preprint arXiv:2007.13542.
Riad, Dancette, Karadayi, Zeghidour, Schatz & Dupoux (2018): Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T. & Dupoux, E. (2018). Sampling strategies in siamese networks for unsupervised speech representation learning. arXiv preprint arXiv:1804.11297.
Dunbar, Algayres, Karadayi, Bernard, Benjumea, Cao, Miskic, Dugrain, Ondel, Black & (2019): Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X., Miskic, L., Dugrain, C., Ondel, L., Black, A. & , (2019). The zero resource speech challenge 2019: TTS without T. INTERSPEECH. Retrieved from https://arxiv.org/abs/1904.11469
Dunbar, Karadayi, Bernard, Cao, Algayres, Ondel, Besacier, Sakriani & Dupoux (2020): Dunbar, E., Karadayi, J., Bernard, M., Cao, X., Algayres, R., Ondel, L., Besacier, L., Sakriani, S. & Dupoux, E. (2020). The zero resource speech challenge 2020: Discovering discrete subword and word units. INTERSPEECH, perception;bootstrapping/modeling;clustering/bootphon.
Duong, Anastasopoulos, Chiang, Bird14 & Cohn (2016): Duong, L., Anastasopoulos, A., Chiang, D., Bird14, S. & Cohn, T. (2016). An attentional model for speech translation without transcription. Proceedings of NAACL-HLT. 949-959.
Dupoux (2018): Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier. Retrieved from https://arxiv.org/abs/1607.08723
Dupoux (2018): Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier. Retrieved from https://arxiv.org/abs/1607.08723
Peters, Neumann, Iyyer, Gardner, Clark, Lee & Zettlemoyer (2018): Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. & Zettlemoyer, L. (2018). Deep contextualized word representations. NAACL.
Faruqui, Tsvetkov, Rastogi & Dyer (2016): Faruqui, M., Tsvetkov, Y., Rastogi, P. & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Feigenbaum (1963): Feigenbaum, E. (1963). The simulation of verbal learning behavior. Computers and thought. McGraw-Hill.
Feldman & Griffiths (2007): Feldman, N. & Griffiths, T. (2007). A rational account of the perceptual magnet effect. Proceedings of the annual meeting of the cognitive science society, 29.
Feldman, Griffiths, Goldwater & Morgan (2013): Feldman, N., Griffiths, T., Goldwater, S. & Morgan, J. (2013). A role for the developing lexicon in phonetic category acquisition.. Psychological review, 120(4). 751-778. American Psychological Association.
Feng, Lee & Peng (2019): Feng, S., Lee, T. & Peng, Z. (2019). Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1906.07234
Cieri, Miller & Walker (2004): Cieri, C., Miller, D. & Walker, K. (2004). The fisher corpus: A resource for the next generations of speech-to-text. LREC.
Frome, Corrado, Shlens, Bengio, Dean, Ranzato & Mikolov (2013): Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M. & Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems (NIPS 2013). 2121-2129.
Futrell, Wilcox, Morita, Qian, Ballesteros & Levy (2019): Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M. & Levy, R. (2019). Neural language models as psycholinguistic subjects: Representations of syntactic state.
Futrell, Wilcox, Morita & Levy (2018): Futrell, R., Wilcox, E., Morita, T. & Levy, R. (2018). RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv preprint 1809.01329.
Gage (1994): Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2). 23-38. McPherson, KS: R & D Publications, c1987-1994..
García-Granada, Sanchis, Castro-Bleda, González & Hurtado (): García-Granada, F., Sanchis, E., Castro-Bleda, M., González, J. & Hurtado, L. (). ZeroSpeech2017 ELIRF-UPV system. Submitted to ASRU 2017.
Gerz, Vulić, Hill, Reichart & Korhonen (2016): Gerz, D., Vulić, I., Hill, F., Reichart, R. & Korhonen, A. (2016). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
Glass (2012): Glass, J. (2012). Towards unsupervised speech processing. Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on. 1-4. IEEE.
Myrman & Salvi (2017): Myrman, A. & Salvi, G. (2017). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. International Workshop on Grounding Language Understanding (GLU). ISCA.
Godais, Linzen & Dupoux (2017): Godais, G., Linzen, T. & Dupoux, E. (2017). Comparing character-level neural language models using a lexical decision task. 125-130.
Godfrey, Holliman & McDaniel (1992): Godfrey, J., Holliman, E. & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. [Proceedings] ICASSP-92: 1992 IEEE international conference on acoustics, speech, and signal processing, 1. 517-520. IEEE.
Goldberg (2019): Goldberg, Y. (2019). Assessing BERT’s syntactic abilities. arXiv preprint 1901.05287.
Goldwater, Griffiths & Johnson (2009): Goldwater, S., Griffiths, T. & Johnson, M. (2009). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112. 21-54. Elsevier.
Guenther & Gjaja (1996): Guenther, F. & Gjaja, M. (1996). The perceptual magnet effect as an emergent property of neural map formation. The Journal of the Acoustical Society of America, 100(2). 1111-1121. Acoustical Society of America.
Gulordava, Bojanowski, Grave, Linzen & Baroni (2018): Gulordava, K., Bojanowski, P., Grave, E., Linzen, T. & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. Retrieved from https://www.aclweb.org/anthology/N18-1108
Hahn & Baroni (2019): Hahn, M. & Baroni, M. (2019). Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics (Accepted). Retrieved from https://arxiv.org/abs/1906.07285
Halawi, Dror, Gabrilovich & Koren (2012): Halawi, G., Dror, G., Gabrilovich, E. & Koren, Y. (2012). Large-scale learning of word relatedness with constraints. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. 1406-1414.
Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014): Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A. & , (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Harwath & Glass (2015): Harwath, D. & Glass, J. (2015). Deep multimodal semantic embeddings for speech and images. 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). 237-244. IEEE.
Harwath, Torralba & Glass (2016): Harwath, D., Torralba, A. & Glass, J. (2016). Unsupervised learning of spoken language with visual context. Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS 2016). 1858-1866.
Harwath, Hsu & Glass (2019): Harwath, D., Hsu, W. & Glass, J. (2019). Learning hierarchical discrete linguistic units from visually-grounded speech. arXiv preprint arXiv:1911.09602.
Tiede, Espy-Wilson, Goldenberg, Mitra, Nam & Sivaraman (2017): Tiede, M., Espy-Wilson, C., Goldenberg, D., Mitra, V., Nam, H. & Sivaraman, G. (2017). Quantifying kinematic aspects of reduction in a contrasting rate production task. The Journal of the Acoustical Society of America, 141(5). 3580-3580. Retrieved from https://doi.org/10.1121/1.4987629
Hastie, Tibshirani & Friedman (2009): Hastie, T., Tibshirani, R. & Friedman, J. (2009). The elements of statistical learning – data mining, inference, and prediction. Springer.
Havard, Besacier & Rosec (2017): Havard, W., Besacier, L. & Rosec, O. (2017). SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. Proc. GLU 2017 international workshop on grounding language understanding. 42-46. Retrieved from http://dx.doi.org/10.21437/GLU.2017-9
Arandjelovic & Zisserman (2017): Arandjelovic, R. & Zisserman, A. (2017). Look, listen and learn. Proceedings of the IEEE international conference on computer vision. 609-617.
Chrupała, Gelderloos & Alishahi (2017): Chrupała, G., Gelderloos, L. & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. arXiv preprint arXiv:1702.01991.
Chrupała, Gelderloos & Alishahi (2017): Chrupała, G., Gelderloos, L. & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 613-622.
Jansen, Dupoux, Goldwater, Johnson, Khudanpur, Church, Feldman, Hermansky, Metze, Rose, Seltzer, Clark, McGraw, Varadarajan, Bennett, Borschinger, Chiu, Dunbar, Fourtassi, Harwath, Lee, Levin, Norouzian, Peddinti, Richardson, Schatz & Thomas (2013): Jansen, A., Dupoux, E., Goldwater, S., Johnson, M., Khudanpur, S., Church, K., Feldman, N., Hermansky, H., Metze, F., Rose, R., Seltzer, M., Clark, P., McGraw, I., Varadarajan, B., Bennett, E., Borschinger, B., Chiu, J., Dunbar, E., Fourtassi, A., Harwath, D., Lee, C., Levin, K., Norouzian, A., Peddinti, V., Richardson, R., Schatz, T. & Thomas, S. (2013). A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition. Proceedings of ICASSP 2013.
Elsner, Goldwater & Eisenstein (2012): Elsner, M., Goldwater, S. & Eisenstein, J. (2012). Bootstrapping a unified model of lexical and phonetic acquisition. Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers). 184-193.
Bostrom & Durrett (2020): Bostrom, K. & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. Retrieved from https://arxiv.org/abs/2004.03720
Fer, Matejka, Grezl, Plchot, Vesely & Cernocky (2017): Fer, R., Matejka, P., Grezl, F., Plchot, O., Vesely, K. & Cernocky, J. (2017). Multilingually trained bottleneck features in spoken language recognition. Computer Speech and Language, 46(Supplement C). 252-267.
Yusuf, Gok, Gundogdu, Kose & Saraclar (2019): Yusuf, B., Gok, A., Gundogdu, B., Kose, O. & Saraclar, M. (2019). Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge. INTERSPEECH 2019.
Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007): Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
Barnard (2014): Barnard, D. (2014). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
Chen, Leung, Xie, Ma & Li (): Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (). Multilingual bottle-neck feature learning from untranscribed speech. Submitted to ASRU 2017.
Cho, Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk & Bengio (2014): Cho, K., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724-1734. Association for Computational Linguistics.
Chrupała (2019): Chrupała, G. (2019). Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6452-6462. Association for Computational Linguistics.
Badino, Mereta & Rosasco (2015): Badino, L., Mereta, A. & Rosasco, L. (2015). Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders. Sixteenth annual conference of the international speech communication association.
Chen, Leung, Xie, Ma & Li (2015): Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (2015). Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study. Sixteenth annual conference of the international speech communication association.
Myrman & Salvi (2017): Myrman, A. & Salvi, G. (2017). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. International workshop on grounding language understanding (GLU). ISCA.
Renshaw, Kamper, Jansen & Goldwater (2015): Renshaw, D., Kamper, H., Jansen, A. & Goldwater, S. (2015). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. Sixteenth annual conference of the international speech communication association.
Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015): Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M. & Dupoux, E. (2015). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Sixteenth annual conference of the international speech communication association.
Zeghidour, Synnaeve, Versteegh & Dupoux (2016): Zeghidour, N., Synnaeve, G., Versteegh, M. & Dupoux, E. (2016). A deep scattering spectrum—deep siamese network pipeline for unsupervised acoustic modeling. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). 4965-4969. IEEE.
Chen, Leung, Xie, Ma & Li (2017): Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (2017). Multilingual bottle-neck feature learning from untranscribed speech. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 727-733. IEEE.
Pellegrini, Manenti & Pinquier (2017): Pellegrini, T., Manenti, C. & Pinquier, J. (2017). Technical report the IRIT-UPS system@ ZeroSpeech 2017 Track1: Unsupervised subword modeling. Tech. rep., IRIT, Université de Toulouse.
Kharitonov, Rivière, Synnaeve, Wolf, Mazaré, Douze & Dupoux (2021): Kharitonov, E., Rivière, M., Synnaeve, G., Wolf, L., Mazaré, P., Douze, M. & Dupoux, E. (2021). Data augmenting contrastive learning of speech representations in the time domain. 2021 IEEE spoken language technology workshop (SLT). 215-222. IEEE.
Jansen & Van Durme (2011): Jansen, A. & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. 2011 IEEE workshop on automatic speech recognition & understanding. 401-406. IEEE.
Seshadri, Remes, Räsänen & (2017): Seshadri, S., Remes, U., Räsänen, O. & , (2017). Comparison of non-parametric bayesian mixture models for syllable clustering and zero-resource speech processing. INTERSPEECH 2017. ISCA.
Lyzinski, Sell & Jansen (2015): Lyzinski, V., Sell, G. & Jansen, A. (2015). An evaluation of graph clustering methods for unsupervised term discovery. Sixteenth annual conference of the international speech communication association.
Lakhotia, Kharitonov, Hsu, Adi, Polyak, Bolte, Nguyen, Copet, Baevski, Mohamed & (2021): Lakhotia, K., Kharitonov, E., Hsu, W., Adi, Y., Polyak, A., Bolte, B., Nguyen, T., Copet, J., Baevski, A., Mohamed, A. & , (2021). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9. 1336-1354. MIT Press.
Millet & Dunbar (2020): Millet, J. & Dunbar, E. (2020). The perceptimatic english benchmark for speech perception models. CogSci Conference 2020.
Millet & Dunbar (2022): Millet, J. & Dunbar, E. (2022). Do self-supervised speech models develop human-like perception biases?.
Moore (2012): Moore, B. (2012). An introduction to the psychology of hearing. Brill.
Weerts, Rosen, Clopath & Goodman (2021): Weerts, L., Rosen, S., Clopath, C. & Goodman, D. (2021). The psychometrics of automatic speech recognition. bioRxiv. Cold Spring Harbor Laboratory.
Tsuji, Cristia & Dupoux (2021): Tsuji, S., Cristia, A. & Dupoux, E. (2021). SCALa: A blueprint for computational models of language acquisition in social context. Cognition, 213. 104779. Elsevier.
Buerkin-Pontrelli, Culbertson, Legendre & Nazzi (2017): Buerkin-Pontrelli, A., Culbertson, J., Legendre, G. & Nazzi, T. (2017). Competing models of liaison acquisition: Evidence from corpus and experimental data. Language, 93(1). 189-219. Linguistic Society of America.
Babineau, Legrand & Shi (2021): Babineau, M., Legrand, C. & Shi, R. (2021). Variable forms in french-learning toddlers’ lexical representations.. Developmental Psychology. American Psychological Association.
Van Gijn & Zúñiga (2014): Van Gijn, R. & Zúñiga, F. (2014). Word and the americanist perspective. Morphology, 24(3). 135-160. Springer.
Millet & Dunbar (2020): Millet, J. & Dunbar, E. (2020). Perceptimatic: A human speech perception benchmark for unsupervised subword modelling. arXiv preprint arXiv:2010.05961.
Warstadt & Bowman (2019): Warstadt, A. & Bowman, S. (2019). Grammatical analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint 1901.03438.
Pandia & Murthy (2020): Pandia, K. & Murthy, H. (2020). Zero resource speech synthesis using transcripts derived from perceptual acoustic units. arXiv preprint arXiv:2006.04372.
Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021): Chorowski, J., Ciesielski, G., Dzikowski, J., Łańcucki, A., Marxer, R., Opala, M., Pusz, P., Rychlikowski, P. & Stypułkowski, M. (2021). Aligned contrastive predictive coding. arXiv preprint arXiv:2104.11946.
Chrupała (2022): Chrupała, G. (2022). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research, 73. 673-707.
Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov & Mohamed (2021): Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R. & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451-3460. IEEE.
Gwilliams, Linzen, Poeppel & Marantz (2018): Gwilliams, L., Linzen, T., Poeppel, D. & Marantz, A. (2018). In spoken word recognition, the future predicts the past. Journal of Neuroscience, 38(35). 7585-7599. Soc Neuroscience.
Beekhuizen, Armstrong & Stevenson (2021): Beekhuizen, B., Armstrong, B. & Stevenson, S. (2021). Probing lexical ambiguity: Word vectors encode number and relatedness of senses. Cognitive Science, 45(5). e12943. Wiley Online Library.
Nikolaus, Alishahi & Chrupała (2022): Nikolaus, M., Alishahi, A. & Chrupała, G. (2022). Learning english with peppa pig. arXiv preprint arXiv:2202.12917.
Havard, Chevrot & Besacier (2019): Havard, W., Chevrot, J. & Besacier, L. (2019). Models of visually grounded speech signal pay attention to nouns: A bilingual experiment on english and japanese. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019). 8618-8622.
Havard, Chevrot & Besacier (2019): Havard, W., Chevrot, J. & Besacier, L. (2019). Word recognition, competition, and activation in a model of visually grounded speech. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL 2019). 339-348.
He, Zhang, Ren & Sun (2016): He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778. IEEE.
Heck, Sakti & Nakamura (): Heck, M., Sakti, S. & Nakamura, S. (). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to ZeroSpeech 2017. Submitted to ASRU 2017.
Higy, Elliott & Chrupała (2020): Higy, B., Elliott, D. & Chrupała, G. (2020). Textual Supervision for Visually Grounded Spoken Language Understanding. Findings of the Association for Computational Linguistics: EMNLP 2020. 2698-2709. Association for Computational Linguistics.
Hill (1983): Hill, J. (1983). A computational model of language acquisition in the two-year old. Cognition and Brain Theory, 6. 287-317.
Hill, Reichart & Korhonen (2015): Hill, F., Reichart, R. & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). 665-695. MIT Press.
Hochreiter & Schmidhuber (1997): Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8). 1735-1780. MIT Press.
Bin & Yuan (2019): Bin, Y. & Yuan, W. (2019). A VAE model with speaker verification for unsupervised subword modeling: A submission to ZeroSpeech 2019. Submitted to INTERSPEECH 2019.
Hsu, Harwath, Song & Glass (2020): Hsu, W., Harwath, D., Song, C. & Glass, J. (2020). Text-Free Image-to-Speech Synthesis Using Learned Segmental Units. 34th Conference on Neural Information Processing Systems (NeurIPS) Workshop on Self-Supervised Learning for Speech and Audio Processing.
Huijbregts, McLaren & Leeuwen (2011): Huijbregts, M., McLaren, M. & Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4436-4439.
(2019): (2019). INTERSPEECH 2019 – 20<sup>th</sup> annual conference of the international speech communication association, september 15-19, graz, austria, proceedings.
Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018): Riochet, R., Castro, M., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2018). Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616.
Jansen & Van Durme (2011): Jansen, A. & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. 401-406.
Jansen, Thomas & Hermansky (2013): Jansen, A., Thomas, S. & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training.. ICASSP. 8091-8095.
Johnson, Griffiths & Goldwater (2007): Johnson, M., Griffiths, T. & Goldwater, S. (2007). Adaptor grammars: A framework for specifying compositional nonparametric bayesian models. Advances in neural information processing systems, 19. 641-648. MIT Press.
Jürgens, Brand & Kollmeier (2007): Jürgens, T., Brand, T. & Kollmeier, B. (2007). Modelling the human-machine gap in speech reception: Microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model. Eighth annual conference of the international speech communication association.
Kahn, Riviere, Zheng, Kharitonov, Xu, Mazare, Karadayi, Liptchinsky, Collobert, Fuegen & al. (2020): Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazare, P., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C. & al., (2020). Libri-light: A benchmark for ASR with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. Retrieved from http://dx.doi.org/10.1109/ICASSP40776.2020.9052942
Kahn, Rivière, Zheng, Kharitonov, Xu, Mazaré, Karadayi, Liptchinsky, Collobert, Fuegen, Likhomanenko, Synnaeve, Joulin, Mohamed & Dupoux (2020): Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A. & Dupoux, E. (2020). Libri-light: A benchmark for ASR with limited or no supervision. INTERSPEECH. Retrieved from https://arxiv.org/abs/1912.07875
Kamper, Livescu & Goldwater (2017): Kamper, H., Livescu, K. & Goldwater, S. (2017). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. ASRU 2017. Retrieved from https://arxiv.org/abs/1904.07556
Kamper, Shakhnarovich & Livescu (2019): Kamper, H., Shakhnarovich, G. & Livescu, K. (2019). Semantic speech retrieval with a visually grounded model of untranscribed speech. IEEE/ACM Transactions on Audio, Speech and Language Processing, 27. 89-98.
Kamper, Elsner, Jansen & Goldwater (2015): Kamper, H., Elsner, M., Jansen, A. & Goldwater, S. (2015). Unsupervised neural network based feature extraction using weak top-down constraints. Proceedings of ICASSP.
Karpathy & Li (2015): Karpathy, A. & Li, F. (2015). Deep visual-semantic alignments for generating image descriptions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). 3128-3137.
Kawakami, Wang, Dyer, Blunsom & Oord (2020): Kawakami, K., Wang, L., Dyer, C., Blunsom, P. & Oord, A. (2020). Learning robust and multilingual speech representations. Retrieved from https://arxiv.org/abs/2001.11128
Kleinschmidt & Jaeger (2015): Kleinschmidt, D. & Jaeger, T. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2). 148-203. American Psychological Association.
Kuhl (1991): Kuhl, P. (1991). Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Attention, Perception, & Psychophysics, 50(2). 93-107. Springer.
Lau, Clark & Lappin (2017): Lau, J., Clark, A. & Lappin, S. (2017). Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive Science. 1202-1241.
Lee & Glass (2012): Lee, C. & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. 40-49.
Chomsky (1957): Chomsky, N. (1957). Syntactic structures. JSTOR.
Liberman, Cooper, Shankweiler & Studdert-Kennedy (1967): Liberman, A., Cooper, F., Shankweiler, D. & Studdert-Kennedy, M. (1967). Perception of the speech code.. Psychological review, 74(6). 431. American Psychological Association.
Fowler (1986): Fowler, C. (1986). An event approach to the study of speech perception from a direct–realist perspective. Journal of phonetics, 14(1). 3-28. Elsevier.
Baljekar, Sitaram, Muthukumar & Black (2015): Baljekar, P., Sitaram, S., Muthukumar, P. & Black, A. (2015). Using articulatory features and inferred phonological segments in zero resource speech processing. Sixteenth annual conference of the international speech communication association.
Morita & Koda (2020): Morita, T. & Koda, H. (2020). Exploring TTS without t using biologically/psychologically motivated neural network modules (ZeroSpeech 2020). arXiv preprint arXiv:2005.05487.
Chomsky & Halle (1968): Chomsky, N. & Halle, M. (1968). The sound pattern of english.. Harper; Row.
Linzen, Dupoux & Goldberg (2016): Linzen, T., Dupoux, E. & Goldberg, Y. (2016). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL.
Linzen & Leonard (2018): Linzen, T. & Leonard, B. (2018). Distinct patterns of syntactic agreement errors in recurrent networks and humans. arXiv preprint 1807.06882.
Lisker & Abramson (1964): Lisker, L. & Abramson, A. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3). 384-422. Taylor & Francis.
Liu, Hsu & Lee (2019): Liu, A., Hsu, P. & Lee, H. (2019). Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11563
Liu, Lowe, Serban, Noseworthy, Charlin & Pineau (2016): Liu, C., Lowe, R., Serban, I., Noseworthy, M., Charlin, L. & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer & Stoyanov (2019): Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Retrieved from http://arxiv.org/abs/1907.11692
Bates, Mächler, Bolker & Walker (2015): Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1). 1-48.
Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014): Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems. Proceedings of LREC.
Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014): Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems. Proceedings of LREC. 560-567.
Luong, Socher & Manning (2013): Luong, M., Socher, R. & Manning, C. (2013). Better word representations with recursive neural networks for morphology. Proceedings of the seventeenth conference on computational natural language learning. 104-113.
Versteegh & Thiolliere (2015): Versteegh, M. & Thiolliere, R. (2015). ZeroSpeech term discovery evaluation toolkit. Retrieved from http://dx.doi.org/10.5281/zenodo.16330
Macmillan & Creelman (2004): Macmillan, N. & Creelman, C. (2004). Detection theory: A user’s guide. Psychology Press.
Mahrt (2016): Mahrt, T. (2016). LMEDS: Language markup and experimental design software.
Wang, Zhang & Zhang (2015): Wang, D., Zhang, X. & Zhang, Z. (2015). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
Manenti, Pellegrini & Pinquier (2017): Manenti, C., Pellegrini, T. & Pinquier, J. (2017). Unsupervised speech unit discovery using k-means and neural networks. International conference on statistical language and speech processing. 169-180. Springer.
Mangin, Filliat, Bosch & Oudeyer (2015): Mangin, O., Filliat, D., Bosch, L. & Oudeyer, P. (2015). MCA-NMF: Multimodal concept acquisition with non-negative matrix factorization. PLOS One.
Marvin & Linzen (2018): Marvin, R. & Linzen, T. (2018). Targeted syntactic evaluation of language models. Retrieved from https://www.aclweb.org/anthology/D18-1151
Matlock (2001): Matlock, T. (2001). How real is fictive motion?. Psychology Department, University of California, Santa Cruz.
Melis, Dyer & Blunsom (2018): Melis, G., Dyer, C. & Blunsom, P. (2018). On the state of the art of evaluation in neural language models. ICLR.
Merkx, Frank & Ernestus (2019): Merkx, D., Frank, S. & Ernestus, M. (2019). Language Learning Using Speech to Image Retrieval. Proc. Interspeech 2019. 1841-1845.
Meyer, Wesker, Brand, Mertins & Kollmeier (2006): Meyer, B., Wesker, T., Brand, T., Mertins, A. & Kollmeier, B. (2006). A human-machine comparison in speech recognition based on a logatome corpus. Speech recognition and intrinsic variation workshop.
Meyer, Wächter, Brand & Kollmeier (2007): Meyer, B., Wächter, M., Brand, T. & Kollmeier, B. (2007). Phoneme confusions in human and automatic speech recognition. Eighth annual conference of the international speech communication association.
Meyer, Jürgens, Wesker, Brand & Kollmeier (2010): Meyer, B., Jürgens, T., Wesker, T., Brand, T. & Kollmeier, B. (2010). Human phoneme recognition depending on speech-intrinsic variability. The Journal of the Acoustical Society of America, 128(5). 3126-3141. Acoustical Society of America.
Miao, Gowayyed & Metze (2015): Miao, Y., Gowayyed, M. & Metze, F. (2015). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Automatic speech recognition and understanding (ASRU), 2015 IEEE workshop on. 167-174. IEEE.
Miech, Zhukov, Alayrac, Tapaswi, Laptev & Sivic (2019): Miech, A., Zhukov, D., Alayrac, J., Tapaswi, M., Laptev, I. & Sivic, J. (2019). HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV.
Miller & Charles (1991): Miller, G. & Charles, W. (1991). Contextual correlates of semantic similarity. Language and cognitive processes, 6(1). 1-28. Taylor & Francis.
Millet, Jurov & Dunbar (2019): Millet, J., Jurov, N. & Dunbar, E. (2019). Comparing unsupervised speech learning directly to human performance in speech perception. CogSci Conference 2019.
Muscariello, Gravier & Bimbot (2012): Muscariello, A., Gravier, G. & Bimbot, F. (2012). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination. IEEE Transactions on Audio, Speech and Language Processing, 20(7). 2031-2044.
Gulordava, Bojanowski, Grave, Linzen & Baroni (2018): Gulordava, K., Bojanowski, P., Grave, E., Linzen, T. & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 1195-1205. Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/N18-1108
Kwiatkowski, Palomaki, Redfield, Collins, Parikh, Alberti, Epstein, Polosukhin, Devlin, Lee & (2019): Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K. & , (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7. 453-466. MIT Press.
Cuervo, Grabias, Chorowski, Ciesielski, Łańcucki, Rychlikowski & Marxer (2021): Cuervo, S., Grabias, M., Chorowski, J., Ciesielski, G., Łańcucki, A., Rychlikowski, P. & Marxer, R. (2021). Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. arXiv preprint arXiv:2110.15909.
Iwamoto & Shinozaki (2021): Iwamoto, Y. & Shinozaki, T. (2021). Unsupervised spoken term discovery using wav2vec 2.0. 2021 asia-pacific signal and information processing association annual summit and conference (APSIPA ASC). 1082-1086. IEEE.
Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021): Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L. & Dehak, N. (2021). Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170.
Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021): Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L. & Dehak, N. (2021). Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. arXiv preprint arXiv:2110.02345.
Bhati, Villalba, Żelasko & Dehak (2020): Bhati, S., Villalba, J., Żelasko, P. & Dehak, N. (2020). Self-expressing autoencoders for unsupervised spoken term discovery. arXiv preprint arXiv:2007.13033.
Borgholt, Havtorn, Edin, Maaløe & Igel (2022): Borgholt, L., Havtorn, J., Edin, J., Maaløe, L. & Igel, C. (2022). A brief overview of unsupervised neural speech representation learning.
Nayak, Kumar, Ramesh, Bhati & Murty (2019): Nayak, S., Kumar, C., Ramesh, G., Bhati, S. & Murty, K. (2019). Virtual Phone Discovery for Speech Synthesis. Retrieved from https://doi.org/10.13140/RG.2.2.23356.08324
Tobing, Hayashi, Wu, Kobayashi & Toda (2020): Tobing, P., Hayashi, T., Wu, Y., Kobayashi, K. & Toda, T. (2020). Cyclic spectral modeling for unsupervised unit discovery into voice conversion with excitation and waveform modeling.. INTERSPEECH. 4861-4865.
Chen & Hain (2020): Chen, M. & Hain, T. (2020). Unsupervised acoustic unit representation learning for voice conversion using wavenet auto-encoders. arXiv preprint arXiv:2008.06892.
Niekerk, Nortje & Kamper (2020): Niekerk, B., Nortje, L. & Kamper, H. (2020). Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. arXiv preprint arXiv:2005.09409.
Yusuf, Ondel, Burget, Černockỳ & Saraclar (2021): Yusuf, B., Ondel, L., Burget, L., Černockỳ, J. & Saraclar, M. (2021). A hierarchical subspace model for language-attuned acoustic unit discovery. ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). 3710-3714. IEEE.
Gündogdu, Yusuf, Yesilbursa & Saraclar (2020): Gündogdu, B., Yusuf, B., Yesilbursa, M. & Saraclar, M. (2020). Vector quantized temporally-aware correspondence sparse autoencoders for zero-resource acoustic unit discovery.. INTERSPEECH. 4846-4850.
Newell & Simon (1972): Newell, A. & Simon, H. (1972). Human problem solving. Prentice-Hall.
Nguyen, Seyssel, Rozé, Rivière, Kharitonov, Baevski, Dunbar & Dupoux (2020): Nguyen, T., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E. & Dupoux, E. (2020). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588.
Jurov (2019): Jurov, N. (2019). Phonetics or Phonology? Modelling Non-Native Perception. Université Paris Diderot.
Ondel, Godard, Besacier, Larsen, Hasegawa-Johnson, Scharenborg, Dupoux, Burget, Yvon & Khudanpur (2018): Ondel, L., Godard, P., Besacier, L., Larsen, E., Hasegawa-Johnson, M., Scharenborg, O., Dupoux, E., Burget, L., Yvon, F. & Khudanpur, S. (2018). Bayesian models for unit discovery on a very low resource language. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5939-5943. IEEE.
Oord, Li & Vinyals (2018): Oord, A., Li, Y. & Vinyals, O. (2018). Representation learning with contrastive predictive coding. CoRR, abs/1807.03748. Retrieved from http://arxiv.org/abs/1807.03748
Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier & Auli (2019): Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D. & Auli, M. (2019). Fairseq: A fast, extensible toolkit for sequence modeling. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics (demonstrations). 48-53. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/N19-4009
Panayotov, Chen, Povey & Khudanpur (2015): Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206-5210. IEEE.
Pandia & Murthy (2019): Pandia, K. & Murthy, H. (2019). Zero Resource Speech Synthesis Using Transcripts Derived from Perceptual Acoustic Units. INTERSPEECH 2019.
Park & Glass (2008): Park, A. & Glass, J. (2008). Unsupervised Pattern Discovery in Speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1). 186-197.
Parrot, Millet & Dunbar (2019): Parrot, M., Millet, J. & Dunbar, E. (2019). Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv. arXiv-1911.
Pauls & Klein (2012): Pauls, A. & Klein, D. (2012). Large-scale syntactic language modeling with treelets.
Chang & Fisher III (2013): Chang, J. & Fisher III, J. (2013). Parallel sampling of DP mixture models using sub-cluster splits. Advances in Neural Information Processing Systems. 620-628.
Pellegrini, Manenti & Pinquier (): Pellegrini, T., Manenti, C. & Pinquier, J. (). Unsupervised discovery of sub-lexical units in speech based on ZCA and k-means. Submitted to ASRU 2017.
Peperkamp (2015): Peperkamp, S. (2015). Phonology versus phonetics in loanword adaptations. 71-90. John Benjamins Publishing Company.
Phillips, Wagers & Lau (2011): Phillips, C., Wagers, M. & Lau, E. (2011). Grammatical illusions and selective fallibility in real-time language comprehension. Experiments at the Interfaces, 37. 147-180. Brill.
Pintér & Watanabe (2016): Pintér, G. & Watanabe, H. (2016). Do GMM phoneme classifiers perceive synthetic sibilants as humans do?. INTERSPEECH. 1363-1367.
Pitt, Johnson, Hume, Kiesling & Raymond (2005): Pitt, M., Johnson, K., Hume, E., Kiesling, S. & Raymond, W. (2005). The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1). 89-95. Elsevier.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011): Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G. & Vesely, K. (2011). The kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011): Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G. & Vesely, K. (2011). The kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz & (2011): Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. & , (2011). The Kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society; IEEE Signal Processing Society.
(2017): , (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
Rabiner (1989): Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2). 257-286.
Radford, Wu, Child, Luan, Amodei & Sutskever (2019): Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. (2019). Language models are unsupervised multitask learners.
Radinsky, Agichtein, Gabrilovich & Markovitch (2011): Radinsky, K., Agichtein, E., Gabrilovich, E. & Markovitch, S. (2011). A word at a time: Computing word relatedness using temporal semantic analysis. Proceedings of the 20th international conference on world wide web. 337-346.
Räsänen & Rasilo (2015): Räsänen, O. & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122. 792-829.
Ravfogel, Tyers & Goldberg (2018): Ravfogel, S., Tyers, F. & Goldberg, Y. (2018). Can LSTM learn to capture agreement? The case of basque. arXiv preprint 1809.04022.
Kamper, Jansen & Goldwater (2017): Kamper, H., Jansen, A. & Goldwater, S. (2017). A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46. 154-174. Elsevier.
Kamper (2022): Kamper, H. (2022). Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. arXiv preprint arXiv:2202.11929.
Renshaw, Kamper, Jansen & Goldwater (2015): Renshaw, D., Kamper, H., Jansen, A. & Goldwater, S. (2015). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. Sixteenth annual conference of the international speech communication association.
Dupoux (2018): Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier.
Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018): Riochet, R., Castro, M., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2018). IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning. arXiv preprint arXiv:1803.07616.
Rivière, Joulin, Mazaré & Dupoux (2020): Rivière, M., Joulin, A., Mazaré, P. & Dupoux, E. (2020). Unsupervised pretraining transfers well across languages. Retrieved from https://arxiv.org/abs/2002.02848
Roy & Pentland (2002): Roy, D. & Pentland, A. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26. 113-146.
Rubenstein & Goodenough (1965): Rubenstein, H. & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10). 627-633. ACM New York, NY, USA.
Tjandra, Sisman, Zhang, Sakti, Li & Nakamura (2019): Tjandra, A., Sisman, B., Zhang, M., Sakti, S., Li, H. & Nakamura, S. (2019). VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11449
Sakti, Kelana, Riza, Sakai, Markov & Nakamura (2008): Sakti, S., Kelana, E., Riza, H., Sakai, S., Markov, K. & Nakamura, S. (2008). Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project. Proceedings of the workshop on technologies and corpora for asia-pacific speech translation (TCAST).
Sakti, Maia, Sakai, Shimizu & Nakamura (2008): Sakti, S., Maia, R., Sakai, S., Shimizu, T. & Nakamura, S. (2008). Development of HMM-based Indonesian speech synthesis. Proc. Oriental COCOSDA. 215-219.
Salazar, Liang, Nguyen & Kirchhoff (2020): Salazar, J., Liang, D., Nguyen, T. & Kirchhoff, K. (2020). Masked language model scoring. Proceedings of the 58th annual meeting of the association for computational linguistics. 2699-2712. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.acl-main.240
Sanabria, Caglayan, Palaskar, Elliott, Barrault, Specia & Metze (2018): Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L. & Metze, F. (2018). How2: A large-scale dataset for multimodal language understanding. Proceedings of the workshop on visually grounded interaction and language (ViGIL). NeurIPS. Retrieved from http://arxiv.org/abs/1811.00347
Scharenborg (2007): Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Communication, 49(5). 336-347. Elsevier.
Scharenborg, Tiesmeyer, Hasegawa-Johnson & Dehak (2018): Scharenborg, O., Tiesmeyer, S., Hasegawa-Johnson, M. & Dehak, N. (2018). Visualizing phoneme category adaptation in deep neural networks.. INTERSPEECH. 1482-1486.
Scharenborg, Gouw, Larson & Marchiori (2019): Scharenborg, O., Gouw, N., Larson, M. & Marchiori, E. (2019). The representation of speech in deep neural networks. International conference on multimedia modeling. 194-205. Springer.
Scharenborg (2019): Scharenborg, O. (2019). The representation of speech and its processing in the human brain and deep neural networks. International conference on speech and computer. 1-8. Springer.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013): Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task (I): Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013): Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013): Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association. 1-5.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014): Schatz, T., Peddinti, V., Cao, X., Bach, F., Hermansky, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. INTERSPEECH.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014): Schatz, T., Peddinti, V., Cao, X., Bach, F., Hermansky, H. & Dupoux, E. (2014). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise. Fifteenth annual conference of the international speech communication association.
Schatz (2016): Schatz, T. (2016). ABX-discriminability measures and applications. École Normale Supérieure.
Schatz (2016): Schatz, T. (2016). ABX-discriminability measures and applications. Paris 6.
Schatz, Bach & Dupoux (2017): Schatz, T., Bach, F. & Dupoux, E. (2017). ASR systems as models of phonetic category perception in adults. Proceedings of the 39th Annual CogSci Meeting.
Schatz & Feldman (2018): Schatz, T. & Feldman, N. (2018). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. Proceedings of the Conference on Cognitive Computational Neuroscience.
Schatz, Feldman, Goldwater, Cao & Dupoux (0): Schatz, T., Feldman, N., Goldwater, S., Cao, X. & Dupoux, E. (0). Early phonetic learning without phonetic categories: Insights from machine learning. Proceedings of the National Academy of Sciences.
Schnabel, Labutov, Mimno & Joachims (2015): Schnabel, T., Labutov, I., Mimno, D. & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 conference on empirical methods in natural language processing. 298-307.
Schneider, Baevski, Collobert & Auli (2019): Schneider, S., Baevski, A., Collobert, R. & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862.
Senin (2008): Senin, P. (2008). Dynamic time warping algorithm review. Retrieved from http://seninp.github.io/assets/pubs/senin_dtw_litreview_2008.pdf
Sennrich, Haddow & Birch (2016): Sennrich, R., Haddow, B. & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers). 1715-1725. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P16-1162
Sennrich, Haddow & Birch (2015): Sennrich, R., Haddow, B. & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Shibata, Kato, Shinozaki & Watanabe (): Shibata, H., Kato, T., Shinozaki, T. & Watanabe, S. (). Composite embedding systems for ZeroSpeech2017 track 1. Submitted to ASRU 2017.
Norris & McQueen (2008): Norris, D. & McQueen, J. (2008). Shortlist B: a Bayesian model of continuous speech recognition. Psychological Review, 115(2). 357-395. American Psychological Association.
Shrager & Langley (1990): Shrager, J. & Langley, P. (1990). Computational models of scientific discovery and theory formation. Morgan Kaufmann.
Siu, Gish, Chan, Belfield & Lowe (2013): Siu, M., Gish, H., Chan, A., Belfield, W. & Lowe, S. (2013). Unsupervized training of an HMM-based self-organizing recognizer with applications to topic classification and keyword discovery. Computer Speech & Language, preprint.
Socher, Karpathy, Le, Manning & Ng (2014): Socher, R., Karpathy, A., Le, Q., Manning, C. & Ng, A. (2014). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2. 207-218.
Scharenborg, Norris, Bosch & McQueen (2005): Scharenborg, O., Norris, D., Bosch, L. & McQueen, J. (2005). How should a speech recognizer work?. Cognitive Science, 29. 867-918.
Stolcke & Droppo (2017): Stolcke, A. & Droppo, J. (2017). Comparing human and machine errors in conversational speech transcription. INTERSPEECH.
Sun, Myers, Vondrick, Murphy & Schmid (2019): Sun, C., Myers, A., Vondrick, C., Murphy, K. & Schmid, C. (2019). Videobert: A joint model for video and language representation learning. Proceedings of the IEEE international conference on computer vision. 7464-7473.
Synnaeve, Schatz & Dupoux (2014): Synnaeve, G., Schatz, T. & Dupoux, E. (2014). Phonetic embedding learning with side information. Proceedings of IEEE spoken language technology.
Synnaeve, Versteegh & Dupoux (2014): Synnaeve, G., Versteegh, M. & Dupoux, E. (2014). Learning words from images and speech. 28th Conference on Neural Information Processing Systems (NIPS) Workshop on Learning Semantics.
Bosch, Van hamme, Boves & Moore (2008): Bosch, L., Van hamme, H., Boves, L. & Moore, R. (2008). A computational model of language acquisition: The emergence of words. Fundamenta Informaticae, 90. 229-249.
McMurray, Aslin & Toscano (2009): McMurray, B., Aslin, R. & Toscano, J. (2009). Statistical learning of phonetic categories: Insights from a computational approach. Developmental Science, 12(3). 369-378. Wiley Online Library.
Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015): Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M. & Dupoux, E. (2015). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling.. INTERSPEECH. 3179-3183.
Schatz, Thiolliere, Dupoux, Synnaeve & Dunbar (2015): Schatz, T., Thiolliere, R., Dupoux, E., Synnaeve, G. & Dunbar, E. (2015). ABXpy v0.1. Retrieved from http://dx.doi.org/10.5281/zenodo.16239
Schatz, Cao, Synnaeve, Thiolliere & Dupoux (2015): Schatz, T., Cao, X., Synnaeve, G., Thiolliere, R. & Dupoux, E. (2015). Abkhazia: Preliminary release. Retrieved from http://dx.doi.org/10.5281/zenodo.16242
Schatz & Feldman (2018): Schatz, T. & Feldman, N. (2018). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. Proceedings of the conference on cognitive computational neuroscience. 1-4.
Elman & McClelland (2015): Elman, J. & McClelland, J. (2015). Exploiting the lawful variability in the speech wave. 71-90. Erlbaum.
McClelland & Elman (1986): McClelland, J. & Elman, J. (1986). Interactive processes in speech perception: The TRACE model. Cognitive Psychology, 18. 1-86.
Tran, Bisazza & Monz (2018): Tran, K., Bisazza, A. & Monz, C. (2018). The importance of being recurrent for modeling hierarchical structure. Retrieved from https://www.aclweb.org/anthology/D18-1503
Vallabha, McClelland, Pons, Werker & Amano (2007): Vallabha, G., McClelland, J., Pons, F., Werker, J. & Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33). 13273-13278. National Acad Sciences.
Oord, Vinyals & (2017): Oord, A., Vinyals, O. & , (2017). Neural discrete representation learning. Advances in neural information processing systems. 6306-6315.
VanDam (2015): VanDam, M. (2015). HomeBank VanDam Public 5-minute Corpus. TalkBank. Retrieved from http://homebank.talkbank.org/access/Public/VanDam-5minute.html
VanDam (2015): VanDam, M. (2015). HomeBank VanDam Public Daylong Corpus. TalkBank. Retrieved from http://homebank.talkbank.org/access/Public/VanDam-Daylong.html
Varadarajan, Khudanpur & Dupoux (2008): Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised learning of acoustic sub-word units. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. 165-168. Association for Computational Linguistics.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017): Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762. Retrieved from http://arxiv.org/abs/1706.03762
Versteegh, Thiolliere, Schatz, Cao, Anguera, Jansen & Dupoux (2015): Versteegh, M., Thiolliere, R., Schatz, T., Cao, X., Anguera, X., Jansen, A. & Dupoux, E. (2015). The zero resource speech challenge 2015. Proc. Of Interspeech.
Versteegh, Anguera, Jansen & Dupoux (2016): Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science: Proceedings of SLTU 2016, 81. 67-72.
Versteegh, Thiollière, Schatz, Cao, Anguera, Jansen & Dupoux (2015): Versteegh, M., Thiollière, R., Schatz, T., Cao, X., Anguera, X., Jansen, A. & Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. INTERSPEECH-16, 81. 67-72.
Versteegh, Anguera, Jansen & Dupoux (2016): Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The Zero Resource Speech Challenge 2015: Proposed approaches and results. Procedia Computer Science, 81. 67-72. Elsevier.
Wang, Tang & Livescu (2020): Wang, W., Tang, Q. & Livescu, K. (2020). Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). 6889-6893. IEEE.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019): Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. & Bowman, S. (2019). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Werker & Tees (1984): Werker, J. & Tees, R. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1). 49-63. Elsevier.
Wesker, Meyer, Wagener, Anemüller, Mertins & Kollmeier (2005): Wesker, T., Meyer, B., Wagener, K., Anemüller, J., Mertins, A. & Kollmeier, B. (2005). Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines. Ninth european conference on speech communication and technology.
Wilcox, Levy, Morita & Futrell (2018): Wilcox, E., Levy, R., Morita, T. & Futrell, R. (2018). What do RNN language models learn about filler–gap dependencies?.
Wilcox, Levy, Morita & Futrell (2018): Wilcox, E., Levy, R., Morita, T. & Futrell, R. (2018). What do RNN language models learn about filler-gap dependencies?. arXiv preprint 1809.00042.
Gauthier, Besacier, Voisin, Melese & Elingui (2016): Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. LREC.
Vries, Davel, Badenhorst, Basson, Wet, Barnard & Waal (2014): Vries, N., Davel, M., Badenhorst, J., Basson, W., Wet, F., Barnard, E. & Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56. 119-131.
Xu & Tenenbaum (2007): Xu, F. & Tenenbaum, J. (2007). Word learning as Bayesian inference. Psychological review, 114(2). 245-272. American Psychological Association.
Yang & Powers (2006): Yang, D. & Powers, D. (2006). Verb similarity on the taxonomy of WordNet. Masaryk University.
Yang, Dai, Yang, Carbonell, Salakhutdinov & Le (2019): Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. & Le, Q. (2019). XLNet: Generalized autoregressive pretraining for language understanding. Retrieved from https://arxiv.org/abs/1906.08237
Yu & Ballard (2004): Yu, C. & Ballard, D. (2004). A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perceptions, 1. 57-80.
Yuan, Leung, Xie, Chen, Ma & Li (): Yuan, Y., Leung, C., Xie, L., Chen, H., Ma, B. & Li, H. (). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representations. Submitted to ASRU 2017.
Yuan, Leung, Xie, Chen, Ma & Li (2017): Yuan, Y., Leung, C., Xie, L., Chen, H., Ma, B. & Li, H. (2017). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 734-739. IEEE.
Zhang & Glass (2010): Zhang, Y. & Glass, J. (2010). Towards multi-speaker unsupervised speech pattern discovery. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. 4366-4369.
Zhou, Xu & Corso (2018): Zhou, L., Xu, C. & Corso, J. (2018). Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI conference on artificial intelligence, 32.
Gauthier, Besacier, Voisin, Melese & Elingui (2016): Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. 10th Language Resources and Evaluation Conference (LREC 2016). Retrieved from https://hal.archives-ouvertes.fr/hal-01350037
Jia, Weiss, Biadsy, Macherey, Johnson, Chen & Wu (2019): Jia, Y., Weiss, R., Biadsy, F., Macherey, W., Johnson, M., Chen, Z. & Wu, Y. (2019). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037.
Lee, Chen, Wang, Gu, Ma, Polyak, Adi, He, Tang, Pino & (2021): Lee, A., Chen, P., Wang, C., Gu, J., Ma, X., Polyak, A., Adi, Y., He, Q., Tang, Y., Pino, J. & , (2021). Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604.
Tjandra, Sakti & Nakamura (2020): Tjandra, A., Sakti, S. & Nakamura, S. (2020). Transformer vq-vae for unsupervised unit discovery and speech synthesis: Zerospeech 2020 challenge. arXiv preprint arXiv:2005.11676.
Alishahi, Chrupała, Cristia, Dupoux, Higy, Lavechin, Räsänen & Yu (2021): Alishahi, A., Chrupała, G., Cristia, A., Dupoux, E., Higy, B., Lavechin, M., Räsänen, O. & Yu, C. (2021). ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track. arXiv preprint arXiv:2107.06546.
Maekaku, Chang, Fujita, Chen, Watanabe & Rudnicky (2021): Maekaku, T., Chang, X., Fujita, Y., Chen, L., Watanabe, S. & Rudnicky, A. (2021). Speech representation learning combining conformer cpc with deep cluster for the zerospeech challenge 2021. arXiv preprint arXiv:2107.05899.
Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021): Chorowski, J., Ciesielski, G., Dzikowski, J., Łańcucki, A., Marxer, R., Opala, M., Pusz, P., Rychlikowski, P. & Stypułkowski, M. (2021). Information retrieval for zerospeech 2021: The submission by university of wroclaw. arXiv preprint arXiv:2106.11603.
Niekerk, Nortje, Baas & Kamper (2021): Niekerk, B., Nortje, L., Baas, M. & Kamper, H. (2021). Analyzing speaker information in self-supervised models to improve zero-resource speech processing. arXiv preprint arXiv:2108.00917.
Tjandra, Sakti & Nakamura (2019): Tjandra, A., Sakti, S. & Nakamura, S. (2019). Speech-to-speech translation between untranscribed unknown languages. 2019 IEEE automatic speech recognition and understanding workshop (ASRU). 593-600. IEEE.
Jia, Ramanovich, Remez & Pomerantz (2021): Jia, Y., Ramanovich, M., Remez, T. & Pomerantz, R. (2021). Translatotron 2: Robust direct speech-to-speech translation. arXiv preprint arXiv:2107.08661.
Lee, Gong, Duquenne, Schwenk, Chen, Wang, Popuri, Pino, Gu & Hsu (2021): Lee, A., Gong, H., Duquenne, P., Schwenk, H., Chen, P., Wang, C., Popuri, S., Pino, J., Gu, J. & Hsu, W. (2021). Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352.
Ostendorf, Price & Shattuck-Hufnagel (1995): Ostendorf, M., Price, P. & Shattuck-Hufnagel, S. (1995). The boston university radio news corpus. Linguistic Data Consortium. 1-19.
Algayres, Ricoul, Karadayi, Mohammed, Sagot & Dupoux (2022): Algayres, R., Ricoul, T., Karadayi, J., Mohammed, A., Sagot, B. & Dupoux, E. (2022). DP-PARSE: Finding word boundaries from raw speech with a token lexicon. Retrieved from https://arxiv.org/abs/1906.08237
Nguyen, Sagot & Dupoux (2022): Nguyen, T., Sagot, B. & Dupoux, E. (2022). Are discrete units necessary for spoken language modeling?. Retrieved from https://arxiv.org/abs/1906.08237
De Saussure (1916): De Saussure, F. (1916). Course in general linguistics. McGraw-Hill Book Company, New York-Toronto-London.
Seyssel, Lavechin, Titeux, Thomas, Virlet, Santos Revilla, Wisniewski, Ludusan & Dupoux (2023): Seyssel, M., Lavechin, M., Titeux, H., Thomas, A., Virlet, G., Santos Revilla, A., Wisniewski, G., Ludusan, B. & Dupoux, E. (2023). ProsAudit, a prosodic benchmark for self-supervised speech models.