- Dunbar, Hamilakis & Dupoux (2022)
- Dunbar, E., Hamilakis, N. & Dupoux, E. (2022). Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge series. IEEE Journal of Special Topics in Signal Processing, 16(6). 1211-1226. Retrieved from https://arxiv.org/abs/2005.12656
- Hallap, Dupoux & Dunbar (2022)
- Hallap, M., Dupoux, E. & Dunbar, E. (2022). Evaluating context-invariance in unsupervised speech representations. arXiv preprint arXiv:2210.15775.
- Goldwater, Griffiths & Johnson (2009)
- Goldwater, S., Griffiths, T. & Johnson, M. (2009). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1). 21-54.
- Agirre, Alfonseca, Hall, Kravalova, Pasca & Soroa (2009)
- Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M. & Soroa, A. (2009). A study on similarity and relatedness using distributional and wordnet-based approaches.
- Al-Rfou, Choe, Constant, Guo & Jones (2018)
- Al-Rfou, R., Choe, D., Constant, N., Guo, M. & Jones, L. (2018). Character-level language modeling with deeper self-attention. arXiv preprint 1808.04444.
- Allen & Seidenberg (1999)
- Allen, J. & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. The emergence of language. 115-151.
- Ansari, Kumar, Singh, Ganapathy & Devi ()
- Ansari, T., Kumar, R., Singh, S., Ganapathy, S. & Devi, S. (). Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions. Submitted to ASRU 2017.
- Chaudhuri, Roth, Ellis, Gallagher, Kaver, Marvin, Pantofaru, Reale, Reid, Wilson & Xi (2018)
- Chaudhuri, S., Roth, J., Ellis, D., Gallagher, A., Kaver, L., Marvin, R., Pantofaru, C., Reale, N., Reid, L., Wilson, K. & Xi, Z. (2018). AVA-speech: A densely labeled dataset of speech activity in movies. Proceedings of interspeech, 2018. Retrieved from https://arxiv.org/pdf/1808.00606
- Baevski, Auli & Mohamed (2019)
- Baevski, A., Auli, M. & Mohamed, A. (2019). Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912.
- Baevski, Zhou, Mohamed & Auli (2020)
- Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
- Baker, Reichart & Korhonen (2014)
- Baker, S., Reichart, R. & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 278-289.
- Bérard, Pietquin, Servan & Besacier (2016)
- Bérard, A., Pietquin, O., Servan, C. & Besacier, L. (2016). Listen and translate: A proof of concept for end-to-end speech-to-text translation. NIPS workshop on end-to-end learning for speech and audio processing.
- Best (1995)
- Best, C. (1995). A direct realist perspective on cross-language speech perception. Speech perception and linguistic experience: Issues in cross-language research. 167-200. York Press.
- Dunbar, Bernard, Hamilakis, Nguyen, Seyssel, Rozé, Rivière, Kharitonov & Dupoux (2021)
- Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E. & Dupoux, E. (2021). The zero resource speech challenge 2021: Spoken language modelling. Interspeech 2021-conference of the international speech communication association.
- Kohonen (1988)
- Kohonen, T. (1988). The ’neural’ phonetic typewriter. Computer, 21(3). 11-22.
- Adda, Stücker, Adda-Decker, Ambouroue, Besacier, Blachon, Bonneau-Maynard, Godard, Hamlaoui, Idiatov, Kouarata, Lamel, Makasso, Rialland, Van de Velde, Yvon & Zerbian (2016)
- Adda, G., Stücker, S., Adda-Decker, M., Ambouroue, O., Besacier, L., Blachon, D., Bonneau-Maynard, H., Godard, P., Hamlaoui, F., Idiatov, D., Kouarata, G., Lamel, L., Makasso, E., Rialland, A., Van de Velde, M., Yvon, F. & Zerbian, S. (2016). Breaking the unwritten kanguage barrier: The Bulb project. Proceedings of SLTU (spoken language technologies for under-resourced languages).
- Akaike (1974)
- Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6). 716-723. IEEE.
- Alishahi, Barking & Chrupała (2017)
- Alishahi, A., Barking, M. & Chrupała, G. (2017). Encoding of phonology in a recurrent neural model of grounded speech. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 368-378.
- Ansari, Singh, Kumar & Ganapathy ()
- Ansari, T., Singh, S., Kumar, R. & Ganapathy, S. (). Deep learning methods for unsupervised acoustic modeling: LEAP submission to ZeroSpeech challenge 2017. Submitted to ASRU 2017.
- Ansari, Kumar, Singh & Ganapathy (2017)
- Ansari, T., Kumar, R., Singh, S. & Ganapathy, S. (2017). Deep learning methods for unsupervised acoustic modeling—leap submission to zerospeech challenge 2017. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 754-761. IEEE.
- Jansen & Van Durme (2011)
- Jansen, A. & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. Automatic speech recognition and understanding (ASRU), 2011 IEEE workshop on. 401-406. IEEE.
- Badino, Canevari, Fadiga & Metta (2014)
- Badino, L., Canevari, C., Fadiga, L. & Metta, G. (2014). An Auto-encoder based approach to unsupervised learning of subword units. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Baevski, Schneider & Auli (2020)
- Baevski, A., Schneider, S. & Auli, M. (2020). Vq-wav2vec: Self-supervised learning of discrete speech representations. International conference on learning representations. Retrieved from https://openreview.net/forum?id=rylwJxrYDS
- Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014)
- Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A. & , (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
- Bengio, Ducharme, Vincent & Jauvin (2003)
- Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. (2003). A neural probabilistic language model. JMLR.
- Besacier, Zhou & Gao (2006)
- Besacier, L., Zhou, B. & Gao, Y. (2006). Towards speech translation of non written languages. Spoken Language Technology Workshop, 2006. IEEE. 222-225.
- Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
- Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. & Bowman, S. (2019). BLiMP: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
- Peng & Harwath (2022)
- Peng, P. & Harwath, D. (2022). Self-supervised representation learning for speech using visual grounding and masked language modeling. arXiv preprint arXiv:2202.03543.
- Bruni, Boleda, Baroni & Tran (2012)
- Bruni, E., Boleda, G., Baroni, M. & Tran, N. (2012). Distributional semantics in technicolor. Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers). 136-145.
- Chalnick & Billman (1988)
- Chalnick, A. & Billman, D. (1988). Unsupervised learning of correlational structure. Proceedings of the tenth annual conference of the cognitive science society. 510-516. Lawrence Erlbaum Associates.
- Chen, Leung, Xie, Ma & Li (2015)
- Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (2015). Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study. INTERSPEECH.
- Chrupała (2021)
- Chrupała, G. (2021). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Retrieved from https://arxiv.org/abs/2104.13225
- Chung & Glass (2018)
- Chung, Y. & Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
- Chung, Hsu, Tang & Glass (2019)
- Chung, Y., Hsu, W., Tang, H. & Glass, J. (2019). An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
- Chung, Hsu, Tang & Glass (2019)
- Chung, Y., Hsu, W., Tang, H. & Glass, J. (2019). An unsupervised autoregressive model for speech representation learning. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 146-150.
- Keuleers & Brysbaert (2010)
- Keuleers, E. & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627-633. Springer.
- Kharitonov, Lee, Polyak, Adi, Copet, Lakhotia, Nguyen, Rivière, Mohamed, Dupoux & (2021)
- Kharitonov, E., Lee, A., Polyak, A., Adi, Y., Copet, J., Lakhotia, K., Nguyen, T., Rivière, M., Mohamed, A., Dupoux, E. & , (2021). Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264.
- Heck, Sakti & Nakamura (2016)
- Heck, M., Sakti, S. & Nakamura, S. (2016). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. Procedia Computer Science, 81. 73-79. Elsevier.
- Srivastava & Shrivastava (2016)
- Srivastava, B. & Shrivastava, M. (2016). Articulatory gesture rich representation learning of phonological units in low resource settings. International conference on statistical language and speech processing. 80-95. Springer.
- Heck, Sakti & Nakamura (2017)
- Heck, M., Sakti, S. & Nakamura, S. (2017). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 740-746. IEEE.
- Shibata, Kato, Shinozaki & Watanabet (2017)
- Shibata, H., Kato, T., Shinozaki, T. & Watanabet, S. (2017). Composite embedding systems for ZeroSpeech2017 Track1. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 747-753. IEEE.
- Chorowski, Weiss, Bengio & Van Den Oord (2019)
- Chorowski, J., Weiss, R., Bengio, S. & Van Den Oord, A. (2019). Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12). 2041-2053. IEEE.
- Kamper, Livescu & Goldwater (2017)
- Kamper, H., Livescu, K. & Goldwater, S. (2017). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 719-726. IEEE.
- Hsu, Harwath & Glass (2019)
- Hsu, W., Harwath, D. & Glass, J. (2019). Transfer learning from audio-visual grounding to speech recognition. arXiv preprint arXiv:1907.04355.
- Chung & Glass (2019)
- Chung, Y. & Glass, J. (2019). Generative pre-training for speech with autoregressive predictive coding. arXiv preprint arXiv:1910.12607.
- Millet, Chitoran & Dunbar (2021)
- Millet, J., Chitoran, I. & Dunbar, E. (2021). Predicting non-native speech perception using the perceptual assimilation model and state-of-the-art acoustic models. Proceedings of the 25th conference on computational natural language learning. 661-673.
- Warstadt, Singh & Bowman (2018)
- Warstadt, A., Singh, A. & Bowman, S. (2018). Neural network acceptability judgments. arXiv preprint 1805.12471.
- Dai, Yang, Yang, Cohen, Carbonell, Le & Salakhutdinov (2019)
- Dai, Z., Yang, Z., Yang, Y., Cohen, W., Carbonell, J., Le, Q. & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint 1901.02860.
- Räsänen, Doyle & Frank (2015)
- Räsänen, O., Doyle, G. & Frank, M. (2015). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. Sixteenth annual conference of the international speech communication association.
- Räsänen & Blandón (2020)
- Räsänen, O. & Blandón, M. (2020). Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics. arXiv preprint arXiv:2008.00731.
- Prakash, Kumar, Murthy & (2020)
- Prakash, A., Kumar, M., Murthy, H. & , (2020). Exploration of end-to-end synthesisers for zero resource speech challenge 2020. arXiv preprint arXiv:2009.04983.
- Davis & Mermelstein (1980)
- Davis, S. & Mermelstein, P. (1980). Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4). 357-366.
- Lee & Glass (2012)
- Lee, C. & Glass, J. (2012). A nonparametric bayesian approach to acoustic model discovery. ACL (1). 40-49. The Association for Computer Linguistics.
- Hsu, Hwang, Wu, Tsao & Wang (2016)
- Hsu, C., Hwang, H., Wu, Y., Tsao, Y. & Wang, H. (2016). Voice conversion from non-parallel corpora using variational auto-encoder. Asia-pacific signal and information processing association annual summit and conference, APSIPA 2016, jeju, south korea, december 13-16, 2016. 1-6.
- Tjandra, Sakti & Nakamura (2017)
- Tjandra, A., Sakti, S. & Nakamura, S. (2017). Listening while speaking: Speech chain by deep learning. ASRU 2017. 301-308.
- Badino, Canevari, Fadiga & Metta (2014)
- Badino, L., Canevari, C., Fadiga, L. & Metta, G. (2014). An auto-encoder based approach to unsupervised learning of subword units. ICASSP. 7634-7638. IEEE.
- Gao, Singh & Raj (2018)
- Gao, Y., Singh, R. & Raj, B. (2018). Voice impersonation using generative adversarial networks. ICASSP. 2506-2510. IEEE.
- Jansen, Thomas & Hermansky (2013)
- Jansen, A., Thomas, S. & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training. ICASSP. 8091-8095. IEEE.
- Eloff, Nortje, Niekerk, Govender, Nortje, Pretorius, Van Biljon, Westhuizen, Staden & Kamper (2019)
- Eloff, R., Nortje, A., Niekerk, B., Govender, A., Nortje, L., Pretorius, A., Van Biljon, E., Westhuizen, E., Staden, L. & Kamper, H. (2019). Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. arXiv preprint arXiv:1904.07556.
- Yusuf, Gök, Gündogdu, Kose & Saraclar (2019)
- Yusuf, B., Gök, A., Gündogdu, B., Kose, O. & Saraclar, M. (2019). Temporally-aware acoustic unit discovery for zerospeech 2019 challenge.. INTERSPEECH. 1098-1102.
- Liu, Hsu & Lee (2019)
- Liu, A., Hsu, P. & Lee, H. (2019). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. arXiv preprint arXiv:1905.11563.
- Nayak, Kumar, Ramesh, Bhati & Murty (2019)
- Nayak, S., Kumar, C., Ramesh, G., Bhati, S. & Murty, K. (2019). Virtual phone discovery for speech synthesis without text. 2019 IEEE global conference on signal and information processing (GlobalSIP). 1-5. IEEE.
- Muthukumar & Black (2014)
- Muthukumar, P. & Black, A. (2014). Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, florence, italy, may 4-9, 2014. 2594-2598.
- Scharenborg, Besacier, Black, Hasegawa-Johnson, Metze, Neubig, Stüker, Godard, Müller, Ondel, Palaskar, Arthur, Ciannella, Du, Larsen, Merkx, Riad, Wang & Dupoux (2018)
- Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stüker, S., Godard, P., Müller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L. & Dupoux, E. (2018). Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "speaking rosetta" JSALT 2017 workshop. ICASSP. 4979-4983. IEEE.
- Shen, Pang, Weiss, Schuster, Jaitly, Yang, Chen, Zhang, Wang, Ryan, Saurous, Agiomyrgiannakis & Wu (2018)
- Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Ryan, R., Saurous, R., Agiomyrgiannakis, Y. & Wu, Y. (2018). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. ICASSP. 4779-4783. IEEE.
- Heck, Sakti & Nakamura (2016)
- Heck, M., Sakti, S. & Nakamura, S. (2016). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. SLTU-2016, 5th workshop on spoken language technologies for under-resourced languages, 9-12 may 2016, yogyakarta, indonesia. 73-79.
- Ondel, Burget & Cernocký (2016)
- Ondel, L., Burget, L. & Cernocký, J. (2016). Variational inference for acoustic unit discovery. SLTU, 81. 80-86. Elsevier.
- Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior & Kavukcuoglu (2016)
- Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. SSW. 125. ISCA.
- Wu, Watts & King (2016)
- Wu, Z., Watts, O. & King, S. (2016). Merlin: An open source neural network speech synthesis system. Speech Synthesis Workshop. 202-207. ISCA.
- Ping, Peng, Gibiansky, Arik, Kannan, Narang, Raiman & Miller (2017)
- Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J. & Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654.
- Kaneko & Kameoka (2017)
- Kaneko, T. & Kameoka, H. (2017). Parallel-data-free voice conversion using cycle-consistent adversarial networks. CoRR, abs/1711.11293. Retrieved from https://arxiv.org/abs/1711.11293
- Chou, Yeh, Lee & Lee (2018)
- Chou, J., Yeh, C., Lee, H. & Lee, L. (2018). Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. CoRR, abs/1804.02812. Retrieved from https://arxiv.org/abs/1804.02812
- Li, Liu, Liu, Zhao, Liu & Zhou (2018)
- Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M. & Zhou, M. (2018). Close to human quality TTS with transformer. CoRR, abs/1809.08895. Retrieved from https://arxiv.org/abs/1809.08895
- Mehri, Kumar, Gulrajani, Kumar, Jain, Sotelo, Courville & Bengio (2016)
- Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A. & Bengio, Y. (2016). SampleRNN: An unconditional end-to-end neural audio generation model. CoRR, abs/1612.07837. Retrieved from https://arxiv.org/abs/1612.07837
- Taigman, Wolf, Polyak & Nachmani (2017)
- Taigman, Y., Wolf, L., Polyak, A. & Nachmani, E. (2017). Voice synthesis for in-the-wild speakers via a phonological loop. CoRR, abs/1707.06588.
- Dillon, Dunbar & Idsardi (2013)
- Dillon, B., Dunbar, E. & Idsardi, W. (2013). A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science, 37(2). 344-377. Wiley Online Library.
- DeCarlo (1998)
- DeCarlo, L. (1998). Signal detection theory and generalized linear models.. Psychological Methods, 3(2). 186. American Psychological Association.
- Deng, Dong, Socher, Li, Li & Fei-Fei (2009)
- Deng, J., Dong, W., Socher, R., Li, L., Li, K. & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248-255.
- Devlin, Chang, Lee & Toutanova (2019)
- Devlin, J., Chang, M., Lee, K. & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL.
- Driesen & Van hamme (2011)
- Driesen, J. & Van hamme, H. (2011). Modeling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA. Neurocomputing, 74. 1874-1882.
- Dunbar, Cao, Benjumea, Karadayi, Bernard, Besacier, Anguera & Dupoux (2017)
- Dunbar, E., Cao, X., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X. & Dupoux, E. (2017). The Zero Resource Speech Challenge 2017. 2017 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 323-330. IEEE. Retrieved from https://arxiv.org/abs/1712.04313
- Algayres, Zaiem, Sagot & Dupoux (2020)
- Algayres, R., Zaiem, M., Sagot, B. & Dupoux, E. (2020). Evaluating the reliability of acoustic speech embeddings. arXiv preprint arXiv:2007.13542.
- Riad, Dancette, Karadayi, Zeghidour, Schatz & Dupoux (2018)
- Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T. & Dupoux, E. (2018). Sampling strategies in siamese networks for unsupervised speech representation learning. arXiv preprint arXiv:1804.11297.
- Dunbar, Algayres, Karadayi, Bernard, Benjumea, Cao, Miskic, Dugrain, Ondel, Black & (2019)
- Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, J., Cao, X., Miskic, L., Dugrain, C., Ondel, L., Black, A. & , (2019). The zero resource speech challenge 2019: TTS without T. INTERSPEECH. Retrieved from https://arxiv.org/abs/1904.11469
- Dunbar, Karadayi, Bernard, Cao, Algayres, Ondel, Besacier, Sakriani & Dupoux (2020)
- Dunbar, E., Karadayi, J., Bernard, M., Cao, X., Algayres, R., Ondel, L., Besacier, L., Sakriani, S. & Dupoux, E. (2020). The zero resource speech challenge 2020: Discovering discrete subword and word units. INTERSPEECH, perception;bootstrapping/modeling;clustering/bootphon.
- Duong, Anastasopoulos, Chiang, Bird14 & Cohn (2016)
- Duong, L., Anastasopoulos, A., Chiang, D., Bird14, S. & Cohn, T. (2016). An attentional model for speech translation without transcription. Proceedings of NAACL-HLT. 949-959.
- Dupoux (2018)
- Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier. Retrieved from https://arxiv.org/abs/1607.08723
- Dupoux (2018)
- Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier. Retrieved from https://arxiv.org/abs/1607.08723
- Peters, Neumann, Iyyer, Gardner, Clark, Lee & Zettlemoyer (2018)
- Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. & Zettlemoyer, L. (2018). Deep contextualized word representations. NAACL.
- Faruqui, Tsvetkov, Rastogi & Dyer (2016)
- Faruqui, M., Tsvetkov, Y., Rastogi, P. & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
- Feigenbaum (1963)
- Feigenbaum, E. (1963). The simulation of verbal learning behavior. Computers and thought. McGraw-Hill.
- Feldman & Griffiths (2007)
- Feldman, N. & Griffiths, T. (2007). A rational account of the perceptual magnet effect. Proceedings of the annual meeting of the cognitive science society, 29.
- Feldman, Griffiths, Goldwater & Morgan (2013)
- Feldman, N., Griffiths, T., Goldwater, S. & Morgan, J. (2013). A role for the developing lexicon in phonetic category acquisition.. Psychological review, 120(4). 751-778. American Psychological Association.
- Feng, Lee & Peng (2019)
- Feng, S., Lee, T. & Peng, Z. (2019). Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1906.07234
- Cieri, Miller & Walker (2004)
- Cieri, C., Miller, D. & Walker, K. (2004). The fisher corpus: A resource for the next generations of speech-to-text. LREC.
- Frome, Corrado, Shlens, Bengio, Dean, Ranzato & Mikolov (2013)
- Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M. & Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems (NIPS 2013). 2121-2129.
- Futrell, Wilcox, Morita, Qian, Ballesteros & Levy (2019)
- Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M. & Levy, R. (2019). Neural language models as psycholinguistic subjects: Representations of syntactic state.
- Futrell, Wilcox, Morita & Levy (2018)
- Futrell, R., Wilcox, E., Morita, T. & Levy, R. (2018). RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv preprint 1809.01329.
- Gage (1994)
- Gage, P. (1994). A new algorithm for data compression. C Users Journal, 12(2). 23-38. McPherson, KS: R & D Publications, c1987-1994..
- García-Granada, Sanchis, Castro-Bleda, González & Hurtado ()
- García-Granada, F., Sanchis, E., Castro-Bleda, M., González, J. & Hurtado, L. (). ZeroSpeech2017 ELIRF-UPV system. Submitted to ASRU 2017.
- Gerz, Vulić, Hill, Reichart & Korhonen (2016)
- Gerz, D., Vulić, I., Hill, F., Reichart, R. & Korhonen, A. (2016). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
- Glass (2012)
- Glass, J. (2012). Towards unsupervised speech processing. Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on. 1-4. IEEE.
- Myrman & Salvi (2017)
- Myrman, A. & Salvi, G. (2017). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. International Workshop on Grounding Language Understanding (GLU). ISCA.
- Godais, Linzen & Dupoux (2017)
- Godais, G., Linzen, T. & Dupoux, E. (2017). Comparing character-level neural language models using a lexical decision task. 125-130.
- Godfrey, Holliman & McDaniel (1992)
- Godfrey, J., Holliman, E. & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. [Proceedings] ICASSP-92: 1992 IEEE international conference on acoustics, speech, and signal processing, 1. 517-520. IEEE.
- Goldberg (2019)
- Goldberg, Y. (2019). Assessing BERT’s syntactic abilities. arXiv preprint 1901.05287.
- Goldwater, Griffiths & Johnson (2009)
- Goldwater, S., Griffiths, T. & Johnson, M. (2009). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112. 21-54. Elsevier.
- Guenther & Gjaja (1996)
- Guenther, F. & Gjaja, M. (1996). The perceptual magnet effect as an emergent property of neural map formation. The Journal of the Acoustical Society of America, 100(2). 1111-1121. Acoustical Society of America.
- Gulordava, Bojanowski, Grave, Linzen & Baroni (2018)
- Gulordava, K., Bojanowski, P., Grave, E., Linzen, T. & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. Retrieved from https://www.aclweb.org/anthology/N18-1108
- Hahn & Baroni (2019)
- Hahn, M. & Baroni, M. (2019). Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics (Accepted). Retrieved from https://arxiv.org/abs/1906.07285
- Halawi, Dror, Gabrilovich & Koren (2012)
- Halawi, G., Dror, G., Gabrilovich, E. & Koren, Y. (2012). Large-scale learning of word relatedness with constraints. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. 1406-1414.
- Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014)
- Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A. & , (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
- Harwath & Glass (2015)
- Harwath, D. & Glass, J. (2015). Deep multimodal semantic embeddings for speech and images. 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). 237-244. IEEE.
- Harwath, Torralba & Glass (2016)
- Harwath, D., Torralba, A. & Glass, J. (2016). Unsupervised learning of spoken language with visual context. Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS 2016). 1858-1866.
- Harwath, Hsu & Glass (2019)
- Harwath, D., Hsu, W. & Glass, J. (2019). Learning hierarchical discrete linguistic units from visually-grounded speech. arXiv preprint arXiv:1911.09602.
- Tiede, Espy-Wilson, Goldenberg, Mitra, Nam & Sivaraman (2017)
- Tiede, M., Espy-Wilson, C., Goldenberg, D., Mitra, V., Nam, H. & Sivaraman, G. (2017). Quantifying kinematic aspects of reduction in a contrasting rate production task. The Journal of the Acoustical Society of America, 141(5). 3580-3580. Retrieved from https://doi.org/10.1121/1.4987629
- Hastie, Tibshirani & Friedman (2009)
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The elements of statistical learning – data mining, inference, and prediction. Springer.
- Havard, Besacier & Rosec (2017)
- Havard, W., Besacier, L. & Rosec, O. (2017). SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. Proc. GLU 2017 international workshop on grounding language understanding. 42-46. Retrieved from http://dx.doi.org/10.21437/GLU.2017-9
- Arandjelovic & Zisserman (2017)
- Arandjelovic, R. & Zisserman, A. (2017). Look, listen and learn. Proceedings of the IEEE international conference on computer vision. 609-617.
- Chrupała, Gelderloos & Alishahi (2017)
- Chrupała, G., Gelderloos, L. & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. arXiv preprint arXiv:1702.01991.
- Chrupała, Gelderloos & Alishahi (2017)
- Chrupała, G., Gelderloos, L. & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 613-622.
- Jansen, Dupoux, Goldwater, Johnson, Khudanpur, Church, Feldman, Hermansky, Metze, Rose, Seltzer, Clark, McGraw, Varadarajan, Bennett, Borschinger, Chiu, Dunbar, Fourtassi, Harwath, Lee, Levin, Norouzian, Peddinti, Richardson, Schatz & Thomas (2013)
- Jansen, A., Dupoux, E., Goldwater, S., Johnson, M., Khudanpur, S., Church, K., Feldman, N., Hermansky, H., Metze, F., Rose, R., Seltzer, M., Clark, P., McGraw, I., Varadarajan, B., Bennett, E., Borschinger, B., Chiu, J., Dunbar, E., Fourtassi, A., Harwath, D., Lee, C., Levin, K., Norouzian, A., Peddinti, V., Richardson, R., Schatz, T. & Thomas, S. (2013). A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition. Proceedings of ICASSP 2013.
- Elsner, Goldwater & Eisenstein (2012)
- Elsner, M., Goldwater, S. & Eisenstein, J. (2012). Bootstrapping a unified model of lexical and phonetic acquisition. Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers). 184-193.
- Bostrom & Durrett (2020)
- Bostrom, K. & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. Retrieved from https://arxiv.org/abs/2004.03720
- Fer, Matejka, Grezl, Plchot, Vesely & Cernocky (2017)
- Fer, R., Matejka, P., Grezl, F., Plchot, O., Vesely, K. & Cernocky, J. (2017). Multilingually trained bottleneck features in spoken language recognition. Computer Speech and Language, 46(Supplement C). 252-267.
- Yusuf, Gok, Gundogdu, Kose & Saraclar (2019)
- Yusuf, B., Gok, A., Gundogdu, B., Kose, O. & Saraclar, M. (2019). Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge. INTERSPEECH 2019.
- Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007)
- Pitt, M., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E. & Fosler-Lussier, E. (2007). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
- Barnard (2014)
- Barnard, D. (2014). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
- Chen, Leung, Xie, Ma & Li ()
- Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (). Multilingual bottle-neck feature learning from untranscribed speech. Submitted to ASRU 2017.
- Cho, Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk & Bengio (2014)
- Cho, K., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724-1734. Association for Computational Linguistics.
- Chrupała (2019)
- Chrupała, G. (2019). Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6452-6462. Association for Computational Linguistics.
- Badino, Mereta & Rosasco (2015)
- Badino, L., Mereta, A. & Rosasco, L. (2015). Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders. Sixteenth annual conference of the international speech communication association.
- Chen, Leung, Xie, Ma & Li (2015)
- Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (2015). Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study. Sixteenth annual conference of the international speech communication association.
- Myrman & Salvi (2017)
- Myrman, A. & Salvi, G. (2017). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. International workshop on grounding language understanding (GLU). ISCA.
- Renshaw, Kamper, Jansen & Goldwater (2015)
- Renshaw, D., Kamper, H., Jansen, A. & Goldwater, S. (2015). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. Sixteenth annual conference of the international speech communication association.
- Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015)
- Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M. & Dupoux, E. (2015). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Sixteenth annual conference of the international speech communication association.
- Zeghidour, Synnaeve, Versteegh & Dupoux (2016)
- Zeghidour, N., Synnaeve, G., Versteegh, M. & Dupoux, E. (2016). A deep scattering spectrum—deep siamese network pipeline for unsupervised acoustic modeling. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). 4965-4969. IEEE.
- Chen, Leung, Xie, Ma & Li (2017)
- Chen, H., Leung, C., Xie, L., Ma, B. & Li, H. (2017). Multilingual bottle-neck feature learning from untranscribed speech. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 727-733. IEEE.
- Pellegrini, Manenti & Pinquier (2017)
- Pellegrini, T., Manenti, C. & Pinquier, J. (2017). Technical report the IRIT-UPS system@ ZeroSpeech 2017 Track1: Unsupervised subword modeling. Tech. rep., IRIT, Université de Toulouse.
- Kharitonov, Rivière, Synnaeve, Wolf, Mazaré, Douze & Dupoux (2021)
- Kharitonov, E., Rivière, M., Synnaeve, G., Wolf, L., Mazaré, P., Douze, M. & Dupoux, E. (2021). Data augmenting contrastive learning of speech representations in the time domain. 2021 IEEE spoken language technology workshop (SLT). 215-222. IEEE.
- Jansen & Van Durme (2011)
- Jansen, A. & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. 2011 IEEE workshop on automatic speech recognition & understanding. 401-406. IEEE.
- Seshadri, Remes, Räsänen & (2017)
- Seshadri, S., Remes, U., Räsänen, O. & , (2017). Comparison of non-parametric bayesian mixture models for syllable clustering and zero-resource speech processing. INTERSPEECH 2017. ISCA.
- Lyzinski, Sell & Jansen (2015)
- Lyzinski, V., Sell, G. & Jansen, A. (2015). An evaluation of graph clustering methods for unsupervised term discovery. Sixteenth annual conference of the international speech communication association.
- Lakhotia, Kharitonov, Hsu, Adi, Polyak, Bolte, Nguyen, Copet, Baevski, Mohamed & (2021)
- Lakhotia, K., Kharitonov, E., Hsu, W., Adi, Y., Polyak, A., Bolte, B., Nguyen, T., Copet, J., Baevski, A., Mohamed, A. & , (2021). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9. 1336-1354. MIT Press.
- Millet & Dunbar (2020)
- Millet, J. & Dunbar, E. (2020). The perceptimatic english benchmark for speech perception models. CogSci Conference 2020.
- Millet & Dunbar (2022)
- Millet, J. & Dunbar, E. (2022). Do self-supervised speech models develop human-like perception biases?.
- Moore (2012)
- Moore, B. (2012). An introduction to the psychology of hearing. Brill.
- Weerts, Rosen, Clopath & Goodman (2021)
- Weerts, L., Rosen, S., Clopath, C. & Goodman, D. (2021). The psychometrics of automatic speech recognition. bioRxiv. Cold Spring Harbor Laboratory.
- Tsuji, Cristia & Dupoux (2021)
- Tsuji, S., Cristia, A. & Dupoux, E. (2021). SCALa: A blueprint for computational models of language acquisition in social context. Cognition, 213. 104779. Elsevier.
- Buerkin-Pontrelli, Culbertson, Legendre & Nazzi (2017)
- Buerkin-Pontrelli, A., Culbertson, J., Legendre, G. & Nazzi, T. (2017). Competing models of liaison acquisition: Evidence from corpus and experimental data. Language, 93(1). 189-219. Linguistic Society of America.
- Babineau, Legrand & Shi (2021)
- Babineau, M., Legrand, C. & Shi, R. (2021). Variable forms in french-learning toddlers’ lexical representations.. Developmental Psychology. American Psychological Association.
- Van Gijn & Zúñiga (2014)
- Van Gijn, R. & Zúñiga, F. (2014). Word and the americanist perspective. Morphology, 24(3). 135-160. Springer.
- Millet & Dunbar (2020)
- Millet, J. & Dunbar, E. (2020). Perceptimatic: A human speech perception benchmark for unsupervised subword modelling. arXiv preprint arXiv:2010.05961.
- Warstadt & Bowman (2019)
- Warstadt, A. & Bowman, S. (2019). Grammatical analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint 1901.03438.
- Pandia & Murthy (2020)
- Pandia, K. & Murthy, H. (2020). Zero resource speech synthesis using transcripts derived from perceptual acoustic units. arXiv preprint arXiv:2006.04372.
- Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021)
- Chorowski, J., Ciesielski, G., Dzikowski, J., Łańcucki, A., Marxer, R., Opala, M., Pusz, P., Rychlikowski, P. & Stypułkowski, M. (2021). Aligned contrastive predictive coding. arXiv preprint arXiv:2104.11946.
- Chrupała (2022)
- Chrupała, G. (2022). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research, 73. 673-707.
- Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov & Mohamed (2021)
- Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R. & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451-3460. IEEE.
- Gwilliams, Linzen, Poeppel & Marantz (2018)
- Gwilliams, L., Linzen, T., Poeppel, D. & Marantz, A. (2018). In spoken word recognition, the future predicts the past. Journal of Neuroscience, 38(35). 7585-7599. Soc Neuroscience.
- Beekhuizen, Armstrong & Stevenson (2021)
- Beekhuizen, B., Armstrong, B. & Stevenson, S. (2021). Probing lexical ambiguity: Word vectors encode number and relatedness of senses. Cognitive Science, 45(5). e12943. Wiley Online Library.
- Nikolaus, Alishahi & Chrupała (2022)
- Nikolaus, M., Alishahi, A. & Chrupała, G. (2022). Learning english with peppa pig. arXiv preprint arXiv:2202.12917.
- Havard, Chevrot & Besacier (2019)
- Havard, W., Chevrot, J. & Besacier, L. (2019). Models of visually grounded speech signal pay attention to nouns: A bilingual experiment on english and japanese. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019). 8618-8622.
- Havard, Chevrot & Besacier (2019)
- Havard, W., Chevrot, J. & Besacier, L. (2019). Word recognition, competition, and activation in a model of visually grounded speech. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL 2019). 339-348.
- He, Zhang, Ren & Sun (2016)
- He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778. IEEE.
- Heck, Sakti & Nakamura ()
- Heck, M., Sakti, S. & Nakamura, S. (). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to ZeroSpeech 2017. Submitted to ASRU 2017.
- Higy, Elliott & Chrupała (2020)
- Higy, B., Elliott, D. & Chrupała, G. (2020). Textual Supervision for Visually Grounded Spoken Language Understanding. Findings of the Association for Computational Linguistics: EMNLP 2020. 2698-2709. Association for Computational Linguistics.
- Hill (1983)
- Hill, J. (1983). A computational model of language acquisition in the two-year old. Cognition and Brain Theory, 6. 287-317.
- Hill, Reichart & Korhonen (2015)
- Hill, F., Reichart, R. & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). 665-695. MIT Press.
- Hochreiter & Schmidhuber (1997)
- Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8). 1735-1780. MIT Press.
- Bin & Yuan (2019)
- Bin, Y. & Yuan, W. (2019). A VAE model with speaker verification for unsupervised subword modeling: A submission to ZeroSpeech 2019. Submitted to INTERSPEECH 2019.
- Hsu, Harwath, Song & Glass (2020)
- Hsu, W., Harwath, D., Song, C. & Glass, J. (2020). Text-Free Image-to-Speech Synthesis Using Learned Segmental Units. 34th Conference on Neural Information Processing Systems (NeurIPS) Workshop on Self-Supervised Learning for Speech and Audio Processing.
- Huijbregts, McLaren & Leeuwen (2011)
- Huijbregts, M., McLaren, M. & Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4436-4439.
- (2019)
- (2019). INTERSPEECH 2019 – 20<sup>th</sup> annual conference of the international speech communication association, september 15-19, graz, austria, proceedings.
- Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018)
- Riochet, R., Castro, M., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2018). Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616.
- Jansen & Van Durme (2011)
- Jansen, A. & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. 401-406.
- Jansen, Thomas & Hermansky (2013)
- Jansen, A., Thomas, S. & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training.. ICASSP. 8091-8095.
- Johnson, Griffiths & Goldwater (2007)
- Johnson, M., Griffiths, T. & Goldwater, S. (2007). Adaptor grammars: A framework for specifying compositional nonparametric bayesian models. Advances in neural information processing systems, 19. 641-648. MIT Press.
- Jürgens, Brand & Kollmeier (2007)
- Jürgens, T., Brand, T. & Kollmeier, B. (2007). Modelling the human-machine gap in speech reception: Microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model. Eighth annual conference of the international speech communication association.
- Kahn, Riviere, Zheng, Kharitonov, Xu, Mazare, Karadayi, Liptchinsky, Collobert, Fuegen & al. (2020)
- Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazare, P., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C. & al., (2020). Libri-light: A benchmark for ASR with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. Retrieved from http://dx.doi.org/10.1109/ICASSP40776.2020.9052942
- Kahn, Rivière, Zheng, Kharitonov, Xu, Mazaré, Karadayi, Liptchinsky, Collobert, Fuegen, Likhomanenko, Synnaeve, Joulin, Mohamed & Dupoux (2020)
- Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A. & Dupoux, E. (2020). Libri-light: A benchmark for ASR with limited or no supervision. INTERSPEECH. Retrieved from https://arxiv.org/abs/1912.07875
- Kamper, Livescu & Goldwater (2017)
- Kamper, H., Livescu, K. & Goldwater, S. (2017). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. ASRU 2017. Retrieved from https://arxiv.org/abs/1904.07556
- Kamper, Shakhnarovich & Livescu (2019)
- Kamper, H., Shakhnarovich, G. & Livescu, K. (2019). Semantic speech retrieval with a visually grounded model of untranscribed speech. IEEE/ACM Transactions on Audio, Speech and Language Processing, 27. 89-98.
- Kamper, Elsner, Jansen & Goldwater (2015)
- Kamper, H., Elsner, M., Jansen, A. & Goldwater, S. (2015). Unsupervised neural network based feature extraction using weak top-down constraints. Proceedings of ICASSP.
- Karpathy & Li (2015)
- Karpathy, A. & Li, F. (2015). Deep visual-semantic alignments for generating image descriptions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). 3128-3137.
- Kawakami, Wang, Dyer, Blunsom & Oord (2020)
- Kawakami, K., Wang, L., Dyer, C., Blunsom, P. & Oord, A. (2020). Learning robust and multilingual speech representations. Retrieved from https://arxiv.org/abs/2001.11128
- Kleinschmidt & Jaeger (2015)
- Kleinschmidt, D. & Jaeger, T. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2). 148-203. American Psychological Association.
- Kuhl (1991)
- Kuhl, P. (1991). Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Attention, Perception, & Psychophysics, 50(2). 93-107. Springer.
- Lau, Clark & Lappin (2017)
- Lau, J., Clark, A. & Lappin, S. (2017). Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive Science. 1202-1241.
- Lee & Glass (2012)
- Lee, C. & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. 40-49.
- Chomsky (1957)
- Chomsky, N. (1957). Syntactic structures. JSTOR.
- Liberman, Cooper, Shankweiler & Studdert-Kennedy (1967)
- Liberman, A., Cooper, F., Shankweiler, D. & Studdert-Kennedy, M. (1967). Perception of the speech code.. Psychological review, 74(6). 431. American Psychological Association.
- Fowler (1986)
- Fowler, C. (1986). An event approach to the study of speech perception from a direct–realist perspective. Journal of phonetics, 14(1). 3-28. Elsevier.
- Baljekar, Sitaram, Muthukumar & Black (2015)
- Baljekar, P., Sitaram, S., Muthukumar, P. & Black, A. (2015). Using articulatory features and inferred phonological segments in zero resource speech processing. Sixteenth annual conference of the international speech communication association.
- Morita & Koda (2020)
- Morita, T. & Koda, H. (2020). Exploring TTS without t using biologically/psychologically motivated neural network modules (ZeroSpeech 2020). arXiv preprint arXiv:2005.05487.
- Chomsky & Halle (1968)
- Chomsky, N. & Halle, M. (1968). The sound pattern of english.. Harper; Row.
- Linzen, Dupoux & Goldberg (2016)
- Linzen, T., Dupoux, E. & Goldberg, Y. (2016). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL.
- Linzen & Leonard (2018)
- Linzen, T. & Leonard, B. (2018). Distinct patterns of syntactic agreement errors in recurrent networks and humans. arXiv preprint 1807.06882.
- Lisker & Abramson (1964)
- Lisker, L. & Abramson, A. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3). 384-422. Taylor & Francis.
- Liu, Hsu & Lee (2019)
- Liu, A., Hsu, P. & Lee, H. (2019). Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11563
- Liu, Lowe, Serban, Noseworthy, Charlin & Pineau (2016)
- Liu, C., Lowe, R., Serban, I., Noseworthy, M., Charlin, L. & Pineau, J. (2016). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
- Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer & Stoyanov (2019)
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Retrieved from http://arxiv.org/abs/1907.11692
- Bates, Mächler, Bolker & Walker (2015)
- Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1). 1-48.
- Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014)
- Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems. Proceedings of LREC.
- Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014)
- Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems. Proceedings of LREC. 560-567.
- Luong, Socher & Manning (2013)
- Luong, M., Socher, R. & Manning, C. (2013). Better word representations with recursive neural networks for morphology. Proceedings of the seventeenth conference on computational natural language learning. 104-113.
- Macmillan & Creelman (2004)
- Macmillan, N. & Creelman, C. (2004). Detection theory: A user’s guide. Psychology Press.
- Mahrt (2016)
- Mahrt, T. (2016). LMEDS: Language markup and experimental design software.
- Wang, Zhang & Zhang (2015)
- Wang, D., Zhang, X. & Zhang, Z. (2015). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
- Manenti, Pellegrini & Pinquier (2017)
- Manenti, C., Pellegrini, T. & Pinquier, J. (2017). Unsupervised speech unit discovery using k-means and neural networks. International conference on statistical language and speech processing. 169-180. Springer.
- Mangin, Filliat, Bosch & Oudeyer (2015)
- Mangin, O., Filliat, D., Bosch, L. & Oudeyer, P. (2015). MCA-NMF: Multimodal concept acquisition with non-negative matrix factorization. PLOS One.
- Matlock (2001)
- Matlock, T. (2001). How real is fictive motion?. Psychology Department, University of California, Santa Cruz.
- Melis, Dyer & Blunsom (2018)
- Melis, G., Dyer, C. & Blunsom, P. (2018). On the state of the art of evaluation in neural language models. ICLR.
- Merkx, Frank & Ernestus (2019)
- Merkx, D., Frank, S. & Ernestus, M. (2019). Language Learning Using Speech to Image Retrieval. Proc. Interspeech 2019. 1841-1845.
- Meyer, Wesker, Brand, Mertins & Kollmeier (2006)
- Meyer, B., Wesker, T., Brand, T., Mertins, A. & Kollmeier, B. (2006). A human-machine comparison in speech recognition based on a logatome corpus. Speech recognition and intrinsic variation workshop.
- Meyer, Wächter, Brand & Kollmeier (2007)
- Meyer, B., Wächter, M., Brand, T. & Kollmeier, B. (2007). Phoneme confusions in human and automatic speech recognition. Eighth annual conference of the international speech communication association.
- Meyer, Jürgens, Wesker, Brand & Kollmeier (2010)
- Meyer, B., Jürgens, T., Wesker, T., Brand, T. & Kollmeier, B. (2010). Human phoneme recognition depending on speech-intrinsic variability. The Journal of the Acoustical Society of America, 128(5). 3126-3141. Acoustical Society of America.
- Miao, Gowayyed & Metze (2015)
- Miao, Y., Gowayyed, M. & Metze, F. (2015). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Automatic speech recognition and understanding (ASRU), 2015 IEEE workshop on. 167-174. IEEE.
- Miech, Zhukov, Alayrac, Tapaswi, Laptev & Sivic (2019)
- Miech, A., Zhukov, D., Alayrac, J., Tapaswi, M., Laptev, I. & Sivic, J. (2019). HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV.
- Miller & Charles (1991)
- Miller, G. & Charles, W. (1991). Contextual correlates of semantic similarity. Language and cognitive processes, 6(1). 1-28. Taylor & Francis.
- Millet, Jurov & Dunbar (2019)
- Millet, J., Jurov, N. & Dunbar, E. (2019). Comparing unsupervised speech learning directly to human performance in speech perception. CogSci Conference 2019.
- Muscariello, Gravier & Bimbot (2012)
- Muscariello, A., Gravier, G. & Bimbot, F. (2012). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination. IEEE Transactions on Audio, Speech and Language Processing, 20(7). 2031-2044.
- Gulordava, Bojanowski, Grave, Linzen & Baroni (2018)
- Gulordava, K., Bojanowski, P., Grave, E., Linzen, T. & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 1195-1205. Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/N18-1108
- Kwiatkowski, Palomaki, Redfield, Collins, Parikh, Alberti, Epstein, Polosukhin, Devlin, Lee & (2019)
- Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K. & , (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7. 453-466. MIT Press.
- Cuervo, Grabias, Chorowski, Ciesielski, Łańcucki, Rychlikowski & Marxer (2021)
- Cuervo, S., Grabias, M., Chorowski, J., Ciesielski, G., Łańcucki, A., Rychlikowski, P. & Marxer, R. (2021). Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. arXiv preprint arXiv:2110.15909.
- Iwamoto & Shinozaki (2021)
- Iwamoto, Y. & Shinozaki, T. (2021). Unsupervised spoken term discovery using wav2vec 2.0. 2021 asia-pacific signal and information processing association annual summit and conference (APSIPA ASC). 1082-1086. IEEE.
- Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021)
- Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L. & Dehak, N. (2021). Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170.
- Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021)
- Bhati, S., Villalba, J., Żelasko, P., Moro-Velazquez, L. & Dehak, N. (2021). Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. arXiv preprint arXiv:2110.02345.
- Bhati, Villalba, Żelasko & Dehak (2020)
- Bhati, S., Villalba, J., Żelasko, P. & Dehak, N. (2020). Self-expressing autoencoders for unsupervised spoken term discovery. arXiv preprint arXiv:2007.13033.
- Borgholt, Havtorn, Edin, Maaløe & Igel (2022)
- Borgholt, L., Havtorn, J., Edin, J., Maaløe, L. & Igel, C. (2022). A brief overview of unsupervised neural speech representation learning.
- Nayak, Kumar, Ramesh, Bhati & Murty (2019)
- Nayak, S., Kumar, C., Ramesh, G., Bhati, S. & Murty, K. (2019). Virtual Phone Discovery for Speech Synthesis. Retrieved from https://doi.org/10.13140/RG.2.2.23356.08324
- Tobing, Hayashi, Wu, Kobayashi & Toda (2020)
- Tobing, P., Hayashi, T., Wu, Y., Kobayashi, K. & Toda, T. (2020). Cyclic spectral modeling for unsupervised unit discovery into voice conversion with excitation and waveform modeling.. INTERSPEECH. 4861-4865.
- Chen & Hain (2020)
- Chen, M. & Hain, T. (2020). Unsupervised acoustic unit representation learning for voice conversion using wavenet auto-encoders. arXiv preprint arXiv:2008.06892.
- Niekerk, Nortje & Kamper (2020)
- Niekerk, B., Nortje, L. & Kamper, H. (2020). Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. arXiv preprint arXiv:2005.09409.
- Yusuf, Ondel, Burget, Černockỳ & Saraclar (2021)
- Yusuf, B., Ondel, L., Burget, L., Černockỳ, J. & Saraclar, M. (2021). A hierarchical subspace model for language-attuned acoustic unit discovery. ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). 3710-3714. IEEE.
- Gündogdu, Yusuf, Yesilbursa & Saraclar (2020)
- Gündogdu, B., Yusuf, B., Yesilbursa, M. & Saraclar, M. (2020). Vector quantized temporally-aware correspondence sparse autoencoders for zero-resource acoustic unit discovery.. INTERSPEECH. 4846-4850.
- Newell & Simon (1972)
- Newell, A. & Simon, H. (1972). Human problem solving. Prentice-Hall.
- Nguyen, Seyssel, Rozé, Rivière, Kharitonov, Baevski, Dunbar & Dupoux (2020)
- Nguyen, T., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E. & Dupoux, E. (2020). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588.
- Jurov (2019)
- Jurov, N. (2019). Phonetics or Phonology? Modelling Non-Native Perception. Université Paris Diderot.
- Ondel, Godard, Besacier, Larsen, Hasegawa-Johnson, Scharenborg, Dupoux, Burget, Yvon & Khudanpur (2018)
- Ondel, L., Godard, P., Besacier, L., Larsen, E., Hasegawa-Johnson, M., Scharenborg, O., Dupoux, E., Burget, L., Yvon, F. & Khudanpur, S. (2018). Bayesian models for unit discovery on a very low resource language. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5939-5943. IEEE.
- Oord, Li & Vinyals (2018)
- Oord, A., Li, Y. & Vinyals, O. (2018). Representation learning with contrastive predictive coding. CoRR, abs/1807.03748. Retrieved from http://arxiv.org/abs/1807.03748
- Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier & Auli (2019)
- Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D. & Auli, M. (2019). Fairseq: A fast, extensible toolkit for sequence modeling. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics (demonstrations). 48-53. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/N19-4009
- Panayotov, Chen, Povey & Khudanpur (2015)
- Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206-5210. IEEE.
- Pandia & Murthy (2019)
- Pandia, K. & Murthy, H. (2019). Zero Resource Speech Synthesis Using Transcripts Derived from Perceptual Acoustic Units. INTERSPEECH 2019.
- Park & Glass (2008)
- Park, A. & Glass, J. (2008). Unsupervised Pattern Discovery in Speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1). 186-197.
- Parrot, Millet & Dunbar (2019)
- Parrot, M., Millet, J. & Dunbar, E. (2019). Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv. arXiv-1911.
- Pauls & Klein (2012)
- Pauls, A. & Klein, D. (2012). Large-scale syntactic language modeling with treelets.
- Chang & Fisher III (2013)
- Chang, J. & Fisher III, J. (2013). Parallel sampling of DP mixture models using sub-cluster splits. Advances in Neural Information Processing Systems. 620-628.
- Pellegrini, Manenti & Pinquier ()
- Pellegrini, T., Manenti, C. & Pinquier, J. (). Unsupervised discovery of sub-lexical units in speech based on ZCA and k-means. Submitted to ASRU 2017.
- Peperkamp (2015)
- Peperkamp, S. (2015). Phonology versus phonetics in loanword adaptations. 71-90. John Benjamins Publishing Company.
- Phillips, Wagers & Lau (2011)
- Phillips, C., Wagers, M. & Lau, E. (2011). Grammatical illusions and selective fallibility in real-time language comprehension. Experiments at the Interfaces, 37. 147-180. Brill.
- Pintér & Watanabe (2016)
- Pintér, G. & Watanabe, H. (2016). Do GMM phoneme classifiers perceive synthetic sibilants as humans do?. INTERSPEECH. 1363-1367.
- Pitt, Johnson, Hume, Kiesling & Raymond (2005)
- Pitt, M., Johnson, K., Hume, E., Kiesling, S. & Raymond, W. (2005). The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1). 89-95. Elsevier.
- Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011)
- Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G. & Vesely, K. (2011). The kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding.
- Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011)
- Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G. & Vesely, K. (2011). The kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
- Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz & (2011)
- Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. & , (2011). The Kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society; IEEE Signal Processing Society.
- (2017)
- , (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
- Rabiner (1989)
- Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2). 257-286.
- Radford, Wu, Child, Luan, Amodei & Sutskever (2019)
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. (2019). Language models are unsupervised multitask learners.
- Radinsky, Agichtein, Gabrilovich & Markovitch (2011)
- Radinsky, K., Agichtein, E., Gabrilovich, E. & Markovitch, S. (2011). A word at a time: Computing word relatedness using temporal semantic analysis. Proceedings of the 20th international conference on world wide web. 337-346.
- Räsänen & Rasilo (2015)
- Räsänen, O. & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122. 792-829.
- Ravfogel, Tyers & Goldberg (2018)
- Ravfogel, S., Tyers, F. & Goldberg, Y. (2018). Can LSTM learn to capture agreement? The case of basque. arXiv preprint 1809.04022.
- Kamper, Jansen & Goldwater (2017)
- Kamper, H., Jansen, A. & Goldwater, S. (2017). A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46. 154-174. Elsevier.
- Kamper (2022)
- Kamper, H. (2022). Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. arXiv preprint arXiv:2202.11929.
- Renshaw, Kamper, Jansen & Goldwater (2015)
- Renshaw, D., Kamper, H., Jansen, A. & Goldwater, S. (2015). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. Sixteenth annual conference of the international speech communication association.
- Dupoux (2018)
- Dupoux, E. (2018). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier.
- Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018)
- Riochet, R., Castro, M., Bernard, M., Lerer, A., Fergus, R., Izard, V. & Dupoux, E. (2018). IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning. arXiv preprint arXiv:1803.07616.
- Rivière, Joulin, Mazaré & Dupoux (2020)
- Rivière, M., Joulin, A., Mazaré, P. & Dupoux, E. (2020). Unsupervised pretraining transfers well across languages. Retrieved from https://arxiv.org/abs/2002.02848
- Roy & Pentland (2002)
- Roy, D. & Pentland, A. (2002). Learning words from sights and sounds: A computational model. Cognitive Science, 26. 113-146.
- Rubenstein & Goodenough (1965)
- Rubenstein, H. & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10). 627-633. ACM New York, NY, USA.
- Tjandra, Sisman, Zhang, Sakti, Li & Nakamura (2019)
- Tjandra, A., Sisman, B., Zhang, M., Sakti, S., Li, H. & Nakamura, S. (2019). VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11449
- Sakti, Kelana, Riza, Sakai, Markov & Nakamura (2008)
- Sakti, S., Kelana, E., Riza, H., Sakai, S., Markov, K. & Nakamura, S. (2008). Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project. Proceedings of the workshop on technologies and corpora for asia-pacific speech translation (TCAST).
- Sakti, Maia, Sakai, Shimizu & Nakamura (2008)
- Sakti, S., Maia, R., Sakai, S., Shimizu, T. & Nakamura, S. (2008). Development of HMM-based Indonesian speech synthesis. Proc. Oriental COCOSDA. 215-219.
- Salazar, Liang, Nguyen & Kirchhoff (2020)
- Salazar, J., Liang, D., Nguyen, T. & Kirchhoff, K. (2020). Masked language model scoring. Proceedings of the 58th annual meeting of the association for computational linguistics. 2699-2712. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.acl-main.240
- Sanabria, Caglayan, Palaskar, Elliott, Barrault, Specia & Metze (2018)
- Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L. & Metze, F. (2018). How2: A large-scale dataset for multimodal language understanding. Proceedings of the workshop on visually grounded interaction and language (ViGIL). NeurIPS. Retrieved from http://arxiv.org/abs/1811.00347
- Scharenborg (2007)
- Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Communication, 49(5). 336-347. Elsevier.
- Scharenborg, Tiesmeyer, Hasegawa-Johnson & Dehak (2018)
- Scharenborg, O., Tiesmeyer, S., Hasegawa-Johnson, M. & Dehak, N. (2018). Visualizing phoneme category adaptation in deep neural networks.. INTERSPEECH. 1482-1486.
- Scharenborg, Gouw, Larson & Marchiori (2019)
- Scharenborg, O., Gouw, N., Larson, M. & Marchiori, E. (2019). The representation of speech in deep neural networks. International conference on multimedia modeling. 194-205. Springer.
- Scharenborg (2019)
- Scharenborg, O. (2019). The representation of speech and its processing in the human brain and deep neural networks. International conference on speech and computer. 1-8. Springer.
- Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
- Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task (I): Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
- Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
- Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
- Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
- Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H. & Dupoux, E. (2013). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association. 1-5.
- Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
- Schatz, T., Peddinti, V., Cao, X., Bach, F., Hermansky, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. INTERSPEECH.
- Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
- Schatz, T., Peddinti, V., Cao, X., Bach, F., Hermansky, H. & Dupoux, E. (2014). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise. Fifteenth annual conference of the international speech communication association.
- Schatz (2016)
- Schatz, T. (2016). ABX-discriminability measures and applications. École Normale Supérieure.
- Schatz (2016)
- Schatz, T. (2016). ABX-discriminability measures and applications. Paris 6.
- Schatz, Bach & Dupoux (2017)
- Schatz, T., Bach, F. & Dupoux, E. (2017). ASR systems as models of phonetic category perception in adults. Proceedings of the 39th Annual CogSci Meeting.
- Schatz & Feldman (2018)
- Schatz, T. & Feldman, N. (2018). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. Proceedings of the Conference on Cognitive Computational Neuroscience.
- Schatz, Feldman, Goldwater, Cao & Dupoux (0)
- Schatz, T., Feldman, N., Goldwater, S., Cao, X. & Dupoux, E. (0). Early phonetic learning without phonetic categories: Insights from machine learning. Proceedings of the National Academy of Sciences.
- Schnabel, Labutov, Mimno & Joachims (2015)
- Schnabel, T., Labutov, I., Mimno, D. & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 conference on empirical methods in natural language processing. 298-307.
- Schneider, Baevski, Collobert & Auli (2019)
- Schneider, S., Baevski, A., Collobert, R. & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862.
- Sennrich, Haddow & Birch (2016)
- Sennrich, R., Haddow, B. & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers). 1715-1725. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P16-1162
- Sennrich, Haddow & Birch (2015)
- Sennrich, R., Haddow, B. & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Shibata, Kato, Shinozaki & Watanabe ()
- Shibata, H., Kato, T., Shinozaki, T. & Watanabe, S. (). Composite embedding systems for ZeroSpeech2017 track 1. Submitted to ASRU 2017.
- Norris & McQueen (2008)
- Norris, D. & McQueen, J. (2008). Shortlist B: a Bayesian model of continuous speech recognition. Psychological Review, 115(2). 357-395. American Psychological Association.
- Shrager & Langley (1990)
- Shrager, J. & Langley, P. (1990). Computational models of scientific discovery and theory formation. Morgan Kaufmann.
- Siu, Gish, Chan, Belfield & Lowe (2013)
- Siu, M., Gish, H., Chan, A., Belfield, W. & Lowe, S. (2013). Unsupervized training of an HMM-based self-organizing recognizer with applications to topic classification and keyword discovery. Computer Speech & Language, preprint.
- Socher, Karpathy, Le, Manning & Ng (2014)
- Socher, R., Karpathy, A., Le, Q., Manning, C. & Ng, A. (2014). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2. 207-218.
- Scharenborg, Norris, Bosch & McQueen (2005)
- Scharenborg, O., Norris, D., Bosch, L. & McQueen, J. (2005). How should a speech recognizer work?. Cognitive Science, 29. 867-918.
- Stolcke & Droppo (2017)
- Stolcke, A. & Droppo, J. (2017). Comparing human and machine errors in conversational speech transcription. INTERSPEECH.
- Sun, Myers, Vondrick, Murphy & Schmid (2019)
- Sun, C., Myers, A., Vondrick, C., Murphy, K. & Schmid, C. (2019). Videobert: A joint model for video and language representation learning. Proceedings of the IEEE international conference on computer vision. 7464-7473.
- Synnaeve, Schatz & Dupoux (2014)
- Synnaeve, G., Schatz, T. & Dupoux, E. (2014). Phonetic embedding learning with side information. Proceedings of IEEE spoken language technology.
- Synnaeve, Versteegh & Dupoux (2014)
- Synnaeve, G., Versteegh, M. & Dupoux, E. (2014). Learning words from images and speech. 28th Conference on Neural Information Processing Systems (NIPS) Workshop on Learning Semantics.
- Bosch, Van hamme, Boves & Moore (2008)
- Bosch, L., Van hamme, H., Boves, L. & Moore, R. (2008). A computational model of language acquisition: The emergence of words. Fundamenta Informaticae, 90. 229-249.
- McMurray, Aslin & Toscano (2009)
- McMurray, B., Aslin, R. & Toscano, J. (2009). Statistical learning of phonetic categories: Insights from a computational approach. Developmental Science, 12(3). 369-378. Wiley Online Library.
- Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015)
- Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M. & Dupoux, E. (2015). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling.. INTERSPEECH. 3179-3183.
- Schatz, Thiolliere, Dupoux, Synnaeve & Dunbar (2015)
- Schatz, T., Thiolliere, R., Dupoux, E., Synnaeve, G. & Dunbar, E. (2015). ABXpy v0.1. Retrieved from http://dx.doi.org/10.5281/zenodo.16239
- Schatz, Cao, Synnaeve, Thiolliere & Dupoux (2015)
- Schatz, T., Cao, X., Synnaeve, G., Thiolliere, R. & Dupoux, E. (2015). Abkhazia: Preliminary release. Retrieved from http://dx.doi.org/10.5281/zenodo.16242
- Schatz & Feldman (2018)
- Schatz, T. & Feldman, N. (2018). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. Proceedings of the conference on cognitive computational neuroscience. 1-4.
- Elman & McClelland (2015)
- Elman, J. & McClelland, J. (2015). Exploiting the lawful variability in the speech wave. 71-90. Erlbaum.
- McClelland & Elman (1986)
- McClelland, J. & Elman, J. (1986). Interactive processes in speech perception: The TRACE model. Cognitive Psychology, 18. 1-86.
- Vallabha, McClelland, Pons, Werker & Amano (2007)
- Vallabha, G., McClelland, J., Pons, F., Werker, J. & Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33). 13273-13278. National Acad Sciences.
- Oord, Vinyals & (2017)
- Oord, A., Vinyals, O. & , (2017). Neural discrete representation learning. Advances in neural information processing systems. 6306-6315.
- Varadarajan, Khudanpur & Dupoux (2008)
- Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised learning of acoustic sub-word units. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. 165-168. Association for Computational Linguistics.
- Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017)
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. & Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762. Retrieved from http://arxiv.org/abs/1706.03762
- Versteegh, Thiolliere, Schatz, Cao, Anguera, Jansen & Dupoux (2015)
- Versteegh, M., Thiolliere, R., Schatz, T., Cao, X., Anguera, X., Jansen, A. & Dupoux, E. (2015). The zero resource speech challenge 2015. Proc. Of Interspeech.
- Versteegh, Anguera, Jansen & Dupoux (2016)
- Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science: Proceedings of SLTU 2016, 81. 67-72.
- Versteegh, Thiollière, Schatz, Cao, Anguera, Jansen & Dupoux (2015)
- Versteegh, M., Thiollière, R., Schatz, T., Cao, X., Anguera, X., Jansen, A. & Dupoux, E. (2015). The Zero Resource Speech Challenge 2015. INTERSPEECH-16, 81. 67-72.
- Versteegh, Anguera, Jansen & Dupoux (2016)
- Versteegh, M., Anguera, X., Jansen, A. & Dupoux, E. (2016). The Zero Resource Speech Challenge 2015: Proposed approaches and results. Procedia Computer Science, 81. 67-72. Elsevier.
- Wang, Tang & Livescu (2020)
- Wang, W., Tang, Q. & Livescu, K. (2020). Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). 6889-6893. IEEE.
- Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
- Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. & Bowman, S. (2019). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
- Werker & Tees (1984)
- Werker, J. & Tees, R. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1). 49-63. Elsevier.
- Wesker, Meyer, Wagener, Anemüller, Mertins & Kollmeier (2005)
- Wesker, T., Meyer, B., Wagener, K., Anemüller, J., Mertins, A. & Kollmeier, B. (2005). Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines. Ninth european conference on speech communication and technology.
- Wilcox, Levy, Morita & Futrell (2018)
- Wilcox, E., Levy, R., Morita, T. & Futrell, R. (2018). What do RNN language models learn about filler–gap dependencies?.
- Wilcox, Levy, Morita & Futrell (2018)
- Wilcox, E., Levy, R., Morita, T. & Futrell, R. (2018). What do RNN language models learn about filler-gap dependencies?. arXiv preprint 1809.00042.
- Gauthier, Besacier, Voisin, Melese & Elingui (2016)
- Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. LREC.
- Vries, Davel, Badenhorst, Basson, Wet, Barnard & Waal (2014)
- Vries, N., Davel, M., Badenhorst, J., Basson, W., Wet, F., Barnard, E. & Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56. 119-131.
- Xu & Tenenbaum (2007)
- Xu, F. & Tenenbaum, J. (2007). Word learning as Bayesian inference. Psychological review, 114(2). 245-272. American Psychological Association.
- Yang & Powers (2006)
- Yang, D. & Powers, D. (2006). Verb similarity on the taxonomy of WordNet. Masaryk University.
- Yang, Dai, Yang, Carbonell, Salakhutdinov & Le (2019)
- Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. & Le, Q. (2019). XLNet: Generalized autoregressive pretraining for language understanding. Retrieved from https://arxiv.org/abs/1906.08237
- Yu & Ballard (2004)
- Yu, C. & Ballard, D. (2004). A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perceptions, 1. 57-80.
- Yuan, Leung, Xie, Chen, Ma & Li ()
- Yuan, Y., Leung, C., Xie, L., Chen, H., Ma, B. & Li, H. (). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representations. Submitted to ASRU 2017.
- Zhang & Glass (2010)
- Zhang, Y. & Glass, J. (2010). Towards multi-speaker unsupervised speech pattern discovery. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. 4366-4369.
- Zhou, Xu & Corso (2018)
- Zhou, L., Xu, C. & Corso, J. (2018). Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI conference on artificial intelligence, 32.
- Gauthier, Besacier, Voisin, Melese & Elingui (2016)
- Gauthier, E., Besacier, L., Voisin, S., Melese, M. & Elingui, U. (2016). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. 10th Language Resources and Evaluation Conference (LREC 2016). Retrieved from https://hal.archives-ouvertes.fr/hal-01350037
- Jia, Weiss, Biadsy, Macherey, Johnson, Chen & Wu (2019)
- Jia, Y., Weiss, R., Biadsy, F., Macherey, W., Johnson, M., Chen, Z. & Wu, Y. (2019). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037.
- Lee, Chen, Wang, Gu, Ma, Polyak, Adi, He, Tang, Pino & (2021)
- Lee, A., Chen, P., Wang, C., Gu, J., Ma, X., Polyak, A., Adi, Y., He, Q., Tang, Y., Pino, J. & , (2021). Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604.
- Tjandra, Sakti & Nakamura (2020)
- Tjandra, A., Sakti, S. & Nakamura, S. (2020). Transformer vq-vae for unsupervised unit discovery and speech synthesis: Zerospeech 2020 challenge. arXiv preprint arXiv:2005.11676.
- Alishahi, Chrupała, Cristia, Dupoux, Higy, Lavechin, Räsänen & Yu (2021)
- Alishahi, A., Chrupała, G., Cristia, A., Dupoux, E., Higy, B., Lavechin, M., Räsänen, O. & Yu, C. (2021). ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track. arXiv preprint arXiv:2107.06546.
- Maekaku, Chang, Fujita, Chen, Watanabe & Rudnicky (2021)
- Maekaku, T., Chang, X., Fujita, Y., Chen, L., Watanabe, S. & Rudnicky, A. (2021). Speech representation learning combining conformer cpc with deep cluster for the zerospeech challenge 2021. arXiv preprint arXiv:2107.05899.
- Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021)
- Chorowski, J., Ciesielski, G., Dzikowski, J., Łańcucki, A., Marxer, R., Opala, M., Pusz, P., Rychlikowski, P. & Stypułkowski, M. (2021). Information retrieval for zerospeech 2021: The submission by university of wroclaw. arXiv preprint arXiv:2106.11603.
- Niekerk, Nortje, Baas & Kamper (2021)
- Niekerk, B., Nortje, L., Baas, M. & Kamper, H. (2021). Analyzing speaker information in self-supervised models to improve zero-resource speech processing. arXiv preprint arXiv:2108.00917.
- Tjandra, Sakti & Nakamura (2019)
- Tjandra, A., Sakti, S. & Nakamura, S. (2019). Speech-to-speech translation between untranscribed unknown languages. 2019 IEEE automatic speech recognition and understanding workshop (ASRU). 593-600. IEEE.
- Jia, Ramanovich, Remez & Pomerantz (2021)
- Jia, Y., Ramanovich, M., Remez, T. & Pomerantz, R. (2021). Translatotron 2: Robust direct speech-to-speech translation. arXiv preprint arXiv:2107.08661.
- Lee, Gong, Duquenne, Schwenk, Chen, Wang, Popuri, Pino, Gu & Hsu (2021)
- Lee, A., Gong, H., Duquenne, P., Schwenk, H., Chen, P., Wang, C., Popuri, S., Pino, J., Gu, J. & Hsu, W. (2021). Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352.
- Ostendorf, Price & Shattuck-Hufnagel (1995)
- Ostendorf, M., Price, P. & Shattuck-Hufnagel, S. (1995). The boston university radio news corpus. Linguistic Data Consortium. 1-19.
- Algayres, Ricoul, Karadayi, Mohammed, Sagot & Dupoux (2022)
- Algayres, R., Ricoul, T., Karadayi, J., Mohammed, A., Sagot, B. & Dupoux, E. (2022). DP-PARSE: Finding word boundaries from raw speech with a token lexicon. Retrieved from https://arxiv.org/abs/1906.08237
- Nguyen, Sagot & Dupoux (2022)
- Nguyen, T., Sagot, B. & Dupoux, E. (2022). Are discrete units necessary for spoken language modeling?. Retrieved from https://arxiv.org/abs/1906.08237
- De Saussure (1916)
- De Saussure, F. (1916). Course in general linguistics. McGraw-Hill Book Company, New York-Toronto-London.
- Seyssel, Lavechin, Titeux, Thomas, Virlet, Santos Revilla, Wisniewski, Ludusan & Dupoux (2023)
- Seyssel, M., Lavechin, M., Titeux, H., Thomas, A., Virlet, G., Santos Revilla, A., Wisniewski, G., Ludusan, B. & Dupoux, E. (2023). ProsAudit, a prosodic benchmark for self-supervised speech models.