Bibliography

Dunbar, Hamilakis & Dupoux (2022)
, & (). Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge series. IEEE Journal of Special Topics in Signal Processing, 16(6). 1211-1226. Retrieved from https://arxiv.org/abs/2005.12656
Hallap, Dupoux & Dunbar (2022)
, & (). Evaluating context-invariance in unsupervised speech representations. arXiv preprint arXiv:2210.15775.
Goldwater, Griffiths & Johnson (2009)
, & (). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1). 21-54.
Agirre, Alfonseca, Hall, Kravalova, Pasca & Soroa (2009)
, , , , & (). A study on similarity and relatedness using distributional and wordnet-based approaches.
Al-Rfou, Choe, Constant, Guo & Jones (2018)
, , , & (). Character-level language modeling with deeper self-attention. arXiv preprint 1808.04444.
Allen & Seidenberg (1999)
& (). The emergence of grammaticality in connectionist networks. The emergence of language. 115-151.
Ansari, Kumar, Singh, Ganapathy & Devi ()
, , , & (). Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions. Submitted to ASRU 2017.
Chaudhuri, Roth, Ellis, Gallagher, Kaver, Marvin, Pantofaru, Reale, Reid, Wilson & Xi (2018)
, , , , , , , , , & (). AVA-speech: A densely labeled dataset of speech activity in movies. Proceedings of interspeech, 2018. Retrieved from https://arxiv.org/pdf/1808.00606
Baevski, Auli & Mohamed (2019)
, & (). Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912.
Baevski, Zhou, Mohamed & Auli (2020)
, , & (). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
Baker, Reichart & Korhonen (2014)
, & (). An unsupervised model for instance level subcategorization acquisition. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 278-289.
Bérard, Pietquin, Servan & Besacier (2016)
, , & (). Listen and translate: A proof of concept for end-to-end speech-to-text translation. NIPS workshop on end-to-end learning for speech and audio processing.
Best (1995)
(). A direct realist perspective on cross-language speech perception. Speech perception and linguistic experience: Issues in cross-language research. 167-200. York Press.
Bavin (2009)
Bavin, E. (). The Cambridge handbook of child language. Cambridge University Press. Retrieved from http://site.ebrary.com/id/10303044
Dunbar, Bernard, Hamilakis, Nguyen, Seyssel, Rozé, Rivière, Kharitonov & Dupoux (2021)
, , , , , , , & (). The zero resource speech challenge 2021: Spoken language modelling. Interspeech 2021-conference of the international speech communication association.
Kohonen (1988)
(). The ’neural’ phonetic typewriter. Computer, 21(3). 11-22.
Adda, Stücker, Adda-Decker, Ambouroue, Besacier, Blachon, Bonneau-Maynard, Godard, Hamlaoui, Idiatov, Kouarata, Lamel, Makasso, Rialland, Van de Velde, Yvon & Zerbian (2016)
, , , , , , , , , , , , , , , & (). Breaking the unwritten kanguage barrier: The Bulb project. Proceedings of SLTU (spoken language technologies for under-resourced languages).
Akaike (1974)
(). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6). 716-723. IEEE.
Alishahi, Barking & Chrupała (2017)
, & (). Encoding of phonology in a recurrent neural model of grounded speech. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 368-378.
Ansari, Singh, Kumar & Ganapathy ()
, , & (). Deep learning methods for unsupervised acoustic modeling: LEAP submission to ZeroSpeech challenge 2017. Submitted to ASRU 2017.
Ansari, Kumar, Singh & Ganapathy (2017)
, , & (). Deep learning methods for unsupervised acoustic modeling—leap submission to zerospeech challenge 2017. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 754-761. IEEE.
Jansen & Van Durme (2011)
& (). Efficient spoken term discovery using randomized algorithms. Automatic speech recognition and understanding (ASRU), 2011 IEEE workshop on. 401-406. IEEE.
Badino, Canevari, Fadiga & Metta (2014)
, , & (). An Auto-encoder based approach to unsupervised learning of subword units. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Baevski, Schneider & Auli (2020)
, & (). Vq-wav2vec: Self-supervised learning of discrete speech representations. International conference on learning representations. Retrieved from https://openreview.net/forum?id=rylwJxrYDS
Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014)
, , , , , , , , , & (). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Bengio, Ducharme, Vincent & Jauvin (2003)
, , & (). A neural probabilistic language model. JMLR.
Besacier, Zhou & Gao (2006)
, & (). Towards speech translation of non written languages. Spoken Language Technology Workshop, 2006. IEEE. 222-225.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
, , , , , & (). BLiMP: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Peng & Harwath (2022)
& (). Self-supervised representation learning for speech using visual grounding and masked language modeling. arXiv preprint arXiv:2202.03543.
Bruni, Boleda, Baroni & Tran (2012)
, , & (). Distributional semantics in technicolor. Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers). 136-145.
Chalnick & Billman (1988)
& (). Unsupervised learning of correlational structure. Proceedings of the tenth annual conference of the cognitive science society. 510-516. Lawrence Erlbaum Associates.
Chen, Leung, Xie, Ma & Li (2015)
, , , & (). Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study. INTERSPEECH.
Chrupała (2021)
(). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Retrieved from https://arxiv.org/abs/2104.13225
Chung & Glass (2018)
& (). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
Chung, Hsu, Tang & Glass (2019)
, , & (). An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
Chung, Hsu, Tang & Glass (2019)
, , & (). An unsupervised autoregressive model for speech representation learning. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 146-150.
Keuleers & Brysbaert (2010)
& (). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627-633. Springer.
Kharitonov, Lee, Polyak, Adi, Copet, Lakhotia, Nguyen, Rivière, Mohamed, Dupoux & (2021)
, , , , , , , , , & (). Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264.
Heck, Sakti & Nakamura (2016)
, & (). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. Procedia Computer Science, 81. 73-79. Elsevier.
Srivastava & Shrivastava (2016)
& (). Articulatory gesture rich representation learning of phonological units in low resource settings. International conference on statistical language and speech processing. 80-95. Springer.
Heck, Sakti & Nakamura (2017)
, & (). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 740-746. IEEE.
Shibata, Kato, Shinozaki & Watanabet (2017)
, , & (). Composite embedding systems for ZeroSpeech2017 Track1. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 747-753. IEEE.
Chorowski, Weiss, Bengio & Van Den Oord (2019)
, , & (). Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12). 2041-2053. IEEE.
Kamper, Livescu & Goldwater (2017)
, & (). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 719-726. IEEE.
Hsu, Harwath & Glass (2019)
, & (). Transfer learning from audio-visual grounding to speech recognition. arXiv preprint arXiv:1907.04355.
Chung & Glass (2019)
& (). Generative pre-training for speech with autoregressive predictive coding. arXiv preprint arXiv:1910.12607.
Millet, Chitoran & Dunbar (2021)
, & (). Predicting non-native speech perception using the perceptual assimilation model and state-of-the-art acoustic models. Proceedings of the 25th conference on computational natural language learning. 661-673.
Warstadt, Singh & Bowman (2018)
, & (). Neural network acceptability judgments. arXiv preprint 1805.12471.
Dai, Yang, Yang, Cohen, Carbonell, Le & Salakhutdinov (2019)
, , , , , & (). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint 1901.02860.
Räsänen, Doyle & Frank (2015)
, & (). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. Sixteenth annual conference of the international speech communication association.
Räsänen & Blandón (2020)
& (). Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics. arXiv preprint arXiv:2008.00731.
Prakash, Kumar, Murthy & (2020)
, , & (). Exploration of end-to-end synthesisers for zero resource speech challenge 2020. arXiv preprint arXiv:2009.04983.
Davis & Mermelstein (1980)
& (). Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4). 357-366.
Lee & Glass (2012)
& (). A nonparametric bayesian approach to acoustic model discovery. ACL (1). 40-49. The Association for Computer Linguistics.
Hsu, Hwang, Wu, Tsao & Wang (2016)
, , , & (). Voice conversion from non-parallel corpora using variational auto-encoder. Asia-pacific signal and information processing association annual summit and conference, APSIPA 2016, jeju, south korea, december 13-16, 2016. 1-6.
Tjandra, Sakti & Nakamura (2017)
, & (). Listening while speaking: Speech chain by deep learning. ASRU 2017. 301-308.
Badino, Canevari, Fadiga & Metta (2014)
, , & (). An auto-encoder based approach to unsupervised learning of subword units. ICASSP. 7634-7638. IEEE.
Gao, Singh & Raj (2018)
, & (). Voice impersonation using generative adversarial networks. ICASSP. 2506-2510. IEEE.
Jansen, Thomas & Hermansky (2013)
, & (). Weak top-down constraints for unsupervised acoustic model training. ICASSP. 8091-8095. IEEE.
Eloff, Nortje, Niekerk, Govender, Nortje, Pretorius, Van Biljon, Westhuizen, Staden & Kamper (2019)
, , , , , , , , & (). Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. arXiv preprint arXiv:1904.07556.
Yusuf, Gök, Gündogdu, Kose & Saraclar (2019)
, , , & (). Temporally-aware acoustic unit discovery for zerospeech 2019 challenge.. INTERSPEECH. 1098-1102.
Liu, Hsu & Lee (2019)
, & (). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. arXiv preprint arXiv:1905.11563.
Nayak, Kumar, Ramesh, Bhati & Murty (2019)
, , , & (). Virtual phone discovery for speech synthesis without text. 2019 IEEE global conference on signal and information processing (GlobalSIP). 1-5. IEEE.
Muthukumar & Black (2014)
& (). Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, florence, italy, may 4-9, 2014. 2594-2598.
Scharenborg, Besacier, Black, Hasegawa-Johnson, Metze, Neubig, Stüker, Godard, Müller, Ondel, Palaskar, Arthur, Ciannella, Du, Larsen, Merkx, Riad, Wang & Dupoux (2018)
, , , , , , , , , , , , , , , , , & (). Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "speaking rosetta" JSALT 2017 workshop. ICASSP. 4979-4983. IEEE.
Shen, Pang, Weiss, Schuster, Jaitly, Yang, Chen, Zhang, Wang, Ryan, Saurous, Agiomyrgiannakis & Wu (2018)
, , , , , , , , , , , & (). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. ICASSP. 4779-4783. IEEE.
Heck, Sakti & Nakamura (2016)
, & (). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. SLTU-2016, 5th workshop on spoken language technologies for under-resourced languages, 9-12 may 2016, yogyakarta, indonesia. 73-79.
Ondel, Burget & Cernocký (2016)
, & (). Variational inference for acoustic unit discovery. SLTU, 81. 80-86. Elsevier.
Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior & Kavukcuoglu (2016)
, , , , , , , & (). WaveNet: A generative model for raw audio. SSW. 125. ISCA.
Wu, Watts & King (2016)
, & (). Merlin: An open source neural network speech synthesis system. Speech Synthesis Workshop. 202-207. ISCA.
Ping, Peng, Gibiansky, Arik, Kannan, Narang, Raiman & Miller (2017)
, , , , , , & (). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654.
Kaneko & Kameoka (2017)
& (). Parallel-data-free voice conversion using cycle-consistent adversarial networks. CoRR, abs/1711.11293. Retrieved from https://arxiv.org/abs/1711.11293
Chou, Yeh, Lee & Lee (2018)
, , & (). Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. CoRR, abs/1804.02812. Retrieved from https://arxiv.org/abs/1804.02812
Li, Liu, Liu, Zhao, Liu & Zhou (2018)
, , , , & (). Close to human quality TTS with transformer. CoRR, abs/1809.08895. Retrieved from https://arxiv.org/abs/1809.08895
Mehri, Kumar, Gulrajani, Kumar, Jain, Sotelo, Courville & Bengio (2016)
, , , , , , & (). SampleRNN: An unconditional end-to-end neural audio generation model. CoRR, abs/1612.07837. Retrieved from https://arxiv.org/abs/1612.07837
Taigman, Wolf, Polyak & Nachmani (2017)
, , & (). Voice synthesis for in-the-wild speakers via a phonological loop. CoRR, abs/1707.06588.
Dillon, Dunbar & Idsardi (2013)
, & (). A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science, 37(2). 344-377. Wiley Online Library.
DeCarlo (1998)
(). Signal detection theory and generalized linear models.. Psychological Methods, 3(2). 186. American Psychological Association.
Deng, Dong, Socher, Li, Li & Fei-Fei (2009)
, , , , & (). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248-255.
Devlin, Chang, Lee & Toutanova (2019)
, , & (). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL.
Driesen & Van hamme (2011)
& (). Modeling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA. Neurocomputing, 74. 1874-1882.
Dunbar, Cao, Benjumea, Karadayi, Bernard, Besacier, Anguera & Dupoux (2017)
, , , , , , & (). The Zero Resource Speech Challenge 2017. 2017 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 323-330. IEEE. Retrieved from https://arxiv.org/abs/1712.04313
Algayres, Zaiem, Sagot & Dupoux (2020)
, , & (). Evaluating the reliability of acoustic speech embeddings. arXiv preprint arXiv:2007.13542.
Riad, Dancette, Karadayi, Zeghidour, Schatz & Dupoux (2018)
, , , , & (). Sampling strategies in siamese networks for unsupervised speech representation learning. arXiv preprint arXiv:1804.11297.
Dunbar, Algayres, Karadayi, Bernard, Benjumea, Cao, Miskic, Dugrain, Ondel, Black & (2019)
, , , , , , , , , & (). The zero resource speech challenge 2019: TTS without T. INTERSPEECH. Retrieved from https://arxiv.org/abs/1904.11469
Dunbar, Karadayi, Bernard, Cao, Algayres, Ondel, Besacier, Sakriani & Dupoux (2020)
, , , , , , , & (). The zero resource speech challenge 2020: Discovering discrete subword and word units. INTERSPEECH, perception;bootstrapping/modeling;clustering/bootphon.
Duong, Anastasopoulos, Chiang, Bird14 & Cohn (2016)
, , , & (). An attentional model for speech translation without transcription. Proceedings of NAACL-HLT. 949-959.
Dupoux (2018)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier. Retrieved from https://arxiv.org/abs/1607.08723
Dupoux (2018)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier. Retrieved from https://arxiv.org/abs/1607.08723
Peters, Neumann, Iyyer, Gardner, Clark, Lee & Zettlemoyer (2018)
, , , , , & (). Deep contextualized word representations. NAACL.
Faruqui, Tsvetkov, Rastogi & Dyer (2016)
, , & (). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Feigenbaum (1963)
(). The simulation of verbal learning behavior. Computers and thought. McGraw-Hill.
Feldman & Griffiths (2007)
& (). A rational account of the perceptual magnet effect. Proceedings of the annual meeting of the cognitive science society, 29.
Feldman, Griffiths, Goldwater & Morgan (2013)
, , & (). A role for the developing lexicon in phonetic category acquisition.. Psychological review, 120(4). 751-778. American Psychological Association.
Feng, Lee & Peng (2019)
, & (). Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1906.07234
Cieri, Miller & Walker (2004)
, & (). The fisher corpus: A resource for the next generations of speech-to-text. LREC.
Frome, Corrado, Shlens, Bengio, Dean, Ranzato & Mikolov (2013)
, , , , , & (). DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems (NIPS 2013). 2121-2129.
Futrell, Wilcox, Morita, Qian, Ballesteros & Levy (2019)
, , , , & (). Neural language models as psycholinguistic subjects: Representations of syntactic state.
Futrell, Wilcox, Morita & Levy (2018)
, , & (). RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv preprint 1809.01329.
Gage (1994)
(). A new algorithm for data compression. C Users Journal, 12(2). 23-38. McPherson, KS: R & D Publications, c1987-1994..
García-Granada, Sanchis, Castro-Bleda, González & Hurtado ()
, , , & (). ZeroSpeech2017 ELIRF-UPV system. Submitted to ASRU 2017.
Gerz, Vulić, Hill, Reichart & Korhonen (2016)
, , , & (). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
Glass (2012)
(). Towards unsupervised speech processing. Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on. 1-4. IEEE.
Myrman & Salvi (2017)
& (). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. International Workshop on Grounding Language Understanding (GLU). ISCA.
Godais, Linzen & Dupoux (2017)
, & (). Comparing character-level neural language models using a lexical decision task. 125-130.
Godfrey, Holliman & McDaniel (1992)
, & (). SWITCHBOARD: Telephone speech corpus for research and development. [Proceedings] ICASSP-92: 1992 IEEE international conference on acoustics, speech, and signal processing, 1. 517-520. IEEE.
Goldberg (2019)
(). Assessing BERT’s syntactic abilities. arXiv preprint 1901.05287.
Goldwater, Griffiths & Johnson (2009)
, & (). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112. 21-54. Elsevier.
Guenther & Gjaja (1996)
& (). The perceptual magnet effect as an emergent property of neural map formation. The Journal of the Acoustical Society of America, 100(2). 1111-1121. Acoustical Society of America.
Gulordava, Bojanowski, Grave, Linzen & Baroni (2018)
, , , & (). Colorless green recurrent networks dream hierarchically. Retrieved from https://www.aclweb.org/anthology/N18-1108
Hahn & Baroni (2019)
& (). Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics (Accepted). Retrieved from https://arxiv.org/abs/1906.07285
Halawi, Dror, Gabrilovich & Koren (2012)
, , & (). Large-scale learning of word relatedness with constraints. Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. 1406-1414.
Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014)
, , , , , , , , , & (). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Harwath & Glass (2015)
& (). Deep multimodal semantic embeddings for speech and images. 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). 237-244. IEEE.
Harwath, Torralba & Glass (2016)
, & (). Unsupervised learning of spoken language with visual context. Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS 2016). 1858-1866.
Harwath, Hsu & Glass (2019)
, & (). Learning hierarchical discrete linguistic units from visually-grounded speech. arXiv preprint arXiv:1911.09602.
Tiede, Espy-Wilson, Goldenberg, Mitra, Nam & Sivaraman (2017)
, , , , & (). Quantifying kinematic aspects of reduction in a contrasting rate production task. The Journal of the Acoustical Society of America, 141(5). 3580-3580. Retrieved from https://doi.org/10.1121/1.4987629
Hastie, Tibshirani & Friedman (2009)
, & (). The elements of statistical learning – data mining, inference, and prediction. Springer.
Havard, Besacier & Rosec (2017)
, & (). SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. Proc. GLU 2017 international workshop on grounding language understanding. 42-46. Retrieved from http://dx.doi.org/10.21437/GLU.2017-9
Arandjelovic & Zisserman (2017)
& (). Look, listen and learn. Proceedings of the IEEE international conference on computer vision. 609-617.
Chrupała, Gelderloos & Alishahi (2017)
, & (). Representations of language in a model of visually grounded speech signal. arXiv preprint arXiv:1702.01991.
Chrupała, Gelderloos & Alishahi (2017)
, & (). Representations of language in a model of visually grounded speech signal. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 613-622.
Jansen, Dupoux, Goldwater, Johnson, Khudanpur, Church, Feldman, Hermansky, Metze, Rose, Seltzer, Clark, McGraw, Varadarajan, Bennett, Borschinger, Chiu, Dunbar, Fourtassi, Harwath, Lee, Levin, Norouzian, Peddinti, Richardson, Schatz & Thomas (2013)
, , , , , , , , , , , , , , , , , , , , , , , , , & (). A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition. Proceedings of ICASSP 2013.
Elsner, Goldwater & Eisenstein (2012)
, & (). Bootstrapping a unified model of lexical and phonetic acquisition. Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers). 184-193.
Bostrom & Durrett (2020)
& (). Byte pair encoding is suboptimal for language model pretraining. Retrieved from https://arxiv.org/abs/2004.03720
Fer, Matejka, Grezl, Plchot, Vesely & Cernocky (2017)
, , , , & (). Multilingually trained bottleneck features in spoken language recognition. Computer Speech and Language, 46(Supplement C). 252-267.
Yusuf, Gok, Gundogdu, Kose & Saraclar (2019)
, , , & (). Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge. INTERSPEECH 2019.
Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007)
, , , , , & (). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
Barnard (2014)
(). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
Chen, Leung, Xie, Ma & Li ()
, , , & (). Multilingual bottle-neck feature learning from untranscribed speech. Submitted to ASRU 2017.
Cho, Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk & Bengio (2014)
, , , , , & (). Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724-1734. Association for Computational Linguistics.
Chrupała (2019)
(). Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6452-6462. Association for Computational Linguistics.
Badino, Mereta & Rosasco (2015)
, & (). Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders. Sixteenth annual conference of the international speech communication association.
Chen, Leung, Xie, Ma & Li (2015)
, , , & (). Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study. Sixteenth annual conference of the international speech communication association.
Myrman & Salvi (2017)
& (). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. International workshop on grounding language understanding (GLU). ISCA.
Renshaw, Kamper, Jansen & Goldwater (2015)
, , & (). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. Sixteenth annual conference of the international speech communication association.
Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015)
, , , & (). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Sixteenth annual conference of the international speech communication association.
Zeghidour, Synnaeve, Versteegh & Dupoux (2016)
, , & (). A deep scattering spectrum—deep siamese network pipeline for unsupervised acoustic modeling. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). 4965-4969. IEEE.
Chen, Leung, Xie, Ma & Li (2017)
, , , & (). Multilingual bottle-neck feature learning from untranscribed speech. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 727-733. IEEE.
Pellegrini, Manenti & Pinquier (2017)
, & (). Technical report the IRIT-UPS system@ ZeroSpeech 2017 Track1: Unsupervised subword modeling. Tech. rep., IRIT, Université de Toulouse.
Kharitonov, Rivière, Synnaeve, Wolf, Mazaré, Douze & Dupoux (2021)
, , , , , & (). Data augmenting contrastive learning of speech representations in the time domain. 2021 IEEE spoken language technology workshop (SLT). 215-222. IEEE.
Jansen & Van Durme (2011)
& (). Efficient spoken term discovery using randomized algorithms. 2011 IEEE workshop on automatic speech recognition & understanding. 401-406. IEEE.
Seshadri, Remes, Räsänen & (2017)
, , & (). Comparison of non-parametric bayesian mixture models for syllable clustering and zero-resource speech processing. INTERSPEECH 2017. ISCA.
Lyzinski, Sell & Jansen (2015)
, & (). An evaluation of graph clustering methods for unsupervised term discovery. Sixteenth annual conference of the international speech communication association.
Lakhotia, Kharitonov, Hsu, Adi, Polyak, Bolte, Nguyen, Copet, Baevski, Mohamed & (2021)
, , , , , , , , , & (). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9. 1336-1354. MIT Press.
Millet & Dunbar (2020)
& (). The perceptimatic english benchmark for speech perception models. CogSci Conference 2020.
Millet & Dunbar (2022)
& (). Do self-supervised speech models develop human-like perception biases?.
Moore (2012)
(). An introduction to the psychology of hearing. Brill.
Weerts, Rosen, Clopath & Goodman (2021)
, , & (). The psychometrics of automatic speech recognition. bioRxiv. Cold Spring Harbor Laboratory.
Tsuji, Cristia & Dupoux (2021)
, & (). SCALa: A blueprint for computational models of language acquisition in social context. Cognition, 213. 104779. Elsevier.
Buerkin-Pontrelli, Culbertson, Legendre & Nazzi (2017)
, , & (). Competing models of liaison acquisition: Evidence from corpus and experimental data. Language, 93(1). 189-219. Linguistic Society of America.
Babineau, Legrand & Shi (2021)
, & (). Variable forms in french-learning toddlers’ lexical representations.. Developmental Psychology. American Psychological Association.
Van Gijn & Zúñiga (2014)
& (). Word and the americanist perspective. Morphology, 24(3). 135-160. Springer.
Millet & Dunbar (2020)
& (). Perceptimatic: A human speech perception benchmark for unsupervised subword modelling. arXiv preprint arXiv:2010.05961.
Warstadt & Bowman (2019)
& (). Grammatical analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint 1901.03438.
Pandia & Murthy (2020)
& (). Zero resource speech synthesis using transcripts derived from perceptual acoustic units. arXiv preprint arXiv:2006.04372.
Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021)
, , , , , , , & (). Aligned contrastive predictive coding. arXiv preprint arXiv:2104.11946.
Chrupała (2022)
(). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research, 73. 673-707.
Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov & Mohamed (2021)
, , , , & (). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451-3460. IEEE.
Gwilliams, Linzen, Poeppel & Marantz (2018)
, , & (). In spoken word recognition, the future predicts the past. Journal of Neuroscience, 38(35). 7585-7599. Soc Neuroscience.
Beekhuizen, Armstrong & Stevenson (2021)
, & (). Probing lexical ambiguity: Word vectors encode number and relatedness of senses. Cognitive Science, 45(5). e12943. Wiley Online Library.
Nikolaus, Alishahi & Chrupała (2022)
, & (). Learning english with peppa pig. arXiv preprint arXiv:2202.12917.
Havard, Chevrot & Besacier (2019)
, & (). Models of visually grounded speech signal pay attention to nouns: A bilingual experiment on english and japanese. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019). 8618-8622.
Havard, Chevrot & Besacier (2019)
, & (). Word recognition, competition, and activation in a model of visually grounded speech. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL 2019). 339-348.
He, Zhang, Ren & Sun (2016)
, , & (). Deep Residual Learning for Image Recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778. IEEE.
Heck, Sakti & Nakamura ()
, & (). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to ZeroSpeech 2017. Submitted to ASRU 2017.
Higy, Elliott & Chrupała (2020)
, & (). Textual Supervision for Visually Grounded Spoken Language Understanding. Findings of the Association for Computational Linguistics: EMNLP 2020. 2698-2709. Association for Computational Linguistics.
Hill (1983)
(). A computational model of language acquisition in the two-year old. Cognition and Brain Theory, 6. 287-317.
Hill, Reichart & Korhonen (2015)
, & (). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). 665-695. MIT Press.
Hochreiter & Schmidhuber (1997)
& (). Long short-term memory. Neural computation, 9(8). 1735-1780. MIT Press.
Bin & Yuan (2019)
& (). A VAE model with speaker verification for unsupervised subword modeling: A submission to ZeroSpeech 2019. Submitted to INTERSPEECH 2019.
Hsu, Harwath, Song & Glass (2020)
, , & (). Text-Free Image-to-Speech Synthesis Using Learned Segmental Units. 34th Conference on Neural Information Processing Systems (NeurIPS) Workshop on Self-Supervised Learning for Speech and Audio Processing.
Huijbregts, McLaren & Leeuwen (2011)
, & (). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4436-4439.
(2019)
(). INTERSPEECH 2019 – 20<sup>th</sup> annual conference of the international speech communication association, september 15-19, graz, austria, proceedings.
Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018)
, , , , , & (). Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616.
Jansen & Van Durme (2011)
& (). Efficient spoken term discovery using randomized algorithms. Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. 401-406.
Jansen, Thomas & Hermansky (2013)
, & (). Weak top-down constraints for unsupervised acoustic model training.. ICASSP. 8091-8095.
Johnson, Griffiths & Goldwater (2007)
, & (). Adaptor grammars: A framework for specifying compositional nonparametric bayesian models. Advances in neural information processing systems, 19. 641-648. MIT Press.
Jürgens, Brand & Kollmeier (2007)
, & (). Modelling the human-machine gap in speech reception: Microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model. Eighth annual conference of the international speech communication association.
Kahn, Riviere, Zheng, Kharitonov, Xu, Mazare, Karadayi, Liptchinsky, Collobert, Fuegen & al. (2020)
, , , , , , , , , & (). Libri-light: A benchmark for ASR with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. Retrieved from http://dx.doi.org/10.1109/ICASSP40776.2020.9052942
Kahn, Rivière, Zheng, Kharitonov, Xu, Mazaré, Karadayi, Liptchinsky, Collobert, Fuegen, Likhomanenko, Synnaeve, Joulin, Mohamed & Dupoux (2020)
, , , , , , , , , , , , , & (). Libri-light: A benchmark for ASR with limited or no supervision. INTERSPEECH. Retrieved from https://arxiv.org/abs/1912.07875
Kamper, Livescu & Goldwater (2017)
, & (). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. ASRU 2017. Retrieved from https://arxiv.org/abs/1904.07556
Kamper, Shakhnarovich & Livescu (2019)
, & (). Semantic speech retrieval with a visually grounded model of untranscribed speech. IEEE/ACM Transactions on Audio, Speech and Language Processing, 27. 89-98.
Kamper, Elsner, Jansen & Goldwater (2015)
, , & (). Unsupervised neural network based feature extraction using weak top-down constraints. Proceedings of ICASSP.
Karpathy & Li (2015)
& (). Deep visual-semantic alignments for generating image descriptions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). 3128-3137.
Kawakami, Wang, Dyer, Blunsom & Oord (2020)
, , , & (). Learning robust and multilingual speech representations. Retrieved from https://arxiv.org/abs/2001.11128
Kleinschmidt & Jaeger (2015)
& (). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2). 148-203. American Psychological Association.
Kuhl (1991)
(). Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Attention, Perception, & Psychophysics, 50(2). 93-107. Springer.
Lau, Clark & Lappin (2017)
, & (). Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive Science. 1202-1241.
Lee & Glass (2012)
& (). A nonparametric Bayesian approach to acoustic model discovery. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. 40-49.
Chomsky (1957)
(). Syntactic structures. JSTOR.
Liberman, Cooper, Shankweiler & Studdert-Kennedy (1967)
, , & (). Perception of the speech code.. Psychological review, 74(6). 431. American Psychological Association.
Fowler (1986)
(). An event approach to the study of speech perception from a direct–realist perspective. Journal of phonetics, 14(1). 3-28. Elsevier.
Baljekar, Sitaram, Muthukumar & Black (2015)
, , & (). Using articulatory features and inferred phonological segments in zero resource speech processing. Sixteenth annual conference of the international speech communication association.
Morita & Koda (2020)
& (). Exploring TTS without t using biologically/psychologically motivated neural network modules (ZeroSpeech 2020). arXiv preprint arXiv:2005.05487.
Chomsky & Halle (1968)
& (). The sound pattern of english.. Harper; Row.
Linzen, Dupoux & Goldberg (2016)
, & (). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL.
Linzen & Leonard (2018)
& (). Distinct patterns of syntactic agreement errors in recurrent networks and humans. arXiv preprint 1807.06882.
Lisker & Abramson (1964)
& (). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3). 384-422. Taylor & Francis.
Liu, Hsu & Lee (2019)
, & (). Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11563
Liu, Lowe, Serban, Noseworthy, Charlin & Pineau (2016)
, , , , & (). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer & Stoyanov (2019)
, , , , , , , , & (). RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Retrieved from http://arxiv.org/abs/1907.11692
Bates, Mächler, Bolker & Walker (2015)
, , & (). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1). 1-48.
Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014)
, , , , , & (). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems. Proceedings of LREC.
Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014)
, , , , , & (). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems. Proceedings of LREC. 560-567.
Luong, Socher & Manning (2013)
, & (). Better word representations with recursive neural networks for morphology. Proceedings of the seventeenth conference on computational natural language learning. 104-113.
Versteegh & Thiolliere (2015)
& (). ZeroSpeech term discovery evaluation toolkit. Retrieved from http://dx.doi.org/10.5281/zenodo.16330
Macmillan & Creelman (2004)
& (). Detection theory: A user’s guide. Psychology Press.
Mahrt (2016)
(). LMEDS: Language markup and experimental design software.
Wang, Zhang & Zhang (2015)
, & (). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
Manenti, Pellegrini & Pinquier (2017)
, & (). Unsupervised speech unit discovery using k-means and neural networks. International conference on statistical language and speech processing. 169-180. Springer.
Mangin, Filliat, Bosch & Oudeyer (2015)
, , & (). MCA-NMF: Multimodal concept acquisition with non-negative matrix factorization. PLOS One.
Marvin & Linzen (2018)
& (). Targeted syntactic evaluation of language models. Retrieved from https://www.aclweb.org/anthology/D18-1151
Matlock (2001)
(). How real is fictive motion?. Psychology Department, University of California, Santa Cruz.
Melis, Dyer & Blunsom (2018)
, & (). On the state of the art of evaluation in neural language models. ICLR.
Merkx, Frank & Ernestus (2019)
, & (). Language Learning Using Speech to Image Retrieval. Proc. Interspeech 2019. 1841-1845.
Meyer, Wesker, Brand, Mertins & Kollmeier (2006)
, , , & (). A human-machine comparison in speech recognition based on a logatome corpus. Speech recognition and intrinsic variation workshop.
Meyer, Wächter, Brand & Kollmeier (2007)
, , & (). Phoneme confusions in human and automatic speech recognition. Eighth annual conference of the international speech communication association.
Meyer, Jürgens, Wesker, Brand & Kollmeier (2010)
, , , & (). Human phoneme recognition depending on speech-intrinsic variability. The Journal of the Acoustical Society of America, 128(5). 3126-3141. Acoustical Society of America.
Miao, Gowayyed & Metze (2015)
, & (). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Automatic speech recognition and understanding (ASRU), 2015 IEEE workshop on. 167-174. IEEE.
Miech, Zhukov, Alayrac, Tapaswi, Laptev & Sivic (2019)
, , , , & (). HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV.
Miller & Charles (1991)
& (). Contextual correlates of semantic similarity. Language and cognitive processes, 6(1). 1-28. Taylor & Francis.
Millet, Jurov & Dunbar (2019)
, & (). Comparing unsupervised speech learning directly to human performance in speech perception. CogSci Conference 2019.
Muscariello, Gravier & Bimbot (2012)
, & (). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination. IEEE Transactions on Audio, Speech and Language Processing, 20(7). 2031-2044.
Gulordava, Bojanowski, Grave, Linzen & Baroni (2018)
, , , & (). Colorless green recurrent networks dream hierarchically. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 1195-1205. Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/N18-1108
Kwiatkowski, Palomaki, Redfield, Collins, Parikh, Alberti, Epstein, Polosukhin, Devlin, Lee & (2019)
, , , , , , , , , & (). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7. 453-466. MIT Press.
Cuervo, Grabias, Chorowski, Ciesielski, Łańcucki, Rychlikowski & Marxer (2021)
, , , , , & (). Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. arXiv preprint arXiv:2110.15909.
Iwamoto & Shinozaki (2021)
& (). Unsupervised spoken term discovery using wav2vec 2.0. 2021 asia-pacific signal and information processing association annual summit and conference (APSIPA ASC). 1082-1086. IEEE.
Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021)
, , , & (). Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170.
Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021)
, , , & (). Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. arXiv preprint arXiv:2110.02345.
Bhati, Villalba, Żelasko & Dehak (2020)
, , & (). Self-expressing autoencoders for unsupervised spoken term discovery. arXiv preprint arXiv:2007.13033.
Borgholt, Havtorn, Edin, Maaløe & Igel (2022)
, , , & (). A brief overview of unsupervised neural speech representation learning.
Nayak, Kumar, Ramesh, Bhati & Murty (2019)
, , , & (). Virtual Phone Discovery for Speech Synthesis. Retrieved from https://doi.org/10.13140/RG.2.2.23356.08324
Tobing, Hayashi, Wu, Kobayashi & Toda (2020)
, , , & (). Cyclic spectral modeling for unsupervised unit discovery into voice conversion with excitation and waveform modeling.. INTERSPEECH. 4861-4865.
Chen & Hain (2020)
& (). Unsupervised acoustic unit representation learning for voice conversion using wavenet auto-encoders. arXiv preprint arXiv:2008.06892.
Niekerk, Nortje & Kamper (2020)
, & (). Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. arXiv preprint arXiv:2005.09409.
Yusuf, Ondel, Burget, Černockỳ & Saraclar (2021)
, , , & (). A hierarchical subspace model for language-attuned acoustic unit discovery. ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). 3710-3714. IEEE.
Gündogdu, Yusuf, Yesilbursa & Saraclar (2020)
, , & (). Vector quantized temporally-aware correspondence sparse autoencoders for zero-resource acoustic unit discovery.. INTERSPEECH. 4846-4850.
Newell & Simon (1972)
& (). Human problem solving. Prentice-Hall.
Nguyen, Seyssel, Rozé, Rivière, Kharitonov, Baevski, Dunbar & Dupoux (2020)
, , , , , , & (). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588.
Jurov (2019)
(). Phonetics or Phonology? Modelling Non-Native Perception. Université Paris Diderot.
Ondel, Godard, Besacier, Larsen, Hasegawa-Johnson, Scharenborg, Dupoux, Burget, Yvon & Khudanpur (2018)
, , , , , , , , & (). Bayesian models for unit discovery on a very low resource language. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5939-5943. IEEE.
Oord, Li & Vinyals (2018)
, & (). Representation learning with contrastive predictive coding. CoRR, abs/1807.03748. Retrieved from http://arxiv.org/abs/1807.03748
Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier & Auli (2019)
, , , , , , & (). Fairseq: A fast, extensible toolkit for sequence modeling. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics (demonstrations). 48-53. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/N19-4009
Panayotov, Chen, Povey & Khudanpur (2015)
, , & (). Librispeech: An asr corpus based on public domain audio books. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). 5206-5210. IEEE.
Pandia & Murthy (2019)
& (). Zero Resource Speech Synthesis Using Transcripts Derived from Perceptual Acoustic Units. INTERSPEECH 2019.
Park & Glass (2008)
& (). Unsupervised Pattern Discovery in Speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1). 186-197.
Parrot, Millet & Dunbar (2019)
, & (). Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv. arXiv-1911.
Pauls & Klein (2012)
& (). Large-scale syntactic language modeling with treelets.
Chang & Fisher III (2013)
& (). Parallel sampling of DP mixture models using sub-cluster splits. Advances in Neural Information Processing Systems. 620-628.
Pellegrini, Manenti & Pinquier ()
, & (). Unsupervised discovery of sub-lexical units in speech based on ZCA and k-means. Submitted to ASRU 2017.
Peperkamp (2015)
(). Phonology versus phonetics in loanword adaptations. 71-90. John Benjamins Publishing Company.
Phillips, Wagers & Lau (2011)
, & (). Grammatical illusions and selective fallibility in real-time language comprehension. Experiments at the Interfaces, 37. 147-180. Brill.
Pintér & Watanabe (2016)
& (). Do GMM phoneme classifiers perceive synthetic sibilants as humans do?. INTERSPEECH. 1363-1367.
Pitt, Johnson, Hume, Kiesling & Raymond (2005)
, , , & (). The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1). 89-95. Elsevier.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011)
, , , , , , , , , , , & (). The kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011)
, , , , , , , , , , , & (). The kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz & (2011)
, , , , , , , , , & (). The Kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society; IEEE Signal Processing Society.
(2017)
(). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
Rabiner (1989)
(). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2). 257-286.
Radford, Wu, Child, Luan, Amodei & Sutskever (2019)
, , , , & (). Language models are unsupervised multitask learners.
Radinsky, Agichtein, Gabrilovich & Markovitch (2011)
, , & (). A word at a time: Computing word relatedness using temporal semantic analysis. Proceedings of the 20th international conference on world wide web. 337-346.
Räsänen & Rasilo (2015)
& (). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122. 792-829.
Ravfogel, Tyers & Goldberg (2018)
, & (). Can LSTM learn to capture agreement? The case of basque. arXiv preprint 1809.04022.
Kamper, Jansen & Goldwater (2017)
, & (). A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46. 154-174. Elsevier.
Kamper (2022)
(). Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. arXiv preprint arXiv:2202.11929.
Renshaw, Kamper, Jansen & Goldwater (2015)
, , & (). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. Sixteenth annual conference of the international speech communication association.
Dupoux (2018)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43-59. Elsevier.
Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018)
, , , , , & (). IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning. arXiv preprint arXiv:1803.07616.
Rivière, Joulin, Mazaré & Dupoux (2020)
, , & (). Unsupervised pretraining transfers well across languages. Retrieved from https://arxiv.org/abs/2002.02848
Roy & Pentland (2002)
& (). Learning words from sights and sounds: A computational model. Cognitive Science, 26. 113-146.
Rubenstein & Goodenough (1965)
& (). Contextual correlates of synonymy. Communications of the ACM, 8(10). 627-633. ACM New York, NY, USA.
Tjandra, Sisman, Zhang, Sakti, Li & Nakamura (2019)
, , , , & (). VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11449
Sakti, Kelana, Riza, Sakai, Markov & Nakamura (2008)
, , , , & (). Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project. Proceedings of the workshop on technologies and corpora for asia-pacific speech translation (TCAST).
Sakti, Maia, Sakai, Shimizu & Nakamura (2008)
, , , & (). Development of HMM-based Indonesian speech synthesis. Proc. Oriental COCOSDA. 215-219.
Salazar, Liang, Nguyen & Kirchhoff (2020)
, , & (). Masked language model scoring. Proceedings of the 58th annual meeting of the association for computational linguistics. 2699-2712. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.acl-main.240
Sanabria, Caglayan, Palaskar, Elliott, Barrault, Specia & Metze (2018)
, , , , , & (). How2: A large-scale dataset for multimodal language understanding. Proceedings of the workshop on visually grounded interaction and language (ViGIL). NeurIPS. Retrieved from http://arxiv.org/abs/1811.00347
Scharenborg (2007)
(). Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Communication, 49(5). 336-347. Elsevier.
Scharenborg, Tiesmeyer, Hasegawa-Johnson & Dehak (2018)
, , & (). Visualizing phoneme category adaptation in deep neural networks.. INTERSPEECH. 1482-1486.
Scharenborg, Gouw, Larson & Marchiori (2019)
, , & (). The representation of speech in deep neural networks. International conference on multimedia modeling. 194-205. Springer.
Scharenborg (2019)
(). The representation of speech and its processing in the human brain and deep neural networks. International conference on speech and computer. 1-8. Springer.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the Minimal-Pair ABX task (I): Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association. 1-5.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
, , , , & (). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. INTERSPEECH.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
, , , , & (). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise. Fifteenth annual conference of the international speech communication association.
Schatz (2016)
(). ABX-discriminability measures and applications. École Normale Supérieure.
Schatz (2016)
(). ABX-discriminability measures and applications. Paris 6.
Schatz, Bach & Dupoux (2017)
, & (). ASR systems as models of phonetic category perception in adults. Proceedings of the 39th Annual CogSci Meeting.
Schatz & Feldman (2018)
& (). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. Proceedings of the Conference on Cognitive Computational Neuroscience.
Schatz, Feldman, Goldwater, Cao & Dupoux (0)
, , , & (). Early phonetic learning without phonetic categories: Insights from machine learning. Proceedings of the National Academy of Sciences.
Schnabel, Labutov, Mimno & Joachims (2015)
, , & (). Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 conference on empirical methods in natural language processing. 298-307.
Schneider, Baevski, Collobert & Auli (2019)
, , & (). wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862.
Senin (2008)
(). Dynamic time warping algorithm review. Retrieved from http://seninp.github.io/assets/pubs/senin_dtw_litreview_2008.pdf
Sennrich, Haddow & Birch (2016)
, & (). Neural machine translation of rare words with subword units. Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers). 1715-1725. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P16-1162
Sennrich, Haddow & Birch (2015)
, & (). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Shibata, Kato, Shinozaki & Watanabe ()
, , & (). Composite embedding systems for ZeroSpeech2017 track 1. Submitted to ASRU 2017.
Norris & McQueen (2008)
& (). Shortlist B: a Bayesian model of continuous speech recognition. Psychological Review, 115(2). 357-395. American Psychological Association.
Shrager & Langley (1990)
Shrager, J. & Langley, P. (). Computational models of scientific discovery and theory formation. Morgan Kaufmann.
Siu, Gish, Chan, Belfield & Lowe (2013)
, , , & (). Unsupervized training of an HMM-based self-organizing recognizer with applications to topic classification and keyword discovery. Computer Speech & Language, preprint.
Socher, Karpathy, Le, Manning & Ng (2014)
, , , & (). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2. 207-218.
Scharenborg, Norris, Bosch & McQueen (2005)
, , & (). How should a speech recognizer work?. Cognitive Science, 29. 867-918.
Stolcke & Droppo (2017)
& (). Comparing human and machine errors in conversational speech transcription. INTERSPEECH.
Sun, Myers, Vondrick, Murphy & Schmid (2019)
, , , & (). Videobert: A joint model for video and language representation learning. Proceedings of the IEEE international conference on computer vision. 7464-7473.
Synnaeve, Schatz & Dupoux (2014)
, & (). Phonetic embedding learning with side information. Proceedings of IEEE spoken language technology.
Synnaeve, Versteegh & Dupoux (2014)
, & (). Learning words from images and speech. 28th Conference on Neural Information Processing Systems (NIPS) Workshop on Learning Semantics.
Bosch, Van hamme, Boves & Moore (2008)
, , & (). A computational model of language acquisition: The emergence of words. Fundamenta Informaticae, 90. 229-249.
McMurray, Aslin & Toscano (2009)
, & (). Statistical learning of phonetic categories: Insights from a computational approach. Developmental Science, 12(3). 369-378. Wiley Online Library.
Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015)
, , , & (). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling.. INTERSPEECH. 3179-3183.
Schatz, Thiolliere, Dupoux, Synnaeve & Dunbar (2015)
, , , & (). ABXpy v0.1. Retrieved from http://dx.doi.org/10.5281/zenodo.16239
Schatz, Cao, Synnaeve, Thiolliere & Dupoux (2015)
, , , & (). Abkhazia: Preliminary release. Retrieved from http://dx.doi.org/10.5281/zenodo.16242
Schatz & Feldman (2018)
& (). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception. Proceedings of the conference on cognitive computational neuroscience. 1-4.
Elman & McClelland (2015)
& (). Exploiting the lawful variability in the speech wave. 71-90. Erlbaum.
McClelland & Elman (1986)
& (). Interactive processes in speech perception: The TRACE model. Cognitive Psychology, 18. 1-86.
Tran, Bisazza & Monz (2018)
, & (). The importance of being recurrent for modeling hierarchical structure. Retrieved from https://www.aclweb.org/anthology/D18-1503
Vallabha, McClelland, Pons, Werker & Amano (2007)
, , , & (). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33). 13273-13278. National Acad Sciences.
Oord, Vinyals & (2017)
, & (). Neural discrete representation learning. Advances in neural information processing systems. 6306-6315.
VanDam (2015)
(). HomeBank VanDam Public 5-minute Corpus. TalkBank. Retrieved from http://homebank.talkbank.org/access/Public/VanDam-5minute.html
VanDam (2015)
(). HomeBank VanDam Public Daylong Corpus. TalkBank. Retrieved from http://homebank.talkbank.org/access/Public/VanDam-Daylong.html
Varadarajan, Khudanpur & Dupoux (2008)
, & (). Unsupervised learning of acoustic sub-word units. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. 165-168. Association for Computational Linguistics.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017)
, , , , , , & (). Attention is all you need. CoRR, abs/1706.03762. Retrieved from http://arxiv.org/abs/1706.03762
Versteegh, Thiolliere, Schatz, Cao, Anguera, Jansen & Dupoux (2015)
, , , , , & (). The zero resource speech challenge 2015. Proc. Of Interspeech.
Versteegh, Anguera, Jansen & Dupoux (2016)
, , & (). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science: Proceedings of SLTU 2016, 81. 67-72.
Versteegh, Thiollière, Schatz, Cao, Anguera, Jansen & Dupoux (2015)
, , , , , & (). The Zero Resource Speech Challenge 2015. INTERSPEECH-16, 81. 67-72.
Versteegh, Anguera, Jansen & Dupoux (2016)
, , & (). The Zero Resource Speech Challenge 2015: Proposed approaches and results. Procedia Computer Science, 81. 67-72. Elsevier.
Wang, Tang & Livescu (2020)
, & (). Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). 6889-6893. IEEE.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
, , , , , & (). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Werker & Tees (1984)
& (). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1). 49-63. Elsevier.
Wesker, Meyer, Wagener, Anemüller, Mertins & Kollmeier (2005)
, , , , & (). Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines. Ninth european conference on speech communication and technology.
Wilcox, Levy, Morita & Futrell (2018)
, , & (). What do RNN language models learn about filler–gap dependencies?.
Wilcox, Levy, Morita & Futrell (2018)
, , & (). What do RNN language models learn about filler-gap dependencies?. arXiv preprint 1809.00042.
Gauthier, Besacier, Voisin, Melese & Elingui (2016)
, , , & (). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. LREC.
Vries, Davel, Badenhorst, Basson, Wet, Barnard & Waal (2014)
, , , , , & (). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56. 119-131.
Xu & Tenenbaum (2007)
& (). Word learning as Bayesian inference. Psychological review, 114(2). 245-272. American Psychological Association.
Yang & Powers (2006)
& (). Verb similarity on the taxonomy of WordNet. Masaryk University.
Yang, Dai, Yang, Carbonell, Salakhutdinov & Le (2019)
, , , , & (). XLNet: Generalized autoregressive pretraining for language understanding. Retrieved from https://arxiv.org/abs/1906.08237
Yu & Ballard (2004)
& (). A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perceptions, 1. 57-80.
Yuan, Leung, Xie, Chen, Ma & Li ()
, , , , & (). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representations. Submitted to ASRU 2017.
Yuan, Leung, Xie, Chen, Ma & Li (2017)
, , , , & (). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. 2017 IEEE automatic speech recognition and understanding workshop (ASRU). 734-739. IEEE.
Zhang & Glass (2010)
& (). Towards multi-speaker unsupervised speech pattern discovery. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. 4366-4369.
Zhou, Xu & Corso (2018)
, & (). Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI conference on artificial intelligence, 32.
Gauthier, Besacier, Voisin, Melese & Elingui (2016)
, , , & (). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. 10th Language Resources and Evaluation Conference (LREC 2016). Retrieved from https://hal.archives-ouvertes.fr/hal-01350037
Jia, Weiss, Biadsy, Macherey, Johnson, Chen & Wu (2019)
, , , , , & (). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037.
Lee, Chen, Wang, Gu, Ma, Polyak, Adi, He, Tang, Pino & (2021)
, , , , , , , , , & (). Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604.
Tjandra, Sakti & Nakamura (2020)
, & (). Transformer vq-vae for unsupervised unit discovery and speech synthesis: Zerospeech 2020 challenge. arXiv preprint arXiv:2005.11676.
Alishahi, Chrupała, Cristia, Dupoux, Higy, Lavechin, Räsänen & Yu (2021)
, , , , , , & (). ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track. arXiv preprint arXiv:2107.06546.
Maekaku, Chang, Fujita, Chen, Watanabe & Rudnicky (2021)
, , , , & (). Speech representation learning combining conformer cpc with deep cluster for the zerospeech challenge 2021. arXiv preprint arXiv:2107.05899.
Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021)
, , , , , , , & (). Information retrieval for zerospeech 2021: The submission by university of wroclaw. arXiv preprint arXiv:2106.11603.
Niekerk, Nortje, Baas & Kamper (2021)
, , & (). Analyzing speaker information in self-supervised models to improve zero-resource speech processing. arXiv preprint arXiv:2108.00917.
Tjandra, Sakti & Nakamura (2019)
, & (). Speech-to-speech translation between untranscribed unknown languages. 2019 IEEE automatic speech recognition and understanding workshop (ASRU). 593-600. IEEE.
Jia, Ramanovich, Remez & Pomerantz (2021)
, , & (). Translatotron 2: Robust direct speech-to-speech translation. arXiv preprint arXiv:2107.08661.
Lee, Gong, Duquenne, Schwenk, Chen, Wang, Popuri, Pino, Gu & Hsu (2021)
, , , , , , , , & (). Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352.
Ostendorf, Price & Shattuck-Hufnagel (1995)
, & (). The boston university radio news corpus. Linguistic Data Consortium. 1-19.
Algayres, Ricoul, Karadayi, Mohammed, Sagot & Dupoux (2022)
, , , , & (). DP-PARSE: Finding word boundaries from raw speech with a token lexicon. Retrieved from https://arxiv.org/abs/1906.08237
Nguyen, Sagot & Dupoux (2022)
, & (). Are discrete units necessary for spoken language modeling?. Retrieved from https://arxiv.org/abs/1906.08237
De Saussure (1916)
(). Course in general linguistics. McGraw-Hill Book Company, New York-Toronto-London.
Seyssel, Lavechin, Titeux, Thomas, Virlet, Santos Revilla, Wisniewski, Ludusan & Dupoux (2023)
, , , , , , , & (). ProsAudit, a prosodic benchmark for self-supervised speech models.