Full Bibliography

Goldwater, Griffiths & Johnson (2009)
, & (). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1). 21–54.
Agirre, Alfonseca, Hall, Kravalova, Pasca & Soroa (2009)
, , , , & (). A study on similarity and relatedness using distributional and wordnet-based approaches.
Al-Rfou, Choe, Constant, Guo & Jones (2018)
, , , & (). Character-level language modeling with deeper self-attention. arXiv preprint 1808.04444.
Allen & Seidenberg (1999)
& (). The emergence of grammaticality in connectionist networks. The emergence of language. 115–151.
Ansari, Kumar, Singh, Ganapathy & Devi (n.d.)
, , , & (n.d.). Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions.
Chaudhuri, Roth, Ellis, Gallagher, Kaver, Marvin, Pantofaru, Reale, Reid, Wilson & Xi (2018)
, , , , , , , , , & (). AVA-speech: A densely labeled dataset of speech activity in movies. Retrieved from https://arxiv.org/pdf/1808.00606
Baevski, Auli & Mohamed (2019)
, & (). Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912.
Baevski, Zhou, Mohamed & Auli (2020)
, , & (). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.
Baker, Reichart & Korhonen (2014)
, & (). An unsupervised model for instance level subcategorization acquisition.
Bérard, Pietquin, Servan & Besacier (2016)
, , & (). Listen and translate: A proof of concept for end-to-end speech-to-text translation.
Best (1995)
(). A direct realist perspective on cross-language speech perception. InStrange, W. (Eds.), Speech perception and linguistic experience: Issues in cross-language research. (pp. 167–200). York Press.
Bavin (2009)
Bavin, E. (). The Cambridge handbook of child language. Cambridge University Press. Retrieved from http://site.ebrary.com/id/10303044
Dunbar, Bernard, Hamilakis, Nguyen, Seyssel, Rozé, Rivière, Kharitonov & Dupoux (2021)
, , , , , , , & (). The zero resource speech challenge 2021: Spoken language modelling.
Kohonen (1988)
(). The ’neural’ phonetic typewriter. Computer, 21(3). 11–22.
Adda, Stücker, Adda-Decker, Ambouroue, Besacier, Blachon, Bonneau-Maynard, Godard, Hamlaoui, Idiatov, Kouarata, Lamel, Makasso, Rialland, Van de Velde, Yvon & Zerbian (2016)
, , , , , , , , , , , , , , , & (). Breaking the unwritten kanguage barrier: The Bulb project.
Akaike (1974)
(). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6). 716–723.
Alishahi, Barking & Chrupała (2017)
, & (). Encoding of phonology in a recurrent neural model of grounded speech.
Ansari, Singh, Kumar & Ganapathy (n.d.)
, , & (n.d.). Deep learning methods for unsupervised acoustic modeling: LEAP submission to ZeroSpeech challenge 2017.
Ansari, Kumar, Singh & Ganapathy (2017)
, , & (). Deep learning methods for unsupervised acoustic modeling—leap submission to zerospeech challenge 2017. IEEE.
Jansen & Van Durme (2011)
& (). Efficient spoken term discovery using randomized algorithms. IEEE.
Badino, Canevari, Fadiga & Metta (2014)
, , & (). An Auto-encoder based approach to unsupervised learning of subword units.
Baevski, Schneider & Auli (2020)
, & (). Vq-wav2vec: Self-supervised learning of discrete speech representations. Retrieved from https://openreview.net/forum?id=rylwJxrYDS
Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014)
, , , , , , , , , & (). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Bengio, Ducharme, Vincent & Jauvin (2003)
, , & (). A neural probabilistic language model. JMLR.
Besacier, Zhou & Gao (2006)
, & (). Towards speech translation of non written languages.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
, , , , , & (). BLiMP: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Peng & Harwath (2022)
& (). Self-supervised representation learning for speech using visual grounding and masked language modeling. arXiv preprint arXiv:2202.03543.
Bruni, Boleda, Baroni & Tran (2012)
, , & (). Distributional semantics in technicolor.
Chalnick & Billman (1988)
& (). Unsupervised learning of correlational structure. Lawrence Erlbaum Associates.
Chen, Leung, Xie, Ma & Li (2015)
, , , & (). Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study.
Chrupała (2021)
(). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Retrieved from https://arxiv.org/abs/2104.13225
Chung & Glass (2018)
& (). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976.
Chung, Hsu, Tang & Glass (2019)
, , & (). An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
Chung, Hsu, Tang & Glass (2019)
, , & (). An unsupervised autoregressive model for speech representation learning. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 146–150. https://doi.org/10.21437/Interspeech.2019-1473
Keuleers & Brysbaert (2010)
& (). Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3). 627–633.
Kharitonov, Lee, Polyak, Adi, Copet, Lakhotia, Nguyen, Rivière, Mohamed, Dupoux & (2021)
, , , , , , , , , & (). Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264.
Heck, Sakti & Nakamura (2016)
, & (). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario. Procedia Computer Science, 81. 73–79.
Srivastava & Shrivastava (2016)
& (). Articulatory gesture rich representation learning of phonological units in low resource settings. Springer.
Heck, Sakti & Nakamura (2017)
, & (). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017. IEEE.
Shibata, Kato, Shinozaki & Watanabet (2017)
, , & (). Composite embedding systems for ZeroSpeech2017 Track1. IEEE.
Chorowski, Weiss, Bengio & Van Den Oord (2019)
, , & (). Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12). 2041–2053.
Kamper, Livescu & Goldwater (2017)
, & (). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. IEEE.
Hsu, Harwath & Glass (2019)
, & (). Transfer learning from audio-visual grounding to speech recognition. arXiv preprint arXiv:1907.04355.
Chung & Glass (2019)
& (). Generative pre-training for speech with autoregressive predictive coding. arXiv preprint arXiv:1910.12607.
Millet, Chitoran & Dunbar (2021)
, & (). Predicting non-native speech perception using the perceptual assimilation model and state-of-the-art acoustic models.
Warstadt, Singh & Bowman (2018)
, & (). Neural network acceptability judgments. arXiv preprint 1805.12471.
Dai, Yang, Yang, Cohen, Carbonell, Le & Salakhutdinov (2019)
, , , , , & (). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint 1901.02860.
Räsänen, Doyle & Frank (2015)
, & (). Unsupervised word discovery from speech using automatic segmentation into syllable-like units.
Räsänen & Blandón (2020)
& (). Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics. arXiv preprint arXiv:2008.00731.
Prakash, Kumar, Murthy & (2020)
, , & (). Exploration of end-to-end synthesisers for zero resource speech challenge 2020. arXiv preprint arXiv:2009.04983.
Davis & Mermelstein (1980)
& (). Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4). 357–366.
Lee & Glass (2012)
& (). A nonparametric bayesian approach to acoustic model discovery. The Association for Computer Linguistics.
Hsu, Hwang, Wu, Tsao & Wang (2016)
, , , & (). Voice conversion from non-parallel corpora using variational auto-encoder. https://doi.org/10.1109/APSIPA.2016.7820786
Tjandra, Sakti & Nakamura (2017)
, & (). Listening while speaking: Speech chain by deep learning.
Badino, Canevari, Fadiga & Metta (2014)
, , & (). An auto-encoder based approach to unsupervised learning of subword units. IEEE.
Gao, Singh & Raj (2018)
, & (). Voice impersonation using generative adversarial networks. IEEE.
Jansen, Thomas & Hermansky (2013)
, & (). Weak top-down constraints for unsupervised acoustic model training. IEEE.
Eloff, Nortje, Niekerk, Govender, Nortje, Pretorius, Van Biljon, Westhuizen, Staden & Kamper (2019)
, , , , , , , , & (). Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks. arXiv preprint arXiv:1904.07556.
Yusuf, Gök, Gündogdu, Kose & Saraclar (2019)
, , , & (). Temporally-aware acoustic unit discovery for zerospeech 2019 challenge..
Liu, Hsu & Lee (2019)
, & (). Unsupervised end-to-end learning of discrete linguistic units for voice conversion. arXiv preprint arXiv:1905.11563.
Nayak, Kumar, Ramesh, Bhati & Murty (2019)
, , , & (). Virtual phone discovery for speech synthesis without text. IEEE.
Muthukumar & Black (2014)
& (). Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis.
Scharenborg, Besacier, Black, Hasegawa-Johnson, Metze, Neubig, Stüker, Godard, Müller, Ondel, Palaskar, Arthur, Ciannella, Du, Larsen, Merkx, Riad, Wang & Dupoux (2018)
, , , , , , , , , , , , , , , , , & (). Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the “speaking rosetta” JSALT 2017 workshop. IEEE.
Shen, Pang, Weiss, Schuster, Jaitly, Yang, Chen, Zhang, Wang, Ryan, Saurous, Agiomyrgiannakis & Wu (2018)
, , , , , , , , , , , & (). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. IEEE.
Heck, Sakti & Nakamura (2016)
, & (). Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario.
Ondel, Burget & Cernocký (2016)
, & (). Variational inference for acoustic unit discovery. Elsevier.
Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior & Kavukcuoglu (2016)
, , , , , , , & (). WaveNet: A generative model for raw audio. ISCA.
Wu, Watts & King (2016)
, & (). Merlin: An open source neural network speech synthesis system. ISCA.
Ping, Peng, Gibiansky, Arik, Kannan, Narang, Raiman & Miller (2017)
, , , , , , & (). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654.
Kaneko & Kameoka (2017)
& (). Parallel-data-free voice conversion using cycle-consistent adversarial networks. CoRR, abs/1711.11293. Retrieved from https://arxiv.org/abs/1711.11293
Chou, Yeh, Lee & Lee (2018)
, , & (). Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. CoRR, abs/1804.02812. Retrieved from https://arxiv.org/abs/1804.02812
Li, Liu, Liu, Zhao, Liu & Zhou (2018)
, , , , & (). Close to human quality TTS with transformer. CoRR, abs/1809.08895. Retrieved from https://arxiv.org/abs/1809.08895
Mehri, Kumar, Gulrajani, Kumar, Jain, Sotelo, Courville & Bengio (2016)
, , , , , , & (). SampleRNN: An unconditional end-to-end neural audio generation model. CoRR, abs/1612.07837. Retrieved from https://arxiv.org/abs/1612.07837
Taigman, Wolf, Polyak & Nachmani (2017)
, , & (). Voice synthesis for in-the-wild speakers via a phonological loop. CoRR, abs/1707.06588.
Dillon, Dunbar & Idsardi (2013)
, & (). A single-stage approach to learning phonological categories: Insights from Inuktitut. Cognitive Science, 37(2). 344–377.
DeCarlo (1998)
(). Signal detection theory and generalized linear models.. Psychological Methods, 3(2). 186.
Deng, Dong, Socher, Li, Li & Fei-Fei (2009)
, , , , & (). ImageNet: A large-scale hierarchical image database. https://doi.org/10.1109/CVPR.2009.5206848
Devlin, Chang, Lee & Toutanova (2019)
, , & (). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL.
Driesen & Van hamme (2011)
& (). Modeling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA. Neurocomputing, 74. 1874–1882.
Dunbar, Cao, Benjumea, Karadayi, Bernard, Besacier, Anguera & Dupoux (2017)
, , , , , , & (). The Zero Resource Speech Challenge 2017. IEEE. Retrieved from https://arxiv.org/abs/1712.04313
Algayres, Zaiem, Sagot & Dupoux (2020)
, , & (). Evaluating the reliability of acoustic speech embeddings. arXiv preprint arXiv:2007.13542.
Riad, Dancette, Karadayi, Zeghidour, Schatz & Dupoux (2018)
, , , , & (). Sampling strategies in siamese networks for unsupervised speech representation learning. arXiv preprint arXiv:1804.11297.
Dunbar, Algayres, Karadayi, Bernard, Benjumea, Cao, Miskic, Dugrain, Ondel, Black & (2019)
, , , , , , , , , & (). The zero resource speech challenge 2019: TTS without T. Retrieved from https://arxiv.org/abs/1904.11469
Dunbar, Karadayi, Bernard, Cao, Algayres, Ondel, Besacier, Sakriani & Dupoux (2020)
, , , , , , , & (). The zero resource speech challenge 2020: Discovering discrete subword and word units.
Duong, Anastasopoulos, Chiang, Bird14 & Cohn (2016)
, , , & (). An attentional model for speech translation without transcription.
Dupoux (2016)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. arXiv preprint arXiv:1607.08723.
Dupoux (2018)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43–59.
Peters, Neumann, Iyyer, Gardner, Clark, Lee & Zettlemoyer (2018)
, , , , , & (). Deep contextualized word representations. NAACL.
Faruqui, Tsvetkov, Rastogi & Dyer (2016)
, , & (). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
Feigenbaum (1963)
(). The simulation of verbal learning behavior. InFeigenbaum, E. & Feldman, J. (Eds.), Computers and thought.. McGraw-Hill.
Feldman & Griffiths (2007)
& (). A rational account of the perceptual magnet effect.
Feldman, Griffiths, Goldwater & Morgan (2013)
, , & (). A role for the developing lexicon in phonetic category acquisition.. Psychological review, 120(4). 751–778.
Feng, Lee & Peng (2019)
, & (). Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1906.07234
Cieri, Miller & Walker (2004)
, & (). The fisher corpus: A resource for the next generations of speech-to-text.
Frome, Corrado, Shlens, Bengio, Dean, Ranzato & Mikolov (2013)
, , , , , & (). DeViSE: A deep visual-semantic embedding model.
Futrell, Wilcox, Morita, Qian, Ballesteros & Levy (2019)
, , , , & (). Neural language models as psycholinguistic subjects: Representations of syntactic state.
Futrell, Wilcox, Morita & Levy (2018)
, , & (). RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency. arXiv preprint 1809.01329.
Gage (1994)
(). A new algorithm for data compression. C Users Journal, 12(2). 23–38.
García-Granada, Sanchis, Castro-Bleda, González & Hurtado (n.d.)
, , , & (n.d.). ZeroSpeech2017 ELIRF-UPV system. Submitted to ASRU 2017.
Gerz, Vulić, Hill, Reichart & Korhonen (2016)
, , , & (). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
Glass (2012)
(). Towards unsupervised speech processing. IEEE.
Myrman & Salvi (2017)
& (). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling. ISCA.
Godais, Linzen & Dupoux (2017)
, & (). Comparing character-level neural language models using a lexical decision task. https://doi.org/10.18653/v1/E17-2020
Godfrey, Holliman & McDaniel (1992)
, & (). SWITCHBOARD: Telephone speech corpus for research and development. IEEE.
Goldberg (2019)
(). Assessing BERT’s syntactic abilities. arXiv preprint 1901.05287.
Goldwater, Griffiths & Johnson (2009)
, & (). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112. 21–54. https://doi.org/10.1016/j.cognition.2009.03.008
Guenther & Gjaja (1996)
& (). The perceptual magnet effect as an emergent property of neural map formation. The Journal of the Acoustical Society of America, 100(2). 1111–1121.
Gulordava, Bojanowski, Grave, Linzen & Baroni (2018)
, , , & (). Colorless green recurrent networks dream hierarchically. Retrieved from https://www.aclweb.org/anthology/N18-1108
Hahn & Baroni (2019)
& (). Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text. Transactions of the Association for Computational Linguistics (Accepted). Retrieved from https://arxiv.org/abs/1906.07285
Halawi, Dror, Gabrilovich & Koren (2012)
, , & (). Large-scale learning of word relatedness with constraints.
Hannun, Case, Casper, Catanzaro, Diamos, Elsen, Prenger, Satheesh, Sengupta, Coates & (2014)
, , , , , , , , , & (). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
Harwath & Glass (2015)
& (). Deep multimodal semantic embeddings for speech and images. IEEE.
Harwath, Torralba & Glass (2016)
, & (). Unsupervised learning of spoken language with visual context.
Harwath, Hsu & Glass (2019)
, & (). Learning hierarchical discrete linguistic units from visually-grounded speech. arXiv preprint arXiv:1911.09602.
Tiede, Espy-Wilson, Goldenberg, Mitra, Nam & Sivaraman (2017)
, , , , & (). Quantifying kinematic aspects of reduction in a contrasting rate production task. The Journal of the Acoustical Society of America, 141(5). 3580–3580. https://doi.org/10.1121/1.4987629
Hastie, Tibshirani & Friedman (2009)
, & (). The elements of statistical learning – data mining, inference, and prediction. Springer.
Havard, Besacier & Rosec (2017)
, & (). SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. https://doi.org/10.21437/GLU.2017-9
Arandjelovic & Zisserman (2017)
& (). Look, listen and learn.
Chrupała, Gelderloos & Alishahi (2017)
, & (). Representations of language in a model of visually grounded speech signal. arXiv preprint arXiv:1702.01991.
Chrupała, Gelderloos & Alishahi (2017)
, & (). Representations of language in a model of visually grounded speech signal.
Jansen, Dupoux, Goldwater, Johnson, Khudanpur, Church, Feldman, Hermansky, Metze, Rose, Seltzer, Clark, McGraw, Varadarajan, Bennett, Borschinger, Chiu, Dunbar, Fourtassi, Harwath, Lee, Levin, Norouzian, Peddinti, Richardson, Schatz & Thomas (2013)
, , , , , , , , , , , , , , , , , , , , , , , , , & (). A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition.
Elsner, Goldwater & Eisenstein (2012)
, & (). Bootstrapping a unified model of lexical and phonetic acquisition.
Bostrom & Durrett (2020)
& (). Byte pair encoding is suboptimal for language model pretraining. Retrieved from https://arxiv.org/abs/2004.03720
Fer, Matejka, Grezl, Plchot, Vesely & Cernocky (2017)
, , , , & (). Multilingually trained bottleneck features in spoken language recognition. Computer Speech and Language, 46(Supplement C). 252–267.
Yusuf, Gok, Gundogdu, Kose & Saraclar (2019)
, , , & (). Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge. INTERSPEECH 2019.
Pitt, Dilley, Johnson, Kiesling, Raymond, Hume & Fosler-Lussier (2007)
, , , , , & (). Buckeye corpus of conversational speech (2nd release). www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
Barnard (2014)
(). The NCHLT speech corpus of the south african languages.. https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from http://hdl.handle.net/10204/7549
Chen, Leung, Xie, Ma & Li (n.d.)
, , , & (n.d.). Multilingual bottle-neck feature learning from untranscribed speech. Submitted to ASRU 2017.
Cho, Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk & Bengio (2014)
, , , , , & (). Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1179
Chrupała (2019)
(). Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1647
Badino, Mereta & Rosasco (2015)
, & (). Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders.
Chen, Leung, Xie, Ma & Li (2015)
, , , & (). Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study.
Myrman & Salvi (2017)
& (). Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling.
Renshaw, Kamper, Jansen & Goldwater (2015)
, , & (). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge.
Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015)
, , , & (). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling.
Zeghidour, Synnaeve, Versteegh & Dupoux (2016)
, , & (). A deep scattering spectrum—deep siamese network pipeline for unsupervised acoustic modeling. IEEE.
Chen, Leung, Xie, Ma & Li (2017)
, , , & (). Multilingual bottle-neck feature learning from untranscribed speech. IEEE.
Pellegrini, Manenti & Pinquier (2017)
, & (). Technical report the IRIT-UPS system@ ZeroSpeech 2017 Track1: Unsupervised subword modeling Tech. rep., IRIT, Université de Toulouse
Kharitonov, Rivière, Synnaeve, Wolf, Mazaré, Douze & Dupoux (2021)
, , , , , & (). Data augmenting contrastive learning of speech representations in the time domain. IEEE.
Jansen & Van Durme (2011)
& (). Efficient spoken term discovery using randomized algorithms. IEEE.
Seshadri, Remes, Räsänen & (2017)
, , & (). Comparison of non-parametric bayesian mixture models for syllable clustering and zero-resource speech processing. INTERSPEECH 2017.
Lyzinski, Sell & Jansen (2015)
, & (). An evaluation of graph clustering methods for unsupervised term discovery.
Lakhotia, Kharitonov, Hsu, Adi, Polyak, Bolte, Nguyen, Copet, Baevski, Mohamed & (2021)
, , , , , , , , , & (). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9. 1336–1354.
Millet & Dunbar (2020)
& (). The perceptimatic english benchmark for speech perception models.
Millet & Dunbar (2022)
& (). Do self-supervised speech models develop human-like perception biases?.
Moore (2012)
(). An introduction to the psychology of hearing. Brill.
Weerts, Rosen, Clopath & Goodman (2021)
, , & (). The psychometrics of automatic speech recognition. bioRxiv.
Tsuji, Cristia & Dupoux (2021)
, & (). SCALa: A blueprint for computational models of language acquisition in social context. Cognition, 213. 104779.
Buerkin-Pontrelli, Culbertson, Legendre & Nazzi (2017)
, , & (). Competing models of liaison acquisition: Evidence from corpus and experimental data. Language, 93(1). 189–219.
Babineau, Legrand & Shi (2021)
, & (). Variable forms in french-learning toddlers’ lexical representations.. Developmental Psychology.
Van Gijn & Zúñiga (2014)
& (). Word and the americanist perspective. Morphology, 24(3). 135–160.
Millet & Dunbar (2020)
& (). Perceptimatic: A human speech perception benchmark for unsupervised subword modelling. arXiv preprint arXiv:2010.05961.
Warstadt & Bowman (2019)
& (). Grammatical analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint 1901.03438.
Pandia & Murthy (2020)
& (). Zero resource speech synthesis using transcripts derived from perceptual acoustic units. arXiv preprint arXiv:2006.04372.
Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021)
, , , , , , , & (). Aligned contrastive predictive coding. arXiv preprint arXiv:2104.11946.
Chrupała (2022)
(). Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. Journal of Artificial Intelligence Research, 73. 673–707.
Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov & Mohamed (2021)
, , , , & (). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451–3460.
Gwilliams, Linzen, Poeppel & Marantz (2018)
, , & (). In spoken word recognition, the future predicts the past. Journal of Neuroscience, 38(35). 7585–7599.
Beekhuizen, Armstrong & Stevenson (2021)
, & (). Probing lexical ambiguity: Word vectors encode number and relatedness of senses. Cognitive Science, 45(5). e12943.
Nikolaus, Alishahi & Chrupała (2022)
, & (). Learning english with peppa pig. arXiv preprint arXiv:2202.12917.
Havard, Chevrot & Besacier (2019)
, & (). Models of visually grounded speech signal pay attention to nouns: A bilingual experiment on english and japanese.
Havard, Chevrot & Besacier (2019)
, & (). Word recognition, competition, and activation in a model of visually grounded speech.
He, Zhang, Ren & Sun (2016)
, , & (). Deep Residual Learning for Image Recognition. IEEE. https://doi.org/10.1109/CVPR.2016.90
Heck, Sakti & Nakamura (n.d.)
, & (n.d.). Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to ZeroSpeech 2017. Submitted to ASRU 2017.
Higy, Elliott & Chrupała (2020)
, & (). Textual Supervision for Visually Grounded Spoken Language Understanding. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.244
Hill (1983)
(). A computational model of language acquisition in the two-year old. Cognition and Brain Theory, 6. 287–317.
Hill, Reichart & Korhonen (2015)
, & (). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4). 665–695.
Hochreiter & Schmidhuber (1997)
& (). Long short-term memory. Neural computation, 9(8). 1735–1780.
Bin & Yuan (2019)
& (). A VAE model with speaker verification for unsupervised subword modeling: A submission to ZeroSpeech 2019. Submitted to INTERSPEECH 2019.
Hsu, Harwath, Song & Glass (2020)
, , & (). Text-Free Image-to-Speech Synthesis Using Learned Segmental Units.
Huijbregts, McLaren & Leeuwen (2011)
, & (). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection.
(N.A.) (2019)
(). INTERSPEECH 2019 – 20th annual conference of the international speech communication association, september 15-19, graz, austria, proceedings.
Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018)
, , , , , & (). Intphys: A framework and benchmark for visual intuitive physics reasoning. arXiv preprint arXiv:1803.07616.
Jansen & Van Durme (2011)
& (). Efficient spoken term discovery using randomized algorithms.
Jansen, Thomas & Hermansky (2013)
, & (). Weak top-down constraints for unsupervised acoustic model training..
Johnson, Griffiths & Goldwater (2007)
, & (). Adaptor grammars: A framework for specifying compositional nonparametric bayesian models. InSchölkopf, B., Platt, J. & Hoffman, T. (Eds.), Advances in neural information processing systems. (pp. 641–648). MIT Press.
Jürgens, Brand & Kollmeier (2007)
, & (). Modelling the human-machine gap in speech reception: Microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model.
Kahn, Riviere, Zheng, Kharitonov, Xu, Mazare, Karadayi, Liptchinsky, Collobert, Fuegen & al. (2020)
, , , , , , , , , & (). Libri-light: A benchmark for ASR with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp40776.2020.9052942
Kahn, Rivière, Zheng, Kharitonov, Xu, Mazaré, Karadayi, Liptchinsky, Collobert, Fuegen, Likhomanenko, Synnaeve, Joulin, Mohamed & Dupoux (2020)
, , , , , , , , , , , , , & (). Libri-light: A benchmark for ASR with limited or no supervision. Retrieved from https://arxiv.org/abs/1912.07875
Kamper, Livescu & Goldwater (2017)
, & (). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. ASRU 2017. Retrieved from https://arxiv.org/abs/1904.07556
Kamper, Shakhnarovich & Livescu (2019)
, & (). Semantic speech retrieval with a visually grounded model of untranscribed speech. IEEE/ACM Transactions on Audio, Speech and Language Processing, 27. 89–98.
Kamper, Elsner, Jansen & Goldwater (2015)
, , & (). Unsupervised neural network based feature extraction using weak top-down constraints.
Karpathy & Li (2015)
& (). Deep visual-semantic alignments for generating image descriptions.
Kawakami, Wang, Dyer, Blunsom & Oord (2020)
, , , & (). Learning robust and multilingual speech representations. Retrieved from https://arxiv.org/abs/2001.11128
Kleinschmidt & Jaeger (2015)
& (). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2). 148–203.
Kuhl (1991)
(). Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Attention, Perception, & Psychophysics, 50(2). 93–107.
Lau, Clark & Lappin (2017)
, & (). Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive Science. 1202–1241.
Lee & Glass (2012)
& (). A nonparametric Bayesian approach to acoustic model discovery.
Chomsky (1957)
(). Syntactic structures. JSTOR.
Liberman, Cooper, Shankweiler & Studdert-Kennedy (1967)
, , & (). Perception of the speech code.. Psychological review, 74(6). 431.
Fowler (1986)
(). An event approach to the study of speech perception from a direct–realist perspective. Journal of phonetics, 14(1). 3–28.
Baljekar, Sitaram, Muthukumar & Black (2015)
, , & (). Using articulatory features and inferred phonological segments in zero resource speech processing.
Morita & Koda (2020)
& (). Exploring TTS without t using biologically/psychologically motivated neural network modules (ZeroSpeech 2020). arXiv preprint arXiv:2005.05487.
Chomsky & Halle (1968)
& (). The sound pattern of english..
Linzen, Dupoux & Goldberg (2016)
, & (). Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL.
Linzen & Leonard (2018)
& (). Distinct patterns of syntactic agreement errors in recurrent networks and humans. arXiv preprint 1807.06882.
Lisker & Abramson (1964)
& (). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3). 384–422.
Liu, Hsu & Lee (2019)
, & (). Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11563
Liu, Lowe, Serban, Noseworthy, Charlin & Pineau (2016)
, , , , & (). How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer & Stoyanov (2019)
, , , , , , , , & (). RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. Retrieved from http://arxiv.org/abs/1907.11692
Bates, Mächler, Bolker & Walker (2015)
, , & (). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1). 1–48.
Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014)
, , , , , & (). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems.
Ludusan, Versteegh, Jansen, Gravier, Cao, Johnson & Dupoux (2014)
, , , , , & (). Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems.
Luong, Socher & Manning (2013)
, & (). Better word representations with recursive neural networks for morphology.
Versteegh & Thiolliere (2015)
& (). ZeroSpeech term discovery evaluation toolkit. https://doi.org/10.5281/zenodo.16330
Macmillan & Creelman (2004)
& (). Detection theory: A user’s guide. Psychology Press.
Mahrt (2016)
(). LMEDS: Language markup and experimental design software.
Wang, Zhang & Zhang (2015)
, & (). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882.
Manenti, Pellegrini & Pinquier (2017)
, & (). Unsupervised speech unit discovery using k-means and neural networks. Springer.
Mangin, Filliat, Bosch & Oudeyer (2015)
, , & (). MCA-NMF: Multimodal concept acquisition with non-negative matrix factorization. PLOS One. https://doi.org/DOI:10.1371/journal.pone.0140732
Marvin & Linzen (2018)
& (). Targeted syntactic evaluation of language models. Retrieved from https://www.aclweb.org/anthology/D18-1151
Matlock (2001)
(). How real is fictive motion?  (Doctoral dissertation). Psychology Department, University of California, Santa Cruz
Melis, Dyer & Blunsom (2018)
, & (). On the state of the art of evaluation in neural language models. ICLR.
Merkx, Frank & Ernestus (2019)
, & (). Language Learning Using Speech to Image Retrieval. https://doi.org/10.21437/Interspeech.2019-3067
Meyer, Wesker, Brand, Mertins & Kollmeier (2006)
, , , & (). A human-machine comparison in speech recognition based on a logatome corpus.
Meyer, Wächter, Brand & Kollmeier (2007)
, , & (). Phoneme confusions in human and automatic speech recognition.
Meyer, Jürgens, Wesker, Brand & Kollmeier (2010)
, , , & (). Human phoneme recognition depending on speech-intrinsic variability. The Journal of the Acoustical Society of America, 128(5). 3126–3141.
Miao, Gowayyed & Metze (2015)
, & (). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. IEEE.
Miech, Zhukov, Alayrac, Tapaswi, Laptev & Sivic (2019)
, , , , & (). HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips.
Miller & Charles (1991)
& (). Contextual correlates of semantic similarity. Language and cognitive processes, 6(1). 1–28.
Millet, Jurov & Dunbar (2019)
, & (). Comparing unsupervised speech learning directly to human performance in speech perception.
Muscariello, Gravier & Bimbot (2012)
, & (). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination. IEEE Transactions on Audio, Speech and Language Processing, 20(7). 2031–2044.
Gulordava, Bojanowski, Grave, Linzen & Baroni (2018)
, , , & (). Colorless green recurrent networks dream hierarchically. Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/N18-1108
Kwiatkowski, Palomaki, Redfield, Collins, Parikh, Alberti, Epstein, Polosukhin, Devlin, Lee & (2019)
, , , , , , , , , & (). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7. 453–466.
Cuervo, Grabias, Chorowski, Ciesielski, Łańcucki, Rychlikowski & Marxer (2021)
, , , , , & (). Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words. arXiv preprint arXiv:2110.15909.
Iwamoto & Shinozaki (2021)
& (). Unsupervised spoken term discovery using wav2vec 2.0. IEEE.
Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021)
, , , & (). Segmental contrastive predictive coding for unsupervised word segmentation. arXiv preprint arXiv:2106.02170.
Bhati, Villalba, Żelasko, Moro-Velazquez & Dehak (2021)
, , , & (). Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. arXiv preprint arXiv:2110.02345.
Bhati, Villalba, Żelasko & Dehak (2020)
, , & (). Self-expressing autoencoders for unsupervised spoken term discovery. arXiv preprint arXiv:2007.13033.
Borgholt, Havtorn, Edin, Maaløe & Igel (2022)
, , , & (). A brief overview of unsupervised neural speech representation learning.
Nayak, Kumar, Ramesh, Bhati & Murty (2019)
, , , & (). Virtual Phone Discovery for Speech Synthesis. https://doi.org/10.13140/RG.2.2.23356.08324
Tobing, Hayashi, Wu, Kobayashi & Toda (2020)
, , , & (). Cyclic spectral modeling for unsupervised unit discovery into voice conversion with excitation and waveform modeling..
Chen & Hain (2020)
& (). Unsupervised acoustic unit representation learning for voice conversion using wavenet auto-encoders. arXiv preprint arXiv:2008.06892.
Niekerk, Nortje & Kamper (2020)
, & (). Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. arXiv preprint arXiv:2005.09409.
Yusuf, Ondel, Burget, Černockỳ & Saraclar (2021)
, , , & (). A hierarchical subspace model for language-attuned acoustic unit discovery. IEEE.
Gündogdu, Yusuf, Yesilbursa & Saraclar (2020)
, , & (). Vector quantized temporally-aware correspondence sparse autoencoders for zero-resource acoustic unit discovery..
Newell & Simon (1972)
& (). Human problem solving. Prentice-Hall.
Nguyen, Seyssel, Rozé, Rivière, Kharitonov, Baevski, Dunbar & Dupoux (2020)
, , , , , , & (). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588.
Jurov (2019)
(). Phonetics or Phonology? Modelling Non-Native Perception  (Master’s thesis). Université Paris Diderot, Paris, France.
Ondel, Godard, Besacier, Larsen, Hasegawa-Johnson, Scharenborg, Dupoux, Burget, Yvon & Khudanpur (2018)
, , , , , , , , & (). Bayesian models for unit discovery on a very low resource language. IEEE.
Oord, Li & Vinyals (2018)
, & (). Representation learning with contrastive predictive coding. CoRR, abs/1807.03748. Retrieved from http://arxiv.org/abs/1807.03748
Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier & Auli (2019)
, , , , , , & (). Fairseq: A fast, extensible toolkit for sequence modeling. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-4009
Panayotov, Chen, Povey & Khudanpur (2015)
, , & (). Librispeech: An asr corpus based on public domain audio books. IEEE.
Pandia & Murthy (2019)
& (). Zero Resource Speech Synthesis Using Transcripts Derived from Perceptual Acoustic Units. INTERSPEECH 2019.
Park & Glass (2008)
& (). Unsupervised Pattern Discovery in Speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1). 186–197.
Parrot, Millet & Dunbar (2019)
, & (). Independent and automatic evaluation of acoustic-to-articulatory inversion models. arXiv. arXiv–1911.
Pauls & Klein (2012)
& (). Large-scale syntactic language modeling with treelets.
Chang & Fisher III (2013)
& (). Parallel sampling of DP mixture models using sub-cluster splits.
Pellegrini, Manenti & Pinquier (n.d.)
, & (n.d.). Unsupervised discovery of sub-lexical units in speech based on ZCA and k-means. Submitted to ASRU 2017.
Peperkamp (2015)
(). Phonology versus phonetics in loanword adaptations. (pp. 71–90). John Benjamins Publishing Company.
Phillips, Wagers & Lau (2011)
, & (). Grammatical illusions and selective fallibility in real-time language comprehension. Experiments at the Interfaces, 37. 147–180.
Pintér & Watanabe (2016)
& (). Do GMM phoneme classifiers perceive synthetic sibilants as humans do?.
Pitt, Johnson, Hume, Kiesling & Raymond (2005)
, , , & (). The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1). 89–95.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011)
, , , , , , , , , , , & (). The kaldi speech recognition toolkit.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer & Vesely (2011)
, , , , , , , , , , , & (). The kaldi speech recognition toolkit. IEEE Signal Processing Society.
Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz & (2011)
, , , , , , , , , & (). The Kaldi speech recognition toolkit. IEEE Signal Processing Society; IEEE Signal Processing Society.
(2017)
(). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
Rabiner (1989)
(). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2). 257–286.
Radford, Wu, Child, Luan, Amodei & Sutskever (2019)
, , , , & (). Language models are unsupervised multitask learners.
Radinsky, Agichtein, Gabrilovich & Markovitch (2011)
, , & (). A word at a time: Computing word relatedness using temporal semantic analysis.
Räsänen & Rasilo (2015)
& (). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122. 792–829.
Ravfogel, Tyers & Goldberg (2018)
, & (). Can LSTM learn to capture agreement? The case of basque. arXiv preprint 1809.04022.
Kamper, Jansen & Goldwater (2017)
, & (). A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46. 154–174.
Kamper (2022)
(). Word segmentation on discovered phone units with dynamic programming and self-supervised scoring. arXiv preprint arXiv:2202.11929.
Renshaw, Kamper, Jansen & Goldwater (2015)
, , & (). A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge.
Dupoux (2018)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173. 43–59.
Riochet, Castro, Bernard, Lerer, Fergus, Izard & Dupoux (2018)
, , , , , & (). IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning. arXiv preprint arXiv:1803.07616.
Rivière, Joulin, Mazaré & Dupoux (2020)
, , & (). Unsupervised pretraining transfers well across languages. Retrieved from https://arxiv.org/abs/2002.02848
Roy & Pentland (2002)
& (). Learning words from sights and sounds: A computational model. Cognitive Science, 26. 113–146.
Rubenstein & Goodenough (1965)
& (). Contextual correlates of synonymy. Communications of the ACM, 8(10). 627–633.
Tjandra, Sisman, Zhang, Sakti, Li & Nakamura (2019)
, , , , & (). VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019. INTERSPEECH 2019. Retrieved from https://arxiv.org/abs/1905.11449
Sakti, Kelana, Riza, Sakai, Markov & Nakamura (2008)
, , , , & (). Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project.
Sakti, Maia, Sakai, Shimizu & Nakamura (2008)
, , , & (). Development of HMM-based Indonesian speech synthesis.
Salazar, Liang, Nguyen & Kirchhoff (2020)
, , & (). Masked language model scoring. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.240
Sanabria, Caglayan, Palaskar, Elliott, Barrault, Specia & Metze (2018)
, , , , , & (). How2: A large-scale dataset for multimodal language understanding. NeurIPS. Retrieved from http://arxiv.org/abs/1811.00347
Scharenborg (2007)
(). Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Communication, 49(5). 336–347.
Scharenborg, Tiesmeyer, Hasegawa-Johnson & Dehak (2018)
, , & (). Visualizing phoneme category adaptation in deep neural networks..
Scharenborg, Gouw, Larson & Marchiori (2019)
, , & (). The representation of speech in deep neural networks. Springer.
Scharenborg (2019)
(). The representation of speech and its processing in the human brain and deep neural networks. Springer.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the Minimal-Pair ABX task (I): Analysis of the classical MFC/PLP pipeline.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. INTERSPEECH.
Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
, , , , & (). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
, , , , & (). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise.
Schatz (2016)
(). ABX-discriminability measures and applications  (Doctoral dissertation). École Normale Supérieure
Schatz (2016)
(). ABX-discriminability measures and applications  (PhD thesis). Paris 6
Schatz, Bach & Dupoux (2017)
, & (). ASR systems as models of phonetic category perception in adults.
Schatz & Feldman (2018)
& (). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception.
Schatz, Feldman, Goldwater, Cao & Dupoux (0)
, , , & (). Early phonetic learning without phonetic categories: Insights from machine learning. Proceedings of the National Academy of Sciences.
Schnabel, Labutov, Mimno & Joachims (2015)
, , & (). Evaluation methods for unsupervised word embeddings.
Schneider, Baevski, Collobert & Auli (2019)
, , & (). wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862.
Senin (2008)
(). Dynamic time warping algorithm review. Retrieved from http://seninp.github.io/assets/pubs/senin_dtw_litreview_2008.pdf
Sennrich, Haddow & Birch (2016)
, & (). Neural machine translation of rare words with subword units. Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1162
Sennrich, Haddow & Birch (2015)
, & (). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Shibata, Kato, Shinozaki & Watanabe (n.d.)
, , & (n.d.). Composite embedding systems for ZeroSpeech2017 track 1. Submitted to ASRU 2017.
Norris & McQueen (2008)
& (). Shortlist B: a Bayesian model of continuous speech recognition. Psychological Review, 115(2). 357–395.
Shrager & Langley (1990)
Shrager, J. & Langley, P. (). Computational models of scientific discovery and theory formation. Morgan Kaufmann.
Siu, Gish, Chan, Belfield & Lowe (2013)
, , , & (). Unsupervized training of an HMM-based self-organizing recognizer with applications to topic classification and keyword discovery. Computer Speech & Language, preprint.
Socher, Karpathy, Le, Manning & Ng (2014)
, , , & (). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2. 207–218.
Scharenborg, Norris, Bosch & McQueen (2005)
, , & (). How should a speech recognizer work?. Cognitive Science, 29. 867–918.
Stolcke & Droppo (2017)
& (). Comparing human and machine errors in conversational speech transcription.
Sun, Myers, Vondrick, Murphy & Schmid (2019)
, , , & (). Videobert: A joint model for video and language representation learning.
Synnaeve, Schatz & Dupoux (2014)
, & (). Phonetic embedding learning with side information.
Synnaeve, Versteegh & Dupoux (2014)
, & (). Learning words from images and speech.
Bosch, Van hamme, Boves & Moore (2008)
, , & (). A computational model of language acquisition: The emergence of words. Fundamenta Informaticae, 90. 229–249.
McMurray, Aslin & Toscano (2009)
, & (). Statistical learning of phonetic categories: Insights from a computational approach. Developmental Science, 12(3). 369–378.
Thiolliere, Dunbar, Synnaeve, Versteegh & Dupoux (2015)
, , , & (). A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling..
Schatz, Thiolliere, Dupoux, Synnaeve & Dunbar (2015)
, , , & (). ABXpy v0.1. https://doi.org/10.5281/zenodo.16239
Schatz, Cao, Synnaeve, Thiolliere & Dupoux (2015)
, , , & (). Abkhazia: Preliminary release. https://doi.org/10.5281/zenodo.16242
Schatz & Feldman (2018)
& (). Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception.
Elman & McClelland (2015)
& (). Exploiting the lawful variability in the speech wave. (pp. 71–90). Erlbaum.
McClelland & Elman (1986)
& (). Interactive processes in speech perception: The TRACE model. Cognitive Psychology, 18. 1–86.
Tran, Bisazza & Monz (2018)
, & (). The importance of being recurrent for modeling hierarchical structure. Retrieved from https://www.aclweb.org/anthology/D18-1503
Vallabha, McClelland, Pons, Werker & Amano (2007)
, , , & (). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33). 13273–13278.
Oord, Vinyals & (2017)
, & (). Neural discrete representation learning.
VanDam (2015)
(). HomeBank VanDam Public 5-minute Corpus. TalkBank. https://doi.org/10.21415/T5388S
VanDam (2015)
(). HomeBank VanDam Public Daylong Corpus. TalkBank. https://doi.org/10.21415/T5QH5N
Varadarajan, Khudanpur & Dupoux (2008)
, & (). Unsupervised learning of acoustic sub-word units. Association for Computational Linguistics.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017)
, , , , , , & (). Attention is all you need. CoRR, abs/1706.03762. Retrieved from http://arxiv.org/abs/1706.03762
Versteegh, Thiolliere, Schatz, Cao, Anguera, Jansen & Dupoux (2015)
, , , , , & (). The zero resource speech challenge 2015.
Versteegh, Anguera, Jansen & Dupoux (2016)
, , & (). The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science: Proceedings of SLTU 2016, 81. 67–72.
Versteegh, Thiollière, Schatz, Cao, Anguera, Jansen & Dupoux (2015)
, , , , , & (). The Zero Resource Speech Challenge 2015. https://doi.org/10.1016/j.procs.2016.04.031
Versteegh, Anguera, Jansen & Dupoux (2016)
, , & (). The Zero Resource Speech Challenge 2015: Proposed approaches and results. Procedia Computer Science, 81. 67–72.
Wang, Tang & Livescu (2020)
, & (). Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. IEEE.
Warstadt, Parrish, Liu, Mohananey, Peng, Wang & Bowman (2019)
, , , , , & (). Blimp: A benchmark of linguistic minimal pairs for english. arXiv preprint arXiv:1912.00582.
Werker & Tees (1984)
& (). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1). 49–63.
Wesker, Meyer, Wagener, Anemüller, Mertins & Kollmeier (2005)
, , , , & (). Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines.
Wilcox, Levy, Morita & Futrell (2018)
, , & (). What do RNN language models learn about filler–gap dependencies?.
Wilcox, Levy, Morita & Futrell (2018)
, , & (). What do RNN language models learn about filler-gap dependencies?. arXiv preprint 1809.00042.
Gauthier, Besacier, Voisin, Melese & Elingui (2016)
, , , & (). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. LREC.
Vries, Davel, Badenhorst, Basson, Wet, Barnard & Waal (2014)
, , , , , & (). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56. 119–131.
Xu & Tenenbaum (2007)
& (). Word learning as Bayesian inference. Psychological review, 114(2). 245–272.
Yang & Powers (2006)
& (). Verb similarity on the taxonomy of WordNet. Masaryk University.
Yang, Dai, Yang, Carbonell, Salakhutdinov & Le (2019)
, , , , & (). XLNet: Generalized autoregressive pretraining for language understanding. Retrieved from https://arxiv.org/abs/1906.08237
Yu & Ballard (2004)
& (). A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perceptions, 1. 57–80.
Yuan, Leung, Xie, Chen, Ma & Li (n.d.)
, , , , & (n.d.). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representations. Submitted to ASRU 2017.
Yuan, Leung, Xie, Chen, Ma & Li (2017)
, , , , & (). Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation. IEEE.
Zhang & Glass (2010)
& (). Towards multi-speaker unsupervised speech pattern discovery.
Zhou, Xu & Corso (2018)
, & (). Towards automatic learning of procedures from web instructional videos.
Gauthier, Besacier, Voisin, Melese & Elingui (2016)
, , , & (). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. Retrieved from https://hal.archives-ouvertes.fr/hal-01350037
Jia, Weiss, Biadsy, Macherey, Johnson, Chen & Wu (2019)
, , , , , & (). Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037.
Lee, Chen, Wang, Gu, Ma, Polyak, Adi, He, Tang, Pino & (2021)
, , , , , , , , , & (). Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604.
Tjandra, Sakti & Nakamura (2020)
, & (). Transformer vq-vae for unsupervised unit discovery and speech synthesis: Zerospeech 2020 challenge. arXiv preprint arXiv:2005.11676.
Alishahi, Chrupała, Cristia, Dupoux, Higy, Lavechin, Räsänen & Yu (2021)
, , , , , , & (). ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track. arXiv preprint arXiv:2107.06546.
Maekaku, Chang, Fujita, Chen, Watanabe & Rudnicky (2021)
, , , , & (). Speech representation learning combining conformer cpc with deep cluster for the zerospeech challenge 2021. arXiv preprint arXiv:2107.05899.
Chorowski, Ciesielski, Dzikowski, Łańcucki, Marxer, Opala, Pusz, Rychlikowski & Stypułkowski (2021)
, , , , , , , & (). Information retrieval for zerospeech 2021: The submission by university of wroclaw. arXiv preprint arXiv:2106.11603.
Niekerk, Nortje, Baas & Kamper (2021)
, , & (). Analyzing speaker information in self-supervised models to improve zero-resource speech processing. arXiv preprint arXiv:2108.00917.
Tjandra, Sakti & Nakamura (2019)
, & (). Speech-to-speech translation between untranscribed unknown languages. IEEE.
Jia, Ramanovich, Remez & Pomerantz (2021)
, , & (). Translatotron 2: Robust direct speech-to-speech translation. arXiv preprint arXiv:2107.08661.
Lee, Gong, Duquenne, Schwenk, Chen, Wang, Popuri, Pino, Gu & Hsu (2021)
, , , , , , , , & (). Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352.
Algayres, Ricoul, Karadayi, Mohammed, Sagot & Dupoux (2022)
, , , , & (). DP-PARSE: Finding word boundaries from raw speech with a token lexicon. Retrieved from https://arxiv.org/abs/1906.08237
Nguyen, Sagot & Dupoux (2022)
, & (). Are discrete units necessary for spoken language modeling?. Retrieved from https://arxiv.org/abs/1906.08237
De Saussure (1916)
(). Course in general linguistics. McGraw-Hill Book Company, New York-Toronto-London.