-
Goldwater,
Griffiths & Johnson
(2009)
-
Goldwater,
S.,
Griffiths,
T. & Johnson,
M.
(2009).
A bayesian framework for word segmentation: Exploring the effects of context.
Cognition, 112(1). 21–54.
-
Agirre,
Alfonseca,
Hall,
Kravalova,
Pasca & Soroa
(2009)
-
Agirre,
E.,
Alfonseca,
E.,
Hall,
K.,
Kravalova,
J.,
Pasca,
M. & Soroa,
A.
(2009).
A study on similarity and relatedness using distributional and wordnet-based approaches.
-
Al-Rfou,
Choe,
Constant,
Guo & Jones
(2018)
-
Al-Rfou,
R.,
Choe,
D.,
Constant,
N.,
Guo,
M. & Jones,
L.
(2018).
Character-level language modeling with deeper self-attention.
arXiv preprint 1808.04444.
-
Allen & Seidenberg
(1999)
-
Allen,
J. & Seidenberg,
M.
(1999).
The emergence of grammaticality in connectionist networks.
The emergence of language. 115–151.
-
Ansari,
Kumar,
Singh,
Ganapathy & Devi
(n.d.)
-
Ansari,
T.,
Kumar,
R.,
Singh,
S.,
Ganapathy,
S. & Devi,
S.
(n.d.).
Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions.
-
Chaudhuri,
Roth,
Ellis,
Gallagher,
Kaver,
Marvin,
Pantofaru,
Reale,
Reid,
Wilson & Xi
(2018)
-
Chaudhuri,
S.,
Roth,
J.,
Ellis,
D.,
Gallagher,
A.,
Kaver,
L.,
Marvin,
R.,
Pantofaru,
C.,
Reale,
N.,
Reid,
L.,
Wilson,
K. & Xi,
Z.
(2018).
AVA-speech: A densely labeled dataset of speech activity in movies.
Retrieved from
https://arxiv.org/pdf/1808.00606
-
Baevski,
Auli & Mohamed
(2019)
-
Baevski,
A.,
Auli,
M. & Mohamed,
A.
(2019).
Effectiveness of self-supervised pre-training for speech recognition.
arXiv preprint arXiv:1911.03912.
-
Baevski,
Zhou,
Mohamed & Auli
(2020)
-
Baevski,
A.,
Zhou,
H.,
Mohamed,
A. & Auli,
M.
(2020).
wav2vec 2.0: A framework for self-supervised learning of speech representations.
arXiv preprint arXiv:2006.11477.
-
Baker,
Reichart & Korhonen
(2014)
-
Baker,
S.,
Reichart,
R. & Korhonen,
A.
(2014).
An unsupervised model for instance level subcategorization acquisition.
-
Bérard,
Pietquin,
Servan & Besacier
(2016)
-
Bérard,
A.,
Pietquin,
O.,
Servan,
C. & Besacier,
L.
(2016).
Listen and translate: A proof of concept for end-to-end speech-to-text translation.
-
Best
(1995)
-
Best,
C.
(1995).
A direct realist perspective on cross-language speech perception. InStrange,
W. (Eds.),
Speech perception and linguistic experience: Issues in cross-language research. (pp. 167–200).
York Press.
-
Dunbar,
Bernard,
Hamilakis,
Nguyen,
Seyssel,
Rozé,
Rivière,
Kharitonov & Dupoux
(2021)
-
Dunbar,
E.,
Bernard,
M.,
Hamilakis,
N.,
Nguyen,
T.,
Seyssel,
M.,
Rozé,
P.,
Rivière,
M.,
Kharitonov,
E. & Dupoux,
E.
(2021).
The zero resource speech challenge 2021: Spoken language modelling.
-
Kohonen
(1988)
-
Kohonen,
T.
(1988).
The ’neural’ phonetic typewriter.
Computer, 21(3). 11–22.
-
Adda,
Stücker,
Adda-Decker,
Ambouroue,
Besacier,
Blachon,
Bonneau-Maynard,
Godard,
Hamlaoui,
Idiatov,
Kouarata,
Lamel,
Makasso,
Rialland,
Van de Velde,
Yvon & Zerbian
(2016)
-
Adda,
G.,
Stücker,
S.,
Adda-Decker,
M.,
Ambouroue,
O.,
Besacier,
L.,
Blachon,
D.,
Bonneau-Maynard,
H.,
Godard,
P.,
Hamlaoui,
F.,
Idiatov,
D.,
Kouarata,
G.,
Lamel,
L.,
Makasso,
E.,
Rialland,
A.,
Van de Velde,
M.,
Yvon,
F. & Zerbian,
S.
(2016).
Breaking the unwritten kanguage barrier: The Bulb project.
-
Akaike
(1974)
-
Akaike,
H.
(1974).
A new look at the statistical model identification.
IEEE Transactions on Automatic Control, 19(6). 716–723.
-
Alishahi,
Barking & Chrupała
(2017)
-
Alishahi,
A.,
Barking,
M. & Chrupała,
G.
(2017).
Encoding of phonology in a recurrent neural model of grounded speech.
-
Ansari,
Singh,
Kumar & Ganapathy
(n.d.)
-
Ansari,
T.,
Singh,
S.,
Kumar,
R. & Ganapathy,
S.
(n.d.).
Deep learning methods for unsupervised acoustic modeling: LEAP submission to ZeroSpeech challenge 2017.
-
Ansari,
Kumar,
Singh & Ganapathy
(2017)
-
Ansari,
T.,
Kumar,
R.,
Singh,
S. & Ganapathy,
S.
(2017).
Deep learning methods for unsupervised acoustic modeling—leap submission to zerospeech challenge 2017.
IEEE.
-
Jansen & Van Durme
(2011)
-
Jansen,
A. & Van Durme,
B.
(2011).
Efficient spoken term discovery using randomized algorithms.
IEEE.
-
Badino,
Canevari,
Fadiga & Metta
(2014)
-
Badino,
L.,
Canevari,
C.,
Fadiga,
L. & Metta,
G.
(2014).
An Auto-encoder based approach to unsupervised learning of subword units.
-
Hannun,
Case,
Casper,
Catanzaro,
Diamos,
Elsen,
Prenger,
Satheesh,
Sengupta,
Coates &
(2014)
-
Hannun,
A.,
Case,
C.,
Casper,
J.,
Catanzaro,
B.,
Diamos,
G.,
Elsen,
E.,
Prenger,
R.,
Satheesh,
S.,
Sengupta,
S.,
Coates,
A. &
(2014).
Deep speech: Scaling up end-to-end speech recognition.
arXiv preprint arXiv:1412.5567.
-
Bengio,
Ducharme,
Vincent & Jauvin
(2003)
-
Bengio,
Y.,
Ducharme,
R.,
Vincent,
P. & Jauvin,
C.
(2003).
A neural probabilistic language model.
JMLR.
-
Besacier,
Zhou & Gao
(2006)
-
Besacier,
L.,
Zhou,
B. & Gao,
Y.
(2006).
Towards speech translation of non written languages.
-
Warstadt,
Parrish,
Liu,
Mohananey,
Peng,
Wang & Bowman
(2019)
-
Warstadt,
A.,
Parrish,
A.,
Liu,
H.,
Mohananey,
A.,
Peng,
W.,
Wang,
S. & Bowman,
S.
(2019).
BLiMP: A benchmark of linguistic minimal pairs for english.
arXiv preprint arXiv:1912.00582.
-
Peng & Harwath
(2022)
-
Peng,
P. & Harwath,
D.
(2022).
Self-supervised representation learning for speech using visual grounding and masked language modeling.
arXiv preprint arXiv:2202.03543.
-
Bruni,
Boleda,
Baroni & Tran
(2012)
-
Bruni,
E.,
Boleda,
G.,
Baroni,
M. & Tran,
N.
(2012).
Distributional semantics in technicolor.
-
Chalnick & Billman
(1988)
-
Chalnick,
A. & Billman,
D.
(1988).
Unsupervised learning of correlational structure.
Lawrence Erlbaum Associates.
-
Chen,
Leung,
Xie,
Ma & Li
(2015)
-
Chen,
H.,
Leung,
C.,
Xie,
L.,
Ma,
B. & Li,
H.
(2015).
Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study.
-
Chrupała
(2021)
-
Chrupała,
G.
(2021).
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques.
Retrieved from
https://arxiv.org/abs/2104.13225
-
Chung & Glass
(2018)
-
Chung,
Y. & Glass,
J.
(2018).
Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech.
arXiv preprint arXiv:1803.08976.
-
Chung,
Hsu,
Tang & Glass
(2019)
-
Chung,
Y.,
Hsu,
W.,
Tang,
H. & Glass,
J.
(2019).
An unsupervised autoregressive model for speech representation learning.
arXiv preprint arXiv:1904.03240.
-
Chung,
Hsu,
Tang & Glass
(2019)
-
Chung,
Y.,
Hsu,
W.,
Tang,
H. & Glass,
J.
(2019).
An unsupervised autoregressive model for speech representation learning.
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 146–150.
https://doi.org/10.21437/Interspeech.2019-1473
-
Keuleers & Brysbaert
(2010)
-
Keuleers,
E. & Brysbaert,
M.
(2010).
Wuggy: A multilingual pseudoword generator.
Behavior research methods, 42(3). 627–633.
-
Kharitonov,
Lee,
Polyak,
Adi,
Copet,
Lakhotia,
Nguyen,
Rivière,
Mohamed,
Dupoux &
(2021)
-
Kharitonov,
E.,
Lee,
A.,
Polyak,
A.,
Adi,
Y.,
Copet,
J.,
Lakhotia,
K.,
Nguyen,
T.,
Rivière,
M.,
Mohamed,
A.,
Dupoux,
E. &
(2021).
Text-free prosody-aware generative spoken language modeling.
arXiv preprint arXiv:2109.03264.
-
Heck,
Sakti & Nakamura
(2016)
-
Heck,
M.,
Sakti,
S. & Nakamura,
S.
(2016).
Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario.
Procedia Computer Science, 81. 73–79.
-
Srivastava & Shrivastava
(2016)
-
Srivastava,
B. & Shrivastava,
M.
(2016).
Articulatory gesture rich representation learning of phonological units in low resource settings.
Springer.
-
Heck,
Sakti & Nakamura
(2017)
-
Heck,
M.,
Sakti,
S. & Nakamura,
S.
(2017).
Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017.
IEEE.
-
Shibata,
Kato,
Shinozaki & Watanabet
(2017)
-
Shibata,
H.,
Kato,
T.,
Shinozaki,
T. & Watanabet,
S.
(2017).
Composite embedding systems for ZeroSpeech2017 Track1.
IEEE.
-
Chorowski,
Weiss,
Bengio & Van Den Oord
(2019)
-
Chorowski,
J.,
Weiss,
R.,
Bengio,
S. & Van Den Oord,
A.
(2019).
Unsupervised speech representation learning using wavenet autoencoders.
IEEE/ACM transactions on audio, speech, and language processing, 27(12). 2041–2053.
-
Kamper,
Livescu & Goldwater
(2017)
-
Kamper,
H.,
Livescu,
K. & Goldwater,
S.
(2017).
An embedded segmental k-means model for unsupervised segmentation and clustering of speech.
IEEE.
-
Hsu,
Harwath & Glass
(2019)
-
Hsu,
W.,
Harwath,
D. & Glass,
J.
(2019).
Transfer learning from audio-visual grounding to speech recognition.
arXiv preprint arXiv:1907.04355.
-
Chung & Glass
(2019)
-
Chung,
Y. & Glass,
J.
(2019).
Generative pre-training for speech with autoregressive predictive coding.
arXiv preprint arXiv:1910.12607.
-
Millet,
Chitoran & Dunbar
(2021)
-
Millet,
J.,
Chitoran,
I. & Dunbar,
E.
(2021).
Predicting non-native speech perception using the perceptual assimilation model and state-of-the-art acoustic models.
-
Warstadt,
Singh & Bowman
(2018)
-
Warstadt,
A.,
Singh,
A. & Bowman,
S.
(2018).
Neural network acceptability judgments.
arXiv preprint 1805.12471.
-
Dai,
Yang,
Yang,
Cohen,
Carbonell,
Le & Salakhutdinov
(2019)
-
Dai,
Z.,
Yang,
Z.,
Yang,
Y.,
Cohen,
W.,
Carbonell,
J.,
Le,
Q. & Salakhutdinov,
R.
(2019).
Transformer-xl: Attentive language models beyond a fixed-length context.
arXiv preprint 1901.02860.
-
Räsänen,
Doyle & Frank
(2015)
-
Räsänen,
O.,
Doyle,
G. & Frank,
M.
(2015).
Unsupervised word discovery from speech using automatic segmentation into syllable-like units.
-
Räsänen & Blandón
(2020)
-
Räsänen,
O. & Blandón,
M.
(2020).
Unsupervised discovery of recurring speech patterns using probabilistic adaptive metrics.
arXiv preprint arXiv:2008.00731.
-
Prakash,
Kumar,
Murthy &
(2020)
-
Prakash,
A.,
Kumar,
M.,
Murthy,
H. &
(2020).
Exploration of end-to-end synthesisers for zero resource speech challenge 2020.
arXiv preprint arXiv:2009.04983.
-
Davis & Mermelstein
(1980)
-
Davis,
S. & Mermelstein,
P.
(1980).
Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences.
IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4). 357–366.
-
Lee & Glass
(2012)
-
Lee,
C. & Glass,
J.
(2012).
A nonparametric bayesian approach to acoustic model discovery.
The Association for Computer Linguistics.
-
Hsu,
Hwang,
Wu,
Tsao & Wang
(2016)
-
Hsu,
C.,
Hwang,
H.,
Wu,
Y.,
Tsao,
Y. & Wang,
H.
(2016).
Voice conversion from non-parallel corpora using variational auto-encoder.
https://doi.org/10.1109/APSIPA.2016.7820786
-
Tjandra,
Sakti & Nakamura
(2017)
-
Tjandra,
A.,
Sakti,
S. & Nakamura,
S.
(2017).
Listening while speaking: Speech chain by deep learning.
-
Badino,
Canevari,
Fadiga & Metta
(2014)
-
Badino,
L.,
Canevari,
C.,
Fadiga,
L. & Metta,
G.
(2014).
An auto-encoder based approach to unsupervised learning of subword units.
IEEE.
-
Gao,
Singh & Raj
(2018)
-
Gao,
Y.,
Singh,
R. & Raj,
B.
(2018).
Voice impersonation using generative adversarial networks.
IEEE.
-
Jansen,
Thomas & Hermansky
(2013)
-
Jansen,
A.,
Thomas,
S. & Hermansky,
H.
(2013).
Weak top-down constraints for unsupervised acoustic model training.
IEEE.
-
Eloff,
Nortje,
Niekerk,
Govender,
Nortje,
Pretorius,
Van Biljon,
Westhuizen,
Staden & Kamper
(2019)
-
Eloff,
R.,
Nortje,
A.,
Niekerk,
B.,
Govender,
A.,
Nortje,
L.,
Pretorius,
A.,
Van Biljon,
E.,
Westhuizen,
E.,
Staden,
L. & Kamper,
H.
(2019).
Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks.
arXiv preprint arXiv:1904.07556.
-
Yusuf,
Gök,
Gündogdu,
Kose & Saraclar
(2019)
-
Yusuf,
B.,
Gök,
A.,
Gündogdu,
B.,
Kose,
O. & Saraclar,
M.
(2019).
Temporally-aware acoustic unit discovery for zerospeech 2019 challenge..
-
Liu,
Hsu & Lee
(2019)
-
Liu,
A.,
Hsu,
P. & Lee,
H.
(2019).
Unsupervised end-to-end learning of discrete linguistic units for voice conversion.
arXiv preprint arXiv:1905.11563.
-
Nayak,
Kumar,
Ramesh,
Bhati & Murty
(2019)
-
Nayak,
S.,
Kumar,
C.,
Ramesh,
G.,
Bhati,
S. & Murty,
K.
(2019).
Virtual phone discovery for speech synthesis without text.
IEEE.
-
Muthukumar & Black
(2014)
-
Muthukumar,
P. & Black,
A.
(2014).
Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis.
-
Scharenborg,
Besacier,
Black,
Hasegawa-Johnson,
Metze,
Neubig,
Stüker,
Godard,
Müller,
Ondel,
Palaskar,
Arthur,
Ciannella,
Du,
Larsen,
Merkx,
Riad,
Wang & Dupoux
(2018)
-
Scharenborg,
O.,
Besacier,
L.,
Black,
A.,
Hasegawa-Johnson,
M.,
Metze,
F.,
Neubig,
G.,
Stüker,
S.,
Godard,
P.,
Müller,
M.,
Ondel,
L.,
Palaskar,
S.,
Arthur,
P.,
Ciannella,
F.,
Du,
M.,
Larsen,
E.,
Merkx,
D.,
Riad,
R.,
Wang,
L. & Dupoux,
E.
(2018).
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the “speaking rosetta” JSALT 2017 workshop.
IEEE.
-
Shen,
Pang,
Weiss,
Schuster,
Jaitly,
Yang,
Chen,
Zhang,
Wang,
Ryan,
Saurous,
Agiomyrgiannakis & Wu
(2018)
-
Shen,
J.,
Pang,
R.,
Weiss,
R.,
Schuster,
M.,
Jaitly,
N.,
Yang,
Z.,
Chen,
Z.,
Zhang,
Y.,
Wang,
Y.,
Ryan,
R.,
Saurous,
R.,
Agiomyrgiannakis,
Y. & Wu,
Y.
(2018).
Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions.
IEEE.
-
Heck,
Sakti & Nakamura
(2016)
-
Heck,
M.,
Sakti,
S. & Nakamura,
S.
(2016).
Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario.
-
Ondel,
Burget & Cernocký
(2016)
-
Ondel,
L.,
Burget,
L. & Cernocký,
J.
(2016).
Variational inference for acoustic unit discovery.
Elsevier.
-
Oord,
Dieleman,
Zen,
Simonyan,
Vinyals,
Graves,
Kalchbrenner,
Senior & Kavukcuoglu
(2016)
-
Oord,
A.,
Dieleman,
S.,
Zen,
H.,
Simonyan,
K.,
Vinyals,
O.,
Graves,
A.,
Kalchbrenner,
N.,
Senior,
A. & Kavukcuoglu,
K.
(2016).
WaveNet: A generative model for raw audio.
ISCA.
-
Wu,
Watts & King
(2016)
-
Wu,
Z.,
Watts,
O. & King,
S.
(2016).
Merlin: An open source neural network speech synthesis system.
ISCA.
-
Ping,
Peng,
Gibiansky,
Arik,
Kannan,
Narang,
Raiman & Miller
(2017)
-
Ping,
W.,
Peng,
K.,
Gibiansky,
A.,
Arik,
S.,
Kannan,
A.,
Narang,
S.,
Raiman,
J. & Miller,
J.
(2017).
Deep voice 3: 2000-speaker neural text-to-speech.
CoRR, abs/1710.07654.
-
Kaneko & Kameoka
(2017)
-
Kaneko,
T. & Kameoka,
H.
(2017).
Parallel-data-free voice conversion using cycle-consistent adversarial networks.
CoRR, abs/1711.11293. Retrieved from
https://arxiv.org/abs/1711.11293
-
Chou,
Yeh,
Lee & Lee
(2018)
-
Chou,
J.,
Yeh,
C.,
Lee,
H. & Lee,
L.
(2018).
Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations.
CoRR, abs/1804.02812. Retrieved from
https://arxiv.org/abs/1804.02812
-
Li,
Liu,
Liu,
Zhao,
Liu & Zhou
(2018)
-
Li,
N.,
Liu,
S.,
Liu,
Y.,
Zhao,
S.,
Liu,
M. & Zhou,
M.
(2018).
Close to human quality TTS with transformer.
CoRR, abs/1809.08895. Retrieved from
https://arxiv.org/abs/1809.08895
-
Mehri,
Kumar,
Gulrajani,
Kumar,
Jain,
Sotelo,
Courville & Bengio
(2016)
-
Mehri,
S.,
Kumar,
K.,
Gulrajani,
I.,
Kumar,
R.,
Jain,
S.,
Sotelo,
J.,
Courville,
A. & Bengio,
Y.
(2016).
SampleRNN: An unconditional end-to-end neural audio generation model.
CoRR, abs/1612.07837. Retrieved from
https://arxiv.org/abs/1612.07837
-
Taigman,
Wolf,
Polyak & Nachmani
(2017)
-
Taigman,
Y.,
Wolf,
L.,
Polyak,
A. & Nachmani,
E.
(2017).
Voice synthesis for in-the-wild speakers via a phonological loop.
CoRR, abs/1707.06588.
-
Dillon,
Dunbar & Idsardi
(2013)
-
Dillon,
B.,
Dunbar,
E. & Idsardi,
W.
(2013).
A single-stage approach to learning phonological categories: Insights from Inuktitut.
Cognitive Science, 37(2). 344–377.
-
DeCarlo
(1998)
-
DeCarlo,
L.
(1998).
Signal detection theory and generalized linear models..
Psychological Methods, 3(2). 186.
-
Deng,
Dong,
Socher,
Li,
Li & Fei-Fei
(2009)
-
Deng,
J.,
Dong,
W.,
Socher,
R.,
Li,
L.,
Li,
K. & Fei-Fei,
L.
(2009).
ImageNet: A large-scale hierarchical image database.
https://doi.org/10.1109/CVPR.2009.5206848
-
Devlin,
Chang,
Lee & Toutanova
(2019)
-
Devlin,
J.,
Chang,
M.,
Lee,
K. & Toutanova,
K.
(2019).
BERT: Pre-training of deep bidirectional transformers for language understanding.
NAACL.
-
Driesen & Van hamme
(2011)
-
Driesen,
J. & Van hamme,
H.
(2011).
Modeling vocabulary acquisition, adaptation and generalization in infants using adaptive Bayesian PLSA.
Neurocomputing, 74. 1874–1882.
-
Dunbar,
Cao,
Benjumea,
Karadayi,
Bernard,
Besacier,
Anguera & Dupoux
(2017)
-
Dunbar,
E.,
Cao,
X.,
Benjumea,
J.,
Karadayi,
J.,
Bernard,
M.,
Besacier,
L.,
Anguera,
X. & Dupoux,
E.
(2017).
The Zero Resource Speech Challenge 2017.
IEEE. Retrieved from
https://arxiv.org/abs/1712.04313
-
Algayres,
Zaiem,
Sagot & Dupoux
(2020)
-
Algayres,
R.,
Zaiem,
M.,
Sagot,
B. & Dupoux,
E.
(2020).
Evaluating the reliability of acoustic speech embeddings.
arXiv preprint arXiv:2007.13542.
-
Riad,
Dancette,
Karadayi,
Zeghidour,
Schatz & Dupoux
(2018)
-
Riad,
R.,
Dancette,
C.,
Karadayi,
J.,
Zeghidour,
N.,
Schatz,
T. & Dupoux,
E.
(2018).
Sampling strategies in siamese networks for unsupervised speech representation learning.
arXiv preprint arXiv:1804.11297.
-
Dunbar,
Algayres,
Karadayi,
Bernard,
Benjumea,
Cao,
Miskic,
Dugrain,
Ondel,
Black &
(2019)
-
Dunbar,
E.,
Algayres,
R.,
Karadayi,
J.,
Bernard,
M.,
Benjumea,
J.,
Cao,
X.,
Miskic,
L.,
Dugrain,
C.,
Ondel,
L.,
Black,
A. &
(2019).
The zero resource speech challenge 2019: TTS without T.
Retrieved from
https://arxiv.org/abs/1904.11469
-
Dunbar,
Karadayi,
Bernard,
Cao,
Algayres,
Ondel,
Besacier,
Sakriani & Dupoux
(2020)
-
Dunbar,
E.,
Karadayi,
J.,
Bernard,
M.,
Cao,
X.,
Algayres,
R.,
Ondel,
L.,
Besacier,
L.,
Sakriani,
S. & Dupoux,
E.
(2020).
The zero resource speech challenge 2020: Discovering discrete subword and word units.
-
Duong,
Anastasopoulos,
Chiang,
Bird14 & Cohn
(2016)
-
Duong,
L.,
Anastasopoulos,
A.,
Chiang,
D.,
Bird14,
S. & Cohn,
T.
(2016).
An attentional model for speech translation without transcription.
-
Dupoux
(2016)
-
Dupoux,
E.
(2016).
Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner.
arXiv preprint arXiv:1607.08723.
-
Dupoux
(2018)
-
Dupoux,
E.
(2018).
Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner.
Cognition, 173. 43–59.
-
Peters,
Neumann,
Iyyer,
Gardner,
Clark,
Lee & Zettlemoyer
(2018)
-
Peters,
M.,
Neumann,
M.,
Iyyer,
M.,
Gardner,
M.,
Clark,
C.,
Lee,
K. & Zettlemoyer,
L.
(2018).
Deep contextualized word representations.
NAACL.
-
Faruqui,
Tsvetkov,
Rastogi & Dyer
(2016)
-
Faruqui,
M.,
Tsvetkov,
Y.,
Rastogi,
P. & Dyer,
C.
(2016).
Problems with evaluation of word embeddings using word similarity tasks.
arXiv preprint arXiv:1605.02276.
-
Feigenbaum
(1963)
-
Feigenbaum,
E.
(1963).
The simulation of verbal learning behavior. InFeigenbaum,
E. & Feldman,
J. (Eds.),
Computers and thought..
McGraw-Hill.
-
Feldman & Griffiths
(2007)
-
Feldman,
N. & Griffiths,
T.
(2007).
A rational account of the perceptual magnet effect.
-
Feldman,
Griffiths,
Goldwater & Morgan
(2013)
-
Feldman,
N.,
Griffiths,
T.,
Goldwater,
S. & Morgan,
J.
(2013).
A role for the developing lexicon in phonetic category acquisition..
Psychological review, 120(4). 751–778.
-
Feng,
Lee & Peng
(2019)
-
Feng,
S.,
Lee,
T. & Peng,
Z.
(2019).
Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling.
INTERSPEECH 2019. Retrieved from
https://arxiv.org/abs/1906.07234
-
Cieri,
Miller & Walker
(2004)
-
Cieri,
C.,
Miller,
D. & Walker,
K.
(2004).
The fisher corpus: A resource for the next generations of speech-to-text.
-
Frome,
Corrado,
Shlens,
Bengio,
Dean,
Ranzato & Mikolov
(2013)
-
Frome,
A.,
Corrado,
G.,
Shlens,
J.,
Bengio,
S.,
Dean,
J.,
Ranzato,
M. & Mikolov,
T.
(2013).
DeViSE: A deep visual-semantic embedding model.
-
Futrell,
Wilcox,
Morita,
Qian,
Ballesteros & Levy
(2019)
-
Futrell,
R.,
Wilcox,
E.,
Morita,
T.,
Qian,
P.,
Ballesteros,
M. & Levy,
R.
(2019).
Neural language models as psycholinguistic subjects: Representations of syntactic state.
-
Futrell,
Wilcox,
Morita & Levy
(2018)
-
Futrell,
R.,
Wilcox,
E.,
Morita,
T. & Levy,
R.
(2018).
RNNs as psycholinguistic subjects: Syntactic state and grammatical dependency.
arXiv preprint 1809.01329.
-
Gage
(1994)
-
Gage,
P.
(1994).
A new algorithm for data compression.
C Users Journal, 12(2). 23–38.
-
García-Granada,
Sanchis,
Castro-Bleda,
González & Hurtado
(n.d.)
-
García-Granada,
F.,
Sanchis,
E.,
Castro-Bleda,
M.,
González,
J. & Hurtado,
L.
(n.d.).
ZeroSpeech2017 ELIRF-UPV system.
Submitted to ASRU 2017.
-
Gerz,
Vulić,
Hill,
Reichart & Korhonen
(2016)
-
Gerz,
D.,
Vulić,
I.,
Hill,
F.,
Reichart,
R. & Korhonen,
A.
(2016).
Simverb-3500: A large-scale evaluation set of verb similarity.
arXiv preprint arXiv:1608.00869.
-
Glass
(2012)
-
Glass,
J.
(2012).
Towards unsupervised speech processing.
IEEE.
-
Myrman & Salvi
(2017)
-
Myrman,
A. & Salvi,
G.
(2017).
Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling.
ISCA.
-
Godais,
Linzen & Dupoux
(2017)
-
Godais,
G.,
Linzen,
T. & Dupoux,
E.
(2017).
Comparing character-level neural language models using a lexical decision task.
https://doi.org/10.18653/v1/E17-2020
-
Godfrey,
Holliman & McDaniel
(1992)
-
Godfrey,
J.,
Holliman,
E. & McDaniel,
J.
(1992).
SWITCHBOARD: Telephone speech corpus for research and development.
IEEE.
-
Goldberg
(2019)
-
Goldberg,
Y.
(2019).
Assessing BERT’s syntactic abilities.
arXiv preprint 1901.05287.
-
Goldwater,
Griffiths & Johnson
(2009)
-
Goldwater,
S.,
Griffiths,
T. & Johnson,
M.
(2009).
A bayesian framework for word segmentation: Exploring the effects of context.
Cognition, 112. 21–54.
https://doi.org/10.1016/j.cognition.2009.03.008
-
Guenther & Gjaja
(1996)
-
Guenther,
F. & Gjaja,
M.
(1996).
The perceptual magnet effect as an emergent property of neural map formation.
The Journal of the Acoustical Society of America, 100(2). 1111–1121.
-
Gulordava,
Bojanowski,
Grave,
Linzen & Baroni
(2018)
-
Gulordava,
K.,
Bojanowski,
P.,
Grave,
E.,
Linzen,
T. & Baroni,
M.
(2018).
Colorless green recurrent networks dream hierarchically. Retrieved from
https://www.aclweb.org/anthology/N18-1108
-
Hahn & Baroni
(2019)
-
Hahn,
M. & Baroni,
M.
(2019).
Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text.
Transactions of the Association for Computational Linguistics (Accepted). Retrieved from
https://arxiv.org/abs/1906.07285
-
Halawi,
Dror,
Gabrilovich & Koren
(2012)
-
Halawi,
G.,
Dror,
G.,
Gabrilovich,
E. & Koren,
Y.
(2012).
Large-scale learning of word relatedness with constraints.
-
Hannun,
Case,
Casper,
Catanzaro,
Diamos,
Elsen,
Prenger,
Satheesh,
Sengupta,
Coates &
(2014)
-
Hannun,
A.,
Case,
C.,
Casper,
J.,
Catanzaro,
B.,
Diamos,
G.,
Elsen,
E.,
Prenger,
R.,
Satheesh,
S.,
Sengupta,
S.,
Coates,
A. &
(2014).
Deep speech: Scaling up end-to-end speech recognition.
arXiv preprint arXiv:1412.5567.
-
Harwath & Glass
(2015)
-
Harwath,
D. & Glass,
J.
(2015).
Deep multimodal semantic embeddings for speech and images.
IEEE.
-
Harwath,
Torralba & Glass
(2016)
-
Harwath,
D.,
Torralba,
A. & Glass,
J.
(2016).
Unsupervised learning of spoken language with visual context.
-
Harwath,
Hsu & Glass
(2019)
-
Harwath,
D.,
Hsu,
W. & Glass,
J.
(2019).
Learning hierarchical discrete linguistic units from visually-grounded speech.
arXiv preprint arXiv:1911.09602.
-
Tiede,
Espy-Wilson,
Goldenberg,
Mitra,
Nam & Sivaraman
(2017)
-
Tiede,
M.,
Espy-Wilson,
C.,
Goldenberg,
D.,
Mitra,
V.,
Nam,
H. & Sivaraman,
G.
(2017).
Quantifying kinematic aspects of reduction in a contrasting rate production task.
The Journal of the Acoustical Society of America, 141(5). 3580–3580.
https://doi.org/10.1121/1.4987629
-
Hastie,
Tibshirani & Friedman
(2009)
-
Hastie,
T.,
Tibshirani,
R. & Friedman,
J.
(2009).
The elements of statistical learning – data mining, inference, and prediction.
Springer.
-
Havard,
Besacier & Rosec
(2017)
-
Havard,
W.,
Besacier,
L. & Rosec,
O.
(2017).
SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set.
https://doi.org/10.21437/GLU.2017-9
-
Arandjelovic & Zisserman
(2017)
-
Arandjelovic,
R. & Zisserman,
A.
(2017).
Look, listen and learn.
-
Chrupała,
Gelderloos & Alishahi
(2017)
-
Chrupała,
G.,
Gelderloos,
L. & Alishahi,
A.
(2017).
Representations of language in a model of visually grounded speech signal.
arXiv preprint arXiv:1702.01991.
-
Chrupała,
Gelderloos & Alishahi
(2017)
-
Chrupała,
G.,
Gelderloos,
L. & Alishahi,
A.
(2017).
Representations of language in a model of visually grounded speech signal.
-
Jansen,
Dupoux,
Goldwater,
Johnson,
Khudanpur,
Church,
Feldman,
Hermansky,
Metze,
Rose,
Seltzer,
Clark,
McGraw,
Varadarajan,
Bennett,
Borschinger,
Chiu,
Dunbar,
Fourtassi,
Harwath,
Lee,
Levin,
Norouzian,
Peddinti,
Richardson,
Schatz & Thomas
(2013)
-
Jansen,
A.,
Dupoux,
E.,
Goldwater,
S.,
Johnson,
M.,
Khudanpur,
S.,
Church,
K.,
Feldman,
N.,
Hermansky,
H.,
Metze,
F.,
Rose,
R.,
Seltzer,
M.,
Clark,
P.,
McGraw,
I.,
Varadarajan,
B.,
Bennett,
E.,
Borschinger,
B.,
Chiu,
J.,
Dunbar,
E.,
Fourtassi,
A.,
Harwath,
D.,
Lee,
C.,
Levin,
K.,
Norouzian,
A.,
Peddinti,
V.,
Richardson,
R.,
Schatz,
T. & Thomas,
S.
(2013).
A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition.
-
Elsner,
Goldwater & Eisenstein
(2012)
-
Elsner,
M.,
Goldwater,
S. & Eisenstein,
J.
(2012).
Bootstrapping a unified model of lexical and phonetic acquisition.
-
Bostrom & Durrett
(2020)
-
Bostrom,
K. & Durrett,
G.
(2020).
Byte pair encoding is suboptimal for language model pretraining.
Retrieved from
https://arxiv.org/abs/2004.03720
-
Fer,
Matejka,
Grezl,
Plchot,
Vesely & Cernocky
(2017)
-
Fer,
R.,
Matejka,
P.,
Grezl,
F.,
Plchot,
O.,
Vesely,
K. & Cernocky,
J.
(2017).
Multilingually trained bottleneck features in spoken language recognition.
Computer Speech and Language, 46(Supplement C). 252–267.
-
Yusuf,
Gok,
Gundogdu,
Kose & Saraclar
(2019)
-
Yusuf,
B.,
Gok,
A.,
Gundogdu,
B.,
Kose,
O. & Saraclar,
M.
(2019).
Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge.
INTERSPEECH 2019.
-
Pitt,
Dilley,
Johnson,
Kiesling,
Raymond,
Hume & Fosler-Lussier
(2007)
-
Pitt,
M.,
Dilley,
L.,
Johnson,
K.,
Kiesling,
S.,
Raymond,
W.,
Hume,
E. & Fosler-Lussier,
E.
(2007).
Buckeye corpus of conversational speech (2nd release).
www.buckeyecorpus.osu.edu; Columbus, OH: Department of Psychology, Ohio State University (Distributor).
-
Barnard
(2014)
-
Barnard,
D.
(2014).
The NCHLT speech corpus of the south african languages..
https://sites.google.com/site/nchltspeechcorpus/home; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from
http://hdl.handle.net/10204/7549
-
Chen,
Leung,
Xie,
Ma & Li
(n.d.)
-
Chen,
H.,
Leung,
C.,
Xie,
L.,
Ma,
B. & Li,
H.
(n.d.).
Multilingual bottle-neck feature learning from untranscribed speech.
Submitted to ASRU 2017.
-
Cho,
Merrienboer,
Gulcehre,
Bahdanau,
Bougares,
Schwenk & Bengio
(2014)
-
Cho,
K.,
Merrienboer,
B.,
Gulcehre,
C.,
Bahdanau,
D.,
Bougares,
F.,
Schwenk,
H. & Bengio,
Y.
(2014).
Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation.
Association for Computational Linguistics.
https://doi.org/10.3115/v1/D14-1179
-
Chrupała
(2019)
-
Chrupała,
G.
(2019).
Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P19-1647
-
Badino,
Mereta & Rosasco
(2015)
-
Badino,
L.,
Mereta,
A. & Rosasco,
L.
(2015).
Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders.
-
Chen,
Leung,
Xie,
Ma & Li
(2015)
-
Chen,
H.,
Leung,
C.,
Xie,
L.,
Ma,
B. & Li,
H.
(2015).
Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: A feasibility study.
-
Myrman & Salvi
(2017)
-
Myrman,
A. & Salvi,
G.
(2017).
Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling.
-
Renshaw,
Kamper,
Jansen & Goldwater
(2015)
-
Renshaw,
D.,
Kamper,
H.,
Jansen,
A. & Goldwater,
S.
(2015).
A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge.
-
Thiolliere,
Dunbar,
Synnaeve,
Versteegh & Dupoux
(2015)
-
Thiolliere,
R.,
Dunbar,
E.,
Synnaeve,
G.,
Versteegh,
M. & Dupoux,
E.
(2015).
A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling.
-
Zeghidour,
Synnaeve,
Versteegh & Dupoux
(2016)
-
Zeghidour,
N.,
Synnaeve,
G.,
Versteegh,
M. & Dupoux,
E.
(2016).
A deep scattering spectrum—deep siamese network pipeline for unsupervised acoustic modeling.
IEEE.
-
Chen,
Leung,
Xie,
Ma & Li
(2017)
-
Chen,
H.,
Leung,
C.,
Xie,
L.,
Ma,
B. & Li,
H.
(2017).
Multilingual bottle-neck feature learning from untranscribed speech.
IEEE.
-
Pellegrini,
Manenti & Pinquier
(2017)
-
Pellegrini,
T.,
Manenti,
C. & Pinquier,
J.
(2017).
Technical report the IRIT-UPS system@ ZeroSpeech 2017 Track1: Unsupervised subword modeling
.
Tech. rep., IRIT, Université de Toulouse
-
Kharitonov,
Rivière,
Synnaeve,
Wolf,
Mazaré,
Douze & Dupoux
(2021)
-
Kharitonov,
E.,
Rivière,
M.,
Synnaeve,
G.,
Wolf,
L.,
Mazaré,
P.,
Douze,
M. & Dupoux,
E.
(2021).
Data augmenting contrastive learning of speech representations in the time domain.
IEEE.
-
Jansen & Van Durme
(2011)
-
Jansen,
A. & Van Durme,
B.
(2011).
Efficient spoken term discovery using randomized algorithms.
IEEE.
-
Seshadri,
Remes,
Räsänen &
(2017)
-
Seshadri,
S.,
Remes,
U.,
Räsänen,
O. &
(2017).
Comparison of non-parametric bayesian mixture models for syllable clustering and zero-resource speech processing.
INTERSPEECH 2017.
-
Lyzinski,
Sell & Jansen
(2015)
-
Lyzinski,
V.,
Sell,
G. & Jansen,
A.
(2015).
An evaluation of graph clustering methods for unsupervised term discovery.
-
Lakhotia,
Kharitonov,
Hsu,
Adi,
Polyak,
Bolte,
Nguyen,
Copet,
Baevski,
Mohamed &
(2021)
-
Lakhotia,
K.,
Kharitonov,
E.,
Hsu,
W.,
Adi,
Y.,
Polyak,
A.,
Bolte,
B.,
Nguyen,
T.,
Copet,
J.,
Baevski,
A.,
Mohamed,
A. &
(2021).
On generative spoken language modeling from raw audio.
Transactions of the Association for Computational Linguistics, 9. 1336–1354.
-
Millet & Dunbar
(2020)
-
Millet,
J. & Dunbar,
E.
(2020).
The perceptimatic english benchmark for speech perception models.
-
Millet & Dunbar
(2022)
-
Millet,
J. & Dunbar,
E.
(2022).
Do self-supervised speech models develop human-like perception biases?.
-
Moore
(2012)
-
Moore,
B.
(2012).
An introduction to the psychology of hearing.
Brill.
-
Weerts,
Rosen,
Clopath & Goodman
(2021)
-
Weerts,
L.,
Rosen,
S.,
Clopath,
C. & Goodman,
D.
(2021).
The psychometrics of automatic speech recognition.
bioRxiv.
-
Tsuji,
Cristia & Dupoux
(2021)
-
Tsuji,
S.,
Cristia,
A. & Dupoux,
E.
(2021).
SCALa: A blueprint for computational models of language acquisition in social context.
Cognition, 213. 104779.
-
Buerkin-Pontrelli,
Culbertson,
Legendre & Nazzi
(2017)
-
Buerkin-Pontrelli,
A.,
Culbertson,
J.,
Legendre,
G. & Nazzi,
T.
(2017).
Competing models of liaison acquisition: Evidence from corpus and experimental data.
Language, 93(1). 189–219.
-
Babineau,
Legrand & Shi
(2021)
-
Babineau,
M.,
Legrand,
C. & Shi,
R.
(2021).
Variable forms in french-learning toddlers’ lexical representations..
Developmental Psychology.
-
Van Gijn & Zúñiga
(2014)
-
Van Gijn,
R. & Zúñiga,
F.
(2014).
Word and the americanist perspective.
Morphology, 24(3). 135–160.
-
Millet & Dunbar
(2020)
-
Millet,
J. & Dunbar,
E.
(2020).
Perceptimatic: A human speech perception benchmark for unsupervised subword modelling.
arXiv preprint arXiv:2010.05961.
-
Warstadt & Bowman
(2019)
-
Warstadt,
A. & Bowman,
S.
(2019).
Grammatical analysis of pretrained sentence encoders with acceptability judgments.
arXiv preprint 1901.03438.
-
Pandia & Murthy
(2020)
-
Pandia,
K. & Murthy,
H.
(2020).
Zero resource speech synthesis using transcripts derived from perceptual acoustic units.
arXiv preprint arXiv:2006.04372.
-
Chorowski,
Ciesielski,
Dzikowski,
Łańcucki,
Marxer,
Opala,
Pusz,
Rychlikowski & Stypułkowski
(2021)
-
Chorowski,
J.,
Ciesielski,
G.,
Dzikowski,
J.,
Łańcucki,
A.,
Marxer,
R.,
Opala,
M.,
Pusz,
P.,
Rychlikowski,
P. & Stypułkowski,
M.
(2021).
Aligned contrastive predictive coding.
arXiv preprint arXiv:2104.11946.
-
Chrupała
(2022)
-
Chrupała,
G.
(2022).
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques.
Journal of Artificial Intelligence Research, 73. 673–707.
-
Hsu,
Bolte,
Tsai,
Lakhotia,
Salakhutdinov & Mohamed
(2021)
-
Hsu,
W.,
Bolte,
B.,
Tsai,
Y.,
Lakhotia,
K.,
Salakhutdinov,
R. & Mohamed,
A.
(2021).
Hubert: Self-supervised speech representation learning by masked prediction of hidden units.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29. 3451–3460.
-
Gwilliams,
Linzen,
Poeppel & Marantz
(2018)
-
Gwilliams,
L.,
Linzen,
T.,
Poeppel,
D. & Marantz,
A.
(2018).
In spoken word recognition, the future predicts the past.
Journal of Neuroscience, 38(35). 7585–7599.
-
Beekhuizen,
Armstrong & Stevenson
(2021)
-
Beekhuizen,
B.,
Armstrong,
B. & Stevenson,
S.
(2021).
Probing lexical ambiguity: Word vectors encode number and relatedness of senses.
Cognitive Science, 45(5). e12943.
-
Nikolaus,
Alishahi & Chrupała
(2022)
-
Nikolaus,
M.,
Alishahi,
A. & Chrupała,
G.
(2022).
Learning english with peppa pig.
arXiv preprint arXiv:2202.12917.
-
Havard,
Chevrot & Besacier
(2019)
-
Havard,
W.,
Chevrot,
J. & Besacier,
L.
(2019).
Models of visually grounded speech signal pay attention to nouns: A bilingual experiment on english and japanese.
-
Havard,
Chevrot & Besacier
(2019)
-
Havard,
W.,
Chevrot,
J. & Besacier,
L.
(2019).
Word recognition, competition, and activation in a model of visually grounded speech.
-
Heck,
Sakti & Nakamura
(n.d.)
-
Heck,
M.,
Sakti,
S. & Nakamura,
S.
(n.d.).
Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to ZeroSpeech 2017.
Submitted to ASRU 2017.
-
Higy,
Elliott & Chrupała
(2020)
-
Higy,
B.,
Elliott,
D. & Chrupała,
G.
(2020).
Textual Supervision for Visually Grounded Spoken Language Understanding.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020.findings-emnlp.244
-
Hill
(1983)
-
Hill,
J.
(1983).
A computational model of language acquisition in the two-year old.
Cognition and Brain Theory, 6. 287–317.
-
Hill,
Reichart & Korhonen
(2015)
-
Hill,
F.,
Reichart,
R. & Korhonen,
A.
(2015).
Simlex-999: Evaluating semantic models with (genuine) similarity estimation.
Computational Linguistics, 41(4). 665–695.
-
Hochreiter & Schmidhuber
(1997)
-
Hochreiter,
S. & Schmidhuber,
J.
(1997).
Long short-term memory.
Neural computation, 9(8). 1735–1780.
-
Bin & Yuan
(2019)
-
Bin,
Y. & Yuan,
W.
(2019).
A VAE model with speaker verification for unsupervised subword modeling: A submission to ZeroSpeech 2019.
Submitted to INTERSPEECH 2019.
-
Hsu,
Harwath,
Song & Glass
(2020)
-
Hsu,
W.,
Harwath,
D.,
Song,
C. & Glass,
J.
(2020).
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units.
-
Huijbregts,
McLaren & Leeuwen
(2011)
-
Huijbregts,
M.,
McLaren,
M. & Leeuwen,
D.
(2011).
Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection.
-
(N.A.)
(2019)
-
(2019).
INTERSPEECH 2019 – 20th annual conference of the international speech communication association, september 15-19, graz, austria, proceedings.
-
Riochet,
Castro,
Bernard,
Lerer,
Fergus,
Izard & Dupoux
(2018)
-
Riochet,
R.,
Castro,
M.,
Bernard,
M.,
Lerer,
A.,
Fergus,
R.,
Izard,
V. & Dupoux,
E.
(2018).
Intphys: A framework and benchmark for visual intuitive physics reasoning.
arXiv preprint arXiv:1803.07616.
-
Jansen & Van Durme
(2011)
-
Jansen,
A. & Van Durme,
B.
(2011).
Efficient spoken term discovery using randomized algorithms.
-
Jansen,
Thomas & Hermansky
(2013)
-
Jansen,
A.,
Thomas,
S. & Hermansky,
H.
(2013).
Weak top-down constraints for unsupervised acoustic model training..
-
Johnson,
Griffiths & Goldwater
(2007)
-
Johnson,
M.,
Griffiths,
T. & Goldwater,
S.
(2007).
Adaptor grammars: A framework for specifying compositional nonparametric bayesian models. InSchölkopf,
B.,
Platt,
J. & Hoffman,
T. (Eds.),
Advances in neural information processing systems. (pp. 641–648).
MIT Press.
-
Jürgens,
Brand & Kollmeier
(2007)
-
Jürgens,
T.,
Brand,
T. & Kollmeier,
B.
(2007).
Modelling the human-machine gap in speech reception: Microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model.
-
Kahn,
Riviere,
Zheng,
Kharitonov,
Xu,
Mazare,
Karadayi,
Liptchinsky,
Collobert,
Fuegen & al.
(2020)
-
Kahn,
J.,
Riviere,
M.,
Zheng,
W.,
Kharitonov,
E.,
Xu,
Q.,
Mazare,
P.,
Karadayi,
J.,
Liptchinsky,
V.,
Collobert,
R.,
Fuegen,
C. & al.
(2020).
Libri-light: A benchmark for ASR with limited or no supervision.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
https://doi.org/10.1109/icassp40776.2020.9052942
-
Kahn,
Rivière,
Zheng,
Kharitonov,
Xu,
Mazaré,
Karadayi,
Liptchinsky,
Collobert,
Fuegen,
Likhomanenko,
Synnaeve,
Joulin,
Mohamed & Dupoux
(2020)
-
Kahn,
J.,
Rivière,
M.,
Zheng,
W.,
Kharitonov,
E.,
Xu,
Q.,
Mazaré,
P.,
Karadayi,
J.,
Liptchinsky,
V.,
Collobert,
R.,
Fuegen,
C.,
Likhomanenko,
T.,
Synnaeve,
G.,
Joulin,
A.,
Mohamed,
A. & Dupoux,
E.
(2020).
Libri-light: A benchmark for ASR with limited or no supervision.
Retrieved from
https://arxiv.org/abs/1912.07875
-
Kamper,
Livescu & Goldwater
(2017)
-
Kamper,
H.,
Livescu,
K. & Goldwater,
S.
(2017).
An embedded segmental k-means model for unsupervised segmentation and clustering of speech.
ASRU 2017. Retrieved from
https://arxiv.org/abs/1904.07556
-
Kamper,
Shakhnarovich & Livescu
(2019)
-
Kamper,
H.,
Shakhnarovich,
G. & Livescu,
K.
(2019).
Semantic speech retrieval with a visually grounded model of untranscribed speech.
IEEE/ACM Transactions on Audio, Speech and Language Processing, 27. 89–98.
-
Kamper,
Elsner,
Jansen & Goldwater
(2015)
-
Kamper,
H.,
Elsner,
M.,
Jansen,
A. & Goldwater,
S.
(2015).
Unsupervised neural network based feature extraction using weak top-down constraints.
-
Karpathy & Li
(2015)
-
Karpathy,
A. & Li,
F.
(2015).
Deep visual-semantic alignments for generating image descriptions.
-
Kawakami,
Wang,
Dyer,
Blunsom & Oord
(2020)
-
Kawakami,
K.,
Wang,
L.,
Dyer,
C.,
Blunsom,
P. & Oord,
A.
(2020).
Learning robust and multilingual speech representations.
Retrieved from
https://arxiv.org/abs/2001.11128
-
Kleinschmidt & Jaeger
(2015)
-
Kleinschmidt,
D. & Jaeger,
T.
(2015).
Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel.
Psychological Review, 122(2). 148–203.
-
Kuhl
(1991)
-
Kuhl,
P.
(1991).
Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not.
Attention, Perception, & Psychophysics, 50(2). 93–107.
-
Lau,
Clark & Lappin
(2017)
-
Lau,
J.,
Clark,
A. & Lappin,
S.
(2017).
Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge.
Cognitive Science. 1202–1241.
-
Lee & Glass
(2012)
-
Lee,
C. & Glass,
J.
(2012).
A nonparametric Bayesian approach to acoustic model discovery.
-
Chomsky
(1957)
-
Chomsky,
N.
(1957).
Syntactic structures.
JSTOR.
-
Liberman,
Cooper,
Shankweiler & Studdert-Kennedy
(1967)
-
Liberman,
A.,
Cooper,
F.,
Shankweiler,
D. & Studdert-Kennedy,
M.
(1967).
Perception of the speech code..
Psychological review, 74(6). 431.
-
Fowler
(1986)
-
Fowler,
C.
(1986).
An event approach to the study of speech perception from a direct–realist perspective.
Journal of phonetics, 14(1). 3–28.
-
Baljekar,
Sitaram,
Muthukumar & Black
(2015)
-
Baljekar,
P.,
Sitaram,
S.,
Muthukumar,
P. & Black,
A.
(2015).
Using articulatory features and inferred phonological segments in zero resource speech processing.
-
Morita & Koda
(2020)
-
Morita,
T. & Koda,
H.
(2020).
Exploring TTS without t using biologically/psychologically motivated neural network modules (ZeroSpeech 2020).
arXiv preprint arXiv:2005.05487.
-
Chomsky & Halle
(1968)
-
Chomsky,
N. & Halle,
M.
(1968).
The sound pattern of english..
-
Linzen,
Dupoux & Goldberg
(2016)
-
Linzen,
T.,
Dupoux,
E. & Goldberg,
Y.
(2016).
Assessing the ability of LSTMs to learn syntax-sensitive dependencies.
TACL.
-
Linzen & Leonard
(2018)
-
Linzen,
T. & Leonard,
B.
(2018).
Distinct patterns of syntactic agreement errors in recurrent networks and humans.
arXiv preprint 1807.06882.
-
Lisker & Abramson
(1964)
-
Lisker,
L. & Abramson,
A.
(1964).
A cross-language study of voicing in initial stops: Acoustical measurements.
Word, 20(3). 384–422.
-
Liu,
Hsu & Lee
(2019)
-
Liu,
A.,
Hsu,
P. & Lee,
H.
(2019).
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion.
INTERSPEECH 2019. Retrieved from
https://arxiv.org/abs/1905.11563
-
Liu,
Lowe,
Serban,
Noseworthy,
Charlin & Pineau
(2016)
-
Liu,
C.,
Lowe,
R.,
Serban,
I.,
Noseworthy,
M.,
Charlin,
L. & Pineau,
J.
(2016).
How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.
arXiv preprint arXiv:1603.08023.
-
Liu,
Ott,
Goyal,
Du,
Joshi,
Chen,
Levy,
Lewis,
Zettlemoyer & Stoyanov
(2019)
-
Liu,
Y.,
Ott,
M.,
Goyal,
N.,
Du,
J.,
Joshi,
M.,
Chen,
D.,
Levy,
O.,
Lewis,
M.,
Zettlemoyer,
L. & Stoyanov,
V.
(2019).
RoBERTa: A robustly optimized BERT pretraining approach.
CoRR, abs/1907.11692. Retrieved from
http://arxiv.org/abs/1907.11692
-
Bates,
Mächler,
Bolker & Walker
(2015)
-
Bates,
D.,
Mächler,
M.,
Bolker,
B. & Walker,
S.
(2015).
Fitting linear mixed-effects models using lme4.
Journal of Statistical Software, 67(1). 1–48.
-
Ludusan,
Versteegh,
Jansen,
Gravier,
Cao,
Johnson & Dupoux
(2014)
-
Ludusan,
B.,
Versteegh,
M.,
Jansen,
A.,
Gravier,
G.,
Cao,
X.,
Johnson,
M. & Dupoux,
E.
(2014).
Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems.
-
Ludusan,
Versteegh,
Jansen,
Gravier,
Cao,
Johnson & Dupoux
(2014)
-
Ludusan,
B.,
Versteegh,
M.,
Jansen,
A.,
Gravier,
G.,
Cao,
X.,
Johnson,
M. & Dupoux,
E.
(2014).
Bridging the gap between speech technology and natural language processing: An evaluation toolbox for term discovery systems.
-
Luong,
Socher & Manning
(2013)
-
Luong,
M.,
Socher,
R. & Manning,
C.
(2013).
Better word representations with recursive neural networks for morphology.
-
Macmillan & Creelman
(2004)
-
Macmillan,
N. & Creelman,
C.
(2004).
Detection theory: A user’s guide.
Psychology Press.
-
Mahrt
(2016)
-
Mahrt,
T.
(2016).
LMEDS: Language markup and experimental design software.
-
Wang,
Zhang & Zhang
(2015)
-
Wang,
D.,
Zhang,
X. & Zhang,
Z.
(2015).
THCHS-30: A free chinese speech corpus.
arXiv preprint arXiv:1512.01882.
-
Manenti,
Pellegrini & Pinquier
(2017)
-
Manenti,
C.,
Pellegrini,
T. & Pinquier,
J.
(2017).
Unsupervised speech unit discovery using k-means and neural networks.
Springer.
-
Matlock
(2001)
-
Matlock,
T.
(2001).
How real is fictive motion?
(Doctoral dissertation).
Psychology Department, University of California, Santa Cruz
-
Melis,
Dyer & Blunsom
(2018)
-
Melis,
G.,
Dyer,
C. & Blunsom,
P.
(2018).
On the state of the art of evaluation in neural language models.
ICLR.
-
Meyer,
Wesker,
Brand,
Mertins & Kollmeier
(2006)
-
Meyer,
B.,
Wesker,
T.,
Brand,
T.,
Mertins,
A. & Kollmeier,
B.
(2006).
A human-machine comparison in speech recognition based on a logatome corpus.
-
Meyer,
Wächter,
Brand & Kollmeier
(2007)
-
Meyer,
B.,
Wächter,
M.,
Brand,
T. & Kollmeier,
B.
(2007).
Phoneme confusions in human and automatic speech recognition.
-
Meyer,
Jürgens,
Wesker,
Brand & Kollmeier
(2010)
-
Meyer,
B.,
Jürgens,
T.,
Wesker,
T.,
Brand,
T. & Kollmeier,
B.
(2010).
Human phoneme recognition depending on speech-intrinsic variability.
The Journal of the Acoustical Society of America, 128(5). 3126–3141.
-
Miao,
Gowayyed & Metze
(2015)
-
Miao,
Y.,
Gowayyed,
M. & Metze,
F.
(2015).
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding.
IEEE.
-
Miech,
Zhukov,
Alayrac,
Tapaswi,
Laptev & Sivic
(2019)
-
Miech,
A.,
Zhukov,
D.,
Alayrac,
J.,
Tapaswi,
M.,
Laptev,
I. & Sivic,
J.
(2019).
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips.
-
Miller & Charles
(1991)
-
Miller,
G. & Charles,
W.
(1991).
Contextual correlates of semantic similarity.
Language and cognitive processes, 6(1). 1–28.
-
Millet,
Jurov & Dunbar
(2019)
-
Millet,
J.,
Jurov,
N. & Dunbar,
E.
(2019).
Comparing unsupervised speech learning directly to human performance in speech perception.
-
Muscariello,
Gravier & Bimbot
(2012)
-
Muscariello,
A.,
Gravier,
G. & Bimbot,
F.
(2012).
Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination.
IEEE Transactions on Audio, Speech and Language Processing, 20(7). 2031–2044.
-
Gulordava,
Bojanowski,
Grave,
Linzen & Baroni
(2018)
-
Gulordava,
K.,
Bojanowski,
P.,
Grave,
E.,
Linzen,
T. & Baroni,
M.
(2018).
Colorless green recurrent networks dream hierarchically.
Association for Computational Linguistics. Retrieved from
http://aclweb.org/anthology/N18-1108
-
Kwiatkowski,
Palomaki,
Redfield,
Collins,
Parikh,
Alberti,
Epstein,
Polosukhin,
Devlin,
Lee &
(2019)
-
Kwiatkowski,
T.,
Palomaki,
J.,
Redfield,
O.,
Collins,
M.,
Parikh,
A.,
Alberti,
C.,
Epstein,
D.,
Polosukhin,
I.,
Devlin,
J.,
Lee,
K. &
(2019).
Natural questions: A benchmark for question answering research.
Transactions of the Association for Computational Linguistics, 7. 453–466.
-
Cuervo,
Grabias,
Chorowski,
Ciesielski,
Łańcucki,
Rychlikowski & Marxer
(2021)
-
Cuervo,
S.,
Grabias,
M.,
Chorowski,
J.,
Ciesielski,
G.,
Łańcucki,
A.,
Rychlikowski,
P. & Marxer,
R.
(2021).
Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words.
arXiv preprint arXiv:2110.15909.
-
Iwamoto & Shinozaki
(2021)
-
Iwamoto,
Y. & Shinozaki,
T.
(2021).
Unsupervised spoken term discovery using wav2vec 2.0.
IEEE.
-
Bhati,
Villalba,
Żelasko,
Moro-Velazquez & Dehak
(2021)
-
Bhati,
S.,
Villalba,
J.,
Żelasko,
P.,
Moro-Velazquez,
L. & Dehak,
N.
(2021).
Segmental contrastive predictive coding for unsupervised word segmentation.
arXiv preprint arXiv:2106.02170.
-
Bhati,
Villalba,
Żelasko,
Moro-Velazquez & Dehak
(2021)
-
Bhati,
S.,
Villalba,
J.,
Żelasko,
P.,
Moro-Velazquez,
L. & Dehak,
N.
(2021).
Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding.
arXiv preprint arXiv:2110.02345.
-
Bhati,
Villalba,
Żelasko & Dehak
(2020)
-
Bhati,
S.,
Villalba,
J.,
Żelasko,
P. & Dehak,
N.
(2020).
Self-expressing autoencoders for unsupervised spoken term discovery.
arXiv preprint arXiv:2007.13033.
-
Borgholt,
Havtorn,
Edin,
Maaløe & Igel
(2022)
-
Borgholt,
L.,
Havtorn,
J.,
Edin,
J.,
Maaløe,
L. & Igel,
C.
(2022).
A brief overview of unsupervised neural speech representation learning.
-
Nayak,
Kumar,
Ramesh,
Bhati & Murty
(2019)
-
Nayak,
S.,
Kumar,
C.,
Ramesh,
G.,
Bhati,
S. & Murty,
K.
(2019).
Virtual Phone Discovery for Speech Synthesis.
https://doi.org/10.13140/RG.2.2.23356.08324
-
Tobing,
Hayashi,
Wu,
Kobayashi & Toda
(2020)
-
Tobing,
P.,
Hayashi,
T.,
Wu,
Y.,
Kobayashi,
K. & Toda,
T.
(2020).
Cyclic spectral modeling for unsupervised unit discovery into voice conversion with excitation and waveform modeling..
-
Chen & Hain
(2020)
-
Chen,
M. & Hain,
T.
(2020).
Unsupervised acoustic unit representation learning for voice conversion using wavenet auto-encoders.
arXiv preprint arXiv:2008.06892.
-
Niekerk,
Nortje & Kamper
(2020)
-
Niekerk,
B.,
Nortje,
L. & Kamper,
H.
(2020).
Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge.
arXiv preprint arXiv:2005.09409.
-
Yusuf,
Ondel,
Burget,
Černockỳ & Saraclar
(2021)
-
Yusuf,
B.,
Ondel,
L.,
Burget,
L.,
Černockỳ,
J. & Saraclar,
M.
(2021).
A hierarchical subspace model for language-attuned acoustic unit discovery.
IEEE.
-
Gündogdu,
Yusuf,
Yesilbursa & Saraclar
(2020)
-
Gündogdu,
B.,
Yusuf,
B.,
Yesilbursa,
M. & Saraclar,
M.
(2020).
Vector quantized temporally-aware correspondence sparse autoencoders for zero-resource acoustic unit discovery..
-
Newell & Simon
(1972)
-
Newell,
A. & Simon,
H.
(1972).
Human problem solving.
Prentice-Hall.
-
Nguyen,
Seyssel,
Rozé,
Rivière,
Kharitonov,
Baevski,
Dunbar & Dupoux
(2020)
-
Nguyen,
T.,
Seyssel,
M.,
Rozé,
P.,
Rivière,
M.,
Kharitonov,
E.,
Baevski,
A.,
Dunbar,
E. & Dupoux,
E.
(2020).
The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling.
arXiv preprint arXiv:2011.11588.
-
Jurov
(2019)
-
Jurov,
N.
(2019).
Phonetics or Phonology? Modelling Non-Native Perception
(Master’s thesis).
Université Paris Diderot, Paris, France.
-
Ondel,
Godard,
Besacier,
Larsen,
Hasegawa-Johnson,
Scharenborg,
Dupoux,
Burget,
Yvon & Khudanpur
(2018)
-
Ondel,
L.,
Godard,
P.,
Besacier,
L.,
Larsen,
E.,
Hasegawa-Johnson,
M.,
Scharenborg,
O.,
Dupoux,
E.,
Burget,
L.,
Yvon,
F. & Khudanpur,
S.
(2018).
Bayesian models for unit discovery on a very low resource language.
IEEE.
-
Oord,
Li & Vinyals
(2018)
-
Oord,
A.,
Li,
Y. & Vinyals,
O.
(2018).
Representation learning with contrastive predictive coding.
CoRR, abs/1807.03748. Retrieved from
http://arxiv.org/abs/1807.03748
-
Ott,
Edunov,
Baevski,
Fan,
Gross,
Ng,
Grangier & Auli
(2019)
-
Ott,
M.,
Edunov,
S.,
Baevski,
A.,
Fan,
A.,
Gross,
S.,
Ng,
N.,
Grangier,
D. & Auli,
M.
(2019).
Fairseq: A fast, extensible toolkit for sequence modeling.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N19-4009
-
Panayotov,
Chen,
Povey & Khudanpur
(2015)
-
Panayotov,
V.,
Chen,
G.,
Povey,
D. & Khudanpur,
S.
(2015).
Librispeech: An asr corpus based on public domain audio books.
IEEE.
-
Pandia & Murthy
(2019)
-
Pandia,
K. & Murthy,
H.
(2019).
Zero Resource Speech Synthesis Using Transcripts Derived from Perceptual Acoustic Units.
INTERSPEECH 2019.
-
Park & Glass
(2008)
-
Park,
A. & Glass,
J.
(2008).
Unsupervised Pattern Discovery in Speech.
IEEE Transactions on Audio, Speech, and Language Processing, 16(1). 186–197.
-
Parrot,
Millet & Dunbar
(2019)
-
Parrot,
M.,
Millet,
J. & Dunbar,
E.
(2019).
Independent and automatic evaluation of acoustic-to-articulatory inversion models.
arXiv. arXiv–1911.
-
Pauls & Klein
(2012)
-
Pauls,
A. & Klein,
D.
(2012).
Large-scale syntactic language modeling with treelets.
-
Chang & Fisher III
(2013)
-
Chang,
J. & Fisher III,
J.
(2013).
Parallel sampling of DP mixture models using sub-cluster splits.
-
Pellegrini,
Manenti & Pinquier
(n.d.)
-
Pellegrini,
T.,
Manenti,
C. & Pinquier,
J.
(n.d.).
Unsupervised discovery of sub-lexical units in speech based on ZCA and k-means.
Submitted to ASRU 2017.
-
Peperkamp
(2015)
-
Peperkamp,
S.
(2015).
Phonology versus phonetics in loanword adaptations. (pp. 71–90).
John Benjamins Publishing Company.
-
Phillips,
Wagers & Lau
(2011)
-
Phillips,
C.,
Wagers,
M. & Lau,
E.
(2011).
Grammatical illusions and selective fallibility in real-time language comprehension.
Experiments at the Interfaces, 37. 147–180.
-
Pintér & Watanabe
(2016)
-
Pintér,
G. & Watanabe,
H.
(2016).
Do GMM phoneme classifiers perceive synthetic sibilants as humans do?.
-
Pitt,
Johnson,
Hume,
Kiesling & Raymond
(2005)
-
Pitt,
M.,
Johnson,
K.,
Hume,
E.,
Kiesling,
S. & Raymond,
W.
(2005).
The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability.
Speech Communication, 45(1). 89–95.
-
Povey,
Ghoshal,
Boulianne,
Burget,
Glembek,
Goel,
Hannemann,
Motlicek,
Qian,
Schwarz,
Silovsky,
Stemmer & Vesely
(2011)
-
Povey,
D.,
Ghoshal,
A.,
Boulianne,
G.,
Burget,
L.,
Glembek,
O.,
Goel,
N.,
Hannemann,
M.,
Motlicek,
P.,
Qian,
Y.,
Schwarz,
P.,
Silovsky,
J.,
Stemmer,
G. & Vesely,
K.
(2011).
The kaldi speech recognition toolkit.
-
Povey,
Ghoshal,
Boulianne,
Burget,
Glembek,
Goel,
Hannemann,
Motlicek,
Qian,
Schwarz,
Silovsky,
Stemmer & Vesely
(2011)
-
Povey,
D.,
Ghoshal,
A.,
Boulianne,
G.,
Burget,
L.,
Glembek,
O.,
Goel,
N.,
Hannemann,
M.,
Motlicek,
P.,
Qian,
Y.,
Schwarz,
P.,
Silovsky,
J.,
Stemmer,
G. & Vesely,
K.
(2011).
The kaldi speech recognition toolkit.
IEEE Signal Processing Society.
-
Povey,
Ghoshal,
Boulianne,
Burget,
Glembek,
Goel,
Hannemann,
Motlicek,
Qian,
Schwarz &
(2011)
-
Povey,
D.,
Ghoshal,
A.,
Boulianne,
G.,
Burget,
L.,
Glembek,
O.,
Goel,
N.,
Hannemann,
M.,
Motlicek,
P.,
Qian,
Y.,
Schwarz,
P. &
(2011).
The Kaldi speech recognition toolkit.
IEEE Signal Processing Society; IEEE Signal Processing Society.
-
(2017)
-
(2017).
R: A language and environment for statistical computing.
R Foundation for Statistical Computing. Retrieved from
https://www.R-project.org/
-
Rabiner
(1989)
-
Rabiner,
L.
(1989).
A tutorial on hidden Markov models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2). 257–286.
-
Radford,
Wu,
Child,
Luan,
Amodei & Sutskever
(2019)
-
Radford,
A.,
Wu,
J.,
Child,
R.,
Luan,
D.,
Amodei,
D. & Sutskever,
I.
(2019).
Language models are unsupervised multitask learners.
-
Radinsky,
Agichtein,
Gabrilovich & Markovitch
(2011)
-
Radinsky,
K.,
Agichtein,
E.,
Gabrilovich,
E. & Markovitch,
S.
(2011).
A word at a time: Computing word relatedness using temporal semantic analysis.
-
Räsänen & Rasilo
(2015)
-
Räsänen,
O. & Rasilo,
H.
(2015).
A joint model of word segmentation and meaning acquisition through cross-situational learning.
Psychological Review, 122. 792–829.
-
Ravfogel,
Tyers & Goldberg
(2018)
-
Ravfogel,
S.,
Tyers,
F. & Goldberg,
Y.
(2018).
Can LSTM learn to capture agreement? The case of basque.
arXiv preprint 1809.04022.
-
Kamper,
Jansen & Goldwater
(2017)
-
Kamper,
H.,
Jansen,
A. & Goldwater,
S.
(2017).
A segmental framework for fully-unsupervised large-vocabulary speech recognition.
Computer Speech & Language, 46. 154–174.
-
Kamper
(2022)
-
Kamper,
H.
(2022).
Word segmentation on discovered phone units with dynamic programming and self-supervised scoring.
arXiv preprint arXiv:2202.11929.
-
Renshaw,
Kamper,
Jansen & Goldwater
(2015)
-
Renshaw,
D.,
Kamper,
H.,
Jansen,
A. & Goldwater,
S.
(2015).
A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge.
-
Dupoux
(2018)
-
Dupoux,
E.
(2018).
Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner.
Cognition, 173. 43–59.
-
Riochet,
Castro,
Bernard,
Lerer,
Fergus,
Izard & Dupoux
(2018)
-
Riochet,
R.,
Castro,
M.,
Bernard,
M.,
Lerer,
A.,
Fergus,
R.,
Izard,
V. & Dupoux,
E.
(2018).
IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning.
arXiv preprint arXiv:1803.07616.
-
Rivière,
Joulin,
Mazaré & Dupoux
(2020)
-
Rivière,
M.,
Joulin,
A.,
Mazaré,
P. & Dupoux,
E.
(2020).
Unsupervised pretraining transfers well across languages.
Retrieved from
https://arxiv.org/abs/2002.02848
-
Roy & Pentland
(2002)
-
Roy,
D. & Pentland,
A.
(2002).
Learning words from sights and sounds: A computational model.
Cognitive Science, 26. 113–146.
-
Rubenstein & Goodenough
(1965)
-
Rubenstein,
H. & Goodenough,
J.
(1965).
Contextual correlates of synonymy.
Communications of the ACM, 8(10). 627–633.
-
Tjandra,
Sisman,
Zhang,
Sakti,
Li & Nakamura
(2019)
-
Tjandra,
A.,
Sisman,
B.,
Zhang,
M.,
Sakti,
S.,
Li,
H. & Nakamura,
S.
(2019).
VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019.
INTERSPEECH 2019. Retrieved from
https://arxiv.org/abs/1905.11449
-
Sakti,
Kelana,
Riza,
Sakai,
Markov & Nakamura
(2008)
-
Sakti,
S.,
Kelana,
E.,
Riza,
H.,
Sakai,
S.,
Markov,
K. & Nakamura,
S.
(2008).
Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project.
-
Sakti,
Maia,
Sakai,
Shimizu & Nakamura
(2008)
-
Sakti,
S.,
Maia,
R.,
Sakai,
S.,
Shimizu,
T. & Nakamura,
S.
(2008).
Development of HMM-based Indonesian speech synthesis.
-
Sanabria,
Caglayan,
Palaskar,
Elliott,
Barrault,
Specia & Metze
(2018)
-
Sanabria,
R.,
Caglayan,
O.,
Palaskar,
S.,
Elliott,
D.,
Barrault,
L.,
Specia,
L. & Metze,
F.
(2018).
How2: A large-scale dataset for multimodal language understanding.
NeurIPS. Retrieved from
http://arxiv.org/abs/1811.00347
-
Scharenborg
(2007)
-
Scharenborg,
O.
(2007).
Reaching over the gap: A review of efforts to link human and automatic speech recognition research.
Speech Communication, 49(5). 336–347.
-
Scharenborg,
Tiesmeyer,
Hasegawa-Johnson & Dehak
(2018)
-
Scharenborg,
O.,
Tiesmeyer,
S.,
Hasegawa-Johnson,
M. & Dehak,
N.
(2018).
Visualizing phoneme category adaptation in deep neural networks..
-
Scharenborg,
Gouw,
Larson & Marchiori
(2019)
-
Scharenborg,
O.,
Gouw,
N.,
Larson,
M. & Marchiori,
E.
(2019).
The representation of speech in deep neural networks.
Springer.
-
Scharenborg
(2019)
-
Scharenborg,
O.
(2019).
The representation of speech and its processing in the human brain and deep neural networks.
Springer.
-
Schatz,
Peddinti,
Bach,
Jansen,
Hermansky & Dupoux
(2013)
-
Schatz,
T.,
Peddinti,
V.,
Bach,
F.,
Jansen,
A.,
Hermansky,
H. & Dupoux,
E.
(2013).
Evaluating speech features with the Minimal-Pair ABX task (I): Analysis of the classical MFC/PLP pipeline.
-
Schatz,
Peddinti,
Bach,
Jansen,
Hermansky & Dupoux
(2013)
-
Schatz,
T.,
Peddinti,
V.,
Bach,
F.,
Jansen,
A.,
Hermansky,
H. & Dupoux,
E.
(2013).
Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline.
INTERSPEECH.
-
Schatz,
Peddinti,
Bach,
Jansen,
Hermansky & Dupoux
(2013)
-
Schatz,
T.,
Peddinti,
V.,
Bach,
F.,
Jansen,
A.,
Hermansky,
H. & Dupoux,
E.
(2013).
Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline.
-
Schatz,
Peddinti,
Cao,
Bach,
Hermansky & Dupoux
(2014)
-
Schatz,
T.,
Peddinti,
V.,
Cao,
X.,
Bach,
F.,
Hermansky,
H. & Dupoux,
E.
(2014).
Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise.
-
Schatz,
Peddinti,
Cao,
Bach,
Hermansky & Dupoux
(2014)
-
Schatz,
T.,
Peddinti,
V.,
Cao,
X.,
Bach,
F.,
Hermansky,
H. & Dupoux,
E.
(2014).
Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise.
-
Schatz
(2016)
-
Schatz,
T.
(2016).
ABX-discriminability measures and applications
(Doctoral dissertation).
École Normale Supérieure
-
Schatz
(2016)
-
Schatz,
T.
(2016).
ABX-discriminability measures and applications
(PhD thesis).
Paris 6
-
Schatz,
Bach & Dupoux
(2017)
-
Schatz,
T.,
Bach,
F. & Dupoux,
E.
(2017).
ASR systems as models of phonetic category perception in adults.
-
Schatz & Feldman
(2018)
-
Schatz,
T. & Feldman,
N.
(2018).
Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception.
-
Schatz,
Feldman,
Goldwater,
Cao & Dupoux
(0)
-
Schatz,
T.,
Feldman,
N.,
Goldwater,
S.,
Cao,
X. & Dupoux,
E.
(0).
Early phonetic learning without phonetic categories: Insights from machine learning.
Proceedings of the National Academy of Sciences.
-
Schnabel,
Labutov,
Mimno & Joachims
(2015)
-
Schnabel,
T.,
Labutov,
I.,
Mimno,
D. & Joachims,
T.
(2015).
Evaluation methods for unsupervised word embeddings.
-
Schneider,
Baevski,
Collobert & Auli
(2019)
-
Schneider,
S.,
Baevski,
A.,
Collobert,
R. & Auli,
M.
(2019).
wav2vec: Unsupervised pre-training for speech recognition.
arXiv:1904.05862.
-
Sennrich,
Haddow & Birch
(2016)
-
Sennrich,
R.,
Haddow,
B. & Birch,
A.
(2016).
Neural machine translation of rare words with subword units.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P16-1162
-
Sennrich,
Haddow & Birch
(2015)
-
Sennrich,
R.,
Haddow,
B. & Birch,
A.
(2015).
Neural machine translation of rare words with subword units.
arXiv preprint arXiv:1508.07909.
-
Shibata,
Kato,
Shinozaki & Watanabe
(n.d.)
-
Shibata,
H.,
Kato,
T.,
Shinozaki,
T. & Watanabe,
S.
(n.d.).
Composite embedding systems for ZeroSpeech2017 track 1.
Submitted to ASRU 2017.
-
Norris & McQueen
(2008)
-
Norris,
D. & McQueen,
J.
(2008).
Shortlist B: a Bayesian model of continuous speech recognition.
Psychological Review, 115(2). 357–395.
-
Shrager & Langley
(1990)
-
Shrager,
J. & Langley,
P.
(1990).
Computational models of scientific discovery and theory formation.
Morgan Kaufmann.
-
Siu,
Gish,
Chan,
Belfield & Lowe
(2013)
-
Siu,
M.,
Gish,
H.,
Chan,
A.,
Belfield,
W. & Lowe,
S.
(2013).
Unsupervized training of an HMM-based self-organizing recognizer with applications to topic classification and keyword discovery.
Computer Speech & Language, preprint.
-
Socher,
Karpathy,
Le,
Manning & Ng
(2014)
-
Socher,
R.,
Karpathy,
A.,
Le,
Q.,
Manning,
C. & Ng,
A.
(2014).
Grounded compositional semantics for finding and describing images with sentences.
Transactions of the Association for Computational Linguistics, 2. 207–218.
-
Scharenborg,
Norris,
Bosch & McQueen
(2005)
-
Scharenborg,
O.,
Norris,
D.,
Bosch,
L. & McQueen,
J.
(2005).
How should a speech recognizer work?.
Cognitive Science, 29. 867–918.
-
Stolcke & Droppo
(2017)
-
Stolcke,
A. & Droppo,
J.
(2017).
Comparing human and machine errors in conversational speech transcription.
-
Sun,
Myers,
Vondrick,
Murphy & Schmid
(2019)
-
Sun,
C.,
Myers,
A.,
Vondrick,
C.,
Murphy,
K. & Schmid,
C.
(2019).
Videobert: A joint model for video and language representation learning.
-
Synnaeve,
Schatz & Dupoux
(2014)
-
Synnaeve,
G.,
Schatz,
T. & Dupoux,
E.
(2014).
Phonetic embedding learning with side information.
-
Synnaeve,
Versteegh & Dupoux
(2014)
-
Synnaeve,
G.,
Versteegh,
M. & Dupoux,
E.
(2014).
Learning words from images and speech.
-
Bosch,
Van hamme,
Boves & Moore
(2008)
-
Bosch,
L.,
Van hamme,
H.,
Boves,
L. & Moore,
R.
(2008).
A computational model of language acquisition: The emergence of words.
Fundamenta Informaticae, 90. 229–249.
-
McMurray,
Aslin & Toscano
(2009)
-
McMurray,
B.,
Aslin,
R. & Toscano,
J.
(2009).
Statistical learning of phonetic categories: Insights from a computational approach.
Developmental Science, 12(3). 369–378.
-
Thiolliere,
Dunbar,
Synnaeve,
Versteegh & Dupoux
(2015)
-
Thiolliere,
R.,
Dunbar,
E.,
Synnaeve,
G.,
Versteegh,
M. & Dupoux,
E.
(2015).
A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling..
-
Schatz,
Thiolliere,
Dupoux,
Synnaeve & Dunbar
(2015)
-
Schatz,
T.,
Thiolliere,
R.,
Dupoux,
E.,
Synnaeve,
G. & Dunbar,
E.
(2015).
ABXpy v0.1.
https://doi.org/10.5281/zenodo.16239
-
Schatz,
Cao,
Synnaeve,
Thiolliere & Dupoux
(2015)
-
Schatz,
T.,
Cao,
X.,
Synnaeve,
G.,
Thiolliere,
R. & Dupoux,
E.
(2015).
Abkhazia: Preliminary release.
https://doi.org/10.5281/zenodo.16242
-
Schatz & Feldman
(2018)
-
Schatz,
T. & Feldman,
N.
(2018).
Neural network vs. HMM speech recognition systems as models of human cross-linguistic phonetic perception.
-
Elman & McClelland
(2015)
-
Elman,
J. & McClelland,
J.
(2015).
Exploiting the lawful variability in the speech wave. (pp. 71–90).
Erlbaum.
-
McClelland & Elman
(1986)
-
McClelland,
J. & Elman,
J.
(1986).
Interactive processes in speech perception: The TRACE model.
Cognitive Psychology, 18. 1–86.
-
Vallabha,
McClelland,
Pons,
Werker & Amano
(2007)
-
Vallabha,
G.,
McClelland,
J.,
Pons,
F.,
Werker,
J. & Amano,
S.
(2007).
Unsupervised learning of vowel categories from infant-directed speech.
Proceedings of the National Academy of Sciences, 104(33). 13273–13278.
-
Oord,
Vinyals &
(2017)
-
Oord,
A.,
Vinyals,
O. &
(2017).
Neural discrete representation learning.
-
Varadarajan,
Khudanpur & Dupoux
(2008)
-
Varadarajan,
B.,
Khudanpur,
S. & Dupoux,
E.
(2008).
Unsupervised learning of acoustic sub-word units.
Association for Computational Linguistics.
-
Vaswani,
Shazeer,
Parmar,
Uszkoreit,
Jones,
Gomez,
Kaiser & Polosukhin
(2017)
-
Vaswani,
A.,
Shazeer,
N.,
Parmar,
N.,
Uszkoreit,
J.,
Jones,
L.,
Gomez,
A.,
Kaiser,
L. & Polosukhin,
I.
(2017).
Attention is all you need.
CoRR, abs/1706.03762. Retrieved from
http://arxiv.org/abs/1706.03762
-
Versteegh,
Thiolliere,
Schatz,
Cao,
Anguera,
Jansen & Dupoux
(2015)
-
Versteegh,
M.,
Thiolliere,
R.,
Schatz,
T.,
Cao,
X.,
Anguera,
X.,
Jansen,
A. & Dupoux,
E.
(2015).
The zero resource speech challenge 2015.
-
Versteegh,
Anguera,
Jansen & Dupoux
(2016)
-
Versteegh,
M.,
Anguera,
X.,
Jansen,
A. & Dupoux,
E.
(2016).
The zero resource speech challenge 2015: Proposed approaches and results.
Procedia Computer Science: Proceedings of SLTU 2016, 81. 67–72.
-
Versteegh,
Thiollière,
Schatz,
Cao,
Anguera,
Jansen & Dupoux
(2015)
-
Versteegh,
M.,
Thiollière,
R.,
Schatz,
T.,
Cao,
X.,
Anguera,
X.,
Jansen,
A. & Dupoux,
E.
(2015).
The Zero Resource Speech Challenge 2015.
https://doi.org/10.1016/j.procs.2016.04.031
-
Versteegh,
Anguera,
Jansen & Dupoux
(2016)
-
Versteegh,
M.,
Anguera,
X.,
Jansen,
A. & Dupoux,
E.
(2016).
The Zero Resource Speech Challenge 2015: Proposed approaches and results.
Procedia Computer Science, 81. 67–72.
-
Wang,
Tang & Livescu
(2020)
-
Wang,
W.,
Tang,
Q. & Livescu,
K.
(2020).
Unsupervised pre-training of bidirectional speech encoders via masked reconstruction.
IEEE.
-
Warstadt,
Parrish,
Liu,
Mohananey,
Peng,
Wang & Bowman
(2019)
-
Warstadt,
A.,
Parrish,
A.,
Liu,
H.,
Mohananey,
A.,
Peng,
W.,
Wang,
S. & Bowman,
S.
(2019).
Blimp: A benchmark of linguistic minimal pairs for english.
arXiv preprint arXiv:1912.00582.
-
Werker & Tees
(1984)
-
Werker,
J. & Tees,
R.
(1984).
Cross-language speech perception: Evidence for perceptual reorganization during the first year of life.
Infant behavior and development, 7(1). 49–63.
-
Wesker,
Meyer,
Wagener,
Anemüller,
Mertins & Kollmeier
(2005)
-
Wesker,
T.,
Meyer,
B.,
Wagener,
K.,
Anemüller,
J.,
Mertins,
A. & Kollmeier,
B.
(2005).
Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines.
-
Wilcox,
Levy,
Morita & Futrell
(2018)
-
Wilcox,
E.,
Levy,
R.,
Morita,
T. & Futrell,
R.
(2018).
What do RNN language models learn about filler–gap dependencies?.
-
Wilcox,
Levy,
Morita & Futrell
(2018)
-
Wilcox,
E.,
Levy,
R.,
Morita,
T. & Futrell,
R.
(2018).
What do RNN language models learn about filler-gap dependencies?.
arXiv preprint 1809.00042.
-
Gauthier,
Besacier,
Voisin,
Melese & Elingui
(2016)
-
Gauthier,
E.,
Besacier,
L.,
Voisin,
S.,
Melese,
M. & Elingui,
U.
(2016).
Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof.
LREC.
-
Vries,
Davel,
Badenhorst,
Basson,
Wet,
Barnard & Waal
(2014)
-
Vries,
N.,
Davel,
M.,
Badenhorst,
J.,
Basson,
W.,
Wet,
F.,
Barnard,
E. & Waal,
A.
(2014).
A smartphone-based ASR data collection tool for under-resourced languages.
Speech Communication, 56. 119–131.
-
Xu & Tenenbaum
(2007)
-
Xu,
F. & Tenenbaum,
J.
(2007).
Word learning as Bayesian inference.
Psychological review, 114(2). 245–272.
-
Yang & Powers
(2006)
-
Yang,
D. & Powers,
D.
(2006).
Verb similarity on the taxonomy of WordNet.
Masaryk University.
-
Yang,
Dai,
Yang,
Carbonell,
Salakhutdinov & Le
(2019)
-
Yang,
Z.,
Dai,
Z.,
Yang,
Y.,
Carbonell,
J.,
Salakhutdinov,
R. & Le,
Q.
(2019).
XLNet: Generalized autoregressive pretraining for language understanding.
Retrieved from
https://arxiv.org/abs/1906.08237
-
Yu & Ballard
(2004)
-
Yu,
C. & Ballard,
D.
(2004).
A multimodal learning interface for grounding spoken language in sensory perceptions.
ACM Transactions on Applied Perceptions, 1. 57–80.
-
Yuan,
Leung,
Xie,
Chen,
Ma & Li
(n.d.)
-
Yuan,
Y.,
Leung,
C.,
Xie,
L.,
Chen,
H.,
Ma,
B. & Li,
H.
(n.d.).
Extracting bottleneck features and word-like pairs from untranscribed speech for feature representations.
Submitted to ASRU 2017.
-
Zhang & Glass
(2010)
-
Zhang,
Y. & Glass,
J.
(2010).
Towards multi-speaker unsupervised speech pattern discovery.
-
Zhou,
Xu & Corso
(2018)
-
Zhou,
L.,
Xu,
C. & Corso,
J.
(2018).
Towards automatic learning of procedures from web instructional videos.
-
Gauthier,
Besacier,
Voisin,
Melese & Elingui
(2016)
-
Gauthier,
E.,
Besacier,
L.,
Voisin,
S.,
Melese,
M. & Elingui,
U.
(2016).
Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof.
Retrieved from
https://hal.archives-ouvertes.fr/hal-01350037
-
Jia,
Weiss,
Biadsy,
Macherey,
Johnson,
Chen & Wu
(2019)
-
Jia,
Y.,
Weiss,
R.,
Biadsy,
F.,
Macherey,
W.,
Johnson,
M.,
Chen,
Z. & Wu,
Y.
(2019).
Direct speech-to-speech translation with a sequence-to-sequence model.
arXiv preprint arXiv:1904.06037.
-
Lee,
Chen,
Wang,
Gu,
Ma,
Polyak,
Adi,
He,
Tang,
Pino &
(2021)
-
Lee,
A.,
Chen,
P.,
Wang,
C.,
Gu,
J.,
Ma,
X.,
Polyak,
A.,
Adi,
Y.,
He,
Q.,
Tang,
Y.,
Pino,
J. &
(2021).
Direct speech-to-speech translation with discrete units.
arXiv preprint arXiv:2107.05604.
-
Tjandra,
Sakti & Nakamura
(2020)
-
Tjandra,
A.,
Sakti,
S. & Nakamura,
S.
(2020).
Transformer vq-vae for unsupervised unit discovery and speech synthesis: Zerospeech 2020 challenge.
arXiv preprint arXiv:2005.11676.
-
Alishahi,
Chrupała,
Cristia,
Dupoux,
Higy,
Lavechin,
Räsänen & Yu
(2021)
-
Alishahi,
A.,
Chrupała,
G.,
Cristia,
A.,
Dupoux,
E.,
Higy,
B.,
Lavechin,
M.,
Räsänen,
O. & Yu,
C.
(2021).
ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track.
arXiv preprint arXiv:2107.06546.
-
Maekaku,
Chang,
Fujita,
Chen,
Watanabe & Rudnicky
(2021)
-
Maekaku,
T.,
Chang,
X.,
Fujita,
Y.,
Chen,
L.,
Watanabe,
S. & Rudnicky,
A.
(2021).
Speech representation learning combining conformer cpc with deep cluster for the zerospeech challenge 2021.
arXiv preprint arXiv:2107.05899.
-
Chorowski,
Ciesielski,
Dzikowski,
Łańcucki,
Marxer,
Opala,
Pusz,
Rychlikowski & Stypułkowski
(2021)
-
Chorowski,
J.,
Ciesielski,
G.,
Dzikowski,
J.,
Łańcucki,
A.,
Marxer,
R.,
Opala,
M.,
Pusz,
P.,
Rychlikowski,
P. & Stypułkowski,
M.
(2021).
Information retrieval for zerospeech 2021: The submission by university of wroclaw.
arXiv preprint arXiv:2106.11603.
-
Niekerk,
Nortje,
Baas & Kamper
(2021)
-
Niekerk,
B.,
Nortje,
L.,
Baas,
M. & Kamper,
H.
(2021).
Analyzing speaker information in self-supervised models to improve zero-resource speech processing.
arXiv preprint arXiv:2108.00917.
-
Tjandra,
Sakti & Nakamura
(2019)
-
Tjandra,
A.,
Sakti,
S. & Nakamura,
S.
(2019).
Speech-to-speech translation between untranscribed unknown languages.
IEEE.
-
Jia,
Ramanovich,
Remez & Pomerantz
(2021)
-
Jia,
Y.,
Ramanovich,
M.,
Remez,
T. & Pomerantz,
R.
(2021).
Translatotron 2: Robust direct speech-to-speech translation.
arXiv preprint arXiv:2107.08661.
-
Lee,
Gong,
Duquenne,
Schwenk,
Chen,
Wang,
Popuri,
Pino,
Gu & Hsu
(2021)
-
Lee,
A.,
Gong,
H.,
Duquenne,
P.,
Schwenk,
H.,
Chen,
P.,
Wang,
C.,
Popuri,
S.,
Pino,
J.,
Gu,
J. & Hsu,
W.
(2021).
Textless speech-to-speech translation on real data.
arXiv preprint arXiv:2112.08352.
-
Algayres,
Ricoul,
Karadayi,
Mohammed,
Sagot & Dupoux
(2022)
-
Algayres,
R.,
Ricoul,
T.,
Karadayi,
J.,
Mohammed,
A.,
Sagot,
B. & Dupoux,
E.
(2022).
DP-PARSE: Finding word boundaries from raw speech with a token lexicon.
Retrieved from
https://arxiv.org/abs/1906.08237
-
Nguyen,
Sagot & Dupoux
(2022)
-
Nguyen,
T.,
Sagot,
B. & Dupoux,
E.
(2022).
Are discrete units necessary for spoken language modeling?.
Retrieved from
https://arxiv.org/abs/1906.08237
-
De Saussure
(1916)
-
De Saussure,
F.
(1916).
Course in general linguistics.
McGraw-Hill Book Company, New York-Toronto-London.